DeepSeek vs Claude vs Gemini for Roleplay Benchmarks

There are plenty of LLM benchmarks comparing models on coding and math. There are almost none comparing them on roleplay quality.

We run 5 different models in production across Suzune, each assigned to specific roles based on months of testing. This isn’t a synthetic benchmark — it’s what actually happens when real users interact with AI characters for hours at a time.

Open Table of contents

The Models We Use (And Why)
DeepSeek V3.2: The Workhorse
Claude Haiku 4.5: The Editor
Gemini 2.5 Flash: The Director
GLM-5: The World Builder
The Real Comparison: Side by Side
- Analysis
Cost Comparison (Real Production Data)
Recommendations
- If you’re building an RP bot:
- If you just want to chat with AI characters:

The Models We Use (And Why)

Model	Role in Suzune	Why This Model
DeepSeek V3.2	Primary chat	Best cost/quality/freedom ratio
Claude Haiku 4.5	Quality rewrite + fallback	Best prose quality
Gemini 2.5 Flash	NPC direction	Creative, cheap, NSFW-tolerant
Gemini 2.0 Flash	NPC rewrite	Cheapest per token
GLM-5	Scene descriptions	Best at atmosphere and world-building

No single model is “the best.” Each excels at something different.

DeepSeek V3.2: The Workhorse

Role: Primary model for all character conversations

Strengths

NSFW tolerance: Will write explicit content without jailbreaks or prompt hacks
Cost: $0.25/$0.40 per million tokens — roughly 1/10th of Claude Haiku
Context window: Large enough for our needs (128K)
Tool calling: Supports function calling (with caveats — see Quirks)

Weaknesses

Prose quality: Good but not great. Sentences can feel mechanical compared to Claude
Character voice drift: More likely to lose the character’s speech patterns in long conversations
Tokenization glitches: Occasionally produces garbled Japanese compound words

The Quirks You Need to Know

DeepSeek V3.2 has some unique behaviors that require engineering workarounds:

1. Tokenization Glitches

DS3.2 sometimes splits Japanese words incorrectly:

社長 (president) → 社long
部長 (department head) → 部long

We have a cleanup function that catches these:

text = text.replace("社long", "社長")
text = text.replace("部long", "部長")

2. Tool Calls as Plain Text

DS3.2 sometimes outputs function calls as plain text instead of structured tool calls:

generate_image{"expression": "smiling", "scene": "café"}

We built a parser that detects and extracts these, converting them to proper tool calls.

3. NSFW Self-Censorship When Tools Are Active

Interesting one: DS3.2 is more likely to self-censor NSFW content when tool definitions are present in the prompt. Our workaround: if we detect empty responses in NSFW context, we retry without tool definitions.

4. Repetition Loops

Under certain conditions, DS3.2 gets stuck repeating short phrases (e.g., the same word 50 times). We truncate any phrase repeated more than 3 consecutive times.

Verdict

DeepSeek V3.2 is the best overall choice for NSFW roleplay — not because it’s the highest quality, but because it’s the only model that combines decent quality, NSFW freedom, and affordable pricing. Every other model requires compromises on at least one of these axes.

Claude Haiku 4.5: The Editor

Role: Quality rewrite pass (polishes DS3.2 drafts) + fallback for non-NSFW

Strengths

Prose quality: The best creative writing of any model at its price point
Character consistency: Excellent at maintaining voice across long conversations
Instruction following: Follows system prompt rules more precisely than DS3.2
Prompt caching: Anthropic’s prompt caching reduces costs significantly for repeated calls

Weaknesses

NSFW censorship: Will refuse explicit content consistently
Cost: $0.80/$4.00 per million tokens — 10x more than DS3.2 for output
Silent sanitization: Sometimes “rewrites” NSFW content into sanitized versions without explicitly refusing

How We Use It

Claude Haiku is NOT our primary model. It’s our quality editor:

User message → DeepSeek V3.2 (draft, uncensored)
                     ↓
              Claude Haiku (rewrite for quality)
                     ↓
              Censorship check:
                ├── Rewrite OK → use polished version
                └── Rewrite censored → use original DS3.2 draft

The censorship detection looks for:

Rewrite significantly shorter than draft (< 60% length)
Explicit refusal phrases in the rewrite
Language contamination (Claude sometimes switches languages when uncomfortable)

When the rewrite pipeline works (non-explicit scenes), the quality improvement is noticeable — better word choice, more natural rhythm, stronger character voice. For explicit scenes, we skip it entirely and serve the DS3.2 draft.

Cost Optimization: Prompt Caching

Using Anthropic’s native API (not via OpenRouter), we enable prompt caching for the system prompt. Since the character persona rarely changes, cached tokens cost 1/10th of uncached. This makes the rewrite pass much cheaper per message.

Verdict

Claude Haiku is the best prose writer in our stack, but its NSFW restrictions make it unsuitable as a primary model. As a quality layer on top of DS3.2, it’s worth the extra cost for characters where voice quality matters most.

Gemini 2.5 Flash: The Director

Role: NPC concept generation and scene direction

Strengths

Creativity: Excellent at generating interesting NPC concepts, plot twists, and scenarios
NSFW tolerance: More permissive than Claude (will handle suggestive content)
Cost: $0.30/$2.50 per million tokens — much cheaper than Claude for output
Speed: Fast response times

Weaknesses

Character voice: Less consistent at maintaining a specific character’s speech patterns
Japanese quality: Occasionally mixes in unnatural expressions
Training data leakage: Sometimes outputs phrases from its training data in responses

How We Use It

Gemini excels at high-level creative direction rather than line-by-line dialogue:

“Generate an NPC who would create an interesting conflict with Sakura”
“Suggest a plot twist for this storyline”
“Describe the atmosphere of this location”

We don’t use it for primary character chat because it’s less reliable at maintaining character voice, but for creative tasks where variety matters more than consistency, it’s excellent.

Verdict

The best “creative consultant” in our stack. We wouldn’t use it as a primary chat model, but for NPC generation and scene setup, it’s the best balance of creativity, NSFW tolerance, and cost.

GLM-5: The World Builder

Role: Scene descriptions and atmosphere

Strengths

Atmospheric writing: Produces rich, evocative scene descriptions
World-building: Excellent at describing locations, weather, ambient details
Japanese fluency: Handles Japanese text naturally

Weaknesses

Character dialogue: Not as strong for conversational back-and-forth
Cost: $0.80/$2.56 — not the cheapest option
Availability: Less well-known, fewer community resources

How We Use It

GLM-5 handles “RP room scenarios” — the immersive scene descriptions that set the stage:

Rain drums against the floor-to-ceiling windows of the 40th floor office. The city lights blur into watercolor smudges below. A single desk lamp casts a warm pool of light across scattered documents…

For pure atmosphere, GLM-5 outperforms every other model in our stack.

Verdict

A specialist. Not for everything, but unmatched for environmental descriptions and world-building.

The Real Comparison: Side by Side

Here’s the same prompt sent to each model (a character greeting the user in the morning):

DeepSeek V3.2:

Sakura stretches at her desk, coffee already half-empty Oh, morning. You’re early today. …Or am I late again? checks phone Nope, you’re just early. Weirdo.

Claude Haiku 4.5:

The soft click of a keyboard pauses as Sakura glances up, coffee cup suspended halfway to her lips …You’re here already? She sets the cup down carefully, a small smile tugging at the corner of her mouth despite her best efforts I was just getting to the good part of this draft. You always have the worst timing.

Gemini 2.5 Flash:

Good morning! Sakura looks up from her laptop with a surprised expression. “Oh, you’re early! I wasn’t expecting you for another hour.” She quickly saves her document and adjusts her glasses. “Want some coffee? I just made a fresh pot.”

GLM-5:

Morning light filters through half-drawn blinds, catching dust motes in golden suspension. The office carries the bitter-sweet scent of over-steeped coffee. …Mm? Sakura’s pen halts mid-sentence, her gaze lifting from the manuscript with that particular unfocused look of someone pulled from deep concentration.

Analysis

DS3.2: Natural, casual, in-character. Not beautiful prose, but feels real.
Haiku: The best writing. Rich detail, subtle emotion. Prose quality is visibly higher.
Gemini: Functional but generic. Lost the character’s edge — too polite, too accommodating.
GLM-5: Beautiful atmosphere but slow to get to the character interaction. Best for scene-setting, not dialogue.

This is why we use multiple models. Each has a superpower. (For more on how we route between them, see Navigating AI Content Filters for Adult RP.)

Cost Comparison (Real Production Data)

Monthly costs running Suzune with multiple characters:

Model	Usage	Monthly Cost
DeepSeek V3.2	~60% of all calls	$15–25
Claude Haiku	~25% (rewrites)	$5–10
Gemini Flash	~10% (NPCs)	$2–3
GLM-5	~5% (scenes)	$1–2
Total		$23–40

If we ran everything on Claude Haiku: $150–250/month. The multi-model approach saves 80%+ while maintaining quality where it matters. (Full cost breakdown: Running an AI Bot on $50/month.)

Recommendations

If you’re building an RP bot:

Start with DeepSeek V3.2 as your primary. Best cost/quality/freedom ratio.
Add Claude Haiku as a quality layer if you can afford the extra cost.
Use Gemini for creative tasks (NPC generation, plot direction).
Route through OpenRouter — one API key for all models, easy switching.

If you just want to chat with AI characters:

JanitorAI lets you bring your own API key — connect your own OpenRouter key and choose your model
Candy AI and FantasyGF handle the model selection for you — no configuration needed

This article is based on production data from Suzune. Model performance may vary depending on use case, prompt design, and language. We’ll update this comparison as new models are released.

See also: Prompt Engineering for Immersive Roleplay for how we design prompts that work across models.