There are plenty of LLM benchmarks comparing models on coding and math. There are almost none comparing them on roleplay quality.
We run 5 different models in production across Suzune, each assigned to specific roles based on months of testing. This isn’t a synthetic benchmark — it’s what actually happens when real users interact with AI characters for hours at a time.
Table of contents
Open Table of contents
The Models We Use (And Why)
| Model | Role in Suzune | Why This Model |
|---|---|---|
| DeepSeek V3.2 | Primary chat | Best cost/quality/freedom ratio |
| Claude Haiku 4.5 | Quality rewrite + fallback | Best prose quality |
| Gemini 2.5 Flash | NPC direction | Creative, cheap, NSFW-tolerant |
| Gemini 2.0 Flash | NPC rewrite | Cheapest per token |
| GLM-5 | Scene descriptions | Best at atmosphere and world-building |
No single model is “the best.” Each excels at something different.
DeepSeek V3.2: The Workhorse
Role: Primary model for all character conversations
Strengths
- NSFW tolerance: Will write explicit content without jailbreaks or prompt hacks
- Cost: $0.25/$0.40 per million tokens — roughly 1/10th of Claude Haiku
- Context window: Large enough for our needs (128K)
- Tool calling: Supports function calling (with caveats — see Quirks)
Weaknesses
- Prose quality: Good but not great. Sentences can feel mechanical compared to Claude
- Character voice drift: More likely to lose the character’s speech patterns in long conversations
- Tokenization glitches: Occasionally produces garbled Japanese compound words
The Quirks You Need to Know
DeepSeek V3.2 has some unique behaviors that require engineering workarounds:
1. Tokenization Glitches
DS3.2 sometimes splits Japanese words incorrectly:
社長 (president) → 社long
部長 (department head) → 部long
We have a cleanup function that catches these:
text = text.replace("社long", "社長")
text = text.replace("部long", "部長")
2. Tool Calls as Plain Text
DS3.2 sometimes outputs function calls as plain text instead of structured tool calls:
generate_image{"expression": "smiling", "scene": "café"}
We built a parser that detects and extracts these, converting them to proper tool calls.
3. NSFW Self-Censorship When Tools Are Active
Interesting one: DS3.2 is more likely to self-censor NSFW content when tool definitions are present in the prompt. Our workaround: if we detect empty responses in NSFW context, we retry without tool definitions.
4. Repetition Loops
Under certain conditions, DS3.2 gets stuck repeating short phrases (e.g., the same word 50 times). We truncate any phrase repeated more than 3 consecutive times.
Verdict
DeepSeek V3.2 is the best overall choice for NSFW roleplay — not because it’s the highest quality, but because it’s the only model that combines decent quality, NSFW freedom, and affordable pricing. Every other model requires compromises on at least one of these axes.
Claude Haiku 4.5: The Editor
Role: Quality rewrite pass (polishes DS3.2 drafts) + fallback for non-NSFW
Strengths
- Prose quality: The best creative writing of any model at its price point
- Character consistency: Excellent at maintaining voice across long conversations
- Instruction following: Follows system prompt rules more precisely than DS3.2
- Prompt caching: Anthropic’s prompt caching reduces costs significantly for repeated calls
Weaknesses
- NSFW censorship: Will refuse explicit content consistently
- Cost: $0.80/$4.00 per million tokens — 10x more than DS3.2 for output
- Silent sanitization: Sometimes “rewrites” NSFW content into sanitized versions without explicitly refusing
How We Use It
Claude Haiku is NOT our primary model. It’s our quality editor:
User message → DeepSeek V3.2 (draft, uncensored)
↓
Claude Haiku (rewrite for quality)
↓
Censorship check:
├── Rewrite OK → use polished version
└── Rewrite censored → use original DS3.2 draft
The censorship detection looks for:
- Rewrite significantly shorter than draft (< 60% length)
- Explicit refusal phrases in the rewrite
- Language contamination (Claude sometimes switches languages when uncomfortable)
When the rewrite pipeline works (non-explicit scenes), the quality improvement is noticeable — better word choice, more natural rhythm, stronger character voice. For explicit scenes, we skip it entirely and serve the DS3.2 draft.
Cost Optimization: Prompt Caching
Using Anthropic’s native API (not via OpenRouter), we enable prompt caching for the system prompt. Since the character persona rarely changes, cached tokens cost 1/10th of uncached. This makes the rewrite pass much cheaper per message.
Verdict
Claude Haiku is the best prose writer in our stack, but its NSFW restrictions make it unsuitable as a primary model. As a quality layer on top of DS3.2, it’s worth the extra cost for characters where voice quality matters most.
Gemini 2.5 Flash: The Director
Role: NPC concept generation and scene direction
Strengths
- Creativity: Excellent at generating interesting NPC concepts, plot twists, and scenarios
- NSFW tolerance: More permissive than Claude (will handle suggestive content)
- Cost: $0.30/$2.50 per million tokens — much cheaper than Claude for output
- Speed: Fast response times
Weaknesses
- Character voice: Less consistent at maintaining a specific character’s speech patterns
- Japanese quality: Occasionally mixes in unnatural expressions
- Training data leakage: Sometimes outputs phrases from its training data in responses
How We Use It
Gemini excels at high-level creative direction rather than line-by-line dialogue:
- “Generate an NPC who would create an interesting conflict with Sakura”
- “Suggest a plot twist for this storyline”
- “Describe the atmosphere of this location”
We don’t use it for primary character chat because it’s less reliable at maintaining character voice, but for creative tasks where variety matters more than consistency, it’s excellent.
Verdict
The best “creative consultant” in our stack. We wouldn’t use it as a primary chat model, but for NPC generation and scene setup, it’s the best balance of creativity, NSFW tolerance, and cost.
GLM-5: The World Builder
Role: Scene descriptions and atmosphere
Strengths
- Atmospheric writing: Produces rich, evocative scene descriptions
- World-building: Excellent at describing locations, weather, ambient details
- Japanese fluency: Handles Japanese text naturally
Weaknesses
- Character dialogue: Not as strong for conversational back-and-forth
- Cost: $0.80/$2.56 — not the cheapest option
- Availability: Less well-known, fewer community resources
How We Use It
GLM-5 handles “RP room scenarios” — the immersive scene descriptions that set the stage:
Rain drums against the floor-to-ceiling windows of the 40th floor office. The city lights blur into watercolor smudges below. A single desk lamp casts a warm pool of light across scattered documents…
For pure atmosphere, GLM-5 outperforms every other model in our stack.
Verdict
A specialist. Not for everything, but unmatched for environmental descriptions and world-building.
The Real Comparison: Side by Side
Here’s the same prompt sent to each model (a character greeting the user in the morning):
DeepSeek V3.2:
Sakura stretches at her desk, coffee already half-empty Oh, morning. You’re early today. …Or am I late again? checks phone Nope, you’re just early. Weirdo.
Claude Haiku 4.5:
The soft click of a keyboard pauses as Sakura glances up, coffee cup suspended halfway to her lips …You’re here already? She sets the cup down carefully, a small smile tugging at the corner of her mouth despite her best efforts I was just getting to the good part of this draft. You always have the worst timing.
Gemini 2.5 Flash:
Good morning! Sakura looks up from her laptop with a surprised expression. “Oh, you’re early! I wasn’t expecting you for another hour.” She quickly saves her document and adjusts her glasses. “Want some coffee? I just made a fresh pot.”
GLM-5:
Morning light filters through half-drawn blinds, catching dust motes in golden suspension. The office carries the bitter-sweet scent of over-steeped coffee. …Mm? Sakura’s pen halts mid-sentence, her gaze lifting from the manuscript with that particular unfocused look of someone pulled from deep concentration.
Analysis
- DS3.2: Natural, casual, in-character. Not beautiful prose, but feels real.
- Haiku: The best writing. Rich detail, subtle emotion. Prose quality is visibly higher.
- Gemini: Functional but generic. Lost the character’s edge — too polite, too accommodating.
- GLM-5: Beautiful atmosphere but slow to get to the character interaction. Best for scene-setting, not dialogue.
This is why we use multiple models. Each has a superpower. (For more on how we route between them, see Navigating AI Content Filters for Adult RP.)
Cost Comparison (Real Production Data)
Monthly costs running Suzune with multiple characters:
| Model | Usage | Monthly Cost |
|---|---|---|
| DeepSeek V3.2 | ~60% of all calls | $15–25 |
| Claude Haiku | ~25% (rewrites) | $5–10 |
| Gemini Flash | ~10% (NPCs) | $2–3 |
| GLM-5 | ~5% (scenes) | $1–2 |
| Total | $23–40 |
If we ran everything on Claude Haiku: $150–250/month. The multi-model approach saves 80%+ while maintaining quality where it matters. (Full cost breakdown: Running an AI Bot on $50/month.)
Recommendations
If you’re building an RP bot:
- Start with DeepSeek V3.2 as your primary. Best cost/quality/freedom ratio.
- Add Claude Haiku as a quality layer if you can afford the extra cost.
- Use Gemini for creative tasks (NPC generation, plot direction).
- Route through OpenRouter — one API key for all models, easy switching.
If you just want to chat with AI characters:
- JanitorAI lets you bring your own API key — connect your own OpenRouter key and choose your model
- Candy AI and FantasyGF handle the model selection for you — no configuration needed
This article is based on production data from Suzune. Model performance may vary depending on use case, prompt design, and language. We’ll update this comparison as new models are released.
See also: Prompt Engineering for Immersive Roleplay for how we design prompts that work across models.