Your users want to see the character they’re talking to. Not a static avatar that never changes — they want her in a sundress at the beach, in a hoodie on a rainy day, blushing after a confession scene. The moment you add dynamic character portraits to an AI roleplay bot, engagement goes through the roof.
In Suzune, we generate character images on the fly using Flux with per-character LoRA models running on RunPod. This post covers the full pipeline: how LoRA configs are structured, how prompts get composed from expressions + outfits + scenes, and how we keep the whole thing stable in production.
Table of contents
Open Table of contents
Why LoRA + Flux?
If you’ve worked with Stable Diffusion models, you know the tradeoffs. Base models give you flexibility but zero character consistency. Full fine-tunes give you consistency but cost a fortune to train and host. LoRA (Low-Rank Adaptation) sits in the sweet spot:
- Tiny model files — a LoRA adapter is typically 50-150MB vs multi-GB for a full checkpoint
- Character-locked faces — train on 20-30 images of your character and the model reliably reproduces their features
- Composable — you can swap LoRA weights without reloading the base model
- Fast inference — negligible overhead on top of the base model
We use Flux as the base model because it handles anime/2D styles well out of the box while being strong enough for semi-realistic outputs. If your characters span 2D and 3D aesthetics (ours do), Flux is a solid foundation.
For compute, RunPod has been our go-to. The serverless GPU option means you’re not paying for idle time between portrait requests. More on the cost math in Running an AI Bot on $50/Month.
Per-Character LoRA Config in YAML
Every character in Suzune has a YAML config file. The image generation section looks like this:
# characters/sakura/config.yaml
name: Sakura
image:
method: lora # or "img2img" for base-image characters
lora:
name: sakura_flux_v3
trigger_word: "1girl, skr_character"
strength: 0.85
base_prompt: "masterpiece, best quality, detailed face"
negative_prompt: "lowres, bad anatomy, extra fingers"
default_expression: neutral
default_outfit: school_uniform
The key fields:
lora.name— maps to the LoRA file on the inference server. We version these (_v3) because you’ll retrain as you refine the character design.lora.trigger_word— the token(s) that activate the LoRA. This gets injected into every prompt automatically. Without it, the base model ignores your LoRA weights entirely.lora.strength— how hard the LoRA pulls the output toward the trained character.0.85is our sweet spot for most characters. Too high (>0.95) and you get artifacts; too low (<0.7) and the face drifts.
This strength value is a server-side setting, not something users control. We tuned it per character during development because each LoRA responds differently depending on training data quality and quantity.
For characters that need visual transformations beyond what LoRA alone can handle, we use img2img with base image switching — covered in detail in Dynamic Character Visuals.
The Prompt Composition System
A raw portrait prompt for Suzune never gets written by hand at generation time. It’s assembled from layers:
[trigger_word] + [base_prompt] + [expression] + [outfit] + [scene] + [extras]
Here’s how that looks in code:
def compose_prompt(character, context):
parts = []
# LoRA trigger (required for character identity)
parts.append(character.lora.trigger_word)
# Base quality tags
parts.append(character.base_prompt)
# Expression from story context
expression = detect_expression(context.recent_messages)
parts.append(EXPRESSION_MAP[expression])
# e.g. "smile, happy, bright eyes" or "blush, embarrassed, looking away"
# Outfit — either from wardrobe selection or story context
outfit = resolve_outfit(character, context)
parts.append(outfit.prompt_tags)
# e.g. "black cocktail dress, bare shoulders, high heels"
# Scene/background
scene = resolve_scene(context)
parts.append(scene.prompt_tags)
# e.g. "night city background, neon lights, rain"
return ", ".join(parts)
The expression detection reads recent chat messages and maps emotional states to prompt tags. Nothing fancy — keyword matching plus a small classifier. The important thing is that the portrait reacts to the conversation. If the character is angry in the story, the generated image shows anger.
The Wardrobe System
This is where it gets fun. Suzune ships with 284 preset outfits organized by category:
WARDROBE = {
"casual": [
{"name": "oversized_hoodie", "tags": "oversized hoodie, bare legs, messy hair"},
{"name": "summer_dress", "tags": "white sundress, straw hat, sandals"},
# ...
],
"formal": [
{"name": "cocktail_black", "tags": "black cocktail dress, updo hairstyle, earrings"},
{"name": "evening_gown", "tags": "red evening gown, long gloves, elegant"},
# ...
],
"sleepwear": [ ... ],
"swimwear": [ ... ],
"work": [ ... ],
"fantasy": [ ... ],
# 20+ categories
}
When the story context mentions getting dressed up, going to the beach, or heading to bed, the system picks an appropriate outfit category and selects from the presets. Characters can also have outfit preferences — one character might favor gothic styles while another leans preppy.
The 284 number isn’t arbitrary. We started with about 40 and kept adding as users hit scenarios where the outfit selection felt wrong or repetitive. Wardrobes grow organically.
The RunPod Generation Pipeline
Here’s where the rubber meets the GPU. The actual generation flow:
class LoRAGenerator:
def __init__(self, config):
self.backend = select_backend(config)
# Backends: RunPod Public, RunPod Serverless, Nano Banana
self.job_tracker = JobTracker()
async def generate_portrait(self, character, context, user):
prompt = compose_prompt(character, context)
negative = character.negative_prompt
payload = {
"prompt": prompt,
"negative_prompt": negative,
"lora": character.lora.name,
"lora_strength": character.lora.strength,
"steps": 28,
"cfg_scale": 7.0,
"width": 768,
"height": 1024,
"seed": random_seed(),
}
# Submit to RunPod
job = await self.backend.submit(payload)
self.job_tracker.register(job.id, user_id=user.id)
# Track progress with crash recovery
result = await self.job_tracker.wait_for_completion(
job.id,
timeout=120,
progress_callback=lambda p: notify_progress(user, p)
)
return result.image_url
A few things to note:
Backend abstraction. We’ve tested three compute providers and keep the option to switch. RunPod Serverless is our primary, but the abstraction layer (image_backend.py) means swapping providers is a config change, not a rewrite. When RunPod has capacity issues — and it happens during peak hours — we can fall back.
Job tracking with crash recovery. GPU jobs can fail silently. The server restarts, the connection drops, the job hangs. Our JobTracker persists job IDs and their state. If the bot process crashes and restarts, it picks up pending jobs instead of orphaning them. This was a painful lesson learned — without it, users would request an image and never get a response.
Progress updates. For Telegram-based interactions, we send progress messages (“Generating your image… 40%”) so users know something is happening. A 15-30 second wait with no feedback feels like a bug. With progress updates, it feels like a feature.
Choosing Generation Parameters
We landed on these defaults after extensive testing:
| Parameter | Value | Why |
|---|---|---|
| Steps | 28 | Sweet spot for Flux. 20 is noticeably worse, 35+ is diminishing returns |
| CFG Scale | 7.0 | Standard for Flux. Higher values over-saturate anime styles |
| Resolution | 768x1024 | Portrait aspect ratio. 512x is too low for detail, 1024x1024 wastes compute on background |
| LoRA Strength | 0.80-0.90 | Per-character tuning, but this range works for most |
These are portrait-oriented defaults. For scene-heavy images (character in a landscape), we switch to 1024x768 or 1024x1024.
Gallery and Metadata Tracking
Every generated image gets indexed with full metadata:
image_record = {
"id": unique_id(),
"character": "sakura",
"user_id": user.id,
"prompt": full_prompt,
"lora": "sakura_flux_v3",
"lora_strength": 0.85,
"expression": "happy",
"outfit": "summer_dress",
"scene": "beach_sunset",
"seed": 48291037,
"timestamp": now(),
"favorited": False,
}
This metadata powers several features:
- Favorites — users can mark images they like, and we track which prompt combinations produce the best results
- Filtered queries — “show me all images of Sakura in formal wear” is a metadata query, not an AI search
- Reproducibility — with the seed and full prompt stored, any image can be regenerated exactly
- Analytics — we track which outfits and expressions get the most favorites, which informs the wardrobe system
The gallery isn’t just a nice-to-have. It’s a retention mechanic. Users come back to browse their collection, try to get better versions of their favorites, and share results. It turns image generation from a throwaway feature into a collection game.
Training Your Own LoRAs
A quick overview of the LoRA training pipeline, since the generation system is only as good as the models it runs.
Dataset preparation:
- Collect 20-30 high-quality images of the character (consistent art style)
- Crop to face + upper body (the model needs to learn the face, not random backgrounds)
- Caption each image with the trigger word + description tags
- Remove any images with inconsistent features (different eye colors, etc.)
Training parameters we’ve found reliable for Flux LoRAs:
- Training steps: 1500-2500 (depends on dataset size)
- Learning rate: 1e-4
- LoRA rank: 16-32 (higher rank = more capacity but larger file)
- Batch size: 1-2 (limited by VRAM)
- Resolution: 768x768 or 1024x1024
Common mistakes we’ve made:
- Overtraining — the LoRA memorizes exact poses from the training set instead of generalizing. Keep validation images separate and check flexibility.
- Inconsistent training data — mixing different art styles in the training set produces a LoRA that can’t commit to either. Pick one style per LoRA.
- Weak trigger words — using common words as triggers (e.g., just “girl”) means the trigger competes with normal prompt tokens. Use a unique token like
skr_character.
We retrain LoRAs roughly every 2-3 months per character as we refine their designs or accumulate better training data.
txt2img vs img2img: When to Use Each
Suzune supports both generation modes, and they serve different purposes:
txt2img (LoRA) is the default. You get a fresh image every time, fully guided by the prompt. Best for:
- Standard portrait generation
- Characters with a single consistent look
- Maximum variety in poses and compositions
img2img (base image) uses a reference image as the starting point. Best for:
- Characters with multiple visual modes (see base image switching)
- When you need tighter control over pose/composition
- Maintaining consistency across a series of images
Some characters in Suzune use txt2img with LoRA. Others use img2img with curated base images. A few use both depending on the scenario. The method field in the character YAML controls which pipeline runs.
Production Lessons
After running this system for months and generating thousands of images, here’s what we wish we’d known on day one:
Cache aggressively. If the same character + outfit + expression combo gets requested twice, serve the cached version. GPU time isn’t free. We cache based on a hash of the prompt + LoRA + strength.
Queue, don’t block. Image generation takes 15-30 seconds. Never make the chat wait synchronously. Queue the job, continue the conversation, and deliver the image when it’s ready.
Have fallback images. When the GPU is down, when RunPod is overloaded, when the job times out — have a set of pre-generated images per character that can serve as fallbacks. A generic portrait is better than an error message.
Monitor LoRA drift. After model updates on the inference server, verify your LoRAs still produce correct output. We’ve had base model updates silently degrade LoRA quality. Automated visual regression tests (even simple CLIP-based similarity checks) save you from shipping broken portraits.
Log everything. Every failed generation, every timeout, every weird artifact. The metadata you collect is how you debug issues that only appear at scale.
What This Costs
Running GPU inference isn’t cheap, but it’s manageable. On RunPod Serverless with Flux:
- ~$0.002-0.005 per image at current pricing
- Average user generates 5-15 images per session
- Monthly cost scales linearly with active users
At our current scale, image generation is about 30% of total infrastructure cost. The full cost breakdown is in Running an AI Bot on $50/Month.
If you’re just starting out, RunPod’s serverless option is the way to go — you pay per second of compute, no idle costs. We burned money on dedicated instances before switching.
Try It Without Building It
If you want to experience AI character portraits without building the whole pipeline yourself, platforms like CandyAI offer character image generation out of the box. It’s a good way to understand what users expect before investing in a custom system.
But if you’re the type who reads 2000-word posts about LoRA pipelines — you probably want to build it yourself. And honestly? The custom system is where the magic is. You control the character identity, the wardrobe, the expressions, the quality. No platform can match that level of customization.
Wrapping Up
The character portrait pipeline is one of the highest-impact features in Suzune. It transforms a text-only chat into something that feels alive. Users care about seeing their character react visually to the story.
The stack is straightforward: Flux as the base model, per-character LoRA adapters for identity, YAML-driven config, prompt composition from expression + outfit + scene layers, and RunPod for inference. The complexity isn’t in any single piece — it’s in making them work together reliably at scale.
If you’re building an AI roleplay bot and haven’t added image generation yet, this is the feature that will change your retention numbers overnight.
Check out the full architecture overview if you want to see how the image pipeline fits into the broader system. And for the companion technique to LoRA-based generation, see how we handle base image switching for character transformations.
Want to start building? Grab a RunPod account, train a LoRA on your character, and start with the simplest possible pipeline — just trigger word + base quality prompt. You can add the wardrobe system, expression detection, and gallery later. Ship the MVP first.