Dynamic Character Visuals: Base Image Switching

One of our characters in Suzune has a secret: she’s beautiful, but she hides it.

Mao is designed as the “plain office worker” archetype — no makeup, thick glasses, hair with no styling, clothes buttoned up to the collar. But when she decides to dress up? Different person. Crimson lipstick, smokey eyeshadow, hair styled, glasses off. Same woman, completely different energy.

The technical challenge: how do you make an AI image generation system produce two visually distinct versions of the same character, automatically, based on the story context?

Here’s how we solved it.

Open Table of contents

The Problem: One Face, Two Modes
The Solution: img2img Base Image Variants
Automatic Switching Logic
- Why Keywords Instead of Something Smarter?
The Visual Pipeline
Designing the Base Images
- Rules for Good Base Images
Why This Matters for Character Design
- The Gap Effect
Extending the Pattern
Implementation Checklist
Tools We Use

The Problem: One Face, Two Modes

Most AI character image systems use one of two approaches:

LoRA models — A fine-tuned model that always produces the same character look
Appearance prompts — Text descriptions that guide generation (inconsistent across images)

Neither handles the “transformation” use case well:

A LoRA trained on “plain Mao” can’t produce “glamorous Mao” — it always generates the trained look
Pure text prompts (“add makeup”) produce wildly inconsistent results, often changing facial features entirely

What we needed: a system that swaps the character’s visual foundation based on what’s happening in the story.

The Solution: img2img Base Image Variants

Instead of LoRA, Mao uses img2img generation — the model takes a base image as input and transforms it according to the prompt while preserving the core facial structure.

The key insight: use different base images for different character states.

characters/mao/
├── base_image.png            ← Default: glasses, no makeup, plain
├── base_image_serious.png    ← Activated: no glasses, makeup-ready face
└── character.yaml

Both images are the same person with the same facial structure. But the base composition is different:

	Default (base_image.png)	Serious (base_image_serious.png)
Glasses	On	Off
Expression	Neutral, slightly stiff	Confident, composed
Hair	Unstyled, parted in middle	Slightly refined
Makeup	None	Minimal (ready for prompt to add more)

The img2img pipeline then applies the scene prompt on top of whichever base is active. The result: consistent facial identity with dramatically different vibes.

Automatic Switching Logic

The base image swap is triggered by outfit keywords. When the character’s current outfit includes makeup items, the system automatically switches to the serious base image.

Here’s the core logic:

# Detect makeup in current outfit → switch base image
makeup_keywords = (
    "lipstick", "eyeshadow", "mascara",
    "eyeliner", "makeup", "cosmetics"
)

if any(kw in clothing.lower() for kw in makeup_keywords):
    serious_path = base_image.parent / "base_image_serious.png"
    if serious_path.exists():
        base_image = serious_path  # swap!

That’s it. The detection is deliberately simple — keyword matching on the outfit string. No ML, no complex logic. Just: does the outfit mention makeup? → use the makeup-ready face.

Why Keywords Instead of Something Smarter?

Because outfit descriptions are generated by our own system (the wardrobe engine), so we control the vocabulary. We don’t need fuzzy matching when we write the prompts ourselves.

The wardrobe entry for Mao’s “queen mode” makeup looks like this:

{
  "name": "Seductive Queen Makeup (Mao exclusive)",
  "prompt": "seductive queen makeup, no glasses, dark crimson lipstick,
             heavy mascara, smokey eyeshadow, sharp eyeliner,
             flawless porcelain skin",
  "tags": ["makeup", "seductive", "queen", "mao"],
  "exclusive": ["mao"]
}

When this wardrobe item is active, the clothing string contains “lipstick” and “eyeshadow” → the keyword check fires → base image swaps.

The Visual Pipeline

Here’s the full flow when Mao generates a selfie:

1. LLM decides to send a selfie (tool call)
         │
2. Load daily outfit from daily_outfit.json
         │
3. Check outfit for makeup keywords
         │
    ┌────┴────┐
    │ No      │ Yes
    ▼         ▼
base_image  base_image_serious
  .png        .png
    │         │
    └────┬────┘
         │
4. Encode base image as input to SDXL img2img
         │
5. Apply scene prompt + outfit prompt
         │
6. RunPod generates image → send to chat

The beauty of this approach: the switching is invisible to the LLM. The character doesn’t need to know which base image is being used. It just describes the scene, and the image system handles the visual consistency.

Designing the Base Images

The hardest part isn’t the code — it’s creating base images that work well with img2img.

Rules for Good Base Images

1. Same face, different energy

Both base images must be unmistakably the same person. Use the same:

Facial proportions and structure
Skin tone
Hair color and approximate length
Eye color and shape

Change only:

Expression
Accessories (glasses on/off)
Hair styling
Makeup level

2. Neutral enough for img2img to work with

Base images shouldn’t be too detailed or specific. The img2img pipeline needs room to apply the scene prompt. If the base image is already wearing a red dress, it’s harder for the model to generate her in a white one.

Keep base images in simple clothing or a neutral composition.

3. High denoising strength for outfit changes, low for facial consistency

This is the balancing act of img2img:

Denoising Strength	Effect
0.3–0.4	Face stays very consistent, but outfits barely change
0.5–0.6	Good balance — face recognizable, outfits change well
0.7–0.8	Outfits change dramatically, but face may drift

We typically use 0.5–0.6 for general scenes and 0.4–0.5 for close-up portraits where facial consistency matters most.

Why This Matters for Character Design

The base image variant system isn’t just a technical feature — it’s a character design tool.

Mao’s “transformation” is part of her character arc. She’s designed as someone who doesn’t care about appearances, but when the moment calls for it — a date, a confrontation, or a moment of confidence — she transforms.

This mirrors a popular character archetype in anime and manga: the “hidden beauty” (隠れ美人). The gap between her daily appearance and her full potential is part of what makes her compelling.

Without dynamic visuals, this character concept falls flat. You can describe the transformation in text, but showing it in generated images makes it visceral.

The Gap Effect

In character design, “gap moe” (ギャップ萌え) — the appeal of contrast — is one of the most powerful tools:

A tough character showing vulnerability
A serious character laughing unexpectedly
A plain character revealing hidden beauty

The base image switching system lets us express gap moe visually, not just textually. And users absolutely notice.

Extending the Pattern

While Mao uses “plain → glamorous,” the same pattern works for many character transformations:

Character Concept	Default Base	Variant Base	Trigger
Hidden beauty (Mao)	Glasses, no makeup	No glasses, confident	Makeup keywords
Warrior/fighter	Casual clothes	Battle-ready, intense	Weapon/armor keywords
Shy character	Averted gaze, closed posture	Eye contact, open posture	High affection score
Idol/performer	Offstage casual	Stage costume, spotlights	Performance scene

The trigger doesn’t have to be outfit-based either. You could switch base images based on:

Affection score — character’s expression softens as they warm up to you
Time of day — sleepy base image at night, energetic during day
Emotional state — happy/sad/angry base images selected by the emotion detection system

We haven’t implemented all of these yet, but the architecture supports them with minimal changes.

Implementation Checklist

If you want to add this to your own bot:

Create 2+ base images of your character with consistent facial features but different compositions
Name them consistently — base_image.png (default), base_image_[variant].png
Add keyword detection in your image generation pipeline — check the outfit/scene string for trigger words
Swap the base image path before passing it to your img2img model
Test denoising strength — find the sweet spot between outfit flexibility and facial consistency

The code change is genuinely small (< 10 lines). The character design work is where the real effort goes.

Tools We Use

SDXL via RunPod for img2img generation
Custom character base images (hand-generated, then refined)
Python + Telegram for the bot pipeline

If you’re not ready to build your own image pipeline, platforms like Candy AI and DreamGF offer built-in character customization with appearance variants — not as flexible as a custom system, but a good starting point.

This article is part of WaifuStack’s series on building AI roleplay bots. See also: Prompt Engineering for Immersive Roleplay and From Idea to Production.

Working on something similar? Share your approach on X.