T

Zonghao Yuan fcd4e6c1ab feat(tts): Xiaomi MiMo per-beat voice + MOCK_IMAGE testing aid (#3 )

Adds optional Xiaomi MiMo TTS layer on top of the scene/beat engine and a MOCK_IMAGE flag for cheap local TTS iteration.

- Per-character voice provisioning via MiMo voice design → clone, reference audio persisted in session
- Per-line free-form delivery direction (Director writes "鼓起勇气又害羞，声音发颤" style instructions; sent to MiMo's director channel, never read aloud)
- Per-beat audio served with the scene response; frontend plays via hidden <audio> with typewriter synced to audio duration; mute toggle persisted via localStorage lazy initializer
- Graceful degradation: any TTS step failing → silent beat, game continues
- MOCK_IMAGE=true returns a sharp-generated placeholder PNG so local TTS iteration doesn't burn image tokens
- Recommended config in .env.example: MiMo Token Plan covers TEXT/VISION/TTS with one key (mimo-v2.5-pro for text, mimo-v2.5 omni for vision, mimo-v2.5-tts for TTS)

Squashed from #3:
- feat(tts): 小米 MiMo 逐 beat 配音 + 按 session 角色音色 + 自由文本配音指导
- feat(engine): MOCK_IMAGE 占位图便于本地测试
- fix(tts): address Copilot review on PR #3
- fix(tts): Copilot round-2 review feedback

Known limitation: Session.characters carries the full WAV reference audio (~200-300KB/character base64) and round-trips through every /api/scene, /api/vision, /api/insert-beat request. This is intrinsic to MiMo's design→clone model (voice identity IS the audio, no server-side voiceId). Fixing requires server-side storage which is out of scope; documented for future hardening.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

2026-05-28 20:45:21 +08:00

apps/web

feat(tts): Xiaomi MiMo per-beat voice + MOCK_IMAGE testing aid (#3 )

2026-05-28 20:45:21 +08:00

packages

feat(tts): Xiaomi MiMo per-beat voice + MOCK_IMAGE testing aid (#3 )

2026-05-28 20:45:21 +08:00

.gitignore

fix(web): tame Next.js 16 dev server CPU runaway

2026-05-10 10:12:54 +08:00

package.json

refactor: rename project DADA → 云梦 (slug: yume)

2026-05-24 10:14:14 +08:00

pnpm-lock.yaml

feat(tts): Xiaomi MiMo per-beat voice + MOCK_IMAGE testing aid (#3 )

2026-05-28 20:45:21 +08:00

pnpm-workspace.yaml

Initial commit: AI-driven visual novel scaffold

2026-05-09 13:29:58 +08:00

README.md

feat: scene/beat architecture — decouple dialogue from image generation (#2 )

2026-05-28 15:20:12 +08:00

tsconfig.base.json

Initial commit: AI-driven visual novel scaffold

2026-05-09 13:29:58 +08:00

vercel.json

feat: prefetch, vision split, provider adapter, UI polish

2026-05-12 19:38:03 +08:00

README.md

云梦

An AI-driven visual novel painted by an AI, one scene at a time. You talk and explore within a scene; when the story turns a corner, it paints the next. You click. It paints. The story unfolds.

How it works

The story unfolds as a sequence of scenes. Each scene is one AI-painted background plus a short tree of beats — moments of narration, dialogue, and the occasional choice. You tap through a scene's beats and the image stays put; only when a choice leads somewhere genuinely new — another place, a new point of view, a jump in time — does the AI paint the next scene.

entering a scene
        │
        ▼
1. Text LLM     directs the whole scene at once — a background prompt
                plus a tree of beats (narration / dialogue / choices)
        │
        ▼
2. Image model  paints the background once, 16:9, no UI baked in
        │
        ▼
[ tap through beats — no model calls, instant ]
        │
        ├─ in-scene choice ──────▶ jump to another beat (instant)
        │
        └─ scene-change choice ──▶ the next scene
                                   (usually pre-generated — see below)

While you're reading one scene, the engine speculatively generates the scenes your choices could lead to — and, for unavoidable next steps, the scene after that. By the time you pick a direction, its image is usually already painted, so the cut feels instant.

Clicking the background itself (not a button) routes through a vision model: it reads where you tapped and decides whether you're exploring the current scene (it inserts a beat — no new image) or moving on (a new scene).

There is no traditional game UI baked into the art. The AI paints the world in whatever style you pick — "stick figure on grid paper" or "cyberpunk noir" — and the dialogue panel and choice buttons are a light HTML layer drawn on top, tuned to sit over the scene.

One-click deploy

After deploy, set the nine environment variables (see below) in your Vercel project. That's it.

Environment variables

Three providers, all independently configurable. Any OpenAI-compatible chat / image endpoint works (OpenAI, Anthropic via OpenAI-compat proxy, Gemini, OpenRouter, DeepSeek, local Ollama, …).

Provider	Variables	Recommended
Text · story director	`TEXT_BASE_URL` `TEXT_API_KEY` `TEXT_MODEL`	`claude-opus-4-7` via Anthropic
Image · UI renderer	`IMAGE_BASE_URL` `IMAGE_API_KEY` `IMAGE_MODEL`	`gpt-image-2` via OpenAI
Vision · click reader	`VISION_BASE_URL` `VISION_API_KEY` `VISION_MODEL`	`gemini-3-flash` via Google

See apps/web/.env.example for the exact shape.

Local development

Requires Node 20+ and pnpm 9+.

pnpm install
cp apps/web/.env.example apps/web/.env.local
# fill in the nine env vars
pnpm dev
# open http://localhost:3000

Project layout

yume/
├── apps/web/              Next.js 16 app — pages + API routes
└── packages/
    ├── types/             shared TypeScript types
    ├── ai-client/         unified OpenAI-compatible clients
    └── engine/            three-stage AI orchestration (open core)

packages/engine is the open core — pure TS, no Next.js or browser dependency. Import it directly to build your own visual-novel front-end (Tauri, Electron, CLI, anywhere).

Cost & limits

Each scene costs roughly $0.15–0.25 in API fees with the recommended model trio (one text + one image call); tapping through a scene's beats is free. To keep transitions instant, the engine also pre-generates scenes you might pick but don't — so real spend runs somewhat higher than the scenes you actually see. There is no rate limiting or auth out of the box — if you make your deployment public, your bill will reflect that. Add limits (and consider lowering the prefetch depth) before sharing widely.

README.md Unescape Escape

云梦

How it works

One-click deploy

Environment variables

Local development

Project layout

Cost & limits

README.md