feat(tts): Xiaomi MiMo per-beat voice + MOCK_IMAGE testing aid (#3)
Adds optional Xiaomi MiMo TTS layer on top of the scene/beat engine and a MOCK_IMAGE flag for cheap local TTS iteration. - Per-character voice provisioning via MiMo voice design → clone, reference audio persisted in session - Per-line free-form delivery direction (Director writes "鼓起勇气又害羞,声音发颤" style instructions; sent to MiMo's director channel, never read aloud) - Per-beat audio served with the scene response; frontend plays via hidden <audio> with typewriter synced to audio duration; mute toggle persisted via localStorage lazy initializer - Graceful degradation: any TTS step failing → silent beat, game continues - MOCK_IMAGE=true returns a sharp-generated placeholder PNG so local TTS iteration doesn't burn image tokens - Recommended config in .env.example: MiMo Token Plan covers TEXT/VISION/TTS with one key (mimo-v2.5-pro for text, mimo-v2.5 omni for vision, mimo-v2.5-tts for TTS) Squashed from #3: - feat(tts): 小米 MiMo 逐 beat 配音 + 按 session 角色音色 + 自由文本配音指导 - feat(engine): MOCK_IMAGE 占位图便于本地测试 - fix(tts): address Copilot review on PR #3 - fix(tts): Copilot round-2 review feedback Known limitation: Session.characters carries the full WAV reference audio (~200-300KB/character base64) and round-trips through every /api/scene, /api/vision, /api/insert-beat request. This is intrinsic to MiMo's design→clone model (voice identity IS the audio, no server-side voiceId). Fixing requires server-side storage which is out of scope; documented for future hardening. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
This commit is contained in:
+33
-12
@@ -1,24 +1,45 @@
|
||||
# =============================================================
|
||||
# 云梦 — AI 视觉小说
|
||||
# Three independently configurable AI providers
|
||||
# Any OpenAI-compatible endpoint works (OpenRouter, OpenAI,
|
||||
# Anthropic via OpenAI-compat proxy, Gemini, DeepSeek, Ollama).
|
||||
# Recommended setup: Xiaomi MiMo Token Plan for TEXT / VISION / TTS
|
||||
# (one API key covers all three) + any image provider for IMAGE.
|
||||
#
|
||||
# Any OpenAI-compatible endpoint works for any slot — OpenRouter,
|
||||
# OpenAI, Anthropic via OpenAI-compat proxy, Gemini, DeepSeek, etc.
|
||||
# Image generation uses the chat-completions + modalities API
|
||||
# (OpenRouter-style), NOT the legacy /images/generations endpoint.
|
||||
# =============================================================
|
||||
|
||||
# ---- 1. Text LLM (story director) -----------------------------
|
||||
TEXT_BASE_URL=https://openrouter.ai/api/v1
|
||||
TEXT_API_KEY=sk-or-v1-xxx
|
||||
TEXT_MODEL=~anthropic/claude-sonnet-latest
|
||||
# ---- 1. Text LLM · scene director ----------------------------------
|
||||
# Recommended: MiMo V2.5 Pro (1M context, native JSON-mode, strong CN)
|
||||
# Token Plan host: https://token-plan-sgp.xiaomimimo.com/v1
|
||||
# Pay-as-you-go host: https://api.xiaomimimo.com/v1 (sk- keys)
|
||||
TEXT_BASE_URL=https://token-plan-sgp.xiaomimimo.com/v1
|
||||
TEXT_API_KEY=tp-xxx
|
||||
TEXT_MODEL=mimo-v2.5-pro
|
||||
|
||||
# ---- 2. Image generator (renders the whole UI screen) ---------
|
||||
# ---- 2. Image generator (renders the scene background) -------------
|
||||
# Any provider supporting chat-completions + modalities image output.
|
||||
IMAGE_BASE_URL=https://openrouter.ai/api/v1
|
||||
IMAGE_API_KEY=sk-or-v1-xxx
|
||||
IMAGE_MODEL=openai/gpt-5.4-image-2
|
||||
|
||||
# ---- 3. Vision model (interprets where the user clicked) ------
|
||||
VISION_BASE_URL=https://openrouter.ai/api/v1
|
||||
VISION_API_KEY=sk-or-v1-xxx
|
||||
VISION_MODEL=~google/gemini-flash-latest
|
||||
# ---- 3. Vision model · multimodal click interpretation -------------
|
||||
# Recommended: MiMo V2.5 omni — multimodal.
|
||||
# ⚠️ DO NOT use mimo-v2.5-pro for this slot — Pro is text-only and
|
||||
# rejects image_url content parts.
|
||||
VISION_BASE_URL=https://token-plan-sgp.xiaomimimo.com/v1
|
||||
VISION_API_KEY=tp-xxx
|
||||
VISION_MODEL=mimo-v2.5
|
||||
|
||||
# ---- 4. TTS · Xiaomi MiMo (optional — leave blank to disable) ------
|
||||
# Per-character voice design → clone, with per-line delivery direction.
|
||||
# Voice identity = the reference audio kept in the session (no server expiry).
|
||||
# The adapter appends -voicedesign / -voiceclone to TTS_SPEECH_MODEL.
|
||||
TTS_BASE_URL=https://token-plan-sgp.xiaomimimo.com/v1
|
||||
TTS_API_KEY=tp-xxx
|
||||
TTS_SPEECH_MODEL=mimo-v2.5-tts
|
||||
|
||||
# ---- 5. MOCK_IMAGE — skip image generation (cheap TTS testing) -----
|
||||
# true → return a placeholder image instead of calling the image model.
|
||||
# Text/story/voice still run normally. Great for iterating on TTS.
|
||||
MOCK_IMAGE=false
|
||||
|
||||
Reference in New Issue
Block a user