e261f4a346
Reduce median scene-load latency from ~30-80s to ~17-25s by switching image generation to Runware FLUX.2 [klein] 9B KV and moving per-beat TTS synthesis off the scene response into a new lazy /api/beat-audio endpoint with hard timeout + abort support.
- feat(image): migrate to Runware FLUX.2 [klein] 9B KV — task-array API, $0.001/image, sub-second inference.
- feat(tts): split /api/scene into directScene + image + voicedesign-provisioning; lazily synth per beat via /api/beat-audio with 15s hard timeout + AbortSignal threaded to MiMo so timed-out calls don't keep burning sockets/quota; client fans out per-beat fetches on scene-id change with abort + identity-check finally to prevent cross-scene beat-id collisions.
- refactor(tts): slim BeatAudioRequest to { beat, voice } — ~800KB per-beat upload dropped to ~160KB by sending only the speaker's voice instead of the full session.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
52 lines
2.3 KiB
Bash
52 lines
2.3 KiB
Bash
# =============================================================
|
|
# 云梦 — AI 视觉小说
|
|
# Recommended setup: Xiaomi MiMo Token Plan for TEXT / VISION / TTS
|
|
# (one API key covers all three) + Runware for IMAGE (FLUX.2 [klein]).
|
|
#
|
|
# TEXT / VISION / TTS use OpenAI-compatible endpoints (any OpenAI-
|
|
# compatible host works: OpenRouter, OpenAI, Anthropic via proxy,
|
|
# Gemini, DeepSeek, Ollama, ...).
|
|
#
|
|
# IMAGE uses Runware's own task-array protocol (not OpenAI-compatible);
|
|
# the adapter posts an `imageInference` task to IMAGE_BASE_URL.
|
|
# =============================================================
|
|
|
|
# ---- 1. Text LLM · scene director ----------------------------------
|
|
# Recommended: MiMo V2.5 Pro (1M context, native JSON-mode, strong CN)
|
|
# Token Plan host: https://token-plan-sgp.xiaomimimo.com/v1
|
|
# Pay-as-you-go host: https://api.xiaomimimo.com/v1 (sk- keys)
|
|
TEXT_BASE_URL=https://token-plan-sgp.xiaomimimo.com/v1
|
|
TEXT_API_KEY=tp-xxx
|
|
TEXT_MODEL=mimo-v2.5-pro
|
|
|
|
# ---- 2. Image generator (renders the scene background) -------------
|
|
# Recommended: Runware + FLUX.2 [klein] 9B KV — distilled 4-step model,
|
|
# sub-second inference at ~$0.0008/image. Sign up at https://runware.ai
|
|
# AIR ids for FLUX.2 [klein] variants:
|
|
# runware:400@1 · 4B (smaller)
|
|
# runware:400@6 · 9B KV (recommended — fastest at 16:9)
|
|
IMAGE_BASE_URL=https://api.runware.ai/v1
|
|
IMAGE_API_KEY=runware-xxx
|
|
IMAGE_MODEL=runware:400@6
|
|
|
|
# ---- 3. Vision model · multimodal click interpretation -------------
|
|
# Recommended: MiMo V2.5 omni — multimodal.
|
|
# ⚠️ DO NOT use mimo-v2.5-pro for this slot — Pro is text-only and
|
|
# rejects image_url content parts.
|
|
VISION_BASE_URL=https://token-plan-sgp.xiaomimimo.com/v1
|
|
VISION_API_KEY=tp-xxx
|
|
VISION_MODEL=mimo-v2.5
|
|
|
|
# ---- 4. TTS · Xiaomi MiMo (optional — leave blank to disable) ------
|
|
# Per-character voice design → clone, with per-line delivery direction.
|
|
# Voice identity = the reference audio kept in the session (no server expiry).
|
|
# The adapter appends -voicedesign / -voiceclone to TTS_SPEECH_MODEL.
|
|
TTS_BASE_URL=https://token-plan-sgp.xiaomimimo.com/v1
|
|
TTS_API_KEY=tp-xxx
|
|
TTS_SPEECH_MODEL=mimo-v2.5-tts
|
|
|
|
# ---- 5. MOCK_IMAGE — skip image generation (cheap TTS testing) -----
|
|
# true → return a placeholder image instead of calling the image model.
|
|
# Text/story/voice still run normally. Great for iterating on TTS.
|
|
MOCK_IMAGE=false
|