feat: Runware FLUX.2 image + lazy per-beat TTS (#5)

Reduce median scene-load latency from ~30-80s to ~17-25s by switching image generation to Runware FLUX.2 [klein] 9B KV and moving per-beat TTS synthesis off the scene response into a new lazy /api/beat-audio endpoint with hard timeout + abort support. - feat(image): migrate to Runware FLUX.2 [klein] 9B KV — task-array API, $0.001/image, sub-second inference. - feat(tts): split /api/scene into directScene + image + voicedesign-provisioning; lazily synth per beat via /api/beat-audio with 15s hard timeout + AbortSignal threaded to MiMo so timed-out calls don't keep burning sockets/quota; client fans out per-beat fetches on scene-id change with abort + identity-check finally to prevent cross-scene beat-id collisions. - refactor(tts): slim BeatAudioRequest to { beat, voice } — ~800KB per-beat upload dropped to ~160KB by sending only the speaker's voice instead of the full session. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-05-28 23:43:51 +08:00
parent fcd4e6c1ab
commit e261f4a346
10 changed files with 431 additions and 214 deletions
@@ -1,12 +1,14 @@
 # =============================================================
 # 云梦 — AI 视觉小说
 # Recommended setup: Xiaomi MiMo Token Plan for TEXT / VISION / TTS
-# (one API key covers all three) + any image provider for IMAGE.
+# (one API key covers all three) + Runware for IMAGE (FLUX.2 [klein]).
 #
-# Any OpenAI-compatible endpoint works for any slot — OpenRouter,
-# OpenAI, Anthropic via OpenAI-compat proxy, Gemini, DeepSeek, etc.
-# Image generation uses the chat-completions + modalities API
-# (OpenRouter-style), NOT the legacy /images/generations endpoint.
+# TEXT / VISION / TTS use OpenAI-compatible endpoints (any OpenAI-
+# compatible host works: OpenRouter, OpenAI, Anthropic via proxy,
+# Gemini, DeepSeek, Ollama, ...).
+#
+# IMAGE uses Runware's own task-array protocol (not OpenAI-compatible);
+# the adapter posts an `imageInference` task to IMAGE_BASE_URL.
 # =============================================================

 # ---- 1. Text LLM · scene director ----------------------------------
@@ -18,10 +20,14 @@ TEXT_API_KEY=tp-xxx
 TEXT_MODEL=mimo-v2.5-pro

 # ---- 2. Image generator (renders the scene background) -------------
-# Any provider supporting chat-completions + modalities image output.
-IMAGE_BASE_URL=https://openrouter.ai/api/v1
-IMAGE_API_KEY=sk-or-v1-xxx
-IMAGE_MODEL=openai/gpt-5.4-image-2
+# Recommended: Runware + FLUX.2 [klein] 9B KV — distilled 4-step model,
+# sub-second inference at ~$0.0008/image. Sign up at https://runware.ai
+# AIR ids for FLUX.2 [klein] variants:
+#   runware:400@1  · 4B (smaller)
+#   runware:400@6  · 9B KV (recommended — fastest at 16:9)
+IMAGE_BASE_URL=https://api.runware.ai/v1
+IMAGE_API_KEY=runware-xxx
+IMAGE_MODEL=runware:400@6

 # ---- 3. Vision model · multimodal click interpretation -------------
 # Recommended: MiMo V2.5 omni — multimodal.