feat: Runware FLUX.2 image + lazy per-beat TTS (#5)
Reduce median scene-load latency from ~30-80s to ~17-25s by switching image generation to Runware FLUX.2 [klein] 9B KV and moving per-beat TTS synthesis off the scene response into a new lazy /api/beat-audio endpoint with hard timeout + abort support.
- feat(image): migrate to Runware FLUX.2 [klein] 9B KV — task-array API, $0.001/image, sub-second inference.
- feat(tts): split /api/scene into directScene + image + voicedesign-provisioning; lazily synth per beat via /api/beat-audio with 15s hard timeout + AbortSignal threaded to MiMo so timed-out calls don't keep burning sockets/quota; client fans out per-beat fetches on scene-id change with abort + identity-check finally to prevent cross-scene beat-id collisions.
- refactor(tts): slim BeatAudioRequest to { beat, voice } — ~800KB per-beat upload dropped to ~160KB by sending only the speaker's voice instead of the full session.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
This commit is contained in:
+15
-9
@@ -1,12 +1,14 @@
|
||||
# =============================================================
|
||||
# 云梦 — AI 视觉小说
|
||||
# Recommended setup: Xiaomi MiMo Token Plan for TEXT / VISION / TTS
|
||||
# (one API key covers all three) + any image provider for IMAGE.
|
||||
# (one API key covers all three) + Runware for IMAGE (FLUX.2 [klein]).
|
||||
#
|
||||
# Any OpenAI-compatible endpoint works for any slot — OpenRouter,
|
||||
# OpenAI, Anthropic via OpenAI-compat proxy, Gemini, DeepSeek, etc.
|
||||
# Image generation uses the chat-completions + modalities API
|
||||
# (OpenRouter-style), NOT the legacy /images/generations endpoint.
|
||||
# TEXT / VISION / TTS use OpenAI-compatible endpoints (any OpenAI-
|
||||
# compatible host works: OpenRouter, OpenAI, Anthropic via proxy,
|
||||
# Gemini, DeepSeek, Ollama, ...).
|
||||
#
|
||||
# IMAGE uses Runware's own task-array protocol (not OpenAI-compatible);
|
||||
# the adapter posts an `imageInference` task to IMAGE_BASE_URL.
|
||||
# =============================================================
|
||||
|
||||
# ---- 1. Text LLM · scene director ----------------------------------
|
||||
@@ -18,10 +20,14 @@ TEXT_API_KEY=tp-xxx
|
||||
TEXT_MODEL=mimo-v2.5-pro
|
||||
|
||||
# ---- 2. Image generator (renders the scene background) -------------
|
||||
# Any provider supporting chat-completions + modalities image output.
|
||||
IMAGE_BASE_URL=https://openrouter.ai/api/v1
|
||||
IMAGE_API_KEY=sk-or-v1-xxx
|
||||
IMAGE_MODEL=openai/gpt-5.4-image-2
|
||||
# Recommended: Runware + FLUX.2 [klein] 9B KV — distilled 4-step model,
|
||||
# sub-second inference at ~$0.0008/image. Sign up at https://runware.ai
|
||||
# AIR ids for FLUX.2 [klein] variants:
|
||||
# runware:400@1 · 4B (smaller)
|
||||
# runware:400@6 · 9B KV (recommended — fastest at 16:9)
|
||||
IMAGE_BASE_URL=https://api.runware.ai/v1
|
||||
IMAGE_API_KEY=runware-xxx
|
||||
IMAGE_MODEL=runware:400@6
|
||||
|
||||
# ---- 3. Vision model · multimodal click interpretation -------------
|
||||
# Recommended: MiMo V2.5 omni — multimodal.
|
||||
|
||||
Reference in New Issue
Block a user