feat: Runware FLUX.2 image + lazy per-beat TTS (#5)

Reduce median scene-load latency from ~30-80s to ~17-25s by switching image generation to Runware FLUX.2 [klein] 9B KV and moving per-beat TTS synthesis off the scene response into a new lazy /api/beat-audio endpoint with hard timeout + abort support.

- feat(image): migrate to Runware FLUX.2 [klein] 9B KV — task-array API, $0.001/image, sub-second inference.
- feat(tts): split /api/scene into directScene + image + voicedesign-provisioning; lazily synth per beat via /api/beat-audio with 15s hard timeout + AbortSignal threaded to MiMo so timed-out calls don't keep burning sockets/quota; client fans out per-beat fetches on scene-id change with abort + identity-check finally to prevent cross-scene beat-id collisions.
- refactor(tts): slim BeatAudioRequest to { beat, voice } — ~800KB per-beat upload dropped to ~160KB by sending only the speaker's voice instead of the full session.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
This commit is contained in:
Zonghao Yuan
2026-05-28 23:43:51 +08:00
committed by GitHub
parent fcd4e6c1ab
commit e261f4a346
10 changed files with 431 additions and 214 deletions
+22 -5
View File
@@ -145,10 +145,8 @@ export type StartResponse = {
sessionId: string;
scene: Scene;
imageBase64: string;
/** Post-voice character registry (with provisioned voices). */
/** Character registry with voice references provisioned for new speakers. */
characters: Character[];
/** Per-beat synthesized audio, keyed by beat.id. */
beatAudio?: Record<string, BeatAudio>;
};
// /api/scene — generates the next Scene, given session whose latest
@@ -162,7 +160,27 @@ export type SceneResponse = {
scene: Scene;
imageBase64: string;
characters: Character[];
beatAudio?: Record<string, BeatAudio>;
};
// /api/beat-audio — lazily synthesize one beat's voice. Client fires this
// per beat after a scene loads; server has a per-call timeout so MiMo
// tail-latency cannot block the UI. A null audio response means "play silent."
//
// Payload deliberately slim: just the line to speak and the speaker's voice
// reference. The client extracts the voice from its local session.characters
// before posting — sending the full Session would force ~160KB of base64 per
// OTHER speaker plus the entire scene history to ride along for nothing.
export type BeatAudioRequest = {
beat: {
id: string;
line: string;
lineDelivery?: string;
};
voice: CharacterVoice;
};
export type BeatAudioResponse = {
audio: BeatAudio | null;
};
// /api/vision — interprets a background click on the current image and
@@ -197,5 +215,4 @@ export type InsertBeatPartial = {
export type InsertBeatResponse = {
partial: InsertBeatPartial;
characters: Character[];
audio?: BeatAudio;
};