feat: Runware FLUX.2 image + lazy per-beat TTS (#5)

Reduce median scene-load latency from ~30-80s to ~17-25s by switching image generation to Runware FLUX.2 [klein] 9B KV and moving per-beat TTS synthesis off the scene response into a new lazy /api/beat-audio endpoint with hard timeout + abort support. - feat(image): migrate to Runware FLUX.2 [klein] 9B KV — task-array API, $0.001/image, sub-second inference. - feat(tts): split /api/scene into directScene + image + voicedesign-provisioning; lazily synth per beat via /api/beat-audio with 15s hard timeout + AbortSignal threaded to MiMo so timed-out calls don't keep burning sockets/quota; client fans out per-beat fetches on scene-id change with abort + identity-check finally to prevent cross-scene beat-id collisions. - refactor(tts): slim BeatAudioRequest to { beat, voice } — ~800KB per-beat upload dropped to ~160KB by sending only the speaker's voice instead of the full session. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-05-28 23:43:51 +08:00
parent fcd4e6c1ab
commit e261f4a346
10 changed files with 431 additions and 214 deletions
@@ -145,10 +145,8 @@ export type StartResponse = {
  sessionId: string;
  scene: Scene;
  imageBase64: string;
-  /** Post-voice character registry (with provisioned voices). */
+  /** Character registry with voice references provisioned for new speakers. */
  characters: Character[];
-  /** Per-beat synthesized audio, keyed by beat.id. */
-  beatAudio?: Record<string, BeatAudio>;
 };

 // /api/scene — generates the next Scene, given session whose latest
@@ -162,7 +160,27 @@ export type SceneResponse = {
  scene: Scene;
  imageBase64: string;
  characters: Character[];
-  beatAudio?: Record<string, BeatAudio>;
+};
+
+// /api/beat-audio — lazily synthesize one beat's voice. Client fires this
+// per beat after a scene loads; server has a per-call timeout so MiMo
+// tail-latency cannot block the UI. A null audio response means "play silent."
+//
+// Payload deliberately slim: just the line to speak and the speaker's voice
+// reference. The client extracts the voice from its local session.characters
+// before posting — sending the full Session would force ~160KB of base64 per
+// OTHER speaker plus the entire scene history to ride along for nothing.
+export type BeatAudioRequest = {
+  beat: {
+    id: string;
+    line: string;
+    lineDelivery?: string;
+  };
+  voice: CharacterVoice;
+};
+
+export type BeatAudioResponse = {
+  audio: BeatAudio | null;
 };

 // /api/vision — interprets a background click on the current image and
@@ -197,5 +215,4 @@ export type InsertBeatPartial = {
 export type InsertBeatResponse = {
  partial: InsertBeatPartial;
  characters: Character[];
-  audio?: BeatAudio;
 };