feat: Runware FLUX.2 image + lazy per-beat TTS (#5)
Reduce median scene-load latency from ~30-80s to ~17-25s by switching image generation to Runware FLUX.2 [klein] 9B KV and moving per-beat TTS synthesis off the scene response into a new lazy /api/beat-audio endpoint with hard timeout + abort support.
- feat(image): migrate to Runware FLUX.2 [klein] 9B KV — task-array API, $0.001/image, sub-second inference.
- feat(tts): split /api/scene into directScene + image + voicedesign-provisioning; lazily synth per beat via /api/beat-audio with 15s hard timeout + AbortSignal threaded to MiMo so timed-out calls don't keep burning sockets/quota; client fans out per-beat fetches on scene-id change with abort + identity-check finally to prevent cross-scene beat-id collisions.
- refactor(tts): slim BeatAudioRequest to { beat, voice } — ~800KB per-beat upload dropped to ~160KB by sending only the speaker's voice instead of the full session.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
This commit is contained in:
@@ -145,10 +145,8 @@ export type StartResponse = {
|
||||
sessionId: string;
|
||||
scene: Scene;
|
||||
imageBase64: string;
|
||||
/** Post-voice character registry (with provisioned voices). */
|
||||
/** Character registry with voice references provisioned for new speakers. */
|
||||
characters: Character[];
|
||||
/** Per-beat synthesized audio, keyed by beat.id. */
|
||||
beatAudio?: Record<string, BeatAudio>;
|
||||
};
|
||||
|
||||
// /api/scene — generates the next Scene, given session whose latest
|
||||
@@ -162,7 +160,27 @@ export type SceneResponse = {
|
||||
scene: Scene;
|
||||
imageBase64: string;
|
||||
characters: Character[];
|
||||
beatAudio?: Record<string, BeatAudio>;
|
||||
};
|
||||
|
||||
// /api/beat-audio — lazily synthesize one beat's voice. Client fires this
|
||||
// per beat after a scene loads; server has a per-call timeout so MiMo
|
||||
// tail-latency cannot block the UI. A null audio response means "play silent."
|
||||
//
|
||||
// Payload deliberately slim: just the line to speak and the speaker's voice
|
||||
// reference. The client extracts the voice from its local session.characters
|
||||
// before posting — sending the full Session would force ~160KB of base64 per
|
||||
// OTHER speaker plus the entire scene history to ride along for nothing.
|
||||
export type BeatAudioRequest = {
|
||||
beat: {
|
||||
id: string;
|
||||
line: string;
|
||||
lineDelivery?: string;
|
||||
};
|
||||
voice: CharacterVoice;
|
||||
};
|
||||
|
||||
export type BeatAudioResponse = {
|
||||
audio: BeatAudio | null;
|
||||
};
|
||||
|
||||
// /api/vision — interprets a background click on the current image and
|
||||
@@ -197,5 +215,4 @@ export type InsertBeatPartial = {
|
||||
export type InsertBeatResponse = {
|
||||
partial: InsertBeatPartial;
|
||||
characters: Character[];
|
||||
audio?: BeatAudio;
|
||||
};
|
||||
|
||||
Reference in New Issue
Block a user