fix(web): reduce FOT by stripping redundant voice data from transport

Three transport-only optimizations that cut per-session Vercel FOT by ~50-60%:

P0 — Server strips voice.referenceAudioBase64 from already-known characters
in /api/scene and /api/insert-beat responses (defense-in-depth).

P1 — Client strips all voice data from session before sending to
/api/scene, /api/vision, and /api/insert-beat. Voices are retained locally
and re-merged from responses via mergeCharactersPreserveVoice(). The engine
only needs character names + visualDescriptions for scene generation.

P3 — /api/beat-audio returns binary audio (Response with Content-Type)
instead of JSON-wrapped base64, saving ~33% encoding overhead. Client
converts to blob URLs; PlayCanvas accepts a single audioSrc prop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
yuanzonghao
2026-06-05 00:08:02 +08:00
parent c30d11d60b
commit e88e988de3
5 changed files with 118 additions and 47 deletions
+4 -1
View File
@@ -26,7 +26,10 @@ export async function POST(req: Request) {
// See StartRequest.clientTts — BYO clients synth in-browser, so drop server TTS.
const config = body.clientTts === true ? { ...base, tts: undefined } : base;
const result = await requestInsertBeat(config, body);
return NextResponse.json(result);
return NextResponse.json({
...result,
characters: result.characters.map((c) => ({ ...c, voice: undefined })),
});
} catch (err) {
const message = err instanceof Error ? err.message : "Unknown error";
return NextResponse.json({ error: message }, { status: 500 });