fix(web): reduce FOT by stripping redundant voice data from transport

Three transport-only optimizations that cut per-session Vercel FOT by ~50-60%: P0 — Server strips voice.referenceAudioBase64 from already-known characters in /api/scene and /api/insert-beat responses (defense-in-depth). P1 — Client strips all voice data from session before sending to /api/scene, /api/vision, and /api/insert-beat. Voices are retained locally and re-merged from responses via mergeCharactersPreserveVoice(). The engine only needs character names + visualDescriptions for scene generation. P3 — /api/beat-audio returns binary audio (Response with Content-Type) instead of JSON-wrapped base64, saving ~33% encoding overhead. Client converts to blob URLs; PlayCanvas accepts a single audioSrc prop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-05 00:08:02 +08:00
parent c30d11d60b
commit e88e988de3
5 changed files with 118 additions and 47 deletions
@@ -26,7 +26,11 @@ export async function POST(req: Request) {
  try {
    const config = loadEngineConfig(req.headers);
    const result = await requestBeatAudio(config, body);
-    return NextResponse.json(result);
+    if (!result.audio) return new Response(null, { status: 204 });
+    const binary = Buffer.from(result.audio.base64, "base64");
+    return new Response(binary, {
+      headers: { "Content-Type": result.audio.mime },
+    });
  } catch (err) {
    // Engine already swallows synth errors and returns audio:null. Anything
    // that reaches here is config-level — surface so the client can log it.