feat(tts): StepFun voice selection via CharacterDesigner + provider-aware beat-audio

Make homepage cards and live sessions produce sound when the server is configured for StepFun TTS, instead of silently failing (the prebaked Xiaomi voice was useless on a StepFun server, and wasted ~220KB/beat in Fast Origin Transfer). Three coordinated changes: 1. CharacterDesigner now picks a StepFun preset voice id directly from the 32-entry catalog in the SAME LLM call that designs the character — zero extra latency, LLM-grade match quality. The Xiaomi prompt path is byte-identical to history (verified programmatically) so cache hit rate and voice quality are preserved. pickStepfunVoiceId (keyword scorer) remains the fallback for orphan speakers / invalid LLM picks. 2. The 32-preset catalog moves to lib/tts-client/stepfun-voices.json as the single source of truth, shared by the scorer, the CharacterDesigner prompt, /api/tts-provider, and the offline enrich script. 3. A new GET /api/tts-provider endpoint lets the client probe the server's TTS provider at /play mount. fetchBeatAudio then shapes its request body: on a StepFun server it sends the lightweight stepfunVoiceId / voiceDescription and omits the ~220KB Xiaomi reference audio (FOT saving ~13MB per protagonist per session on prebaked cards). requestBeatAudio re-provisions on a provider mismatch before synth, so audio never goes silent on a cross-provider replay or mid-session provider flip. New type fields are all optional and backward-compatible: Character.stepfunVoiceId, BeatAudioRequest.voiceDescription/characterName/stepfunVoiceId, voice made optional. AGENTS.md updated for the new route, type fields, dependency map, and StepFun voice-selection flow.
2026-06-15 12:49:25 +08:00
parent da191dd7a2
commit ca73a41a0b
15 changed files with 754 additions and 90 deletions
@@ -22,6 +22,7 @@ import type {
  Session,
  StartRequest,
  StartResponse,
+  TtsProvider,
  VisionRequest,
  VisionResponse,
 } from "@infiplot/types";
@@ -60,6 +61,17 @@ async function postJson<T>(path: string, body: unknown): Promise<T> {
  return res.json() as Promise<T>;
 }

+// GET variant of postJson — same 401 → AuthRequiredError mapping. Used by
+// getTtsProvider (a tiny config probe, no body).
+async function getJson<T>(path: string): Promise<T> {
+  const res = await fetch(path, { method: "GET" });
+  if (!res.ok) {
+    if (res.status === 401) throw new AuthRequiredError();
+    throw new Error(`HTTP ${res.status}`);
+  }
+  return res.json() as Promise<T>;
+}
+
 // ── FOT reduction helpers (server-fallback path only) ─────────────────
 // The server-fallback POSTs send the whole Session over the wire. Voice
 // data is bulky (~160KB/character via referenceAudioBase64) and the
@@ -99,6 +111,29 @@ function mergeCharactersPreserveVoice(
 // Otherwise they fall back to the server-side API routes, which read
 // environment variables — useful for Vercel deploys that already supply keys.

+// Probe the server's TTS provider so fetchBeatAudio can shape its request body
+// (skip the ~220KB Xiaomi reference audio when the server runs StepFun).
+//
+// BYO precedence: when the browser has a client model config (BYO mode),
+// voice synthesis always runs locally against the user's own Xiaomi key, so
+// the server provider is irrelevant — return "xiaomi" synchronously without a
+// round-trip. Non-BYO → GET /api/tts-provider. Errors degrade to null (the
+// caller then sends voice fields defensively and the server normalizes).
+export async function getTtsProvider(): Promise<TtsProvider> {
+  if (getClientConfig()) return "xiaomi";
+  try {
+    const data = await getJson<{ provider: TtsProvider }>("/api/tts-provider");
+    return data.provider;
+  } catch (e) {
+    // AuthRequiredError (401) propagates so the caller's handleAuthError can
+    // surface the login modal; other errors (network, 5xx) → null = unknown,
+    // and fetchBeatAudio falls back to sending everything + server normalizes.
+    if (e instanceof AuthRequiredError) throw e;
+    console.warn("[getTtsProvider] probe failed, assuming unknown:", e);
+    return null;
+  }
+}
+
 export async function startSession(req: StartRequest): Promise<StartResponse> {
  const config = getClientConfig();
  if (config) {