feat(tts): StepFun voice selection via CharacterDesigner + provider-aware beat-audio

Make homepage cards and live sessions produce sound when the server is configured for StepFun TTS, instead of silently failing (the prebaked Xiaomi voice was useless on a StepFun server, and wasted ~220KB/beat in Fast Origin Transfer). Three coordinated changes: 1. CharacterDesigner now picks a StepFun preset voice id directly from the 32-entry catalog in the SAME LLM call that designs the character — zero extra latency, LLM-grade match quality. The Xiaomi prompt path is byte-identical to history (verified programmatically) so cache hit rate and voice quality are preserved. pickStepfunVoiceId (keyword scorer) remains the fallback for orphan speakers / invalid LLM picks. 2. The 32-preset catalog moves to lib/tts-client/stepfun-voices.json as the single source of truth, shared by the scorer, the CharacterDesigner prompt, /api/tts-provider, and the offline enrich script. 3. A new GET /api/tts-provider endpoint lets the client probe the server's TTS provider at /play mount. fetchBeatAudio then shapes its request body: on a StepFun server it sends the lightweight stepfunVoiceId / voiceDescription and omits the ~220KB Xiaomi reference audio (FOT saving ~13MB per protagonist per session on prebaked cards). requestBeatAudio re-provisions on a provider mismatch before synth, so audio never goes silent on a cross-provider replay or mid-session provider flip. New type fields are all optional and backward-compatible: Character.stepfunVoiceId, BeatAudioRequest.voiceDescription/characterName/stepfunVoiceId, voice made optional. AGENTS.md updated for the new route, type fields, dependency map, and StepFun voice-selection flow.
2026-06-15 12:49:25 +08:00
parent da191dd7a2
commit ca73a41a0b
15 changed files with 754 additions and 90 deletions
@@ -208,6 +208,13 @@ export type Character = {
  basePortraitUrl?: string;
  /** Xiaomi MiMo voice reference audio. */
  voice?: CharacterVoice;
+  /** StepFun preset voice id (e.g. "cixingnansheng"). Only present on
+   *  characters designed while the server ran StepFun, OR on prebaked
+   *  homepage cards enriched with a StepFun voice id. Lets the client send a
+   *  lightweight beat-audio request (no ~220KB Xiaomi reference audio) when the
+   *  server runs StepFun, and lets the server normalize an off-provider voice
+   *  without a fresh provision. Validated against the catalog at synth time. */
+  stepfunVoiceId?: string;
 };

 /** A single beat's synthesized audio, attached to the response. */
@@ -359,6 +366,22 @@ export type TtsConfig = {
  speechModel: string;
 };

+/** Which TTS provider the server is configured for (inferred from TtsConfig's
+ *  base URL by lib/tts-client's isStepfun). Exposed to the client via the
+ *  /api/tts-provider route so the play page can send only the voice fields
+ *  the server actually needs — e.g. skip the ~220KB Xiaomi reference audio
+ *  when the server runs StepFun (saving Fast Origin Transfer bandwidth).
+ *  `null` means no server-side TTS (silent). BYO client TTS takes precedence
+ *  over this signal. */
+export type TtsProvider = "stepfun" | "xiaomi" | null;
+
+// /api/tts-provider — lightweight GET returning the server's TTS provider so
+// the client can shape beat-audio request bodies accordingly (see fetchBeatAudio
+// in app/play/page.tsx). Response is a few dozen bytes; runs once per session.
+export type TtsProviderResponse = {
+  provider: TtsProvider;
+};
+
 export type EngineConfig = {
  text: ProviderConfig;
  image: ProviderConfig;
@@ -461,7 +484,23 @@ export type BeatAudioRequest = {
    line: string;
    lineDelivery?: string;
  };
-  voice: CharacterVoice;
+  /** The speaker's already-provisioned voice. Optional now — when the server
+   *  runs a DIFFERENT provider than `voice.provider` (e.g. the client holds a
+   *  Xiaomi voice from a prebaked card but the server runs StepFun), the
+   *  client may omit `voice` and send `voiceDescription` + `stepfunVoiceId`
+   *  instead to save the ~220KB reference-audio transfer. The server then
+   *  re-provisions against its own provider before synthesizing. */
+  voice?: CharacterVoice;
+  /** Voice-design card (中文). Used by the server to re-provision when
+   *  `voice` is absent or its provider doesn't match the server's TTS. */
+  voiceDescription?: string;
+  /** Speaker name — used as the StepFun provision salt for archetype spreading
+   *  when the server falls back to pickStepfunVoiceId. */
+  characterName?: string;
+  /** Pre-selected StepFun preset id (from a live CharacterDesigner pick or a
+   *  prebaked card). Honored directly when the server runs StepFun, skipping
+   *  both the keyword scorer and a network provision. */
+  stepfunVoiceId?: string;
 };

 export type BeatAudioResponse = {