feat(tts): StepFun voice selection via CharacterDesigner + provider-aware beat-audio

Make homepage cards and live sessions produce sound when the server is
configured for StepFun TTS, instead of silently failing (the prebaked
Xiaomi voice was useless on a StepFun server, and wasted ~220KB/beat in
Fast Origin Transfer).

Three coordinated changes:

1. CharacterDesigner now picks a StepFun preset voice id directly from the
   32-entry catalog in the SAME LLM call that designs the character — zero
   extra latency, LLM-grade match quality. The Xiaomi prompt path is
   byte-identical to history (verified programmatically) so cache hit rate
   and voice quality are preserved. pickStepfunVoiceId (keyword scorer)
   remains the fallback for orphan speakers / invalid LLM picks.

2. The 32-preset catalog moves to lib/tts-client/stepfun-voices.json as the
   single source of truth, shared by the scorer, the CharacterDesigner
   prompt, /api/tts-provider, and the offline enrich script.

3. A new GET /api/tts-provider endpoint lets the client probe the server's
   TTS provider at /play mount. fetchBeatAudio then shapes its request body:
   on a StepFun server it sends the lightweight stepfunVoiceId /
   voiceDescription and omits the ~220KB Xiaomi reference audio (FOT saving
   ~13MB per protagonist per session on prebaked cards). requestBeatAudio
   re-provisions on a provider mismatch before synth, so audio never goes
   silent on a cross-provider replay or mid-session provider flip.

New type fields are all optional and backward-compatible: Character.stepfunVoiceId,
BeatAudioRequest.voiceDescription/characterName/stepfunVoiceId, voice made
optional. AGENTS.md updated for the new route, type fields, dependency map,
and StepFun voice-selection flow.
This commit is contained in:
yuanzonghao
2026-06-15 12:49:25 +08:00
parent da191dd7a2
commit ca73a41a0b
15 changed files with 754 additions and 90 deletions
+40 -1
View File
@@ -208,6 +208,13 @@ export type Character = {
basePortraitUrl?: string;
/** Xiaomi MiMo voice reference audio. */
voice?: CharacterVoice;
/** StepFun preset voice id (e.g. "cixingnansheng"). Only present on
* characters designed while the server ran StepFun, OR on prebaked
* homepage cards enriched with a StepFun voice id. Lets the client send a
* lightweight beat-audio request (no ~220KB Xiaomi reference audio) when the
* server runs StepFun, and lets the server normalize an off-provider voice
* without a fresh provision. Validated against the catalog at synth time. */
stepfunVoiceId?: string;
};
/** A single beat's synthesized audio, attached to the response. */
@@ -359,6 +366,22 @@ export type TtsConfig = {
speechModel: string;
};
/** Which TTS provider the server is configured for (inferred from TtsConfig's
* base URL by lib/tts-client's isStepfun). Exposed to the client via the
* /api/tts-provider route so the play page can send only the voice fields
* the server actually needs — e.g. skip the ~220KB Xiaomi reference audio
* when the server runs StepFun (saving Fast Origin Transfer bandwidth).
* `null` means no server-side TTS (silent). BYO client TTS takes precedence
* over this signal. */
export type TtsProvider = "stepfun" | "xiaomi" | null;
// /api/tts-provider — lightweight GET returning the server's TTS provider so
// the client can shape beat-audio request bodies accordingly (see fetchBeatAudio
// in app/play/page.tsx). Response is a few dozen bytes; runs once per session.
export type TtsProviderResponse = {
provider: TtsProvider;
};
export type EngineConfig = {
text: ProviderConfig;
image: ProviderConfig;
@@ -461,7 +484,23 @@ export type BeatAudioRequest = {
line: string;
lineDelivery?: string;
};
voice: CharacterVoice;
/** The speaker's already-provisioned voice. Optional now — when the server
* runs a DIFFERENT provider than `voice.provider` (e.g. the client holds a
* Xiaomi voice from a prebaked card but the server runs StepFun), the
* client may omit `voice` and send `voiceDescription` + `stepfunVoiceId`
* instead to save the ~220KB reference-audio transfer. The server then
* re-provisions against its own provider before synthesizing. */
voice?: CharacterVoice;
/** Voice-design card (中文). Used by the server to re-provision when
* `voice` is absent or its provider doesn't match the server's TTS. */
voiceDescription?: string;
/** Speaker name — used as the StepFun provision salt for archetype spreading
* when the server falls back to pickStepfunVoiceId. */
characterName?: string;
/** Pre-selected StepFun preset id (from a live CharacterDesigner pick or a
* prebaked card). Honored directly when the server runs StepFun, skipping
* both the keyword scorer and a network provision. */
stepfunVoiceId?: string;
};
export type BeatAudioResponse = {