feat(tts): StepFun voice selection via CharacterDesigner + provider-aware beat-audio
Make homepage cards and live sessions produce sound when the server is configured for StepFun TTS, instead of silently failing (the prebaked Xiaomi voice was useless on a StepFun server, and wasted ~220KB/beat in Fast Origin Transfer). Three coordinated changes: 1. CharacterDesigner now picks a StepFun preset voice id directly from the 32-entry catalog in the SAME LLM call that designs the character — zero extra latency, LLM-grade match quality. The Xiaomi prompt path is byte-identical to history (verified programmatically) so cache hit rate and voice quality are preserved. pickStepfunVoiceId (keyword scorer) remains the fallback for orphan speakers / invalid LLM picks. 2. The 32-preset catalog moves to lib/tts-client/stepfun-voices.json as the single source of truth, shared by the scorer, the CharacterDesigner prompt, /api/tts-provider, and the offline enrich script. 3. A new GET /api/tts-provider endpoint lets the client probe the server's TTS provider at /play mount. fetchBeatAudio then shapes its request body: on a StepFun server it sends the lightweight stepfunVoiceId / voiceDescription and omits the ~220KB Xiaomi reference audio (FOT saving ~13MB per protagonist per session on prebaked cards). requestBeatAudio re-provisions on a provider mismatch before synth, so audio never goes silent on a cross-provider replay or mid-session provider flip. New type fields are all optional and backward-compatible: Character.stepfunVoiceId, BeatAudioRequest.voiceDescription/characterName/stepfunVoiceId, voice made optional. AGENTS.md updated for the new route, type fields, dependency map, and StepFun voice-selection flow.
This commit is contained in:
+40
-1
@@ -208,6 +208,13 @@ export type Character = {
|
||||
basePortraitUrl?: string;
|
||||
/** Xiaomi MiMo voice reference audio. */
|
||||
voice?: CharacterVoice;
|
||||
/** StepFun preset voice id (e.g. "cixingnansheng"). Only present on
|
||||
* characters designed while the server ran StepFun, OR on prebaked
|
||||
* homepage cards enriched with a StepFun voice id. Lets the client send a
|
||||
* lightweight beat-audio request (no ~220KB Xiaomi reference audio) when the
|
||||
* server runs StepFun, and lets the server normalize an off-provider voice
|
||||
* without a fresh provision. Validated against the catalog at synth time. */
|
||||
stepfunVoiceId?: string;
|
||||
};
|
||||
|
||||
/** A single beat's synthesized audio, attached to the response. */
|
||||
@@ -359,6 +366,22 @@ export type TtsConfig = {
|
||||
speechModel: string;
|
||||
};
|
||||
|
||||
/** Which TTS provider the server is configured for (inferred from TtsConfig's
|
||||
* base URL by lib/tts-client's isStepfun). Exposed to the client via the
|
||||
* /api/tts-provider route so the play page can send only the voice fields
|
||||
* the server actually needs — e.g. skip the ~220KB Xiaomi reference audio
|
||||
* when the server runs StepFun (saving Fast Origin Transfer bandwidth).
|
||||
* `null` means no server-side TTS (silent). BYO client TTS takes precedence
|
||||
* over this signal. */
|
||||
export type TtsProvider = "stepfun" | "xiaomi" | null;
|
||||
|
||||
// /api/tts-provider — lightweight GET returning the server's TTS provider so
|
||||
// the client can shape beat-audio request bodies accordingly (see fetchBeatAudio
|
||||
// in app/play/page.tsx). Response is a few dozen bytes; runs once per session.
|
||||
export type TtsProviderResponse = {
|
||||
provider: TtsProvider;
|
||||
};
|
||||
|
||||
export type EngineConfig = {
|
||||
text: ProviderConfig;
|
||||
image: ProviderConfig;
|
||||
@@ -461,7 +484,23 @@ export type BeatAudioRequest = {
|
||||
line: string;
|
||||
lineDelivery?: string;
|
||||
};
|
||||
voice: CharacterVoice;
|
||||
/** The speaker's already-provisioned voice. Optional now — when the server
|
||||
* runs a DIFFERENT provider than `voice.provider` (e.g. the client holds a
|
||||
* Xiaomi voice from a prebaked card but the server runs StepFun), the
|
||||
* client may omit `voice` and send `voiceDescription` + `stepfunVoiceId`
|
||||
* instead to save the ~220KB reference-audio transfer. The server then
|
||||
* re-provisions against its own provider before synthesizing. */
|
||||
voice?: CharacterVoice;
|
||||
/** Voice-design card (中文). Used by the server to re-provision when
|
||||
* `voice` is absent or its provider doesn't match the server's TTS. */
|
||||
voiceDescription?: string;
|
||||
/** Speaker name — used as the StepFun provision salt for archetype spreading
|
||||
* when the server falls back to pickStepfunVoiceId. */
|
||||
characterName?: string;
|
||||
/** Pre-selected StepFun preset id (from a live CharacterDesigner pick or a
|
||||
* prebaked card). Honored directly when the server runs StepFun, skipping
|
||||
* both the keyword scorer and a network provision. */
|
||||
stepfunVoiceId?: string;
|
||||
};
|
||||
|
||||
export type BeatAudioResponse = {
|
||||
|
||||
Reference in New Issue
Block a user