feat(tts): add StepFun preset-voice provider, route by URL + voice tag

Add StepFun step-tts-mini / step-tts-2 / stepaudio-2.5-tts as an alternate TTS provider alongside Xiaomi MiMo. Auto-detected from TTS_BASE_URL host (contains `stepfun.com` → StepFun; otherwise → MiMo), mirroring how the image client infers Runware from `*.runware.ai`. CharacterVoice becomes a discriminated union on `provider`: - xiaomi: { referenceAudioBase64, mimeType } — unchanged - stepfun: { voiceId, model, mimeType } — preset voice ID + chosen model Provision dispatches on the current cfg's base URL; synthesis dispatches on the voice's own `provider` tag so a session with mixed voices (e.g. a provider switch mid-development) routes each beat through the correct protocol. xiaomiSynthesize now guards against being called with a non- xiaomi voice, surfacing the bug as a clear runtime error instead of a TypeScript narrow violation at the access site. StepFun has no voicedesign equivalent — only preset voices + voice cloning from a reference audio upload. Cloning would require an extra asset per character, so v1 maps the LLM's Chinese voiceDescription to one of the 32 published preset IDs via gender + age + tone keyword scoring, with a deterministic hash spread across the top-3 candidates so multiple characters with similar descriptions don't collapse onto the identical preset. lineDelivery is accepted but not yet propagated to StepFun's voice_label.emotion / .style fields — left as a follow-up. beat-audio route validation relaxed from `voice.referenceAudioBase64` (xiaomi-shaped) to `voice.provider` (shape-agnostic), so stepfun voices pass the gate; provider-specific shape errors still surface from the synth function. Observed latency on InfiPlot's dev loop: StepFun step-tts-mini median ~2.3s per beat with 0% timeouts across the test session, vs MiMo's median ~8s with the long tail tripping the existing 15s synth budget on roughly 2 of 3 beats. Pricing: step-tts-mini ¥0.9/万字符 (~¥0.14 per typical 50-beat session) vs MiMo TTS currently free under the Token Plan creator incentive. AGENTS.md provider matrix updated to describe both providers and the discriminated-union dispatch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-08 17:15:02 +08:00
parent 75548ce005
commit 19bbee16fe
6 changed files with 250 additions and 10 deletions
@@ -160,12 +160,24 @@ export type WriterPlan = {
 //  Characters & voices (TTS)
 // ──────────────────────────────────────────────────────────────────────

-export type CharacterVoice = {
-  provider: "xiaomi";
-  /** Xiaomi MiMo design output stored as reference audio for later clones. */
-  referenceAudioBase64: string;
-  mimeType: string;
-};
+export type CharacterVoice =
+  | {
+      provider: "xiaomi";
+      /** Xiaomi MiMo design output stored as reference audio for later clones. */
+      referenceAudioBase64: string;
+      mimeType: string;
+    }
+  | {
+      provider: "stepfun";
+      /** StepFun preset voice ID (e.g. "cixingnansheng"). Selected by keyword
+       *  matching against the LLM-written voiceDescription — no network call
+       *  on provision (StepFun has no voicedesign endpoint), so this carries
+       *  only the picked preset, not a clip. */
+      voiceId: string;
+      /** TTS model used at synth time (step-tts-mini / step-tts-2 / stepaudio-2.5-tts). */
+      model: string;
+      mimeType: string;
+    };

 export type Character = {
  name: string;