Commit Graph

6 Commits

Author SHA1 Message Date
yuanzonghao ca73a41a0b feat(tts): StepFun voice selection via CharacterDesigner + provider-aware beat-audio
Make homepage cards and live sessions produce sound when the server is
configured for StepFun TTS, instead of silently failing (the prebaked
Xiaomi voice was useless on a StepFun server, and wasted ~220KB/beat in
Fast Origin Transfer).

Three coordinated changes:

1. CharacterDesigner now picks a StepFun preset voice id directly from the
   32-entry catalog in the SAME LLM call that designs the character — zero
   extra latency, LLM-grade match quality. The Xiaomi prompt path is
   byte-identical to history (verified programmatically) so cache hit rate
   and voice quality are preserved. pickStepfunVoiceId (keyword scorer)
   remains the fallback for orphan speakers / invalid LLM picks.

2. The 32-preset catalog moves to lib/tts-client/stepfun-voices.json as the
   single source of truth, shared by the scorer, the CharacterDesigner
   prompt, /api/tts-provider, and the offline enrich script.

3. A new GET /api/tts-provider endpoint lets the client probe the server's
   TTS provider at /play mount. fetchBeatAudio then shapes its request body:
   on a StepFun server it sends the lightweight stepfunVoiceId /
   voiceDescription and omits the ~220KB Xiaomi reference audio (FOT saving
   ~13MB per protagonist per session on prebaked cards). requestBeatAudio
   re-provisions on a provider mismatch before synth, so audio never goes
   silent on a cross-provider replay or mid-session provider flip.

New type fields are all optional and backward-compatible: Character.stepfunVoiceId,
BeatAudioRequest.voiceDescription/characterName/stepfunVoiceId, voice made
optional. AGENTS.md updated for the new route, type fields, dependency map,
and StepFun voice-selection flow.
2026-06-15 12:49:25 +08:00
baizhi958216 94973bc6c6 fix(tts): add non-null assertion in stepfun array access
Signed-off-by: baizhi958216 <1475289190@qq.com>
2026-06-11 12:15:14 +08:00
baizhi958216 2088bae311 fix(tts): replace Buffer.from with browser-compatible arrayBufferToBase64 in stepfun
Signed-off-by: baizhi958216 <1475289190@qq.com>
2026-06-11 12:15:13 +08:00
yuanzonghao 1a6238f8b8 fix(tts): harden StepFun provider integration
- Validate voice.provider against known whitelist (xiaomi|stepfun) in
  beat-audio route to return a clear 400 instead of falling through
- Move single-char pronouns (他/她) to weak-signal fallback in
  detectGender to avoid false positives on compounds like 其他
- Update .env.example with StepFun configuration examples

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-09 14:24:27 +08:00
DESKTOP-I1T6TF3\Q 04f22249c9 fix(tts): make stepfun preset pick case-stable and per-character
- Hash the lowercased description (matching the case-insensitive scoring)
  so the same archetype text picks the same preset regardless of case.
- Thread the character name through provisionVoice -> stepfunProvision as
  the hash salt, so two characters that share archetype keywords spread
  across the top-N candidate presets instead of collapsing on one voice.

Xiaomi path is unaffected (voicedesign mints a unique clip per call).
2026-06-09 09:14:44 +08:00
DESKTOP-I1T6TF3\Q 19bbee16fe feat(tts): add StepFun preset-voice provider, route by URL + voice tag
Add StepFun step-tts-mini / step-tts-2 / stepaudio-2.5-tts as an alternate
TTS provider alongside Xiaomi MiMo. Auto-detected from TTS_BASE_URL host
(contains `stepfun.com` → StepFun; otherwise → MiMo), mirroring how the
image client infers Runware from `*.runware.ai`.

CharacterVoice becomes a discriminated union on `provider`:
- xiaomi: { referenceAudioBase64, mimeType } — unchanged
- stepfun: { voiceId, model, mimeType } — preset voice ID + chosen model

Provision dispatches on the current cfg's base URL; synthesis dispatches
on the voice's own `provider` tag so a session with mixed voices (e.g. a
provider switch mid-development) routes each beat through the correct
protocol. xiaomiSynthesize now guards against being called with a non-
xiaomi voice, surfacing the bug as a clear runtime error instead of a
TypeScript narrow violation at the access site.

StepFun has no voicedesign equivalent — only preset voices + voice
cloning from a reference audio upload. Cloning would require an extra
asset per character, so v1 maps the LLM's Chinese voiceDescription to one
of the 32 published preset IDs via gender + age + tone keyword scoring,
with a deterministic hash spread across the top-3 candidates so multiple
characters with similar descriptions don't collapse onto the identical
preset. lineDelivery is accepted but not yet propagated to StepFun's
voice_label.emotion / .style fields — left as a follow-up.

beat-audio route validation relaxed from `voice.referenceAudioBase64`
(xiaomi-shaped) to `voice.provider` (shape-agnostic), so stepfun voices
pass the gate; provider-specific shape errors still surface from the
synth function.

Observed latency on InfiPlot's dev loop: StepFun step-tts-mini median
~2.3s per beat with 0% timeouts across the test session, vs MiMo's
median ~8s with the long tail tripping the existing 15s synth budget
on roughly 2 of 3 beats. Pricing: step-tts-mini ¥0.9/万字符 (~¥0.14
per typical 50-beat session) vs MiMo TTS currently free under the
Token Plan creator incentive.

AGENTS.md provider matrix updated to describe both providers and the
discriminated-union dispatch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-08 17:15:02 +08:00