19bbee16fe
Add StepFun step-tts-mini / step-tts-2 / stepaudio-2.5-tts as an alternate
TTS provider alongside Xiaomi MiMo. Auto-detected from TTS_BASE_URL host
(contains `stepfun.com` → StepFun; otherwise → MiMo), mirroring how the
image client infers Runware from `*.runware.ai`.
CharacterVoice becomes a discriminated union on `provider`:
- xiaomi: { referenceAudioBase64, mimeType } — unchanged
- stepfun: { voiceId, model, mimeType } — preset voice ID + chosen model
Provision dispatches on the current cfg's base URL; synthesis dispatches
on the voice's own `provider` tag so a session with mixed voices (e.g. a
provider switch mid-development) routes each beat through the correct
protocol. xiaomiSynthesize now guards against being called with a non-
xiaomi voice, surfacing the bug as a clear runtime error instead of a
TypeScript narrow violation at the access site.
StepFun has no voicedesign equivalent — only preset voices + voice
cloning from a reference audio upload. Cloning would require an extra
asset per character, so v1 maps the LLM's Chinese voiceDescription to one
of the 32 published preset IDs via gender + age + tone keyword scoring,
with a deterministic hash spread across the top-3 candidates so multiple
characters with similar descriptions don't collapse onto the identical
preset. lineDelivery is accepted but not yet propagated to StepFun's
voice_label.emotion / .style fields — left as a follow-up.
beat-audio route validation relaxed from `voice.referenceAudioBase64`
(xiaomi-shaped) to `voice.provider` (shape-agnostic), so stepfun voices
pass the gate; provider-specific shape errors still surface from the
synth function.
Observed latency on InfiPlot's dev loop: StepFun step-tts-mini median
~2.3s per beat with 0% timeouts across the test session, vs MiMo's
median ~8s with the long tail tripping the existing 15s synth budget
on roughly 2 of 3 beats. Pricing: step-tts-mini ¥0.9/万字符 (~¥0.14
per typical 50-beat session) vs MiMo TTS currently free under the
Token Plan creator incentive.
AGENTS.md provider matrix updated to describe both providers and the
discriminated-union dispatch.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
39 lines
1.5 KiB
TypeScript
39 lines
1.5 KiB
TypeScript
import type { CharacterVoice, TtsConfig } from "@infiplot/types";
|
|
import { stepfunProvision, stepfunSynthesize } from "./stepfun";
|
|
import { xiaomiProvision, xiaomiSynthesize } from "./xiaomi";
|
|
|
|
// Provider auto-detection by base URL — mirrors the image client convention
|
|
// of inferring Runware from *.runware.ai and falling back otherwise. Keeps
|
|
// the BYO client flow unchanged: TTS_PROVIDER env var stays unused, and
|
|
// browser-side keys (Xiaomi only today) keep working through the xiaomi path.
|
|
function isStepfun(cfg: TtsConfig): boolean {
|
|
return /(^|[./])stepfun\.com\b/i.test(cfg.baseUrl);
|
|
}
|
|
|
|
export async function provisionVoice(
|
|
cfg: TtsConfig,
|
|
description: string,
|
|
): Promise<CharacterVoice> {
|
|
return isStepfun(cfg)
|
|
? stepfunProvision(cfg, description)
|
|
: xiaomiProvision(cfg, description);
|
|
}
|
|
|
|
// Dispatch by the voice's own provider tag, not by the current config. A
|
|
// session can outlive a provider switch (e.g. .env.local flip mid-game), and
|
|
// each voice must be synthesized via the protocol that minted it. The cfg
|
|
// still needs to point at the matching provider's endpoint; mismatch surfaces
|
|
// as a transparent network error, which `synthesizeBeat` already swallows.
|
|
export async function synthesize(
|
|
cfg: TtsConfig,
|
|
voice: CharacterVoice,
|
|
text: string,
|
|
delivery?: string,
|
|
signal?: AbortSignal,
|
|
): Promise<{ audioBase64: string; mimeType: string }> {
|
|
if (voice.provider === "stepfun") {
|
|
return stepfunSynthesize(cfg, voice, text, delivery, signal);
|
|
}
|
|
return xiaomiSynthesize(cfg, voice, text, delivery, signal);
|
|
}
|