Merge PR #79: feat(tts): StepFun voice selection via CharacterDesigner + provider-aware beat-audio

- StepFun voice selection: CharacterDesigner picks a preset voiceId from the 32-entry catalog (zero extra LLM call); pickStepfunVoiceId remains as fallback. - Prebaked homepage cards enriched with stepfunVoiceId (147 characters, gemini model). - /api/tts-provider endpoint + client probe: skip the ~220KB Xiaomi reference audio when the server runs StepFun (saves Fast Origin Transfer bandwidth). - Server-side resolveVoice normalization: re-provisions on provider mismatch. - Removed hardcoded 1.2x speech playback speed (was for slow MiMo voice). - Hardened voice-provider validation per PR-agent review. Xiaomi path prompt is byte-identical to history (prompt-cache-preserving).
2026-06-15 15:08:21 +08:00
parent 6e1ad55f1a 65b7daff0b
commit ba9f9c1342
122 changed files with 874 additions and 201 deletions
@@ -21,7 +21,7 @@ InfiPlot is a Next.js 16 / React 19 / TypeScript app for AI-driven interactive v
 - `lib/engine/agents/`: Architect, Writer, CharacterDesigner, Cinematographer, Painter.
 - `lib/engine/prompts.ts`: Agent prompts and prompt-cache-sensitive message builders.
 - `lib/ai-client/`: Text, image, vision, and retry wrappers.
- `lib/tts-client/`: TTS integration.
+- `lib/tts-client/`: TTS integration. `stepfun-voices.json` is the single source of truth for the 32 StepFun preset voices (shared by the scorer, CharacterDesigner prompt, `/api/tts-provider`, and the enrich script).
 - `lib/config.ts`: Server-side provider/environment loading.
 - `lib/presets.ts`, `lib/ttsPresets.ts`, `lib/options.ts`: Home-page presets and selectable options.
 - `scripts/`: Asset and preset generation helpers.
@@ -91,8 +91,9 @@ Common routes live under `app/api/`:
 - `POST /api/scene`: generates the next scene from an existing session.
 - `POST /api/vision`: interprets scene-image clicks.
 - `POST /api/insert-beat`: creates a transient beat without image generation.
- `POST /api/beat-audio`: lazy TTS for a displayed beat; returns binary audio, or `204` when silent.
+- `POST /api/beat-audio`: lazy TTS for a displayed beat; returns binary audio, or `204` when silent. `voice` is now OPTIONAL — when the server runs StepFun, the client omits the ~220KB Xiaomi reference audio and sends `stepfunVoiceId` / `voiceDescription` instead (saves Fast Origin Transfer bandwidth). The engine re-provisions on a provider mismatch before synthesizing.
 - `POST /api/parse-style-image`: extracts a style prompt from uploaded reference art.
+- `GET /api/tts-provider`: returns `{ provider: "stepfun" | "xiaomi" | null }` (the server's TTS provider, inferred from `TTS_BASE_URL`). Probed once at `/play` mount (non-BYO) so `fetchBeatAudio` can shape its request body — skip the ~220KB Xiaomi reference audio when the server runs StepFun. BYO client TTS takes precedence over this signal.
 - `POST /api/story-pack` / `POST /api/story-unpack`: stateless AES-GCM packing/unpacking for playable story share `.infiplot` files; uses `GALLERY_SECRET`.

 When changing public types or route payloads, update all route callers and client consumers in the same change.
@@ -114,6 +115,7 @@ Use pnpm with Node >=22. `pnpm-lock.yaml` is the source of truth; `package-lock.
 - `pnpm start`: run production server after building.
 - `pnpm lint`: Next.js built-in lint.
 - `pnpm typecheck`: `tsc --noEmit`.
+- `pnpm enrich:firstacts`: one-off enrichment of `public/home/firstact{,-portrait}/*.json` — adds `characters[i].stepfunVoiceId` via a TEXT-provider LLM call per character (uses `.env.local`). Idempotent; `--force` re-picks, `--only=f0,f1` filters, `--portrait` targets the portrait set.
 - `pnpm build:cf`: Cloudflare Workers build through OpenNext.
 - `pnpm preview:cf`: local Cloudflare preview.
 - `pnpm deploy:cf`: Cloudflare deploy.
@@ -139,7 +141,7 @@ Use `.env.example` as the source of truth. Never commit `.env.local`, API keys,
 - Text and Vision use `TEXT_*` and `VISION_*` over the `openai_compatible` protocol (the only supported text/vision protocol); Claude and Gemini are reached via their own OpenAI-compatible endpoints with the `*_PROVIDER` var unset.
 - Image uses `IMAGE_*`; supported protocols are `runware`, `openai_compatible`, and native `openai`. When `IMAGE_PROVIDER` is unset, Runware is inferred from `*.runware.ai` URLs and otherwise falls back to OpenAI-compatible image generations.
 - `IMAGE_TIMEOUT_MS` (per-attempt hard deadline) and `IMAGE_HEDGE_MS` (Painter scene-paint hedging: race a second request when the first is still pending after the threshold) are both OFF when unset — the default path must stay byte-identical to historical behavior. Hedging applies only to the Tier-A scene paint, never to portraits, and never fires after a fast failure (saturation guard). Client-side engine configs (`resolveEngineConfig`) intentionally do not set these fields.
- TTS supports Xiaomi MiMo (voicedesign + voiceclone) or StepFun (preset voices auto-selected by keyword scoring), inferred from `TTS_BASE_URL` (host containing `stepfun.com` → StepFun, otherwise → MiMo). `CharacterVoice` is a discriminated union on `provider`; synth dispatches on the voice's own tag so a session may carry both shapes through a provider switch. Blank config means silent mode.
+- TTS supports Xiaomi MiMo (voicedesign + voiceclone) or StepFun (preset voices), inferred from `TTS_BASE_URL` (host containing `stepfun.com` → StepFun, otherwise → MiMo). `CharacterVoice` is a discriminated union on `provider`; synth dispatches on the voice's own tag so a session may carry both shapes through a provider switch. Blank config means silent mode. StepFun voice selection: the CharacterDesigner LLM picks a preset id directly from the 32-entry catalog (`lib/tts-client/stepfun-voices.json`, rendered by `formatStepfunCatalogForPrompt`) when `config.tts` is StepFun — zero extra LLM call. `pickStepfunVoiceId` (keyword scorer) is the fallback for orphan speakers / invalid picks. Prebaked homepage cards are enriched with `Character.stepfunVoiceId` via `scripts/enrich-firstacts-stepfun.mjs` so a card works under either server provider.
 - `MOCK_IMAGE=true` skips image generation and returns a placeholder for cheap local iteration.
 - `NEXT_PUBLIC_IMAGE_PROXY_URL` and `NEXT_PUBLIC_IMAGE_PROXY_ALLOWED_HOSTS` opt into browser-side image proxying for allowed hosts.
 - Analytics uses optional Umami `NEXT_PUBLIC_UMAMI_*` values and must stay content-free/privacy-preserving.
@@ -148,7 +150,7 @@ Use `.env.example` as the source of truth. Never commit `.env.local`, API keys,

 ## File Dependency Map

-If modifying Writer, also check `director.ts`, `prompts.ts`, WriterPlan/StoryState types, and Cinematographer/Painter consumers. If modifying CharacterDesigner, check Director scheduling/merge logic, portrait prompts, voice provisioning, and Painter reference collection. If modifying Cinematographer or Painter, check Director, prompt builders, provider image options, orientation handling, and reference priority. If modifying Architect, check `orchestrator.ts`, `prompts.ts`, and StoryState patch rules. If modifying `lib/types/index.ts`, check all agents, Director, Orchestrator, API routes, and client consumers in `app/page.tsx`, `app/play/page.tsx`, and `components/PlayCanvas.tsx`. If modifying TTS, check server `beat-audio`, BYO client TTS, voice stripping/merging, and payload privacy. If modifying image delivery, check Painter, `lib/ai-client/image.ts`, mock images, orientation dimensions, preload/proxy logic, and style-reference validation.
+If modifying Writer, also check `director.ts`, `prompts.ts`, WriterPlan/StoryState types, and Cinematographer/Painter consumers. If modifying CharacterDesigner, check Director scheduling/merge logic, portrait prompts, voice provisioning, Painter reference collection, and (StepFun path) the `buildCharacterDesignerSystem` catalog injection + `stepfunVoiceId` validation. If modifying the StepFun voice catalog (`lib/tts-client/stepfun-voices.json`), also check `formatStepfunCatalogForPrompt`, `isValidStepfunVoiceId`, the CharacterDesigner system prompt, and the enrich script. If modifying Cinematographer or Painter, check Director, prompt builders, provider image options, orientation handling, and reference priority. If modifying Architect, check `orchestrator.ts`, `prompts.ts`, and StoryState patch rules. If modifying `lib/types/index.ts`, check all agents, Director, Orchestrator, API routes, and client consumers in `app/page.tsx`, `app/play/page.tsx`, and `components/PlayCanvas.tsx`. If modifying TTS, check server `beat-audio` (including the `resolveVoice` provider-mismatch normalization), `/api/tts-provider`, BYO client TTS, voice stripping/merging, payload privacy, and the StepFun voice-id flow (CharacterDesigner → provision → synth). If modifying image delivery, check Painter, `lib/ai-client/image.ts`, mock images, orientation dimensions, preload/proxy logic, and style-reference validation.

 ## Guide Maintenance

@@ -17,18 +17,26 @@ export async function POST(req: Request) {
    return NextResponse.json({ error: "Invalid JSON" }, { status: 400 });
  }

-  // Accept either provider's voice shape — xiaomi carries referenceAudioBase64,
-  // stepfun carries voiceId. We only check the discriminator + the line text;
-  // shape-specific validation lives in each provider's synth function.
+  // Voice is now optional — when the server runs StepFun, the client omits
+  // the ~220KB Xiaomi reference audio and sends stepfunVoiceId /
+  // voiceDescription instead (saves Fast Origin Transfer bandwidth). The
+  // engine's resolveVoice re-provisions on a provider mismatch. We only
+  // require the beat text + SOMETHING to synthesize from.
  const VALID_TTS_PROVIDERS = ["xiaomi", "stepfun"];
+  const hasInvalidVoiceProvider =
+    !!body.voice?.provider && !VALID_TTS_PROVIDERS.includes(body.voice.provider);
+  const hasVoice =
+    !!body.voice?.provider && VALID_TTS_PROVIDERS.includes(body.voice.provider);
+  const hasFallback =
+    !!body.stepfunVoiceId || !!body.voiceDescription;
  if (
    !body.beat?.id ||
    !body.beat?.line ||
-    !body.voice?.provider ||
-    !VALID_TTS_PROVIDERS.includes(body.voice.provider)
+    hasInvalidVoiceProvider ||
+    (!hasVoice && !hasFallback)
  ) {
    return NextResponse.json(
-      { error: "beat.id, beat.line and voice.provider (xiaomi|stepfun) are required" },
+      { error: "beat.id and beat.line are required, plus either voice.provider (xiaomi|stepfun) or stepfunVoiceId/voiceDescription" },
      { status: 400 },
    );
  }
@@ -0,0 +1,25 @@
+import type { TtsProviderResponse } from "@infiplot/types";
+import { inferTtsProvider } from "@infiplot/tts-client";
+import { NextResponse } from "next/server";
+import { loadEngineConfig } from "@/lib/config";
+import { requireUser } from "@/lib/supabase/guard";
+
+export const runtime = "nodejs";
+
+// GET /api/tts-provider — tells the client which TTS provider the server is
+// configured for, so the play page can shape /api/beat-audio request bodies
+// accordingly (skip the ~220KB Xiaomi reference audio when the server runs
+// StepFun → saves Fast Origin Transfer bandwidth; the response itself is a
+// few dozen bytes). Runs once at /play mount; same auth as other routes so
+// the provider (a server-config fact, not user data) isn't leaked publicly.
+// BYO client TTS (clientTts:true) takes precedence and bypasses this signal.
+export async function GET() {
+  const auth = await requireUser();
+  if (auth instanceof NextResponse) return auth;
+
+  const cfg = loadEngineConfig();
+  const provider = cfg.tts ? inferTtsProvider(cfg.tts) : null;
+
+  const body: TtsProviderResponse = { provider };
+  return NextResponse.json(body);
+}
@@ -35,6 +35,7 @@ import {
  visionDecide,
  classifyFreeform,
  requestInsertBeat,
+  getTtsProvider,
  AuthRequiredError,
 } from "@/lib/engineClient";
 import type {
@@ -49,6 +50,7 @@ import type {
  Session,
  StartResponse,
  TtsConfig,
+  TtsProvider,
 } from "@infiplot/types";
 import { track } from "@/lib/analytics";
 import { AUTH_ENABLED } from "@/lib/supabase/config";
@@ -779,6 +781,14 @@ function PlayInner() {
    loadClientTtsConfig(),
  );
  const byoTtsRef = useRef<TtsConfig | null>(byoTtsConfig);
+  // Server TTS provider (probed once at mount via /api/tts-provider). Used by
+  // fetchBeatAudio to decide which voice fields to send: when the server runs
+  // StepFun, omit the ~220KB Xiaomi `voice` and send stepfunVoiceId /
+  // voiceDescription instead (saves Fast Origin Transfer bandwidth). null =
+  // probe failed or server has no TTS; fetchBeatAudio then sends defensively
+  // and the server normalizes. Ignored entirely in BYO mode (byoTtsRef wins).
+  const [serverTtsProvider, setServerTtsProvider] = useState<TtsProvider>(null);
+  const serverTtsProviderRef = useRef<TtsProvider>(null);
  // BYO voice cache (see resolveByoVoice). Keyed by character name; persists
  // across scenes so each speaker is provisioned at most once per session.
  const provisionedVoicesRef = useRef<Map<string, Promise<CharacterVoice>>>(
@@ -853,10 +863,37 @@ function PlayInner() {
  useEffect(() => {
    phaseRef.current = phase;
  }, [phase]);
+  useEffect(() => {
+    serverTtsProviderRef.current = serverTtsProvider;
+  }, [serverTtsProvider]);
  useEffect(() => {
    setVisionClickEnabled(readStoredVisionClick());
  }, []);

+  // Probe the server's TTS provider ONCE at mount. Non-BYO users need this so
+  // fetchBeatAudio can skip the ~220KB Xiaomi reference audio when the server
+  // runs StepFun. BYO users never read this ref (byoTtsRef takes precedence),
+  // but the probe is harmless and cheap, so we run it unconditionally and let
+  // getTtsProvider short-circuit for BYO. AuthRequiredError is handled by the
+  // bootstrap flow's handleAuthError; other errors degrade to null silently.
+  useEffect(() => {
+    let cancelled = false;
+    getTtsProvider()
+      .then((p) => {
+        if (!cancelled) setServerTtsProvider(p);
+      })
+      .catch((e) => {
+        if (!cancelled && e instanceof AuthRequiredError) {
+          // Defer to the bootstrap effect's auth modal — leave provider null.
+          return;
+        }
+        // Non-auth errors already logged in getTtsProvider; null = unknown.
+      });
+    return () => {
+      cancelled = true;
+    };
+  }, []);
+
  function trackPlayError(source: ErrorSource, e: unknown, startMs: number, res?: Response) {
    const { kind, http_status } = classifyError(e, res);
    track("play_error", {
@@ -948,11 +985,23 @@ function PlayInner() {
      if (!speaker) return;

      const byo = byoTtsRef.current;
-      // Non-BYO relies on the server having provisioned speaker.voice. BYO
-      // skipped server TTS, so it needs a baked voice (prebaked card) or a
-      // voiceDescription to provision from in the browser.
-      if (!byo && !speaker.voice) return;
-      if (byo && !speaker.voice && !speaker.voiceDescription) return;
+      const serverProvider = serverTtsProviderRef.current;
+      // What we need to synthesize depends on the path:
+      //   - BYO (xiaomi): baked voice OR voiceDescription to provision locally.
+      //   - Server stepfun: stepfunVoiceId or voiceDescription — no Xiaomi
+      //     `voice` needed (saves the ~220KB reference-audio FOT).
+      //   - Server xiaomi / unknown (probe pending): accept ANY synthesizable
+      //     source. The null case covers the race where getTtsProvider hasn't
+      //     resolved before the first beat fetch fires — without this widening
+      //     a stepfun-only speaker (no Xiaomi voice) would be silently dropped.
+      //     The server resolves + normalizes regardless of which fields arrive.
+      if (byo) {
+        if (!speaker.voice && !speaker.voiceDescription) return;
+      } else if (serverProvider === "stepfun") {
+        if (!speaker.stepfunVoiceId && !speaker.voiceDescription) return;
+      } else {
+        if (!speaker.voice && !speaker.stepfunVoiceId && !speaker.voiceDescription) return;
+      }

      if (beatAudioAbortRef.current.has(beat.id)) return;
      const abort = new AbortController();
@@ -977,17 +1026,35 @@ function PlayInner() {
          );
          audioUrl = `data:${out.mimeType};base64,${out.audioBase64}`;
        } else {
-          // Server-side synth: POST just this beat + the speaker's voice (not
-          // the whole session) to /api/beat-audio. Returns 204 when the engine
-          // had nothing to say (no TTS configured / empty synth) and binary
-          // audio otherwise. Both 204 and !ok count as a silence strike so the
-          // nudge surfaces when the shared server key is being rate-limited.
+          // Server-side synth: shape the body by the probed provider so we don't
+          // waste Fast Origin Transfer bandwidth on the ~220KB Xiaomi reference
+          // audio when the server actually runs StepFun.
+          //   - stepfun → stepfunVoiceId + voiceDescription + characterName
+          //     (all lightweight; the server synths directly with the id).
+          //   - xiaomi / unknown → voice (the ~220KB reference audio the server
+          //     needs to clone), PLUS the lightweight fallback fields so the
+          //     server can still normalize on a provider mismatch (e.g. a prebaked
+          //     card holding a Xiaomi voice while the server runs StepFun).
+          const isStepfunServer = serverProvider === "stepfun";
          const res = await fetch("/api/beat-audio", {
            method: "POST",
            headers: { "Content-Type": "application/json" },
            body: JSON.stringify({
              beat: { id: beat.id, line: beat.line, lineDelivery: beat.lineDelivery },
-              voice: speaker.voice,
+              ...(isStepfunServer
+                ? {
+                    stepfunVoiceId: speaker.stepfunVoiceId,
+                    voiceDescription: speaker.voiceDescription,
+                    characterName: speaker.name,
+                  }
+                : {
+                    voice: speaker.voice,
+                    // Defensive fallback fields (lightweight) — let the server
+                    // re-provision if speaker.voice.provider ≠ server provider.
+                    stepfunVoiceId: speaker.stepfunVoiceId,
+                    voiceDescription: speaker.voiceDescription,
+                    characterName: speaker.name,
+                  }),
            }),
            signal: abort.signal,
          });
@@ -19,9 +19,6 @@ const SHADOW =

 const DEFAULT_CHAR_MS = 28;
 const MIN_CHAR_MS = 30;
-// Voice playback speed multiplier. >1 speeds up the (somewhat slow) MiMo voice
-// while preserving pitch. Typewriter pacing is divided by the same factor.
-const SPEECH_RATE = 1.2;
 // If audio metadata never arrives within this window, give up waiting and
 // let the typewriter run at default speed.
 const AUDIO_WAIT_TIMEOUT_MS = 2500;
@@ -261,7 +258,6 @@ export function PlayCanvas({
    const el = audioRef.current;
    if (!el) return;
    el.muted = muted;
-    el.playbackRate = SPEECH_RATE;
    if (!muted && audioSrc && el.paused) {
      el.play().catch(() => {
        // autoplay blocked — silent until next interaction
@@ -272,11 +268,7 @@ export function PlayCanvas({
  function handleAudioMetadata() {
    const el = audioRef.current;
    if (!el) return;
-    el.playbackRate = SPEECH_RATE;
-    // Effective playback time is shorter once sped up — keep the typewriter in sync.
-    const ms = Number.isFinite(el.duration)
-      ? (el.duration * 1000) / SPEECH_RATE
-      : 0;
+    const ms = Number.isFinite(el.duration) ? el.duration * 1000 : 0;
    setAudioDurationMs(ms > 0 ? ms : 0);
    if (!muted) {
      el.play().catch(() => {
@@ -1,5 +1,10 @@
 import { chat, generateImage } from "@infiplot/ai-client";
-import { provisionVoice } from "@infiplot/tts-client";
+import {
+  isStepfun,
+  isValidStepfunVoiceId,
+  provisionVoice,
+  type ProvisionVoiceOptions,
+} from "@infiplot/tts-client";
 import type {
  Character,
  CharacterVoice,
@@ -9,7 +14,7 @@ import type {
 import { parseJsonLoose } from "../jsonParser";
 import { mockImageDataUri } from "../mockImage";
 import {
-  CHARACTER_DESIGNER_SYSTEM,
+  buildCharacterDesignerSystem,
  buildCharacterDesignerUserMessage,
  buildCharacterPortraitPrompt,
 } from "../prompts";
@@ -34,6 +39,10 @@ import {
 type CharacterDesignOutput = {
  visualDescription?: string;
  voiceDescription?: string;
+  /** Only present on the StepFun path (the system prompt asks for it when
+   *  stepfun:true). Hallucinated / out-of-catalog ids are dropped before
+   *  they reach provisioning, falling back to pickStepfunVoiceId. */
+  stepfunVoiceId?: string;
 };

 // TEMP: per-phase timing for latency diagnosis. Same convention as the
@@ -50,7 +59,7 @@ async function runDesignLLM(
  const raw = await chat(
    config.text,
    [
-      { role: "system", content: CHARACTER_DESIGNER_SYSTEM },
+      { role: "system", content: buildCharacterDesignerSystem({ stepfun: stepfunEnabled(config) }) },
      {
        role: "user",
        content: buildCharacterDesignerUserMessage(charName, session),
@@ -61,6 +70,13 @@ async function runDesignLLM(
  return parseJsonLoose<CharacterDesignOutput>(raw);
 }

+/** True when the server's TTS config points at StepFun (so the CharacterDesigner
+ *  should also pick a preset voice id). Returns false when TTS is off or on the
+ *  Xiaomi path — keeping the Xiaomi prompt byte-identical to history. */
+function stepfunEnabled(config: EngineConfig): boolean {
+  return !!config.tts && isStepfun(config.tts);
+}
+
 // Generate the per-character base portrait. The portrait is a "concept
 // sheet" — single character, neutral pose, plain background — so it works
 // well as a Runware referenceImages anchor for later scenes.
@@ -105,10 +121,11 @@ export async function provisionCharacterVoice(
  config: EngineConfig,
  voiceDescription: string,
  charName: string,
+  opts?: ProvisionVoiceOptions,
 ): Promise<CharacterVoice | undefined> {
  if (!config.tts) return undefined;
  try {
-    return await provisionVoice(config.tts, voiceDescription, charName);
+    return await provisionVoice(config.tts, voiceDescription, charName, opts);
  } catch (err) {
    const msg = err instanceof Error ? err.message : String(err);
    console.error(`[characterDesigner] voice provision failed for ${charName}: ${msg}`);
@@ -120,10 +137,18 @@ export async function provisionCharacterVoice(
 // call. The director then schedules renderCharacterPortrait /
 // provisionCharacterVoice around the Painter. Multiple new characters in the
 // same scene run this stage in parallel at the director level.
+//
+// On the StepFun path the same call ALSO yields stepfunVoiceId (the model
+// picks from the 32-preset catalog it sees in the system prompt). An invalid
+// pick is dropped here so the downstream provision falls back to the keyword
+// scorer — never trust an LLM-hallucinated id at the synth boundary.
 export type CharacterCard = {
  name: string;
  visualDescription?: string;
  voiceDescription: string;
+  /** Only set on the StepFun path AND only when the LLM picked a valid catalog
+   *  id. Threads through provisionCharacterVoice → stepfunProvision. */
+  stepfunVoiceId?: string;
 };

 export async function designCharacterCard(
@@ -135,12 +160,19 @@ export async function designCharacterCard(
  const design = await runDesignLLM(config, session, charName);
  tlog(`[charDesigner ${charName}] design LLM`, tDesign);

+  // Drop invalid catalog picks before they reach provision/synth. A hallucinated
+  // id would 4xx at synth time; better to fall back to pickStepfunVoiceId now.
+  const stepfunVoiceId = isValidStepfunVoiceId(design.stepfunVoiceId)
+    ? design.stepfunVoiceId
+    : undefined;
+
  return {
    name: charName,
    visualDescription: design.visualDescription?.trim() || undefined,
    voiceDescription:
      design.voiceDescription?.trim() ||
      `请根据角色名「${charName}」推断其性别、年龄与气质，生成最贴合的音色。所属世界观：${session.worldSetting}`,
+    stepfunVoiceId,
  };
 }

@@ -305,12 +305,21 @@ export async function directScene(
  }

  // Kick off voice provisioning for every NEW char (never on the paint path).
+  // On the StepFun path, thread the LLM-selected stepfunVoiceId from the card
+  // into provision — it lets stepfunProvision honor the catalog pick instead
+  // of falling back to the keyword scorer (same network cost: still zero).
+  // ALSO persist it onto the Character so the client can echo it back on a
+  // StepFun server (where it skips the ~220KB voice payload) and the server
+  // resolveVoice honors the LLM pick at synth time instead of re-scoring.
  const voicePromises = cards.map((card) =>
-    provisionCharacterVoice(config, card.voiceDescription, card.name).then(
+    provisionCharacterVoice(config, card.voiceDescription, card.name, {
+      stepfunVoiceId: card.stepfunVoiceId,
+    }).then(
      (voice): Character => ({
        name: card.name,
        voiceDescription: card.voiceDescription,
        voice,
+        stepfunVoiceId: card.stepfunVoiceId,
      }),
    ),
  );
@@ -1,6 +1,7 @@
 import type {
  BeatAudioRequest,
  BeatAudioResponse,
+  CharacterVoice,
  EngineConfig,
  FreeformClassify,
  FreeformClassifyRequest,
@@ -17,6 +18,7 @@ import type {
 } from "@infiplot/types";
 import { coerceOrientation } from "@infiplot/types";
 import { chat } from "@infiplot/ai-client";
+import { isStepfun, isValidStepfunVoiceId, provisionVoice } from "@infiplot/tts-client";
 import { runArchitect } from "./agents/architect";
 import { selectStyle } from "./agents/styleSelector";
 import { directInsertBeat, directScene } from "./director";
@@ -241,11 +243,73 @@ export async function requestInsertBeat(
 //  timeout / failure / TTS disabled, so the client just plays silent.
 // ──────────────────────────────────────────────────────────────────────

+// Resolve a synth-ready voice for the request, normalizing provider
+// mismatches. The client usually sends a voice whose provider matches the
+// server's TTS (the common case). The mismatch case is mainly prebaked
+// homepage cards: they ship a Xiaomi voice baked at build time, but the
+// server may now run StepFun — so the client skips the ~220KB reference
+// audio (saving FOT) and sends stepfunVoiceId / voiceDescription instead.
+// We re-provision against the SERVER's provider so the right voice synth runs.
+// Returns undefined when there's nothing to synthesize from (caller plays
+// silent).
+async function resolveVoice(
+  config: EngineConfig,
+  req: BeatAudioRequest,
+): Promise<CharacterVoice | undefined> {
+  const serverStepfun = !!config.tts && isStepfun(config.tts);
+  const voiceProvider = req.voice?.provider;
+  const voiceMatchesServer =
+    (voiceProvider === "stepfun" && serverStepfun) ||
+    (voiceProvider === "xiaomi" && !serverStepfun);
+
+  // Fast path: the client sent a matching voice. (Also covers the legacy
+  // xiaomi card + xiaomi server case where the 220KB was unavoidable anyway.)
+  if (req.voice && voiceMatchesServer) {
+    return req.voice;
+  }
+
+  // Mismatch (or voice omitted). Re-provision against the server's provider.
+  if (!config.tts) return undefined;
+
+  // StepFun server: prefer an LLM-picked / prebaked id (zero-cost), else
+  // fall back to the keyword scorer over the voiceDescription.
+  if (serverStepfun) {
+    if (isValidStepfunVoiceId(req.stepfunVoiceId)) {
+      return provisionVoice(config.tts, req.voiceDescription ?? "", req.characterName, {
+        stepfunVoiceId: req.stepfunVoiceId,
+      });
+    }
+    if (req.voiceDescription) {
+      return provisionVoice(config.tts, req.voiceDescription, req.characterName);
+    }
+    return undefined;
+  }
+
+  // Xiaomi server but client sent a StepFun voice (or nothing). Re-design via
+  // voicedesign using the description; no description → can't synthesize.
+  //
+  // NOTE: this re-provision runs OUTSIDE synthesizeBeat's 15s withTimeout — a
+  // hung MiMo voicedesign tail (~30-70s) could hang /api/beat-audio until the
+  // platform timeout. Accepted because: (1) this path only fires on a rare
+  // cross-provider replay (.infiplot carrying a stepfun voice, opened on a
+  // Xiaomi-server deploy) or a mid-session provider flip — NOT the common
+  // prebaked-card + stepfun-server case, which is a pure-function provision
+  // with no network; (2) it degrades to silence rather than crashing. If it
+  // ever bites in practice, wrap resolve+synth in one withTimeout in voice.ts
+  // (requires threading an AbortSignal through provisionVoice → xiaomiProvision).
+  if (req.voiceDescription) {
+    return provisionVoice(config.tts, req.voiceDescription, req.characterName);
+  }
+  return undefined;
+}
+
 export async function requestBeatAudio(
  config: EngineConfig,
  req: BeatAudioRequest,
 ): Promise<BeatAudioResponse> {
  if (!config.tts) return { audio: null };
-  const audio = await synthesizeBeat(config.tts, req.voice, req.beat);
+  const voice = await resolveVoice(config, req);
+  if (!voice) return { audio: null };
+  const audio = await synthesizeBeat(config.tts, voice, req.beat);
  return { audio };
 }
@@ -7,6 +7,7 @@ import type {
  StoryState,
  WriterPlan,
 } from "@infiplot/types";
+import { formatStepfunCatalogForPrompt } from "@infiplot/tts-client";

 // ══════════════════════════════════════════════════════════════════════
 //  Multi-agent scene generation pipeline:
@@ -599,7 +600,14 @@ function collectPriorSceneKeys(session: Session): string[] {
 //  (e.g., gentle-looking character with energetic voice).
 // ──────────────────────────────────────────────────────────────────────

-export const CHARACTER_DESIGNER_SYSTEM = `你是视觉小说的「角色设定师」。给你一个**新登场角色的名字**，你要为这个角色同时设计两份卡片：
+// CHARACTER_DESIGNER_SYSTEM is split into a provider-agnostic CORE (visual +
+// voice-text rules) and a provider-specific TAIL (the JSON contract). When the
+// server runs StepFun, the tail additionally asks the model to pick a preset
+// voice id from the 32-entry catalog — so the SAME LLM call that designs the
+// character also selects its voice, at zero extra latency. When StepFun is
+// off (Xiaomi / no TTS), the tail is byte-identical to the historical prompt
+// (Xiaomi path is cache- and behavior-preserving).
+const CHARACTER_DESIGNER_SYSTEM_CORE = `你是视觉小说的「角色设定师」。给你一个**新登场角色的名字**，你要为这个角色同时设计两份卡片：
 1. **视觉设定卡（英文）**——给生图模型 FLUX 用，遵循 prompt engineering 风格
 2. **音色设定卡（中文）**——给小米 MiMo 配音设计用

@@ -652,7 +660,12 @@ export const CHARACTER_DESIGNER_SYSTEM = `你是视觉小说的「角色设定
 - 随后描述：年龄段（如「约17岁少女」「30 出头男性」）、音色质感、性格情绪基调、语速节奏、人设腔调、口音方言
 - 用中文，整段连续描述，不分段
 - 长度：50–80 个中文字为宜
- 例："女性，约17岁少女，音色清亮带点稚嫩甜美，性格开朗外向但容易害羞，语速偏快，标准普通话"
+- 例："女性，约17岁少女，音色清亮带点稚嫩甜美，性格开朗外向但容易害羞，语速偏快，标准普通话"`;
+
+// JSON-contract tail for the NON-stepfun path (Xiaomi voicedesign / no TTS).
+// Byte-identical to the historical prompt so the Xiaomi path keeps its cache
+// hit rate and voice quality unchanged.
+const CHARACTER_DESIGNER_TAIL_DEFAULT = `

 必须输出严格 JSON：
 {
@@ -662,6 +675,43 @@ export const CHARACTER_DESIGNER_SYSTEM = `你是视觉小说的「角色设定

 不要输出 JSON 以外的任何文本。`;

+// JSON-contract tail for the StepFun path. Same core output, plus the model
+// picks a preset voice id from the catalog. The id must match the SAME person
+// the voiceDescription describes (gender / age / vibe) — designed together so
+// appearance and voice stay coherent (the same invariant the CORE enforces).
+const CHARACTER_DESIGNER_TAIL_STEPFUN = `
+
+**StepFun 预设音色选择（必做）：**
+除 voiceDescription 外，你还必须从下列 StepFun 预设音色清单中，为本角色挑选一个与 voiceDescription 描绘的「同一个人」（性别 / 年龄段 / 气质都要一致）最贴合的预设，并把它的 id 填入 stepfunVoiceId。清单：
+${formatStepfunCatalogForPrompt()}
+
+挑选原则：
+- stepfunVoiceId 必须是上表里某个 id，原样复制（拼写、大小写、连字符都不能变）。
+- 必须与 voiceDescription 的性别一致（男声选 male 行，女声选 female 行）。
+- 年龄段尽量一致；拿不准时优先气质匹配（例如“冷艳御姐”选 lengyanyujie、“软萌萝莉”选 ruanmengnvsheng）。
+- 不允许编造清单外的 id，也不允许留空。
+
+必须输出严格 JSON：
+{
+  "visualDescription": "English visual card, comma-separated tags...",
+  "voiceDescription": "中文音色卡，以性别开头...",
+  "stepfunVoiceId": "清单内某个 id"
+}
+
+不要输出 JSON 以外的任何文本。`;
+
+/** Build the CharacterDesigner system prompt, provider-aware.
+ *  - stepfun:false → identical to the historical Xiaomi/no-TTS prompt.
+ *  - stepfun:true  → additionally asks the model to pick a StepFun preset
+ *    voice id from the 32-entry catalog (see formatStepfunCatalogForPrompt). */
+export function buildCharacterDesignerSystem(opts: {
+  stepfun: boolean;
+}): string {
+  return opts.stepfun
+    ? CHARACTER_DESIGNER_SYSTEM_CORE + CHARACTER_DESIGNER_TAIL_STEPFUN
+    : CHARACTER_DESIGNER_SYSTEM_CORE + CHARACTER_DESIGNER_TAIL_DEFAULT;
+}
+
 export function buildCharacterDesignerUserMessage(
  charName: string,
  session: Session,
@@ -22,6 +22,7 @@ import type {
  Session,
  StartRequest,
  StartResponse,
+  TtsProvider,
  VisionRequest,
  VisionResponse,
 } from "@infiplot/types";
@@ -60,6 +61,17 @@ async function postJson<T>(path: string, body: unknown): Promise<T> {
  return res.json() as Promise<T>;
 }

+// GET variant of postJson — same 401 → AuthRequiredError mapping. Used by
+// getTtsProvider (a tiny config probe, no body).
+async function getJson<T>(path: string): Promise<T> {
+  const res = await fetch(path, { method: "GET" });
+  if (!res.ok) {
+    if (res.status === 401) throw new AuthRequiredError();
+    throw new Error(`HTTP ${res.status}`);
+  }
+  return res.json() as Promise<T>;
+}
+
 // ── FOT reduction helpers (server-fallback path only) ─────────────────
 // The server-fallback POSTs send the whole Session over the wire. Voice
 // data is bulky (~160KB/character via referenceAudioBase64) and the
@@ -99,6 +111,29 @@ function mergeCharactersPreserveVoice(
 // Otherwise they fall back to the server-side API routes, which read
 // environment variables — useful for Vercel deploys that already supply keys.

+// Probe the server's TTS provider so fetchBeatAudio can shape its request body
+// (skip the ~220KB Xiaomi reference audio when the server runs StepFun).
+//
+// BYO precedence: when the browser has a client model config (BYO mode),
+// voice synthesis always runs locally against the user's own Xiaomi key, so
+// the server provider is irrelevant — return "xiaomi" synchronously without a
+// round-trip. Non-BYO → GET /api/tts-provider. Errors degrade to null (the
+// caller then sends voice fields defensively and the server normalizes).
+export async function getTtsProvider(): Promise<TtsProvider> {
+  if (getClientConfig()) return "xiaomi";
+  try {
+    const data = await getJson<{ provider: TtsProvider }>("/api/tts-provider");
+    return data.provider;
+  } catch (e) {
+    // AuthRequiredError (401) propagates so the caller's handleAuthError can
+    // surface the login modal; other errors (network, 5xx) → null = unknown,
+    // and fetchBeatAudio falls back to sending everything + server normalizes.
+    if (e instanceof AuthRequiredError) throw e;
+    console.warn("[getTtsProvider] probe failed, assuming unknown:", e);
+    return null;
+  }
+}
+
 export async function startSession(req: StartRequest): Promise<StartResponse> {
  const config = getClientConfig();
  if (config) {
@@ -1,15 +1,32 @@
-import type { CharacterVoice, TtsConfig } from "@infiplot/types";
-import { stepfunProvision, stepfunSynthesize } from "./stepfun";
+import type { CharacterVoice, TtsConfig, TtsProvider } from "@infiplot/types";
+import {
+  formatStepfunCatalogForPrompt,
+  isStepfun,
+  isValidStepfunVoiceId,
+  stepfunProvision,
+  type StepfunProvisionOptions,
+  stepfunSynthesize,
+} from "./stepfun";
 import { xiaomiProvision, xiaomiSynthesize } from "./xiaomi";

-// Provider auto-detection by base URL — mirrors the image client convention
-// of inferring Runware from *.runware.ai and falling back otherwise. Keeps
-// the BYO client flow unchanged: TTS_PROVIDER env var stays unused, and
-// browser-side keys (Xiaomi only today) keep working through the xiaomi path.
-function isStepfun(cfg: TtsConfig): boolean {
-  return /(^|[./])stepfun\.com\b/i.test(cfg.baseUrl);
+// Re-export so /api/tts-provider, orchestrator, CharacterDesigner prompt, and
+// the client all share ONE provider-detection rule + ONE catalog rendering +
+// ONE validity check with the synth path.
+export { isStepfun, isValidStepfunVoiceId, formatStepfunCatalogForPrompt };
+
+/** Map a configured TtsConfig to its provider tag. Single source of truth for
+ *  the inference rule (host contains stepfun.com → stepfun, else xiaomi) so
+ *  /api/tts-provider and resolveVoice can't drift when a third provider is
+ *  added. A PRESENT TtsConfig always maps to a concrete provider — `null`
+ *  (no TTS configured) is the caller's responsibility to handle separately. */
+export function inferTtsProvider(cfg: TtsConfig): Exclude<TtsProvider, null> {
+  return isStepfun(cfg) ? "stepfun" : "xiaomi";
 }

+// `opts.stepfunVoiceId` threads the CharacterDesigner's LLM-selected preset
+// down to stepfunProvision. Xiaomi ignores it. See StepfunProvisionOptions.
+export type ProvisionVoiceOptions = StepfunProvisionOptions;
+
 export async function provisionVoice(
  cfg: TtsConfig,
  description: string,
@@ -18,9 +35,10 @@ export async function provisionVoice(
  // clip per call regardless. Threading it through keeps the API uniform
  // and prevents archetype collisions on the StepFun path.
  salt?: string,
+  opts?: ProvisionVoiceOptions,
 ): Promise<CharacterVoice> {
  return isStepfun(cfg)
-    ? stepfunProvision(cfg, description, salt)
+    ? stepfunProvision(cfg, description, salt, opts)
    : xiaomiProvision(cfg, description);
 }

@@ -0,0 +1,34 @@
+[
+  { "id": "cixingnansheng", "gender": "male", "age": "young", "tones": ["磁性", "成熟", "narrative"], "desc": "磁性成熟男声，沉稳有厚度，适合旁白/叙事/解说" },
+  { "id": "wenrounansheng", "gender": "male", "age": "young", "tones": ["温柔", "gentle", "supportive"], "desc": "温柔男声，暖系治愈，适合陪伴/安抚/暖男主" },
+  { "id": "wenrougongzi", "gender": "male", "age": "young", "tones": ["温柔", "公子", "tender"], "desc": "温柔公子型男声，清润书卷气，适合古风公子/儒雅青年" },
+  { "id": "yuanqinansheng", "gender": "male", "age": "teen", "tones": ["元气", "energetic", "阳光"], "desc": "元气阳光少年男声，明亮有活力，适合少年/阳光系男主" },
+  { "id": "zhengpaiqingnian", "gender": "male", "age": "young", "tones": ["正派", "正气", "earnest"], "desc": "正派正气青年男声，端庄坚定，适合正剧男主/英雄" },
+  { "id": "shuangkuainansheng", "gender": "male", "age": "young", "tones": ["爽快", "干脆", "brisk"], "desc": "爽快干脆男声，利落不拖沓，适合热血/爽文男主" },
+  { "id": "boyinnansheng", "gender": "male", "age": "middle", "tones": ["播音", "broadcast", "稳重"], "desc": "播音腔稳重男声，字正腔圆，适合新闻/旁白/中年男主" },
+  { "id": "ruyananshi", "gender": "male", "age": "middle", "tones": ["儒雅", "斯文", "refined"], "desc": "儒雅斯文中年男声，文气内敛，适合学者/师者/儒雅男性" },
+  { "id": "shenchennanyin", "gender": "male", "age": "middle", "tones": ["深沉", "低沉", "deep"], "desc": "深沉低沉男声，厚重磁性，适合成熟/权威/反派男主" },
+  { "id": "qingniandaxuesheng", "gender": "male", "age": "young", "tones": ["大学生", "青年", "student"], "desc": "大学生青年男声，自然清爽，适合校园男主/学生" },
+  { "id": "zixinnansheng", "gender": "male", "age": "young", "tones": ["自信", "confident"], "desc": "自信青年男声，有底气不张扬，适合精英/自信男主" },
+  { "id": "elegantgentle-female", "gender": "female", "age": "young", "tones": ["气质", "温婉", "professional"], "desc": "气质温婉女声，得体大方，适合职业女性/气质女主" },
+  { "id": "livelybreezy-female", "gender": "female", "age": "teen", "tones": ["活力", "轻快", "upbeat"], "desc": "活力轻快少女声，明快有节奏，适合元气少女" },
+  { "id": "jingdiannvsheng", "gender": "female", "age": "middle", "tones": ["经典", "classic", "成熟"], "desc": "经典成熟女声，圆润端庄，适合旁白/成熟女性" },
+  { "id": "wenroushunv", "gender": "female", "age": "middle", "tones": ["温柔", "熟女", "mature"], "desc": "温柔熟女声，成熟柔润，适合熟女/姐姐型角色" },
+  { "id": "tianmeinvsheng", "gender": "female", "age": "young", "tones": ["甜美", "sweet"], "desc": "甜美女声，甜润可爱，适合甜系女主/甜妹" },
+  { "id": "qingchunshaonv", "gender": "female", "age": "teen", "tones": ["清纯", "少女", "pure"], "desc": "清纯少女声，干净清澈，适合清纯少女/初恋感" },
+  { "id": "yuanqishaonv", "gender": "female", "age": "teen", "tones": ["元气", "少女", "活力", "energetic"], "desc": "元气活力少女声，明亮张扬，适合元气少女/活泼女主" },
+  { "id": "linjiajiejie", "gender": "female", "age": "young", "tones": ["邻家", "姐姐"], "desc": "邻家姐姐声，亲切自然，适合邻家姐姐/青梅竹马" },
+  { "id": "jilingshaonv", "gender": "female", "age": "teen", "tones": ["机灵", "灵动", "少女"], "desc": "机灵灵动少女声，俏皮跳脱，适合机灵少女/鬼马角色" },
+  { "id": "ruanmengnvsheng", "gender": "female", "age": "teen", "tones": ["软萌", "可爱", "稚嫩", "甜软"], "desc": "软萌可爱稚嫩女声，甜软奶气，适合萝莉/软萌角色" },
+  { "id": "youyanvsheng", "gender": "female", "age": "young", "tones": ["优雅", "elegant"], "desc": "优雅女声，从容矜持，适合优雅/淑女型角色" },
+  { "id": "lengyanyujie", "gender": "female", "age": "middle", "tones": ["冷艳", "御姐", "高冷"], "desc": "冷艳御姐声，高冷有气场，适合御姐/女王/高冷女主" },
+  { "id": "shuangkuaijiejie", "gender": "female", "age": "young", "tones": ["爽快", "姐姐", "干脆"], "desc": "爽快干脆姐姐声，利落飒爽，适合飒爽女主/大姐大" },
+  { "id": "wenjingxuejie", "gender": "female", "age": "young", "tones": ["文静", "学姐", "安静"], "desc": "文静学姐声，安静内敛，适合文静/学姐/内向女主" },
+  { "id": "linjiameimei", "gender": "female", "age": "teen", "tones": ["邻家", "妹妹"], "desc": "邻家妹妹声，稚气天真，适合妹妹型/天真少女" },
+  { "id": "zhixingjiejie", "gender": "female", "age": "young", "tones": ["知性", "姐姐", "聪慧"], "desc": "知性聪慧姐姐声，沉稳理性，适合知性女性/学姐" },
+  { "id": "ganliannvsheng", "gender": "female", "age": "middle", "tones": ["干练", "sharp", "professional"], "desc": "干练职业女声，利落专业，适合职场女性/女强人" },
+  { "id": "qinhenvsheng", "gender": "female", "age": "young", "tones": ["亲和", "warm", "亲切"], "desc": "亲和温暖女声，亲切易接近，适合亲和型/治愈系女主" },
+  { "id": "huolinvsheng", "gender": "female", "age": "young", "tones": ["活力", "lively", "活泼"], "desc": "活力活泼女声，热情外放，适合活泼女主/开朗角色" },
+  { "id": "qinqienvsheng", "gender": "female", "age": "middle", "tones": ["亲切", "温暖"], "desc": "亲切温暖中年女声，温厚母性，适合阿姨/母亲/温暖长辈" },
+  { "id": "wenrounvsheng", "gender": "female", "age": "young", "tones": ["温柔", "tender", "柔和"], "desc": "温柔柔和女声，轻柔不张扬，适合温柔女主/治愈系" }
+]
@@ -1,4 +1,28 @@
 import type { CharacterVoice, TtsConfig } from "@infiplot/types";
+import catalogData from "./stepfun-voices.json";
+
+// Preset voice record. The 32 presets live in stepfun-voices.json (the single
+// source of truth — shared with the CharacterDesigner prompt, /api/tts-provider
+// validity check, and the offline enrich script). gender/age are discriminant
+// unions so detectGender / detectAge scoring stays type-safe.
+export type PresetVoice = {
+  id: string;
+  gender: "male" | "female";
+  age: "teen" | "young" | "middle";
+  /** Keywords (中文 or English) that, when present in the LLM's voice
+   *  description, boost this preset's score. Drawn from StepFun's published
+   *  voice name + recommended scenario. */
+  tones: string[];
+  /** 中文人设短语，供 LLM（设定师 prompt / enrich 脚本）在选音色时理解每个
+   *  预设适合的角色类型。打分函数（pickStepfunVoiceId）仍只用 tones。 */
+  desc: string;
+};
+
+// JSON literals widen gender/age to `string`; cast back to the discriminant
+// unions. The catalog is a build-time-checked asset (touched rarely), and
+// pickStepfunVoiceId / isValidStepfunVoiceId tolerate anything we ship, so a
+// wrong entry surfaces as a bad voice pick rather than a crash.
+const PRESET_VOICES = catalogData as unknown as PresetVoice[];

 // StepFun TTS uses an OpenAI-compatible /v1/audio/speech endpoint with PRESET
 // voice IDs only — there is no "design a new voice from text description"
@@ -8,6 +32,14 @@ import type { CharacterVoice, TtsConfig } from "@infiplot/types";
 // top-N candidates so multiple similar characters don't collapse onto the
 // same voice. Provision is a pure function — no network call needed.

+/** Provider detection — shared by /api/tts-provider, orchestrator fallback,
+ *  and the client (via the route). StepFun is inferred from a *.stepfun.com
+ *  host in the base URL, matching lib/tts-client/index.ts. Exported so every
+ *  caller agrees on the same rule. */
+export function isStepfun(cfg: TtsConfig): boolean {
+  return /(^|[./])stepfun\.com\b/i.test(cfg.baseUrl);
+}
+
 function arrayBufferToBase64(buffer: ArrayBuffer): string {
  const bytes = new Uint8Array(buffer);
  let binary = "";
@@ -21,53 +53,37 @@ function arrayBufferToBase64(buffer: ArrayBuffer): string {
 const OUTPUT_FORMAT = "mp3";
 const OUTPUT_MIME = "audio/mpeg";

-type PresetVoice = {
-  id: string;
-  gender: "male" | "female";
-  age: "teen" | "young" | "middle";
-  /** Keywords (中文 or English) that, when present in the LLM's voice
-   *  description, boost this preset's score. Drawn from StepFun's published
-   *  voice name + recommended scenario. */
-  tones: string[];
-};
-
 // Full catalog from StepFun's docs (32 presets across step-tts-mini /
-// step-tts-2 / stepaudio-2.5-tts). Adding more later is safe — the scorer
-// degrades gracefully when an unknown id is picked.
-const PRESET_VOICES: PresetVoice[] = [
-  { id: "cixingnansheng", gender: "male", age: "young", tones: ["磁性", "成熟", "narrative"] },
-  { id: "wenrounansheng", gender: "male", age: "young", tones: ["温柔", "gentle", "supportive"] },
-  { id: "wenrougongzi", gender: "male", age: "young", tones: ["温柔", "公子", "tender"] },
-  { id: "yuanqinansheng", gender: "male", age: "teen", tones: ["元气", "energetic", "阳光"] },
-  { id: "zhengpaiqingnian", gender: "male", age: "young", tones: ["正派", "正气", "earnest"] },
-  { id: "shuangkuainansheng", gender: "male", age: "young", tones: ["爽快", "干脆", "brisk"] },
-  { id: "boyinnansheng", gender: "male", age: "middle", tones: ["播音", "broadcast", "稳重"] },
-  { id: "ruyananshi", gender: "male", age: "middle", tones: ["儒雅", "斯文", "refined"] },
-  { id: "shenchennanyin", gender: "male", age: "middle", tones: ["深沉", "低沉", "deep"] },
-  { id: "qingniandaxuesheng", gender: "male", age: "young", tones: ["大学生", "青年", "student"] },
-  { id: "zixinnansheng", gender: "male", age: "young", tones: ["自信", "confident"] },
-  { id: "elegantgentle-female", gender: "female", age: "young", tones: ["气质", "温婉", "professional"] },
-  { id: "livelybreezy-female", gender: "female", age: "teen", tones: ["活力", "轻快", "upbeat"] },
-  { id: "jingdiannvsheng", gender: "female", age: "middle", tones: ["经典", "classic", "成熟"] },
-  { id: "wenroushunv", gender: "female", age: "middle", tones: ["温柔", "熟女", "mature"] },
-  { id: "tianmeinvsheng", gender: "female", age: "young", tones: ["甜美", "sweet"] },
-  { id: "qingchunshaonv", gender: "female", age: "teen", tones: ["清纯", "少女", "pure"] },
-  { id: "yuanqishaonv", gender: "female", age: "teen", tones: ["元气", "少女", "活力", "energetic"] },
-  { id: "linjiajiejie", gender: "female", age: "young", tones: ["邻家", "姐姐"] },
-  { id: "jilingshaonv", gender: "female", age: "teen", tones: ["机灵", "灵动", "少女"] },
-  { id: "ruanmengnvsheng", gender: "female", age: "teen", tones: ["软萌", "可爱", "稚嫩", "甜软"] },
-  { id: "youyanvsheng", gender: "female", age: "young", tones: ["优雅", "elegant"] },
-  { id: "lengyanyujie", gender: "female", age: "middle", tones: ["冷艳", "御姐", "高冷"] },
-  { id: "shuangkuaijiejie", gender: "female", age: "young", tones: ["爽快", "姐姐", "干脆"] },
-  { id: "wenjingxuejie", gender: "female", age: "young", tones: ["文静", "学姐", "安静"] },
-  { id: "linjiameimei", gender: "female", age: "teen", tones: ["邻家", "妹妹"] },
-  { id: "zhixingjiejie", gender: "female", age: "young", tones: ["知性", "姐姐", "聪慧"] },
-  { id: "ganliannvsheng", gender: "female", age: "middle", tones: ["干练", "sharp", "professional"] },
-  { id: "qinhenvsheng", gender: "female", age: "young", tones: ["亲和", "warm", "亲切"] },
-  { id: "huolinvsheng", gender: "female", age: "young", tones: ["活力", "lively", "活泼"] },
-  { id: "qinqienvsheng", gender: "female", age: "middle", tones: ["亲切", "温暖"] },
-  { id: "wenrounvsheng", gender: "female", age: "young", tones: ["温柔", "tender", "柔和"] },
-];
+// step-tts-2 / stepaudio-2.5-tts). The JSON is the single source of truth —
+// shared by the scorer here, the CharacterDesigner prompt (via
+// formatStepfunCatalogForPrompt), the /api/tts-provider route's validity
+// check, and the offline enrich script. Adding more later is safe — the
+// scorer degrades gracefully when an unknown id is picked.
+// (catalogData is cast to PresetVoice[] at the import above; kept as
+// PRESET_VOICES so existing references stay unchanged.)
+
+/** All valid preset voice ids — for validation by the CharacterDesigner
+ *  (discard an out-of-catalog LLM pick) and the enrich script. */
+export const STEPFUN_PRESET_VOICE_IDS: string[] = PRESET_VOICES.map(
+  (v) => v.id,
+);
+
+const STEPFUN_ID_SET = new Set(STEPFUN_PRESET_VOICE_IDS);
+
+/** True iff `id` is one of the 32 catalog presets. Used to drop LLM-hallucinated
+ *  ids before they reach StepFun (which would otherwise 4xx on synth). */
+export function isValidStepfunVoiceId(id: string | null | undefined): boolean {
+  return !!id && STEPFUN_ID_SET.has(id);
+}
+
+/** Render the catalog as a 中文 prompt-friendly list, one line per preset,
+ *  so the CharacterDesigner and the enrich script can ask the LLM to pick a
+ *  matching voice id. Each line: `id — desc（gender/age）`. */
+export function formatStepfunCatalogForPrompt(): string {
+  return PRESET_VOICES.map(
+    (v) => `- ${v.id}：${v.desc}（${v.gender}/${v.age}）`,
+  ).join("\n");
+}

 // Cheap deterministic 32-bit hash — used only to spread similar descriptions
 // across the top-N candidate voices so two "温柔女声" characters don't collide.
@@ -139,12 +155,28 @@ export function pickStepfunVoiceId(description: string, salt = ""): string {
 // We mirror xiaomiProvision's async signature so the router stays uniform.
 // The optional `salt` (character name) spreads two characters that share
 // archetype keywords across the top-N candidate presets.
+//
+// `opts.stepfunVoiceId` — when the CharacterDesigner already picked a preset
+// (it sees the same catalog via formatStepfunCatalogForPrompt), honor it if
+// valid; otherwise fall back to the keyword scorer. This keeps StepFun
+// provisioning a pure function (zero network cost) while lifting voice-id
+// selection quality to LLM-grade on the live path.
+export type StepfunProvisionOptions = {
+  /** LLM-selected preset id from the CharacterDesigner; validated against the
+   *  catalog and ignored when out of range (hallucination guard). */
+  stepfunVoiceId?: string;
+};
+
 export async function stepfunProvision(
  cfg: TtsConfig,
  description: string,
  salt?: string,
+  opts?: StepfunProvisionOptions,
 ): Promise<CharacterVoice> {
-  const voiceId = pickStepfunVoiceId(description, salt);
+  const voiceId =
+    opts && isValidStepfunVoiceId(opts.stepfunVoiceId)
+      ? opts.stepfunVoiceId!
+      : pickStepfunVoiceId(description, salt);
  return {
    provider: "stepfun",
    voiceId,
@@ -208,6 +208,13 @@ export type Character = {
  basePortraitUrl?: string;
  /** Xiaomi MiMo voice reference audio. */
  voice?: CharacterVoice;
+  /** StepFun preset voice id (e.g. "cixingnansheng"). Only present on
+   *  characters designed while the server ran StepFun, OR on prebaked
+   *  homepage cards enriched with a StepFun voice id. Lets the client send a
+   *  lightweight beat-audio request (no ~220KB Xiaomi reference audio) when the
+   *  server runs StepFun, and lets the server normalize an off-provider voice
+   *  without a fresh provision. Validated against the catalog at synth time. */
+  stepfunVoiceId?: string;
 };

 /** A single beat's synthesized audio, attached to the response. */
@@ -359,6 +366,22 @@ export type TtsConfig = {
  speechModel: string;
 };

+/** Which TTS provider the server is configured for (inferred from TtsConfig's
+ *  base URL by lib/tts-client's isStepfun). Exposed to the client via the
+ *  /api/tts-provider route so the play page can send only the voice fields
+ *  the server actually needs — e.g. skip the ~220KB Xiaomi reference audio
+ *  when the server runs StepFun (saving Fast Origin Transfer bandwidth).
+ *  `null` means no server-side TTS (silent). BYO client TTS takes precedence
+ *  over this signal. */
+export type TtsProvider = "stepfun" | "xiaomi" | null;
+
+// /api/tts-provider — lightweight GET returning the server's TTS provider so
+// the client can shape beat-audio request bodies accordingly (see fetchBeatAudio
+// in app/play/page.tsx). Response is a few dozen bytes; runs once per session.
+export type TtsProviderResponse = {
+  provider: TtsProvider;
+};
+
 export type EngineConfig = {
  text: ProviderConfig;
  image: ProviderConfig;
@@ -461,7 +484,23 @@ export type BeatAudioRequest = {
    line: string;
    lineDelivery?: string;
  };
-  voice: CharacterVoice;
+  /** The speaker's already-provisioned voice. Optional now — when the server
+   *  runs a DIFFERENT provider than `voice.provider` (e.g. the client holds a
+   *  Xiaomi voice from a prebaked card but the server runs StepFun), the
+   *  client may omit `voice` and send `voiceDescription` + `stepfunVoiceId`
+   *  instead to save the ~220KB reference-audio transfer. The server then
+   *  re-provisions against its own provider before synthesizing. */
+  voice?: CharacterVoice;
+  /** Voice-design card (中文). Used by the server to re-provision when
+   *  `voice` is absent or its provider doesn't match the server's TTS. */
+  voiceDescription?: string;
+  /** Speaker name — used as the StepFun provision salt for archetype spreading
+   *  when the server falls back to pickStepfunVoiceId. */
+  characterName?: string;
+  /** Pre-selected StepFun preset id (from a live CharacterDesigner pick or a
+   *  prebaked card). Honored directly when the server runs StepFun, skipping
+   *  both the keyword scorer and a network provision. */
+  stepfunVoiceId?: string;
 };

 export type BeatAudioResponse = {
@@ -15,6 +15,7 @@
    "start": "next start",
    "lint": "next lint",
    "typecheck": "tsc --noEmit",
+    "enrich:firstacts": "node scripts/enrich-firstacts-stepfun.mjs",
    "build:cf": "opennextjs-cloudflare build",
    "preview:cf": "opennextjs-cloudflare preview",
    "deploy:cf": "opennextjs-cloudflare deploy"
--- a/Show More
+++ b/Show More