feat(engine): multi-agent character consistency pipeline (#6)

* feat(types): Character.voiceDescription rename + visual fields + Scene.sceneKey Prepares the type surface for the multi-agent scene pipeline: - Character.description → voiceDescription (clearer pairing with new visualDescription) - Character gains visualDescription (English appearance card for Painter) + basePortraitBase64 + basePortraitUuid (for Runware referenceImages reuse) - Scene gains sceneKey (English slug for cross-scene img2img continuity) + imageUuid (Runware UUID of the scene's rendered image for cheap seedImage reuse on subsequent same-sceneKey calls) - Beat gains activeCharacters[] so the Cinematographer can read which characters are on-screen + their poses when composing the establishing shot Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ai-client): generateImage img2img + multi-reference options + uploadImage Extends the Runware adapter to support the two anchoring mechanisms FLUX.2 [klein] 9B KV needs for character + scene visual consistency: - generateImage gains optional { seedImage, referenceImages, strength }: seedImage drives img2img (single starting image, sceneKey continuity), referenceImages drives multi-reference anchoring (up to 4 character portraits, capped per Runware spec). Default strength 0.85 — FLUX ignores strength < 0.8. - uploadImage POSTs a base64 to Runware's imageUpload taskType and returns the UUID, so portraits/scene snapshots can be referenced by UUID on subsequent calls instead of resending base64 every scene. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(engine): multi-agent scene pipeline (Writer→CharDesigner+Cinematographer→Painter) Replaces the single-LLM directScene with a four-agent pipeline that specializes each concern and parallelizes the slow parts. Adopts the core idea from #4 (multi-agent dispatch + character visual consistency) and grafts it onto the Scene/Beat architecture introduced in #2. Pipeline per Scene (~9-12s critical path with parallelization): Writer LLM (序列, ~3s) │ outputs: sceneSummary + sceneKey + beats[] (each beat carries │ activeCharacters[] with poses) │ ├─ CharacterDesigner LLM × N new chars (并行) │ │ outputs: { visualDescription (英文外貌卡), voiceDescription (中文音色卡) } │ ├─ FLUX portrait gen → upload → UUID (并行 within agent) │ └─ Xiaomi MiMo voicedesign provision (并行 within agent) │ └─ Cinematographer LLM (并行 with CharacterDesigner) outputs: { shotType, integratedPrompt (英文构图+机位+人物站位) } Painter (FLUX img2img + referenceImages, ~1-3s) inputs: integratedPrompt + onStageCharacters' archetype block + (optional) prior sceneKey-hit scene as seedImage + (optional) character portrait UUIDs as referenceImages fallback chain: A) both anchors → B) refs only (保角色) → C) seed only (保背景) → D) pure t2i output uploaded → Scene.imageUuid for the next sceneKey hop Why this carving: - Writer focuses purely on narrative (drops the voice-design duty staging's DIRECTOR_SYSTEM was carrying as a side concern). - CharacterDesigner bundles visual + voice so the agent that thinks "who is this character" produces internally-consistent appearance + vocal personality (split agents tend to diverge). - Cinematographer doesn't need character visualDescriptions — Painter appends archetypes after — so it parallelizes with CharacterDesigner. - sceneKey enables cross-scene backdrop continuity that Scene/Beat doesn't cover (Scene/Beat only reuses backdrop WITHIN a scene's beats; sceneKey reuses across scenes that share a location). Other changes: - voice.ts loses provisionVoicesForScene (moved into CharacterDesigner); keeps synthesizeBeat for the lazy per-beat /api/beat-audio path. - renderer.ts deleted (replaced by agents/painter.ts). - directInsertBeat (vision-driven in-scene exploration) stays single- LLM — it forbids new characters and produces no image, so multi- agent doesn't apply. apps/web is unchanged: orchestrator.ts keeps the same exports (startSession / requestScene / visionDecide / requestInsertBeat / requestBeatAudio) with identical request/response shapes. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): Pattern B player POV + JSON repair + drop seedImage tier Three hotfixes surfaced by manual end-to-end testing of the multi-agent pipeline. F1 — Player viewpoint (galgame Pattern B): - Writer accepts speaker="你" for player dialog (renders in dialog box, never TTS'd because no Character record exists for "你"). Filter POV variants (玩家/我/主角/protagonist/player/I/me/...) from activeCharacters so CharacterDesigner never wastes API calls on the player. Two-layer defense: explicit prompt rule in WRITER_SYSTEM + code normalization (POV_VARIANTS set, isPovName, normalizeSpeakerName). - Cinematographer and Painter prompts gain "player never in frame" rule so the player never appears in any rendered scene. - Cinematographer gains dynamic camera policy driven by the entry beat's speaker: NPC-speaker → close-up looking toward camera; "你"-speaker → medium shot of attentive NPC; no speaker → wide establishing shot. - director.ts filters POV from orphanSpeakers so provisionVoiceForName never fires for "你". F2 — JSON parsing robustness: - parseJsonLoose gains a 4th repair tier: strip JS-style comments, strip trailing commas, insert missing commas between adjacent objects / arrays / quoted values. Logs the first 800 chars of raw LLM output when all repair attempts fail, so we can see what the model emitted. F3 — Drop seedImage, use referenceImages for prior scene: - FLUX.2 [klein] 9B KV does not support seedImage (img2img). Removed Tier A (seedImage+refs) and Tier C (seedImage only) from the Painter degradation chain. New layout: prior scene's image slots into referenceImages[0] for spatial continuity, character portraits fill slots 1-3 (Runware caps at 4 total). Cinematographer instructed to emphasize continuity when sceneKey matches a prior scene. All five package typechecks pass. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): address Copilot review feedback on #6 Three targeted fixes from PR #6 Copilot review. F4 — Stale seedImage/img2img docstrings Four locations still referenced the original img2img design after F3 switched to referenceImages-based spatial continuity: - types/index.ts:57 Scene.sceneKey docstring - types/index.ts:63 Scene.imageUuid docstring - director.ts:34 pipeline diagram in module block comment - director.ts:128 directScene JSDoc Doc-only changes; misleading wording corrected to mention referenceImages. (The design-rationale comment in pickPriorSceneReference is kept — it explains WHY we don't use seedImage and is load-bearing context.) F5 — Remove JS-comment stripping from JSON repair pass parseJsonLoose's repair tier previously stripped `// ...` and `/* ... */` across the entire text, which would corrupt JSON string values containing URLs (e.g. "https://example.com" → "https:"). Since LLMs in `responseFormat: "json_object"` mode essentially never emit comments, dropping the comment-stripping step is a net win for safety. Trailing-comma and missing-comma repair (the high-frequency failures) are kept. F6 — Pattern B parity on the insert-beat path Previously: directInsertBeat's INSERT_BEAT_SYSTEM forbade any speaker not in session.characters, and the orchestrator's unregistered-speaker guard demoted such lines to narration. This meant the player could not speak via speaker="你" in transient in-scene beats — inconsistent with the Writer path. Fix: - INSERT_BEAT_SYSTEM prompt now allows speaker="你" (NPC name OR "你") and rejects other POV variants - directInsertBeat applies normalizeSpeakerName to the LLM output, same as the Writer path, so POV variants collapse to "你" - lineDelivery is dropped when speaker="你" (no TTS for player) - orchestrator's unregistered-speaker guard adds a `speaker !== "你"` exception so Pattern B player dialog passes through Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(engine): drop "JS-style comments" from parseJsonLoose header The function header listed JS-style comments as a step-4 repair, but F5 already removed comment stripping from `repairJsonString` because the regex would corrupt URLs inside JSON string values. The inner function's comment was updated then; this header was missed. Doc-only sync from second-round Copilot review on #6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: QiChen88 <2291969160@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 13:30:24 +08:00
parent e261f4a346
commit def1b25bd9
14 changed files with 1733 additions and 562 deletions
@@ -1,309 +1,294 @@
-import { chat } from "@yume/ai-client";
+import { chat, uploadImage } from "@yume/ai-client";
 import type {
-  Beat,
-  BeatChoice,
-  BeatChoiceEffect,
-  BeatNext,
  Character,
+  EngineConfig,
  InsertBeatPartial,
  ProviderConfig,
  Scene,
  Session,
 } from "@yume/types";
-import { parseJsonLoose } from "./jsonParser";
+import { designCharacter, provisionVoiceForName } from "./agents/characterDesigner";
+import { runCinematographer } from "./agents/cinematographer";
+import { runPainter } from "./agents/painter";
 import {
-  DIRECTOR_SYSTEM,
-  INSERT_BEAT_SYSTEM,
-  buildDirectorUserMessage,
-  buildInsertBeatUserMessage,
-} from "./prompts";
+  collectActiveCharacterNames,
+  isPovName,
+  normalizeSpeakerName,
+  POV_DISPLAY_NAME,
+  runWriter,
+} from "./agents/writer";
+import { parseJsonLoose } from "./jsonParser";
+import { INSERT_BEAT_SYSTEM, buildInsertBeatUserMessage } from "./prompts";

-// ──────────────────────────────────────────────────────────────────────
-//  Raw shape produced by the model — we coerce + validate into a Scene.
-// ──────────────────────────────────────────────────────────────────────
-
-type RawEffect = {
-  kind?: string;
-  targetBeatId?: string;
-  nextSceneSeed?: string;
-};
-
-type RawChoice = {
-  id?: string;
-  label?: string;
-  effect?: RawEffect;
-};
-
-type RawNext = {
-  type?: string;
-  nextBeatId?: string;
-  choices?: RawChoice[];
-};
-
-type RawBeat = {
-  id?: string;
-  narration?: string;
-  speaker?: string;
-  line?: string;
-  lineDelivery?: string;
-  next?: RawNext;
-};
-
-type RawCharacterUpdate = {
-  name?: string;
-  description?: string;
-};
-
-type RawScene = {
-  scenePrompt?: string;
-  entryBeatId?: string;
-  beats?: RawBeat[];
-  characterUpdates?: RawCharacterUpdate[];
-};
-
-function coerceEffect(raw: RawEffect | undefined): BeatChoiceEffect {
-  if (raw?.kind === "advance-beat" && raw.targetBeatId?.trim()) {
-    return { kind: "advance-beat", targetBeatId: raw.targetBeatId.trim() };
-  }
-  return {
-    kind: "change-scene",
-    nextSceneSeed: raw?.nextSceneSeed?.trim() || "未指定",
-  };
-}
-
-function coerceChoice(raw: RawChoice, idx: number): BeatChoice {
-  return {
-    id: raw.id?.trim() || `c${idx + 1}`,
-    label: raw.label?.trim() || `选项 ${idx + 1}`,
-    effect: coerceEffect(raw.effect),
-  };
-}
-
-function coerceNext(raw: RawNext | undefined, fallbackBeatId: string): BeatNext {
-  if (raw?.type === "choice" && Array.isArray(raw.choices) && raw.choices.length) {
-    return {
-      type: "choice",
-      choices: raw.choices.map((c, i) => coerceChoice(c, i)),
-    };
-  }
-  return {
-    type: "continue",
-    nextBeatId: raw?.nextBeatId?.trim() || fallbackBeatId,
-  };
-}
-
-function coerceBeat(raw: RawBeat, idx: number, totalBeats: number): Beat {
-  const id = raw.id?.trim() || `b${idx + 1}`;
-  // Non-last beats default their `continue` target to the following beat.
-  // The last beat gets an empty fallback on purpose: repairBeats() turns a
-  // last/dangling continue into a real scene-change exit so the player can
-  // never get stuck self-looping on it.
-  const fallback = idx + 1 < totalBeats ? `b${idx + 2}` : "";
-  const line = raw.line?.trim() || undefined;
-  return {
-    id,
-    narration: raw.narration?.trim() || undefined,
-    speaker: raw.speaker?.trim() || undefined,
-    line,
-    // lineDelivery only meaningful when there is a line to deliver.
-    lineDelivery: line ? raw.lineDelivery?.trim() || undefined : undefined,
-    next: coerceNext(raw.next, fallback),
-  };
-}
-
-function coerceCharacterUpdates(raw: RawCharacterUpdate[] | undefined): Character[] {
-  if (!Array.isArray(raw)) return [];
-  return raw
-    .map((c) => ({
-      name: c.name?.trim() ?? "",
-      description: c.description?.trim() ?? "",
-    }))
-    .filter((c) => c.name && c.description);
-}
-
-const FALLBACK_SEED = "故事继续推进";
-
-function fallbackExitChoice(beatId: string): BeatChoice {
-  return {
-    id: `${beatId}__exit`,
-    label: "继续",
-    effect: { kind: "change-scene", nextSceneSeed: FALLBACK_SEED },
-  };
-}
-
-// Beat ids are graph keys (the front-end's `beats.find(b => b.id === ...)`,
-// the session's `visitedBeatIds`, and `continue`/`advance-beat` targets). If
-// the model reuses an id across beats, the second occurrence becomes silently
-// unreachable and external references collapse to the first beat. Rename
-// duplicates; rewrite the renamed beat's OWN self-references (the most
-// natural interpretation of a duplicate id being referenced from inside that
-// same beat). External references stay pointing at the first occurrence.
-function ensureUniqueBeatIds(beats: Beat[]): Beat[] {
-  const seen = new Set<string>();
-  return beats.map((b): Beat => {
-    if (!seen.has(b.id)) {
-      seen.add(b.id);
-      return b;
-    }
-    const oldId = b.id;
-    let n = 2;
-    while (seen.has(`${oldId}_${n}`)) n += 1;
-    const newId = `${oldId}_${n}`;
-    seen.add(newId);
-
-    let next = b.next;
-    if (next.type === "continue" && next.nextBeatId === oldId) {
-      next = { type: "continue", nextBeatId: newId };
-    } else if (next.type === "choice") {
-      next = {
-        type: "choice",
-        choices: next.choices.map((c) =>
-          c.effect.kind === "advance-beat" && c.effect.targetBeatId === oldId
-            ? {
-                ...c,
-                effect: { kind: "advance-beat" as const, targetBeatId: newId },
-              }
-            : c,
-        ),
-      };
-    }
-    return { ...b, id: newId, next };
-  });
-}
-
-// Repairs referential integrity AND guarantees the scene is escapable:
-// - a `continue` to a missing/self id is repointed to the next beat in order;
-//   a last/dangling continue with nowhere to go becomes a scene-change exit
-//   (never a self-loop, which would strand the player on "click to advance")
-// - an `advance-beat` to a missing id is downgraded to a scene change
-// - if no change-scene exit exists anywhere, one is appended to the last beat
-function repairBeats(beats: Beat[]): Beat[] {
-  const ids = new Set(beats.map((b) => b.id));
-
-  const fixed: Beat[] = beats.map((b, idx): Beat => {
-    if (b.next.type === "continue") {
-      const target = b.next.nextBeatId;
-      if (ids.has(target) && target !== b.id) return b;
-      const nextByIndex = beats[idx + 1]?.id;
-      if (nextByIndex) {
-        return { ...b, next: { type: "continue", nextBeatId: nextByIndex } };
-      }
-      return { ...b, next: { type: "choice", choices: [fallbackExitChoice(b.id)] } };
-    }
-
-    const patched = b.next.choices.map((c) =>
-      c.effect.kind === "advance-beat" && !ids.has(c.effect.targetBeatId)
-        ? {
-            ...c,
-            effect: {
-              kind: "change-scene" as const,
-              nextSceneSeed: "未指定（导演引用不存在的 beat，已降级为换场）",
-            },
-          }
-        : c,
-    );
-    return { ...b, next: { type: "choice", choices: patched } };
-  });
-
-  const hasExit = fixed.some(
-    (b) =>
-      b.next.type === "choice" &&
-      b.next.choices.some((c) => c.effect.kind === "change-scene"),
-  );
-  if (!hasExit && fixed.length > 0) {
-    const lastIdx = fixed.length - 1;
-    const last = fixed[lastIdx]!;
-    const existing = last.next.type === "choice" ? last.next.choices : [];
-    fixed[lastIdx] = {
-      ...last,
-      next: { type: "choice", choices: [...existing, fallbackExitChoice(last.id)] },
-    };
-  }
-
-  return fixed;
-}
-
-// Choice ids are the keys the front-end uses to cache and consume prefetched
-// scenes. Two beats both defaulting to c1/c2 (or the model reusing ids across
-// beats) would make a transition reuse the WRONG prefetched scene — so force
-// every choice id to be unique within the scene.
-function ensureUniqueChoiceIds(beats: Beat[]): Beat[] {
-  const seen = new Set<string>();
-  for (const b of beats) {
-    if (b.next.type !== "choice") continue;
-    for (const c of b.next.choices) {
-      if (seen.has(c.id)) {
-        let n = 2;
-        while (seen.has(`${c.id}_${n}`)) n += 1;
-        c.id = `${c.id}_${n}`;
-      }
-      seen.add(c.id);
-    }
-  }
-  return beats;
-}
+// ══════════════════════════════════════════════════════════════════════
+//  director.ts — multi-agent orchestrator for one full Scene generation.
+//
+//  Critical path (per Scene call):
+//
+//    Writer LLM (~3s, serial)
+//      │
+//      ├─ CharacterDesigner LLM × N    (parallel per new char)
+//      │     │
+//      │     ├─ portrait gen + upload  (parallel within agent)
+//      │     └─ voice provisioning     (parallel within agent)
+//      │
+//      ├─ Cinematographer LLM          (parallel with all of the above)
+//      │
+//      └─ wait for all parallel branches
+//      │
+//      ▼
+//    Painter (FLUX referenceImages — two-tier degradation chain)
+//      │
+//      ▼
+//    upload final scene image → Scene.imageUuid
+//      │
+//      ▼
+//    return { scene, sceneImageBase64, characters }
+//
+//  The Cinematographer intentionally does NOT depend on CharacterDesigner
+//  output — it only positions named characters in the frame, not their
+//  appearance. This unlocks the parallelism that makes the full pipeline
+//  ~9-12s instead of ~15-18s serial.
+// ══════════════════════════════════════════════════════════════════════

 function newSceneId(): string {
  return `scene_${Date.now()}_${Math.random().toString(36).slice(2, 6)}`;
 }

-// ──────────────────────────────────────────────────────────────────────
-//  directScene — generates one Scene (multi-beat) for the player.
-//  Called both on real scene transitions AND on speculative prefetch.
-// ──────────────────────────────────────────────────────────────────────
+function tlog(label: string, t0: number): void {
+  console.log(`${label}: ${Date.now() - t0}ms`);
+}
+
+// Merge a freshly-designed Character into a registry, preserving any
+// previously-set voice/portrait that the new design didn't fill in (so
+// re-designing a known character can't silently drop their voice or wipe
+// out an already-generated portrait UUID). Match by name.
+export function mergeCharacters(
+  existing: Character[],
+  updates: Character[],
+): Character[] {
+  if (updates.length === 0) return existing;
+  const byName = new Map(existing.map((c) => [c.name, c]));
+  for (const u of updates) {
+    const prev = byName.get(u.name);
+    if (!prev) {
+      byName.set(u.name, u);
+      continue;
+    }
+    // Preserve any prior provisioned resource that the new design omitted.
+    byName.set(u.name, {
+      ...u,
+      voice: u.voice ?? prev.voice,
+      visualDescription: u.visualDescription ?? prev.visualDescription,
+      basePortraitBase64: u.basePortraitBase64 ?? prev.basePortraitBase64,
+      basePortraitUuid: u.basePortraitUuid ?? prev.basePortraitUuid,
+      voiceDescription: u.voiceDescription || prev.voiceDescription,
+    });
+  }
+  return Array.from(byName.values());
+}
+
+// Pick a reference to the prior scene image when sceneKey matches a prior
+// scene — used by the Painter as one of the `referenceImages` (NOT as a
+// seedImage, because FLUX.2 [klein] 9B KV does not support seedImage).
+//
+// Returns the UUID if available (cheap reference, ~36 chars over the wire),
+// else the base64 of the most recent matching scene's image. Returns
+// undefined when no prior scene shares the current sceneKey.
+function pickPriorSceneReference(
+  session: Session,
+  currentSceneKey: string | undefined,
+  priorImageBase64ByUuid: Map<string, string>,
+): { priorSceneReference?: string; priorSceneKey?: string } {
+  if (!currentSceneKey) return {};
+  for (let i = session.history.length - 1; i >= 0; i--) {
+    const prior = session.history[i]!.scene;
+    if (prior.sceneKey === currentSceneKey) {
+      if (prior.imageUuid) {
+        return {
+          priorSceneReference: prior.imageUuid,
+          priorSceneKey: prior.sceneKey,
+        };
+      }
+      const cached = priorImageBase64ByUuid.get(prior.id);
+      if (cached) {
+        return { priorSceneReference: cached, priorSceneKey: prior.sceneKey };
+      }
+    }
+  }
+  return {};
+}

 export type SceneResult = {
  scene: Scene;
-  characterUpdates: Character[];
+  sceneImageBase64: string;
+  characters: Character[];
 };

+// ──────────────────────────────────────────────────────────────────────
+//  directScene — the multi-agent pipeline. Used by orchestrator's
+//  startSession and requestScene.
+//
+//  priorImageBase64ByUuid: optional map from prior Scene.id → base64
+//  the caller has on-hand. If a sceneKey-hit scene's imageUuid is missing
+//  but the base64 is cached locally, we can still feed it as one of the
+//  Painter's referenceImages. Pass an empty map when caller has no cache
+//  (orchestrator does pass it for the start-session bootstrap).
+// ──────────────────────────────────────────────────────────────────────
+
 export async function directScene(
-  config: ProviderConfig,
+  config: EngineConfig,
  session: Session,
+  priorImageBase64ByUuid: Map<string, string> = new Map(),
 ): Promise<SceneResult> {
-  const raw = await chat(
-    config,
-    [
-      { role: "system", content: DIRECTOR_SYSTEM },
-      { role: "user", content: buildDirectorUserMessage(session) },
-    ],
-    { temperature: 0.9, responseFormat: "json_object" },
+  const tTotal = Date.now();
+
+  // Stage 1 — Writer (serial; everything downstream needs sceneSummary +
+  // beats[] to know who's on stage and what to compose around).
+  const tWriter = Date.now();
+  const writerOut = await runWriter(config.text, session);
+  tlog("[directScene] Writer", tWriter);
+
+  // Identify NEW characters introduced by this scene that need to be
+  // designed (LLM + portrait + voice). Existing characters in the registry
+  // are skipped — their cards / portraits / voices persist across scenes.
+  const allActiveNames = collectActiveCharacterNames(writerOut.beats);
+  const newCharNames = allActiveNames.filter(
+    (n) => !session.characters.some((c) => c.name === n),
  );

-  const parsed = parseJsonLoose<RawScene>(raw);
-  const rawBeats = Array.isArray(parsed.beats) ? parsed.beats : [];
-  if (rawBeats.length === 0) {
-    throw new Error("Director returned no beats");
+  // Find the entry beat for the Cinematographer (which characters are
+  // on-screen in the establishing shot).
+  const entryBeat = writerOut.beats.find((b) => b.id === writerOut.entryBeatId);
+  const entryBeatActive = entryBeat?.activeCharacters ?? [];
+
+  // For sceneKey-based visual continuity, look up the prior matching scene's
+  // image to slot into Painter's referenceImages (max 4 of which include
+  // character portraits too).
+  const { priorSceneReference, priorSceneKey } = pickPriorSceneReference(
+    session,
+    writerOut.sceneKey,
+    priorImageBase64ByUuid,
+  );
+
+  // Stage 2 — parallel: CharacterDesigner(s) and Cinematographer.
+  // Cinematographer doesn't need character visualDescriptions (those are
+  // appended at Painter stage), so it runs concurrently with chardesign.
+  const tParallel = Date.now();
+
+  const designPromises = newCharNames.map((name) =>
+    designCharacter(config, session, name).catch((err): Character => {
+      const msg = err instanceof Error ? err.message : String(err);
+      console.error(`[directScene] designCharacter(${name}) failed: ${msg}`);
+      // Last-resort fallback: register with name only so the speaker isn't
+      // unknown. Caller may try voice provisioning later or skip.
+      return {
+        name,
+        voiceDescription: `请根据角色名「${name}」推断其性别、年龄与气质。所属世界观：${session.worldSetting}`,
+      };
+    }),
+  );
+
+  const cinemaPromise = runCinematographer(config.text, {
+    sceneSummary: writerOut.sceneSummary,
+    styleGuide: session.styleGuide,
+    entryBeatActive,
+    entryBeatSpeaker: entryBeat?.speaker,
+    priorSceneKey,
+    currentSceneKey: writerOut.sceneKey,
+  });
+
+  const [designedChars, cinemaOut] = await Promise.all([
+    Promise.all(designPromises),
+    cinemaPromise,
+  ]);
+  tlog("[directScene] CharacterDesigner+Cinematographer parallel", tParallel);
+
+  // Merge new chars into a working registry that we'll pass to the Painter.
+  const characters = mergeCharacters(session.characters, designedChars);
+
+  // Edge case: a speaker referenced by the Writer might not have been in
+  // `activeCharacters` of any beat (LLM oversight), so they got skipped by
+  // newCharNames. Catch them here and at least provision a voice so the
+  // beat-audio path doesn't render silent. No portrait — they weren't
+  // visible in the scene, so visual consistency doesn't matter for them.
+  const speakerNames = new Set(
+    writerOut.beats.map((b) => b.speaker).filter((n): n is string => Boolean(n)),
+  );
+  const orphanSpeakers = [...speakerNames].filter(
+    // Pattern B: "你" (player) is a valid speaker but never gets a Character
+    // record — TTS is intentionally skipped on the client. Filter POV out so
+    // provisionVoiceForName isn't accidentally invoked for the player.
+    (n) => !isPovName(n) && !characters.some((c) => c.name === n),
+  );
+  if (orphanSpeakers.length > 0) {
+    const orphans = await Promise.all(
+      orphanSpeakers.map((n) => provisionVoiceForName(config, session, n)),
+    );
+    const merged = mergeCharacters(characters, orphans);
+    characters.splice(0, characters.length, ...merged);
  }

-  const beats = ensureUniqueChoiceIds(
-    repairBeats(
-      ensureUniqueBeatIds(
-        rawBeats.map((b, i) => coerceBeat(b, i, rawBeats.length)),
-      ),
-    ),
+  // Stage 3 — Painter (depends on cinemaOut + characters).
+  // On-stage characters for THIS scene are the ones in any beat — pass them
+  // all so the archetype block covers anyone the player might encounter.
+  const onStageCharacters = characters.filter((c) =>
+    allActiveNames.includes(c.name),
  );

-  const declaredEntry = parsed.entryBeatId?.trim();
-  const entryBeatId =
-    declaredEntry && beats.some((b) => b.id === declaredEntry)
-      ? declaredEntry
-      : beats[0]!.id;
-
-  return {
-    scene: {
-      id: newSceneId(),
-      scenePrompt: parsed.scenePrompt?.trim() || "an empty scene",
-      beats,
-      entryBeatId,
+  const tPainter = Date.now();
+  const sceneImageBase64 = await runPainter(
+    config,
+    {
+      integratedPrompt: cinemaOut.integratedPrompt,
+      styleGuide: session.styleGuide,
+      onStageCharacters,
+      priorSceneImage: priorSceneReference,
    },
-    characterUpdates: coerceCharacterUpdates(parsed.characterUpdates),
+    entryBeat,
+  );
+  tlog("[directScene] Painter", tPainter);
+
+  // Stage 4 — best-effort upload of the final scene image so the NEXT
+  // sceneKey-match call can reference its UUID instead of carrying base64.
+  // If upload fails, the scene still works; only loses cheap referencing
+  // on the next hop. Don't wait on mock images (static placeholder).
+  let imageUuid: string | undefined;
+  if (!config.mockImage) {
+    try {
+      const tUpload = Date.now();
+      imageUuid = await uploadImage(config.image, sceneImageBase64);
+      tlog("[directScene] image upload", tUpload);
+    } catch (err) {
+      const msg = err instanceof Error ? err.message : String(err);
+      console.warn(`[directScene] scene image upload failed: ${msg} — sceneKey reuse will need base64 fallback`);
+    }
+  }
+
+  const scene: Scene = {
+    id: newSceneId(),
+    // scenePrompt is the cinematographer's English compositional output;
+    // the Writer's sceneSummary stays in the session log via beats[]/
+    // history. Keeping the original field name preserves compat with
+    // anything that already reads scene.scenePrompt (e.g., insert-beat
+    // user prompt).
+    scenePrompt: cinemaOut.integratedPrompt,
+    beats: writerOut.beats,
+    entryBeatId: writerOut.entryBeatId,
+    sceneKey: writerOut.sceneKey,
+    imageUuid,
  };
+
+  tlog("[directScene] TOTAL", tTotal);
+
+  return { scene, sceneImageBase64, characters };
 }

 // ──────────────────────────────────────────────────────────────────────
-//  directInsertBeat — generates a one-off transient beat in response to
-//  a freeform vision action that stays in-scene. Used by /api/insert-beat.
+//  directInsertBeat — single-agent path for vision-driven in-scene
+//  exploration. Generates ONE transient beat with NO new image, NO new
+//  characters. Multi-agent pipeline doesn't apply here (no rendering, no
+//  character introduction allowed by the prompt).
 // ──────────────────────────────────────────────────────────────────────

 export async function directInsertBeat(
@@ -326,13 +311,17 @@ export async function directInsertBeat(
  const parsed = parseJsonLoose<InsertBeatPartial>(raw);

  const narration = parsed.narration?.trim() || undefined;
-  const speaker = parsed.speaker?.trim() || undefined;
+  const rawSpeaker = parsed.speaker?.trim() || undefined;
+  // Pattern B (mirrors Writer): normalize POV variants → "你"; NPCs pass through.
+  const speaker = rawSpeaker ? normalizeSpeakerName(rawSpeaker) : undefined;
  const line = parsed.line?.trim() || undefined;
-  const lineDelivery = line ? parsed.lineDelivery?.trim() || undefined : undefined;
+  // lineDelivery is only meaningful for NPC speakers (TTS). For POV ("你")
+  // TTS is intentionally skipped on the client, so lineDelivery is dropped.
+  const lineDelivery =
+    line && speaker !== POV_DISPLAY_NAME
+      ? parsed.lineDelivery?.trim() || undefined
+      : undefined;

-  // If the model returned nothing usable, supply a fallback narration so the
-  // frontend doesn't append a silent empty beat that renders no dialogue —
-  // which would make the click appear to do nothing.
  if (!narration && !speaker && !line) {
    return { narration: "（你停下脚步，环视片刻。）" };
  }