feat(engine): multi-agent character consistency pipeline (#6)

* feat(types): Character.voiceDescription rename + visual fields + Scene.sceneKey

Prepares the type surface for the multi-agent scene pipeline:

- Character.description → voiceDescription (clearer pairing with new visualDescription)
- Character gains visualDescription (English appearance card for Painter) +
  basePortraitBase64 + basePortraitUuid (for Runware referenceImages reuse)
- Scene gains sceneKey (English slug for cross-scene img2img continuity) +
  imageUuid (Runware UUID of the scene's rendered image for cheap seedImage
  reuse on subsequent same-sceneKey calls)
- Beat gains activeCharacters[] so the Cinematographer can read which
  characters are on-screen + their poses when composing the establishing shot

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ai-client): generateImage img2img + multi-reference options + uploadImage

Extends the Runware adapter to support the two anchoring mechanisms FLUX.2
[klein] 9B KV needs for character + scene visual consistency:

- generateImage gains optional { seedImage, referenceImages, strength }:
  seedImage drives img2img (single starting image, sceneKey continuity),
  referenceImages drives multi-reference anchoring (up to 4 character
  portraits, capped per Runware spec). Default strength 0.85 — FLUX
  ignores strength < 0.8.
- uploadImage POSTs a base64 to Runware's imageUpload taskType and
  returns the UUID, so portraits/scene snapshots can be referenced by
  UUID on subsequent calls instead of resending base64 every scene.

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(engine): multi-agent scene pipeline (Writer→CharDesigner+Cinematographer→Painter)

Replaces the single-LLM directScene with a four-agent pipeline that
specializes each concern and parallelizes the slow parts. Adopts the
core idea from #4 (multi-agent dispatch + character visual consistency)
and grafts it onto the Scene/Beat architecture introduced in #2.

Pipeline per Scene (~9-12s critical path with parallelization):

  Writer LLM (序列, ~3s)
    │ outputs: sceneSummary + sceneKey + beats[] (each beat carries
    │           activeCharacters[] with poses)
    │
    ├─ CharacterDesigner LLM × N new chars (并行)
    │     │ outputs: { visualDescription (英文外貌卡), voiceDescription (中文音色卡) }
    │     ├─ FLUX portrait gen → upload → UUID (并行 within agent)
    │     └─ Xiaomi MiMo voicedesign provision (并行 within agent)
    │
    └─ Cinematographer LLM (并行 with CharacterDesigner)
          outputs: { shotType, integratedPrompt (英文构图+机位+人物站位) }

  Painter (FLUX img2img + referenceImages, ~1-3s)
    inputs: integratedPrompt + onStageCharacters' archetype block
            + (optional) prior sceneKey-hit scene as seedImage
            + (optional) character portrait UUIDs as referenceImages
    fallback chain: A) both anchors → B) refs only (保角色) →
                    C) seed only (保背景) → D) pure t2i
    output uploaded → Scene.imageUuid for the next sceneKey hop

Why this carving:
- Writer focuses purely on narrative (drops the voice-design duty
  staging's DIRECTOR_SYSTEM was carrying as a side concern).
- CharacterDesigner bundles visual + voice so the agent that thinks
  "who is this character" produces internally-consistent appearance +
  vocal personality (split agents tend to diverge).
- Cinematographer doesn't need character visualDescriptions —
  Painter appends archetypes after — so it parallelizes with
  CharacterDesigner.
- sceneKey enables cross-scene backdrop continuity that Scene/Beat
  doesn't cover (Scene/Beat only reuses backdrop WITHIN a scene's
  beats; sceneKey reuses across scenes that share a location).

Other changes:
- voice.ts loses provisionVoicesForScene (moved into CharacterDesigner);
  keeps synthesizeBeat for the lazy per-beat /api/beat-audio path.
- renderer.ts deleted (replaced by agents/painter.ts).
- directInsertBeat (vision-driven in-scene exploration) stays single-
  LLM — it forbids new characters and produces no image, so multi-
  agent doesn't apply.

apps/web is unchanged: orchestrator.ts keeps the same exports
(startSession / requestScene / visionDecide / requestInsertBeat /
requestBeatAudio) with identical request/response shapes.

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(engine): Pattern B player POV + JSON repair + drop seedImage tier

Three hotfixes surfaced by manual end-to-end testing of the multi-agent
pipeline.

F1 — Player viewpoint (galgame Pattern B):
  - Writer accepts speaker="你" for player dialog (renders in dialog box,
    never TTS'd because no Character record exists for "你"). Filter POV
    variants (玩家/我/主角/protagonist/player/I/me/...) from
    activeCharacters so CharacterDesigner never wastes API calls on the
    player. Two-layer defense: explicit prompt rule in WRITER_SYSTEM +
    code normalization (POV_VARIANTS set, isPovName, normalizeSpeakerName).
  - Cinematographer and Painter prompts gain "player never in frame" rule
    so the player never appears in any rendered scene.
  - Cinematographer gains dynamic camera policy driven by the entry beat's
    speaker: NPC-speaker → close-up looking toward camera; "你"-speaker →
    medium shot of attentive NPC; no speaker → wide establishing shot.
  - director.ts filters POV from orphanSpeakers so provisionVoiceForName
    never fires for "你".

F2 — JSON parsing robustness:
  - parseJsonLoose gains a 4th repair tier: strip JS-style comments, strip
    trailing commas, insert missing commas between adjacent objects /
    arrays / quoted values. Logs the first 800 chars of raw LLM output
    when all repair attempts fail, so we can see what the model emitted.

F3 — Drop seedImage, use referenceImages for prior scene:
  - FLUX.2 [klein] 9B KV does not support seedImage (img2img). Removed
    Tier A (seedImage+refs) and Tier C (seedImage only) from the Painter
    degradation chain. New layout: prior scene's image slots into
    referenceImages[0] for spatial continuity, character portraits fill
    slots 1-3 (Runware caps at 4 total). Cinematographer instructed to
    emphasize continuity when sceneKey matches a prior scene.

All five package typechecks pass.

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(engine): address Copilot review feedback on #6

Three targeted fixes from PR #6 Copilot review.

F4 — Stale seedImage/img2img docstrings
  Four locations still referenced the original img2img design after F3
  switched to referenceImages-based spatial continuity:
  - types/index.ts:57   Scene.sceneKey docstring
  - types/index.ts:63   Scene.imageUuid docstring
  - director.ts:34      pipeline diagram in module block comment
  - director.ts:128     directScene JSDoc
  Doc-only changes; misleading wording corrected to mention referenceImages.
  (The design-rationale comment in pickPriorSceneReference is kept — it
  explains WHY we don't use seedImage and is load-bearing context.)

F5 — Remove JS-comment stripping from JSON repair pass
  parseJsonLoose's repair tier previously stripped `// ...` and
  `/* ... */` across the entire text, which would corrupt JSON string
  values containing URLs (e.g. "https://example.com" → "https:"). Since
  LLMs in `responseFormat: "json_object"` mode essentially never emit
  comments, dropping the comment-stripping step is a net win for safety.
  Trailing-comma and missing-comma repair (the high-frequency failures)
  are kept.

F6 — Pattern B parity on the insert-beat path
  Previously: directInsertBeat's INSERT_BEAT_SYSTEM forbade any speaker
  not in session.characters, and the orchestrator's unregistered-speaker
  guard demoted such lines to narration. This meant the player could not
  speak via speaker="你" in transient in-scene beats — inconsistent with
  the Writer path.
  Fix:
  - INSERT_BEAT_SYSTEM prompt now allows speaker="你" (NPC name OR "你")
    and rejects other POV variants
  - directInsertBeat applies normalizeSpeakerName to the LLM output, same
    as the Writer path, so POV variants collapse to "你"
  - lineDelivery is dropped when speaker="你" (no TTS for player)
  - orchestrator's unregistered-speaker guard adds a `speaker !== "你"`
    exception so Pattern B player dialog passes through

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(engine): drop "JS-style comments" from parseJsonLoose header

The function header listed JS-style comments as a step-4 repair, but F5
already removed comment stripping from `repairJsonString` because the
regex would corrupt URLs inside JSON string values. The inner function's
comment was updated then; this header was missed.

Doc-only sync from second-round Copilot review on #6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: QiChen88 <2291969160@qq.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Zonghao Yuan
2026-05-29 13:30:24 +08:00
committed by GitHub
parent e261f4a346
commit def1b25bd9
14 changed files with 1733 additions and 562 deletions
+267 -278
View File
@@ -1,309 +1,294 @@
import { chat } from "@yume/ai-client";
import { chat, uploadImage } from "@yume/ai-client";
import type {
Beat,
BeatChoice,
BeatChoiceEffect,
BeatNext,
Character,
EngineConfig,
InsertBeatPartial,
ProviderConfig,
Scene,
Session,
} from "@yume/types";
import { parseJsonLoose } from "./jsonParser";
import { designCharacter, provisionVoiceForName } from "./agents/characterDesigner";
import { runCinematographer } from "./agents/cinematographer";
import { runPainter } from "./agents/painter";
import {
DIRECTOR_SYSTEM,
INSERT_BEAT_SYSTEM,
buildDirectorUserMessage,
buildInsertBeatUserMessage,
} from "./prompts";
collectActiveCharacterNames,
isPovName,
normalizeSpeakerName,
POV_DISPLAY_NAME,
runWriter,
} from "./agents/writer";
import { parseJsonLoose } from "./jsonParser";
import { INSERT_BEAT_SYSTEM, buildInsertBeatUserMessage } from "./prompts";
// ──────────────────────────────────────────────────────────────────────
// Raw shape produced by the model — we coerce + validate into a Scene.
// ──────────────────────────────────────────────────────────────────────
type RawEffect = {
kind?: string;
targetBeatId?: string;
nextSceneSeed?: string;
};
type RawChoice = {
id?: string;
label?: string;
effect?: RawEffect;
};
type RawNext = {
type?: string;
nextBeatId?: string;
choices?: RawChoice[];
};
type RawBeat = {
id?: string;
narration?: string;
speaker?: string;
line?: string;
lineDelivery?: string;
next?: RawNext;
};
type RawCharacterUpdate = {
name?: string;
description?: string;
};
type RawScene = {
scenePrompt?: string;
entryBeatId?: string;
beats?: RawBeat[];
characterUpdates?: RawCharacterUpdate[];
};
function coerceEffect(raw: RawEffect | undefined): BeatChoiceEffect {
if (raw?.kind === "advance-beat" && raw.targetBeatId?.trim()) {
return { kind: "advance-beat", targetBeatId: raw.targetBeatId.trim() };
}
return {
kind: "change-scene",
nextSceneSeed: raw?.nextSceneSeed?.trim() || "未指定",
};
}
function coerceChoice(raw: RawChoice, idx: number): BeatChoice {
return {
id: raw.id?.trim() || `c${idx + 1}`,
label: raw.label?.trim() || `选项 ${idx + 1}`,
effect: coerceEffect(raw.effect),
};
}
function coerceNext(raw: RawNext | undefined, fallbackBeatId: string): BeatNext {
if (raw?.type === "choice" && Array.isArray(raw.choices) && raw.choices.length) {
return {
type: "choice",
choices: raw.choices.map((c, i) => coerceChoice(c, i)),
};
}
return {
type: "continue",
nextBeatId: raw?.nextBeatId?.trim() || fallbackBeatId,
};
}
function coerceBeat(raw: RawBeat, idx: number, totalBeats: number): Beat {
const id = raw.id?.trim() || `b${idx + 1}`;
// Non-last beats default their `continue` target to the following beat.
// The last beat gets an empty fallback on purpose: repairBeats() turns a
// last/dangling continue into a real scene-change exit so the player can
// never get stuck self-looping on it.
const fallback = idx + 1 < totalBeats ? `b${idx + 2}` : "";
const line = raw.line?.trim() || undefined;
return {
id,
narration: raw.narration?.trim() || undefined,
speaker: raw.speaker?.trim() || undefined,
line,
// lineDelivery only meaningful when there is a line to deliver.
lineDelivery: line ? raw.lineDelivery?.trim() || undefined : undefined,
next: coerceNext(raw.next, fallback),
};
}
function coerceCharacterUpdates(raw: RawCharacterUpdate[] | undefined): Character[] {
if (!Array.isArray(raw)) return [];
return raw
.map((c) => ({
name: c.name?.trim() ?? "",
description: c.description?.trim() ?? "",
}))
.filter((c) => c.name && c.description);
}
const FALLBACK_SEED = "故事继续推进";
function fallbackExitChoice(beatId: string): BeatChoice {
return {
id: `${beatId}__exit`,
label: "继续",
effect: { kind: "change-scene", nextSceneSeed: FALLBACK_SEED },
};
}
// Beat ids are graph keys (the front-end's `beats.find(b => b.id === ...)`,
// the session's `visitedBeatIds`, and `continue`/`advance-beat` targets). If
// the model reuses an id across beats, the second occurrence becomes silently
// unreachable and external references collapse to the first beat. Rename
// duplicates; rewrite the renamed beat's OWN self-references (the most
// natural interpretation of a duplicate id being referenced from inside that
// same beat). External references stay pointing at the first occurrence.
function ensureUniqueBeatIds(beats: Beat[]): Beat[] {
const seen = new Set<string>();
return beats.map((b): Beat => {
if (!seen.has(b.id)) {
seen.add(b.id);
return b;
}
const oldId = b.id;
let n = 2;
while (seen.has(`${oldId}_${n}`)) n += 1;
const newId = `${oldId}_${n}`;
seen.add(newId);
let next = b.next;
if (next.type === "continue" && next.nextBeatId === oldId) {
next = { type: "continue", nextBeatId: newId };
} else if (next.type === "choice") {
next = {
type: "choice",
choices: next.choices.map((c) =>
c.effect.kind === "advance-beat" && c.effect.targetBeatId === oldId
? {
...c,
effect: { kind: "advance-beat" as const, targetBeatId: newId },
}
: c,
),
};
}
return { ...b, id: newId, next };
});
}
// Repairs referential integrity AND guarantees the scene is escapable:
// - a `continue` to a missing/self id is repointed to the next beat in order;
// a last/dangling continue with nowhere to go becomes a scene-change exit
// (never a self-loop, which would strand the player on "click to advance")
// - an `advance-beat` to a missing id is downgraded to a scene change
// - if no change-scene exit exists anywhere, one is appended to the last beat
function repairBeats(beats: Beat[]): Beat[] {
const ids = new Set(beats.map((b) => b.id));
const fixed: Beat[] = beats.map((b, idx): Beat => {
if (b.next.type === "continue") {
const target = b.next.nextBeatId;
if (ids.has(target) && target !== b.id) return b;
const nextByIndex = beats[idx + 1]?.id;
if (nextByIndex) {
return { ...b, next: { type: "continue", nextBeatId: nextByIndex } };
}
return { ...b, next: { type: "choice", choices: [fallbackExitChoice(b.id)] } };
}
const patched = b.next.choices.map((c) =>
c.effect.kind === "advance-beat" && !ids.has(c.effect.targetBeatId)
? {
...c,
effect: {
kind: "change-scene" as const,
nextSceneSeed: "未指定(导演引用不存在的 beat,已降级为换场)",
},
}
: c,
);
return { ...b, next: { type: "choice", choices: patched } };
});
const hasExit = fixed.some(
(b) =>
b.next.type === "choice" &&
b.next.choices.some((c) => c.effect.kind === "change-scene"),
);
if (!hasExit && fixed.length > 0) {
const lastIdx = fixed.length - 1;
const last = fixed[lastIdx]!;
const existing = last.next.type === "choice" ? last.next.choices : [];
fixed[lastIdx] = {
...last,
next: { type: "choice", choices: [...existing, fallbackExitChoice(last.id)] },
};
}
return fixed;
}
// Choice ids are the keys the front-end uses to cache and consume prefetched
// scenes. Two beats both defaulting to c1/c2 (or the model reusing ids across
// beats) would make a transition reuse the WRONG prefetched scene — so force
// every choice id to be unique within the scene.
function ensureUniqueChoiceIds(beats: Beat[]): Beat[] {
const seen = new Set<string>();
for (const b of beats) {
if (b.next.type !== "choice") continue;
for (const c of b.next.choices) {
if (seen.has(c.id)) {
let n = 2;
while (seen.has(`${c.id}_${n}`)) n += 1;
c.id = `${c.id}_${n}`;
}
seen.add(c.id);
}
}
return beats;
}
// ══════════════════════════════════════════════════════════════════════
// director.ts — multi-agent orchestrator for one full Scene generation.
//
// Critical path (per Scene call):
//
// Writer LLM (~3s, serial)
// │
// ├─ CharacterDesigner LLM × N (parallel per new char)
// │ │
// │ ├─ portrait gen + upload (parallel within agent)
// │ └─ voice provisioning (parallel within agent)
// │
// ├─ Cinematographer LLM (parallel with all of the above)
// │
// └─ wait for all parallel branches
// │
// ▼
// Painter (FLUX referenceImages — two-tier degradation chain)
// │
// ▼
// upload final scene image → Scene.imageUuid
// │
// ▼
// return { scene, sceneImageBase64, characters }
//
// The Cinematographer intentionally does NOT depend on CharacterDesigner
// output — it only positions named characters in the frame, not their
// appearance. This unlocks the parallelism that makes the full pipeline
// ~9-12s instead of ~15-18s serial.
// ══════════════════════════════════════════════════════════════════════
function newSceneId(): string {
return `scene_${Date.now()}_${Math.random().toString(36).slice(2, 6)}`;
}
// ──────────────────────────────────────────────────────────────────────
// directScene — generates one Scene (multi-beat) for the player.
// Called both on real scene transitions AND on speculative prefetch.
// ──────────────────────────────────────────────────────────────────────
function tlog(label: string, t0: number): void {
console.log(`${label}: ${Date.now() - t0}ms`);
}
// Merge a freshly-designed Character into a registry, preserving any
// previously-set voice/portrait that the new design didn't fill in (so
// re-designing a known character can't silently drop their voice or wipe
// out an already-generated portrait UUID). Match by name.
export function mergeCharacters(
existing: Character[],
updates: Character[],
): Character[] {
if (updates.length === 0) return existing;
const byName = new Map(existing.map((c) => [c.name, c]));
for (const u of updates) {
const prev = byName.get(u.name);
if (!prev) {
byName.set(u.name, u);
continue;
}
// Preserve any prior provisioned resource that the new design omitted.
byName.set(u.name, {
...u,
voice: u.voice ?? prev.voice,
visualDescription: u.visualDescription ?? prev.visualDescription,
basePortraitBase64: u.basePortraitBase64 ?? prev.basePortraitBase64,
basePortraitUuid: u.basePortraitUuid ?? prev.basePortraitUuid,
voiceDescription: u.voiceDescription || prev.voiceDescription,
});
}
return Array.from(byName.values());
}
// Pick a reference to the prior scene image when sceneKey matches a prior
// scene — used by the Painter as one of the `referenceImages` (NOT as a
// seedImage, because FLUX.2 [klein] 9B KV does not support seedImage).
//
// Returns the UUID if available (cheap reference, ~36 chars over the wire),
// else the base64 of the most recent matching scene's image. Returns
// undefined when no prior scene shares the current sceneKey.
function pickPriorSceneReference(
session: Session,
currentSceneKey: string | undefined,
priorImageBase64ByUuid: Map<string, string>,
): { priorSceneReference?: string; priorSceneKey?: string } {
if (!currentSceneKey) return {};
for (let i = session.history.length - 1; i >= 0; i--) {
const prior = session.history[i]!.scene;
if (prior.sceneKey === currentSceneKey) {
if (prior.imageUuid) {
return {
priorSceneReference: prior.imageUuid,
priorSceneKey: prior.sceneKey,
};
}
const cached = priorImageBase64ByUuid.get(prior.id);
if (cached) {
return { priorSceneReference: cached, priorSceneKey: prior.sceneKey };
}
}
}
return {};
}
export type SceneResult = {
scene: Scene;
characterUpdates: Character[];
sceneImageBase64: string;
characters: Character[];
};
// ──────────────────────────────────────────────────────────────────────
// directScene — the multi-agent pipeline. Used by orchestrator's
// startSession and requestScene.
//
// priorImageBase64ByUuid: optional map from prior Scene.id → base64
// the caller has on-hand. If a sceneKey-hit scene's imageUuid is missing
// but the base64 is cached locally, we can still feed it as one of the
// Painter's referenceImages. Pass an empty map when caller has no cache
// (orchestrator does pass it for the start-session bootstrap).
// ──────────────────────────────────────────────────────────────────────
export async function directScene(
config: ProviderConfig,
config: EngineConfig,
session: Session,
priorImageBase64ByUuid: Map<string, string> = new Map(),
): Promise<SceneResult> {
const raw = await chat(
config,
[
{ role: "system", content: DIRECTOR_SYSTEM },
{ role: "user", content: buildDirectorUserMessage(session) },
],
{ temperature: 0.9, responseFormat: "json_object" },
const tTotal = Date.now();
// Stage 1 — Writer (serial; everything downstream needs sceneSummary +
// beats[] to know who's on stage and what to compose around).
const tWriter = Date.now();
const writerOut = await runWriter(config.text, session);
tlog("[directScene] Writer", tWriter);
// Identify NEW characters introduced by this scene that need to be
// designed (LLM + portrait + voice). Existing characters in the registry
// are skipped — their cards / portraits / voices persist across scenes.
const allActiveNames = collectActiveCharacterNames(writerOut.beats);
const newCharNames = allActiveNames.filter(
(n) => !session.characters.some((c) => c.name === n),
);
const parsed = parseJsonLoose<RawScene>(raw);
const rawBeats = Array.isArray(parsed.beats) ? parsed.beats : [];
if (rawBeats.length === 0) {
throw new Error("Director returned no beats");
// Find the entry beat for the Cinematographer (which characters are
// on-screen in the establishing shot).
const entryBeat = writerOut.beats.find((b) => b.id === writerOut.entryBeatId);
const entryBeatActive = entryBeat?.activeCharacters ?? [];
// For sceneKey-based visual continuity, look up the prior matching scene's
// image to slot into Painter's referenceImages (max 4 of which include
// character portraits too).
const { priorSceneReference, priorSceneKey } = pickPriorSceneReference(
session,
writerOut.sceneKey,
priorImageBase64ByUuid,
);
// Stage 2 — parallel: CharacterDesigner(s) and Cinematographer.
// Cinematographer doesn't need character visualDescriptions (those are
// appended at Painter stage), so it runs concurrently with chardesign.
const tParallel = Date.now();
const designPromises = newCharNames.map((name) =>
designCharacter(config, session, name).catch((err): Character => {
const msg = err instanceof Error ? err.message : String(err);
console.error(`[directScene] designCharacter(${name}) failed: ${msg}`);
// Last-resort fallback: register with name only so the speaker isn't
// unknown. Caller may try voice provisioning later or skip.
return {
name,
voiceDescription: `请根据角色名「${name}」推断其性别、年龄与气质。所属世界观:${session.worldSetting}`,
};
}),
);
const cinemaPromise = runCinematographer(config.text, {
sceneSummary: writerOut.sceneSummary,
styleGuide: session.styleGuide,
entryBeatActive,
entryBeatSpeaker: entryBeat?.speaker,
priorSceneKey,
currentSceneKey: writerOut.sceneKey,
});
const [designedChars, cinemaOut] = await Promise.all([
Promise.all(designPromises),
cinemaPromise,
]);
tlog("[directScene] CharacterDesigner+Cinematographer parallel", tParallel);
// Merge new chars into a working registry that we'll pass to the Painter.
const characters = mergeCharacters(session.characters, designedChars);
// Edge case: a speaker referenced by the Writer might not have been in
// `activeCharacters` of any beat (LLM oversight), so they got skipped by
// newCharNames. Catch them here and at least provision a voice so the
// beat-audio path doesn't render silent. No portrait — they weren't
// visible in the scene, so visual consistency doesn't matter for them.
const speakerNames = new Set(
writerOut.beats.map((b) => b.speaker).filter((n): n is string => Boolean(n)),
);
const orphanSpeakers = [...speakerNames].filter(
// Pattern B: "你" (player) is a valid speaker but never gets a Character
// record — TTS is intentionally skipped on the client. Filter POV out so
// provisionVoiceForName isn't accidentally invoked for the player.
(n) => !isPovName(n) && !characters.some((c) => c.name === n),
);
if (orphanSpeakers.length > 0) {
const orphans = await Promise.all(
orphanSpeakers.map((n) => provisionVoiceForName(config, session, n)),
);
const merged = mergeCharacters(characters, orphans);
characters.splice(0, characters.length, ...merged);
}
const beats = ensureUniqueChoiceIds(
repairBeats(
ensureUniqueBeatIds(
rawBeats.map((b, i) => coerceBeat(b, i, rawBeats.length)),
),
),
// Stage 3 — Painter (depends on cinemaOut + characters).
// On-stage characters for THIS scene are the ones in any beat — pass them
// all so the archetype block covers anyone the player might encounter.
const onStageCharacters = characters.filter((c) =>
allActiveNames.includes(c.name),
);
const declaredEntry = parsed.entryBeatId?.trim();
const entryBeatId =
declaredEntry && beats.some((b) => b.id === declaredEntry)
? declaredEntry
: beats[0]!.id;
return {
scene: {
id: newSceneId(),
scenePrompt: parsed.scenePrompt?.trim() || "an empty scene",
beats,
entryBeatId,
const tPainter = Date.now();
const sceneImageBase64 = await runPainter(
config,
{
integratedPrompt: cinemaOut.integratedPrompt,
styleGuide: session.styleGuide,
onStageCharacters,
priorSceneImage: priorSceneReference,
},
characterUpdates: coerceCharacterUpdates(parsed.characterUpdates),
entryBeat,
);
tlog("[directScene] Painter", tPainter);
// Stage 4 — best-effort upload of the final scene image so the NEXT
// sceneKey-match call can reference its UUID instead of carrying base64.
// If upload fails, the scene still works; only loses cheap referencing
// on the next hop. Don't wait on mock images (static placeholder).
let imageUuid: string | undefined;
if (!config.mockImage) {
try {
const tUpload = Date.now();
imageUuid = await uploadImage(config.image, sceneImageBase64);
tlog("[directScene] image upload", tUpload);
} catch (err) {
const msg = err instanceof Error ? err.message : String(err);
console.warn(`[directScene] scene image upload failed: ${msg} — sceneKey reuse will need base64 fallback`);
}
}
const scene: Scene = {
id: newSceneId(),
// scenePrompt is the cinematographer's English compositional output;
// the Writer's sceneSummary stays in the session log via beats[]/
// history. Keeping the original field name preserves compat with
// anything that already reads scene.scenePrompt (e.g., insert-beat
// user prompt).
scenePrompt: cinemaOut.integratedPrompt,
beats: writerOut.beats,
entryBeatId: writerOut.entryBeatId,
sceneKey: writerOut.sceneKey,
imageUuid,
};
tlog("[directScene] TOTAL", tTotal);
return { scene, sceneImageBase64, characters };
}
// ──────────────────────────────────────────────────────────────────────
// directInsertBeat — generates a one-off transient beat in response to
// a freeform vision action that stays in-scene. Used by /api/insert-beat.
// directInsertBeat — single-agent path for vision-driven in-scene
// exploration. Generates ONE transient beat with NO new image, NO new
// characters. Multi-agent pipeline doesn't apply here (no rendering, no
// character introduction allowed by the prompt).
// ──────────────────────────────────────────────────────────────────────
export async function directInsertBeat(
@@ -326,13 +311,17 @@ export async function directInsertBeat(
const parsed = parseJsonLoose<InsertBeatPartial>(raw);
const narration = parsed.narration?.trim() || undefined;
const speaker = parsed.speaker?.trim() || undefined;
const rawSpeaker = parsed.speaker?.trim() || undefined;
// Pattern B (mirrors Writer): normalize POV variants → "你"; NPCs pass through.
const speaker = rawSpeaker ? normalizeSpeakerName(rawSpeaker) : undefined;
const line = parsed.line?.trim() || undefined;
const lineDelivery = line ? parsed.lineDelivery?.trim() || undefined : undefined;
// lineDelivery is only meaningful for NPC speakers (TTS). For POV ("你")
// TTS is intentionally skipped on the client, so lineDelivery is dropped.
const lineDelivery =
line && speaker !== POV_DISPLAY_NAME
? parsed.lineDelivery?.trim() || undefined
: undefined;
// If the model returned nothing usable, supply a fallback narration so the
// frontend doesn't append a silent empty beat that renders no dialogue —
// which would make the click appear to do nothing.
if (!narration && !speaker && !line) {
return { narration: "(你停下脚步,环视片刻。)" };
}