feat(engine): multi-agent character consistency pipeline (#6)
* feat(types): Character.voiceDescription rename + visual fields + Scene.sceneKey Prepares the type surface for the multi-agent scene pipeline: - Character.description → voiceDescription (clearer pairing with new visualDescription) - Character gains visualDescription (English appearance card for Painter) + basePortraitBase64 + basePortraitUuid (for Runware referenceImages reuse) - Scene gains sceneKey (English slug for cross-scene img2img continuity) + imageUuid (Runware UUID of the scene's rendered image for cheap seedImage reuse on subsequent same-sceneKey calls) - Beat gains activeCharacters[] so the Cinematographer can read which characters are on-screen + their poses when composing the establishing shot Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ai-client): generateImage img2img + multi-reference options + uploadImage Extends the Runware adapter to support the two anchoring mechanisms FLUX.2 [klein] 9B KV needs for character + scene visual consistency: - generateImage gains optional { seedImage, referenceImages, strength }: seedImage drives img2img (single starting image, sceneKey continuity), referenceImages drives multi-reference anchoring (up to 4 character portraits, capped per Runware spec). Default strength 0.85 — FLUX ignores strength < 0.8. - uploadImage POSTs a base64 to Runware's imageUpload taskType and returns the UUID, so portraits/scene snapshots can be referenced by UUID on subsequent calls instead of resending base64 every scene. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(engine): multi-agent scene pipeline (Writer→CharDesigner+Cinematographer→Painter) Replaces the single-LLM directScene with a four-agent pipeline that specializes each concern and parallelizes the slow parts. Adopts the core idea from #4 (multi-agent dispatch + character visual consistency) and grafts it onto the Scene/Beat architecture introduced in #2. Pipeline per Scene (~9-12s critical path with parallelization): Writer LLM (序列, ~3s) │ outputs: sceneSummary + sceneKey + beats[] (each beat carries │ activeCharacters[] with poses) │ ├─ CharacterDesigner LLM × N new chars (并行) │ │ outputs: { visualDescription (英文外貌卡), voiceDescription (中文音色卡) } │ ├─ FLUX portrait gen → upload → UUID (并行 within agent) │ └─ Xiaomi MiMo voicedesign provision (并行 within agent) │ └─ Cinematographer LLM (并行 with CharacterDesigner) outputs: { shotType, integratedPrompt (英文构图+机位+人物站位) } Painter (FLUX img2img + referenceImages, ~1-3s) inputs: integratedPrompt + onStageCharacters' archetype block + (optional) prior sceneKey-hit scene as seedImage + (optional) character portrait UUIDs as referenceImages fallback chain: A) both anchors → B) refs only (保角色) → C) seed only (保背景) → D) pure t2i output uploaded → Scene.imageUuid for the next sceneKey hop Why this carving: - Writer focuses purely on narrative (drops the voice-design duty staging's DIRECTOR_SYSTEM was carrying as a side concern). - CharacterDesigner bundles visual + voice so the agent that thinks "who is this character" produces internally-consistent appearance + vocal personality (split agents tend to diverge). - Cinematographer doesn't need character visualDescriptions — Painter appends archetypes after — so it parallelizes with CharacterDesigner. - sceneKey enables cross-scene backdrop continuity that Scene/Beat doesn't cover (Scene/Beat only reuses backdrop WITHIN a scene's beats; sceneKey reuses across scenes that share a location). Other changes: - voice.ts loses provisionVoicesForScene (moved into CharacterDesigner); keeps synthesizeBeat for the lazy per-beat /api/beat-audio path. - renderer.ts deleted (replaced by agents/painter.ts). - directInsertBeat (vision-driven in-scene exploration) stays single- LLM — it forbids new characters and produces no image, so multi- agent doesn't apply. apps/web is unchanged: orchestrator.ts keeps the same exports (startSession / requestScene / visionDecide / requestInsertBeat / requestBeatAudio) with identical request/response shapes. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): Pattern B player POV + JSON repair + drop seedImage tier Three hotfixes surfaced by manual end-to-end testing of the multi-agent pipeline. F1 — Player viewpoint (galgame Pattern B): - Writer accepts speaker="你" for player dialog (renders in dialog box, never TTS'd because no Character record exists for "你"). Filter POV variants (玩家/我/主角/protagonist/player/I/me/...) from activeCharacters so CharacterDesigner never wastes API calls on the player. Two-layer defense: explicit prompt rule in WRITER_SYSTEM + code normalization (POV_VARIANTS set, isPovName, normalizeSpeakerName). - Cinematographer and Painter prompts gain "player never in frame" rule so the player never appears in any rendered scene. - Cinematographer gains dynamic camera policy driven by the entry beat's speaker: NPC-speaker → close-up looking toward camera; "你"-speaker → medium shot of attentive NPC; no speaker → wide establishing shot. - director.ts filters POV from orphanSpeakers so provisionVoiceForName never fires for "你". F2 — JSON parsing robustness: - parseJsonLoose gains a 4th repair tier: strip JS-style comments, strip trailing commas, insert missing commas between adjacent objects / arrays / quoted values. Logs the first 800 chars of raw LLM output when all repair attempts fail, so we can see what the model emitted. F3 — Drop seedImage, use referenceImages for prior scene: - FLUX.2 [klein] 9B KV does not support seedImage (img2img). Removed Tier A (seedImage+refs) and Tier C (seedImage only) from the Painter degradation chain. New layout: prior scene's image slots into referenceImages[0] for spatial continuity, character portraits fill slots 1-3 (Runware caps at 4 total). Cinematographer instructed to emphasize continuity when sceneKey matches a prior scene. All five package typechecks pass. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): address Copilot review feedback on #6 Three targeted fixes from PR #6 Copilot review. F4 — Stale seedImage/img2img docstrings Four locations still referenced the original img2img design after F3 switched to referenceImages-based spatial continuity: - types/index.ts:57 Scene.sceneKey docstring - types/index.ts:63 Scene.imageUuid docstring - director.ts:34 pipeline diagram in module block comment - director.ts:128 directScene JSDoc Doc-only changes; misleading wording corrected to mention referenceImages. (The design-rationale comment in pickPriorSceneReference is kept — it explains WHY we don't use seedImage and is load-bearing context.) F5 — Remove JS-comment stripping from JSON repair pass parseJsonLoose's repair tier previously stripped `// ...` and `/* ... */` across the entire text, which would corrupt JSON string values containing URLs (e.g. "https://example.com" → "https:"). Since LLMs in `responseFormat: "json_object"` mode essentially never emit comments, dropping the comment-stripping step is a net win for safety. Trailing-comma and missing-comma repair (the high-frequency failures) are kept. F6 — Pattern B parity on the insert-beat path Previously: directInsertBeat's INSERT_BEAT_SYSTEM forbade any speaker not in session.characters, and the orchestrator's unregistered-speaker guard demoted such lines to narration. This meant the player could not speak via speaker="你" in transient in-scene beats — inconsistent with the Writer path. Fix: - INSERT_BEAT_SYSTEM prompt now allows speaker="你" (NPC name OR "你") and rejects other POV variants - directInsertBeat applies normalizeSpeakerName to the LLM output, same as the Writer path, so POV variants collapse to "你" - lineDelivery is dropped when speaker="你" (no TTS for player) - orchestrator's unregistered-speaker guard adds a `speaker !== "你"` exception so Pattern B player dialog passes through Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(engine): drop "JS-style comments" from parseJsonLoose header The function header listed JS-style comments as a step-4 repair, but F5 already removed comment stripping from `repairJsonString` because the regex would corrupt URLs inside JSON string values. The inner function's comment was updated then; this header was missed. Doc-only sync from second-round Copilot review on #6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: QiChen88 <2291969160@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+267
-278
@@ -1,309 +1,294 @@
|
||||
import { chat } from "@yume/ai-client";
|
||||
import { chat, uploadImage } from "@yume/ai-client";
|
||||
import type {
|
||||
Beat,
|
||||
BeatChoice,
|
||||
BeatChoiceEffect,
|
||||
BeatNext,
|
||||
Character,
|
||||
EngineConfig,
|
||||
InsertBeatPartial,
|
||||
ProviderConfig,
|
||||
Scene,
|
||||
Session,
|
||||
} from "@yume/types";
|
||||
import { parseJsonLoose } from "./jsonParser";
|
||||
import { designCharacter, provisionVoiceForName } from "./agents/characterDesigner";
|
||||
import { runCinematographer } from "./agents/cinematographer";
|
||||
import { runPainter } from "./agents/painter";
|
||||
import {
|
||||
DIRECTOR_SYSTEM,
|
||||
INSERT_BEAT_SYSTEM,
|
||||
buildDirectorUserMessage,
|
||||
buildInsertBeatUserMessage,
|
||||
} from "./prompts";
|
||||
collectActiveCharacterNames,
|
||||
isPovName,
|
||||
normalizeSpeakerName,
|
||||
POV_DISPLAY_NAME,
|
||||
runWriter,
|
||||
} from "./agents/writer";
|
||||
import { parseJsonLoose } from "./jsonParser";
|
||||
import { INSERT_BEAT_SYSTEM, buildInsertBeatUserMessage } from "./prompts";
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// Raw shape produced by the model — we coerce + validate into a Scene.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
type RawEffect = {
|
||||
kind?: string;
|
||||
targetBeatId?: string;
|
||||
nextSceneSeed?: string;
|
||||
};
|
||||
|
||||
type RawChoice = {
|
||||
id?: string;
|
||||
label?: string;
|
||||
effect?: RawEffect;
|
||||
};
|
||||
|
||||
type RawNext = {
|
||||
type?: string;
|
||||
nextBeatId?: string;
|
||||
choices?: RawChoice[];
|
||||
};
|
||||
|
||||
type RawBeat = {
|
||||
id?: string;
|
||||
narration?: string;
|
||||
speaker?: string;
|
||||
line?: string;
|
||||
lineDelivery?: string;
|
||||
next?: RawNext;
|
||||
};
|
||||
|
||||
type RawCharacterUpdate = {
|
||||
name?: string;
|
||||
description?: string;
|
||||
};
|
||||
|
||||
type RawScene = {
|
||||
scenePrompt?: string;
|
||||
entryBeatId?: string;
|
||||
beats?: RawBeat[];
|
||||
characterUpdates?: RawCharacterUpdate[];
|
||||
};
|
||||
|
||||
function coerceEffect(raw: RawEffect | undefined): BeatChoiceEffect {
|
||||
if (raw?.kind === "advance-beat" && raw.targetBeatId?.trim()) {
|
||||
return { kind: "advance-beat", targetBeatId: raw.targetBeatId.trim() };
|
||||
}
|
||||
return {
|
||||
kind: "change-scene",
|
||||
nextSceneSeed: raw?.nextSceneSeed?.trim() || "未指定",
|
||||
};
|
||||
}
|
||||
|
||||
function coerceChoice(raw: RawChoice, idx: number): BeatChoice {
|
||||
return {
|
||||
id: raw.id?.trim() || `c${idx + 1}`,
|
||||
label: raw.label?.trim() || `选项 ${idx + 1}`,
|
||||
effect: coerceEffect(raw.effect),
|
||||
};
|
||||
}
|
||||
|
||||
function coerceNext(raw: RawNext | undefined, fallbackBeatId: string): BeatNext {
|
||||
if (raw?.type === "choice" && Array.isArray(raw.choices) && raw.choices.length) {
|
||||
return {
|
||||
type: "choice",
|
||||
choices: raw.choices.map((c, i) => coerceChoice(c, i)),
|
||||
};
|
||||
}
|
||||
return {
|
||||
type: "continue",
|
||||
nextBeatId: raw?.nextBeatId?.trim() || fallbackBeatId,
|
||||
};
|
||||
}
|
||||
|
||||
function coerceBeat(raw: RawBeat, idx: number, totalBeats: number): Beat {
|
||||
const id = raw.id?.trim() || `b${idx + 1}`;
|
||||
// Non-last beats default their `continue` target to the following beat.
|
||||
// The last beat gets an empty fallback on purpose: repairBeats() turns a
|
||||
// last/dangling continue into a real scene-change exit so the player can
|
||||
// never get stuck self-looping on it.
|
||||
const fallback = idx + 1 < totalBeats ? `b${idx + 2}` : "";
|
||||
const line = raw.line?.trim() || undefined;
|
||||
return {
|
||||
id,
|
||||
narration: raw.narration?.trim() || undefined,
|
||||
speaker: raw.speaker?.trim() || undefined,
|
||||
line,
|
||||
// lineDelivery only meaningful when there is a line to deliver.
|
||||
lineDelivery: line ? raw.lineDelivery?.trim() || undefined : undefined,
|
||||
next: coerceNext(raw.next, fallback),
|
||||
};
|
||||
}
|
||||
|
||||
function coerceCharacterUpdates(raw: RawCharacterUpdate[] | undefined): Character[] {
|
||||
if (!Array.isArray(raw)) return [];
|
||||
return raw
|
||||
.map((c) => ({
|
||||
name: c.name?.trim() ?? "",
|
||||
description: c.description?.trim() ?? "",
|
||||
}))
|
||||
.filter((c) => c.name && c.description);
|
||||
}
|
||||
|
||||
const FALLBACK_SEED = "故事继续推进";
|
||||
|
||||
function fallbackExitChoice(beatId: string): BeatChoice {
|
||||
return {
|
||||
id: `${beatId}__exit`,
|
||||
label: "继续",
|
||||
effect: { kind: "change-scene", nextSceneSeed: FALLBACK_SEED },
|
||||
};
|
||||
}
|
||||
|
||||
// Beat ids are graph keys (the front-end's `beats.find(b => b.id === ...)`,
|
||||
// the session's `visitedBeatIds`, and `continue`/`advance-beat` targets). If
|
||||
// the model reuses an id across beats, the second occurrence becomes silently
|
||||
// unreachable and external references collapse to the first beat. Rename
|
||||
// duplicates; rewrite the renamed beat's OWN self-references (the most
|
||||
// natural interpretation of a duplicate id being referenced from inside that
|
||||
// same beat). External references stay pointing at the first occurrence.
|
||||
function ensureUniqueBeatIds(beats: Beat[]): Beat[] {
|
||||
const seen = new Set<string>();
|
||||
return beats.map((b): Beat => {
|
||||
if (!seen.has(b.id)) {
|
||||
seen.add(b.id);
|
||||
return b;
|
||||
}
|
||||
const oldId = b.id;
|
||||
let n = 2;
|
||||
while (seen.has(`${oldId}_${n}`)) n += 1;
|
||||
const newId = `${oldId}_${n}`;
|
||||
seen.add(newId);
|
||||
|
||||
let next = b.next;
|
||||
if (next.type === "continue" && next.nextBeatId === oldId) {
|
||||
next = { type: "continue", nextBeatId: newId };
|
||||
} else if (next.type === "choice") {
|
||||
next = {
|
||||
type: "choice",
|
||||
choices: next.choices.map((c) =>
|
||||
c.effect.kind === "advance-beat" && c.effect.targetBeatId === oldId
|
||||
? {
|
||||
...c,
|
||||
effect: { kind: "advance-beat" as const, targetBeatId: newId },
|
||||
}
|
||||
: c,
|
||||
),
|
||||
};
|
||||
}
|
||||
return { ...b, id: newId, next };
|
||||
});
|
||||
}
|
||||
|
||||
// Repairs referential integrity AND guarantees the scene is escapable:
|
||||
// - a `continue` to a missing/self id is repointed to the next beat in order;
|
||||
// a last/dangling continue with nowhere to go becomes a scene-change exit
|
||||
// (never a self-loop, which would strand the player on "click to advance")
|
||||
// - an `advance-beat` to a missing id is downgraded to a scene change
|
||||
// - if no change-scene exit exists anywhere, one is appended to the last beat
|
||||
function repairBeats(beats: Beat[]): Beat[] {
|
||||
const ids = new Set(beats.map((b) => b.id));
|
||||
|
||||
const fixed: Beat[] = beats.map((b, idx): Beat => {
|
||||
if (b.next.type === "continue") {
|
||||
const target = b.next.nextBeatId;
|
||||
if (ids.has(target) && target !== b.id) return b;
|
||||
const nextByIndex = beats[idx + 1]?.id;
|
||||
if (nextByIndex) {
|
||||
return { ...b, next: { type: "continue", nextBeatId: nextByIndex } };
|
||||
}
|
||||
return { ...b, next: { type: "choice", choices: [fallbackExitChoice(b.id)] } };
|
||||
}
|
||||
|
||||
const patched = b.next.choices.map((c) =>
|
||||
c.effect.kind === "advance-beat" && !ids.has(c.effect.targetBeatId)
|
||||
? {
|
||||
...c,
|
||||
effect: {
|
||||
kind: "change-scene" as const,
|
||||
nextSceneSeed: "未指定(导演引用不存在的 beat,已降级为换场)",
|
||||
},
|
||||
}
|
||||
: c,
|
||||
);
|
||||
return { ...b, next: { type: "choice", choices: patched } };
|
||||
});
|
||||
|
||||
const hasExit = fixed.some(
|
||||
(b) =>
|
||||
b.next.type === "choice" &&
|
||||
b.next.choices.some((c) => c.effect.kind === "change-scene"),
|
||||
);
|
||||
if (!hasExit && fixed.length > 0) {
|
||||
const lastIdx = fixed.length - 1;
|
||||
const last = fixed[lastIdx]!;
|
||||
const existing = last.next.type === "choice" ? last.next.choices : [];
|
||||
fixed[lastIdx] = {
|
||||
...last,
|
||||
next: { type: "choice", choices: [...existing, fallbackExitChoice(last.id)] },
|
||||
};
|
||||
}
|
||||
|
||||
return fixed;
|
||||
}
|
||||
|
||||
// Choice ids are the keys the front-end uses to cache and consume prefetched
|
||||
// scenes. Two beats both defaulting to c1/c2 (or the model reusing ids across
|
||||
// beats) would make a transition reuse the WRONG prefetched scene — so force
|
||||
// every choice id to be unique within the scene.
|
||||
function ensureUniqueChoiceIds(beats: Beat[]): Beat[] {
|
||||
const seen = new Set<string>();
|
||||
for (const b of beats) {
|
||||
if (b.next.type !== "choice") continue;
|
||||
for (const c of b.next.choices) {
|
||||
if (seen.has(c.id)) {
|
||||
let n = 2;
|
||||
while (seen.has(`${c.id}_${n}`)) n += 1;
|
||||
c.id = `${c.id}_${n}`;
|
||||
}
|
||||
seen.add(c.id);
|
||||
}
|
||||
}
|
||||
return beats;
|
||||
}
|
||||
// ══════════════════════════════════════════════════════════════════════
|
||||
// director.ts — multi-agent orchestrator for one full Scene generation.
|
||||
//
|
||||
// Critical path (per Scene call):
|
||||
//
|
||||
// Writer LLM (~3s, serial)
|
||||
// │
|
||||
// ├─ CharacterDesigner LLM × N (parallel per new char)
|
||||
// │ │
|
||||
// │ ├─ portrait gen + upload (parallel within agent)
|
||||
// │ └─ voice provisioning (parallel within agent)
|
||||
// │
|
||||
// ├─ Cinematographer LLM (parallel with all of the above)
|
||||
// │
|
||||
// └─ wait for all parallel branches
|
||||
// │
|
||||
// ▼
|
||||
// Painter (FLUX referenceImages — two-tier degradation chain)
|
||||
// │
|
||||
// ▼
|
||||
// upload final scene image → Scene.imageUuid
|
||||
// │
|
||||
// ▼
|
||||
// return { scene, sceneImageBase64, characters }
|
||||
//
|
||||
// The Cinematographer intentionally does NOT depend on CharacterDesigner
|
||||
// output — it only positions named characters in the frame, not their
|
||||
// appearance. This unlocks the parallelism that makes the full pipeline
|
||||
// ~9-12s instead of ~15-18s serial.
|
||||
// ══════════════════════════════════════════════════════════════════════
|
||||
|
||||
function newSceneId(): string {
|
||||
return `scene_${Date.now()}_${Math.random().toString(36).slice(2, 6)}`;
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// directScene — generates one Scene (multi-beat) for the player.
|
||||
// Called both on real scene transitions AND on speculative prefetch.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
function tlog(label: string, t0: number): void {
|
||||
console.log(`${label}: ${Date.now() - t0}ms`);
|
||||
}
|
||||
|
||||
// Merge a freshly-designed Character into a registry, preserving any
|
||||
// previously-set voice/portrait that the new design didn't fill in (so
|
||||
// re-designing a known character can't silently drop their voice or wipe
|
||||
// out an already-generated portrait UUID). Match by name.
|
||||
export function mergeCharacters(
|
||||
existing: Character[],
|
||||
updates: Character[],
|
||||
): Character[] {
|
||||
if (updates.length === 0) return existing;
|
||||
const byName = new Map(existing.map((c) => [c.name, c]));
|
||||
for (const u of updates) {
|
||||
const prev = byName.get(u.name);
|
||||
if (!prev) {
|
||||
byName.set(u.name, u);
|
||||
continue;
|
||||
}
|
||||
// Preserve any prior provisioned resource that the new design omitted.
|
||||
byName.set(u.name, {
|
||||
...u,
|
||||
voice: u.voice ?? prev.voice,
|
||||
visualDescription: u.visualDescription ?? prev.visualDescription,
|
||||
basePortraitBase64: u.basePortraitBase64 ?? prev.basePortraitBase64,
|
||||
basePortraitUuid: u.basePortraitUuid ?? prev.basePortraitUuid,
|
||||
voiceDescription: u.voiceDescription || prev.voiceDescription,
|
||||
});
|
||||
}
|
||||
return Array.from(byName.values());
|
||||
}
|
||||
|
||||
// Pick a reference to the prior scene image when sceneKey matches a prior
|
||||
// scene — used by the Painter as one of the `referenceImages` (NOT as a
|
||||
// seedImage, because FLUX.2 [klein] 9B KV does not support seedImage).
|
||||
//
|
||||
// Returns the UUID if available (cheap reference, ~36 chars over the wire),
|
||||
// else the base64 of the most recent matching scene's image. Returns
|
||||
// undefined when no prior scene shares the current sceneKey.
|
||||
function pickPriorSceneReference(
|
||||
session: Session,
|
||||
currentSceneKey: string | undefined,
|
||||
priorImageBase64ByUuid: Map<string, string>,
|
||||
): { priorSceneReference?: string; priorSceneKey?: string } {
|
||||
if (!currentSceneKey) return {};
|
||||
for (let i = session.history.length - 1; i >= 0; i--) {
|
||||
const prior = session.history[i]!.scene;
|
||||
if (prior.sceneKey === currentSceneKey) {
|
||||
if (prior.imageUuid) {
|
||||
return {
|
||||
priorSceneReference: prior.imageUuid,
|
||||
priorSceneKey: prior.sceneKey,
|
||||
};
|
||||
}
|
||||
const cached = priorImageBase64ByUuid.get(prior.id);
|
||||
if (cached) {
|
||||
return { priorSceneReference: cached, priorSceneKey: prior.sceneKey };
|
||||
}
|
||||
}
|
||||
}
|
||||
return {};
|
||||
}
|
||||
|
||||
export type SceneResult = {
|
||||
scene: Scene;
|
||||
characterUpdates: Character[];
|
||||
sceneImageBase64: string;
|
||||
characters: Character[];
|
||||
};
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// directScene — the multi-agent pipeline. Used by orchestrator's
|
||||
// startSession and requestScene.
|
||||
//
|
||||
// priorImageBase64ByUuid: optional map from prior Scene.id → base64
|
||||
// the caller has on-hand. If a sceneKey-hit scene's imageUuid is missing
|
||||
// but the base64 is cached locally, we can still feed it as one of the
|
||||
// Painter's referenceImages. Pass an empty map when caller has no cache
|
||||
// (orchestrator does pass it for the start-session bootstrap).
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export async function directScene(
|
||||
config: ProviderConfig,
|
||||
config: EngineConfig,
|
||||
session: Session,
|
||||
priorImageBase64ByUuid: Map<string, string> = new Map(),
|
||||
): Promise<SceneResult> {
|
||||
const raw = await chat(
|
||||
config,
|
||||
[
|
||||
{ role: "system", content: DIRECTOR_SYSTEM },
|
||||
{ role: "user", content: buildDirectorUserMessage(session) },
|
||||
],
|
||||
{ temperature: 0.9, responseFormat: "json_object" },
|
||||
const tTotal = Date.now();
|
||||
|
||||
// Stage 1 — Writer (serial; everything downstream needs sceneSummary +
|
||||
// beats[] to know who's on stage and what to compose around).
|
||||
const tWriter = Date.now();
|
||||
const writerOut = await runWriter(config.text, session);
|
||||
tlog("[directScene] Writer", tWriter);
|
||||
|
||||
// Identify NEW characters introduced by this scene that need to be
|
||||
// designed (LLM + portrait + voice). Existing characters in the registry
|
||||
// are skipped — their cards / portraits / voices persist across scenes.
|
||||
const allActiveNames = collectActiveCharacterNames(writerOut.beats);
|
||||
const newCharNames = allActiveNames.filter(
|
||||
(n) => !session.characters.some((c) => c.name === n),
|
||||
);
|
||||
|
||||
const parsed = parseJsonLoose<RawScene>(raw);
|
||||
const rawBeats = Array.isArray(parsed.beats) ? parsed.beats : [];
|
||||
if (rawBeats.length === 0) {
|
||||
throw new Error("Director returned no beats");
|
||||
// Find the entry beat for the Cinematographer (which characters are
|
||||
// on-screen in the establishing shot).
|
||||
const entryBeat = writerOut.beats.find((b) => b.id === writerOut.entryBeatId);
|
||||
const entryBeatActive = entryBeat?.activeCharacters ?? [];
|
||||
|
||||
// For sceneKey-based visual continuity, look up the prior matching scene's
|
||||
// image to slot into Painter's referenceImages (max 4 of which include
|
||||
// character portraits too).
|
||||
const { priorSceneReference, priorSceneKey } = pickPriorSceneReference(
|
||||
session,
|
||||
writerOut.sceneKey,
|
||||
priorImageBase64ByUuid,
|
||||
);
|
||||
|
||||
// Stage 2 — parallel: CharacterDesigner(s) and Cinematographer.
|
||||
// Cinematographer doesn't need character visualDescriptions (those are
|
||||
// appended at Painter stage), so it runs concurrently with chardesign.
|
||||
const tParallel = Date.now();
|
||||
|
||||
const designPromises = newCharNames.map((name) =>
|
||||
designCharacter(config, session, name).catch((err): Character => {
|
||||
const msg = err instanceof Error ? err.message : String(err);
|
||||
console.error(`[directScene] designCharacter(${name}) failed: ${msg}`);
|
||||
// Last-resort fallback: register with name only so the speaker isn't
|
||||
// unknown. Caller may try voice provisioning later or skip.
|
||||
return {
|
||||
name,
|
||||
voiceDescription: `请根据角色名「${name}」推断其性别、年龄与气质。所属世界观:${session.worldSetting}`,
|
||||
};
|
||||
}),
|
||||
);
|
||||
|
||||
const cinemaPromise = runCinematographer(config.text, {
|
||||
sceneSummary: writerOut.sceneSummary,
|
||||
styleGuide: session.styleGuide,
|
||||
entryBeatActive,
|
||||
entryBeatSpeaker: entryBeat?.speaker,
|
||||
priorSceneKey,
|
||||
currentSceneKey: writerOut.sceneKey,
|
||||
});
|
||||
|
||||
const [designedChars, cinemaOut] = await Promise.all([
|
||||
Promise.all(designPromises),
|
||||
cinemaPromise,
|
||||
]);
|
||||
tlog("[directScene] CharacterDesigner+Cinematographer parallel", tParallel);
|
||||
|
||||
// Merge new chars into a working registry that we'll pass to the Painter.
|
||||
const characters = mergeCharacters(session.characters, designedChars);
|
||||
|
||||
// Edge case: a speaker referenced by the Writer might not have been in
|
||||
// `activeCharacters` of any beat (LLM oversight), so they got skipped by
|
||||
// newCharNames. Catch them here and at least provision a voice so the
|
||||
// beat-audio path doesn't render silent. No portrait — they weren't
|
||||
// visible in the scene, so visual consistency doesn't matter for them.
|
||||
const speakerNames = new Set(
|
||||
writerOut.beats.map((b) => b.speaker).filter((n): n is string => Boolean(n)),
|
||||
);
|
||||
const orphanSpeakers = [...speakerNames].filter(
|
||||
// Pattern B: "你" (player) is a valid speaker but never gets a Character
|
||||
// record — TTS is intentionally skipped on the client. Filter POV out so
|
||||
// provisionVoiceForName isn't accidentally invoked for the player.
|
||||
(n) => !isPovName(n) && !characters.some((c) => c.name === n),
|
||||
);
|
||||
if (orphanSpeakers.length > 0) {
|
||||
const orphans = await Promise.all(
|
||||
orphanSpeakers.map((n) => provisionVoiceForName(config, session, n)),
|
||||
);
|
||||
const merged = mergeCharacters(characters, orphans);
|
||||
characters.splice(0, characters.length, ...merged);
|
||||
}
|
||||
|
||||
const beats = ensureUniqueChoiceIds(
|
||||
repairBeats(
|
||||
ensureUniqueBeatIds(
|
||||
rawBeats.map((b, i) => coerceBeat(b, i, rawBeats.length)),
|
||||
),
|
||||
),
|
||||
// Stage 3 — Painter (depends on cinemaOut + characters).
|
||||
// On-stage characters for THIS scene are the ones in any beat — pass them
|
||||
// all so the archetype block covers anyone the player might encounter.
|
||||
const onStageCharacters = characters.filter((c) =>
|
||||
allActiveNames.includes(c.name),
|
||||
);
|
||||
|
||||
const declaredEntry = parsed.entryBeatId?.trim();
|
||||
const entryBeatId =
|
||||
declaredEntry && beats.some((b) => b.id === declaredEntry)
|
||||
? declaredEntry
|
||||
: beats[0]!.id;
|
||||
|
||||
return {
|
||||
scene: {
|
||||
id: newSceneId(),
|
||||
scenePrompt: parsed.scenePrompt?.trim() || "an empty scene",
|
||||
beats,
|
||||
entryBeatId,
|
||||
const tPainter = Date.now();
|
||||
const sceneImageBase64 = await runPainter(
|
||||
config,
|
||||
{
|
||||
integratedPrompt: cinemaOut.integratedPrompt,
|
||||
styleGuide: session.styleGuide,
|
||||
onStageCharacters,
|
||||
priorSceneImage: priorSceneReference,
|
||||
},
|
||||
characterUpdates: coerceCharacterUpdates(parsed.characterUpdates),
|
||||
entryBeat,
|
||||
);
|
||||
tlog("[directScene] Painter", tPainter);
|
||||
|
||||
// Stage 4 — best-effort upload of the final scene image so the NEXT
|
||||
// sceneKey-match call can reference its UUID instead of carrying base64.
|
||||
// If upload fails, the scene still works; only loses cheap referencing
|
||||
// on the next hop. Don't wait on mock images (static placeholder).
|
||||
let imageUuid: string | undefined;
|
||||
if (!config.mockImage) {
|
||||
try {
|
||||
const tUpload = Date.now();
|
||||
imageUuid = await uploadImage(config.image, sceneImageBase64);
|
||||
tlog("[directScene] image upload", tUpload);
|
||||
} catch (err) {
|
||||
const msg = err instanceof Error ? err.message : String(err);
|
||||
console.warn(`[directScene] scene image upload failed: ${msg} — sceneKey reuse will need base64 fallback`);
|
||||
}
|
||||
}
|
||||
|
||||
const scene: Scene = {
|
||||
id: newSceneId(),
|
||||
// scenePrompt is the cinematographer's English compositional output;
|
||||
// the Writer's sceneSummary stays in the session log via beats[]/
|
||||
// history. Keeping the original field name preserves compat with
|
||||
// anything that already reads scene.scenePrompt (e.g., insert-beat
|
||||
// user prompt).
|
||||
scenePrompt: cinemaOut.integratedPrompt,
|
||||
beats: writerOut.beats,
|
||||
entryBeatId: writerOut.entryBeatId,
|
||||
sceneKey: writerOut.sceneKey,
|
||||
imageUuid,
|
||||
};
|
||||
|
||||
tlog("[directScene] TOTAL", tTotal);
|
||||
|
||||
return { scene, sceneImageBase64, characters };
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// directInsertBeat — generates a one-off transient beat in response to
|
||||
// a freeform vision action that stays in-scene. Used by /api/insert-beat.
|
||||
// directInsertBeat — single-agent path for vision-driven in-scene
|
||||
// exploration. Generates ONE transient beat with NO new image, NO new
|
||||
// characters. Multi-agent pipeline doesn't apply here (no rendering, no
|
||||
// character introduction allowed by the prompt).
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export async function directInsertBeat(
|
||||
@@ -326,13 +311,17 @@ export async function directInsertBeat(
|
||||
const parsed = parseJsonLoose<InsertBeatPartial>(raw);
|
||||
|
||||
const narration = parsed.narration?.trim() || undefined;
|
||||
const speaker = parsed.speaker?.trim() || undefined;
|
||||
const rawSpeaker = parsed.speaker?.trim() || undefined;
|
||||
// Pattern B (mirrors Writer): normalize POV variants → "你"; NPCs pass through.
|
||||
const speaker = rawSpeaker ? normalizeSpeakerName(rawSpeaker) : undefined;
|
||||
const line = parsed.line?.trim() || undefined;
|
||||
const lineDelivery = line ? parsed.lineDelivery?.trim() || undefined : undefined;
|
||||
// lineDelivery is only meaningful for NPC speakers (TTS). For POV ("你")
|
||||
// TTS is intentionally skipped on the client, so lineDelivery is dropped.
|
||||
const lineDelivery =
|
||||
line && speaker !== POV_DISPLAY_NAME
|
||||
? parsed.lineDelivery?.trim() || undefined
|
||||
: undefined;
|
||||
|
||||
// If the model returned nothing usable, supply a fallback narration so the
|
||||
// frontend doesn't append a silent empty beat that renders no dialogue —
|
||||
// which would make the click appear to do nothing.
|
||||
if (!narration && !speaker && !line) {
|
||||
return { narration: "(你停下脚步,环视片刻。)" };
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user