feat(engine): multi-agent character consistency pipeline (#6)
* feat(types): Character.voiceDescription rename + visual fields + Scene.sceneKey Prepares the type surface for the multi-agent scene pipeline: - Character.description → voiceDescription (clearer pairing with new visualDescription) - Character gains visualDescription (English appearance card for Painter) + basePortraitBase64 + basePortraitUuid (for Runware referenceImages reuse) - Scene gains sceneKey (English slug for cross-scene img2img continuity) + imageUuid (Runware UUID of the scene's rendered image for cheap seedImage reuse on subsequent same-sceneKey calls) - Beat gains activeCharacters[] so the Cinematographer can read which characters are on-screen + their poses when composing the establishing shot Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ai-client): generateImage img2img + multi-reference options + uploadImage Extends the Runware adapter to support the two anchoring mechanisms FLUX.2 [klein] 9B KV needs for character + scene visual consistency: - generateImage gains optional { seedImage, referenceImages, strength }: seedImage drives img2img (single starting image, sceneKey continuity), referenceImages drives multi-reference anchoring (up to 4 character portraits, capped per Runware spec). Default strength 0.85 — FLUX ignores strength < 0.8. - uploadImage POSTs a base64 to Runware's imageUpload taskType and returns the UUID, so portraits/scene snapshots can be referenced by UUID on subsequent calls instead of resending base64 every scene. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(engine): multi-agent scene pipeline (Writer→CharDesigner+Cinematographer→Painter) Replaces the single-LLM directScene with a four-agent pipeline that specializes each concern and parallelizes the slow parts. Adopts the core idea from #4 (multi-agent dispatch + character visual consistency) and grafts it onto the Scene/Beat architecture introduced in #2. Pipeline per Scene (~9-12s critical path with parallelization): Writer LLM (序列, ~3s) │ outputs: sceneSummary + sceneKey + beats[] (each beat carries │ activeCharacters[] with poses) │ ├─ CharacterDesigner LLM × N new chars (并行) │ │ outputs: { visualDescription (英文外貌卡), voiceDescription (中文音色卡) } │ ├─ FLUX portrait gen → upload → UUID (并行 within agent) │ └─ Xiaomi MiMo voicedesign provision (并行 within agent) │ └─ Cinematographer LLM (并行 with CharacterDesigner) outputs: { shotType, integratedPrompt (英文构图+机位+人物站位) } Painter (FLUX img2img + referenceImages, ~1-3s) inputs: integratedPrompt + onStageCharacters' archetype block + (optional) prior sceneKey-hit scene as seedImage + (optional) character portrait UUIDs as referenceImages fallback chain: A) both anchors → B) refs only (保角色) → C) seed only (保背景) → D) pure t2i output uploaded → Scene.imageUuid for the next sceneKey hop Why this carving: - Writer focuses purely on narrative (drops the voice-design duty staging's DIRECTOR_SYSTEM was carrying as a side concern). - CharacterDesigner bundles visual + voice so the agent that thinks "who is this character" produces internally-consistent appearance + vocal personality (split agents tend to diverge). - Cinematographer doesn't need character visualDescriptions — Painter appends archetypes after — so it parallelizes with CharacterDesigner. - sceneKey enables cross-scene backdrop continuity that Scene/Beat doesn't cover (Scene/Beat only reuses backdrop WITHIN a scene's beats; sceneKey reuses across scenes that share a location). Other changes: - voice.ts loses provisionVoicesForScene (moved into CharacterDesigner); keeps synthesizeBeat for the lazy per-beat /api/beat-audio path. - renderer.ts deleted (replaced by agents/painter.ts). - directInsertBeat (vision-driven in-scene exploration) stays single- LLM — it forbids new characters and produces no image, so multi- agent doesn't apply. apps/web is unchanged: orchestrator.ts keeps the same exports (startSession / requestScene / visionDecide / requestInsertBeat / requestBeatAudio) with identical request/response shapes. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): Pattern B player POV + JSON repair + drop seedImage tier Three hotfixes surfaced by manual end-to-end testing of the multi-agent pipeline. F1 — Player viewpoint (galgame Pattern B): - Writer accepts speaker="你" for player dialog (renders in dialog box, never TTS'd because no Character record exists for "你"). Filter POV variants (玩家/我/主角/protagonist/player/I/me/...) from activeCharacters so CharacterDesigner never wastes API calls on the player. Two-layer defense: explicit prompt rule in WRITER_SYSTEM + code normalization (POV_VARIANTS set, isPovName, normalizeSpeakerName). - Cinematographer and Painter prompts gain "player never in frame" rule so the player never appears in any rendered scene. - Cinematographer gains dynamic camera policy driven by the entry beat's speaker: NPC-speaker → close-up looking toward camera; "你"-speaker → medium shot of attentive NPC; no speaker → wide establishing shot. - director.ts filters POV from orphanSpeakers so provisionVoiceForName never fires for "你". F2 — JSON parsing robustness: - parseJsonLoose gains a 4th repair tier: strip JS-style comments, strip trailing commas, insert missing commas between adjacent objects / arrays / quoted values. Logs the first 800 chars of raw LLM output when all repair attempts fail, so we can see what the model emitted. F3 — Drop seedImage, use referenceImages for prior scene: - FLUX.2 [klein] 9B KV does not support seedImage (img2img). Removed Tier A (seedImage+refs) and Tier C (seedImage only) from the Painter degradation chain. New layout: prior scene's image slots into referenceImages[0] for spatial continuity, character portraits fill slots 1-3 (Runware caps at 4 total). Cinematographer instructed to emphasize continuity when sceneKey matches a prior scene. All five package typechecks pass. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): address Copilot review feedback on #6 Three targeted fixes from PR #6 Copilot review. F4 — Stale seedImage/img2img docstrings Four locations still referenced the original img2img design after F3 switched to referenceImages-based spatial continuity: - types/index.ts:57 Scene.sceneKey docstring - types/index.ts:63 Scene.imageUuid docstring - director.ts:34 pipeline diagram in module block comment - director.ts:128 directScene JSDoc Doc-only changes; misleading wording corrected to mention referenceImages. (The design-rationale comment in pickPriorSceneReference is kept — it explains WHY we don't use seedImage and is load-bearing context.) F5 — Remove JS-comment stripping from JSON repair pass parseJsonLoose's repair tier previously stripped `// ...` and `/* ... */` across the entire text, which would corrupt JSON string values containing URLs (e.g. "https://example.com" → "https:"). Since LLMs in `responseFormat: "json_object"` mode essentially never emit comments, dropping the comment-stripping step is a net win for safety. Trailing-comma and missing-comma repair (the high-frequency failures) are kept. F6 — Pattern B parity on the insert-beat path Previously: directInsertBeat's INSERT_BEAT_SYSTEM forbade any speaker not in session.characters, and the orchestrator's unregistered-speaker guard demoted such lines to narration. This meant the player could not speak via speaker="你" in transient in-scene beats — inconsistent with the Writer path. Fix: - INSERT_BEAT_SYSTEM prompt now allows speaker="你" (NPC name OR "你") and rejects other POV variants - directInsertBeat applies normalizeSpeakerName to the LLM output, same as the Writer path, so POV variants collapse to "你" - lineDelivery is dropped when speaker="你" (no TTS for player) - orchestrator's unregistered-speaker guard adds a `speaker !== "你"` exception so Pattern B player dialog passes through Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(engine): drop "JS-style comments" from parseJsonLoose header The function header listed JS-style comments as a step-4 repair, but F5 already removed comment stripping from `repairJsonString` because the regex would corrupt URLs inside JSON string values. The inner function's comment was updated then; this header was missed. Doc-only sync from second-round Copilot review on #6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: QiChen88 <2291969160@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,145 @@
|
||||
import { generateImage } from "@yume/ai-client";
|
||||
import type { GenerateImageOptions } from "@yume/ai-client";
|
||||
import type {
|
||||
Beat,
|
||||
Character,
|
||||
EngineConfig,
|
||||
ProviderConfig,
|
||||
} from "@yume/types";
|
||||
import { mockImageBase64 } from "../mockImage";
|
||||
import { buildPainterPrompt } from "../prompts";
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// Painter — final image generation with multi-reference anchoring.
|
||||
//
|
||||
// FLUX.2 [klein] 9B KV does NOT support seedImage (img2img). Instead,
|
||||
// visual continuity comes entirely from `referenceImages` (capped at 4),
|
||||
// which the KV-optimized variant accelerates ~2.5× via key-value caching
|
||||
// of reference latents.
|
||||
//
|
||||
// References are slotted in priority order (max 4):
|
||||
// 1. Prior scene image — when sceneKey matched a previous scene, this
|
||||
// anchors the same physical space (lighting/layout/style continuity)
|
||||
// 2. Entry beat's speaker portrait — the NPC the player is talking with
|
||||
// (most visually prominent)
|
||||
// 3. Other on-stage NPCs' portraits — secondary characters in the frame
|
||||
//
|
||||
// Failure handling — two-tier degradation:
|
||||
// A. referenceImages call (preferred — full visual anchoring)
|
||||
// B. pure text-to-image fallback (last resort if Runware refs API errors)
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
const MAX_REFERENCE_IMAGES = 4;
|
||||
|
||||
export type PainterInput = {
|
||||
integratedPrompt: string;
|
||||
styleGuide: string;
|
||||
onStageCharacters: Character[];
|
||||
/**
|
||||
* Prior scene's Runware UUID or base64. When set (= sceneKey hit a
|
||||
* prior scene), it slots into referenceImages[0] for spatial continuity.
|
||||
* Capacity-wise this displaces ONE character portrait — slot is shared
|
||||
* with character refs, capped at 4 total per Runware spec.
|
||||
*/
|
||||
priorSceneImage?: string;
|
||||
};
|
||||
|
||||
// Pick the references we send to Runware as `referenceImages`. Priority:
|
||||
// slot 0: priorSceneImage (if any — sceneKey continuity)
|
||||
// slot 1: entry beat's speaker portrait (the NPC speaking to the player)
|
||||
// slot 2+: other on-stage NPCs from entry beat's activeCharacters
|
||||
// Caps at 4 total. Returns the array exactly as it'll be sent — already
|
||||
// truncated, already deduplicated.
|
||||
export function collectReferenceImages(
|
||||
characters: Character[],
|
||||
entryBeat: Beat | undefined,
|
||||
priorSceneImage: string | undefined,
|
||||
): string[] {
|
||||
const refs: string[] = [];
|
||||
const seen = new Set<string>();
|
||||
|
||||
// Slot 0 — prior scene image for spatial continuity. Goes first because
|
||||
// backdrop drift is the most jarring discontinuity across same-sceneKey
|
||||
// scenes; character drift is partially masked by character archetype text
|
||||
// in the prompt anyway.
|
||||
if (priorSceneImage) {
|
||||
refs.push(priorSceneImage);
|
||||
}
|
||||
|
||||
// Slot 1+ — character portraits, speaker-first.
|
||||
const speakerName = entryBeat?.speaker;
|
||||
if (speakerName) {
|
||||
const speaker = characters.find((c) => c.name === speakerName);
|
||||
const ref = speaker?.basePortraitUuid ?? speaker?.basePortraitBase64;
|
||||
if (ref && refs.length < MAX_REFERENCE_IMAGES) {
|
||||
refs.push(ref);
|
||||
seen.add(speakerName);
|
||||
}
|
||||
}
|
||||
|
||||
for (const c of entryBeat?.activeCharacters ?? []) {
|
||||
if (refs.length >= MAX_REFERENCE_IMAGES) break;
|
||||
if (seen.has(c.name)) continue;
|
||||
const char = characters.find((x) => x.name === c.name);
|
||||
const ref = char?.basePortraitUuid ?? char?.basePortraitBase64;
|
||||
if (ref) {
|
||||
refs.push(ref);
|
||||
seen.add(c.name);
|
||||
}
|
||||
}
|
||||
|
||||
return refs.slice(0, MAX_REFERENCE_IMAGES);
|
||||
}
|
||||
|
||||
async function tryGenerate(
|
||||
config: ProviderConfig,
|
||||
prompt: string,
|
||||
options: GenerateImageOptions,
|
||||
label: string,
|
||||
): Promise<string | null> {
|
||||
try {
|
||||
return await generateImage(config, prompt, options);
|
||||
} catch (err) {
|
||||
const msg = err instanceof Error ? err.message : String(err);
|
||||
console.warn(`[painter] ${label} failed: ${msg}`);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
export async function runPainter(
|
||||
config: EngineConfig,
|
||||
input: PainterInput,
|
||||
entryBeat: Beat | undefined,
|
||||
): Promise<string> {
|
||||
if (config.mockImage) return mockImageBase64();
|
||||
|
||||
const prompt = buildPainterPrompt(
|
||||
input.integratedPrompt,
|
||||
input.styleGuide,
|
||||
input.onStageCharacters,
|
||||
);
|
||||
|
||||
const refs = collectReferenceImages(
|
||||
input.onStageCharacters,
|
||||
entryBeat,
|
||||
input.priorSceneImage,
|
||||
);
|
||||
|
||||
// Tier A — with referenceImages (priorSceneImage + character portraits).
|
||||
// FLUX.2 [klein] 9B KV's KV cache accelerates this multi-reference path
|
||||
// ~2.5× compared to the non-KV variant.
|
||||
if (refs.length > 0) {
|
||||
const r = await tryGenerate(
|
||||
config.image,
|
||||
prompt,
|
||||
{ referenceImages: refs },
|
||||
`referenceImages (${refs.length})`,
|
||||
);
|
||||
if (r) return r;
|
||||
}
|
||||
|
||||
// Tier B — pure text-to-image. Last resort, used when Tier A failed OR
|
||||
// there are no references to send (first scene with no characters yet).
|
||||
// Errors here propagate to the caller.
|
||||
return generateImage(config.image, prompt);
|
||||
}
|
||||
Reference in New Issue
Block a user