Files
infiplot-web/packages/engine/src/agents/painter.ts
T
Zonghao Yuan def1b25bd9 feat(engine): multi-agent character consistency pipeline (#6)
* feat(types): Character.voiceDescription rename + visual fields + Scene.sceneKey

Prepares the type surface for the multi-agent scene pipeline:

- Character.description → voiceDescription (clearer pairing with new visualDescription)
- Character gains visualDescription (English appearance card for Painter) +
  basePortraitBase64 + basePortraitUuid (for Runware referenceImages reuse)
- Scene gains sceneKey (English slug for cross-scene img2img continuity) +
  imageUuid (Runware UUID of the scene's rendered image for cheap seedImage
  reuse on subsequent same-sceneKey calls)
- Beat gains activeCharacters[] so the Cinematographer can read which
  characters are on-screen + their poses when composing the establishing shot

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ai-client): generateImage img2img + multi-reference options + uploadImage

Extends the Runware adapter to support the two anchoring mechanisms FLUX.2
[klein] 9B KV needs for character + scene visual consistency:

- generateImage gains optional { seedImage, referenceImages, strength }:
  seedImage drives img2img (single starting image, sceneKey continuity),
  referenceImages drives multi-reference anchoring (up to 4 character
  portraits, capped per Runware spec). Default strength 0.85 — FLUX
  ignores strength < 0.8.
- uploadImage POSTs a base64 to Runware's imageUpload taskType and
  returns the UUID, so portraits/scene snapshots can be referenced by
  UUID on subsequent calls instead of resending base64 every scene.

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(engine): multi-agent scene pipeline (Writer→CharDesigner+Cinematographer→Painter)

Replaces the single-LLM directScene with a four-agent pipeline that
specializes each concern and parallelizes the slow parts. Adopts the
core idea from #4 (multi-agent dispatch + character visual consistency)
and grafts it onto the Scene/Beat architecture introduced in #2.

Pipeline per Scene (~9-12s critical path with parallelization):

  Writer LLM (序列, ~3s)
    │ outputs: sceneSummary + sceneKey + beats[] (each beat carries
    │           activeCharacters[] with poses)
    │
    ├─ CharacterDesigner LLM × N new chars (并行)
    │     │ outputs: { visualDescription (英文外貌卡), voiceDescription (中文音色卡) }
    │     ├─ FLUX portrait gen → upload → UUID (并行 within agent)
    │     └─ Xiaomi MiMo voicedesign provision (并行 within agent)
    │
    └─ Cinematographer LLM (并行 with CharacterDesigner)
          outputs: { shotType, integratedPrompt (英文构图+机位+人物站位) }

  Painter (FLUX img2img + referenceImages, ~1-3s)
    inputs: integratedPrompt + onStageCharacters' archetype block
            + (optional) prior sceneKey-hit scene as seedImage
            + (optional) character portrait UUIDs as referenceImages
    fallback chain: A) both anchors → B) refs only (保角色) →
                    C) seed only (保背景) → D) pure t2i
    output uploaded → Scene.imageUuid for the next sceneKey hop

Why this carving:
- Writer focuses purely on narrative (drops the voice-design duty
  staging's DIRECTOR_SYSTEM was carrying as a side concern).
- CharacterDesigner bundles visual + voice so the agent that thinks
  "who is this character" produces internally-consistent appearance +
  vocal personality (split agents tend to diverge).
- Cinematographer doesn't need character visualDescriptions —
  Painter appends archetypes after — so it parallelizes with
  CharacterDesigner.
- sceneKey enables cross-scene backdrop continuity that Scene/Beat
  doesn't cover (Scene/Beat only reuses backdrop WITHIN a scene's
  beats; sceneKey reuses across scenes that share a location).

Other changes:
- voice.ts loses provisionVoicesForScene (moved into CharacterDesigner);
  keeps synthesizeBeat for the lazy per-beat /api/beat-audio path.
- renderer.ts deleted (replaced by agents/painter.ts).
- directInsertBeat (vision-driven in-scene exploration) stays single-
  LLM — it forbids new characters and produces no image, so multi-
  agent doesn't apply.

apps/web is unchanged: orchestrator.ts keeps the same exports
(startSession / requestScene / visionDecide / requestInsertBeat /
requestBeatAudio) with identical request/response shapes.

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(engine): Pattern B player POV + JSON repair + drop seedImage tier

Three hotfixes surfaced by manual end-to-end testing of the multi-agent
pipeline.

F1 — Player viewpoint (galgame Pattern B):
  - Writer accepts speaker="你" for player dialog (renders in dialog box,
    never TTS'd because no Character record exists for "你"). Filter POV
    variants (玩家/我/主角/protagonist/player/I/me/...) from
    activeCharacters so CharacterDesigner never wastes API calls on the
    player. Two-layer defense: explicit prompt rule in WRITER_SYSTEM +
    code normalization (POV_VARIANTS set, isPovName, normalizeSpeakerName).
  - Cinematographer and Painter prompts gain "player never in frame" rule
    so the player never appears in any rendered scene.
  - Cinematographer gains dynamic camera policy driven by the entry beat's
    speaker: NPC-speaker → close-up looking toward camera; "你"-speaker →
    medium shot of attentive NPC; no speaker → wide establishing shot.
  - director.ts filters POV from orphanSpeakers so provisionVoiceForName
    never fires for "你".

F2 — JSON parsing robustness:
  - parseJsonLoose gains a 4th repair tier: strip JS-style comments, strip
    trailing commas, insert missing commas between adjacent objects /
    arrays / quoted values. Logs the first 800 chars of raw LLM output
    when all repair attempts fail, so we can see what the model emitted.

F3 — Drop seedImage, use referenceImages for prior scene:
  - FLUX.2 [klein] 9B KV does not support seedImage (img2img). Removed
    Tier A (seedImage+refs) and Tier C (seedImage only) from the Painter
    degradation chain. New layout: prior scene's image slots into
    referenceImages[0] for spatial continuity, character portraits fill
    slots 1-3 (Runware caps at 4 total). Cinematographer instructed to
    emphasize continuity when sceneKey matches a prior scene.

All five package typechecks pass.

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(engine): address Copilot review feedback on #6

Three targeted fixes from PR #6 Copilot review.

F4 — Stale seedImage/img2img docstrings
  Four locations still referenced the original img2img design after F3
  switched to referenceImages-based spatial continuity:
  - types/index.ts:57   Scene.sceneKey docstring
  - types/index.ts:63   Scene.imageUuid docstring
  - director.ts:34      pipeline diagram in module block comment
  - director.ts:128     directScene JSDoc
  Doc-only changes; misleading wording corrected to mention referenceImages.
  (The design-rationale comment in pickPriorSceneReference is kept — it
  explains WHY we don't use seedImage and is load-bearing context.)

F5 — Remove JS-comment stripping from JSON repair pass
  parseJsonLoose's repair tier previously stripped `// ...` and
  `/* ... */` across the entire text, which would corrupt JSON string
  values containing URLs (e.g. "https://example.com" → "https:"). Since
  LLMs in `responseFormat: "json_object"` mode essentially never emit
  comments, dropping the comment-stripping step is a net win for safety.
  Trailing-comma and missing-comma repair (the high-frequency failures)
  are kept.

F6 — Pattern B parity on the insert-beat path
  Previously: directInsertBeat's INSERT_BEAT_SYSTEM forbade any speaker
  not in session.characters, and the orchestrator's unregistered-speaker
  guard demoted such lines to narration. This meant the player could not
  speak via speaker="你" in transient in-scene beats — inconsistent with
  the Writer path.
  Fix:
  - INSERT_BEAT_SYSTEM prompt now allows speaker="你" (NPC name OR "你")
    and rejects other POV variants
  - directInsertBeat applies normalizeSpeakerName to the LLM output, same
    as the Writer path, so POV variants collapse to "你"
  - lineDelivery is dropped when speaker="你" (no TTS for player)
  - orchestrator's unregistered-speaker guard adds a `speaker !== "你"`
    exception so Pattern B player dialog passes through

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(engine): drop "JS-style comments" from parseJsonLoose header

The function header listed JS-style comments as a step-4 repair, but F5
already removed comment stripping from `repairJsonString` because the
regex would corrupt URLs inside JSON string values. The inner function's
comment was updated then; this header was missed.

Doc-only sync from second-round Copilot review on #6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: QiChen88 <2291969160@qq.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 13:30:24 +08:00

146 lines
5.1 KiB
TypeScript
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
import { generateImage } from "@yume/ai-client";
import type { GenerateImageOptions } from "@yume/ai-client";
import type {
Beat,
Character,
EngineConfig,
ProviderConfig,
} from "@yume/types";
import { mockImageBase64 } from "../mockImage";
import { buildPainterPrompt } from "../prompts";
// ──────────────────────────────────────────────────────────────────────
// Painter — final image generation with multi-reference anchoring.
//
// FLUX.2 [klein] 9B KV does NOT support seedImage (img2img). Instead,
// visual continuity comes entirely from `referenceImages` (capped at 4),
// which the KV-optimized variant accelerates ~2.5× via key-value caching
// of reference latents.
//
// References are slotted in priority order (max 4):
// 1. Prior scene image — when sceneKey matched a previous scene, this
// anchors the same physical space (lighting/layout/style continuity)
// 2. Entry beat's speaker portrait — the NPC the player is talking with
// (most visually prominent)
// 3. Other on-stage NPCs' portraits — secondary characters in the frame
//
// Failure handling — two-tier degradation:
// A. referenceImages call (preferred — full visual anchoring)
// B. pure text-to-image fallback (last resort if Runware refs API errors)
// ──────────────────────────────────────────────────────────────────────
const MAX_REFERENCE_IMAGES = 4;
export type PainterInput = {
integratedPrompt: string;
styleGuide: string;
onStageCharacters: Character[];
/**
* Prior scene's Runware UUID or base64. When set (= sceneKey hit a
* prior scene), it slots into referenceImages[0] for spatial continuity.
* Capacity-wise this displaces ONE character portrait — slot is shared
* with character refs, capped at 4 total per Runware spec.
*/
priorSceneImage?: string;
};
// Pick the references we send to Runware as `referenceImages`. Priority:
// slot 0: priorSceneImage (if any — sceneKey continuity)
// slot 1: entry beat's speaker portrait (the NPC speaking to the player)
// slot 2+: other on-stage NPCs from entry beat's activeCharacters
// Caps at 4 total. Returns the array exactly as it'll be sent — already
// truncated, already deduplicated.
export function collectReferenceImages(
characters: Character[],
entryBeat: Beat | undefined,
priorSceneImage: string | undefined,
): string[] {
const refs: string[] = [];
const seen = new Set<string>();
// Slot 0 — prior scene image for spatial continuity. Goes first because
// backdrop drift is the most jarring discontinuity across same-sceneKey
// scenes; character drift is partially masked by character archetype text
// in the prompt anyway.
if (priorSceneImage) {
refs.push(priorSceneImage);
}
// Slot 1+ — character portraits, speaker-first.
const speakerName = entryBeat?.speaker;
if (speakerName) {
const speaker = characters.find((c) => c.name === speakerName);
const ref = speaker?.basePortraitUuid ?? speaker?.basePortraitBase64;
if (ref && refs.length < MAX_REFERENCE_IMAGES) {
refs.push(ref);
seen.add(speakerName);
}
}
for (const c of entryBeat?.activeCharacters ?? []) {
if (refs.length >= MAX_REFERENCE_IMAGES) break;
if (seen.has(c.name)) continue;
const char = characters.find((x) => x.name === c.name);
const ref = char?.basePortraitUuid ?? char?.basePortraitBase64;
if (ref) {
refs.push(ref);
seen.add(c.name);
}
}
return refs.slice(0, MAX_REFERENCE_IMAGES);
}
async function tryGenerate(
config: ProviderConfig,
prompt: string,
options: GenerateImageOptions,
label: string,
): Promise<string | null> {
try {
return await generateImage(config, prompt, options);
} catch (err) {
const msg = err instanceof Error ? err.message : String(err);
console.warn(`[painter] ${label} failed: ${msg}`);
return null;
}
}
export async function runPainter(
config: EngineConfig,
input: PainterInput,
entryBeat: Beat | undefined,
): Promise<string> {
if (config.mockImage) return mockImageBase64();
const prompt = buildPainterPrompt(
input.integratedPrompt,
input.styleGuide,
input.onStageCharacters,
);
const refs = collectReferenceImages(
input.onStageCharacters,
entryBeat,
input.priorSceneImage,
);
// Tier A — with referenceImages (priorSceneImage + character portraits).
// FLUX.2 [klein] 9B KV's KV cache accelerates this multi-reference path
// ~2.5× compared to the non-KV variant.
if (refs.length > 0) {
const r = await tryGenerate(
config.image,
prompt,
{ referenceImages: refs },
`referenceImages (${refs.length})`,
);
if (r) return r;
}
// Tier B — pure text-to-image. Last resort, used when Tier A failed OR
// there are no references to send (first scene with no characters yet).
// Errors here propagate to the caller.
return generateImage(config.image, prompt);
}