feat(engine): multi-agent character consistency pipeline (#6)

* feat(types): Character.voiceDescription rename + visual fields + Scene.sceneKey Prepares the type surface for the multi-agent scene pipeline: - Character.description → voiceDescription (clearer pairing with new visualDescription) - Character gains visualDescription (English appearance card for Painter) + basePortraitBase64 + basePortraitUuid (for Runware referenceImages reuse) - Scene gains sceneKey (English slug for cross-scene img2img continuity) + imageUuid (Runware UUID of the scene's rendered image for cheap seedImage reuse on subsequent same-sceneKey calls) - Beat gains activeCharacters[] so the Cinematographer can read which characters are on-screen + their poses when composing the establishing shot Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ai-client): generateImage img2img + multi-reference options + uploadImage Extends the Runware adapter to support the two anchoring mechanisms FLUX.2 [klein] 9B KV needs for character + scene visual consistency: - generateImage gains optional { seedImage, referenceImages, strength }: seedImage drives img2img (single starting image, sceneKey continuity), referenceImages drives multi-reference anchoring (up to 4 character portraits, capped per Runware spec). Default strength 0.85 — FLUX ignores strength < 0.8. - uploadImage POSTs a base64 to Runware's imageUpload taskType and returns the UUID, so portraits/scene snapshots can be referenced by UUID on subsequent calls instead of resending base64 every scene. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(engine): multi-agent scene pipeline (Writer→CharDesigner+Cinematographer→Painter) Replaces the single-LLM directScene with a four-agent pipeline that specializes each concern and parallelizes the slow parts. Adopts the core idea from #4 (multi-agent dispatch + character visual consistency) and grafts it onto the Scene/Beat architecture introduced in #2. Pipeline per Scene (~9-12s critical path with parallelization): Writer LLM (序列, ~3s) │ outputs: sceneSummary + sceneKey + beats[] (each beat carries │ activeCharacters[] with poses) │ ├─ CharacterDesigner LLM × N new chars (并行) │ │ outputs: { visualDescription (英文外貌卡), voiceDescription (中文音色卡) } │ ├─ FLUX portrait gen → upload → UUID (并行 within agent) │ └─ Xiaomi MiMo voicedesign provision (并行 within agent) │ └─ Cinematographer LLM (并行 with CharacterDesigner) outputs: { shotType, integratedPrompt (英文构图+机位+人物站位) } Painter (FLUX img2img + referenceImages, ~1-3s) inputs: integratedPrompt + onStageCharacters' archetype block + (optional) prior sceneKey-hit scene as seedImage + (optional) character portrait UUIDs as referenceImages fallback chain: A) both anchors → B) refs only (保角色) → C) seed only (保背景) → D) pure t2i output uploaded → Scene.imageUuid for the next sceneKey hop Why this carving: - Writer focuses purely on narrative (drops the voice-design duty staging's DIRECTOR_SYSTEM was carrying as a side concern). - CharacterDesigner bundles visual + voice so the agent that thinks "who is this character" produces internally-consistent appearance + vocal personality (split agents tend to diverge). - Cinematographer doesn't need character visualDescriptions — Painter appends archetypes after — so it parallelizes with CharacterDesigner. - sceneKey enables cross-scene backdrop continuity that Scene/Beat doesn't cover (Scene/Beat only reuses backdrop WITHIN a scene's beats; sceneKey reuses across scenes that share a location). Other changes: - voice.ts loses provisionVoicesForScene (moved into CharacterDesigner); keeps synthesizeBeat for the lazy per-beat /api/beat-audio path. - renderer.ts deleted (replaced by agents/painter.ts). - directInsertBeat (vision-driven in-scene exploration) stays single- LLM — it forbids new characters and produces no image, so multi- agent doesn't apply. apps/web is unchanged: orchestrator.ts keeps the same exports (startSession / requestScene / visionDecide / requestInsertBeat / requestBeatAudio) with identical request/response shapes. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): Pattern B player POV + JSON repair + drop seedImage tier Three hotfixes surfaced by manual end-to-end testing of the multi-agent pipeline. F1 — Player viewpoint (galgame Pattern B): - Writer accepts speaker="你" for player dialog (renders in dialog box, never TTS'd because no Character record exists for "你"). Filter POV variants (玩家/我/主角/protagonist/player/I/me/...) from activeCharacters so CharacterDesigner never wastes API calls on the player. Two-layer defense: explicit prompt rule in WRITER_SYSTEM + code normalization (POV_VARIANTS set, isPovName, normalizeSpeakerName). - Cinematographer and Painter prompts gain "player never in frame" rule so the player never appears in any rendered scene. - Cinematographer gains dynamic camera policy driven by the entry beat's speaker: NPC-speaker → close-up looking toward camera; "你"-speaker → medium shot of attentive NPC; no speaker → wide establishing shot. - director.ts filters POV from orphanSpeakers so provisionVoiceForName never fires for "你". F2 — JSON parsing robustness: - parseJsonLoose gains a 4th repair tier: strip JS-style comments, strip trailing commas, insert missing commas between adjacent objects / arrays / quoted values. Logs the first 800 chars of raw LLM output when all repair attempts fail, so we can see what the model emitted. F3 — Drop seedImage, use referenceImages for prior scene: - FLUX.2 [klein] 9B KV does not support seedImage (img2img). Removed Tier A (seedImage+refs) and Tier C (seedImage only) from the Painter degradation chain. New layout: prior scene's image slots into referenceImages[0] for spatial continuity, character portraits fill slots 1-3 (Runware caps at 4 total). Cinematographer instructed to emphasize continuity when sceneKey matches a prior scene. All five package typechecks pass. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): address Copilot review feedback on #6 Three targeted fixes from PR #6 Copilot review. F4 — Stale seedImage/img2img docstrings Four locations still referenced the original img2img design after F3 switched to referenceImages-based spatial continuity: - types/index.ts:57 Scene.sceneKey docstring - types/index.ts:63 Scene.imageUuid docstring - director.ts:34 pipeline diagram in module block comment - director.ts:128 directScene JSDoc Doc-only changes; misleading wording corrected to mention referenceImages. (The design-rationale comment in pickPriorSceneReference is kept — it explains WHY we don't use seedImage and is load-bearing context.) F5 — Remove JS-comment stripping from JSON repair pass parseJsonLoose's repair tier previously stripped `// ...` and `/* ... */` across the entire text, which would corrupt JSON string values containing URLs (e.g. "https://example.com" → "https:"). Since LLMs in `responseFormat: "json_object"` mode essentially never emit comments, dropping the comment-stripping step is a net win for safety. Trailing-comma and missing-comma repair (the high-frequency failures) are kept. F6 — Pattern B parity on the insert-beat path Previously: directInsertBeat's INSERT_BEAT_SYSTEM forbade any speaker not in session.characters, and the orchestrator's unregistered-speaker guard demoted such lines to narration. This meant the player could not speak via speaker="你" in transient in-scene beats — inconsistent with the Writer path. Fix: - INSERT_BEAT_SYSTEM prompt now allows speaker="你" (NPC name OR "你") and rejects other POV variants - directInsertBeat applies normalizeSpeakerName to the LLM output, same as the Writer path, so POV variants collapse to "你" - lineDelivery is dropped when speaker="你" (no TTS for player) - orchestrator's unregistered-speaker guard adds a `speaker !== "你"` exception so Pattern B player dialog passes through Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(engine): drop "JS-style comments" from parseJsonLoose header The function header listed JS-style comments as a step-4 repair, but F5 already removed comment stripping from `repairJsonString` because the regex would corrupt URLs inside JSON string values. The inner function's comment was updated then; this header was missed. Doc-only sync from second-round Copilot review on #6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: QiChen88 <2291969160@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 13:30:24 +08:00
parent e261f4a346
commit def1b25bd9
14 changed files with 1733 additions and 562 deletions
@@ -4,8 +4,22 @@ import { fetchWithRetry } from "./fetchWithRetry";
 // Runware uses its own task-array protocol (not OpenAI-compatible).
 // POST <baseUrl> with [{ taskType: "imageInference", ... }]; errors come
 // back as a 200 with `errors[]`, so we have to inspect the body either way.
+
+// FLUX img2img specifics:
+// - strength < 0.8 has minimal-to-no visible effect on FLUX models (per
+//   Runware docs); we default to 0.85 which leaves room to deviate while
+//   still anchoring on the seed image's composition.
+// - referenceImages caps at 4 per request; the FLUX.2 [klein] 9B KV model
+//   (runware:400@6) accelerates multi-reference inference by ~2.5× via its
+//   KV cache for reference latents (cached only WITHIN one inference run —
+//   not persisted across API calls, hence the upload-once-then-reference
+//   strategy below).
+const DEFAULT_IMG2IMG_STRENGTH = 0.85;
+const MAX_REFERENCE_IMAGES = 4;
+
 type RunwareImageResult = {
  imageBase64Data?: string;
+  imageUUID?: string;
 };
 type RunwareError = {
  code?: string;
@@ -17,27 +31,58 @@ type RunwareResponse = {
  errors?: RunwareError[];
 };

+export type GenerateImageOptions = {
+  /**
+   * Reference image (UUID, plain base64, or data URI) to use as the
+   * img2img starting point. When set, FLUX preserves the seed image's
+   * composition and applies `strength` to allow deviation from it.
+   * Used for cross-scene visual continuity when sceneKey hits.
+   */
+  seedImage?: string;
+  /**
+   * Reference images (UUIDs or base64) to condition the generation on —
+   * typically character portraits to anchor identity / outfit / style
+   * across scenes. Runware caps at 4; we silently truncate beyond that.
+   */
+  referenceImages?: string[];
+  /** 0–1, FLUX needs ≥ 0.8 to actually have an effect. */
+  strength?: number;
+};
+
+// ──────────────────────────────────────────────────────────────────────
+//  generateImage — text-to-image (default) or img2img / multi-reference
+//  when seedImage / referenceImages are supplied. Returns base64.
+// ──────────────────────────────────────────────────────────────────────
+
 export async function generateImage(
  config: ProviderConfig,
  prompt: string,
+  options?: GenerateImageOptions,
 ): Promise<string> {
  const url = config.baseUrl.replace(/\/$/, "");

-  const body = [
-    {
-      taskType: "imageInference",
-      taskUUID: crypto.randomUUID(),
-      model: config.model,
-      positivePrompt: prompt,
-      width: 1792,
-      height: 1024,
-      steps: 4,
-      CFGScale: 3.5,
-      numberResults: 1,
-      outputType: "base64Data",
-      outputFormat: "PNG",
-    },
-  ];
+  const task: Record<string, unknown> = {
+    taskType: "imageInference",
+    taskUUID: crypto.randomUUID(),
+    model: config.model,
+    positivePrompt: prompt,
+    width: 1792,
+    height: 1024,
+    steps: 4,
+    CFGScale: 3.5,
+    numberResults: 1,
+    outputType: "base64Data",
+    outputFormat: "PNG",
+  };
+
+  if (options?.seedImage) {
+    task.seedImage = options.seedImage;
+    task.strength = options.strength ?? DEFAULT_IMG2IMG_STRENGTH;
+  }
+
+  if (options?.referenceImages?.length) {
+    task.referenceImages = options.referenceImages.slice(0, MAX_REFERENCE_IMAGES);
+  }

  const res = await fetchWithRetry(url, {
    method: "POST",
@@ -45,7 +90,7 @@ export async function generateImage(
      "Content-Type": "application/json",
      Authorization: `Bearer ${config.apiKey}`,
    },
-    body: JSON.stringify(body),
+    body: JSON.stringify([task]),
  });

  const text = await res.text();
@@ -66,9 +111,64 @@ export async function generateImage(

  const b64 = json.data?.[0]?.imageBase64Data;
  if (!b64) {
-    throw new Error(
-      `No image in Runware response: ${text.slice(0, 300)}`,
-    );
+    throw new Error(`No image in Runware response: ${text.slice(0, 300)}`);
  }
  return b64;
 }
+
+// ──────────────────────────────────────────────────────────────────────
+//  uploadImage — registers a base64 image on Runware and returns its
+//  UUID, so subsequent generateImage calls can pass the UUID in
+//  referenceImages / seedImage instead of resending the base64 payload
+//  every time. Character base portraits and scene snapshots both flow
+//  through this path.
+//
+//  Runware exposes the imageUpload taskType for exactly this purpose.
+//  Returns the UUID. Caller treats a thrown error as "fall back to
+//  sending base64 next time" — non-fatal.
+// ──────────────────────────────────────────────────────────────────────
+
+export async function uploadImage(
+  config: ProviderConfig,
+  base64: string,
+): Promise<string> {
+  const url = config.baseUrl.replace(/\/$/, "");
+
+  const body = [
+    {
+      taskType: "imageUpload",
+      taskUUID: crypto.randomUUID(),
+      image: `data:image/png;base64,${base64}`,
+    },
+  ];
+
+  const res = await fetchWithRetry(url, {
+    method: "POST",
+    headers: {
+      "Content-Type": "application/json",
+      Authorization: `Bearer ${config.apiKey}`,
+    },
+    body: JSON.stringify(body),
+  });
+
+  const text = await res.text();
+  let json: RunwareResponse;
+  try {
+    json = JSON.parse(text) as RunwareResponse;
+  } catch {
+    throw new Error(`Image upload API error ${res.status}: ${text.slice(0, 500)}`);
+  }
+
+  if (json.errors?.length) {
+    const e = json.errors[0]!;
+    throw new Error(
+      `Runware upload error [${e.code ?? "unknown"}]: ${e.message ?? "no message"}`,
+    );
+  }
+
+  const uuid = json.data?.[0]?.imageUUID;
+  if (!uuid) {
+    throw new Error(`No UUID in upload response: ${text.slice(0, 300)}`);
+  }
+  return uuid;
+}