feat(engine): multi-agent character consistency pipeline (#6)

* feat(types): Character.voiceDescription rename + visual fields + Scene.sceneKey Prepares the type surface for the multi-agent scene pipeline: - Character.description → voiceDescription (clearer pairing with new visualDescription) - Character gains visualDescription (English appearance card for Painter) + basePortraitBase64 + basePortraitUuid (for Runware referenceImages reuse) - Scene gains sceneKey (English slug for cross-scene img2img continuity) + imageUuid (Runware UUID of the scene's rendered image for cheap seedImage reuse on subsequent same-sceneKey calls) - Beat gains activeCharacters[] so the Cinematographer can read which characters are on-screen + their poses when composing the establishing shot Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ai-client): generateImage img2img + multi-reference options + uploadImage Extends the Runware adapter to support the two anchoring mechanisms FLUX.2 [klein] 9B KV needs for character + scene visual consistency: - generateImage gains optional { seedImage, referenceImages, strength }: seedImage drives img2img (single starting image, sceneKey continuity), referenceImages drives multi-reference anchoring (up to 4 character portraits, capped per Runware spec). Default strength 0.85 — FLUX ignores strength < 0.8. - uploadImage POSTs a base64 to Runware's imageUpload taskType and returns the UUID, so portraits/scene snapshots can be referenced by UUID on subsequent calls instead of resending base64 every scene. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(engine): multi-agent scene pipeline (Writer→CharDesigner+Cinematographer→Painter) Replaces the single-LLM directScene with a four-agent pipeline that specializes each concern and parallelizes the slow parts. Adopts the core idea from #4 (multi-agent dispatch + character visual consistency) and grafts it onto the Scene/Beat architecture introduced in #2. Pipeline per Scene (~9-12s critical path with parallelization): Writer LLM (序列, ~3s) │ outputs: sceneSummary + sceneKey + beats[] (each beat carries │ activeCharacters[] with poses) │ ├─ CharacterDesigner LLM × N new chars (并行) │ │ outputs: { visualDescription (英文外貌卡), voiceDescription (中文音色卡) } │ ├─ FLUX portrait gen → upload → UUID (并行 within agent) │ └─ Xiaomi MiMo voicedesign provision (并行 within agent) │ └─ Cinematographer LLM (并行 with CharacterDesigner) outputs: { shotType, integratedPrompt (英文构图+机位+人物站位) } Painter (FLUX img2img + referenceImages, ~1-3s) inputs: integratedPrompt + onStageCharacters' archetype block + (optional) prior sceneKey-hit scene as seedImage + (optional) character portrait UUIDs as referenceImages fallback chain: A) both anchors → B) refs only (保角色) → C) seed only (保背景) → D) pure t2i output uploaded → Scene.imageUuid for the next sceneKey hop Why this carving: - Writer focuses purely on narrative (drops the voice-design duty staging's DIRECTOR_SYSTEM was carrying as a side concern). - CharacterDesigner bundles visual + voice so the agent that thinks "who is this character" produces internally-consistent appearance + vocal personality (split agents tend to diverge). - Cinematographer doesn't need character visualDescriptions — Painter appends archetypes after — so it parallelizes with CharacterDesigner. - sceneKey enables cross-scene backdrop continuity that Scene/Beat doesn't cover (Scene/Beat only reuses backdrop WITHIN a scene's beats; sceneKey reuses across scenes that share a location). Other changes: - voice.ts loses provisionVoicesForScene (moved into CharacterDesigner); keeps synthesizeBeat for the lazy per-beat /api/beat-audio path. - renderer.ts deleted (replaced by agents/painter.ts). - directInsertBeat (vision-driven in-scene exploration) stays single- LLM — it forbids new characters and produces no image, so multi- agent doesn't apply. apps/web is unchanged: orchestrator.ts keeps the same exports (startSession / requestScene / visionDecide / requestInsertBeat / requestBeatAudio) with identical request/response shapes. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): Pattern B player POV + JSON repair + drop seedImage tier Three hotfixes surfaced by manual end-to-end testing of the multi-agent pipeline. F1 — Player viewpoint (galgame Pattern B): - Writer accepts speaker="你" for player dialog (renders in dialog box, never TTS'd because no Character record exists for "你"). Filter POV variants (玩家/我/主角/protagonist/player/I/me/...) from activeCharacters so CharacterDesigner never wastes API calls on the player. Two-layer defense: explicit prompt rule in WRITER_SYSTEM + code normalization (POV_VARIANTS set, isPovName, normalizeSpeakerName). - Cinematographer and Painter prompts gain "player never in frame" rule so the player never appears in any rendered scene. - Cinematographer gains dynamic camera policy driven by the entry beat's speaker: NPC-speaker → close-up looking toward camera; "你"-speaker → medium shot of attentive NPC; no speaker → wide establishing shot. - director.ts filters POV from orphanSpeakers so provisionVoiceForName never fires for "你". F2 — JSON parsing robustness: - parseJsonLoose gains a 4th repair tier: strip JS-style comments, strip trailing commas, insert missing commas between adjacent objects / arrays / quoted values. Logs the first 800 chars of raw LLM output when all repair attempts fail, so we can see what the model emitted. F3 — Drop seedImage, use referenceImages for prior scene: - FLUX.2 [klein] 9B KV does not support seedImage (img2img). Removed Tier A (seedImage+refs) and Tier C (seedImage only) from the Painter degradation chain. New layout: prior scene's image slots into referenceImages[0] for spatial continuity, character portraits fill slots 1-3 (Runware caps at 4 total). Cinematographer instructed to emphasize continuity when sceneKey matches a prior scene. All five package typechecks pass. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): address Copilot review feedback on #6 Three targeted fixes from PR #6 Copilot review. F4 — Stale seedImage/img2img docstrings Four locations still referenced the original img2img design after F3 switched to referenceImages-based spatial continuity: - types/index.ts:57 Scene.sceneKey docstring - types/index.ts:63 Scene.imageUuid docstring - director.ts:34 pipeline diagram in module block comment - director.ts:128 directScene JSDoc Doc-only changes; misleading wording corrected to mention referenceImages. (The design-rationale comment in pickPriorSceneReference is kept — it explains WHY we don't use seedImage and is load-bearing context.) F5 — Remove JS-comment stripping from JSON repair pass parseJsonLoose's repair tier previously stripped `// ...` and `/* ... */` across the entire text, which would corrupt JSON string values containing URLs (e.g. "https://example.com" → "https:"). Since LLMs in `responseFormat: "json_object"` mode essentially never emit comments, dropping the comment-stripping step is a net win for safety. Trailing-comma and missing-comma repair (the high-frequency failures) are kept. F6 — Pattern B parity on the insert-beat path Previously: directInsertBeat's INSERT_BEAT_SYSTEM forbade any speaker not in session.characters, and the orchestrator's unregistered-speaker guard demoted such lines to narration. This meant the player could not speak via speaker="你" in transient in-scene beats — inconsistent with the Writer path. Fix: - INSERT_BEAT_SYSTEM prompt now allows speaker="你" (NPC name OR "你") and rejects other POV variants - directInsertBeat applies normalizeSpeakerName to the LLM output, same as the Writer path, so POV variants collapse to "你" - lineDelivery is dropped when speaker="你" (no TTS for player) - orchestrator's unregistered-speaker guard adds a `speaker !== "你"` exception so Pattern B player dialog passes through Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(engine): drop "JS-style comments" from parseJsonLoose header The function header listed JS-style comments as a step-4 repair, but F5 already removed comment stripping from `repairJsonString` because the regex would corrupt URLs inside JSON string values. The inner function's comment was updated then; this header was missed. Doc-only sync from second-round Copilot review on #6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: QiChen88 <2291969160@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 13:30:24 +08:00
parent e261f4a346
commit def1b25bd9
14 changed files with 1733 additions and 562 deletions
@@ -0,0 +1,192 @@
+import { chat, generateImage, uploadImage } from "@yume/ai-client";
+import { provisionVoice } from "@yume/tts-client";
+import type {
+  Character,
+  CharacterVoice,
+  EngineConfig,
+  Session,
+} from "@yume/types";
+import { parseJsonLoose } from "../jsonParser";
+import { mockImageBase64 } from "../mockImage";
+import {
+  CHARACTER_DESIGNER_SYSTEM,
+  buildCharacterDesignerUserMessage,
+  buildCharacterPortraitPrompt,
+} from "../prompts";
+
+// ──────────────────────────────────────────────────────────────────────
+//  CharacterDesigner agent — designs ONE new character end-to-end.
+//
+//  Pipeline (per character, all the slow parts are parallelized):
+//
+//    1. LLM call — designs BOTH visual + voice cards in one shot
+//       (intentional: same agent thinks about who this character IS,
+//        which keeps appearance and vocal personality coherent)
+//
+//    2. In parallel:
+//       a. Image gen — base portrait from visualDescription + styleGuide
+//          then upload to Runware → get UUID for cheap re-reference
+//       b. Voice provisioning — Xiaomi MiMo voicedesign from voiceDescription
+//          → reference audio for later voiceclone synth
+//
+//    3. Returns merged Character ready to be added to session.characters
+//
+//  Each step degrades gracefully — if image gen fails we return the
+//  character without a portrait; if voice gen fails we return without
+//  voice. The game keeps running even when sub-components fail.
+// ──────────────────────────────────────────────────────────────────────
+
+type CharacterDesignOutput = {
+  visualDescription?: string;
+  voiceDescription?: string;
+};
+
+// TEMP: per-phase timing for latency diagnosis. Same convention as the
+// orchestrator's tlog. Remove after we have data on real-world numbers.
+function tlog(label: string, t0: number): void {
+  console.log(`${label}: ${Date.now() - t0}ms`);
+}
+
+async function runDesignLLM(
+  config: EngineConfig,
+  session: Session,
+  charName: string,
+): Promise<CharacterDesignOutput> {
+  const raw = await chat(
+    config.text,
+    [
+      { role: "system", content: CHARACTER_DESIGNER_SYSTEM },
+      {
+        role: "user",
+        content: buildCharacterDesignerUserMessage(charName, session),
+      },
+    ],
+    { temperature: 0.7, responseFormat: "json_object" },
+  );
+  return parseJsonLoose<CharacterDesignOutput>(raw);
+}
+
+// Generate the per-character base portrait and upload it. The portrait is
+// a "concept sheet" — single character, neutral pose, plain background —
+// so it works well as a Runware referenceImages anchor for later scenes.
+//
+// Returns both the base64 (for client-side asset use, e.g., 立绘登场
+// animations) and the Runware UUID (for cheap referencing in subsequent
+// Painter calls without resending the 100KB+ base64 each time).
+//
+// The upload step is best-effort: if it fails, we still return the base64
+// so the next scene can pass it as a referenceImages entry directly (just
+// pays the bandwidth cost each call instead of once).
+async function renderAndUploadPortrait(
+  config: EngineConfig,
+  charName: string,
+  visualDescription: string,
+  styleGuide: string,
+): Promise<{ basePortraitBase64?: string; basePortraitUuid?: string }> {
+  let base64: string;
+  try {
+    if (config.mockImage) {
+      base64 = await mockImageBase64();
+    } else {
+      const prompt = buildCharacterPortraitPrompt(
+        charName,
+        visualDescription,
+        styleGuide,
+      );
+      base64 = await generateImage(config.image, prompt);
+    }
+  } catch (err) {
+    const msg = err instanceof Error ? err.message : String(err);
+    console.error(`[characterDesigner] portrait gen failed for ${charName}: ${msg}`);
+    return {}; // no portrait at all — degrade gracefully
+  }
+
+  // Skip upload in mock mode — the mock image is the same static SVG every
+  // time and uploading it gives us a UUID that points to a useless asset.
+  if (config.mockImage) {
+    return { basePortraitBase64: base64 };
+  }
+
+  try {
+    const uuid = await uploadImage(config.image, base64);
+    return { basePortraitBase64: base64, basePortraitUuid: uuid };
+  } catch (err) {
+    const msg = err instanceof Error ? err.message : String(err);
+    console.warn(
+      `[characterDesigner] portrait upload failed for ${charName}: ${msg} — will pass base64 in subsequent calls`,
+    );
+    return { basePortraitBase64: base64 };
+  }
+}
+
+async function provisionVoiceSafe(
+  config: EngineConfig,
+  voiceDescription: string,
+  charName: string,
+): Promise<CharacterVoice | undefined> {
+  if (!config.tts) return undefined;
+  try {
+    return await provisionVoice(config.tts, voiceDescription);
+  } catch (err) {
+    const msg = err instanceof Error ? err.message : String(err);
+    console.error(`[characterDesigner] voice provision failed for ${charName}: ${msg}`);
+    return undefined;
+  }
+}
+
+// Single-character design pipeline. Called by the orchestrator once per
+// NEW character name; multiple characters in the same scene run their
+// pipelines in parallel at the orchestrator level.
+export async function designCharacter(
+  config: EngineConfig,
+  session: Session,
+  charName: string,
+): Promise<Character> {
+  const tTotal = Date.now();
+
+  // Step 1 — LLM design (visual + voice). Must complete first.
+  const tDesign = Date.now();
+  const design = await runDesignLLM(config, session, charName);
+  tlog(`[charDesigner ${charName}] design LLM`, tDesign);
+
+  const visualDescription = design.visualDescription?.trim();
+  const voiceDescription =
+    design.voiceDescription?.trim() ||
+    `请根据角色名「${charName}」推断其性别、年龄与气质，生成最贴合的音色。所属世界观：${session.worldSetting}`;
+
+  // Step 2 — parallel: portrait + voice provisioning.
+  const tProvision = Date.now();
+  const portraitPromise = visualDescription
+    ? renderAndUploadPortrait(config, charName, visualDescription, session.styleGuide)
+    : Promise.resolve({} as Awaited<ReturnType<typeof renderAndUploadPortrait>>);
+  const voicePromise = provisionVoiceSafe(config, voiceDescription, charName);
+
+  const [portrait, voice] = await Promise.all([portraitPromise, voicePromise]);
+  tlog(`[charDesigner ${charName}] portrait+voice parallel`, tProvision);
+
+  tlog(`[charDesigner ${charName}] TOTAL`, tTotal);
+
+  return {
+    name: charName,
+    voiceDescription,
+    visualDescription,
+    basePortraitBase64: portrait.basePortraitBase64,
+    basePortraitUuid: portrait.basePortraitUuid,
+    voice,
+  };
+}
+
+// Provision voice ONLY for an existing character that the LLM mentioned
+// without us having designed them yet (e.g., 编剧 referenced a name that
+// wasn't in `activeCharacters` but appeared as a speaker). Used by
+// directInsertBeat path and as a safety net in directScene. No portrait
+// is generated for these — they get a name + voice only.
+export async function provisionVoiceForName(
+  config: EngineConfig,
+  session: Session,
+  charName: string,
+): Promise<Character> {
+  const voiceDescription = `请根据角色名「${charName}」推断其性别、年龄与气质，生成最贴合的音色。所属世界观：${session.worldSetting}`;
+  const voice = await provisionVoiceSafe(config, voiceDescription, charName);
+  return { name: charName, voiceDescription, voice };
+}