From def1b25bd96e51c417d1f72d05ec08904bd6f409 Mon Sep 17 00:00:00 2001 From: Zonghao Yuan <64521992+zonghaoyuan@users.noreply.github.com> Date: Fri, 29 May 2026 13:30:24 +0800 Subject: [PATCH] feat(engine): multi-agent character consistency pipeline (#6) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * feat(types): Character.voiceDescription rename + visual fields + Scene.sceneKey Prepares the type surface for the multi-agent scene pipeline: - Character.description → voiceDescription (clearer pairing with new visualDescription) - Character gains visualDescription (English appearance card for Painter) + basePortraitBase64 + basePortraitUuid (for Runware referenceImages reuse) - Scene gains sceneKey (English slug for cross-scene img2img continuity) + imageUuid (Runware UUID of the scene's rendered image for cheap seedImage reuse on subsequent same-sceneKey calls) - Beat gains activeCharacters[] so the Cinematographer can read which characters are on-screen + their poses when composing the establishing shot Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) * feat(ai-client): generateImage img2img + multi-reference options + uploadImage Extends the Runware adapter to support the two anchoring mechanisms FLUX.2 [klein] 9B KV needs for character + scene visual consistency: - generateImage gains optional { seedImage, referenceImages, strength }: seedImage drives img2img (single starting image, sceneKey continuity), referenceImages drives multi-reference anchoring (up to 4 character portraits, capped per Runware spec). Default strength 0.85 — FLUX ignores strength < 0.8. - uploadImage POSTs a base64 to Runware's imageUpload taskType and returns the UUID, so portraits/scene snapshots can be referenced by UUID on subsequent calls instead of resending base64 every scene. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) * feat(engine): multi-agent scene pipeline (Writer→CharDesigner+Cinematographer→Painter) Replaces the single-LLM directScene with a four-agent pipeline that specializes each concern and parallelizes the slow parts. Adopts the core idea from #4 (multi-agent dispatch + character visual consistency) and grafts it onto the Scene/Beat architecture introduced in #2. Pipeline per Scene (~9-12s critical path with parallelization): Writer LLM (序列, ~3s) │ outputs: sceneSummary + sceneKey + beats[] (each beat carries │ activeCharacters[] with poses) │ ├─ CharacterDesigner LLM × N new chars (并行) │ │ outputs: { visualDescription (英文外貌卡), voiceDescription (中文音色卡) } │ ├─ FLUX portrait gen → upload → UUID (并行 within agent) │ └─ Xiaomi MiMo voicedesign provision (并行 within agent) │ └─ Cinematographer LLM (并行 with CharacterDesigner) outputs: { shotType, integratedPrompt (英文构图+机位+人物站位) } Painter (FLUX img2img + referenceImages, ~1-3s) inputs: integratedPrompt + onStageCharacters' archetype block + (optional) prior sceneKey-hit scene as seedImage + (optional) character portrait UUIDs as referenceImages fallback chain: A) both anchors → B) refs only (保角色) → C) seed only (保背景) → D) pure t2i output uploaded → Scene.imageUuid for the next sceneKey hop Why this carving: - Writer focuses purely on narrative (drops the voice-design duty staging's DIRECTOR_SYSTEM was carrying as a side concern). - CharacterDesigner bundles visual + voice so the agent that thinks "who is this character" produces internally-consistent appearance + vocal personality (split agents tend to diverge). - Cinematographer doesn't need character visualDescriptions — Painter appends archetypes after — so it parallelizes with CharacterDesigner. - sceneKey enables cross-scene backdrop continuity that Scene/Beat doesn't cover (Scene/Beat only reuses backdrop WITHIN a scene's beats; sceneKey reuses across scenes that share a location). Other changes: - voice.ts loses provisionVoicesForScene (moved into CharacterDesigner); keeps synthesizeBeat for the lazy per-beat /api/beat-audio path. - renderer.ts deleted (replaced by agents/painter.ts). - directInsertBeat (vision-driven in-scene exploration) stays single- LLM — it forbids new characters and produces no image, so multi- agent doesn't apply. apps/web is unchanged: orchestrator.ts keeps the same exports (startSession / requestScene / visionDecide / requestInsertBeat / requestBeatAudio) with identical request/response shapes. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) * fix(engine): Pattern B player POV + JSON repair + drop seedImage tier Three hotfixes surfaced by manual end-to-end testing of the multi-agent pipeline. F1 — Player viewpoint (galgame Pattern B): - Writer accepts speaker="你" for player dialog (renders in dialog box, never TTS'd because no Character record exists for "你"). Filter POV variants (玩家/我/主角/protagonist/player/I/me/...) from activeCharacters so CharacterDesigner never wastes API calls on the player. Two-layer defense: explicit prompt rule in WRITER_SYSTEM + code normalization (POV_VARIANTS set, isPovName, normalizeSpeakerName). - Cinematographer and Painter prompts gain "player never in frame" rule so the player never appears in any rendered scene. - Cinematographer gains dynamic camera policy driven by the entry beat's speaker: NPC-speaker → close-up looking toward camera; "你"-speaker → medium shot of attentive NPC; no speaker → wide establishing shot. - director.ts filters POV from orphanSpeakers so provisionVoiceForName never fires for "你". F2 — JSON parsing robustness: - parseJsonLoose gains a 4th repair tier: strip JS-style comments, strip trailing commas, insert missing commas between adjacent objects / arrays / quoted values. Logs the first 800 chars of raw LLM output when all repair attempts fail, so we can see what the model emitted. F3 — Drop seedImage, use referenceImages for prior scene: - FLUX.2 [klein] 9B KV does not support seedImage (img2img). Removed Tier A (seedImage+refs) and Tier C (seedImage only) from the Painter degradation chain. New layout: prior scene's image slots into referenceImages[0] for spatial continuity, character portraits fill slots 1-3 (Runware caps at 4 total). Cinematographer instructed to emphasize continuity when sceneKey matches a prior scene. All five package typechecks pass. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) * fix(engine): address Copilot review feedback on #6 Three targeted fixes from PR #6 Copilot review. F4 — Stale seedImage/img2img docstrings Four locations still referenced the original img2img design after F3 switched to referenceImages-based spatial continuity: - types/index.ts:57 Scene.sceneKey docstring - types/index.ts:63 Scene.imageUuid docstring - director.ts:34 pipeline diagram in module block comment - director.ts:128 directScene JSDoc Doc-only changes; misleading wording corrected to mention referenceImages. (The design-rationale comment in pickPriorSceneReference is kept — it explains WHY we don't use seedImage and is load-bearing context.) F5 — Remove JS-comment stripping from JSON repair pass parseJsonLoose's repair tier previously stripped `// ...` and `/* ... */` across the entire text, which would corrupt JSON string values containing URLs (e.g. "https://example.com" → "https:"). Since LLMs in `responseFormat: "json_object"` mode essentially never emit comments, dropping the comment-stripping step is a net win for safety. Trailing-comma and missing-comma repair (the high-frequency failures) are kept. F6 — Pattern B parity on the insert-beat path Previously: directInsertBeat's INSERT_BEAT_SYSTEM forbade any speaker not in session.characters, and the orchestrator's unregistered-speaker guard demoted such lines to narration. This meant the player could not speak via speaker="你" in transient in-scene beats — inconsistent with the Writer path. Fix: - INSERT_BEAT_SYSTEM prompt now allows speaker="你" (NPC name OR "你") and rejects other POV variants - directInsertBeat applies normalizeSpeakerName to the LLM output, same as the Writer path, so POV variants collapse to "你" - lineDelivery is dropped when speaker="你" (no TTS for player) - orchestrator's unregistered-speaker guard adds a `speaker !== "你"` exception so Pattern B player dialog passes through Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) * docs(engine): drop "JS-style comments" from parseJsonLoose header The function header listed JS-style comments as a step-4 repair, but F5 already removed comment stripping from `repairJsonString` because the regex would corrupt URLs inside JSON string values. The inner function's comment was updated then; this header was missed. Doc-only sync from second-round Copilot review on #6. Co-Authored-By: Claude Opus 4.7 (1M context) --------- Co-authored-by: QiChen88 <2291969160@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) --- packages/ai-client/src/image.ts | 138 ++++- packages/ai-client/src/index.ts | 3 +- .../engine/src/agents/characterDesigner.ts | 192 ++++++ packages/engine/src/agents/cinematographer.ts | 86 +++ packages/engine/src/agents/painter.ts | 145 +++++ packages/engine/src/agents/writer.ts | 386 +++++++++++++ packages/engine/src/director.ts | 545 +++++++++--------- packages/engine/src/index.ts | 5 +- packages/engine/src/jsonParser.ts | 56 +- packages/engine/src/orchestrator.ts | 133 ++--- packages/engine/src/prompts.ts | 436 ++++++++++++-- packages/engine/src/renderer.ts | 12 - packages/engine/src/voice.ts | 93 +-- packages/types/src/index.ts | 65 ++- 14 files changed, 1733 insertions(+), 562 deletions(-) create mode 100644 packages/engine/src/agents/characterDesigner.ts create mode 100644 packages/engine/src/agents/cinematographer.ts create mode 100644 packages/engine/src/agents/painter.ts create mode 100644 packages/engine/src/agents/writer.ts delete mode 100644 packages/engine/src/renderer.ts diff --git a/packages/ai-client/src/image.ts b/packages/ai-client/src/image.ts index 837e8de..4710e16 100644 --- a/packages/ai-client/src/image.ts +++ b/packages/ai-client/src/image.ts @@ -4,8 +4,22 @@ import { fetchWithRetry } from "./fetchWithRetry"; // Runware uses its own task-array protocol (not OpenAI-compatible). // POST with [{ taskType: "imageInference", ... }]; errors come // back as a 200 with `errors[]`, so we have to inspect the body either way. + +// FLUX img2img specifics: +// - strength < 0.8 has minimal-to-no visible effect on FLUX models (per +// Runware docs); we default to 0.85 which leaves room to deviate while +// still anchoring on the seed image's composition. +// - referenceImages caps at 4 per request; the FLUX.2 [klein] 9B KV model +// (runware:400@6) accelerates multi-reference inference by ~2.5× via its +// KV cache for reference latents (cached only WITHIN one inference run — +// not persisted across API calls, hence the upload-once-then-reference +// strategy below). +const DEFAULT_IMG2IMG_STRENGTH = 0.85; +const MAX_REFERENCE_IMAGES = 4; + type RunwareImageResult = { imageBase64Data?: string; + imageUUID?: string; }; type RunwareError = { code?: string; @@ -17,27 +31,58 @@ type RunwareResponse = { errors?: RunwareError[]; }; +export type GenerateImageOptions = { + /** + * Reference image (UUID, plain base64, or data URI) to use as the + * img2img starting point. When set, FLUX preserves the seed image's + * composition and applies `strength` to allow deviation from it. + * Used for cross-scene visual continuity when sceneKey hits. + */ + seedImage?: string; + /** + * Reference images (UUIDs or base64) to condition the generation on — + * typically character portraits to anchor identity / outfit / style + * across scenes. Runware caps at 4; we silently truncate beyond that. + */ + referenceImages?: string[]; + /** 0–1, FLUX needs ≥ 0.8 to actually have an effect. */ + strength?: number; +}; + +// ────────────────────────────────────────────────────────────────────── +// generateImage — text-to-image (default) or img2img / multi-reference +// when seedImage / referenceImages are supplied. Returns base64. +// ────────────────────────────────────────────────────────────────────── + export async function generateImage( config: ProviderConfig, prompt: string, + options?: GenerateImageOptions, ): Promise { const url = config.baseUrl.replace(/\/$/, ""); - const body = [ - { - taskType: "imageInference", - taskUUID: crypto.randomUUID(), - model: config.model, - positivePrompt: prompt, - width: 1792, - height: 1024, - steps: 4, - CFGScale: 3.5, - numberResults: 1, - outputType: "base64Data", - outputFormat: "PNG", - }, - ]; + const task: Record = { + taskType: "imageInference", + taskUUID: crypto.randomUUID(), + model: config.model, + positivePrompt: prompt, + width: 1792, + height: 1024, + steps: 4, + CFGScale: 3.5, + numberResults: 1, + outputType: "base64Data", + outputFormat: "PNG", + }; + + if (options?.seedImage) { + task.seedImage = options.seedImage; + task.strength = options.strength ?? DEFAULT_IMG2IMG_STRENGTH; + } + + if (options?.referenceImages?.length) { + task.referenceImages = options.referenceImages.slice(0, MAX_REFERENCE_IMAGES); + } const res = await fetchWithRetry(url, { method: "POST", @@ -45,7 +90,7 @@ export async function generateImage( "Content-Type": "application/json", Authorization: `Bearer ${config.apiKey}`, }, - body: JSON.stringify(body), + body: JSON.stringify([task]), }); const text = await res.text(); @@ -66,9 +111,64 @@ export async function generateImage( const b64 = json.data?.[0]?.imageBase64Data; if (!b64) { - throw new Error( - `No image in Runware response: ${text.slice(0, 300)}`, - ); + throw new Error(`No image in Runware response: ${text.slice(0, 300)}`); } return b64; } + +// ────────────────────────────────────────────────────────────────────── +// uploadImage — registers a base64 image on Runware and returns its +// UUID, so subsequent generateImage calls can pass the UUID in +// referenceImages / seedImage instead of resending the base64 payload +// every time. Character base portraits and scene snapshots both flow +// through this path. +// +// Runware exposes the imageUpload taskType for exactly this purpose. +// Returns the UUID. Caller treats a thrown error as "fall back to +// sending base64 next time" — non-fatal. +// ────────────────────────────────────────────────────────────────────── + +export async function uploadImage( + config: ProviderConfig, + base64: string, +): Promise { + const url = config.baseUrl.replace(/\/$/, ""); + + const body = [ + { + taskType: "imageUpload", + taskUUID: crypto.randomUUID(), + image: `data:image/png;base64,${base64}`, + }, + ]; + + const res = await fetchWithRetry(url, { + method: "POST", + headers: { + "Content-Type": "application/json", + Authorization: `Bearer ${config.apiKey}`, + }, + body: JSON.stringify(body), + }); + + const text = await res.text(); + let json: RunwareResponse; + try { + json = JSON.parse(text) as RunwareResponse; + } catch { + throw new Error(`Image upload API error ${res.status}: ${text.slice(0, 500)}`); + } + + if (json.errors?.length) { + const e = json.errors[0]!; + throw new Error( + `Runware upload error [${e.code ?? "unknown"}]: ${e.message ?? "no message"}`, + ); + } + + const uuid = json.data?.[0]?.imageUUID; + if (!uuid) { + throw new Error(`No UUID in upload response: ${text.slice(0, 300)}`); + } + return uuid; +} diff --git a/packages/ai-client/src/index.ts b/packages/ai-client/src/index.ts index e8fed49..13fa290 100644 --- a/packages/ai-client/src/index.ts +++ b/packages/ai-client/src/index.ts @@ -1,4 +1,5 @@ export { chat } from "./chat"; -export { generateImage } from "./image"; +export { generateImage, uploadImage } from "./image"; +export type { GenerateImageOptions } from "./image"; export { interpretClick } from "./vision"; export type { ChatMessage } from "./chat"; diff --git a/packages/engine/src/agents/characterDesigner.ts b/packages/engine/src/agents/characterDesigner.ts new file mode 100644 index 0000000..ae5f505 --- /dev/null +++ b/packages/engine/src/agents/characterDesigner.ts @@ -0,0 +1,192 @@ +import { chat, generateImage, uploadImage } from "@yume/ai-client"; +import { provisionVoice } from "@yume/tts-client"; +import type { + Character, + CharacterVoice, + EngineConfig, + Session, +} from "@yume/types"; +import { parseJsonLoose } from "../jsonParser"; +import { mockImageBase64 } from "../mockImage"; +import { + CHARACTER_DESIGNER_SYSTEM, + buildCharacterDesignerUserMessage, + buildCharacterPortraitPrompt, +} from "../prompts"; + +// ────────────────────────────────────────────────────────────────────── +// CharacterDesigner agent — designs ONE new character end-to-end. +// +// Pipeline (per character, all the slow parts are parallelized): +// +// 1. LLM call — designs BOTH visual + voice cards in one shot +// (intentional: same agent thinks about who this character IS, +// which keeps appearance and vocal personality coherent) +// +// 2. In parallel: +// a. Image gen — base portrait from visualDescription + styleGuide +// then upload to Runware → get UUID for cheap re-reference +// b. Voice provisioning — Xiaomi MiMo voicedesign from voiceDescription +// → reference audio for later voiceclone synth +// +// 3. Returns merged Character ready to be added to session.characters +// +// Each step degrades gracefully — if image gen fails we return the +// character without a portrait; if voice gen fails we return without +// voice. The game keeps running even when sub-components fail. +// ────────────────────────────────────────────────────────────────────── + +type CharacterDesignOutput = { + visualDescription?: string; + voiceDescription?: string; +}; + +// TEMP: per-phase timing for latency diagnosis. Same convention as the +// orchestrator's tlog. Remove after we have data on real-world numbers. +function tlog(label: string, t0: number): void { + console.log(`${label}: ${Date.now() - t0}ms`); +} + +async function runDesignLLM( + config: EngineConfig, + session: Session, + charName: string, +): Promise { + const raw = await chat( + config.text, + [ + { role: "system", content: CHARACTER_DESIGNER_SYSTEM }, + { + role: "user", + content: buildCharacterDesignerUserMessage(charName, session), + }, + ], + { temperature: 0.7, responseFormat: "json_object" }, + ); + return parseJsonLoose(raw); +} + +// Generate the per-character base portrait and upload it. The portrait is +// a "concept sheet" — single character, neutral pose, plain background — +// so it works well as a Runware referenceImages anchor for later scenes. +// +// Returns both the base64 (for client-side asset use, e.g., 立绘登场 +// animations) and the Runware UUID (for cheap referencing in subsequent +// Painter calls without resending the 100KB+ base64 each time). +// +// The upload step is best-effort: if it fails, we still return the base64 +// so the next scene can pass it as a referenceImages entry directly (just +// pays the bandwidth cost each call instead of once). +async function renderAndUploadPortrait( + config: EngineConfig, + charName: string, + visualDescription: string, + styleGuide: string, +): Promise<{ basePortraitBase64?: string; basePortraitUuid?: string }> { + let base64: string; + try { + if (config.mockImage) { + base64 = await mockImageBase64(); + } else { + const prompt = buildCharacterPortraitPrompt( + charName, + visualDescription, + styleGuide, + ); + base64 = await generateImage(config.image, prompt); + } + } catch (err) { + const msg = err instanceof Error ? err.message : String(err); + console.error(`[characterDesigner] portrait gen failed for ${charName}: ${msg}`); + return {}; // no portrait at all — degrade gracefully + } + + // Skip upload in mock mode — the mock image is the same static SVG every + // time and uploading it gives us a UUID that points to a useless asset. + if (config.mockImage) { + return { basePortraitBase64: base64 }; + } + + try { + const uuid = await uploadImage(config.image, base64); + return { basePortraitBase64: base64, basePortraitUuid: uuid }; + } catch (err) { + const msg = err instanceof Error ? err.message : String(err); + console.warn( + `[characterDesigner] portrait upload failed for ${charName}: ${msg} — will pass base64 in subsequent calls`, + ); + return { basePortraitBase64: base64 }; + } +} + +async function provisionVoiceSafe( + config: EngineConfig, + voiceDescription: string, + charName: string, +): Promise { + if (!config.tts) return undefined; + try { + return await provisionVoice(config.tts, voiceDescription); + } catch (err) { + const msg = err instanceof Error ? err.message : String(err); + console.error(`[characterDesigner] voice provision failed for ${charName}: ${msg}`); + return undefined; + } +} + +// Single-character design pipeline. Called by the orchestrator once per +// NEW character name; multiple characters in the same scene run their +// pipelines in parallel at the orchestrator level. +export async function designCharacter( + config: EngineConfig, + session: Session, + charName: string, +): Promise { + const tTotal = Date.now(); + + // Step 1 — LLM design (visual + voice). Must complete first. + const tDesign = Date.now(); + const design = await runDesignLLM(config, session, charName); + tlog(`[charDesigner ${charName}] design LLM`, tDesign); + + const visualDescription = design.visualDescription?.trim(); + const voiceDescription = + design.voiceDescription?.trim() || + `请根据角色名「${charName}」推断其性别、年龄与气质,生成最贴合的音色。所属世界观:${session.worldSetting}`; + + // Step 2 — parallel: portrait + voice provisioning. + const tProvision = Date.now(); + const portraitPromise = visualDescription + ? renderAndUploadPortrait(config, charName, visualDescription, session.styleGuide) + : Promise.resolve({} as Awaited>); + const voicePromise = provisionVoiceSafe(config, voiceDescription, charName); + + const [portrait, voice] = await Promise.all([portraitPromise, voicePromise]); + tlog(`[charDesigner ${charName}] portrait+voice parallel`, tProvision); + + tlog(`[charDesigner ${charName}] TOTAL`, tTotal); + + return { + name: charName, + voiceDescription, + visualDescription, + basePortraitBase64: portrait.basePortraitBase64, + basePortraitUuid: portrait.basePortraitUuid, + voice, + }; +} + +// Provision voice ONLY for an existing character that the LLM mentioned +// without us having designed them yet (e.g., 编剧 referenced a name that +// wasn't in `activeCharacters` but appeared as a speaker). Used by +// directInsertBeat path and as a safety net in directScene. No portrait +// is generated for these — they get a name + voice only. +export async function provisionVoiceForName( + config: EngineConfig, + session: Session, + charName: string, +): Promise { + const voiceDescription = `请根据角色名「${charName}」推断其性别、年龄与气质,生成最贴合的音色。所属世界观:${session.worldSetting}`; + const voice = await provisionVoiceSafe(config, voiceDescription, charName); + return { name: charName, voiceDescription, voice }; +} diff --git a/packages/engine/src/agents/cinematographer.ts b/packages/engine/src/agents/cinematographer.ts new file mode 100644 index 0000000..df0acc2 --- /dev/null +++ b/packages/engine/src/agents/cinematographer.ts @@ -0,0 +1,86 @@ +import { chat } from "@yume/ai-client"; +import type { BeatActiveCharacter, ProviderConfig } from "@yume/types"; +import { parseJsonLoose } from "../jsonParser"; +import { + CINEMATOGRAPHER_SYSTEM, + buildCinematographerUserMessage, +} from "../prompts"; + +// ────────────────────────────────────────────────────────────────────── +// Cinematographer agent — translates the Writer's narrative scene +// summary into an English compositional prompt for FLUX. +// +// Reads: sceneSummary + entry beat's activeCharacters (poses) +// + prior sceneKey (for continuity hints) +// Writes: { shotType, integratedPrompt } +// +// Does NOT describe character APPEARANCE — that's appended at the +// Painter stage from session.characters[].visualDescription. The +// Cinematographer only positions named characters in the frame and +// describes the environment + lighting + camera framing. +// +// This separation lets the Cinematographer run IN PARALLEL with the +// CharacterDesigner — neither needs the other's output. They both +// feed independently into the Painter prompt. +// ────────────────────────────────────────────────────────────────────── + +export type CinematographerOutput = { + shotType: string; + integratedPrompt: string; +}; + +type RawCinematographerOutput = { + shotType?: string; + integratedPrompt?: string; +}; + +export type CinematographerInput = { + sceneSummary: string; + styleGuide: string; + entryBeatActive: BeatActiveCharacter[]; + /** Entry beat's speaker — drives the dynamic camera policy: + * NPC name → NPC looks toward camera (close-up) + * "你" → medium shot, NPC listens + * undefined → wide establishing shot */ + entryBeatSpeaker?: string; + priorSceneKey?: string; + currentSceneKey?: string; +}; + +export async function runCinematographer( + config: ProviderConfig, + input: CinematographerInput, +): Promise { + const raw = await chat( + config, + [ + { role: "system", content: CINEMATOGRAPHER_SYSTEM }, + { + role: "user", + content: buildCinematographerUserMessage( + input.sceneSummary, + input.styleGuide, + input.entryBeatActive, + input.entryBeatSpeaker, + input.priorSceneKey, + input.currentSceneKey, + ), + }, + ], + { temperature: 0.6, responseFormat: "json_object" }, + ); + + const parsed = parseJsonLoose(raw); + + // Fallback: if the LLM produced nothing usable, synthesize a minimal + // integratedPrompt from the Writer's sceneSummary so the Painter has + // SOMETHING to work with rather than blowing up the whole pipeline. + const integratedPrompt = + parsed.integratedPrompt?.trim() || + `A cinematic illustration depicting: ${input.sceneSummary}. Wide establishing shot, natural lighting, atmospheric mood.`; + + return { + shotType: parsed.shotType?.trim() || "medium shot", + integratedPrompt, + }; +} diff --git a/packages/engine/src/agents/painter.ts b/packages/engine/src/agents/painter.ts new file mode 100644 index 0000000..e9d6e00 --- /dev/null +++ b/packages/engine/src/agents/painter.ts @@ -0,0 +1,145 @@ +import { generateImage } from "@yume/ai-client"; +import type { GenerateImageOptions } from "@yume/ai-client"; +import type { + Beat, + Character, + EngineConfig, + ProviderConfig, +} from "@yume/types"; +import { mockImageBase64 } from "../mockImage"; +import { buildPainterPrompt } from "../prompts"; + +// ────────────────────────────────────────────────────────────────────── +// Painter — final image generation with multi-reference anchoring. +// +// FLUX.2 [klein] 9B KV does NOT support seedImage (img2img). Instead, +// visual continuity comes entirely from `referenceImages` (capped at 4), +// which the KV-optimized variant accelerates ~2.5× via key-value caching +// of reference latents. +// +// References are slotted in priority order (max 4): +// 1. Prior scene image — when sceneKey matched a previous scene, this +// anchors the same physical space (lighting/layout/style continuity) +// 2. Entry beat's speaker portrait — the NPC the player is talking with +// (most visually prominent) +// 3. Other on-stage NPCs' portraits — secondary characters in the frame +// +// Failure handling — two-tier degradation: +// A. referenceImages call (preferred — full visual anchoring) +// B. pure text-to-image fallback (last resort if Runware refs API errors) +// ────────────────────────────────────────────────────────────────────── + +const MAX_REFERENCE_IMAGES = 4; + +export type PainterInput = { + integratedPrompt: string; + styleGuide: string; + onStageCharacters: Character[]; + /** + * Prior scene's Runware UUID or base64. When set (= sceneKey hit a + * prior scene), it slots into referenceImages[0] for spatial continuity. + * Capacity-wise this displaces ONE character portrait — slot is shared + * with character refs, capped at 4 total per Runware spec. + */ + priorSceneImage?: string; +}; + +// Pick the references we send to Runware as `referenceImages`. Priority: +// slot 0: priorSceneImage (if any — sceneKey continuity) +// slot 1: entry beat's speaker portrait (the NPC speaking to the player) +// slot 2+: other on-stage NPCs from entry beat's activeCharacters +// Caps at 4 total. Returns the array exactly as it'll be sent — already +// truncated, already deduplicated. +export function collectReferenceImages( + characters: Character[], + entryBeat: Beat | undefined, + priorSceneImage: string | undefined, +): string[] { + const refs: string[] = []; + const seen = new Set(); + + // Slot 0 — prior scene image for spatial continuity. Goes first because + // backdrop drift is the most jarring discontinuity across same-sceneKey + // scenes; character drift is partially masked by character archetype text + // in the prompt anyway. + if (priorSceneImage) { + refs.push(priorSceneImage); + } + + // Slot 1+ — character portraits, speaker-first. + const speakerName = entryBeat?.speaker; + if (speakerName) { + const speaker = characters.find((c) => c.name === speakerName); + const ref = speaker?.basePortraitUuid ?? speaker?.basePortraitBase64; + if (ref && refs.length < MAX_REFERENCE_IMAGES) { + refs.push(ref); + seen.add(speakerName); + } + } + + for (const c of entryBeat?.activeCharacters ?? []) { + if (refs.length >= MAX_REFERENCE_IMAGES) break; + if (seen.has(c.name)) continue; + const char = characters.find((x) => x.name === c.name); + const ref = char?.basePortraitUuid ?? char?.basePortraitBase64; + if (ref) { + refs.push(ref); + seen.add(c.name); + } + } + + return refs.slice(0, MAX_REFERENCE_IMAGES); +} + +async function tryGenerate( + config: ProviderConfig, + prompt: string, + options: GenerateImageOptions, + label: string, +): Promise { + try { + return await generateImage(config, prompt, options); + } catch (err) { + const msg = err instanceof Error ? err.message : String(err); + console.warn(`[painter] ${label} failed: ${msg}`); + return null; + } +} + +export async function runPainter( + config: EngineConfig, + input: PainterInput, + entryBeat: Beat | undefined, +): Promise { + if (config.mockImage) return mockImageBase64(); + + const prompt = buildPainterPrompt( + input.integratedPrompt, + input.styleGuide, + input.onStageCharacters, + ); + + const refs = collectReferenceImages( + input.onStageCharacters, + entryBeat, + input.priorSceneImage, + ); + + // Tier A — with referenceImages (priorSceneImage + character portraits). + // FLUX.2 [klein] 9B KV's KV cache accelerates this multi-reference path + // ~2.5× compared to the non-KV variant. + if (refs.length > 0) { + const r = await tryGenerate( + config.image, + prompt, + { referenceImages: refs }, + `referenceImages (${refs.length})`, + ); + if (r) return r; + } + + // Tier B — pure text-to-image. Last resort, used when Tier A failed OR + // there are no references to send (first scene with no characters yet). + // Errors here propagate to the caller. + return generateImage(config.image, prompt); +} diff --git a/packages/engine/src/agents/writer.ts b/packages/engine/src/agents/writer.ts new file mode 100644 index 0000000..5423b4b --- /dev/null +++ b/packages/engine/src/agents/writer.ts @@ -0,0 +1,386 @@ +import { chat } from "@yume/ai-client"; +import type { + Beat, + BeatActiveCharacter, + BeatChoice, + BeatChoiceEffect, + BeatNext, + ProviderConfig, + Session, +} from "@yume/types"; +import { parseJsonLoose } from "../jsonParser"; +import { WRITER_SYSTEM, buildWriterUserMessage } from "../prompts"; + +// ────────────────────────────────────────────────────────────────────── +// Writer agent — owns the narrative half of scene generation. +// +// Output: { sceneSummary, sceneKey, entryBeatId, beats[] } +// Each beat carries activeCharacters[] (names + poses) the +// Cinematographer reads when composing the establishing shot. +// +// Character DESIGN (visual + voice) is NOT this agent's job — +// it only names characters; the CharacterDesigner picks up any +// unknown name from beats[].activeCharacters. +// ────────────────────────────────────────────────────────────────────── + +export type WriterOutput = { + sceneSummary: string; + sceneKey?: string; + entryBeatId: string; + beats: Beat[]; +}; + +// Raw shapes — what the LLM produces before validation / coercion. +type RawActiveCharacter = { + name?: string; + pose?: string; +}; +type RawEffect = { + kind?: string; + targetBeatId?: string; + nextSceneSeed?: string; +}; +type RawChoice = { + id?: string; + label?: string; + effect?: RawEffect; +}; +type RawNext = { + type?: string; + nextBeatId?: string; + choices?: RawChoice[]; +}; +type RawBeat = { + id?: string; + narration?: string; + speaker?: string; + line?: string; + lineDelivery?: string; + activeCharacters?: RawActiveCharacter[]; + next?: RawNext; +}; +type RawScene = { + sceneSummary?: string; + sceneKey?: string; + entryBeatId?: string; + beats?: RawBeat[]; +}; + +// ────────────────────────────────────────────────────────────────────── +// POV (player viewpoint) handling — Pattern B (galgame standard): +// - speaker = "你" → ALLOWED (renders as dialog box, never TTS'd) +// - any other POV term → normalized to "你" (LLM slip-up safety net) +// - activeCharacters → POV is NEVER allowed (player has no body in-scene) +// - CharacterDesigner → never invoked for "你" or POV variants +// ────────────────────────────────────────────────────────────────────── + +const POV_DISPLAY_NAME = "你"; +const POV_VARIANTS = new Set([ + "玩家", + "我", + "主角", + "protagonist", + "Protagonist", + "player", + "Player", + "PLAYER", + "MC", + "mc", + "Mc", + "I", + "i", + "me", + "Me", + "ME", +]); + +function isPovName(name: string): boolean { + return name === POV_DISPLAY_NAME || POV_VARIANTS.has(name); +} + +// Normalize a speaker name: any POV variant collapses to "你"; an NPC name +// passes through unchanged. Caller passes already-trimmed input. +function normalizeSpeakerName(name: string): string { + return POV_VARIANTS.has(name) ? POV_DISPLAY_NAME : name; +} + +function coerceEffect(raw: RawEffect | undefined): BeatChoiceEffect { + if (raw?.kind === "advance-beat" && raw.targetBeatId?.trim()) { + return { kind: "advance-beat", targetBeatId: raw.targetBeatId.trim() }; + } + return { + kind: "change-scene", + nextSceneSeed: raw?.nextSceneSeed?.trim() || "未指定", + }; +} + +function coerceChoice(raw: RawChoice, idx: number): BeatChoice { + return { + id: raw.id?.trim() || `c${idx + 1}`, + label: raw.label?.trim() || `选项 ${idx + 1}`, + effect: coerceEffect(raw.effect), + }; +} + +function coerceNext(raw: RawNext | undefined, fallbackBeatId: string): BeatNext { + if (raw?.type === "choice" && Array.isArray(raw.choices) && raw.choices.length) { + return { + type: "choice", + choices: raw.choices.map((c, i) => coerceChoice(c, i)), + }; + } + return { + type: "continue", + nextBeatId: raw?.nextBeatId?.trim() || fallbackBeatId, + }; +} + +function coerceActiveCharacters( + raw: RawActiveCharacter[] | undefined, +): BeatActiveCharacter[] | undefined { + if (!Array.isArray(raw)) return undefined; + const out = raw + .map((c): BeatActiveCharacter | null => { + const name = c.name?.trim(); + if (!name) return null; + // POV is never IN the picture — strip the LLM's slip-up silently so + // CharacterDesigner doesn't end up generating a portrait for the player. + if (isPovName(name)) return null; + const pose = c.pose?.trim(); + return pose ? { name, pose } : { name }; + }) + .filter((c): c is BeatActiveCharacter => Boolean(c)); + return out.length > 0 ? out : undefined; +} + +function coerceBeat(raw: RawBeat, idx: number, totalBeats: number): Beat { + const id = raw.id?.trim() || `b${idx + 1}`; + // Non-last beats default their `continue` target to the following beat. + // The last beat gets an empty fallback on purpose: repairBeats() turns a + // last/dangling continue into a real scene-change exit so the player can + // never get stuck self-looping on it. + const fallback = idx + 1 < totalBeats ? `b${idx + 2}` : ""; + + const rawSpeaker = raw.speaker?.trim() || undefined; + // Normalize any POV variant (玩家/我/主角/protagonist/...) to "你". + // NPC names pass through unchanged. This means the LLM can slip and + // write "玩家" or "I" and we still render the dialog box correctly with + // speaker="你" — and TTS is automatically skipped because no Character + // record exists for "你". + const speaker = rawSpeaker ? normalizeSpeakerName(rawSpeaker) : undefined; + + const line = raw.line?.trim() || undefined; + return { + id, + narration: raw.narration?.trim() || undefined, + speaker, + line, + // lineDelivery is meaningful only for NPC speakers (TTS). For POV + // speaker ("你") TTS is skipped, so lineDelivery would never be used. + lineDelivery: + line && speaker !== POV_DISPLAY_NAME + ? raw.lineDelivery?.trim() || undefined + : undefined, + activeCharacters: coerceActiveCharacters(raw.activeCharacters), + next: coerceNext(raw.next, fallback), + }; +} + +const FALLBACK_SEED = "故事继续推进"; + +function fallbackExitChoice(beatId: string): BeatChoice { + return { + id: `${beatId}__exit`, + label: "继续", + effect: { kind: "change-scene", nextSceneSeed: FALLBACK_SEED }, + }; +} + +// Beat ids are graph keys (the front-end's `beats.find(b => b.id === ...)`, +// the session's `visitedBeatIds`, and `continue`/`advance-beat` targets). If +// the model reuses an id across beats, the second occurrence becomes silently +// unreachable and external references collapse to the first beat. Rename +// duplicates; rewrite the renamed beat's OWN self-references. External +// references stay pointing at the first occurrence. +function ensureUniqueBeatIds(beats: Beat[]): Beat[] { + const seen = new Set(); + return beats.map((b): Beat => { + if (!seen.has(b.id)) { + seen.add(b.id); + return b; + } + const oldId = b.id; + let n = 2; + while (seen.has(`${oldId}_${n}`)) n += 1; + const newId = `${oldId}_${n}`; + seen.add(newId); + + let next = b.next; + if (next.type === "continue" && next.nextBeatId === oldId) { + next = { type: "continue", nextBeatId: newId }; + } else if (next.type === "choice") { + next = { + type: "choice", + choices: next.choices.map((c) => + c.effect.kind === "advance-beat" && c.effect.targetBeatId === oldId + ? { + ...c, + effect: { kind: "advance-beat" as const, targetBeatId: newId }, + } + : c, + ), + }; + } + return { ...b, id: newId, next }; + }); +} + +// Repairs referential integrity AND guarantees the scene is escapable: +// - a `continue` to a missing/self id is repointed to the next beat in order; +// a last/dangling continue with nowhere to go becomes a scene-change exit +// - an `advance-beat` to a missing id is downgraded to a scene change +// - if no change-scene exit exists anywhere, one is appended to the last beat +function repairBeats(beats: Beat[]): Beat[] { + const ids = new Set(beats.map((b) => b.id)); + + const fixed: Beat[] = beats.map((b, idx): Beat => { + if (b.next.type === "continue") { + const target = b.next.nextBeatId; + if (ids.has(target) && target !== b.id) return b; + const nextByIndex = beats[idx + 1]?.id; + if (nextByIndex) { + return { ...b, next: { type: "continue", nextBeatId: nextByIndex } }; + } + return { ...b, next: { type: "choice", choices: [fallbackExitChoice(b.id)] } }; + } + + const patched = b.next.choices.map((c) => + c.effect.kind === "advance-beat" && !ids.has(c.effect.targetBeatId) + ? { + ...c, + effect: { + kind: "change-scene" as const, + nextSceneSeed: "未指定(导演引用不存在的 beat,已降级为换场)", + }, + } + : c, + ); + return { ...b, next: { type: "choice", choices: patched } }; + }); + + const hasExit = fixed.some( + (b) => + b.next.type === "choice" && + b.next.choices.some((c) => c.effect.kind === "change-scene"), + ); + if (!hasExit && fixed.length > 0) { + const lastIdx = fixed.length - 1; + const last = fixed[lastIdx]!; + const existing = last.next.type === "choice" ? last.next.choices : []; + fixed[lastIdx] = { + ...last, + next: { type: "choice", choices: [...existing, fallbackExitChoice(last.id)] }, + }; + } + + return fixed; +} + +// Choice ids are keys the front-end uses to cache + consume prefetched +// scenes. Two beats both defaulting to c1/c2 would make a transition reuse +// the WRONG prefetched scene — so force every choice id to be unique within +// the scene. +function ensureUniqueChoiceIds(beats: Beat[]): Beat[] { + const seen = new Set(); + for (const b of beats) { + if (b.next.type !== "choice") continue; + for (const c of b.next.choices) { + if (seen.has(c.id)) { + let n = 2; + while (seen.has(`${c.id}_${n}`)) n += 1; + c.id = `${c.id}_${n}`; + } + seen.add(c.id); + } + } + return beats; +} + +// Normalize sceneKey to a safe lowercase-with-dashes English slug. If the +// model returns something weird (中文 / spaces / mixed case), best-effort +// fix; if it ends up empty, return undefined (the scene just won't be +// considered for img2img reuse). +function normalizeSceneKey(raw: string | undefined): string | undefined { + if (!raw) return undefined; + const slug = raw + .trim() + .toLowerCase() + .replace(/[^a-z0-9-]+/g, "-") + .replace(/-+/g, "-") + .replace(/^-|-$/g, ""); + return slug.length > 0 ? slug : undefined; +} + +export async function runWriter( + config: ProviderConfig, + session: Session, +): Promise { + const raw = await chat( + config, + [ + { role: "system", content: WRITER_SYSTEM }, + { role: "user", content: buildWriterUserMessage(session) }, + ], + { temperature: 0.9, responseFormat: "json_object" }, + ); + + const parsed = parseJsonLoose(raw); + const rawBeats = Array.isArray(parsed.beats) ? parsed.beats : []; + if (rawBeats.length === 0) { + throw new Error("Writer returned no beats"); + } + + const beats = ensureUniqueChoiceIds( + repairBeats( + ensureUniqueBeatIds( + rawBeats.map((b, i) => coerceBeat(b, i, rawBeats.length)), + ), + ), + ); + + const declaredEntry = parsed.entryBeatId?.trim(); + const entryBeatId = + declaredEntry && beats.some((b) => b.id === declaredEntry) + ? declaredEntry + : beats[0]!.id; + + return { + sceneSummary: parsed.sceneSummary?.trim() || "未指定场景概要", + sceneKey: normalizeSceneKey(parsed.sceneKey), + entryBeatId, + beats, + }; +} + +// Surface the set of character names introduced by this scene's beats, +// so the orchestrator can decide which ones need the CharacterDesigner to +// fire. Pulls names from both `speaker` fields AND `activeCharacters` +// (a character can be on-screen without speaking). +// +// Excludes POV ("你" / 玩家 / 主角 / ...) entirely — the player is never +// designed (no portrait, no voice, no archetype). +export function collectActiveCharacterNames(beats: Beat[]): string[] { + const seen = new Set(); + for (const b of beats) { + if (b.speaker && !isPovName(b.speaker)) seen.add(b.speaker); + if (b.activeCharacters) { + for (const c of b.activeCharacters) { + if (!isPovName(c.name)) seen.add(c.name); + } + } + } + return Array.from(seen); +} + +// Re-export POV constants for downstream filters (director's orphanSpeakers). +export { POV_DISPLAY_NAME, POV_VARIANTS, isPovName, normalizeSpeakerName }; diff --git a/packages/engine/src/director.ts b/packages/engine/src/director.ts index 9555255..df7bde2 100644 --- a/packages/engine/src/director.ts +++ b/packages/engine/src/director.ts @@ -1,309 +1,294 @@ -import { chat } from "@yume/ai-client"; +import { chat, uploadImage } from "@yume/ai-client"; import type { - Beat, - BeatChoice, - BeatChoiceEffect, - BeatNext, Character, + EngineConfig, InsertBeatPartial, ProviderConfig, Scene, Session, } from "@yume/types"; -import { parseJsonLoose } from "./jsonParser"; +import { designCharacter, provisionVoiceForName } from "./agents/characterDesigner"; +import { runCinematographer } from "./agents/cinematographer"; +import { runPainter } from "./agents/painter"; import { - DIRECTOR_SYSTEM, - INSERT_BEAT_SYSTEM, - buildDirectorUserMessage, - buildInsertBeatUserMessage, -} from "./prompts"; + collectActiveCharacterNames, + isPovName, + normalizeSpeakerName, + POV_DISPLAY_NAME, + runWriter, +} from "./agents/writer"; +import { parseJsonLoose } from "./jsonParser"; +import { INSERT_BEAT_SYSTEM, buildInsertBeatUserMessage } from "./prompts"; -// ────────────────────────────────────────────────────────────────────── -// Raw shape produced by the model — we coerce + validate into a Scene. -// ────────────────────────────────────────────────────────────────────── - -type RawEffect = { - kind?: string; - targetBeatId?: string; - nextSceneSeed?: string; -}; - -type RawChoice = { - id?: string; - label?: string; - effect?: RawEffect; -}; - -type RawNext = { - type?: string; - nextBeatId?: string; - choices?: RawChoice[]; -}; - -type RawBeat = { - id?: string; - narration?: string; - speaker?: string; - line?: string; - lineDelivery?: string; - next?: RawNext; -}; - -type RawCharacterUpdate = { - name?: string; - description?: string; -}; - -type RawScene = { - scenePrompt?: string; - entryBeatId?: string; - beats?: RawBeat[]; - characterUpdates?: RawCharacterUpdate[]; -}; - -function coerceEffect(raw: RawEffect | undefined): BeatChoiceEffect { - if (raw?.kind === "advance-beat" && raw.targetBeatId?.trim()) { - return { kind: "advance-beat", targetBeatId: raw.targetBeatId.trim() }; - } - return { - kind: "change-scene", - nextSceneSeed: raw?.nextSceneSeed?.trim() || "未指定", - }; -} - -function coerceChoice(raw: RawChoice, idx: number): BeatChoice { - return { - id: raw.id?.trim() || `c${idx + 1}`, - label: raw.label?.trim() || `选项 ${idx + 1}`, - effect: coerceEffect(raw.effect), - }; -} - -function coerceNext(raw: RawNext | undefined, fallbackBeatId: string): BeatNext { - if (raw?.type === "choice" && Array.isArray(raw.choices) && raw.choices.length) { - return { - type: "choice", - choices: raw.choices.map((c, i) => coerceChoice(c, i)), - }; - } - return { - type: "continue", - nextBeatId: raw?.nextBeatId?.trim() || fallbackBeatId, - }; -} - -function coerceBeat(raw: RawBeat, idx: number, totalBeats: number): Beat { - const id = raw.id?.trim() || `b${idx + 1}`; - // Non-last beats default their `continue` target to the following beat. - // The last beat gets an empty fallback on purpose: repairBeats() turns a - // last/dangling continue into a real scene-change exit so the player can - // never get stuck self-looping on it. - const fallback = idx + 1 < totalBeats ? `b${idx + 2}` : ""; - const line = raw.line?.trim() || undefined; - return { - id, - narration: raw.narration?.trim() || undefined, - speaker: raw.speaker?.trim() || undefined, - line, - // lineDelivery only meaningful when there is a line to deliver. - lineDelivery: line ? raw.lineDelivery?.trim() || undefined : undefined, - next: coerceNext(raw.next, fallback), - }; -} - -function coerceCharacterUpdates(raw: RawCharacterUpdate[] | undefined): Character[] { - if (!Array.isArray(raw)) return []; - return raw - .map((c) => ({ - name: c.name?.trim() ?? "", - description: c.description?.trim() ?? "", - })) - .filter((c) => c.name && c.description); -} - -const FALLBACK_SEED = "故事继续推进"; - -function fallbackExitChoice(beatId: string): BeatChoice { - return { - id: `${beatId}__exit`, - label: "继续", - effect: { kind: "change-scene", nextSceneSeed: FALLBACK_SEED }, - }; -} - -// Beat ids are graph keys (the front-end's `beats.find(b => b.id === ...)`, -// the session's `visitedBeatIds`, and `continue`/`advance-beat` targets). If -// the model reuses an id across beats, the second occurrence becomes silently -// unreachable and external references collapse to the first beat. Rename -// duplicates; rewrite the renamed beat's OWN self-references (the most -// natural interpretation of a duplicate id being referenced from inside that -// same beat). External references stay pointing at the first occurrence. -function ensureUniqueBeatIds(beats: Beat[]): Beat[] { - const seen = new Set(); - return beats.map((b): Beat => { - if (!seen.has(b.id)) { - seen.add(b.id); - return b; - } - const oldId = b.id; - let n = 2; - while (seen.has(`${oldId}_${n}`)) n += 1; - const newId = `${oldId}_${n}`; - seen.add(newId); - - let next = b.next; - if (next.type === "continue" && next.nextBeatId === oldId) { - next = { type: "continue", nextBeatId: newId }; - } else if (next.type === "choice") { - next = { - type: "choice", - choices: next.choices.map((c) => - c.effect.kind === "advance-beat" && c.effect.targetBeatId === oldId - ? { - ...c, - effect: { kind: "advance-beat" as const, targetBeatId: newId }, - } - : c, - ), - }; - } - return { ...b, id: newId, next }; - }); -} - -// Repairs referential integrity AND guarantees the scene is escapable: -// - a `continue` to a missing/self id is repointed to the next beat in order; -// a last/dangling continue with nowhere to go becomes a scene-change exit -// (never a self-loop, which would strand the player on "click to advance") -// - an `advance-beat` to a missing id is downgraded to a scene change -// - if no change-scene exit exists anywhere, one is appended to the last beat -function repairBeats(beats: Beat[]): Beat[] { - const ids = new Set(beats.map((b) => b.id)); - - const fixed: Beat[] = beats.map((b, idx): Beat => { - if (b.next.type === "continue") { - const target = b.next.nextBeatId; - if (ids.has(target) && target !== b.id) return b; - const nextByIndex = beats[idx + 1]?.id; - if (nextByIndex) { - return { ...b, next: { type: "continue", nextBeatId: nextByIndex } }; - } - return { ...b, next: { type: "choice", choices: [fallbackExitChoice(b.id)] } }; - } - - const patched = b.next.choices.map((c) => - c.effect.kind === "advance-beat" && !ids.has(c.effect.targetBeatId) - ? { - ...c, - effect: { - kind: "change-scene" as const, - nextSceneSeed: "未指定(导演引用不存在的 beat,已降级为换场)", - }, - } - : c, - ); - return { ...b, next: { type: "choice", choices: patched } }; - }); - - const hasExit = fixed.some( - (b) => - b.next.type === "choice" && - b.next.choices.some((c) => c.effect.kind === "change-scene"), - ); - if (!hasExit && fixed.length > 0) { - const lastIdx = fixed.length - 1; - const last = fixed[lastIdx]!; - const existing = last.next.type === "choice" ? last.next.choices : []; - fixed[lastIdx] = { - ...last, - next: { type: "choice", choices: [...existing, fallbackExitChoice(last.id)] }, - }; - } - - return fixed; -} - -// Choice ids are the keys the front-end uses to cache and consume prefetched -// scenes. Two beats both defaulting to c1/c2 (or the model reusing ids across -// beats) would make a transition reuse the WRONG prefetched scene — so force -// every choice id to be unique within the scene. -function ensureUniqueChoiceIds(beats: Beat[]): Beat[] { - const seen = new Set(); - for (const b of beats) { - if (b.next.type !== "choice") continue; - for (const c of b.next.choices) { - if (seen.has(c.id)) { - let n = 2; - while (seen.has(`${c.id}_${n}`)) n += 1; - c.id = `${c.id}_${n}`; - } - seen.add(c.id); - } - } - return beats; -} +// ══════════════════════════════════════════════════════════════════════ +// director.ts — multi-agent orchestrator for one full Scene generation. +// +// Critical path (per Scene call): +// +// Writer LLM (~3s, serial) +// │ +// ├─ CharacterDesigner LLM × N (parallel per new char) +// │ │ +// │ ├─ portrait gen + upload (parallel within agent) +// │ └─ voice provisioning (parallel within agent) +// │ +// ├─ Cinematographer LLM (parallel with all of the above) +// │ +// └─ wait for all parallel branches +// │ +// ▼ +// Painter (FLUX referenceImages — two-tier degradation chain) +// │ +// ▼ +// upload final scene image → Scene.imageUuid +// │ +// ▼ +// return { scene, sceneImageBase64, characters } +// +// The Cinematographer intentionally does NOT depend on CharacterDesigner +// output — it only positions named characters in the frame, not their +// appearance. This unlocks the parallelism that makes the full pipeline +// ~9-12s instead of ~15-18s serial. +// ══════════════════════════════════════════════════════════════════════ function newSceneId(): string { return `scene_${Date.now()}_${Math.random().toString(36).slice(2, 6)}`; } -// ────────────────────────────────────────────────────────────────────── -// directScene — generates one Scene (multi-beat) for the player. -// Called both on real scene transitions AND on speculative prefetch. -// ────────────────────────────────────────────────────────────────────── +function tlog(label: string, t0: number): void { + console.log(`${label}: ${Date.now() - t0}ms`); +} + +// Merge a freshly-designed Character into a registry, preserving any +// previously-set voice/portrait that the new design didn't fill in (so +// re-designing a known character can't silently drop their voice or wipe +// out an already-generated portrait UUID). Match by name. +export function mergeCharacters( + existing: Character[], + updates: Character[], +): Character[] { + if (updates.length === 0) return existing; + const byName = new Map(existing.map((c) => [c.name, c])); + for (const u of updates) { + const prev = byName.get(u.name); + if (!prev) { + byName.set(u.name, u); + continue; + } + // Preserve any prior provisioned resource that the new design omitted. + byName.set(u.name, { + ...u, + voice: u.voice ?? prev.voice, + visualDescription: u.visualDescription ?? prev.visualDescription, + basePortraitBase64: u.basePortraitBase64 ?? prev.basePortraitBase64, + basePortraitUuid: u.basePortraitUuid ?? prev.basePortraitUuid, + voiceDescription: u.voiceDescription || prev.voiceDescription, + }); + } + return Array.from(byName.values()); +} + +// Pick a reference to the prior scene image when sceneKey matches a prior +// scene — used by the Painter as one of the `referenceImages` (NOT as a +// seedImage, because FLUX.2 [klein] 9B KV does not support seedImage). +// +// Returns the UUID if available (cheap reference, ~36 chars over the wire), +// else the base64 of the most recent matching scene's image. Returns +// undefined when no prior scene shares the current sceneKey. +function pickPriorSceneReference( + session: Session, + currentSceneKey: string | undefined, + priorImageBase64ByUuid: Map, +): { priorSceneReference?: string; priorSceneKey?: string } { + if (!currentSceneKey) return {}; + for (let i = session.history.length - 1; i >= 0; i--) { + const prior = session.history[i]!.scene; + if (prior.sceneKey === currentSceneKey) { + if (prior.imageUuid) { + return { + priorSceneReference: prior.imageUuid, + priorSceneKey: prior.sceneKey, + }; + } + const cached = priorImageBase64ByUuid.get(prior.id); + if (cached) { + return { priorSceneReference: cached, priorSceneKey: prior.sceneKey }; + } + } + } + return {}; +} export type SceneResult = { scene: Scene; - characterUpdates: Character[]; + sceneImageBase64: string; + characters: Character[]; }; +// ────────────────────────────────────────────────────────────────────── +// directScene — the multi-agent pipeline. Used by orchestrator's +// startSession and requestScene. +// +// priorImageBase64ByUuid: optional map from prior Scene.id → base64 +// the caller has on-hand. If a sceneKey-hit scene's imageUuid is missing +// but the base64 is cached locally, we can still feed it as one of the +// Painter's referenceImages. Pass an empty map when caller has no cache +// (orchestrator does pass it for the start-session bootstrap). +// ────────────────────────────────────────────────────────────────────── + export async function directScene( - config: ProviderConfig, + config: EngineConfig, session: Session, + priorImageBase64ByUuid: Map = new Map(), ): Promise { - const raw = await chat( - config, - [ - { role: "system", content: DIRECTOR_SYSTEM }, - { role: "user", content: buildDirectorUserMessage(session) }, - ], - { temperature: 0.9, responseFormat: "json_object" }, + const tTotal = Date.now(); + + // Stage 1 — Writer (serial; everything downstream needs sceneSummary + + // beats[] to know who's on stage and what to compose around). + const tWriter = Date.now(); + const writerOut = await runWriter(config.text, session); + tlog("[directScene] Writer", tWriter); + + // Identify NEW characters introduced by this scene that need to be + // designed (LLM + portrait + voice). Existing characters in the registry + // are skipped — their cards / portraits / voices persist across scenes. + const allActiveNames = collectActiveCharacterNames(writerOut.beats); + const newCharNames = allActiveNames.filter( + (n) => !session.characters.some((c) => c.name === n), ); - const parsed = parseJsonLoose(raw); - const rawBeats = Array.isArray(parsed.beats) ? parsed.beats : []; - if (rawBeats.length === 0) { - throw new Error("Director returned no beats"); + // Find the entry beat for the Cinematographer (which characters are + // on-screen in the establishing shot). + const entryBeat = writerOut.beats.find((b) => b.id === writerOut.entryBeatId); + const entryBeatActive = entryBeat?.activeCharacters ?? []; + + // For sceneKey-based visual continuity, look up the prior matching scene's + // image to slot into Painter's referenceImages (max 4 of which include + // character portraits too). + const { priorSceneReference, priorSceneKey } = pickPriorSceneReference( + session, + writerOut.sceneKey, + priorImageBase64ByUuid, + ); + + // Stage 2 — parallel: CharacterDesigner(s) and Cinematographer. + // Cinematographer doesn't need character visualDescriptions (those are + // appended at Painter stage), so it runs concurrently with chardesign. + const tParallel = Date.now(); + + const designPromises = newCharNames.map((name) => + designCharacter(config, session, name).catch((err): Character => { + const msg = err instanceof Error ? err.message : String(err); + console.error(`[directScene] designCharacter(${name}) failed: ${msg}`); + // Last-resort fallback: register with name only so the speaker isn't + // unknown. Caller may try voice provisioning later or skip. + return { + name, + voiceDescription: `请根据角色名「${name}」推断其性别、年龄与气质。所属世界观:${session.worldSetting}`, + }; + }), + ); + + const cinemaPromise = runCinematographer(config.text, { + sceneSummary: writerOut.sceneSummary, + styleGuide: session.styleGuide, + entryBeatActive, + entryBeatSpeaker: entryBeat?.speaker, + priorSceneKey, + currentSceneKey: writerOut.sceneKey, + }); + + const [designedChars, cinemaOut] = await Promise.all([ + Promise.all(designPromises), + cinemaPromise, + ]); + tlog("[directScene] CharacterDesigner+Cinematographer parallel", tParallel); + + // Merge new chars into a working registry that we'll pass to the Painter. + const characters = mergeCharacters(session.characters, designedChars); + + // Edge case: a speaker referenced by the Writer might not have been in + // `activeCharacters` of any beat (LLM oversight), so they got skipped by + // newCharNames. Catch them here and at least provision a voice so the + // beat-audio path doesn't render silent. No portrait — they weren't + // visible in the scene, so visual consistency doesn't matter for them. + const speakerNames = new Set( + writerOut.beats.map((b) => b.speaker).filter((n): n is string => Boolean(n)), + ); + const orphanSpeakers = [...speakerNames].filter( + // Pattern B: "你" (player) is a valid speaker but never gets a Character + // record — TTS is intentionally skipped on the client. Filter POV out so + // provisionVoiceForName isn't accidentally invoked for the player. + (n) => !isPovName(n) && !characters.some((c) => c.name === n), + ); + if (orphanSpeakers.length > 0) { + const orphans = await Promise.all( + orphanSpeakers.map((n) => provisionVoiceForName(config, session, n)), + ); + const merged = mergeCharacters(characters, orphans); + characters.splice(0, characters.length, ...merged); } - const beats = ensureUniqueChoiceIds( - repairBeats( - ensureUniqueBeatIds( - rawBeats.map((b, i) => coerceBeat(b, i, rawBeats.length)), - ), - ), + // Stage 3 — Painter (depends on cinemaOut + characters). + // On-stage characters for THIS scene are the ones in any beat — pass them + // all so the archetype block covers anyone the player might encounter. + const onStageCharacters = characters.filter((c) => + allActiveNames.includes(c.name), ); - const declaredEntry = parsed.entryBeatId?.trim(); - const entryBeatId = - declaredEntry && beats.some((b) => b.id === declaredEntry) - ? declaredEntry - : beats[0]!.id; - - return { - scene: { - id: newSceneId(), - scenePrompt: parsed.scenePrompt?.trim() || "an empty scene", - beats, - entryBeatId, + const tPainter = Date.now(); + const sceneImageBase64 = await runPainter( + config, + { + integratedPrompt: cinemaOut.integratedPrompt, + styleGuide: session.styleGuide, + onStageCharacters, + priorSceneImage: priorSceneReference, }, - characterUpdates: coerceCharacterUpdates(parsed.characterUpdates), + entryBeat, + ); + tlog("[directScene] Painter", tPainter); + + // Stage 4 — best-effort upload of the final scene image so the NEXT + // sceneKey-match call can reference its UUID instead of carrying base64. + // If upload fails, the scene still works; only loses cheap referencing + // on the next hop. Don't wait on mock images (static placeholder). + let imageUuid: string | undefined; + if (!config.mockImage) { + try { + const tUpload = Date.now(); + imageUuid = await uploadImage(config.image, sceneImageBase64); + tlog("[directScene] image upload", tUpload); + } catch (err) { + const msg = err instanceof Error ? err.message : String(err); + console.warn(`[directScene] scene image upload failed: ${msg} — sceneKey reuse will need base64 fallback`); + } + } + + const scene: Scene = { + id: newSceneId(), + // scenePrompt is the cinematographer's English compositional output; + // the Writer's sceneSummary stays in the session log via beats[]/ + // history. Keeping the original field name preserves compat with + // anything that already reads scene.scenePrompt (e.g., insert-beat + // user prompt). + scenePrompt: cinemaOut.integratedPrompt, + beats: writerOut.beats, + entryBeatId: writerOut.entryBeatId, + sceneKey: writerOut.sceneKey, + imageUuid, }; + + tlog("[directScene] TOTAL", tTotal); + + return { scene, sceneImageBase64, characters }; } // ────────────────────────────────────────────────────────────────────── -// directInsertBeat — generates a one-off transient beat in response to -// a freeform vision action that stays in-scene. Used by /api/insert-beat. +// directInsertBeat — single-agent path for vision-driven in-scene +// exploration. Generates ONE transient beat with NO new image, NO new +// characters. Multi-agent pipeline doesn't apply here (no rendering, no +// character introduction allowed by the prompt). // ────────────────────────────────────────────────────────────────────── export async function directInsertBeat( @@ -326,13 +311,17 @@ export async function directInsertBeat( const parsed = parseJsonLoose(raw); const narration = parsed.narration?.trim() || undefined; - const speaker = parsed.speaker?.trim() || undefined; + const rawSpeaker = parsed.speaker?.trim() || undefined; + // Pattern B (mirrors Writer): normalize POV variants → "你"; NPCs pass through. + const speaker = rawSpeaker ? normalizeSpeakerName(rawSpeaker) : undefined; const line = parsed.line?.trim() || undefined; - const lineDelivery = line ? parsed.lineDelivery?.trim() || undefined : undefined; + // lineDelivery is only meaningful for NPC speakers (TTS). For POV ("你") + // TTS is intentionally skipped on the client, so lineDelivery is dropped. + const lineDelivery = + line && speaker !== POV_DISPLAY_NAME + ? parsed.lineDelivery?.trim() || undefined + : undefined; - // If the model returned nothing usable, supply a fallback narration so the - // frontend doesn't append a silent empty beat that renders no dialogue — - // which would make the click appear to do nothing. if (!narration && !speaker && !line) { return { narration: "(你停下脚步,环视片刻。)" }; } diff --git a/packages/engine/src/index.ts b/packages/engine/src/index.ts index 9d96a48..8162c30 100644 --- a/packages/engine/src/index.ts +++ b/packages/engine/src/index.ts @@ -6,7 +6,10 @@ export { requestBeatAudio, } from "./orchestrator"; export { annotateClick } from "./annotate"; -export { provisionVoicesForScene, synthesizeBeat } from "./voice"; +export { synthesizeBeat } from "./voice"; +export { mergeCharacters } from "./director"; export type { SceneResult } from "./director"; +export type { WriterOutput } from "./agents/writer"; +export type { CinematographerOutput } from "./agents/cinematographer"; export type { InsertBeatPartial } from "@yume/types"; export * from "./prompts"; diff --git a/packages/engine/src/jsonParser.ts b/packages/engine/src/jsonParser.ts index 60a422a..20130fc 100644 --- a/packages/engine/src/jsonParser.ts +++ b/packages/engine/src/jsonParser.ts @@ -1,3 +1,13 @@ +// Strict-then-forgiving JSON parser for LLM output. Tries in order: +// 1. Direct JSON.parse on the trimmed text. +// 2. Extract from ```json``` fenced block. +// 3. Slice between first { and last } and parse. +// 4. Apply best-effort regex repair (trailing commas, missing commas +// between adjacent values) and try again. +// +// On final failure, logs the first 800 chars of the raw model output so we +// can see what the LLM did wrong (the standard error message only shows +// the position, not the surrounding context). export function parseJsonLoose(raw: string): T { const trimmed = raw.trim(); @@ -20,8 +30,52 @@ export function parseJsonLoose(raw: string): T { const last = trimmed.lastIndexOf("}"); if (first !== -1 && last > first) { const slice = trimmed.slice(first, last + 1); - return JSON.parse(slice) as T; + try { + return JSON.parse(slice) as T; + } catch { + // Last resort: try repairing common LLM-output malformations. + const repaired = repairJsonString(slice); + try { + return JSON.parse(repaired) as T; + } catch (err) { + console.error( + `[parseJsonLoose] all strategies failed. Raw output (first 800 chars):\n${raw.slice(0, 800)}`, + ); + throw err; + } + } } + console.error( + `[parseJsonLoose] no { ... } found. Raw output (first 800 chars):\n${raw.slice(0, 800)}`, + ); throw new Error(`Failed to parse JSON from model output: ${raw.slice(0, 200)}`); } + +// Best-effort repair of LLM-typical JSON syntax errors. Targeted at the two +// most common failures we see in practice: +// 1. Trailing comma before } or ]. +// 2. Missing comma between two adjacent JSON values (the specific error +// mode we hit at position 3390). +// +// Deliberately conservative — does NOT try to fix unclosed strings, +// unbalanced braces, or strip JS-style comments. The comment-stripping +// path was previously included but would corrupt JSON string values +// containing `//` (e.g. URLs like "https://example.com"); since LLMs in +// `responseFormat: "json_object"` mode essentially never emit comments, +// dropping that step is a net win for safety. +function repairJsonString(s: string): string { + return s + // 1. Strip trailing commas before } or ]. + .replace(/,(\s*[}\]])/g, "$1") + // 2. Insert missing commas between two adjacent JSON values. The cases: + // } { → },{ ] [ → ],[ } [ → },[ ] { → ],{ + // "string" "key" "string" { "string" [ + // number then "key" / { / [ + // + // The regex looks for a closing token (} ] " or a digit) followed by + // a newline and an opening token (} ] " a letter), and inserts a + // comma between them. Requires the newline (\s*\n\s*) so it only + // fires across line boundaries, never within a single-line value. + .replace(/(\}|\]|"|\d)(\s*\n\s*)(\{|\[|")/g, "$1,$2$3"); +} diff --git a/packages/engine/src/orchestrator.ts b/packages/engine/src/orchestrator.ts index d75e17b..87a8e6a 100644 --- a/packages/engine/src/orchestrator.ts +++ b/packages/engine/src/orchestrator.ts @@ -1,14 +1,12 @@ import type { BeatAudioRequest, BeatAudioResponse, - Character, EngineConfig, InsertBeatRequest, InsertBeatResponse, - Scene, + Session, SceneRequest, SceneResponse, - Session, StartRequest, StartResponse, VisionRequest, @@ -16,55 +14,24 @@ import type { } from "@yume/types"; import { annotateClick } from "./annotate"; import { directInsertBeat, directScene } from "./director"; -import { mockImageBase64 } from "./mockImage"; -import { render } from "./renderer"; +import { synthesizeBeat } from "./voice"; import { interpret } from "./vision"; -import { provisionVoicesForScene, synthesizeBeat } from "./voice"; function newSessionId(): string { return `s_${Date.now()}_${Math.random().toString(36).slice(2, 8)}`; } -// TEMP: per-phase timing for latency diagnosis. Remove after we have data. function tlog(label: string, t0: number): void { console.log(`${label}: ${Date.now() - t0}ms`); } -// Merge new character entries into the registry by name. If a name already -// exists we preserve the existing voice (so a description revision never -// silently re-provisions a voice the player has already heard). -function mergeCharacters(existing: Character[], updates: Character[]): Character[] { - if (updates.length === 0) return existing; - const byName = new Map(existing.map((c) => [c.name, c])); - for (const u of updates) { - const prev = byName.get(u.name); - byName.set(u.name, prev?.voice ? { ...u, voice: prev.voice } : u); - } - return Array.from(byName.values()); -} - -async function renderImage( - config: EngineConfig, - scene: Scene, - styleGuide: string, -): Promise { - if (config.mockImage) return mockImageBase64(); - return render(config.image, scene, styleGuide); -} - -async function provisionForScene( - config: EngineConfig, - session: Session, - scene: Scene, -): Promise<{ characters: Character[] }> { - if (!config.tts) return { characters: session.characters }; - return provisionVoicesForScene(config.tts, session, scene); -} - // ────────────────────────────────────────────────────────────────────── -// startSession — first scene + image + voice provisioning. The actual -// per-beat synth runs lazily via requestBeatAudio so MiMo's tail -// latency never blocks the UI. +// startSession — initial Scene via the multi-agent pipeline. +// +// directScene internally handles: Writer → (CharacterDesigner+ +// Cinematographer parallel) → Painter → upload. Voice provisioning and +// portrait generation happen inside CharacterDesigner per new character, +// so the orchestrator no longer needs to coordinate them separately. // ────────────────────────────────────────────────────────────────────── export async function startSession( @@ -72,6 +39,7 @@ export async function startSession( req: StartRequest, ): Promise { const tTotal = Date.now(); + const session: Session = { id: newSessionId(), createdAt: Date.now(), @@ -81,42 +49,20 @@ export async function startSession( characters: [], }; - const tDirect = Date.now(); - const { scene, characterUpdates } = await directScene(config.text, session); - tlog("[start] directScene", tDirect); - - const preVoiceSession: Session = { - ...session, - characters: mergeCharacters(session.characters, characterUpdates), - }; - - const tImage = Date.now(); - const tProv = Date.now(); - const imagePromise = renderImage(config, scene, preVoiceSession.styleGuide) - .then((r) => { - tlog("[start] renderImage", tImage); - return r; - }); - const provPromise = provisionForScene(config, preVoiceSession, scene) - .then((r) => { - tlog("[start] provisionForScene", tProv); - return r; - }); - const [imageBase64, provRes] = await Promise.all([imagePromise, provPromise]); + const { scene, sceneImageBase64, characters } = await directScene(config, session); tlog("[start] TOTAL", tTotal); return { sessionId: session.id, scene, - imageBase64, - characters: provRes.characters, + imageBase64: sceneImageBase64, + characters, }; } // ────────────────────────────────────────────────────────────────────── -// requestScene — generate the NEXT scene + image + voice provisioning. -// Used both on real scene transitions and on speculative prefetch. +// requestScene — next Scene from existing session. // ────────────────────────────────────────────────────────────────────── export async function requestScene( @@ -125,40 +71,24 @@ export async function requestScene( ): Promise { const tTotal = Date.now(); - const tDirect = Date.now(); - const { scene, characterUpdates } = await directScene(config.text, req.session); - tlog("[scene] directScene", tDirect); - - const preVoiceSession: Session = { - ...req.session, - characters: mergeCharacters(req.session.characters, characterUpdates), - }; - - const tImage = Date.now(); - const tProv = Date.now(); - const imagePromise = renderImage(config, scene, preVoiceSession.styleGuide) - .then((r) => { - tlog("[scene] renderImage", tImage); - return r; - }); - const provPromise = provisionForScene(config, preVoiceSession, scene) - .then((r) => { - tlog("[scene] provisionForScene", tProv); - return r; - }); - const [imageBase64, provRes] = await Promise.all([imagePromise, provPromise]); + const { scene, sceneImageBase64, characters } = await directScene( + config, + req.session, + ); tlog("[scene] TOTAL", tTotal); return { scene, - imageBase64, - characters: provRes.characters, + imageBase64: sceneImageBase64, + characters, }; } // ────────────────────────────────────────────────────────────────────── // visionDecide — interprets a background click into intent + classify. +// No change from staging — vision lives outside the scene-generation +// pipeline. // ────────────────────────────────────────────────────────────────────── export async function visionDecide( @@ -171,9 +101,9 @@ export async function visionDecide( } // ────────────────────────────────────────────────────────────────────── -// requestInsertBeat — generates a transient in-scene beat (no image -// regen, no voice). The client fires /api/beat-audio for the new beat -// after this returns. +// requestInsertBeat — single-agent transient beat (no image, no new +// characters). Stays single-LLM by design — the INSERT_BEAT prompt +// forbids new characters and there's nothing to render. // ────────────────────────────────────────────────────────────────────── export async function requestInsertBeat( @@ -182,19 +112,24 @@ export async function requestInsertBeat( ): Promise { const tTotal = Date.now(); - const tDirect = Date.now(); const partial = await directInsertBeat( config.text, req.session, req.freeformAction, ); - tlog("[insert-beat] directInsertBeat", tDirect); - // INSERT_BEAT prompt forbids new characters — promote disallowed-speaker - // lines to narration so the player still sees the text (the client only - // renders `line` when there is a `speaker`). + // INSERT_BEAT prompt forbids new NPCs — promote disallowed-speaker lines + // to narration so the player still sees the text (the client only renders + // `line` when there is a `speaker`). + // + // Exception (Pattern B): speaker = "你" is the player speaking. No + // Character record exists for "你" (intentional — TTS is skipped), so we + // must NOT demote it; the client renders the dialog box correctly. + // directInsertBeat already normalized POV variants to "你" before this + // guard, so a literal "你" here is always Pattern B player dialog. if ( partial.speaker && + partial.speaker !== "你" && !req.session.characters.some((c) => c.name === partial.speaker) ) { console.warn( diff --git a/packages/engine/src/prompts.ts b/packages/engine/src/prompts.ts index 57c1484..ac1449a 100644 --- a/packages/engine/src/prompts.ts +++ b/packages/engine/src/prompts.ts @@ -1,23 +1,47 @@ -import type { Scene, Session } from "@yume/types"; +import type { + BeatActiveCharacter, + Character, + Scene, + Session, +} from "@yume/types"; + +// ══════════════════════════════════════════════════════════════════════ +// Multi-agent scene generation pipeline: +// Writer (编剧) — narrative + beats[] + per-beat activeCharacters +// CharacterDesigner — per-new-character visual + voice cards +// Cinematographer (分镜导演) — sceneKey + English compositional prompt +// Painter (画师) — FLUX rendering with character archetypes +// +// Each agent owns one system prompt + one user-message builder below. +// All four agents see the same world / style guide, but each only reads +// the slice of session state it needs to make its decision. +// ══════════════════════════════════════════════════════════════════════ // ────────────────────────────────────────────────────────────────────── -// Director — emits one Scene (background + a graph of beats) at a time. +// 1. Writer (编剧) — drives the narrative. +// +// Emits a full Scene: beats[] graph + entryBeatId + sceneKey hint + +// activeCharacters per beat. Does NOT design characters (that's the +// CharacterDesigner's job) — only names them in `activeCharacters`. +// The CharacterDesigner is invoked separately for any name not yet in +// session.characters. // ────────────────────────────────────────────────────────────────────── -export const DIRECTOR_SYSTEM = `你是一个交互视觉小说的「场景导演」。每次基于世界观、画风、玩家历史、已登记角色,输出**一个完整的场景**,并为每句台词配上细腻的配音导演指令。 +export const WRITER_SYSTEM = `你是一个交互视觉小说的「编剧」。每次基于世界观、画风、玩家历史、已登记角色,写出**一个完整场景的剧本**:场景背景概要 + 一组对话节拍 beats。你只负责**剧情和台词**——不设计角色形象、不写出图提示词、不做镜头调度,这些由其他 agent 完成。 一个场景包含: -- 一张背景图(你给出英文 scenePrompt) -- 一组对话节拍 beats,玩家会按顺序经历它们 -- 任何**首次登场**的角色,需在 characterUpdates 里登记其专属音色设计 +- sceneSummary:当前场景的中文概要(地点、时间、氛围、关键事件——给后续的分镜导演看) +- sceneKey:当前场景的英文 slug(如 "classroom-dusk"、"rooftop-night"、"rainy-street")——同一物理空间应沿用相同 slug +- beats[]:玩家依次经历的对话节拍 +- entryBeatId:玩家进入场景时落在哪个 beat 每个 beat 是玩家会看到的一段叙述 / 对话 / 选择。beat 之间通过 next 字段连接: -- "continue": 玩家点击图片背景 / 按继续,自然推进到下一个 beat -- "choice": 在此让玩家做选择,按所选 choice 的 effect 走向 +- "continue":玩家点击图片背景 / 按继续,自然推进到下一个 beat +- "choice":在此让玩家做选择,按所选 choice 的 effect 走向 choice 的 effect 有两种: -- "advance-beat": 玩家选了之后跳到**同场景内**的另一个 beat(不换背景图,速度极快) -- "change-scene": 玩家选了之后切换到**新场景**(视角变了 / 走到新地方 / 时间跳了) +- "advance-beat":玩家选了之后跳到**同场景内**的另一个 beat(不换背景图,速度极快) +- "change-scene":玩家选了之后切换到**新场景**(视角变了 / 走到新地方 / 时间跳了) 设计原则: - 同场景内 beat 数自由发挥,按剧情节奏自然给出(通常 2–6 个,可以更多) @@ -25,34 +49,60 @@ choice 的 effect 有两种: - advance-beat 适合处理对话分支(同一场景里换个话题、追问、撒娇) - change-scene 适合空间/时间跳跃(出门、转身看窗外、第二天清晨) - 一个场景至少要有一个 change-scene 出口(除非真到结局) -- 每个 change-scene 必须带 nextSceneSeed —— 一句中文简述「下一场是哪里、谁在、要发生什么」,用来引导下一次导演调用 +- 每个 change-scene 必须带 nextSceneSeed —— 一句中文简述「下一场是哪里、谁在、要发生什么」 - 同一场景的 beat id 互不重复 - next.nextBeatId 引用的 beat 必须存在 - choice 至少 2 个,至多 4 个,互不重复 +sceneKey 设计原则(重要 — 用于跨场景视觉一致性): +- 同一物理空间 + 同一时段 → 必须沿用**完全相同**的英文 slug +- 时段或空间变化时换 slug(如 "classroom-dusk" → "classroom-night","classroom-dusk" → "corridor-dusk") +- slug 规范:lowercase-with-dashes,2–4 个英文单词 +- 已登记的历史场景 sceneKey 会在用户消息里列出,请优先**复用**这些已有 slug + 文本风格约束: - narration / line 用中文(**纯净可显示文本**,绝不要写 (叹气)(语速快) 这类标注 —— 那是给配音的,会被玩家看见) -- scenePrompt / lineDelivery / characterUpdates 内的文字按下方专门说明 +- sceneSummary / lineDelivery / activeCharacters[].pose 内的文字也用中文 +- sceneKey 用英文 slug - 单个 beat 的 narration 与 line 加起来 ≤80 字 - 单个 choice label ≤15 字 -- scenePrompt 用英文,只描述画面里看到什么,不要描述 UI 配音相关字段: -- 每个有 line 的 beat **必须**给出 lineDelivery —— 自由中文的"配音导演指令",描述该句台词怎么念(情绪 / 语气 / 语速 / 气息 / 停顿 / 重音 / 音色起伏)。例:"鼓起勇气又害羞,声音发颤、偏小,句尾带一丝气声,语速偏慢"。平淡场合写"平静自然、语速适中"即可,但要贴当下情境。 -- characterUpdates 仅当**有新角色首次出现**时列出该新角色的音色设计;已登记的角色不要重复列出。 -- characterUpdates[].description **必须以明确性别开头**("女性,…" / "男性,…"),随后描述:年龄、音色质感、性格情绪基调、语速节奏、人设腔调、口音方言。例:"女性,约17岁少女,音色清亮带点稚嫩甜美,性格开朗,语速偏快,标准普通话"。 +- 每个有 line 的 beat **必须**给出 lineDelivery —— 自由中文的「配音导演指令」,描述该句台词怎么念(情绪 / 语气 / 语速 / 气息 / 停顿 / 重音 / 音色起伏)。例:"鼓起勇气又害羞,声音发颤、偏小,句尾带一丝气声,语速偏慢"。平淡场合写"平静自然、语速适中"即可,但要贴当下情境。 -角色与台词的硬性规则(影响配音正确性): -- 任何 beat 的 speaker 字段一旦填了名字,**该名字必须**:① 在"已登记角色"列表中存在,或 ② 本次输出的 characterUpdates 里登记。绝不允许 speaker 是个未登记的陌生名字。 +角色与台词的硬性规则: +- 任何 beat 的 speaker 字段一旦填了名字,**该名字必须**:① 是 "你"(玩家本人,见下方"玩家视角硬规则"),或 ② 在「已登记角色」列表中存在,或 ③ 出现在本场景的某个 beat 的 activeCharacters 里。 - speaker 名字必须与登记名**完全一致**,不要加「(回忆)」「学姐」之类后缀或别名。 +- 每个 beat 的 activeCharacters 列出**此时此刻画面里出现的 NPC 角色**及其当下姿态/神情(中文)。即使没人说话,画面里有谁在也要列出。 + +玩家视角硬规则(重要 — 违反这条会破坏整个 galgame): + +【画面规则 — 严格禁止】 +- 玩家是第二人称 POV,**永远不出现在任何 Scene 画面里** +- activeCharacters[].name 数组**绝不允许**包含任何下列名字(任何大小写、中英文变体): + 「玩家」「你」「我」「主角」「protagonist」「player」「Player」「MC」「I」「me」 +- 玩家不会被设计立绘、不会被设计音色 + +【对白规则 — galgame 标准做法(Pattern B)】 +- 玩家**可以正常说话**——当主角对 NPC 开口时: + speaker = "你"(**固定用这两个字,不要用其他变体**) + line = 实际说的话(如「学姐,下雨了」) + lineDelivery 可以留空(玩家对白不会被 TTS 合成) +- speaker 字段允许的取值**只有两种**:① NPC 真名(必须在 activeCharacters 里)② "你" +- 其它 POV 变体(玩家 / 我 / 主角 / protagonist / player / MC / I / me)**一律视为错误** + +【内心 vs 外显的区分】 +- 主角在心里想 / 在做某个动作 / 在观察 / 自己的体感 → 用 narration(speaker 留空) + 例:"你的心跳得很快,几乎听不见外面的雨声。" +- 主角真的开口对 NPC 说出来 → 用 speaker="你" + line + 例:speaker="你" line="学姐,这把伞你拿着。" +- 同一个 beat 可以同时有 narration(心理活动 / 动作)和 speaker="你" + line(说出口的话) 必须输出严格 JSON,结构如下: { - "scenePrompt": "english scene description, no UI", + "sceneSummary": "中文场景概要:地点+时间+氛围+关键事件", + "sceneKey": "classroom-dusk", "entryBeatId": "b1", - "characterUpdates": [ - { "name": "夏海", "description": "女性,约17岁少女,音色清亮带点稚嫩甜美…" } - ], "beats": [ { "id": "b1", @@ -60,6 +110,9 @@ choice 的 effect 有两种: "speaker": "可空", "line": "可空(纯净文本)", "lineDelivery": "line 非空时必填:配音导演指令", + "activeCharacters": [ + { "name": "夏海", "pose": "脸红害羞地绞着衣角,双眼躲闪" } + ], "next": { "type": "continue", "nextBeatId": "b2" } }, { @@ -67,13 +120,26 @@ choice 的 effect 有两种: "speaker": "夏海", "line": "学长,我有话想对你说。", "lineDelivery": "鼓起勇气,但又有点害羞,语速偏慢,句尾微微上扬", + "activeCharacters": [ + { "name": "夏海", "pose": "鼓起勇气直视对方,双手紧握" } + ], + "next": { "type": "continue", "nextBeatId": "b3" } + }, + { + "id": "b3", + "narration": "你下意识攥紧了书包带,喉咙有点干。", + "speaker": "你", + "line": "……你说。", + "activeCharacters": [ + { "name": "夏海", "pose": "鼓起勇气直视对方,双手紧握" } + ], "next": { "type": "choice", "choices": [ { "id": "c1", "label": "继续追问", - "effect": { "kind": "advance-beat", "targetBeatId": "b3" } + "effect": { "kind": "advance-beat", "targetBeatId": "b4" } }, { "id": "c2", @@ -88,18 +154,24 @@ choice 的 effect 有两种: 不要输出 JSON 以外的任何文本。`; -export function buildDirectorUserMessage(session: Session): string { +export function buildWriterUserMessage(session: Session): string { const parts: string[] = []; parts.push(`世界观:${session.worldSetting}`); parts.push(`画风:${session.styleGuide}`); if (session.characters.length > 0) { - parts.push("\n已登记角色(speaker 必须用这些名字之一,或在本次 characterUpdates 里登记新名):"); + parts.push("\n已登记角色(speaker 必须用这些名字之一,或本场景新引入):"); for (const c of session.characters) { - parts.push(`- ${c.name}:${c.description}`); + parts.push(`- ${c.name}`); } } + const priorKeys = collectPriorSceneKeys(session); + if (priorKeys.length > 0) { + parts.push("\n已使用的 sceneKey(同一物理空间请沿用,不要新造):"); + for (const k of priorKeys) parts.push(`- ${k}`); + } + if (session.history.length === 0) { parts.push("\n这是故事的开场。请生成第一个场景,严格以 JSON 格式返回。"); return parts.join("\n"); @@ -108,7 +180,7 @@ export function buildDirectorUserMessage(session: Session): string { parts.push("\n场景历史(按时间顺序):"); session.history.forEach((entry, idx) => { const lines: string[] = [`【场景 ${idx + 1}】`]; - lines.push(` scenePrompt: ${entry.scene.scenePrompt}`); + if (entry.scene.sceneKey) lines.push(` sceneKey: ${entry.scene.sceneKey}`); const visited = entry.visitedBeatIds.length ? entry.visitedBeatIds @@ -157,9 +229,274 @@ export function buildDirectorUserMessage(session: Session): string { return parts.join("\n"); } +function collectPriorSceneKeys(session: Session): string[] { + const seen = new Set(); + for (const entry of session.history) { + const k = entry.scene.sceneKey; + if (k) seen.add(k); + } + return Array.from(seen); +} + +// ────────────────────────────────────────────────────────────────────── +// 2. CharacterDesigner (角色设定师) — designs one new character. +// +// Receives a character NAME (extracted by the Writer's activeCharacters) +// and produces BOTH the English visual card AND the Chinese voice card +// in a single LLM call. Bundling these two is intentional: a single agent +// that "knows who this character is" produces internally-consistent +// appearance + vocal personality, whereas split agents tend to diverge +// (e.g., gentle-looking character with energetic voice). +// ────────────────────────────────────────────────────────────────────── + +export const CHARACTER_DESIGNER_SYSTEM = `你是视觉小说的「角色设定师」。给你一个**新登场角色的名字**,你要为这个角色同时设计两份卡片: +1. **视觉设定卡(英文)**——给生图模型 FLUX 用,遵循 prompt engineering 风格 +2. **音色设定卡(中文)**——给小米 MiMo 配音设计用 + +两份卡片要描绘**同一个人**——外貌温柔的人不该被配上张扬聒噪的嗓音;冷酷干练的人不该用甜软糯的童声。先在心里想清楚这个人的整体气质,再分两面落笔。 + +视觉设定卡 visualDescription 规则: +- **必须完全用英文** +- 风格:用形容词 + 短语,**英文逗号分隔**,符合 FLUX/Stable Diffusion prompt 习惯 +- 包含:年龄段、发型发色、眼睛 / 神情基调、面部特征、标志性服饰(款式 + 配色 + 花纹)、整体气质 +- **不要写瞬时姿势或表情**(这些由编剧/分镜每帧实时控制) +- **必须融入全局画风** styleGuide 的美术指向(比如 styleGuide 是「赛博朋克」时,服饰要赛博朋克化) +- 长度:80–150 个英文词为宜 +- 不要包含背景环境(这不是场景图,是角色立绘卡) + +音色设定卡 voiceDescription 规则: +- **必须以明确性别开头**:"女性,…" / "男性,…" +- 随后描述:年龄段(如「约17岁少女」「30 出头男性」)、音色质感、性格情绪基调、语速节奏、人设腔调、口音方言 +- 用中文,整段连续描述,不分段 +- 长度:50–80 个中文字为宜 +- 例:"女性,约17岁少女,音色清亮带点稚嫩甜美,性格开朗外向但容易害羞,语速偏快,标准普通话" + +必须输出严格 JSON: +{ + "visualDescription": "English visual card, comma-separated tags...", + "voiceDescription": "中文音色卡,以性别开头..." +} + +不要输出 JSON 以外的任何文本。`; + +export function buildCharacterDesignerUserMessage( + charName: string, + session: Session, +): string { + const parts: string[] = []; + parts.push(`角色名:${charName}`); + parts.push(`世界观:${session.worldSetting}`); + parts.push(`全局美术画风:${session.styleGuide}`); + + const others = session.characters.filter((c) => c.visualDescription); + if (others.length > 0) { + parts.push("\n已设定角色(外貌应与他们有区分):"); + for (const c of others) { + parts.push(`- ${c.name}: ${c.visualDescription}`); + } + } + + parts.push( + "\n请为该角色同时设计 visualDescription(英文)和 voiceDescription(中文),严格以 JSON 格式返回。", + ); + return parts.join("\n"); +} + +// ────────────────────────────────────────────────────────────────────── +// 3. Cinematographer (分镜导演) — composes the visual frame. +// +// Reads the Writer's sceneSummary + active characters and produces the +// English compositional prompt fed to FLUX. Does NOT describe the +// characters themselves (those archetypes are appended at the Painter +// stage from session.characters.visualDescription). Only describes the +// ENVIRONMENT, lighting, camera framing, and how the characters are +// positioned within the frame. +// ────────────────────────────────────────────────────────────────────── + +export const CINEMATOGRAPHER_SYSTEM = `你是视觉小说的「分镜导演」。给你编剧的当前场景概要、活跃角色名单和他们在场景里的姿态描述,以及**入口 beat 的 speaker 信息**(用来决定镜头语言)。你的任务是**只用英文**写一段**纯环境+构图**的描述(integratedPrompt),交给画师作为出图主提示词。 + +你**不要**写角色的外貌细节——发色、服饰、脸型这些由其他 agent 提供,画师会把"角色档案卡"附加到你的 integratedPrompt 后面。你只关心: +- **环境**:地点、时间、天气、光线、空间细节(什么家具/植物/物件) +- **构图 / 镜头**:景别(wide shot / medium shot / close-up / over-the-shoulder)、机位、视角 +- **人物在画面中的位置和姿态**(不写脸 / 不写穿什么——只写"哪个角色站在哪儿、在做什么") +- **氛围**:情绪基调、色调、影调(warm dusk / cold neon / soft morning light) + +═══════════════════════════════════════════════════════════════════ +玩家视角硬规则(与画面相关,必须严格遵守) +═══════════════════════════════════════════════════════════════════ +- 玩家本人**永远不出现在画面里**——不画 player 的身体、手、肩膀、背影、剪影、脚、头发 +- integratedPrompt 中**绝对禁止**出现下列英文(或中文等价): + "first-person view" · "POV of the protagonist" · "player's hand / arm / shoulder / back" + "protagonist visible" · "from the player's perspective" · "MC" · "player's silhouette" +- 镜头是一个"隐形的观察者位置"——可以位于玩家的视角附近(NPC 像在看玩家),但**绝不画出玩家本身** + +═══════════════════════════════════════════════════════════════════ +动态镜头策略(根据入口 beat 的 speaker 字段选择镜头) +═══════════════════════════════════════════════════════════════════ +你会收到 entryBeatSpeaker 字段。按以下规则选镜头: + +【entryBeatSpeaker = 某个 NPC 名字】 → NPC 正在对玩家说话 +- 优先 **close-up 或 medium close-up**,NPC 看向画面外(= 看玩家) +- 关键英文:close-up / medium close-up, looking toward camera, eyes meeting the viewer, + direct gaze, lips parted mid-speech +- 制造"她正在对你说话"的代入感(galgame 经典直视镜头) + +【entryBeatSpeaker = "你"】 → 玩家正在对 NPC 说话 +- 优先 **medium shot**,NPC 居中,做"在听玩家说话"的姿态 +- 关键英文:medium shot, attentively listening, facing the camera, + head slightly tilted, expression of attention +- ❌ 不要写 over-the-shoulder(因为这会暗示画出玩家肩膀,违反 POV 规则) + +【entryBeatSpeaker 为空】 → 纯环境 / 旁白 beat +- 优先 **wide establishing shot**,展现环境氛围 +- 关键英文:wide establishing shot, atmospheric mood, environmental detail +- 如果有 NPC 在场,他们可以处于远处 / 中景 / 自然状态(不必看镜头) + +【entryBeatActive 有多个角色】 → 群像 +- 使用 **medium group shot 或 medium wide shot**,多人在一个框内 +- 关键英文:medium group shot, two-shot / three-shot, characters arranged in the frame + +═══════════════════════════════════════════════════════════════════ +输出 JSON 结构 +═══════════════════════════════════════════════════════════════════ +{ + "shotType": "close-up / medium shot / wide establishing / medium group shot / ...", + "integratedPrompt": "English. Environment + composition + character positioning + camera language. No dialogue boxes, no UI. 80-150 words." +} + +写作要求: +- integratedPrompt **必须英文**,遵循 FLUX prompt engineering 习惯(形容词 + 短语,英文逗号分隔,必要时短句) +- 提到具体角色时**只用其名字 + 动作**,例如 "Natsumi standing by the window, head slightly bowed"——绝不要写她长什么样 +- 不描述任何 UI、字幕、对话框、边框 +- 不描述图像之外的事情(不要写"this scene depicts..."这种 meta 句) +- 长度 80–150 英文词 + +不要输出 JSON 以外的任何文本。`; + +export function buildCinematographerUserMessage( + sceneSummary: string, + styleGuide: string, + entryBeatActive: BeatActiveCharacter[], + entryBeatSpeaker: string | undefined, + priorSceneKey: string | undefined, + currentSceneKey: string | undefined, +): string { + const parts: string[] = []; + parts.push(`全局美术画风:${styleGuide}`); + parts.push(`\n当前场景(来自编剧):${sceneSummary}`); + + if (entryBeatActive.length > 0) { + parts.push("\n开场画面里的角色及其姿态:"); + for (const c of entryBeatActive) { + parts.push(`- ${c.name}:${c.pose ?? "(无具体姿态描述)"}`); + } + } else { + parts.push("\n开场画面里没有角色(纯环境)。"); + } + + // entryBeatSpeaker drives the dynamic camera policy (see CINEMATOGRAPHER_SYSTEM). + // "你" means the player is speaking; an NPC name means an NPC is speaking; + // empty means no dialog (pure environment / narration beat). + if (entryBeatSpeaker === "你") { + parts.push( + '\n开场 beat 是**玩家说话**(speaker = "你")——按动态镜头策略:medium shot,NPC 居中、做听玩家说话的姿态、看向画面外。**绝不要画出玩家**。', + ); + } else if (entryBeatSpeaker) { + parts.push( + `\n开场 beat 是 **${entryBeatSpeaker} 在对玩家说话**(speaker = "${entryBeatSpeaker}")——按动态镜头策略:close-up 或 medium close-up,${entryBeatSpeaker} 看向画面外(看玩家),眼神交流。`, + ); + } else { + parts.push( + "\n开场 beat 没有 speaker(纯旁白/环境)——按动态镜头策略:wide establishing shot 展现环境氛围。", + ); + } + + if (priorSceneKey && currentSceneKey && priorSceneKey === currentSceneKey) { + parts.push( + `\n注意:上一场和本场 sceneKey 都是 "${currentSceneKey}"——画师会把上一张场景图作为 referenceImages 之一锚定同一空间。你的 integratedPrompt 应该**强调连续性**,描述时段/情绪/构图的细微变化,而不是完全重新设定空间。`, + ); + } + + parts.push("\n请输出 shotType + integratedPrompt,严格以 JSON 格式返回。"); + return parts.join("\n"); +} + +// ────────────────────────────────────────────────────────────────────── +// 4. Painter (画师) — final image prompt assembly. +// +// Not an LLM agent — a pure prompt-building function that combines the +// Cinematographer's integratedPrompt with character archetype blocks +// (visual cards) and the standard FLUX constraints. +// ────────────────────────────────────────────────────────────────────── + +export function buildPainterPrompt( + integratedPrompt: string, + styleGuide: string, + characters: { name: string; visualDescription?: string }[], +): string { + const archetypeBlock = characters + .filter((c) => c.visualDescription) + .map((c) => `[CHARACTER: ${c.name}]\n${c.visualDescription}`) + .join("\n\n"); + + const archetypeSection = archetypeBlock + ? `\n\nCHARACTER ARCHETYPES (anchor identity, outfit, and style across scenes — keep each character visually identical to their archetype):\n${archetypeBlock}` + : ""; + + return `Generate a cinematic landscape background illustration, 16:9 widescreen (1792x1024). + +ART STYLE: ${styleGuide} + +SCENE COMPOSITION (from cinematographer — environment + camera framing + character positioning): +${integratedPrompt}${archetypeSection} + +STRICT RULES — NEVER violate these: +- DO NOT draw any dialogue boxes, speech bubbles, text panels, or any rectangular overlay. +- DO NOT draw any buttons, choice options, menu items, or interactive UI elements. +- DO NOT render any Chinese or English text anywhere in the image. +- DO NOT add any HUD, interface chrome, or game UI elements. +- The image is a PURE BACKGROUND SCENE ONLY. All UI will be added as HTML on top. +- 16:9 LANDSCAPE orientation — wider than tall. No portrait or square output. +- Leave the bottom 35% of the frame relatively uncluttered (darker or softer) so overlaid UI panels remain readable. +- Characters or key scene elements should be positioned in the upper 65% of the frame. +- Maintain character identity exactly as specified in CHARACTER ARCHETYPES — same face, same hairstyle, same outfit across every scene. + +PLAYER POV RULES — the player / protagonist is the unseen viewer: +- The player / protagonist is NEVER visible in the frame — no body parts, no hands, no shoulders, no back of head, no silhouette, no feet, no hair. +- DO NOT use first-person POV that implies the player's body in frame. +- When an NPC is speaking to the player, they SHOULD look toward the camera (toward the player's implied position) — this creates eye contact without showing the player. +- The camera position represents the player's gaze; only NPCs, scenery, and objects are rendered.`; +} + +// Character portrait prompt — for the per-character base image generated +// once when the CharacterDesigner introduces a new character. The portrait +// is used both as a client-side asset (立绘登场) and as a referenceImages +// entry when rendering later scenes for visual consistency. +export function buildCharacterPortraitPrompt( + charName: string, + visualDescription: string, + styleGuide: string, +): string { + return `Character concept portrait sheet, single character, full-body or upper-body composition, neutral standing pose, looking toward camera, neutral expression, plain neutral background (no environment, no scenery). + +ART STYLE: ${styleGuide} + +CHARACTER (${charName}): +${visualDescription} + +STRICT RULES: +- ONE character only — no other people, no crowd, no background characters. +- Plain neutral background (off-white or soft gradient). NO environment, NO furniture, NO props beyond what's worn. +- Neutral, calm pose and expression — this is a reference sheet, not a dramatic shot. +- NO text, NO UI, NO watermark, NO border. +- The character should be clearly visible and centered, the pose natural and relaxed. +- 16:9 landscape orientation.`; +} + // ────────────────────────────────────────────────────────────────────── // Insert-Beat — given a freeform vision action that is judged to stay // *within* the current scene, generate one transient beat. +// Single-agent path; no character design / no rendering involved. // ────────────────────────────────────────────────────────────────────── export const INSERT_BEAT_SYSTEM = `你是视觉小说编剧。玩家在当前场景内做了一个**不会换场景的自由动作**(比如看一眼桌上的相框、想了想刚才那句话)。请基于此动作,写出一个**单独的、过渡性的 beat**:可以是旁白、角色台词、或两者结合。 @@ -169,8 +506,15 @@ export const INSERT_BEAT_SYSTEM = `你是视觉小说编剧。玩家在当前场 - narration 与 line 加起来 ≤80 字 - 不要打破当前场景的物理状态(玩家仍在原地、对面仍是同一个角色) - 不要生成选项或下一步指引 —— 玩家点击会自然回到原 beat -- 如果有 line,speaker 必须用**已登记角色**里的名字(绝不允许引入新角色) -- 如果有 line,**必须**给出 lineDelivery(配音导演指令,自由中文,描述这句话怎么念) + +speaker 字段允许的取值**只有两种**(与主路径 Writer 一致 — Pattern B galgame 标准): +1. **已登记角色**里的 NPC 真名(**绝不允许引入新角色**) +2. **"你"** — 玩家本人在自言自语 / 说一句过渡性的话(对白框显示,但不调 TTS) + +其它任何 POV 变体(玩家 / 我 / 主角 / protagonist / player / MC / I / me)**一律错误**,请用 "你" 代替。 + +- 如果有 line 且 speaker = NPC,**必须**给出 lineDelivery(配音导演指令) +- 如果有 line 且 speaker = "你",lineDelivery 可以留空(玩家对白不调 TTS) 必须输出严格 JSON: { @@ -198,9 +542,10 @@ export function buildInsertBeatUserMessage( const current = session.history.at(-1); if (current) { - parts.push(`\n当前场景:${current.scene.scenePrompt}`); - const lastBeatId = current.visitedBeatIds.at(-1) ?? current.scene.entryBeatId; - const lastBeat = current.scene.beats.find((b) => b.id === lastBeatId); + const scene: Scene = current.scene; + parts.push(`\n当前场景:${scene.scenePrompt}`); + const lastBeatId = current.visitedBeatIds.at(-1) ?? scene.entryBeatId; + const lastBeat = scene.beats.find((b) => b.id === lastBeatId); if (lastBeat) { const recent: string[] = []; if (lastBeat.narration) recent.push(`旁白:${lastBeat.narration}`); @@ -214,31 +559,10 @@ export function buildInsertBeatUserMessage( return parts.join("\n"); } -// ────────────────────────────────────────────────────────────────────── -// Image renderer -// ────────────────────────────────────────────────────────────────────── - -export function buildImagePrompt(scene: Scene, styleGuide: string): string { - return `Generate a cinematic landscape background illustration, 16:9 widescreen (1792x1024). - -ART STYLE: ${styleGuide} - -SCENE (fill the ENTIRE canvas — no UI elements, no text overlays): -${scene.scenePrompt} - -STRICT RULES — NEVER violate these: -- DO NOT draw any dialogue boxes, speech bubbles, text panels, or any rectangular overlay. -- DO NOT draw any buttons, choice options, menu items, or interactive UI elements. -- DO NOT render any Chinese or English text anywhere in the image. -- DO NOT add any HUD, interface chrome, or game UI elements. -- The image is a PURE BACKGROUND SCENE ONLY. All UI will be added as HTML on top. -- 16:9 LANDSCAPE orientation — wider than tall. No portrait or square output. -- Leave the bottom 35% of the frame relatively uncluttered (darker or softer) so overlaid UI panels remain readable. -- Characters or key scene elements should be positioned in the upper 65% of the frame.`; -} - // ────────────────────────────────────────────────────────────────────── // Vision — interprets a background click and classifies the action. +// Unchanged from staging (UI choices live in HTML, vision only judges +// background clicks). // ────────────────────────────────────────────────────────────────────── export const VISION_SYSTEM_PROMPT = `你是视觉理解助手。玩家在视觉小说的背景图上点击了红色圆点位置(HTML 上的选项按钮不会走到你这里)。你的任务是: @@ -265,3 +589,5 @@ export function buildVisionUserPrompt(scene: Scene | null): string { 红点位置即为玩家点击位置。请判断玩家意图与分类,以 JSON 格式返回。`; } + +export type PainterCharacterInput = Pick; diff --git a/packages/engine/src/renderer.ts b/packages/engine/src/renderer.ts deleted file mode 100644 index 27f4292..0000000 --- a/packages/engine/src/renderer.ts +++ /dev/null @@ -1,12 +0,0 @@ -import { generateImage } from "@yume/ai-client"; -import type { ProviderConfig, Scene } from "@yume/types"; -import { buildImagePrompt } from "./prompts"; - -export async function render( - config: ProviderConfig, - scene: Scene, - styleGuide: string, -): Promise { - const prompt = buildImagePrompt(scene, styleGuide); - return generateImage(config, prompt); -} diff --git a/packages/engine/src/voice.ts b/packages/engine/src/voice.ts index d61464b..64d54c3 100644 --- a/packages/engine/src/voice.ts +++ b/packages/engine/src/voice.ts @@ -1,25 +1,11 @@ -import { provisionVoice, synthesize } from "@yume/tts-client"; -import type { - BeatAudio, - Character, - CharacterVoice, - Scene, - Session, - TtsConfig, -} from "@yume/types"; +import { synthesize } from "@yume/tts-client"; +import type { BeatAudio, CharacterVoice, TtsConfig } from "@yume/types"; // Per-beat synth budget. MiMo's median synth is 3–7s; the tail can spike // to 30–70s under concurrent load. Capping here means a single bad beat // degrades to silent in <15s instead of blocking the whole UI flow. const SYNTH_TIMEOUT_MS = 15000; -// When the director references a speaker that was never registered, derive a -// description from the name + world so the voice's gender/temperament is at -// least inferred from the name — never borrowed from another character. -function inferredSpeakerDescription(name: string, session: Session): string { - return `请根据角色名「${name}」推断其性别、年龄与气质,生成最贴合的音色。所属世界观:${session.worldSetting}`; -} - // Race the work against a timer; on either outcome clear the timer (otherwise // the success path leaks a 15s-pending reject closure into Node's timer heap, // per-synth call). On timeout, abort the supplied controller so the underlying @@ -47,82 +33,15 @@ async function withTimeout( } } -// Provision voices for all unseen speakers in a scene, in parallel. -// Does NOT synthesize per-beat audio — that happens lazily via -// synthesizeBeat from the /api/beat-audio route. Returning the populated -// registry lets the client fire per-beat synth without re-provisioning. -// -// Why dedupe before fanning out: the SAME unseen speaker appearing in 3 -// beats must run voicedesign once; parallel design of the same speaker -// would burn three voices' worth of budget and pick whichever raced last. -export async function provisionVoicesForScene( - cfg: TtsConfig, - session: Session, - scene: Scene, -): Promise<{ characters: Character[] }> { - const tScene = Date.now(); - const speakingBeats = scene.beats.filter( - (b): b is typeof b & { speaker: string; line: string } => - Boolean(b.speaker && b.line), - ); - - let characters: Character[] = [...session.characters]; - const toProvision = new Map(); // name -> description - for (const b of speakingBeats) { - if (toProvision.has(b.speaker)) continue; - const existing = characters.find((c) => c.name === b.speaker); - if (existing?.voice) continue; - toProvision.set( - b.speaker, - existing?.description ?? inferredSpeakerDescription(b.speaker, session), - ); - } - - if (toProvision.size === 0) { - console.log( - `[voice] provisionVoicesForScene total=${Date.now() - tScene}ms (no new speakers)`, - ); - return { characters }; - } - - const tProvision = Date.now(); - const provisioned = await Promise.all( - Array.from(toProvision.entries()).map(async ([name, description]) => { - try { - const voice = await provisionVoice(cfg, description); - return { name, description, voice }; - } catch (err) { - const msg = err instanceof Error ? err.message : String(err); - console.error(`[voice] provision degraded for ${name}: ${msg}`); - return { name, description, voice: undefined }; - } - }), - ); - console.log( - `[voice] provision: ${toProvision.size} speakers parallel max=${Date.now() - tProvision}ms`, - ); - - for (const p of provisioned) { - if (!p.voice) continue; - const idx = characters.findIndex((c) => c.name === p.name); - if (idx === -1) { - characters.push({ name: p.name, description: p.description, voice: p.voice }); - } else { - characters[idx] = { ...characters[idx]!, voice: p.voice }; - } - } - - console.log( - `[voice] provisionVoicesForScene total=${Date.now() - tScene}ms`, - ); - return { characters }; -} - // Synthesize audio for one beat. Caller is expected to have already // resolved the speaker's voice (from session.characters in the client) — // passing it directly here keeps the /api/beat-audio payload small and // makes this function pure with respect to session state. // Returns null on error or timeout; caller treats null as "play silent." +// +// (Voice PROVISIONING — designing a voice for a new character from a +// voiceDescription — lives in agents/characterDesigner.ts now. This file +// only handles per-beat SYNTHESIS using an already-provisioned voice.) export async function synthesizeBeat( cfg: TtsConfig, voice: CharacterVoice, diff --git a/packages/types/src/index.ts b/packages/types/src/index.ts index 5d0e86b..01be754 100644 --- a/packages/types/src/index.ts +++ b/packages/types/src/index.ts @@ -11,9 +11,21 @@ export type Beat = { line?: string; /** Free-form voice-acting direction for the line, sent to TTS only. Never displayed. */ lineDelivery?: string; + /** + * Characters visible in this beat with their pose / expression for this moment. + * Read by the Cinematographer when composing the scene's establishing shot — + * the beat the entry beat lands in is the visual anchor for the image. + */ + activeCharacters?: BeatActiveCharacter[]; next: BeatNext; }; +export type BeatActiveCharacter = { + name: string; + /** Free-form 中文 description of pose / expression / what the character is doing. */ + pose?: string; +}; + export type BeatNext = | { type: "continue"; nextBeatId: string } | { type: "choice"; choices: BeatChoice[] }; @@ -39,6 +51,22 @@ export type Scene = { scenePrompt: string; beats: Beat[]; entryBeatId: string; + /** + * Stable English slug identifying the visual scene's location + time, + * e.g. "classroom-dusk", "rooftop-night". When the next Scene shares this + * key, the Painter slots the previous Scene's image into Runware's + * `referenceImages` (alongside character portraits) so the same physical + * space stays visually consistent across cuts. (Originally planned as a + * seedImage / img2img anchor, but FLUX.2 [klein] 9B KV does not support + * seedImage — referenceImages serves the same purpose with the model.) + */ + sceneKey?: string; + /** + * Runware UUID of this Scene's generated image — once uploaded, subsequent + * Scenes that match sceneKey can reference it via `referenceImages` + * without resending base64. + */ + imageUuid?: string; }; export type SceneExit = @@ -69,8 +97,32 @@ export type CharacterVoice = { export type Character = { name: string; - /** Free-form voice design description; must begin with explicit gender. */ - description: string; + /** + * 中文 voice-acting direction card. Must begin with explicit gender, then + * age / timbre / personality / speed / accent. Fed to Xiaomi MiMo's + * voicedesign endpoint when the voice is first provisioned. + */ + voiceDescription: string; + /** + * English appearance card — comma-separated visual attributes following + * Runware/FLUX prompt-engineering convention. Fed to the Painter as a + * character archetype anchor so the same face/outfit/style stays consistent + * across every scene this character appears in. + */ + visualDescription?: string; + /** + * Base portrait image generated by the CharacterDesigner once, then reused + * as a Runware `referenceImages` entry in every subsequent scene the + * character appears in. Stored as base64 for client display. + */ + basePortraitBase64?: string; + /** + * Runware UUID for the base portrait. Once uploaded via the image-upload + * endpoint, subsequent Painter calls reference this UUID instead of + * resending the full base64 payload. + */ + basePortraitUuid?: string; + /** Xiaomi MiMo voice reference audio. */ voice?: CharacterVoice; }; @@ -90,7 +142,7 @@ export type Session = { worldSetting: string; styleGuide: string; history: SceneHistoryEntry[]; - /** Character registry — accumulates across scenes; voices persist for reuse. */ + /** Character registry — accumulates across scenes; voices + portraits persist for reuse. */ characters: Character[]; }; @@ -145,7 +197,7 @@ export type StartResponse = { sessionId: string; scene: Scene; imageBase64: string; - /** Character registry with voice references provisioned for new speakers. */ + /** Character registry with voice references + visual cards provisioned. */ characters: Character[]; }; @@ -165,11 +217,6 @@ export type SceneResponse = { // /api/beat-audio — lazily synthesize one beat's voice. Client fires this // per beat after a scene loads; server has a per-call timeout so MiMo // tail-latency cannot block the UI. A null audio response means "play silent." -// -// Payload deliberately slim: just the line to speak and the speaker's voice -// reference. The client extracts the voice from its local session.characters -// before posting — sending the full Session would force ~160KB of base64 per -// OTHER speaker plus the entire scene history to ride along for nothing. export type BeatAudioRequest = { beat: { id: string;