feat(engine): multi-agent character consistency pipeline (#6)
* feat(types): Character.voiceDescription rename + visual fields + Scene.sceneKey Prepares the type surface for the multi-agent scene pipeline: - Character.description → voiceDescription (clearer pairing with new visualDescription) - Character gains visualDescription (English appearance card for Painter) + basePortraitBase64 + basePortraitUuid (for Runware referenceImages reuse) - Scene gains sceneKey (English slug for cross-scene img2img continuity) + imageUuid (Runware UUID of the scene's rendered image for cheap seedImage reuse on subsequent same-sceneKey calls) - Beat gains activeCharacters[] so the Cinematographer can read which characters are on-screen + their poses when composing the establishing shot Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ai-client): generateImage img2img + multi-reference options + uploadImage Extends the Runware adapter to support the two anchoring mechanisms FLUX.2 [klein] 9B KV needs for character + scene visual consistency: - generateImage gains optional { seedImage, referenceImages, strength }: seedImage drives img2img (single starting image, sceneKey continuity), referenceImages drives multi-reference anchoring (up to 4 character portraits, capped per Runware spec). Default strength 0.85 — FLUX ignores strength < 0.8. - uploadImage POSTs a base64 to Runware's imageUpload taskType and returns the UUID, so portraits/scene snapshots can be referenced by UUID on subsequent calls instead of resending base64 every scene. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(engine): multi-agent scene pipeline (Writer→CharDesigner+Cinematographer→Painter) Replaces the single-LLM directScene with a four-agent pipeline that specializes each concern and parallelizes the slow parts. Adopts the core idea from #4 (multi-agent dispatch + character visual consistency) and grafts it onto the Scene/Beat architecture introduced in #2. Pipeline per Scene (~9-12s critical path with parallelization): Writer LLM (序列, ~3s) │ outputs: sceneSummary + sceneKey + beats[] (each beat carries │ activeCharacters[] with poses) │ ├─ CharacterDesigner LLM × N new chars (并行) │ │ outputs: { visualDescription (英文外貌卡), voiceDescription (中文音色卡) } │ ├─ FLUX portrait gen → upload → UUID (并行 within agent) │ └─ Xiaomi MiMo voicedesign provision (并行 within agent) │ └─ Cinematographer LLM (并行 with CharacterDesigner) outputs: { shotType, integratedPrompt (英文构图+机位+人物站位) } Painter (FLUX img2img + referenceImages, ~1-3s) inputs: integratedPrompt + onStageCharacters' archetype block + (optional) prior sceneKey-hit scene as seedImage + (optional) character portrait UUIDs as referenceImages fallback chain: A) both anchors → B) refs only (保角色) → C) seed only (保背景) → D) pure t2i output uploaded → Scene.imageUuid for the next sceneKey hop Why this carving: - Writer focuses purely on narrative (drops the voice-design duty staging's DIRECTOR_SYSTEM was carrying as a side concern). - CharacterDesigner bundles visual + voice so the agent that thinks "who is this character" produces internally-consistent appearance + vocal personality (split agents tend to diverge). - Cinematographer doesn't need character visualDescriptions — Painter appends archetypes after — so it parallelizes with CharacterDesigner. - sceneKey enables cross-scene backdrop continuity that Scene/Beat doesn't cover (Scene/Beat only reuses backdrop WITHIN a scene's beats; sceneKey reuses across scenes that share a location). Other changes: - voice.ts loses provisionVoicesForScene (moved into CharacterDesigner); keeps synthesizeBeat for the lazy per-beat /api/beat-audio path. - renderer.ts deleted (replaced by agents/painter.ts). - directInsertBeat (vision-driven in-scene exploration) stays single- LLM — it forbids new characters and produces no image, so multi- agent doesn't apply. apps/web is unchanged: orchestrator.ts keeps the same exports (startSession / requestScene / visionDecide / requestInsertBeat / requestBeatAudio) with identical request/response shapes. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): Pattern B player POV + JSON repair + drop seedImage tier Three hotfixes surfaced by manual end-to-end testing of the multi-agent pipeline. F1 — Player viewpoint (galgame Pattern B): - Writer accepts speaker="你" for player dialog (renders in dialog box, never TTS'd because no Character record exists for "你"). Filter POV variants (玩家/我/主角/protagonist/player/I/me/...) from activeCharacters so CharacterDesigner never wastes API calls on the player. Two-layer defense: explicit prompt rule in WRITER_SYSTEM + code normalization (POV_VARIANTS set, isPovName, normalizeSpeakerName). - Cinematographer and Painter prompts gain "player never in frame" rule so the player never appears in any rendered scene. - Cinematographer gains dynamic camera policy driven by the entry beat's speaker: NPC-speaker → close-up looking toward camera; "你"-speaker → medium shot of attentive NPC; no speaker → wide establishing shot. - director.ts filters POV from orphanSpeakers so provisionVoiceForName never fires for "你". F2 — JSON parsing robustness: - parseJsonLoose gains a 4th repair tier: strip JS-style comments, strip trailing commas, insert missing commas between adjacent objects / arrays / quoted values. Logs the first 800 chars of raw LLM output when all repair attempts fail, so we can see what the model emitted. F3 — Drop seedImage, use referenceImages for prior scene: - FLUX.2 [klein] 9B KV does not support seedImage (img2img). Removed Tier A (seedImage+refs) and Tier C (seedImage only) from the Painter degradation chain. New layout: prior scene's image slots into referenceImages[0] for spatial continuity, character portraits fill slots 1-3 (Runware caps at 4 total). Cinematographer instructed to emphasize continuity when sceneKey matches a prior scene. All five package typechecks pass. Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): address Copilot review feedback on #6 Three targeted fixes from PR #6 Copilot review. F4 — Stale seedImage/img2img docstrings Four locations still referenced the original img2img design after F3 switched to referenceImages-based spatial continuity: - types/index.ts:57 Scene.sceneKey docstring - types/index.ts:63 Scene.imageUuid docstring - director.ts:34 pipeline diagram in module block comment - director.ts:128 directScene JSDoc Doc-only changes; misleading wording corrected to mention referenceImages. (The design-rationale comment in pickPriorSceneReference is kept — it explains WHY we don't use seedImage and is load-bearing context.) F5 — Remove JS-comment stripping from JSON repair pass parseJsonLoose's repair tier previously stripped `// ...` and `/* ... */` across the entire text, which would corrupt JSON string values containing URLs (e.g. "https://example.com" → "https:"). Since LLMs in `responseFormat: "json_object"` mode essentially never emit comments, dropping the comment-stripping step is a net win for safety. Trailing-comma and missing-comma repair (the high-frequency failures) are kept. F6 — Pattern B parity on the insert-beat path Previously: directInsertBeat's INSERT_BEAT_SYSTEM forbade any speaker not in session.characters, and the orchestrator's unregistered-speaker guard demoted such lines to narration. This meant the player could not speak via speaker="你" in transient in-scene beats — inconsistent with the Writer path. Fix: - INSERT_BEAT_SYSTEM prompt now allows speaker="你" (NPC name OR "你") and rejects other POV variants - directInsertBeat applies normalizeSpeakerName to the LLM output, same as the Writer path, so POV variants collapse to "你" - lineDelivery is dropped when speaker="你" (no TTS for player) - orchestrator's unregistered-speaker guard adds a `speaker !== "你"` exception so Pattern B player dialog passes through Co-Authored-By: QiChen88 <2291969160@qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(engine): drop "JS-style comments" from parseJsonLoose header The function header listed JS-style comments as a step-4 repair, but F5 already removed comment stripping from `repairJsonString` because the regex would corrupt URLs inside JSON string values. The inner function's comment was updated then; this header was missed. Doc-only sync from second-round Copilot review on #6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: QiChen88 <2291969160@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+119
-19
@@ -4,8 +4,22 @@ import { fetchWithRetry } from "./fetchWithRetry";
|
||||
// Runware uses its own task-array protocol (not OpenAI-compatible).
|
||||
// POST <baseUrl> with [{ taskType: "imageInference", ... }]; errors come
|
||||
// back as a 200 with `errors[]`, so we have to inspect the body either way.
|
||||
|
||||
// FLUX img2img specifics:
|
||||
// - strength < 0.8 has minimal-to-no visible effect on FLUX models (per
|
||||
// Runware docs); we default to 0.85 which leaves room to deviate while
|
||||
// still anchoring on the seed image's composition.
|
||||
// - referenceImages caps at 4 per request; the FLUX.2 [klein] 9B KV model
|
||||
// (runware:400@6) accelerates multi-reference inference by ~2.5× via its
|
||||
// KV cache for reference latents (cached only WITHIN one inference run —
|
||||
// not persisted across API calls, hence the upload-once-then-reference
|
||||
// strategy below).
|
||||
const DEFAULT_IMG2IMG_STRENGTH = 0.85;
|
||||
const MAX_REFERENCE_IMAGES = 4;
|
||||
|
||||
type RunwareImageResult = {
|
||||
imageBase64Data?: string;
|
||||
imageUUID?: string;
|
||||
};
|
||||
type RunwareError = {
|
||||
code?: string;
|
||||
@@ -17,27 +31,58 @@ type RunwareResponse = {
|
||||
errors?: RunwareError[];
|
||||
};
|
||||
|
||||
export type GenerateImageOptions = {
|
||||
/**
|
||||
* Reference image (UUID, plain base64, or data URI) to use as the
|
||||
* img2img starting point. When set, FLUX preserves the seed image's
|
||||
* composition and applies `strength` to allow deviation from it.
|
||||
* Used for cross-scene visual continuity when sceneKey hits.
|
||||
*/
|
||||
seedImage?: string;
|
||||
/**
|
||||
* Reference images (UUIDs or base64) to condition the generation on —
|
||||
* typically character portraits to anchor identity / outfit / style
|
||||
* across scenes. Runware caps at 4; we silently truncate beyond that.
|
||||
*/
|
||||
referenceImages?: string[];
|
||||
/** 0–1, FLUX needs ≥ 0.8 to actually have an effect. */
|
||||
strength?: number;
|
||||
};
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// generateImage — text-to-image (default) or img2img / multi-reference
|
||||
// when seedImage / referenceImages are supplied. Returns base64.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export async function generateImage(
|
||||
config: ProviderConfig,
|
||||
prompt: string,
|
||||
options?: GenerateImageOptions,
|
||||
): Promise<string> {
|
||||
const url = config.baseUrl.replace(/\/$/, "");
|
||||
|
||||
const body = [
|
||||
{
|
||||
taskType: "imageInference",
|
||||
taskUUID: crypto.randomUUID(),
|
||||
model: config.model,
|
||||
positivePrompt: prompt,
|
||||
width: 1792,
|
||||
height: 1024,
|
||||
steps: 4,
|
||||
CFGScale: 3.5,
|
||||
numberResults: 1,
|
||||
outputType: "base64Data",
|
||||
outputFormat: "PNG",
|
||||
},
|
||||
];
|
||||
const task: Record<string, unknown> = {
|
||||
taskType: "imageInference",
|
||||
taskUUID: crypto.randomUUID(),
|
||||
model: config.model,
|
||||
positivePrompt: prompt,
|
||||
width: 1792,
|
||||
height: 1024,
|
||||
steps: 4,
|
||||
CFGScale: 3.5,
|
||||
numberResults: 1,
|
||||
outputType: "base64Data",
|
||||
outputFormat: "PNG",
|
||||
};
|
||||
|
||||
if (options?.seedImage) {
|
||||
task.seedImage = options.seedImage;
|
||||
task.strength = options.strength ?? DEFAULT_IMG2IMG_STRENGTH;
|
||||
}
|
||||
|
||||
if (options?.referenceImages?.length) {
|
||||
task.referenceImages = options.referenceImages.slice(0, MAX_REFERENCE_IMAGES);
|
||||
}
|
||||
|
||||
const res = await fetchWithRetry(url, {
|
||||
method: "POST",
|
||||
@@ -45,7 +90,7 @@ export async function generateImage(
|
||||
"Content-Type": "application/json",
|
||||
Authorization: `Bearer ${config.apiKey}`,
|
||||
},
|
||||
body: JSON.stringify(body),
|
||||
body: JSON.stringify([task]),
|
||||
});
|
||||
|
||||
const text = await res.text();
|
||||
@@ -66,9 +111,64 @@ export async function generateImage(
|
||||
|
||||
const b64 = json.data?.[0]?.imageBase64Data;
|
||||
if (!b64) {
|
||||
throw new Error(
|
||||
`No image in Runware response: ${text.slice(0, 300)}`,
|
||||
);
|
||||
throw new Error(`No image in Runware response: ${text.slice(0, 300)}`);
|
||||
}
|
||||
return b64;
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// uploadImage — registers a base64 image on Runware and returns its
|
||||
// UUID, so subsequent generateImage calls can pass the UUID in
|
||||
// referenceImages / seedImage instead of resending the base64 payload
|
||||
// every time. Character base portraits and scene snapshots both flow
|
||||
// through this path.
|
||||
//
|
||||
// Runware exposes the imageUpload taskType for exactly this purpose.
|
||||
// Returns the UUID. Caller treats a thrown error as "fall back to
|
||||
// sending base64 next time" — non-fatal.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export async function uploadImage(
|
||||
config: ProviderConfig,
|
||||
base64: string,
|
||||
): Promise<string> {
|
||||
const url = config.baseUrl.replace(/\/$/, "");
|
||||
|
||||
const body = [
|
||||
{
|
||||
taskType: "imageUpload",
|
||||
taskUUID: crypto.randomUUID(),
|
||||
image: `data:image/png;base64,${base64}`,
|
||||
},
|
||||
];
|
||||
|
||||
const res = await fetchWithRetry(url, {
|
||||
method: "POST",
|
||||
headers: {
|
||||
"Content-Type": "application/json",
|
||||
Authorization: `Bearer ${config.apiKey}`,
|
||||
},
|
||||
body: JSON.stringify(body),
|
||||
});
|
||||
|
||||
const text = await res.text();
|
||||
let json: RunwareResponse;
|
||||
try {
|
||||
json = JSON.parse(text) as RunwareResponse;
|
||||
} catch {
|
||||
throw new Error(`Image upload API error ${res.status}: ${text.slice(0, 500)}`);
|
||||
}
|
||||
|
||||
if (json.errors?.length) {
|
||||
const e = json.errors[0]!;
|
||||
throw new Error(
|
||||
`Runware upload error [${e.code ?? "unknown"}]: ${e.message ?? "no message"}`,
|
||||
);
|
||||
}
|
||||
|
||||
const uuid = json.data?.[0]?.imageUUID;
|
||||
if (!uuid) {
|
||||
throw new Error(`No UUID in upload response: ${text.slice(0, 300)}`);
|
||||
}
|
||||
return uuid;
|
||||
}
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
export { chat } from "./chat";
|
||||
export { generateImage } from "./image";
|
||||
export { generateImage, uploadImage } from "./image";
|
||||
export type { GenerateImageOptions } from "./image";
|
||||
export { interpretClick } from "./vision";
|
||||
export type { ChatMessage } from "./chat";
|
||||
|
||||
@@ -0,0 +1,192 @@
|
||||
import { chat, generateImage, uploadImage } from "@yume/ai-client";
|
||||
import { provisionVoice } from "@yume/tts-client";
|
||||
import type {
|
||||
Character,
|
||||
CharacterVoice,
|
||||
EngineConfig,
|
||||
Session,
|
||||
} from "@yume/types";
|
||||
import { parseJsonLoose } from "../jsonParser";
|
||||
import { mockImageBase64 } from "../mockImage";
|
||||
import {
|
||||
CHARACTER_DESIGNER_SYSTEM,
|
||||
buildCharacterDesignerUserMessage,
|
||||
buildCharacterPortraitPrompt,
|
||||
} from "../prompts";
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// CharacterDesigner agent — designs ONE new character end-to-end.
|
||||
//
|
||||
// Pipeline (per character, all the slow parts are parallelized):
|
||||
//
|
||||
// 1. LLM call — designs BOTH visual + voice cards in one shot
|
||||
// (intentional: same agent thinks about who this character IS,
|
||||
// which keeps appearance and vocal personality coherent)
|
||||
//
|
||||
// 2. In parallel:
|
||||
// a. Image gen — base portrait from visualDescription + styleGuide
|
||||
// then upload to Runware → get UUID for cheap re-reference
|
||||
// b. Voice provisioning — Xiaomi MiMo voicedesign from voiceDescription
|
||||
// → reference audio for later voiceclone synth
|
||||
//
|
||||
// 3. Returns merged Character ready to be added to session.characters
|
||||
//
|
||||
// Each step degrades gracefully — if image gen fails we return the
|
||||
// character without a portrait; if voice gen fails we return without
|
||||
// voice. The game keeps running even when sub-components fail.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
type CharacterDesignOutput = {
|
||||
visualDescription?: string;
|
||||
voiceDescription?: string;
|
||||
};
|
||||
|
||||
// TEMP: per-phase timing for latency diagnosis. Same convention as the
|
||||
// orchestrator's tlog. Remove after we have data on real-world numbers.
|
||||
function tlog(label: string, t0: number): void {
|
||||
console.log(`${label}: ${Date.now() - t0}ms`);
|
||||
}
|
||||
|
||||
async function runDesignLLM(
|
||||
config: EngineConfig,
|
||||
session: Session,
|
||||
charName: string,
|
||||
): Promise<CharacterDesignOutput> {
|
||||
const raw = await chat(
|
||||
config.text,
|
||||
[
|
||||
{ role: "system", content: CHARACTER_DESIGNER_SYSTEM },
|
||||
{
|
||||
role: "user",
|
||||
content: buildCharacterDesignerUserMessage(charName, session),
|
||||
},
|
||||
],
|
||||
{ temperature: 0.7, responseFormat: "json_object" },
|
||||
);
|
||||
return parseJsonLoose<CharacterDesignOutput>(raw);
|
||||
}
|
||||
|
||||
// Generate the per-character base portrait and upload it. The portrait is
|
||||
// a "concept sheet" — single character, neutral pose, plain background —
|
||||
// so it works well as a Runware referenceImages anchor for later scenes.
|
||||
//
|
||||
// Returns both the base64 (for client-side asset use, e.g., 立绘登场
|
||||
// animations) and the Runware UUID (for cheap referencing in subsequent
|
||||
// Painter calls without resending the 100KB+ base64 each time).
|
||||
//
|
||||
// The upload step is best-effort: if it fails, we still return the base64
|
||||
// so the next scene can pass it as a referenceImages entry directly (just
|
||||
// pays the bandwidth cost each call instead of once).
|
||||
async function renderAndUploadPortrait(
|
||||
config: EngineConfig,
|
||||
charName: string,
|
||||
visualDescription: string,
|
||||
styleGuide: string,
|
||||
): Promise<{ basePortraitBase64?: string; basePortraitUuid?: string }> {
|
||||
let base64: string;
|
||||
try {
|
||||
if (config.mockImage) {
|
||||
base64 = await mockImageBase64();
|
||||
} else {
|
||||
const prompt = buildCharacterPortraitPrompt(
|
||||
charName,
|
||||
visualDescription,
|
||||
styleGuide,
|
||||
);
|
||||
base64 = await generateImage(config.image, prompt);
|
||||
}
|
||||
} catch (err) {
|
||||
const msg = err instanceof Error ? err.message : String(err);
|
||||
console.error(`[characterDesigner] portrait gen failed for ${charName}: ${msg}`);
|
||||
return {}; // no portrait at all — degrade gracefully
|
||||
}
|
||||
|
||||
// Skip upload in mock mode — the mock image is the same static SVG every
|
||||
// time and uploading it gives us a UUID that points to a useless asset.
|
||||
if (config.mockImage) {
|
||||
return { basePortraitBase64: base64 };
|
||||
}
|
||||
|
||||
try {
|
||||
const uuid = await uploadImage(config.image, base64);
|
||||
return { basePortraitBase64: base64, basePortraitUuid: uuid };
|
||||
} catch (err) {
|
||||
const msg = err instanceof Error ? err.message : String(err);
|
||||
console.warn(
|
||||
`[characterDesigner] portrait upload failed for ${charName}: ${msg} — will pass base64 in subsequent calls`,
|
||||
);
|
||||
return { basePortraitBase64: base64 };
|
||||
}
|
||||
}
|
||||
|
||||
async function provisionVoiceSafe(
|
||||
config: EngineConfig,
|
||||
voiceDescription: string,
|
||||
charName: string,
|
||||
): Promise<CharacterVoice | undefined> {
|
||||
if (!config.tts) return undefined;
|
||||
try {
|
||||
return await provisionVoice(config.tts, voiceDescription);
|
||||
} catch (err) {
|
||||
const msg = err instanceof Error ? err.message : String(err);
|
||||
console.error(`[characterDesigner] voice provision failed for ${charName}: ${msg}`);
|
||||
return undefined;
|
||||
}
|
||||
}
|
||||
|
||||
// Single-character design pipeline. Called by the orchestrator once per
|
||||
// NEW character name; multiple characters in the same scene run their
|
||||
// pipelines in parallel at the orchestrator level.
|
||||
export async function designCharacter(
|
||||
config: EngineConfig,
|
||||
session: Session,
|
||||
charName: string,
|
||||
): Promise<Character> {
|
||||
const tTotal = Date.now();
|
||||
|
||||
// Step 1 — LLM design (visual + voice). Must complete first.
|
||||
const tDesign = Date.now();
|
||||
const design = await runDesignLLM(config, session, charName);
|
||||
tlog(`[charDesigner ${charName}] design LLM`, tDesign);
|
||||
|
||||
const visualDescription = design.visualDescription?.trim();
|
||||
const voiceDescription =
|
||||
design.voiceDescription?.trim() ||
|
||||
`请根据角色名「${charName}」推断其性别、年龄与气质,生成最贴合的音色。所属世界观:${session.worldSetting}`;
|
||||
|
||||
// Step 2 — parallel: portrait + voice provisioning.
|
||||
const tProvision = Date.now();
|
||||
const portraitPromise = visualDescription
|
||||
? renderAndUploadPortrait(config, charName, visualDescription, session.styleGuide)
|
||||
: Promise.resolve({} as Awaited<ReturnType<typeof renderAndUploadPortrait>>);
|
||||
const voicePromise = provisionVoiceSafe(config, voiceDescription, charName);
|
||||
|
||||
const [portrait, voice] = await Promise.all([portraitPromise, voicePromise]);
|
||||
tlog(`[charDesigner ${charName}] portrait+voice parallel`, tProvision);
|
||||
|
||||
tlog(`[charDesigner ${charName}] TOTAL`, tTotal);
|
||||
|
||||
return {
|
||||
name: charName,
|
||||
voiceDescription,
|
||||
visualDescription,
|
||||
basePortraitBase64: portrait.basePortraitBase64,
|
||||
basePortraitUuid: portrait.basePortraitUuid,
|
||||
voice,
|
||||
};
|
||||
}
|
||||
|
||||
// Provision voice ONLY for an existing character that the LLM mentioned
|
||||
// without us having designed them yet (e.g., 编剧 referenced a name that
|
||||
// wasn't in `activeCharacters` but appeared as a speaker). Used by
|
||||
// directInsertBeat path and as a safety net in directScene. No portrait
|
||||
// is generated for these — they get a name + voice only.
|
||||
export async function provisionVoiceForName(
|
||||
config: EngineConfig,
|
||||
session: Session,
|
||||
charName: string,
|
||||
): Promise<Character> {
|
||||
const voiceDescription = `请根据角色名「${charName}」推断其性别、年龄与气质,生成最贴合的音色。所属世界观:${session.worldSetting}`;
|
||||
const voice = await provisionVoiceSafe(config, voiceDescription, charName);
|
||||
return { name: charName, voiceDescription, voice };
|
||||
}
|
||||
@@ -0,0 +1,86 @@
|
||||
import { chat } from "@yume/ai-client";
|
||||
import type { BeatActiveCharacter, ProviderConfig } from "@yume/types";
|
||||
import { parseJsonLoose } from "../jsonParser";
|
||||
import {
|
||||
CINEMATOGRAPHER_SYSTEM,
|
||||
buildCinematographerUserMessage,
|
||||
} from "../prompts";
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// Cinematographer agent — translates the Writer's narrative scene
|
||||
// summary into an English compositional prompt for FLUX.
|
||||
//
|
||||
// Reads: sceneSummary + entry beat's activeCharacters (poses)
|
||||
// + prior sceneKey (for continuity hints)
|
||||
// Writes: { shotType, integratedPrompt }
|
||||
//
|
||||
// Does NOT describe character APPEARANCE — that's appended at the
|
||||
// Painter stage from session.characters[].visualDescription. The
|
||||
// Cinematographer only positions named characters in the frame and
|
||||
// describes the environment + lighting + camera framing.
|
||||
//
|
||||
// This separation lets the Cinematographer run IN PARALLEL with the
|
||||
// CharacterDesigner — neither needs the other's output. They both
|
||||
// feed independently into the Painter prompt.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export type CinematographerOutput = {
|
||||
shotType: string;
|
||||
integratedPrompt: string;
|
||||
};
|
||||
|
||||
type RawCinematographerOutput = {
|
||||
shotType?: string;
|
||||
integratedPrompt?: string;
|
||||
};
|
||||
|
||||
export type CinematographerInput = {
|
||||
sceneSummary: string;
|
||||
styleGuide: string;
|
||||
entryBeatActive: BeatActiveCharacter[];
|
||||
/** Entry beat's speaker — drives the dynamic camera policy:
|
||||
* NPC name → NPC looks toward camera (close-up)
|
||||
* "你" → medium shot, NPC listens
|
||||
* undefined → wide establishing shot */
|
||||
entryBeatSpeaker?: string;
|
||||
priorSceneKey?: string;
|
||||
currentSceneKey?: string;
|
||||
};
|
||||
|
||||
export async function runCinematographer(
|
||||
config: ProviderConfig,
|
||||
input: CinematographerInput,
|
||||
): Promise<CinematographerOutput> {
|
||||
const raw = await chat(
|
||||
config,
|
||||
[
|
||||
{ role: "system", content: CINEMATOGRAPHER_SYSTEM },
|
||||
{
|
||||
role: "user",
|
||||
content: buildCinematographerUserMessage(
|
||||
input.sceneSummary,
|
||||
input.styleGuide,
|
||||
input.entryBeatActive,
|
||||
input.entryBeatSpeaker,
|
||||
input.priorSceneKey,
|
||||
input.currentSceneKey,
|
||||
),
|
||||
},
|
||||
],
|
||||
{ temperature: 0.6, responseFormat: "json_object" },
|
||||
);
|
||||
|
||||
const parsed = parseJsonLoose<RawCinematographerOutput>(raw);
|
||||
|
||||
// Fallback: if the LLM produced nothing usable, synthesize a minimal
|
||||
// integratedPrompt from the Writer's sceneSummary so the Painter has
|
||||
// SOMETHING to work with rather than blowing up the whole pipeline.
|
||||
const integratedPrompt =
|
||||
parsed.integratedPrompt?.trim() ||
|
||||
`A cinematic illustration depicting: ${input.sceneSummary}. Wide establishing shot, natural lighting, atmospheric mood.`;
|
||||
|
||||
return {
|
||||
shotType: parsed.shotType?.trim() || "medium shot",
|
||||
integratedPrompt,
|
||||
};
|
||||
}
|
||||
@@ -0,0 +1,145 @@
|
||||
import { generateImage } from "@yume/ai-client";
|
||||
import type { GenerateImageOptions } from "@yume/ai-client";
|
||||
import type {
|
||||
Beat,
|
||||
Character,
|
||||
EngineConfig,
|
||||
ProviderConfig,
|
||||
} from "@yume/types";
|
||||
import { mockImageBase64 } from "../mockImage";
|
||||
import { buildPainterPrompt } from "../prompts";
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// Painter — final image generation with multi-reference anchoring.
|
||||
//
|
||||
// FLUX.2 [klein] 9B KV does NOT support seedImage (img2img). Instead,
|
||||
// visual continuity comes entirely from `referenceImages` (capped at 4),
|
||||
// which the KV-optimized variant accelerates ~2.5× via key-value caching
|
||||
// of reference latents.
|
||||
//
|
||||
// References are slotted in priority order (max 4):
|
||||
// 1. Prior scene image — when sceneKey matched a previous scene, this
|
||||
// anchors the same physical space (lighting/layout/style continuity)
|
||||
// 2. Entry beat's speaker portrait — the NPC the player is talking with
|
||||
// (most visually prominent)
|
||||
// 3. Other on-stage NPCs' portraits — secondary characters in the frame
|
||||
//
|
||||
// Failure handling — two-tier degradation:
|
||||
// A. referenceImages call (preferred — full visual anchoring)
|
||||
// B. pure text-to-image fallback (last resort if Runware refs API errors)
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
const MAX_REFERENCE_IMAGES = 4;
|
||||
|
||||
export type PainterInput = {
|
||||
integratedPrompt: string;
|
||||
styleGuide: string;
|
||||
onStageCharacters: Character[];
|
||||
/**
|
||||
* Prior scene's Runware UUID or base64. When set (= sceneKey hit a
|
||||
* prior scene), it slots into referenceImages[0] for spatial continuity.
|
||||
* Capacity-wise this displaces ONE character portrait — slot is shared
|
||||
* with character refs, capped at 4 total per Runware spec.
|
||||
*/
|
||||
priorSceneImage?: string;
|
||||
};
|
||||
|
||||
// Pick the references we send to Runware as `referenceImages`. Priority:
|
||||
// slot 0: priorSceneImage (if any — sceneKey continuity)
|
||||
// slot 1: entry beat's speaker portrait (the NPC speaking to the player)
|
||||
// slot 2+: other on-stage NPCs from entry beat's activeCharacters
|
||||
// Caps at 4 total. Returns the array exactly as it'll be sent — already
|
||||
// truncated, already deduplicated.
|
||||
export function collectReferenceImages(
|
||||
characters: Character[],
|
||||
entryBeat: Beat | undefined,
|
||||
priorSceneImage: string | undefined,
|
||||
): string[] {
|
||||
const refs: string[] = [];
|
||||
const seen = new Set<string>();
|
||||
|
||||
// Slot 0 — prior scene image for spatial continuity. Goes first because
|
||||
// backdrop drift is the most jarring discontinuity across same-sceneKey
|
||||
// scenes; character drift is partially masked by character archetype text
|
||||
// in the prompt anyway.
|
||||
if (priorSceneImage) {
|
||||
refs.push(priorSceneImage);
|
||||
}
|
||||
|
||||
// Slot 1+ — character portraits, speaker-first.
|
||||
const speakerName = entryBeat?.speaker;
|
||||
if (speakerName) {
|
||||
const speaker = characters.find((c) => c.name === speakerName);
|
||||
const ref = speaker?.basePortraitUuid ?? speaker?.basePortraitBase64;
|
||||
if (ref && refs.length < MAX_REFERENCE_IMAGES) {
|
||||
refs.push(ref);
|
||||
seen.add(speakerName);
|
||||
}
|
||||
}
|
||||
|
||||
for (const c of entryBeat?.activeCharacters ?? []) {
|
||||
if (refs.length >= MAX_REFERENCE_IMAGES) break;
|
||||
if (seen.has(c.name)) continue;
|
||||
const char = characters.find((x) => x.name === c.name);
|
||||
const ref = char?.basePortraitUuid ?? char?.basePortraitBase64;
|
||||
if (ref) {
|
||||
refs.push(ref);
|
||||
seen.add(c.name);
|
||||
}
|
||||
}
|
||||
|
||||
return refs.slice(0, MAX_REFERENCE_IMAGES);
|
||||
}
|
||||
|
||||
async function tryGenerate(
|
||||
config: ProviderConfig,
|
||||
prompt: string,
|
||||
options: GenerateImageOptions,
|
||||
label: string,
|
||||
): Promise<string | null> {
|
||||
try {
|
||||
return await generateImage(config, prompt, options);
|
||||
} catch (err) {
|
||||
const msg = err instanceof Error ? err.message : String(err);
|
||||
console.warn(`[painter] ${label} failed: ${msg}`);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
export async function runPainter(
|
||||
config: EngineConfig,
|
||||
input: PainterInput,
|
||||
entryBeat: Beat | undefined,
|
||||
): Promise<string> {
|
||||
if (config.mockImage) return mockImageBase64();
|
||||
|
||||
const prompt = buildPainterPrompt(
|
||||
input.integratedPrompt,
|
||||
input.styleGuide,
|
||||
input.onStageCharacters,
|
||||
);
|
||||
|
||||
const refs = collectReferenceImages(
|
||||
input.onStageCharacters,
|
||||
entryBeat,
|
||||
input.priorSceneImage,
|
||||
);
|
||||
|
||||
// Tier A — with referenceImages (priorSceneImage + character portraits).
|
||||
// FLUX.2 [klein] 9B KV's KV cache accelerates this multi-reference path
|
||||
// ~2.5× compared to the non-KV variant.
|
||||
if (refs.length > 0) {
|
||||
const r = await tryGenerate(
|
||||
config.image,
|
||||
prompt,
|
||||
{ referenceImages: refs },
|
||||
`referenceImages (${refs.length})`,
|
||||
);
|
||||
if (r) return r;
|
||||
}
|
||||
|
||||
// Tier B — pure text-to-image. Last resort, used when Tier A failed OR
|
||||
// there are no references to send (first scene with no characters yet).
|
||||
// Errors here propagate to the caller.
|
||||
return generateImage(config.image, prompt);
|
||||
}
|
||||
@@ -0,0 +1,386 @@
|
||||
import { chat } from "@yume/ai-client";
|
||||
import type {
|
||||
Beat,
|
||||
BeatActiveCharacter,
|
||||
BeatChoice,
|
||||
BeatChoiceEffect,
|
||||
BeatNext,
|
||||
ProviderConfig,
|
||||
Session,
|
||||
} from "@yume/types";
|
||||
import { parseJsonLoose } from "../jsonParser";
|
||||
import { WRITER_SYSTEM, buildWriterUserMessage } from "../prompts";
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// Writer agent — owns the narrative half of scene generation.
|
||||
//
|
||||
// Output: { sceneSummary, sceneKey, entryBeatId, beats[] }
|
||||
// Each beat carries activeCharacters[] (names + poses) the
|
||||
// Cinematographer reads when composing the establishing shot.
|
||||
//
|
||||
// Character DESIGN (visual + voice) is NOT this agent's job —
|
||||
// it only names characters; the CharacterDesigner picks up any
|
||||
// unknown name from beats[].activeCharacters.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export type WriterOutput = {
|
||||
sceneSummary: string;
|
||||
sceneKey?: string;
|
||||
entryBeatId: string;
|
||||
beats: Beat[];
|
||||
};
|
||||
|
||||
// Raw shapes — what the LLM produces before validation / coercion.
|
||||
type RawActiveCharacter = {
|
||||
name?: string;
|
||||
pose?: string;
|
||||
};
|
||||
type RawEffect = {
|
||||
kind?: string;
|
||||
targetBeatId?: string;
|
||||
nextSceneSeed?: string;
|
||||
};
|
||||
type RawChoice = {
|
||||
id?: string;
|
||||
label?: string;
|
||||
effect?: RawEffect;
|
||||
};
|
||||
type RawNext = {
|
||||
type?: string;
|
||||
nextBeatId?: string;
|
||||
choices?: RawChoice[];
|
||||
};
|
||||
type RawBeat = {
|
||||
id?: string;
|
||||
narration?: string;
|
||||
speaker?: string;
|
||||
line?: string;
|
||||
lineDelivery?: string;
|
||||
activeCharacters?: RawActiveCharacter[];
|
||||
next?: RawNext;
|
||||
};
|
||||
type RawScene = {
|
||||
sceneSummary?: string;
|
||||
sceneKey?: string;
|
||||
entryBeatId?: string;
|
||||
beats?: RawBeat[];
|
||||
};
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// POV (player viewpoint) handling — Pattern B (galgame standard):
|
||||
// - speaker = "你" → ALLOWED (renders as dialog box, never TTS'd)
|
||||
// - any other POV term → normalized to "你" (LLM slip-up safety net)
|
||||
// - activeCharacters → POV is NEVER allowed (player has no body in-scene)
|
||||
// - CharacterDesigner → never invoked for "你" or POV variants
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
const POV_DISPLAY_NAME = "你";
|
||||
const POV_VARIANTS = new Set([
|
||||
"玩家",
|
||||
"我",
|
||||
"主角",
|
||||
"protagonist",
|
||||
"Protagonist",
|
||||
"player",
|
||||
"Player",
|
||||
"PLAYER",
|
||||
"MC",
|
||||
"mc",
|
||||
"Mc",
|
||||
"I",
|
||||
"i",
|
||||
"me",
|
||||
"Me",
|
||||
"ME",
|
||||
]);
|
||||
|
||||
function isPovName(name: string): boolean {
|
||||
return name === POV_DISPLAY_NAME || POV_VARIANTS.has(name);
|
||||
}
|
||||
|
||||
// Normalize a speaker name: any POV variant collapses to "你"; an NPC name
|
||||
// passes through unchanged. Caller passes already-trimmed input.
|
||||
function normalizeSpeakerName(name: string): string {
|
||||
return POV_VARIANTS.has(name) ? POV_DISPLAY_NAME : name;
|
||||
}
|
||||
|
||||
function coerceEffect(raw: RawEffect | undefined): BeatChoiceEffect {
|
||||
if (raw?.kind === "advance-beat" && raw.targetBeatId?.trim()) {
|
||||
return { kind: "advance-beat", targetBeatId: raw.targetBeatId.trim() };
|
||||
}
|
||||
return {
|
||||
kind: "change-scene",
|
||||
nextSceneSeed: raw?.nextSceneSeed?.trim() || "未指定",
|
||||
};
|
||||
}
|
||||
|
||||
function coerceChoice(raw: RawChoice, idx: number): BeatChoice {
|
||||
return {
|
||||
id: raw.id?.trim() || `c${idx + 1}`,
|
||||
label: raw.label?.trim() || `选项 ${idx + 1}`,
|
||||
effect: coerceEffect(raw.effect),
|
||||
};
|
||||
}
|
||||
|
||||
function coerceNext(raw: RawNext | undefined, fallbackBeatId: string): BeatNext {
|
||||
if (raw?.type === "choice" && Array.isArray(raw.choices) && raw.choices.length) {
|
||||
return {
|
||||
type: "choice",
|
||||
choices: raw.choices.map((c, i) => coerceChoice(c, i)),
|
||||
};
|
||||
}
|
||||
return {
|
||||
type: "continue",
|
||||
nextBeatId: raw?.nextBeatId?.trim() || fallbackBeatId,
|
||||
};
|
||||
}
|
||||
|
||||
function coerceActiveCharacters(
|
||||
raw: RawActiveCharacter[] | undefined,
|
||||
): BeatActiveCharacter[] | undefined {
|
||||
if (!Array.isArray(raw)) return undefined;
|
||||
const out = raw
|
||||
.map((c): BeatActiveCharacter | null => {
|
||||
const name = c.name?.trim();
|
||||
if (!name) return null;
|
||||
// POV is never IN the picture — strip the LLM's slip-up silently so
|
||||
// CharacterDesigner doesn't end up generating a portrait for the player.
|
||||
if (isPovName(name)) return null;
|
||||
const pose = c.pose?.trim();
|
||||
return pose ? { name, pose } : { name };
|
||||
})
|
||||
.filter((c): c is BeatActiveCharacter => Boolean(c));
|
||||
return out.length > 0 ? out : undefined;
|
||||
}
|
||||
|
||||
function coerceBeat(raw: RawBeat, idx: number, totalBeats: number): Beat {
|
||||
const id = raw.id?.trim() || `b${idx + 1}`;
|
||||
// Non-last beats default their `continue` target to the following beat.
|
||||
// The last beat gets an empty fallback on purpose: repairBeats() turns a
|
||||
// last/dangling continue into a real scene-change exit so the player can
|
||||
// never get stuck self-looping on it.
|
||||
const fallback = idx + 1 < totalBeats ? `b${idx + 2}` : "";
|
||||
|
||||
const rawSpeaker = raw.speaker?.trim() || undefined;
|
||||
// Normalize any POV variant (玩家/我/主角/protagonist/...) to "你".
|
||||
// NPC names pass through unchanged. This means the LLM can slip and
|
||||
// write "玩家" or "I" and we still render the dialog box correctly with
|
||||
// speaker="你" — and TTS is automatically skipped because no Character
|
||||
// record exists for "你".
|
||||
const speaker = rawSpeaker ? normalizeSpeakerName(rawSpeaker) : undefined;
|
||||
|
||||
const line = raw.line?.trim() || undefined;
|
||||
return {
|
||||
id,
|
||||
narration: raw.narration?.trim() || undefined,
|
||||
speaker,
|
||||
line,
|
||||
// lineDelivery is meaningful only for NPC speakers (TTS). For POV
|
||||
// speaker ("你") TTS is skipped, so lineDelivery would never be used.
|
||||
lineDelivery:
|
||||
line && speaker !== POV_DISPLAY_NAME
|
||||
? raw.lineDelivery?.trim() || undefined
|
||||
: undefined,
|
||||
activeCharacters: coerceActiveCharacters(raw.activeCharacters),
|
||||
next: coerceNext(raw.next, fallback),
|
||||
};
|
||||
}
|
||||
|
||||
const FALLBACK_SEED = "故事继续推进";
|
||||
|
||||
function fallbackExitChoice(beatId: string): BeatChoice {
|
||||
return {
|
||||
id: `${beatId}__exit`,
|
||||
label: "继续",
|
||||
effect: { kind: "change-scene", nextSceneSeed: FALLBACK_SEED },
|
||||
};
|
||||
}
|
||||
|
||||
// Beat ids are graph keys (the front-end's `beats.find(b => b.id === ...)`,
|
||||
// the session's `visitedBeatIds`, and `continue`/`advance-beat` targets). If
|
||||
// the model reuses an id across beats, the second occurrence becomes silently
|
||||
// unreachable and external references collapse to the first beat. Rename
|
||||
// duplicates; rewrite the renamed beat's OWN self-references. External
|
||||
// references stay pointing at the first occurrence.
|
||||
function ensureUniqueBeatIds(beats: Beat[]): Beat[] {
|
||||
const seen = new Set<string>();
|
||||
return beats.map((b): Beat => {
|
||||
if (!seen.has(b.id)) {
|
||||
seen.add(b.id);
|
||||
return b;
|
||||
}
|
||||
const oldId = b.id;
|
||||
let n = 2;
|
||||
while (seen.has(`${oldId}_${n}`)) n += 1;
|
||||
const newId = `${oldId}_${n}`;
|
||||
seen.add(newId);
|
||||
|
||||
let next = b.next;
|
||||
if (next.type === "continue" && next.nextBeatId === oldId) {
|
||||
next = { type: "continue", nextBeatId: newId };
|
||||
} else if (next.type === "choice") {
|
||||
next = {
|
||||
type: "choice",
|
||||
choices: next.choices.map((c) =>
|
||||
c.effect.kind === "advance-beat" && c.effect.targetBeatId === oldId
|
||||
? {
|
||||
...c,
|
||||
effect: { kind: "advance-beat" as const, targetBeatId: newId },
|
||||
}
|
||||
: c,
|
||||
),
|
||||
};
|
||||
}
|
||||
return { ...b, id: newId, next };
|
||||
});
|
||||
}
|
||||
|
||||
// Repairs referential integrity AND guarantees the scene is escapable:
|
||||
// - a `continue` to a missing/self id is repointed to the next beat in order;
|
||||
// a last/dangling continue with nowhere to go becomes a scene-change exit
|
||||
// - an `advance-beat` to a missing id is downgraded to a scene change
|
||||
// - if no change-scene exit exists anywhere, one is appended to the last beat
|
||||
function repairBeats(beats: Beat[]): Beat[] {
|
||||
const ids = new Set(beats.map((b) => b.id));
|
||||
|
||||
const fixed: Beat[] = beats.map((b, idx): Beat => {
|
||||
if (b.next.type === "continue") {
|
||||
const target = b.next.nextBeatId;
|
||||
if (ids.has(target) && target !== b.id) return b;
|
||||
const nextByIndex = beats[idx + 1]?.id;
|
||||
if (nextByIndex) {
|
||||
return { ...b, next: { type: "continue", nextBeatId: nextByIndex } };
|
||||
}
|
||||
return { ...b, next: { type: "choice", choices: [fallbackExitChoice(b.id)] } };
|
||||
}
|
||||
|
||||
const patched = b.next.choices.map((c) =>
|
||||
c.effect.kind === "advance-beat" && !ids.has(c.effect.targetBeatId)
|
||||
? {
|
||||
...c,
|
||||
effect: {
|
||||
kind: "change-scene" as const,
|
||||
nextSceneSeed: "未指定(导演引用不存在的 beat,已降级为换场)",
|
||||
},
|
||||
}
|
||||
: c,
|
||||
);
|
||||
return { ...b, next: { type: "choice", choices: patched } };
|
||||
});
|
||||
|
||||
const hasExit = fixed.some(
|
||||
(b) =>
|
||||
b.next.type === "choice" &&
|
||||
b.next.choices.some((c) => c.effect.kind === "change-scene"),
|
||||
);
|
||||
if (!hasExit && fixed.length > 0) {
|
||||
const lastIdx = fixed.length - 1;
|
||||
const last = fixed[lastIdx]!;
|
||||
const existing = last.next.type === "choice" ? last.next.choices : [];
|
||||
fixed[lastIdx] = {
|
||||
...last,
|
||||
next: { type: "choice", choices: [...existing, fallbackExitChoice(last.id)] },
|
||||
};
|
||||
}
|
||||
|
||||
return fixed;
|
||||
}
|
||||
|
||||
// Choice ids are keys the front-end uses to cache + consume prefetched
|
||||
// scenes. Two beats both defaulting to c1/c2 would make a transition reuse
|
||||
// the WRONG prefetched scene — so force every choice id to be unique within
|
||||
// the scene.
|
||||
function ensureUniqueChoiceIds(beats: Beat[]): Beat[] {
|
||||
const seen = new Set<string>();
|
||||
for (const b of beats) {
|
||||
if (b.next.type !== "choice") continue;
|
||||
for (const c of b.next.choices) {
|
||||
if (seen.has(c.id)) {
|
||||
let n = 2;
|
||||
while (seen.has(`${c.id}_${n}`)) n += 1;
|
||||
c.id = `${c.id}_${n}`;
|
||||
}
|
||||
seen.add(c.id);
|
||||
}
|
||||
}
|
||||
return beats;
|
||||
}
|
||||
|
||||
// Normalize sceneKey to a safe lowercase-with-dashes English slug. If the
|
||||
// model returns something weird (中文 / spaces / mixed case), best-effort
|
||||
// fix; if it ends up empty, return undefined (the scene just won't be
|
||||
// considered for img2img reuse).
|
||||
function normalizeSceneKey(raw: string | undefined): string | undefined {
|
||||
if (!raw) return undefined;
|
||||
const slug = raw
|
||||
.trim()
|
||||
.toLowerCase()
|
||||
.replace(/[^a-z0-9-]+/g, "-")
|
||||
.replace(/-+/g, "-")
|
||||
.replace(/^-|-$/g, "");
|
||||
return slug.length > 0 ? slug : undefined;
|
||||
}
|
||||
|
||||
export async function runWriter(
|
||||
config: ProviderConfig,
|
||||
session: Session,
|
||||
): Promise<WriterOutput> {
|
||||
const raw = await chat(
|
||||
config,
|
||||
[
|
||||
{ role: "system", content: WRITER_SYSTEM },
|
||||
{ role: "user", content: buildWriterUserMessage(session) },
|
||||
],
|
||||
{ temperature: 0.9, responseFormat: "json_object" },
|
||||
);
|
||||
|
||||
const parsed = parseJsonLoose<RawScene>(raw);
|
||||
const rawBeats = Array.isArray(parsed.beats) ? parsed.beats : [];
|
||||
if (rawBeats.length === 0) {
|
||||
throw new Error("Writer returned no beats");
|
||||
}
|
||||
|
||||
const beats = ensureUniqueChoiceIds(
|
||||
repairBeats(
|
||||
ensureUniqueBeatIds(
|
||||
rawBeats.map((b, i) => coerceBeat(b, i, rawBeats.length)),
|
||||
),
|
||||
),
|
||||
);
|
||||
|
||||
const declaredEntry = parsed.entryBeatId?.trim();
|
||||
const entryBeatId =
|
||||
declaredEntry && beats.some((b) => b.id === declaredEntry)
|
||||
? declaredEntry
|
||||
: beats[0]!.id;
|
||||
|
||||
return {
|
||||
sceneSummary: parsed.sceneSummary?.trim() || "未指定场景概要",
|
||||
sceneKey: normalizeSceneKey(parsed.sceneKey),
|
||||
entryBeatId,
|
||||
beats,
|
||||
};
|
||||
}
|
||||
|
||||
// Surface the set of character names introduced by this scene's beats,
|
||||
// so the orchestrator can decide which ones need the CharacterDesigner to
|
||||
// fire. Pulls names from both `speaker` fields AND `activeCharacters`
|
||||
// (a character can be on-screen without speaking).
|
||||
//
|
||||
// Excludes POV ("你" / 玩家 / 主角 / ...) entirely — the player is never
|
||||
// designed (no portrait, no voice, no archetype).
|
||||
export function collectActiveCharacterNames(beats: Beat[]): string[] {
|
||||
const seen = new Set<string>();
|
||||
for (const b of beats) {
|
||||
if (b.speaker && !isPovName(b.speaker)) seen.add(b.speaker);
|
||||
if (b.activeCharacters) {
|
||||
for (const c of b.activeCharacters) {
|
||||
if (!isPovName(c.name)) seen.add(c.name);
|
||||
}
|
||||
}
|
||||
}
|
||||
return Array.from(seen);
|
||||
}
|
||||
|
||||
// Re-export POV constants for downstream filters (director's orphanSpeakers).
|
||||
export { POV_DISPLAY_NAME, POV_VARIANTS, isPovName, normalizeSpeakerName };
|
||||
+267
-278
@@ -1,309 +1,294 @@
|
||||
import { chat } from "@yume/ai-client";
|
||||
import { chat, uploadImage } from "@yume/ai-client";
|
||||
import type {
|
||||
Beat,
|
||||
BeatChoice,
|
||||
BeatChoiceEffect,
|
||||
BeatNext,
|
||||
Character,
|
||||
EngineConfig,
|
||||
InsertBeatPartial,
|
||||
ProviderConfig,
|
||||
Scene,
|
||||
Session,
|
||||
} from "@yume/types";
|
||||
import { parseJsonLoose } from "./jsonParser";
|
||||
import { designCharacter, provisionVoiceForName } from "./agents/characterDesigner";
|
||||
import { runCinematographer } from "./agents/cinematographer";
|
||||
import { runPainter } from "./agents/painter";
|
||||
import {
|
||||
DIRECTOR_SYSTEM,
|
||||
INSERT_BEAT_SYSTEM,
|
||||
buildDirectorUserMessage,
|
||||
buildInsertBeatUserMessage,
|
||||
} from "./prompts";
|
||||
collectActiveCharacterNames,
|
||||
isPovName,
|
||||
normalizeSpeakerName,
|
||||
POV_DISPLAY_NAME,
|
||||
runWriter,
|
||||
} from "./agents/writer";
|
||||
import { parseJsonLoose } from "./jsonParser";
|
||||
import { INSERT_BEAT_SYSTEM, buildInsertBeatUserMessage } from "./prompts";
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// Raw shape produced by the model — we coerce + validate into a Scene.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
type RawEffect = {
|
||||
kind?: string;
|
||||
targetBeatId?: string;
|
||||
nextSceneSeed?: string;
|
||||
};
|
||||
|
||||
type RawChoice = {
|
||||
id?: string;
|
||||
label?: string;
|
||||
effect?: RawEffect;
|
||||
};
|
||||
|
||||
type RawNext = {
|
||||
type?: string;
|
||||
nextBeatId?: string;
|
||||
choices?: RawChoice[];
|
||||
};
|
||||
|
||||
type RawBeat = {
|
||||
id?: string;
|
||||
narration?: string;
|
||||
speaker?: string;
|
||||
line?: string;
|
||||
lineDelivery?: string;
|
||||
next?: RawNext;
|
||||
};
|
||||
|
||||
type RawCharacterUpdate = {
|
||||
name?: string;
|
||||
description?: string;
|
||||
};
|
||||
|
||||
type RawScene = {
|
||||
scenePrompt?: string;
|
||||
entryBeatId?: string;
|
||||
beats?: RawBeat[];
|
||||
characterUpdates?: RawCharacterUpdate[];
|
||||
};
|
||||
|
||||
function coerceEffect(raw: RawEffect | undefined): BeatChoiceEffect {
|
||||
if (raw?.kind === "advance-beat" && raw.targetBeatId?.trim()) {
|
||||
return { kind: "advance-beat", targetBeatId: raw.targetBeatId.trim() };
|
||||
}
|
||||
return {
|
||||
kind: "change-scene",
|
||||
nextSceneSeed: raw?.nextSceneSeed?.trim() || "未指定",
|
||||
};
|
||||
}
|
||||
|
||||
function coerceChoice(raw: RawChoice, idx: number): BeatChoice {
|
||||
return {
|
||||
id: raw.id?.trim() || `c${idx + 1}`,
|
||||
label: raw.label?.trim() || `选项 ${idx + 1}`,
|
||||
effect: coerceEffect(raw.effect),
|
||||
};
|
||||
}
|
||||
|
||||
function coerceNext(raw: RawNext | undefined, fallbackBeatId: string): BeatNext {
|
||||
if (raw?.type === "choice" && Array.isArray(raw.choices) && raw.choices.length) {
|
||||
return {
|
||||
type: "choice",
|
||||
choices: raw.choices.map((c, i) => coerceChoice(c, i)),
|
||||
};
|
||||
}
|
||||
return {
|
||||
type: "continue",
|
||||
nextBeatId: raw?.nextBeatId?.trim() || fallbackBeatId,
|
||||
};
|
||||
}
|
||||
|
||||
function coerceBeat(raw: RawBeat, idx: number, totalBeats: number): Beat {
|
||||
const id = raw.id?.trim() || `b${idx + 1}`;
|
||||
// Non-last beats default their `continue` target to the following beat.
|
||||
// The last beat gets an empty fallback on purpose: repairBeats() turns a
|
||||
// last/dangling continue into a real scene-change exit so the player can
|
||||
// never get stuck self-looping on it.
|
||||
const fallback = idx + 1 < totalBeats ? `b${idx + 2}` : "";
|
||||
const line = raw.line?.trim() || undefined;
|
||||
return {
|
||||
id,
|
||||
narration: raw.narration?.trim() || undefined,
|
||||
speaker: raw.speaker?.trim() || undefined,
|
||||
line,
|
||||
// lineDelivery only meaningful when there is a line to deliver.
|
||||
lineDelivery: line ? raw.lineDelivery?.trim() || undefined : undefined,
|
||||
next: coerceNext(raw.next, fallback),
|
||||
};
|
||||
}
|
||||
|
||||
function coerceCharacterUpdates(raw: RawCharacterUpdate[] | undefined): Character[] {
|
||||
if (!Array.isArray(raw)) return [];
|
||||
return raw
|
||||
.map((c) => ({
|
||||
name: c.name?.trim() ?? "",
|
||||
description: c.description?.trim() ?? "",
|
||||
}))
|
||||
.filter((c) => c.name && c.description);
|
||||
}
|
||||
|
||||
const FALLBACK_SEED = "故事继续推进";
|
||||
|
||||
function fallbackExitChoice(beatId: string): BeatChoice {
|
||||
return {
|
||||
id: `${beatId}__exit`,
|
||||
label: "继续",
|
||||
effect: { kind: "change-scene", nextSceneSeed: FALLBACK_SEED },
|
||||
};
|
||||
}
|
||||
|
||||
// Beat ids are graph keys (the front-end's `beats.find(b => b.id === ...)`,
|
||||
// the session's `visitedBeatIds`, and `continue`/`advance-beat` targets). If
|
||||
// the model reuses an id across beats, the second occurrence becomes silently
|
||||
// unreachable and external references collapse to the first beat. Rename
|
||||
// duplicates; rewrite the renamed beat's OWN self-references (the most
|
||||
// natural interpretation of a duplicate id being referenced from inside that
|
||||
// same beat). External references stay pointing at the first occurrence.
|
||||
function ensureUniqueBeatIds(beats: Beat[]): Beat[] {
|
||||
const seen = new Set<string>();
|
||||
return beats.map((b): Beat => {
|
||||
if (!seen.has(b.id)) {
|
||||
seen.add(b.id);
|
||||
return b;
|
||||
}
|
||||
const oldId = b.id;
|
||||
let n = 2;
|
||||
while (seen.has(`${oldId}_${n}`)) n += 1;
|
||||
const newId = `${oldId}_${n}`;
|
||||
seen.add(newId);
|
||||
|
||||
let next = b.next;
|
||||
if (next.type === "continue" && next.nextBeatId === oldId) {
|
||||
next = { type: "continue", nextBeatId: newId };
|
||||
} else if (next.type === "choice") {
|
||||
next = {
|
||||
type: "choice",
|
||||
choices: next.choices.map((c) =>
|
||||
c.effect.kind === "advance-beat" && c.effect.targetBeatId === oldId
|
||||
? {
|
||||
...c,
|
||||
effect: { kind: "advance-beat" as const, targetBeatId: newId },
|
||||
}
|
||||
: c,
|
||||
),
|
||||
};
|
||||
}
|
||||
return { ...b, id: newId, next };
|
||||
});
|
||||
}
|
||||
|
||||
// Repairs referential integrity AND guarantees the scene is escapable:
|
||||
// - a `continue` to a missing/self id is repointed to the next beat in order;
|
||||
// a last/dangling continue with nowhere to go becomes a scene-change exit
|
||||
// (never a self-loop, which would strand the player on "click to advance")
|
||||
// - an `advance-beat` to a missing id is downgraded to a scene change
|
||||
// - if no change-scene exit exists anywhere, one is appended to the last beat
|
||||
function repairBeats(beats: Beat[]): Beat[] {
|
||||
const ids = new Set(beats.map((b) => b.id));
|
||||
|
||||
const fixed: Beat[] = beats.map((b, idx): Beat => {
|
||||
if (b.next.type === "continue") {
|
||||
const target = b.next.nextBeatId;
|
||||
if (ids.has(target) && target !== b.id) return b;
|
||||
const nextByIndex = beats[idx + 1]?.id;
|
||||
if (nextByIndex) {
|
||||
return { ...b, next: { type: "continue", nextBeatId: nextByIndex } };
|
||||
}
|
||||
return { ...b, next: { type: "choice", choices: [fallbackExitChoice(b.id)] } };
|
||||
}
|
||||
|
||||
const patched = b.next.choices.map((c) =>
|
||||
c.effect.kind === "advance-beat" && !ids.has(c.effect.targetBeatId)
|
||||
? {
|
||||
...c,
|
||||
effect: {
|
||||
kind: "change-scene" as const,
|
||||
nextSceneSeed: "未指定(导演引用不存在的 beat,已降级为换场)",
|
||||
},
|
||||
}
|
||||
: c,
|
||||
);
|
||||
return { ...b, next: { type: "choice", choices: patched } };
|
||||
});
|
||||
|
||||
const hasExit = fixed.some(
|
||||
(b) =>
|
||||
b.next.type === "choice" &&
|
||||
b.next.choices.some((c) => c.effect.kind === "change-scene"),
|
||||
);
|
||||
if (!hasExit && fixed.length > 0) {
|
||||
const lastIdx = fixed.length - 1;
|
||||
const last = fixed[lastIdx]!;
|
||||
const existing = last.next.type === "choice" ? last.next.choices : [];
|
||||
fixed[lastIdx] = {
|
||||
...last,
|
||||
next: { type: "choice", choices: [...existing, fallbackExitChoice(last.id)] },
|
||||
};
|
||||
}
|
||||
|
||||
return fixed;
|
||||
}
|
||||
|
||||
// Choice ids are the keys the front-end uses to cache and consume prefetched
|
||||
// scenes. Two beats both defaulting to c1/c2 (or the model reusing ids across
|
||||
// beats) would make a transition reuse the WRONG prefetched scene — so force
|
||||
// every choice id to be unique within the scene.
|
||||
function ensureUniqueChoiceIds(beats: Beat[]): Beat[] {
|
||||
const seen = new Set<string>();
|
||||
for (const b of beats) {
|
||||
if (b.next.type !== "choice") continue;
|
||||
for (const c of b.next.choices) {
|
||||
if (seen.has(c.id)) {
|
||||
let n = 2;
|
||||
while (seen.has(`${c.id}_${n}`)) n += 1;
|
||||
c.id = `${c.id}_${n}`;
|
||||
}
|
||||
seen.add(c.id);
|
||||
}
|
||||
}
|
||||
return beats;
|
||||
}
|
||||
// ══════════════════════════════════════════════════════════════════════
|
||||
// director.ts — multi-agent orchestrator for one full Scene generation.
|
||||
//
|
||||
// Critical path (per Scene call):
|
||||
//
|
||||
// Writer LLM (~3s, serial)
|
||||
// │
|
||||
// ├─ CharacterDesigner LLM × N (parallel per new char)
|
||||
// │ │
|
||||
// │ ├─ portrait gen + upload (parallel within agent)
|
||||
// │ └─ voice provisioning (parallel within agent)
|
||||
// │
|
||||
// ├─ Cinematographer LLM (parallel with all of the above)
|
||||
// │
|
||||
// └─ wait for all parallel branches
|
||||
// │
|
||||
// ▼
|
||||
// Painter (FLUX referenceImages — two-tier degradation chain)
|
||||
// │
|
||||
// ▼
|
||||
// upload final scene image → Scene.imageUuid
|
||||
// │
|
||||
// ▼
|
||||
// return { scene, sceneImageBase64, characters }
|
||||
//
|
||||
// The Cinematographer intentionally does NOT depend on CharacterDesigner
|
||||
// output — it only positions named characters in the frame, not their
|
||||
// appearance. This unlocks the parallelism that makes the full pipeline
|
||||
// ~9-12s instead of ~15-18s serial.
|
||||
// ══════════════════════════════════════════════════════════════════════
|
||||
|
||||
function newSceneId(): string {
|
||||
return `scene_${Date.now()}_${Math.random().toString(36).slice(2, 6)}`;
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// directScene — generates one Scene (multi-beat) for the player.
|
||||
// Called both on real scene transitions AND on speculative prefetch.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
function tlog(label: string, t0: number): void {
|
||||
console.log(`${label}: ${Date.now() - t0}ms`);
|
||||
}
|
||||
|
||||
// Merge a freshly-designed Character into a registry, preserving any
|
||||
// previously-set voice/portrait that the new design didn't fill in (so
|
||||
// re-designing a known character can't silently drop their voice or wipe
|
||||
// out an already-generated portrait UUID). Match by name.
|
||||
export function mergeCharacters(
|
||||
existing: Character[],
|
||||
updates: Character[],
|
||||
): Character[] {
|
||||
if (updates.length === 0) return existing;
|
||||
const byName = new Map(existing.map((c) => [c.name, c]));
|
||||
for (const u of updates) {
|
||||
const prev = byName.get(u.name);
|
||||
if (!prev) {
|
||||
byName.set(u.name, u);
|
||||
continue;
|
||||
}
|
||||
// Preserve any prior provisioned resource that the new design omitted.
|
||||
byName.set(u.name, {
|
||||
...u,
|
||||
voice: u.voice ?? prev.voice,
|
||||
visualDescription: u.visualDescription ?? prev.visualDescription,
|
||||
basePortraitBase64: u.basePortraitBase64 ?? prev.basePortraitBase64,
|
||||
basePortraitUuid: u.basePortraitUuid ?? prev.basePortraitUuid,
|
||||
voiceDescription: u.voiceDescription || prev.voiceDescription,
|
||||
});
|
||||
}
|
||||
return Array.from(byName.values());
|
||||
}
|
||||
|
||||
// Pick a reference to the prior scene image when sceneKey matches a prior
|
||||
// scene — used by the Painter as one of the `referenceImages` (NOT as a
|
||||
// seedImage, because FLUX.2 [klein] 9B KV does not support seedImage).
|
||||
//
|
||||
// Returns the UUID if available (cheap reference, ~36 chars over the wire),
|
||||
// else the base64 of the most recent matching scene's image. Returns
|
||||
// undefined when no prior scene shares the current sceneKey.
|
||||
function pickPriorSceneReference(
|
||||
session: Session,
|
||||
currentSceneKey: string | undefined,
|
||||
priorImageBase64ByUuid: Map<string, string>,
|
||||
): { priorSceneReference?: string; priorSceneKey?: string } {
|
||||
if (!currentSceneKey) return {};
|
||||
for (let i = session.history.length - 1; i >= 0; i--) {
|
||||
const prior = session.history[i]!.scene;
|
||||
if (prior.sceneKey === currentSceneKey) {
|
||||
if (prior.imageUuid) {
|
||||
return {
|
||||
priorSceneReference: prior.imageUuid,
|
||||
priorSceneKey: prior.sceneKey,
|
||||
};
|
||||
}
|
||||
const cached = priorImageBase64ByUuid.get(prior.id);
|
||||
if (cached) {
|
||||
return { priorSceneReference: cached, priorSceneKey: prior.sceneKey };
|
||||
}
|
||||
}
|
||||
}
|
||||
return {};
|
||||
}
|
||||
|
||||
export type SceneResult = {
|
||||
scene: Scene;
|
||||
characterUpdates: Character[];
|
||||
sceneImageBase64: string;
|
||||
characters: Character[];
|
||||
};
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// directScene — the multi-agent pipeline. Used by orchestrator's
|
||||
// startSession and requestScene.
|
||||
//
|
||||
// priorImageBase64ByUuid: optional map from prior Scene.id → base64
|
||||
// the caller has on-hand. If a sceneKey-hit scene's imageUuid is missing
|
||||
// but the base64 is cached locally, we can still feed it as one of the
|
||||
// Painter's referenceImages. Pass an empty map when caller has no cache
|
||||
// (orchestrator does pass it for the start-session bootstrap).
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export async function directScene(
|
||||
config: ProviderConfig,
|
||||
config: EngineConfig,
|
||||
session: Session,
|
||||
priorImageBase64ByUuid: Map<string, string> = new Map(),
|
||||
): Promise<SceneResult> {
|
||||
const raw = await chat(
|
||||
config,
|
||||
[
|
||||
{ role: "system", content: DIRECTOR_SYSTEM },
|
||||
{ role: "user", content: buildDirectorUserMessage(session) },
|
||||
],
|
||||
{ temperature: 0.9, responseFormat: "json_object" },
|
||||
const tTotal = Date.now();
|
||||
|
||||
// Stage 1 — Writer (serial; everything downstream needs sceneSummary +
|
||||
// beats[] to know who's on stage and what to compose around).
|
||||
const tWriter = Date.now();
|
||||
const writerOut = await runWriter(config.text, session);
|
||||
tlog("[directScene] Writer", tWriter);
|
||||
|
||||
// Identify NEW characters introduced by this scene that need to be
|
||||
// designed (LLM + portrait + voice). Existing characters in the registry
|
||||
// are skipped — their cards / portraits / voices persist across scenes.
|
||||
const allActiveNames = collectActiveCharacterNames(writerOut.beats);
|
||||
const newCharNames = allActiveNames.filter(
|
||||
(n) => !session.characters.some((c) => c.name === n),
|
||||
);
|
||||
|
||||
const parsed = parseJsonLoose<RawScene>(raw);
|
||||
const rawBeats = Array.isArray(parsed.beats) ? parsed.beats : [];
|
||||
if (rawBeats.length === 0) {
|
||||
throw new Error("Director returned no beats");
|
||||
// Find the entry beat for the Cinematographer (which characters are
|
||||
// on-screen in the establishing shot).
|
||||
const entryBeat = writerOut.beats.find((b) => b.id === writerOut.entryBeatId);
|
||||
const entryBeatActive = entryBeat?.activeCharacters ?? [];
|
||||
|
||||
// For sceneKey-based visual continuity, look up the prior matching scene's
|
||||
// image to slot into Painter's referenceImages (max 4 of which include
|
||||
// character portraits too).
|
||||
const { priorSceneReference, priorSceneKey } = pickPriorSceneReference(
|
||||
session,
|
||||
writerOut.sceneKey,
|
||||
priorImageBase64ByUuid,
|
||||
);
|
||||
|
||||
// Stage 2 — parallel: CharacterDesigner(s) and Cinematographer.
|
||||
// Cinematographer doesn't need character visualDescriptions (those are
|
||||
// appended at Painter stage), so it runs concurrently with chardesign.
|
||||
const tParallel = Date.now();
|
||||
|
||||
const designPromises = newCharNames.map((name) =>
|
||||
designCharacter(config, session, name).catch((err): Character => {
|
||||
const msg = err instanceof Error ? err.message : String(err);
|
||||
console.error(`[directScene] designCharacter(${name}) failed: ${msg}`);
|
||||
// Last-resort fallback: register with name only so the speaker isn't
|
||||
// unknown. Caller may try voice provisioning later or skip.
|
||||
return {
|
||||
name,
|
||||
voiceDescription: `请根据角色名「${name}」推断其性别、年龄与气质。所属世界观:${session.worldSetting}`,
|
||||
};
|
||||
}),
|
||||
);
|
||||
|
||||
const cinemaPromise = runCinematographer(config.text, {
|
||||
sceneSummary: writerOut.sceneSummary,
|
||||
styleGuide: session.styleGuide,
|
||||
entryBeatActive,
|
||||
entryBeatSpeaker: entryBeat?.speaker,
|
||||
priorSceneKey,
|
||||
currentSceneKey: writerOut.sceneKey,
|
||||
});
|
||||
|
||||
const [designedChars, cinemaOut] = await Promise.all([
|
||||
Promise.all(designPromises),
|
||||
cinemaPromise,
|
||||
]);
|
||||
tlog("[directScene] CharacterDesigner+Cinematographer parallel", tParallel);
|
||||
|
||||
// Merge new chars into a working registry that we'll pass to the Painter.
|
||||
const characters = mergeCharacters(session.characters, designedChars);
|
||||
|
||||
// Edge case: a speaker referenced by the Writer might not have been in
|
||||
// `activeCharacters` of any beat (LLM oversight), so they got skipped by
|
||||
// newCharNames. Catch them here and at least provision a voice so the
|
||||
// beat-audio path doesn't render silent. No portrait — they weren't
|
||||
// visible in the scene, so visual consistency doesn't matter for them.
|
||||
const speakerNames = new Set(
|
||||
writerOut.beats.map((b) => b.speaker).filter((n): n is string => Boolean(n)),
|
||||
);
|
||||
const orphanSpeakers = [...speakerNames].filter(
|
||||
// Pattern B: "你" (player) is a valid speaker but never gets a Character
|
||||
// record — TTS is intentionally skipped on the client. Filter POV out so
|
||||
// provisionVoiceForName isn't accidentally invoked for the player.
|
||||
(n) => !isPovName(n) && !characters.some((c) => c.name === n),
|
||||
);
|
||||
if (orphanSpeakers.length > 0) {
|
||||
const orphans = await Promise.all(
|
||||
orphanSpeakers.map((n) => provisionVoiceForName(config, session, n)),
|
||||
);
|
||||
const merged = mergeCharacters(characters, orphans);
|
||||
characters.splice(0, characters.length, ...merged);
|
||||
}
|
||||
|
||||
const beats = ensureUniqueChoiceIds(
|
||||
repairBeats(
|
||||
ensureUniqueBeatIds(
|
||||
rawBeats.map((b, i) => coerceBeat(b, i, rawBeats.length)),
|
||||
),
|
||||
),
|
||||
// Stage 3 — Painter (depends on cinemaOut + characters).
|
||||
// On-stage characters for THIS scene are the ones in any beat — pass them
|
||||
// all so the archetype block covers anyone the player might encounter.
|
||||
const onStageCharacters = characters.filter((c) =>
|
||||
allActiveNames.includes(c.name),
|
||||
);
|
||||
|
||||
const declaredEntry = parsed.entryBeatId?.trim();
|
||||
const entryBeatId =
|
||||
declaredEntry && beats.some((b) => b.id === declaredEntry)
|
||||
? declaredEntry
|
||||
: beats[0]!.id;
|
||||
|
||||
return {
|
||||
scene: {
|
||||
id: newSceneId(),
|
||||
scenePrompt: parsed.scenePrompt?.trim() || "an empty scene",
|
||||
beats,
|
||||
entryBeatId,
|
||||
const tPainter = Date.now();
|
||||
const sceneImageBase64 = await runPainter(
|
||||
config,
|
||||
{
|
||||
integratedPrompt: cinemaOut.integratedPrompt,
|
||||
styleGuide: session.styleGuide,
|
||||
onStageCharacters,
|
||||
priorSceneImage: priorSceneReference,
|
||||
},
|
||||
characterUpdates: coerceCharacterUpdates(parsed.characterUpdates),
|
||||
entryBeat,
|
||||
);
|
||||
tlog("[directScene] Painter", tPainter);
|
||||
|
||||
// Stage 4 — best-effort upload of the final scene image so the NEXT
|
||||
// sceneKey-match call can reference its UUID instead of carrying base64.
|
||||
// If upload fails, the scene still works; only loses cheap referencing
|
||||
// on the next hop. Don't wait on mock images (static placeholder).
|
||||
let imageUuid: string | undefined;
|
||||
if (!config.mockImage) {
|
||||
try {
|
||||
const tUpload = Date.now();
|
||||
imageUuid = await uploadImage(config.image, sceneImageBase64);
|
||||
tlog("[directScene] image upload", tUpload);
|
||||
} catch (err) {
|
||||
const msg = err instanceof Error ? err.message : String(err);
|
||||
console.warn(`[directScene] scene image upload failed: ${msg} — sceneKey reuse will need base64 fallback`);
|
||||
}
|
||||
}
|
||||
|
||||
const scene: Scene = {
|
||||
id: newSceneId(),
|
||||
// scenePrompt is the cinematographer's English compositional output;
|
||||
// the Writer's sceneSummary stays in the session log via beats[]/
|
||||
// history. Keeping the original field name preserves compat with
|
||||
// anything that already reads scene.scenePrompt (e.g., insert-beat
|
||||
// user prompt).
|
||||
scenePrompt: cinemaOut.integratedPrompt,
|
||||
beats: writerOut.beats,
|
||||
entryBeatId: writerOut.entryBeatId,
|
||||
sceneKey: writerOut.sceneKey,
|
||||
imageUuid,
|
||||
};
|
||||
|
||||
tlog("[directScene] TOTAL", tTotal);
|
||||
|
||||
return { scene, sceneImageBase64, characters };
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// directInsertBeat — generates a one-off transient beat in response to
|
||||
// a freeform vision action that stays in-scene. Used by /api/insert-beat.
|
||||
// directInsertBeat — single-agent path for vision-driven in-scene
|
||||
// exploration. Generates ONE transient beat with NO new image, NO new
|
||||
// characters. Multi-agent pipeline doesn't apply here (no rendering, no
|
||||
// character introduction allowed by the prompt).
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export async function directInsertBeat(
|
||||
@@ -326,13 +311,17 @@ export async function directInsertBeat(
|
||||
const parsed = parseJsonLoose<InsertBeatPartial>(raw);
|
||||
|
||||
const narration = parsed.narration?.trim() || undefined;
|
||||
const speaker = parsed.speaker?.trim() || undefined;
|
||||
const rawSpeaker = parsed.speaker?.trim() || undefined;
|
||||
// Pattern B (mirrors Writer): normalize POV variants → "你"; NPCs pass through.
|
||||
const speaker = rawSpeaker ? normalizeSpeakerName(rawSpeaker) : undefined;
|
||||
const line = parsed.line?.trim() || undefined;
|
||||
const lineDelivery = line ? parsed.lineDelivery?.trim() || undefined : undefined;
|
||||
// lineDelivery is only meaningful for NPC speakers (TTS). For POV ("你")
|
||||
// TTS is intentionally skipped on the client, so lineDelivery is dropped.
|
||||
const lineDelivery =
|
||||
line && speaker !== POV_DISPLAY_NAME
|
||||
? parsed.lineDelivery?.trim() || undefined
|
||||
: undefined;
|
||||
|
||||
// If the model returned nothing usable, supply a fallback narration so the
|
||||
// frontend doesn't append a silent empty beat that renders no dialogue —
|
||||
// which would make the click appear to do nothing.
|
||||
if (!narration && !speaker && !line) {
|
||||
return { narration: "(你停下脚步,环视片刻。)" };
|
||||
}
|
||||
|
||||
@@ -6,7 +6,10 @@ export {
|
||||
requestBeatAudio,
|
||||
} from "./orchestrator";
|
||||
export { annotateClick } from "./annotate";
|
||||
export { provisionVoicesForScene, synthesizeBeat } from "./voice";
|
||||
export { synthesizeBeat } from "./voice";
|
||||
export { mergeCharacters } from "./director";
|
||||
export type { SceneResult } from "./director";
|
||||
export type { WriterOutput } from "./agents/writer";
|
||||
export type { CinematographerOutput } from "./agents/cinematographer";
|
||||
export type { InsertBeatPartial } from "@yume/types";
|
||||
export * from "./prompts";
|
||||
|
||||
@@ -1,3 +1,13 @@
|
||||
// Strict-then-forgiving JSON parser for LLM output. Tries in order:
|
||||
// 1. Direct JSON.parse on the trimmed text.
|
||||
// 2. Extract from ```json``` fenced block.
|
||||
// 3. Slice between first { and last } and parse.
|
||||
// 4. Apply best-effort regex repair (trailing commas, missing commas
|
||||
// between adjacent values) and try again.
|
||||
//
|
||||
// On final failure, logs the first 800 chars of the raw model output so we
|
||||
// can see what the LLM did wrong (the standard error message only shows
|
||||
// the position, not the surrounding context).
|
||||
export function parseJsonLoose<T>(raw: string): T {
|
||||
const trimmed = raw.trim();
|
||||
|
||||
@@ -20,8 +30,52 @@ export function parseJsonLoose<T>(raw: string): T {
|
||||
const last = trimmed.lastIndexOf("}");
|
||||
if (first !== -1 && last > first) {
|
||||
const slice = trimmed.slice(first, last + 1);
|
||||
return JSON.parse(slice) as T;
|
||||
try {
|
||||
return JSON.parse(slice) as T;
|
||||
} catch {
|
||||
// Last resort: try repairing common LLM-output malformations.
|
||||
const repaired = repairJsonString(slice);
|
||||
try {
|
||||
return JSON.parse(repaired) as T;
|
||||
} catch (err) {
|
||||
console.error(
|
||||
`[parseJsonLoose] all strategies failed. Raw output (first 800 chars):\n${raw.slice(0, 800)}`,
|
||||
);
|
||||
throw err;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
console.error(
|
||||
`[parseJsonLoose] no { ... } found. Raw output (first 800 chars):\n${raw.slice(0, 800)}`,
|
||||
);
|
||||
throw new Error(`Failed to parse JSON from model output: ${raw.slice(0, 200)}`);
|
||||
}
|
||||
|
||||
// Best-effort repair of LLM-typical JSON syntax errors. Targeted at the two
|
||||
// most common failures we see in practice:
|
||||
// 1. Trailing comma before } or ].
|
||||
// 2. Missing comma between two adjacent JSON values (the specific error
|
||||
// mode we hit at position 3390).
|
||||
//
|
||||
// Deliberately conservative — does NOT try to fix unclosed strings,
|
||||
// unbalanced braces, or strip JS-style comments. The comment-stripping
|
||||
// path was previously included but would corrupt JSON string values
|
||||
// containing `//` (e.g. URLs like "https://example.com"); since LLMs in
|
||||
// `responseFormat: "json_object"` mode essentially never emit comments,
|
||||
// dropping that step is a net win for safety.
|
||||
function repairJsonString(s: string): string {
|
||||
return s
|
||||
// 1. Strip trailing commas before } or ].
|
||||
.replace(/,(\s*[}\]])/g, "$1")
|
||||
// 2. Insert missing commas between two adjacent JSON values. The cases:
|
||||
// } { → },{ ] [ → ],[ } [ → },[ ] { → ],{
|
||||
// "string" "key" "string" { "string" [
|
||||
// number then "key" / { / [
|
||||
//
|
||||
// The regex looks for a closing token (} ] " or a digit) followed by
|
||||
// a newline and an opening token (} ] " a letter), and inserts a
|
||||
// comma between them. Requires the newline (\s*\n\s*) so it only
|
||||
// fires across line boundaries, never within a single-line value.
|
||||
.replace(/(\}|\]|"|\d)(\s*\n\s*)(\{|\[|")/g, "$1,$2$3");
|
||||
}
|
||||
|
||||
@@ -1,14 +1,12 @@
|
||||
import type {
|
||||
BeatAudioRequest,
|
||||
BeatAudioResponse,
|
||||
Character,
|
||||
EngineConfig,
|
||||
InsertBeatRequest,
|
||||
InsertBeatResponse,
|
||||
Scene,
|
||||
Session,
|
||||
SceneRequest,
|
||||
SceneResponse,
|
||||
Session,
|
||||
StartRequest,
|
||||
StartResponse,
|
||||
VisionRequest,
|
||||
@@ -16,55 +14,24 @@ import type {
|
||||
} from "@yume/types";
|
||||
import { annotateClick } from "./annotate";
|
||||
import { directInsertBeat, directScene } from "./director";
|
||||
import { mockImageBase64 } from "./mockImage";
|
||||
import { render } from "./renderer";
|
||||
import { synthesizeBeat } from "./voice";
|
||||
import { interpret } from "./vision";
|
||||
import { provisionVoicesForScene, synthesizeBeat } from "./voice";
|
||||
|
||||
function newSessionId(): string {
|
||||
return `s_${Date.now()}_${Math.random().toString(36).slice(2, 8)}`;
|
||||
}
|
||||
|
||||
// TEMP: per-phase timing for latency diagnosis. Remove after we have data.
|
||||
function tlog(label: string, t0: number): void {
|
||||
console.log(`${label}: ${Date.now() - t0}ms`);
|
||||
}
|
||||
|
||||
// Merge new character entries into the registry by name. If a name already
|
||||
// exists we preserve the existing voice (so a description revision never
|
||||
// silently re-provisions a voice the player has already heard).
|
||||
function mergeCharacters(existing: Character[], updates: Character[]): Character[] {
|
||||
if (updates.length === 0) return existing;
|
||||
const byName = new Map(existing.map((c) => [c.name, c]));
|
||||
for (const u of updates) {
|
||||
const prev = byName.get(u.name);
|
||||
byName.set(u.name, prev?.voice ? { ...u, voice: prev.voice } : u);
|
||||
}
|
||||
return Array.from(byName.values());
|
||||
}
|
||||
|
||||
async function renderImage(
|
||||
config: EngineConfig,
|
||||
scene: Scene,
|
||||
styleGuide: string,
|
||||
): Promise<string> {
|
||||
if (config.mockImage) return mockImageBase64();
|
||||
return render(config.image, scene, styleGuide);
|
||||
}
|
||||
|
||||
async function provisionForScene(
|
||||
config: EngineConfig,
|
||||
session: Session,
|
||||
scene: Scene,
|
||||
): Promise<{ characters: Character[] }> {
|
||||
if (!config.tts) return { characters: session.characters };
|
||||
return provisionVoicesForScene(config.tts, session, scene);
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// startSession — first scene + image + voice provisioning. The actual
|
||||
// per-beat synth runs lazily via requestBeatAudio so MiMo's tail
|
||||
// latency never blocks the UI.
|
||||
// startSession — initial Scene via the multi-agent pipeline.
|
||||
//
|
||||
// directScene internally handles: Writer → (CharacterDesigner+
|
||||
// Cinematographer parallel) → Painter → upload. Voice provisioning and
|
||||
// portrait generation happen inside CharacterDesigner per new character,
|
||||
// so the orchestrator no longer needs to coordinate them separately.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export async function startSession(
|
||||
@@ -72,6 +39,7 @@ export async function startSession(
|
||||
req: StartRequest,
|
||||
): Promise<StartResponse> {
|
||||
const tTotal = Date.now();
|
||||
|
||||
const session: Session = {
|
||||
id: newSessionId(),
|
||||
createdAt: Date.now(),
|
||||
@@ -81,42 +49,20 @@ export async function startSession(
|
||||
characters: [],
|
||||
};
|
||||
|
||||
const tDirect = Date.now();
|
||||
const { scene, characterUpdates } = await directScene(config.text, session);
|
||||
tlog("[start] directScene", tDirect);
|
||||
|
||||
const preVoiceSession: Session = {
|
||||
...session,
|
||||
characters: mergeCharacters(session.characters, characterUpdates),
|
||||
};
|
||||
|
||||
const tImage = Date.now();
|
||||
const tProv = Date.now();
|
||||
const imagePromise = renderImage(config, scene, preVoiceSession.styleGuide)
|
||||
.then((r) => {
|
||||
tlog("[start] renderImage", tImage);
|
||||
return r;
|
||||
});
|
||||
const provPromise = provisionForScene(config, preVoiceSession, scene)
|
||||
.then((r) => {
|
||||
tlog("[start] provisionForScene", tProv);
|
||||
return r;
|
||||
});
|
||||
const [imageBase64, provRes] = await Promise.all([imagePromise, provPromise]);
|
||||
const { scene, sceneImageBase64, characters } = await directScene(config, session);
|
||||
|
||||
tlog("[start] TOTAL", tTotal);
|
||||
|
||||
return {
|
||||
sessionId: session.id,
|
||||
scene,
|
||||
imageBase64,
|
||||
characters: provRes.characters,
|
||||
imageBase64: sceneImageBase64,
|
||||
characters,
|
||||
};
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// requestScene — generate the NEXT scene + image + voice provisioning.
|
||||
// Used both on real scene transitions and on speculative prefetch.
|
||||
// requestScene — next Scene from existing session.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export async function requestScene(
|
||||
@@ -125,40 +71,24 @@ export async function requestScene(
|
||||
): Promise<SceneResponse> {
|
||||
const tTotal = Date.now();
|
||||
|
||||
const tDirect = Date.now();
|
||||
const { scene, characterUpdates } = await directScene(config.text, req.session);
|
||||
tlog("[scene] directScene", tDirect);
|
||||
|
||||
const preVoiceSession: Session = {
|
||||
...req.session,
|
||||
characters: mergeCharacters(req.session.characters, characterUpdates),
|
||||
};
|
||||
|
||||
const tImage = Date.now();
|
||||
const tProv = Date.now();
|
||||
const imagePromise = renderImage(config, scene, preVoiceSession.styleGuide)
|
||||
.then((r) => {
|
||||
tlog("[scene] renderImage", tImage);
|
||||
return r;
|
||||
});
|
||||
const provPromise = provisionForScene(config, preVoiceSession, scene)
|
||||
.then((r) => {
|
||||
tlog("[scene] provisionForScene", tProv);
|
||||
return r;
|
||||
});
|
||||
const [imageBase64, provRes] = await Promise.all([imagePromise, provPromise]);
|
||||
const { scene, sceneImageBase64, characters } = await directScene(
|
||||
config,
|
||||
req.session,
|
||||
);
|
||||
|
||||
tlog("[scene] TOTAL", tTotal);
|
||||
|
||||
return {
|
||||
scene,
|
||||
imageBase64,
|
||||
characters: provRes.characters,
|
||||
imageBase64: sceneImageBase64,
|
||||
characters,
|
||||
};
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// visionDecide — interprets a background click into intent + classify.
|
||||
// No change from staging — vision lives outside the scene-generation
|
||||
// pipeline.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export async function visionDecide(
|
||||
@@ -171,9 +101,9 @@ export async function visionDecide(
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// requestInsertBeat — generates a transient in-scene beat (no image
|
||||
// regen, no voice). The client fires /api/beat-audio for the new beat
|
||||
// after this returns.
|
||||
// requestInsertBeat — single-agent transient beat (no image, no new
|
||||
// characters). Stays single-LLM by design — the INSERT_BEAT prompt
|
||||
// forbids new characters and there's nothing to render.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export async function requestInsertBeat(
|
||||
@@ -182,19 +112,24 @@ export async function requestInsertBeat(
|
||||
): Promise<InsertBeatResponse> {
|
||||
const tTotal = Date.now();
|
||||
|
||||
const tDirect = Date.now();
|
||||
const partial = await directInsertBeat(
|
||||
config.text,
|
||||
req.session,
|
||||
req.freeformAction,
|
||||
);
|
||||
tlog("[insert-beat] directInsertBeat", tDirect);
|
||||
|
||||
// INSERT_BEAT prompt forbids new characters — promote disallowed-speaker
|
||||
// lines to narration so the player still sees the text (the client only
|
||||
// renders `line` when there is a `speaker`).
|
||||
// INSERT_BEAT prompt forbids new NPCs — promote disallowed-speaker lines
|
||||
// to narration so the player still sees the text (the client only renders
|
||||
// `line` when there is a `speaker`).
|
||||
//
|
||||
// Exception (Pattern B): speaker = "你" is the player speaking. No
|
||||
// Character record exists for "你" (intentional — TTS is skipped), so we
|
||||
// must NOT demote it; the client renders the dialog box correctly.
|
||||
// directInsertBeat already normalized POV variants to "你" before this
|
||||
// guard, so a literal "你" here is always Pattern B player dialog.
|
||||
if (
|
||||
partial.speaker &&
|
||||
partial.speaker !== "你" &&
|
||||
!req.session.characters.some((c) => c.name === partial.speaker)
|
||||
) {
|
||||
console.warn(
|
||||
|
||||
+381
-55
@@ -1,23 +1,47 @@
|
||||
import type { Scene, Session } from "@yume/types";
|
||||
import type {
|
||||
BeatActiveCharacter,
|
||||
Character,
|
||||
Scene,
|
||||
Session,
|
||||
} from "@yume/types";
|
||||
|
||||
// ══════════════════════════════════════════════════════════════════════
|
||||
// Multi-agent scene generation pipeline:
|
||||
// Writer (编剧) — narrative + beats[] + per-beat activeCharacters
|
||||
// CharacterDesigner — per-new-character visual + voice cards
|
||||
// Cinematographer (分镜导演) — sceneKey + English compositional prompt
|
||||
// Painter (画师) — FLUX rendering with character archetypes
|
||||
//
|
||||
// Each agent owns one system prompt + one user-message builder below.
|
||||
// All four agents see the same world / style guide, but each only reads
|
||||
// the slice of session state it needs to make its decision.
|
||||
// ══════════════════════════════════════════════════════════════════════
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// Director — emits one Scene (background + a graph of beats) at a time.
|
||||
// 1. Writer (编剧) — drives the narrative.
|
||||
//
|
||||
// Emits a full Scene: beats[] graph + entryBeatId + sceneKey hint +
|
||||
// activeCharacters per beat. Does NOT design characters (that's the
|
||||
// CharacterDesigner's job) — only names them in `activeCharacters`.
|
||||
// The CharacterDesigner is invoked separately for any name not yet in
|
||||
// session.characters.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export const DIRECTOR_SYSTEM = `你是一个交互视觉小说的「场景导演」。每次基于世界观、画风、玩家历史、已登记角色,输出**一个完整的场景**,并为每句台词配上细腻的配音导演指令。
|
||||
export const WRITER_SYSTEM = `你是一个交互视觉小说的「编剧」。每次基于世界观、画风、玩家历史、已登记角色,写出**一个完整场景的剧本**:场景背景概要 + 一组对话节拍 beats。你只负责**剧情和台词**——不设计角色形象、不写出图提示词、不做镜头调度,这些由其他 agent 完成。
|
||||
|
||||
一个场景包含:
|
||||
- 一张背景图(你给出英文 scenePrompt)
|
||||
- 一组对话节拍 beats,玩家会按顺序经历它们
|
||||
- 任何**首次登场**的角色,需在 characterUpdates 里登记其专属音色设计
|
||||
- sceneSummary:当前场景的中文概要(地点、时间、氛围、关键事件——给后续的分镜导演看)
|
||||
- sceneKey:当前场景的英文 slug(如 "classroom-dusk"、"rooftop-night"、"rainy-street")——同一物理空间应沿用相同 slug
|
||||
- beats[]:玩家依次经历的对话节拍
|
||||
- entryBeatId:玩家进入场景时落在哪个 beat
|
||||
|
||||
每个 beat 是玩家会看到的一段叙述 / 对话 / 选择。beat 之间通过 next 字段连接:
|
||||
- "continue": 玩家点击图片背景 / 按继续,自然推进到下一个 beat
|
||||
- "choice": 在此让玩家做选择,按所选 choice 的 effect 走向
|
||||
- "continue":玩家点击图片背景 / 按继续,自然推进到下一个 beat
|
||||
- "choice":在此让玩家做选择,按所选 choice 的 effect 走向
|
||||
|
||||
choice 的 effect 有两种:
|
||||
- "advance-beat": 玩家选了之后跳到**同场景内**的另一个 beat(不换背景图,速度极快)
|
||||
- "change-scene": 玩家选了之后切换到**新场景**(视角变了 / 走到新地方 / 时间跳了)
|
||||
- "advance-beat":玩家选了之后跳到**同场景内**的另一个 beat(不换背景图,速度极快)
|
||||
- "change-scene":玩家选了之后切换到**新场景**(视角变了 / 走到新地方 / 时间跳了)
|
||||
|
||||
设计原则:
|
||||
- 同场景内 beat 数自由发挥,按剧情节奏自然给出(通常 2–6 个,可以更多)
|
||||
@@ -25,34 +49,60 @@ choice 的 effect 有两种:
|
||||
- advance-beat 适合处理对话分支(同一场景里换个话题、追问、撒娇)
|
||||
- change-scene 适合空间/时间跳跃(出门、转身看窗外、第二天清晨)
|
||||
- 一个场景至少要有一个 change-scene 出口(除非真到结局)
|
||||
- 每个 change-scene 必须带 nextSceneSeed —— 一句中文简述「下一场是哪里、谁在、要发生什么」,用来引导下一次导演调用
|
||||
- 每个 change-scene 必须带 nextSceneSeed —— 一句中文简述「下一场是哪里、谁在、要发生什么」
|
||||
- 同一场景的 beat id 互不重复
|
||||
- next.nextBeatId 引用的 beat 必须存在
|
||||
- choice 至少 2 个,至多 4 个,互不重复
|
||||
|
||||
sceneKey 设计原则(重要 — 用于跨场景视觉一致性):
|
||||
- 同一物理空间 + 同一时段 → 必须沿用**完全相同**的英文 slug
|
||||
- 时段或空间变化时换 slug(如 "classroom-dusk" → "classroom-night","classroom-dusk" → "corridor-dusk")
|
||||
- slug 规范:lowercase-with-dashes,2–4 个英文单词
|
||||
- 已登记的历史场景 sceneKey 会在用户消息里列出,请优先**复用**这些已有 slug
|
||||
|
||||
文本风格约束:
|
||||
- narration / line 用中文(**纯净可显示文本**,绝不要写 (叹气)(语速快) 这类标注 —— 那是给配音的,会被玩家看见)
|
||||
- scenePrompt / lineDelivery / characterUpdates 内的文字按下方专门说明
|
||||
- sceneSummary / lineDelivery / activeCharacters[].pose 内的文字也用中文
|
||||
- sceneKey 用英文 slug
|
||||
- 单个 beat 的 narration 与 line 加起来 ≤80 字
|
||||
- 单个 choice label ≤15 字
|
||||
- scenePrompt 用英文,只描述画面里看到什么,不要描述 UI
|
||||
|
||||
配音相关字段:
|
||||
- 每个有 line 的 beat **必须**给出 lineDelivery —— 自由中文的"配音导演指令",描述该句台词怎么念(情绪 / 语气 / 语速 / 气息 / 停顿 / 重音 / 音色起伏)。例:"鼓起勇气又害羞,声音发颤、偏小,句尾带一丝气声,语速偏慢"。平淡场合写"平静自然、语速适中"即可,但要贴当下情境。
|
||||
- characterUpdates 仅当**有新角色首次出现**时列出该新角色的音色设计;已登记的角色不要重复列出。
|
||||
- characterUpdates[].description **必须以明确性别开头**("女性,…" / "男性,…"),随后描述:年龄、音色质感、性格情绪基调、语速节奏、人设腔调、口音方言。例:"女性,约17岁少女,音色清亮带点稚嫩甜美,性格开朗,语速偏快,标准普通话"。
|
||||
- 每个有 line 的 beat **必须**给出 lineDelivery —— 自由中文的「配音导演指令」,描述该句台词怎么念(情绪 / 语气 / 语速 / 气息 / 停顿 / 重音 / 音色起伏)。例:"鼓起勇气又害羞,声音发颤、偏小,句尾带一丝气声,语速偏慢"。平淡场合写"平静自然、语速适中"即可,但要贴当下情境。
|
||||
|
||||
角色与台词的硬性规则(影响配音正确性):
|
||||
- 任何 beat 的 speaker 字段一旦填了名字,**该名字必须**:① 在"已登记角色"列表中存在,或 ② 本次输出的 characterUpdates 里登记。绝不允许 speaker 是个未登记的陌生名字。
|
||||
角色与台词的硬性规则:
|
||||
- 任何 beat 的 speaker 字段一旦填了名字,**该名字必须**:① 是 "你"(玩家本人,见下方"玩家视角硬规则"),或 ② 在「已登记角色」列表中存在,或 ③ 出现在本场景的某个 beat 的 activeCharacters 里。
|
||||
- speaker 名字必须与登记名**完全一致**,不要加「(回忆)」「学姐」之类后缀或别名。
|
||||
- 每个 beat 的 activeCharacters 列出**此时此刻画面里出现的 NPC 角色**及其当下姿态/神情(中文)。即使没人说话,画面里有谁在也要列出。
|
||||
|
||||
玩家视角硬规则(重要 — 违反这条会破坏整个 galgame):
|
||||
|
||||
【画面规则 — 严格禁止】
|
||||
- 玩家是第二人称 POV,**永远不出现在任何 Scene 画面里**
|
||||
- activeCharacters[].name 数组**绝不允许**包含任何下列名字(任何大小写、中英文变体):
|
||||
「玩家」「你」「我」「主角」「protagonist」「player」「Player」「MC」「I」「me」
|
||||
- 玩家不会被设计立绘、不会被设计音色
|
||||
|
||||
【对白规则 — galgame 标准做法(Pattern B)】
|
||||
- 玩家**可以正常说话**——当主角对 NPC 开口时:
|
||||
speaker = "你"(**固定用这两个字,不要用其他变体**)
|
||||
line = 实际说的话(如「学姐,下雨了」)
|
||||
lineDelivery 可以留空(玩家对白不会被 TTS 合成)
|
||||
- speaker 字段允许的取值**只有两种**:① NPC 真名(必须在 activeCharacters 里)② "你"
|
||||
- 其它 POV 变体(玩家 / 我 / 主角 / protagonist / player / MC / I / me)**一律视为错误**
|
||||
|
||||
【内心 vs 外显的区分】
|
||||
- 主角在心里想 / 在做某个动作 / 在观察 / 自己的体感 → 用 narration(speaker 留空)
|
||||
例:"你的心跳得很快,几乎听不见外面的雨声。"
|
||||
- 主角真的开口对 NPC 说出来 → 用 speaker="你" + line
|
||||
例:speaker="你" line="学姐,这把伞你拿着。"
|
||||
- 同一个 beat 可以同时有 narration(心理活动 / 动作)和 speaker="你" + line(说出口的话)
|
||||
|
||||
必须输出严格 JSON,结构如下:
|
||||
{
|
||||
"scenePrompt": "english scene description, no UI",
|
||||
"sceneSummary": "中文场景概要:地点+时间+氛围+关键事件",
|
||||
"sceneKey": "classroom-dusk",
|
||||
"entryBeatId": "b1",
|
||||
"characterUpdates": [
|
||||
{ "name": "夏海", "description": "女性,约17岁少女,音色清亮带点稚嫩甜美…" }
|
||||
],
|
||||
"beats": [
|
||||
{
|
||||
"id": "b1",
|
||||
@@ -60,6 +110,9 @@ choice 的 effect 有两种:
|
||||
"speaker": "可空",
|
||||
"line": "可空(纯净文本)",
|
||||
"lineDelivery": "line 非空时必填:配音导演指令",
|
||||
"activeCharacters": [
|
||||
{ "name": "夏海", "pose": "脸红害羞地绞着衣角,双眼躲闪" }
|
||||
],
|
||||
"next": { "type": "continue", "nextBeatId": "b2" }
|
||||
},
|
||||
{
|
||||
@@ -67,13 +120,26 @@ choice 的 effect 有两种:
|
||||
"speaker": "夏海",
|
||||
"line": "学长,我有话想对你说。",
|
||||
"lineDelivery": "鼓起勇气,但又有点害羞,语速偏慢,句尾微微上扬",
|
||||
"activeCharacters": [
|
||||
{ "name": "夏海", "pose": "鼓起勇气直视对方,双手紧握" }
|
||||
],
|
||||
"next": { "type": "continue", "nextBeatId": "b3" }
|
||||
},
|
||||
{
|
||||
"id": "b3",
|
||||
"narration": "你下意识攥紧了书包带,喉咙有点干。",
|
||||
"speaker": "你",
|
||||
"line": "……你说。",
|
||||
"activeCharacters": [
|
||||
{ "name": "夏海", "pose": "鼓起勇气直视对方,双手紧握" }
|
||||
],
|
||||
"next": {
|
||||
"type": "choice",
|
||||
"choices": [
|
||||
{
|
||||
"id": "c1",
|
||||
"label": "继续追问",
|
||||
"effect": { "kind": "advance-beat", "targetBeatId": "b3" }
|
||||
"effect": { "kind": "advance-beat", "targetBeatId": "b4" }
|
||||
},
|
||||
{
|
||||
"id": "c2",
|
||||
@@ -88,18 +154,24 @@ choice 的 effect 有两种:
|
||||
|
||||
不要输出 JSON 以外的任何文本。`;
|
||||
|
||||
export function buildDirectorUserMessage(session: Session): string {
|
||||
export function buildWriterUserMessage(session: Session): string {
|
||||
const parts: string[] = [];
|
||||
parts.push(`世界观:${session.worldSetting}`);
|
||||
parts.push(`画风:${session.styleGuide}`);
|
||||
|
||||
if (session.characters.length > 0) {
|
||||
parts.push("\n已登记角色(speaker 必须用这些名字之一,或在本次 characterUpdates 里登记新名):");
|
||||
parts.push("\n已登记角色(speaker 必须用这些名字之一,或本场景新引入):");
|
||||
for (const c of session.characters) {
|
||||
parts.push(`- ${c.name}:${c.description}`);
|
||||
parts.push(`- ${c.name}`);
|
||||
}
|
||||
}
|
||||
|
||||
const priorKeys = collectPriorSceneKeys(session);
|
||||
if (priorKeys.length > 0) {
|
||||
parts.push("\n已使用的 sceneKey(同一物理空间请沿用,不要新造):");
|
||||
for (const k of priorKeys) parts.push(`- ${k}`);
|
||||
}
|
||||
|
||||
if (session.history.length === 0) {
|
||||
parts.push("\n这是故事的开场。请生成第一个场景,严格以 JSON 格式返回。");
|
||||
return parts.join("\n");
|
||||
@@ -108,7 +180,7 @@ export function buildDirectorUserMessage(session: Session): string {
|
||||
parts.push("\n场景历史(按时间顺序):");
|
||||
session.history.forEach((entry, idx) => {
|
||||
const lines: string[] = [`【场景 ${idx + 1}】`];
|
||||
lines.push(` scenePrompt: ${entry.scene.scenePrompt}`);
|
||||
if (entry.scene.sceneKey) lines.push(` sceneKey: ${entry.scene.sceneKey}`);
|
||||
|
||||
const visited = entry.visitedBeatIds.length
|
||||
? entry.visitedBeatIds
|
||||
@@ -157,9 +229,274 @@ export function buildDirectorUserMessage(session: Session): string {
|
||||
return parts.join("\n");
|
||||
}
|
||||
|
||||
function collectPriorSceneKeys(session: Session): string[] {
|
||||
const seen = new Set<string>();
|
||||
for (const entry of session.history) {
|
||||
const k = entry.scene.sceneKey;
|
||||
if (k) seen.add(k);
|
||||
}
|
||||
return Array.from(seen);
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// 2. CharacterDesigner (角色设定师) — designs one new character.
|
||||
//
|
||||
// Receives a character NAME (extracted by the Writer's activeCharacters)
|
||||
// and produces BOTH the English visual card AND the Chinese voice card
|
||||
// in a single LLM call. Bundling these two is intentional: a single agent
|
||||
// that "knows who this character is" produces internally-consistent
|
||||
// appearance + vocal personality, whereas split agents tend to diverge
|
||||
// (e.g., gentle-looking character with energetic voice).
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export const CHARACTER_DESIGNER_SYSTEM = `你是视觉小说的「角色设定师」。给你一个**新登场角色的名字**,你要为这个角色同时设计两份卡片:
|
||||
1. **视觉设定卡(英文)**——给生图模型 FLUX 用,遵循 prompt engineering 风格
|
||||
2. **音色设定卡(中文)**——给小米 MiMo 配音设计用
|
||||
|
||||
两份卡片要描绘**同一个人**——外貌温柔的人不该被配上张扬聒噪的嗓音;冷酷干练的人不该用甜软糯的童声。先在心里想清楚这个人的整体气质,再分两面落笔。
|
||||
|
||||
视觉设定卡 visualDescription 规则:
|
||||
- **必须完全用英文**
|
||||
- 风格:用形容词 + 短语,**英文逗号分隔**,符合 FLUX/Stable Diffusion prompt 习惯
|
||||
- 包含:年龄段、发型发色、眼睛 / 神情基调、面部特征、标志性服饰(款式 + 配色 + 花纹)、整体气质
|
||||
- **不要写瞬时姿势或表情**(这些由编剧/分镜每帧实时控制)
|
||||
- **必须融入全局画风** styleGuide 的美术指向(比如 styleGuide 是「赛博朋克」时,服饰要赛博朋克化)
|
||||
- 长度:80–150 个英文词为宜
|
||||
- 不要包含背景环境(这不是场景图,是角色立绘卡)
|
||||
|
||||
音色设定卡 voiceDescription 规则:
|
||||
- **必须以明确性别开头**:"女性,…" / "男性,…"
|
||||
- 随后描述:年龄段(如「约17岁少女」「30 出头男性」)、音色质感、性格情绪基调、语速节奏、人设腔调、口音方言
|
||||
- 用中文,整段连续描述,不分段
|
||||
- 长度:50–80 个中文字为宜
|
||||
- 例:"女性,约17岁少女,音色清亮带点稚嫩甜美,性格开朗外向但容易害羞,语速偏快,标准普通话"
|
||||
|
||||
必须输出严格 JSON:
|
||||
{
|
||||
"visualDescription": "English visual card, comma-separated tags...",
|
||||
"voiceDescription": "中文音色卡,以性别开头..."
|
||||
}
|
||||
|
||||
不要输出 JSON 以外的任何文本。`;
|
||||
|
||||
export function buildCharacterDesignerUserMessage(
|
||||
charName: string,
|
||||
session: Session,
|
||||
): string {
|
||||
const parts: string[] = [];
|
||||
parts.push(`角色名:${charName}`);
|
||||
parts.push(`世界观:${session.worldSetting}`);
|
||||
parts.push(`全局美术画风:${session.styleGuide}`);
|
||||
|
||||
const others = session.characters.filter((c) => c.visualDescription);
|
||||
if (others.length > 0) {
|
||||
parts.push("\n已设定角色(外貌应与他们有区分):");
|
||||
for (const c of others) {
|
||||
parts.push(`- ${c.name}: ${c.visualDescription}`);
|
||||
}
|
||||
}
|
||||
|
||||
parts.push(
|
||||
"\n请为该角色同时设计 visualDescription(英文)和 voiceDescription(中文),严格以 JSON 格式返回。",
|
||||
);
|
||||
return parts.join("\n");
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// 3. Cinematographer (分镜导演) — composes the visual frame.
|
||||
//
|
||||
// Reads the Writer's sceneSummary + active characters and produces the
|
||||
// English compositional prompt fed to FLUX. Does NOT describe the
|
||||
// characters themselves (those archetypes are appended at the Painter
|
||||
// stage from session.characters.visualDescription). Only describes the
|
||||
// ENVIRONMENT, lighting, camera framing, and how the characters are
|
||||
// positioned within the frame.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export const CINEMATOGRAPHER_SYSTEM = `你是视觉小说的「分镜导演」。给你编剧的当前场景概要、活跃角色名单和他们在场景里的姿态描述,以及**入口 beat 的 speaker 信息**(用来决定镜头语言)。你的任务是**只用英文**写一段**纯环境+构图**的描述(integratedPrompt),交给画师作为出图主提示词。
|
||||
|
||||
你**不要**写角色的外貌细节——发色、服饰、脸型这些由其他 agent 提供,画师会把"角色档案卡"附加到你的 integratedPrompt 后面。你只关心:
|
||||
- **环境**:地点、时间、天气、光线、空间细节(什么家具/植物/物件)
|
||||
- **构图 / 镜头**:景别(wide shot / medium shot / close-up / over-the-shoulder)、机位、视角
|
||||
- **人物在画面中的位置和姿态**(不写脸 / 不写穿什么——只写"哪个角色站在哪儿、在做什么")
|
||||
- **氛围**:情绪基调、色调、影调(warm dusk / cold neon / soft morning light)
|
||||
|
||||
═══════════════════════════════════════════════════════════════════
|
||||
玩家视角硬规则(与画面相关,必须严格遵守)
|
||||
═══════════════════════════════════════════════════════════════════
|
||||
- 玩家本人**永远不出现在画面里**——不画 player 的身体、手、肩膀、背影、剪影、脚、头发
|
||||
- integratedPrompt 中**绝对禁止**出现下列英文(或中文等价):
|
||||
"first-person view" · "POV of the protagonist" · "player's hand / arm / shoulder / back"
|
||||
"protagonist visible" · "from the player's perspective" · "MC" · "player's silhouette"
|
||||
- 镜头是一个"隐形的观察者位置"——可以位于玩家的视角附近(NPC 像在看玩家),但**绝不画出玩家本身**
|
||||
|
||||
═══════════════════════════════════════════════════════════════════
|
||||
动态镜头策略(根据入口 beat 的 speaker 字段选择镜头)
|
||||
═══════════════════════════════════════════════════════════════════
|
||||
你会收到 entryBeatSpeaker 字段。按以下规则选镜头:
|
||||
|
||||
【entryBeatSpeaker = 某个 NPC 名字】 → NPC 正在对玩家说话
|
||||
- 优先 **close-up 或 medium close-up**,NPC 看向画面外(= 看玩家)
|
||||
- 关键英文:close-up / medium close-up, looking toward camera, eyes meeting the viewer,
|
||||
direct gaze, lips parted mid-speech
|
||||
- 制造"她正在对你说话"的代入感(galgame 经典直视镜头)
|
||||
|
||||
【entryBeatSpeaker = "你"】 → 玩家正在对 NPC 说话
|
||||
- 优先 **medium shot**,NPC 居中,做"在听玩家说话"的姿态
|
||||
- 关键英文:medium shot, attentively listening, facing the camera,
|
||||
head slightly tilted, expression of attention
|
||||
- ❌ 不要写 over-the-shoulder(因为这会暗示画出玩家肩膀,违反 POV 规则)
|
||||
|
||||
【entryBeatSpeaker 为空】 → 纯环境 / 旁白 beat
|
||||
- 优先 **wide establishing shot**,展现环境氛围
|
||||
- 关键英文:wide establishing shot, atmospheric mood, environmental detail
|
||||
- 如果有 NPC 在场,他们可以处于远处 / 中景 / 自然状态(不必看镜头)
|
||||
|
||||
【entryBeatActive 有多个角色】 → 群像
|
||||
- 使用 **medium group shot 或 medium wide shot**,多人在一个框内
|
||||
- 关键英文:medium group shot, two-shot / three-shot, characters arranged in the frame
|
||||
|
||||
═══════════════════════════════════════════════════════════════════
|
||||
输出 JSON 结构
|
||||
═══════════════════════════════════════════════════════════════════
|
||||
{
|
||||
"shotType": "close-up / medium shot / wide establishing / medium group shot / ...",
|
||||
"integratedPrompt": "English. Environment + composition + character positioning + camera language. No dialogue boxes, no UI. 80-150 words."
|
||||
}
|
||||
|
||||
写作要求:
|
||||
- integratedPrompt **必须英文**,遵循 FLUX prompt engineering 习惯(形容词 + 短语,英文逗号分隔,必要时短句)
|
||||
- 提到具体角色时**只用其名字 + 动作**,例如 "Natsumi standing by the window, head slightly bowed"——绝不要写她长什么样
|
||||
- 不描述任何 UI、字幕、对话框、边框
|
||||
- 不描述图像之外的事情(不要写"this scene depicts..."这种 meta 句)
|
||||
- 长度 80–150 英文词
|
||||
|
||||
不要输出 JSON 以外的任何文本。`;
|
||||
|
||||
export function buildCinematographerUserMessage(
|
||||
sceneSummary: string,
|
||||
styleGuide: string,
|
||||
entryBeatActive: BeatActiveCharacter[],
|
||||
entryBeatSpeaker: string | undefined,
|
||||
priorSceneKey: string | undefined,
|
||||
currentSceneKey: string | undefined,
|
||||
): string {
|
||||
const parts: string[] = [];
|
||||
parts.push(`全局美术画风:${styleGuide}`);
|
||||
parts.push(`\n当前场景(来自编剧):${sceneSummary}`);
|
||||
|
||||
if (entryBeatActive.length > 0) {
|
||||
parts.push("\n开场画面里的角色及其姿态:");
|
||||
for (const c of entryBeatActive) {
|
||||
parts.push(`- ${c.name}:${c.pose ?? "(无具体姿态描述)"}`);
|
||||
}
|
||||
} else {
|
||||
parts.push("\n开场画面里没有角色(纯环境)。");
|
||||
}
|
||||
|
||||
// entryBeatSpeaker drives the dynamic camera policy (see CINEMATOGRAPHER_SYSTEM).
|
||||
// "你" means the player is speaking; an NPC name means an NPC is speaking;
|
||||
// empty means no dialog (pure environment / narration beat).
|
||||
if (entryBeatSpeaker === "你") {
|
||||
parts.push(
|
||||
'\n开场 beat 是**玩家说话**(speaker = "你")——按动态镜头策略:medium shot,NPC 居中、做听玩家说话的姿态、看向画面外。**绝不要画出玩家**。',
|
||||
);
|
||||
} else if (entryBeatSpeaker) {
|
||||
parts.push(
|
||||
`\n开场 beat 是 **${entryBeatSpeaker} 在对玩家说话**(speaker = "${entryBeatSpeaker}")——按动态镜头策略:close-up 或 medium close-up,${entryBeatSpeaker} 看向画面外(看玩家),眼神交流。`,
|
||||
);
|
||||
} else {
|
||||
parts.push(
|
||||
"\n开场 beat 没有 speaker(纯旁白/环境)——按动态镜头策略:wide establishing shot 展现环境氛围。",
|
||||
);
|
||||
}
|
||||
|
||||
if (priorSceneKey && currentSceneKey && priorSceneKey === currentSceneKey) {
|
||||
parts.push(
|
||||
`\n注意:上一场和本场 sceneKey 都是 "${currentSceneKey}"——画师会把上一张场景图作为 referenceImages 之一锚定同一空间。你的 integratedPrompt 应该**强调连续性**,描述时段/情绪/构图的细微变化,而不是完全重新设定空间。`,
|
||||
);
|
||||
}
|
||||
|
||||
parts.push("\n请输出 shotType + integratedPrompt,严格以 JSON 格式返回。");
|
||||
return parts.join("\n");
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// 4. Painter (画师) — final image prompt assembly.
|
||||
//
|
||||
// Not an LLM agent — a pure prompt-building function that combines the
|
||||
// Cinematographer's integratedPrompt with character archetype blocks
|
||||
// (visual cards) and the standard FLUX constraints.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export function buildPainterPrompt(
|
||||
integratedPrompt: string,
|
||||
styleGuide: string,
|
||||
characters: { name: string; visualDescription?: string }[],
|
||||
): string {
|
||||
const archetypeBlock = characters
|
||||
.filter((c) => c.visualDescription)
|
||||
.map((c) => `[CHARACTER: ${c.name}]\n${c.visualDescription}`)
|
||||
.join("\n\n");
|
||||
|
||||
const archetypeSection = archetypeBlock
|
||||
? `\n\nCHARACTER ARCHETYPES (anchor identity, outfit, and style across scenes — keep each character visually identical to their archetype):\n${archetypeBlock}`
|
||||
: "";
|
||||
|
||||
return `Generate a cinematic landscape background illustration, 16:9 widescreen (1792x1024).
|
||||
|
||||
ART STYLE: ${styleGuide}
|
||||
|
||||
SCENE COMPOSITION (from cinematographer — environment + camera framing + character positioning):
|
||||
${integratedPrompt}${archetypeSection}
|
||||
|
||||
STRICT RULES — NEVER violate these:
|
||||
- DO NOT draw any dialogue boxes, speech bubbles, text panels, or any rectangular overlay.
|
||||
- DO NOT draw any buttons, choice options, menu items, or interactive UI elements.
|
||||
- DO NOT render any Chinese or English text anywhere in the image.
|
||||
- DO NOT add any HUD, interface chrome, or game UI elements.
|
||||
- The image is a PURE BACKGROUND SCENE ONLY. All UI will be added as HTML on top.
|
||||
- 16:9 LANDSCAPE orientation — wider than tall. No portrait or square output.
|
||||
- Leave the bottom 35% of the frame relatively uncluttered (darker or softer) so overlaid UI panels remain readable.
|
||||
- Characters or key scene elements should be positioned in the upper 65% of the frame.
|
||||
- Maintain character identity exactly as specified in CHARACTER ARCHETYPES — same face, same hairstyle, same outfit across every scene.
|
||||
|
||||
PLAYER POV RULES — the player / protagonist is the unseen viewer:
|
||||
- The player / protagonist is NEVER visible in the frame — no body parts, no hands, no shoulders, no back of head, no silhouette, no feet, no hair.
|
||||
- DO NOT use first-person POV that implies the player's body in frame.
|
||||
- When an NPC is speaking to the player, they SHOULD look toward the camera (toward the player's implied position) — this creates eye contact without showing the player.
|
||||
- The camera position represents the player's gaze; only NPCs, scenery, and objects are rendered.`;
|
||||
}
|
||||
|
||||
// Character portrait prompt — for the per-character base image generated
|
||||
// once when the CharacterDesigner introduces a new character. The portrait
|
||||
// is used both as a client-side asset (立绘登场) and as a referenceImages
|
||||
// entry when rendering later scenes for visual consistency.
|
||||
export function buildCharacterPortraitPrompt(
|
||||
charName: string,
|
||||
visualDescription: string,
|
||||
styleGuide: string,
|
||||
): string {
|
||||
return `Character concept portrait sheet, single character, full-body or upper-body composition, neutral standing pose, looking toward camera, neutral expression, plain neutral background (no environment, no scenery).
|
||||
|
||||
ART STYLE: ${styleGuide}
|
||||
|
||||
CHARACTER (${charName}):
|
||||
${visualDescription}
|
||||
|
||||
STRICT RULES:
|
||||
- ONE character only — no other people, no crowd, no background characters.
|
||||
- Plain neutral background (off-white or soft gradient). NO environment, NO furniture, NO props beyond what's worn.
|
||||
- Neutral, calm pose and expression — this is a reference sheet, not a dramatic shot.
|
||||
- NO text, NO UI, NO watermark, NO border.
|
||||
- The character should be clearly visible and centered, the pose natural and relaxed.
|
||||
- 16:9 landscape orientation.`;
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// Insert-Beat — given a freeform vision action that is judged to stay
|
||||
// *within* the current scene, generate one transient beat.
|
||||
// Single-agent path; no character design / no rendering involved.
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export const INSERT_BEAT_SYSTEM = `你是视觉小说编剧。玩家在当前场景内做了一个**不会换场景的自由动作**(比如看一眼桌上的相框、想了想刚才那句话)。请基于此动作,写出一个**单独的、过渡性的 beat**:可以是旁白、角色台词、或两者结合。
|
||||
@@ -169,8 +506,15 @@ export const INSERT_BEAT_SYSTEM = `你是视觉小说编剧。玩家在当前场
|
||||
- narration 与 line 加起来 ≤80 字
|
||||
- 不要打破当前场景的物理状态(玩家仍在原地、对面仍是同一个角色)
|
||||
- 不要生成选项或下一步指引 —— 玩家点击会自然回到原 beat
|
||||
- 如果有 line,speaker 必须用**已登记角色**里的名字(绝不允许引入新角色)
|
||||
- 如果有 line,**必须**给出 lineDelivery(配音导演指令,自由中文,描述这句话怎么念)
|
||||
|
||||
speaker 字段允许的取值**只有两种**(与主路径 Writer 一致 — Pattern B galgame 标准):
|
||||
1. **已登记角色**里的 NPC 真名(**绝不允许引入新角色**)
|
||||
2. **"你"** — 玩家本人在自言自语 / 说一句过渡性的话(对白框显示,但不调 TTS)
|
||||
|
||||
其它任何 POV 变体(玩家 / 我 / 主角 / protagonist / player / MC / I / me)**一律错误**,请用 "你" 代替。
|
||||
|
||||
- 如果有 line 且 speaker = NPC,**必须**给出 lineDelivery(配音导演指令)
|
||||
- 如果有 line 且 speaker = "你",lineDelivery 可以留空(玩家对白不调 TTS)
|
||||
|
||||
必须输出严格 JSON:
|
||||
{
|
||||
@@ -198,9 +542,10 @@ export function buildInsertBeatUserMessage(
|
||||
|
||||
const current = session.history.at(-1);
|
||||
if (current) {
|
||||
parts.push(`\n当前场景:${current.scene.scenePrompt}`);
|
||||
const lastBeatId = current.visitedBeatIds.at(-1) ?? current.scene.entryBeatId;
|
||||
const lastBeat = current.scene.beats.find((b) => b.id === lastBeatId);
|
||||
const scene: Scene = current.scene;
|
||||
parts.push(`\n当前场景:${scene.scenePrompt}`);
|
||||
const lastBeatId = current.visitedBeatIds.at(-1) ?? scene.entryBeatId;
|
||||
const lastBeat = scene.beats.find((b) => b.id === lastBeatId);
|
||||
if (lastBeat) {
|
||||
const recent: string[] = [];
|
||||
if (lastBeat.narration) recent.push(`旁白:${lastBeat.narration}`);
|
||||
@@ -214,31 +559,10 @@ export function buildInsertBeatUserMessage(
|
||||
return parts.join("\n");
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// Image renderer
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export function buildImagePrompt(scene: Scene, styleGuide: string): string {
|
||||
return `Generate a cinematic landscape background illustration, 16:9 widescreen (1792x1024).
|
||||
|
||||
ART STYLE: ${styleGuide}
|
||||
|
||||
SCENE (fill the ENTIRE canvas — no UI elements, no text overlays):
|
||||
${scene.scenePrompt}
|
||||
|
||||
STRICT RULES — NEVER violate these:
|
||||
- DO NOT draw any dialogue boxes, speech bubbles, text panels, or any rectangular overlay.
|
||||
- DO NOT draw any buttons, choice options, menu items, or interactive UI elements.
|
||||
- DO NOT render any Chinese or English text anywhere in the image.
|
||||
- DO NOT add any HUD, interface chrome, or game UI elements.
|
||||
- The image is a PURE BACKGROUND SCENE ONLY. All UI will be added as HTML on top.
|
||||
- 16:9 LANDSCAPE orientation — wider than tall. No portrait or square output.
|
||||
- Leave the bottom 35% of the frame relatively uncluttered (darker or softer) so overlaid UI panels remain readable.
|
||||
- Characters or key scene elements should be positioned in the upper 65% of the frame.`;
|
||||
}
|
||||
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
// Vision — interprets a background click and classifies the action.
|
||||
// Unchanged from staging (UI choices live in HTML, vision only judges
|
||||
// background clicks).
|
||||
// ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
export const VISION_SYSTEM_PROMPT = `你是视觉理解助手。玩家在视觉小说的背景图上点击了红色圆点位置(HTML 上的选项按钮不会走到你这里)。你的任务是:
|
||||
@@ -265,3 +589,5 @@ export function buildVisionUserPrompt(scene: Scene | null): string {
|
||||
|
||||
红点位置即为玩家点击位置。请判断玩家意图与分类,以 JSON 格式返回。`;
|
||||
}
|
||||
|
||||
export type PainterCharacterInput = Pick<Character, "name" | "visualDescription">;
|
||||
|
||||
@@ -1,12 +0,0 @@
|
||||
import { generateImage } from "@yume/ai-client";
|
||||
import type { ProviderConfig, Scene } from "@yume/types";
|
||||
import { buildImagePrompt } from "./prompts";
|
||||
|
||||
export async function render(
|
||||
config: ProviderConfig,
|
||||
scene: Scene,
|
||||
styleGuide: string,
|
||||
): Promise<string> {
|
||||
const prompt = buildImagePrompt(scene, styleGuide);
|
||||
return generateImage(config, prompt);
|
||||
}
|
||||
@@ -1,25 +1,11 @@
|
||||
import { provisionVoice, synthesize } from "@yume/tts-client";
|
||||
import type {
|
||||
BeatAudio,
|
||||
Character,
|
||||
CharacterVoice,
|
||||
Scene,
|
||||
Session,
|
||||
TtsConfig,
|
||||
} from "@yume/types";
|
||||
import { synthesize } from "@yume/tts-client";
|
||||
import type { BeatAudio, CharacterVoice, TtsConfig } from "@yume/types";
|
||||
|
||||
// Per-beat synth budget. MiMo's median synth is 3–7s; the tail can spike
|
||||
// to 30–70s under concurrent load. Capping here means a single bad beat
|
||||
// degrades to silent in <15s instead of blocking the whole UI flow.
|
||||
const SYNTH_TIMEOUT_MS = 15000;
|
||||
|
||||
// When the director references a speaker that was never registered, derive a
|
||||
// description from the name + world so the voice's gender/temperament is at
|
||||
// least inferred from the name — never borrowed from another character.
|
||||
function inferredSpeakerDescription(name: string, session: Session): string {
|
||||
return `请根据角色名「${name}」推断其性别、年龄与气质,生成最贴合的音色。所属世界观:${session.worldSetting}`;
|
||||
}
|
||||
|
||||
// Race the work against a timer; on either outcome clear the timer (otherwise
|
||||
// the success path leaks a 15s-pending reject closure into Node's timer heap,
|
||||
// per-synth call). On timeout, abort the supplied controller so the underlying
|
||||
@@ -47,82 +33,15 @@ async function withTimeout<T>(
|
||||
}
|
||||
}
|
||||
|
||||
// Provision voices for all unseen speakers in a scene, in parallel.
|
||||
// Does NOT synthesize per-beat audio — that happens lazily via
|
||||
// synthesizeBeat from the /api/beat-audio route. Returning the populated
|
||||
// registry lets the client fire per-beat synth without re-provisioning.
|
||||
//
|
||||
// Why dedupe before fanning out: the SAME unseen speaker appearing in 3
|
||||
// beats must run voicedesign once; parallel design of the same speaker
|
||||
// would burn three voices' worth of budget and pick whichever raced last.
|
||||
export async function provisionVoicesForScene(
|
||||
cfg: TtsConfig,
|
||||
session: Session,
|
||||
scene: Scene,
|
||||
): Promise<{ characters: Character[] }> {
|
||||
const tScene = Date.now();
|
||||
const speakingBeats = scene.beats.filter(
|
||||
(b): b is typeof b & { speaker: string; line: string } =>
|
||||
Boolean(b.speaker && b.line),
|
||||
);
|
||||
|
||||
let characters: Character[] = [...session.characters];
|
||||
const toProvision = new Map<string, string>(); // name -> description
|
||||
for (const b of speakingBeats) {
|
||||
if (toProvision.has(b.speaker)) continue;
|
||||
const existing = characters.find((c) => c.name === b.speaker);
|
||||
if (existing?.voice) continue;
|
||||
toProvision.set(
|
||||
b.speaker,
|
||||
existing?.description ?? inferredSpeakerDescription(b.speaker, session),
|
||||
);
|
||||
}
|
||||
|
||||
if (toProvision.size === 0) {
|
||||
console.log(
|
||||
`[voice] provisionVoicesForScene total=${Date.now() - tScene}ms (no new speakers)`,
|
||||
);
|
||||
return { characters };
|
||||
}
|
||||
|
||||
const tProvision = Date.now();
|
||||
const provisioned = await Promise.all(
|
||||
Array.from(toProvision.entries()).map(async ([name, description]) => {
|
||||
try {
|
||||
const voice = await provisionVoice(cfg, description);
|
||||
return { name, description, voice };
|
||||
} catch (err) {
|
||||
const msg = err instanceof Error ? err.message : String(err);
|
||||
console.error(`[voice] provision degraded for ${name}: ${msg}`);
|
||||
return { name, description, voice: undefined };
|
||||
}
|
||||
}),
|
||||
);
|
||||
console.log(
|
||||
`[voice] provision: ${toProvision.size} speakers parallel max=${Date.now() - tProvision}ms`,
|
||||
);
|
||||
|
||||
for (const p of provisioned) {
|
||||
if (!p.voice) continue;
|
||||
const idx = characters.findIndex((c) => c.name === p.name);
|
||||
if (idx === -1) {
|
||||
characters.push({ name: p.name, description: p.description, voice: p.voice });
|
||||
} else {
|
||||
characters[idx] = { ...characters[idx]!, voice: p.voice };
|
||||
}
|
||||
}
|
||||
|
||||
console.log(
|
||||
`[voice] provisionVoicesForScene total=${Date.now() - tScene}ms`,
|
||||
);
|
||||
return { characters };
|
||||
}
|
||||
|
||||
// Synthesize audio for one beat. Caller is expected to have already
|
||||
// resolved the speaker's voice (from session.characters in the client) —
|
||||
// passing it directly here keeps the /api/beat-audio payload small and
|
||||
// makes this function pure with respect to session state.
|
||||
// Returns null on error or timeout; caller treats null as "play silent."
|
||||
//
|
||||
// (Voice PROVISIONING — designing a voice for a new character from a
|
||||
// voiceDescription — lives in agents/characterDesigner.ts now. This file
|
||||
// only handles per-beat SYNTHESIS using an already-provisioned voice.)
|
||||
export async function synthesizeBeat(
|
||||
cfg: TtsConfig,
|
||||
voice: CharacterVoice,
|
||||
|
||||
@@ -11,9 +11,21 @@ export type Beat = {
|
||||
line?: string;
|
||||
/** Free-form voice-acting direction for the line, sent to TTS only. Never displayed. */
|
||||
lineDelivery?: string;
|
||||
/**
|
||||
* Characters visible in this beat with their pose / expression for this moment.
|
||||
* Read by the Cinematographer when composing the scene's establishing shot —
|
||||
* the beat the entry beat lands in is the visual anchor for the image.
|
||||
*/
|
||||
activeCharacters?: BeatActiveCharacter[];
|
||||
next: BeatNext;
|
||||
};
|
||||
|
||||
export type BeatActiveCharacter = {
|
||||
name: string;
|
||||
/** Free-form 中文 description of pose / expression / what the character is doing. */
|
||||
pose?: string;
|
||||
};
|
||||
|
||||
export type BeatNext =
|
||||
| { type: "continue"; nextBeatId: string }
|
||||
| { type: "choice"; choices: BeatChoice[] };
|
||||
@@ -39,6 +51,22 @@ export type Scene = {
|
||||
scenePrompt: string;
|
||||
beats: Beat[];
|
||||
entryBeatId: string;
|
||||
/**
|
||||
* Stable English slug identifying the visual scene's location + time,
|
||||
* e.g. "classroom-dusk", "rooftop-night". When the next Scene shares this
|
||||
* key, the Painter slots the previous Scene's image into Runware's
|
||||
* `referenceImages` (alongside character portraits) so the same physical
|
||||
* space stays visually consistent across cuts. (Originally planned as a
|
||||
* seedImage / img2img anchor, but FLUX.2 [klein] 9B KV does not support
|
||||
* seedImage — referenceImages serves the same purpose with the model.)
|
||||
*/
|
||||
sceneKey?: string;
|
||||
/**
|
||||
* Runware UUID of this Scene's generated image — once uploaded, subsequent
|
||||
* Scenes that match sceneKey can reference it via `referenceImages`
|
||||
* without resending base64.
|
||||
*/
|
||||
imageUuid?: string;
|
||||
};
|
||||
|
||||
export type SceneExit =
|
||||
@@ -69,8 +97,32 @@ export type CharacterVoice = {
|
||||
|
||||
export type Character = {
|
||||
name: string;
|
||||
/** Free-form voice design description; must begin with explicit gender. */
|
||||
description: string;
|
||||
/**
|
||||
* 中文 voice-acting direction card. Must begin with explicit gender, then
|
||||
* age / timbre / personality / speed / accent. Fed to Xiaomi MiMo's
|
||||
* voicedesign endpoint when the voice is first provisioned.
|
||||
*/
|
||||
voiceDescription: string;
|
||||
/**
|
||||
* English appearance card — comma-separated visual attributes following
|
||||
* Runware/FLUX prompt-engineering convention. Fed to the Painter as a
|
||||
* character archetype anchor so the same face/outfit/style stays consistent
|
||||
* across every scene this character appears in.
|
||||
*/
|
||||
visualDescription?: string;
|
||||
/**
|
||||
* Base portrait image generated by the CharacterDesigner once, then reused
|
||||
* as a Runware `referenceImages` entry in every subsequent scene the
|
||||
* character appears in. Stored as base64 for client display.
|
||||
*/
|
||||
basePortraitBase64?: string;
|
||||
/**
|
||||
* Runware UUID for the base portrait. Once uploaded via the image-upload
|
||||
* endpoint, subsequent Painter calls reference this UUID instead of
|
||||
* resending the full base64 payload.
|
||||
*/
|
||||
basePortraitUuid?: string;
|
||||
/** Xiaomi MiMo voice reference audio. */
|
||||
voice?: CharacterVoice;
|
||||
};
|
||||
|
||||
@@ -90,7 +142,7 @@ export type Session = {
|
||||
worldSetting: string;
|
||||
styleGuide: string;
|
||||
history: SceneHistoryEntry[];
|
||||
/** Character registry — accumulates across scenes; voices persist for reuse. */
|
||||
/** Character registry — accumulates across scenes; voices + portraits persist for reuse. */
|
||||
characters: Character[];
|
||||
};
|
||||
|
||||
@@ -145,7 +197,7 @@ export type StartResponse = {
|
||||
sessionId: string;
|
||||
scene: Scene;
|
||||
imageBase64: string;
|
||||
/** Character registry with voice references provisioned for new speakers. */
|
||||
/** Character registry with voice references + visual cards provisioned. */
|
||||
characters: Character[];
|
||||
};
|
||||
|
||||
@@ -165,11 +217,6 @@ export type SceneResponse = {
|
||||
// /api/beat-audio — lazily synthesize one beat's voice. Client fires this
|
||||
// per beat after a scene loads; server has a per-call timeout so MiMo
|
||||
// tail-latency cannot block the UI. A null audio response means "play silent."
|
||||
//
|
||||
// Payload deliberately slim: just the line to speak and the speaker's voice
|
||||
// reference. The client extracts the voice from its local session.characters
|
||||
// before posting — sending the full Session would force ~160KB of base64 per
|
||||
// OTHER speaker plus the entire scene history to ride along for nothing.
|
||||
export type BeatAudioRequest = {
|
||||
beat: {
|
||||
id: string;
|
||||
|
||||
Reference in New Issue
Block a user