T

Zonghao Yuan def1b25bd9 feat(engine): multi-agent character consistency pipeline (#6 )

* feat(types): Character.voiceDescription rename + visual fields + Scene.sceneKey

Prepares the type surface for the multi-agent scene pipeline:

- Character.description → voiceDescription (clearer pairing with new visualDescription)
- Character gains visualDescription (English appearance card for Painter) +
  basePortraitBase64 + basePortraitUuid (for Runware referenceImages reuse)
- Scene gains sceneKey (English slug for cross-scene img2img continuity) +
  imageUuid (Runware UUID of the scene's rendered image for cheap seedImage
  reuse on subsequent same-sceneKey calls)
- Beat gains activeCharacters[] so the Cinematographer can read which
  characters are on-screen + their poses when composing the establishing shot

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ai-client): generateImage img2img + multi-reference options + uploadImage

Extends the Runware adapter to support the two anchoring mechanisms FLUX.2
[klein] 9B KV needs for character + scene visual consistency:

- generateImage gains optional { seedImage, referenceImages, strength }:
  seedImage drives img2img (single starting image, sceneKey continuity),
  referenceImages drives multi-reference anchoring (up to 4 character
  portraits, capped per Runware spec). Default strength 0.85 — FLUX
  ignores strength < 0.8.
- uploadImage POSTs a base64 to Runware's imageUpload taskType and
  returns the UUID, so portraits/scene snapshots can be referenced by
  UUID on subsequent calls instead of resending base64 every scene.

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(engine): multi-agent scene pipeline (Writer→CharDesigner+Cinematographer→Painter)

Replaces the single-LLM directScene with a four-agent pipeline that
specializes each concern and parallelizes the slow parts. Adopts the
core idea from #4 (multi-agent dispatch + character visual consistency)
and grafts it onto the Scene/Beat architecture introduced in #2.

Pipeline per Scene (~9-12s critical path with parallelization):

  Writer LLM (序列, ~3s)
    │ outputs: sceneSummary + sceneKey + beats[] (each beat carries
    │           activeCharacters[] with poses)
    │
    ├─ CharacterDesigner LLM × N new chars (并行)
    │     │ outputs: { visualDescription (英文外貌卡), voiceDescription (中文音色卡) }
    │     ├─ FLUX portrait gen → upload → UUID (并行 within agent)
    │     └─ Xiaomi MiMo voicedesign provision (并行 within agent)
    │
    └─ Cinematographer LLM (并行 with CharacterDesigner)
          outputs: { shotType, integratedPrompt (英文构图+机位+人物站位) }

  Painter (FLUX img2img + referenceImages, ~1-3s)
    inputs: integratedPrompt + onStageCharacters' archetype block
            + (optional) prior sceneKey-hit scene as seedImage
            + (optional) character portrait UUIDs as referenceImages
    fallback chain: A) both anchors → B) refs only (保角色) →
                    C) seed only (保背景) → D) pure t2i
    output uploaded → Scene.imageUuid for the next sceneKey hop

Why this carving:
- Writer focuses purely on narrative (drops the voice-design duty
  staging's DIRECTOR_SYSTEM was carrying as a side concern).
- CharacterDesigner bundles visual + voice so the agent that thinks
  "who is this character" produces internally-consistent appearance +
  vocal personality (split agents tend to diverge).
- Cinematographer doesn't need character visualDescriptions —
  Painter appends archetypes after — so it parallelizes with
  CharacterDesigner.
- sceneKey enables cross-scene backdrop continuity that Scene/Beat
  doesn't cover (Scene/Beat only reuses backdrop WITHIN a scene's
  beats; sceneKey reuses across scenes that share a location).

Other changes:
- voice.ts loses provisionVoicesForScene (moved into CharacterDesigner);
  keeps synthesizeBeat for the lazy per-beat /api/beat-audio path.
- renderer.ts deleted (replaced by agents/painter.ts).
- directInsertBeat (vision-driven in-scene exploration) stays single-
  LLM — it forbids new characters and produces no image, so multi-
  agent doesn't apply.

apps/web is unchanged: orchestrator.ts keeps the same exports
(startSession / requestScene / visionDecide / requestInsertBeat /
requestBeatAudio) with identical request/response shapes.

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(engine): Pattern B player POV + JSON repair + drop seedImage tier

Three hotfixes surfaced by manual end-to-end testing of the multi-agent
pipeline.

F1 — Player viewpoint (galgame Pattern B):
  - Writer accepts speaker="你" for player dialog (renders in dialog box,
    never TTS'd because no Character record exists for "你"). Filter POV
    variants (玩家/我/主角/protagonist/player/I/me/...) from
    activeCharacters so CharacterDesigner never wastes API calls on the
    player. Two-layer defense: explicit prompt rule in WRITER_SYSTEM +
    code normalization (POV_VARIANTS set, isPovName, normalizeSpeakerName).
  - Cinematographer and Painter prompts gain "player never in frame" rule
    so the player never appears in any rendered scene.
  - Cinematographer gains dynamic camera policy driven by the entry beat's
    speaker: NPC-speaker → close-up looking toward camera; "你"-speaker →
    medium shot of attentive NPC; no speaker → wide establishing shot.
  - director.ts filters POV from orphanSpeakers so provisionVoiceForName
    never fires for "你".

F2 — JSON parsing robustness:
  - parseJsonLoose gains a 4th repair tier: strip JS-style comments, strip
    trailing commas, insert missing commas between adjacent objects /
    arrays / quoted values. Logs the first 800 chars of raw LLM output
    when all repair attempts fail, so we can see what the model emitted.

F3 — Drop seedImage, use referenceImages for prior scene:
  - FLUX.2 [klein] 9B KV does not support seedImage (img2img). Removed
    Tier A (seedImage+refs) and Tier C (seedImage only) from the Painter
    degradation chain. New layout: prior scene's image slots into
    referenceImages[0] for spatial continuity, character portraits fill
    slots 1-3 (Runware caps at 4 total). Cinematographer instructed to
    emphasize continuity when sceneKey matches a prior scene.

All five package typechecks pass.

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(engine): address Copilot review feedback on #6

Three targeted fixes from PR #6 Copilot review.

F4 — Stale seedImage/img2img docstrings
  Four locations still referenced the original img2img design after F3
  switched to referenceImages-based spatial continuity:
  - types/index.ts:57   Scene.sceneKey docstring
  - types/index.ts:63   Scene.imageUuid docstring
  - director.ts:34      pipeline diagram in module block comment
  - director.ts:128     directScene JSDoc
  Doc-only changes; misleading wording corrected to mention referenceImages.
  (The design-rationale comment in pickPriorSceneReference is kept — it
  explains WHY we don't use seedImage and is load-bearing context.)

F5 — Remove JS-comment stripping from JSON repair pass
  parseJsonLoose's repair tier previously stripped `// ...` and
  `/* ... */` across the entire text, which would corrupt JSON string
  values containing URLs (e.g. "https://example.com" → "https:"). Since
  LLMs in `responseFormat: "json_object"` mode essentially never emit
  comments, dropping the comment-stripping step is a net win for safety.
  Trailing-comma and missing-comma repair (the high-frequency failures)
  are kept.

F6 — Pattern B parity on the insert-beat path
  Previously: directInsertBeat's INSERT_BEAT_SYSTEM forbade any speaker
  not in session.characters, and the orchestrator's unregistered-speaker
  guard demoted such lines to narration. This meant the player could not
  speak via speaker="你" in transient in-scene beats — inconsistent with
  the Writer path.
  Fix:
  - INSERT_BEAT_SYSTEM prompt now allows speaker="你" (NPC name OR "你")
    and rejects other POV variants
  - directInsertBeat applies normalizeSpeakerName to the LLM output, same
    as the Writer path, so POV variants collapse to "你"
  - lineDelivery is dropped when speaker="你" (no TTS for player)
  - orchestrator's unregistered-speaker guard adds a `speaker !== "你"`
    exception so Pattern B player dialog passes through

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(engine): drop "JS-style comments" from parseJsonLoose header

The function header listed JS-style comments as a step-4 repair, but F5
already removed comment stripping from `repairJsonString` because the
regex would corrupt URLs inside JSON string values. The inner function's
comment was updated then; this header was missed.

Doc-only sync from second-round Copilot review on #6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: QiChen88 <2291969160@qq.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-29 13:30:24 +08:00

apps/web

feat: Runware FLUX.2 image + lazy per-beat TTS (#5 )

2026-05-28 23:43:51 +08:00

packages

feat(engine): multi-agent character consistency pipeline (#6 )

2026-05-29 13:30:24 +08:00

.gitignore

fix(web): tame Next.js 16 dev server CPU runaway

2026-05-10 10:12:54 +08:00

package.json

refactor: rename project DADA → 云梦 (slug: yume)

2026-05-24 10:14:14 +08:00

pnpm-lock.yaml

feat(tts): Xiaomi MiMo per-beat voice + MOCK_IMAGE testing aid (#3 )

2026-05-28 20:45:21 +08:00

pnpm-workspace.yaml

Initial commit: AI-driven visual novel scaffold

2026-05-09 13:29:58 +08:00

README.md

feat: Runware FLUX.2 image + lazy per-beat TTS (#5 )

2026-05-28 23:43:51 +08:00

tsconfig.base.json

Initial commit: AI-driven visual novel scaffold

2026-05-09 13:29:58 +08:00

vercel.json

feat: prefetch, vision split, provider adapter, UI polish

2026-05-12 19:38:03 +08:00

README.md

云梦

An AI-driven visual novel painted by an AI, one scene at a time. You talk and explore within a scene; when the story turns a corner, it paints the next. You click. It paints. The story unfolds.

How it works

The story unfolds as a sequence of scenes. Each scene is one AI-painted background plus a short tree of beats — moments of narration, dialogue, and the occasional choice. You tap through a scene's beats and the image stays put; only when a choice leads somewhere genuinely new — another place, a new point of view, a jump in time — does the AI paint the next scene.

entering a scene
        │
        ▼
1. Text LLM     directs the whole scene at once — a background prompt
                plus a tree of beats (narration / dialogue / choices)
        │
        ▼
2. Image model  paints the background once, 16:9, no UI baked in
        │
        ▼
[ tap through beats — no model calls, instant ]
        │
        ├─ in-scene choice ──────▶ jump to another beat (instant)
        │
        └─ scene-change choice ──▶ the next scene
                                   (usually pre-generated — see below)

While you're reading one scene, the engine speculatively generates the scenes your choices could lead to — and, for unavoidable next steps, the scene after that. By the time you pick a direction, its image is usually already painted, so the cut feels instant.

Clicking the background itself (not a button) routes through a vision model: it reads where you tapped and decides whether you're exploring the current scene (it inserts a beat — no new image) or moving on (a new scene).

There is no traditional game UI baked into the art. The AI paints the world in whatever style you pick — "stick figure on grid paper" or "cyberpunk noir" — and the dialogue panel and choice buttons are a light HTML layer drawn on top, tuned to sit over the scene.

One-click deploy

After deploy, set the nine environment variables (see below) in your Vercel project. That's it.

Environment variables

Three providers, all independently configurable. Text and Vision accept any OpenAI-compatible endpoint (OpenAI, Anthropic via OpenAI-compat proxy, Gemini, OpenRouter, DeepSeek, local Ollama, …). Image goes to Runware (its own task-array protocol, not OpenAI-compatible).

Provider	Variables	Recommended
Text · story director	`TEXT_BASE_URL` `TEXT_API_KEY` `TEXT_MODEL`	`claude-opus-4-7` via Anthropic
Image · UI renderer	`IMAGE_BASE_URL` `IMAGE_API_KEY` `IMAGE_MODEL`	`runware:400@6` (FLUX.2 [klein] 9B KV) via Runware
Vision · click reader	`VISION_BASE_URL` `VISION_API_KEY` `VISION_MODEL`	`gemini-3-flash` via Google

See apps/web/.env.example for the exact shape.

Local development

Requires Node 20+ and pnpm 9+.

pnpm install
cp apps/web/.env.example apps/web/.env.local
# fill in the nine env vars
pnpm dev
# open http://localhost:3000

Project layout

yume/
├── apps/web/              Next.js 16 app — pages + API routes
└── packages/
    ├── types/             shared TypeScript types
    ├── ai-client/         unified OpenAI-compatible clients
    └── engine/            three-stage AI orchestration (open core)

packages/engine is the open core — pure TS, no Next.js or browser dependency. Import it directly to build your own visual-novel front-end (Tauri, Electron, CLI, anywhere).

Cost & limits

With the recommended trio, each scene is dominated by the text-LLM call. The FLUX.2 [klein] 9B KV image is roughly $0.001 per scene (1792×1024, 4 steps, sub-second); the text call is the rest. Tapping through a scene's beats is free. To keep transitions instant, the engine also pre-generates scenes you might pick but don't — so real spend runs somewhat higher than the scenes you actually see. There is no rate limiting or auth out of the box — if you make your deployment public, your bill will reflect that. Add limits (and consider lowering the prefetch depth) before sharing widely.

README.md Unescape Escape

云梦

How it works

One-click deploy

Environment variables

Local development

Project layout

Cost & limits

README.md