Zonghao Yuan def1b25bd9 feat(engine): multi-agent character consistency pipeline (#6)
* feat(types): Character.voiceDescription rename + visual fields + Scene.sceneKey

Prepares the type surface for the multi-agent scene pipeline:

- Character.description → voiceDescription (clearer pairing with new visualDescription)
- Character gains visualDescription (English appearance card for Painter) +
  basePortraitBase64 + basePortraitUuid (for Runware referenceImages reuse)
- Scene gains sceneKey (English slug for cross-scene img2img continuity) +
  imageUuid (Runware UUID of the scene's rendered image for cheap seedImage
  reuse on subsequent same-sceneKey calls)
- Beat gains activeCharacters[] so the Cinematographer can read which
  characters are on-screen + their poses when composing the establishing shot

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ai-client): generateImage img2img + multi-reference options + uploadImage

Extends the Runware adapter to support the two anchoring mechanisms FLUX.2
[klein] 9B KV needs for character + scene visual consistency:

- generateImage gains optional { seedImage, referenceImages, strength }:
  seedImage drives img2img (single starting image, sceneKey continuity),
  referenceImages drives multi-reference anchoring (up to 4 character
  portraits, capped per Runware spec). Default strength 0.85 — FLUX
  ignores strength < 0.8.
- uploadImage POSTs a base64 to Runware's imageUpload taskType and
  returns the UUID, so portraits/scene snapshots can be referenced by
  UUID on subsequent calls instead of resending base64 every scene.

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(engine): multi-agent scene pipeline (Writer→CharDesigner+Cinematographer→Painter)

Replaces the single-LLM directScene with a four-agent pipeline that
specializes each concern and parallelizes the slow parts. Adopts the
core idea from #4 (multi-agent dispatch + character visual consistency)
and grafts it onto the Scene/Beat architecture introduced in #2.

Pipeline per Scene (~9-12s critical path with parallelization):

  Writer LLM (序列, ~3s)
    │ outputs: sceneSummary + sceneKey + beats[] (each beat carries
    │           activeCharacters[] with poses)
    │
    ├─ CharacterDesigner LLM × N new chars (并行)
    │     │ outputs: { visualDescription (英文外貌卡), voiceDescription (中文音色卡) }
    │     ├─ FLUX portrait gen → upload → UUID (并行 within agent)
    │     └─ Xiaomi MiMo voicedesign provision (并行 within agent)
    │
    └─ Cinematographer LLM (并行 with CharacterDesigner)
          outputs: { shotType, integratedPrompt (英文构图+机位+人物站位) }

  Painter (FLUX img2img + referenceImages, ~1-3s)
    inputs: integratedPrompt + onStageCharacters' archetype block
            + (optional) prior sceneKey-hit scene as seedImage
            + (optional) character portrait UUIDs as referenceImages
    fallback chain: A) both anchors → B) refs only (保角色) →
                    C) seed only (保背景) → D) pure t2i
    output uploaded → Scene.imageUuid for the next sceneKey hop

Why this carving:
- Writer focuses purely on narrative (drops the voice-design duty
  staging's DIRECTOR_SYSTEM was carrying as a side concern).
- CharacterDesigner bundles visual + voice so the agent that thinks
  "who is this character" produces internally-consistent appearance +
  vocal personality (split agents tend to diverge).
- Cinematographer doesn't need character visualDescriptions —
  Painter appends archetypes after — so it parallelizes with
  CharacterDesigner.
- sceneKey enables cross-scene backdrop continuity that Scene/Beat
  doesn't cover (Scene/Beat only reuses backdrop WITHIN a scene's
  beats; sceneKey reuses across scenes that share a location).

Other changes:
- voice.ts loses provisionVoicesForScene (moved into CharacterDesigner);
  keeps synthesizeBeat for the lazy per-beat /api/beat-audio path.
- renderer.ts deleted (replaced by agents/painter.ts).
- directInsertBeat (vision-driven in-scene exploration) stays single-
  LLM — it forbids new characters and produces no image, so multi-
  agent doesn't apply.

apps/web is unchanged: orchestrator.ts keeps the same exports
(startSession / requestScene / visionDecide / requestInsertBeat /
requestBeatAudio) with identical request/response shapes.

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(engine): Pattern B player POV + JSON repair + drop seedImage tier

Three hotfixes surfaced by manual end-to-end testing of the multi-agent
pipeline.

F1 — Player viewpoint (galgame Pattern B):
  - Writer accepts speaker="你" for player dialog (renders in dialog box,
    never TTS'd because no Character record exists for "你"). Filter POV
    variants (玩家/我/主角/protagonist/player/I/me/...) from
    activeCharacters so CharacterDesigner never wastes API calls on the
    player. Two-layer defense: explicit prompt rule in WRITER_SYSTEM +
    code normalization (POV_VARIANTS set, isPovName, normalizeSpeakerName).
  - Cinematographer and Painter prompts gain "player never in frame" rule
    so the player never appears in any rendered scene.
  - Cinematographer gains dynamic camera policy driven by the entry beat's
    speaker: NPC-speaker → close-up looking toward camera; "你"-speaker →
    medium shot of attentive NPC; no speaker → wide establishing shot.
  - director.ts filters POV from orphanSpeakers so provisionVoiceForName
    never fires for "你".

F2 — JSON parsing robustness:
  - parseJsonLoose gains a 4th repair tier: strip JS-style comments, strip
    trailing commas, insert missing commas between adjacent objects /
    arrays / quoted values. Logs the first 800 chars of raw LLM output
    when all repair attempts fail, so we can see what the model emitted.

F3 — Drop seedImage, use referenceImages for prior scene:
  - FLUX.2 [klein] 9B KV does not support seedImage (img2img). Removed
    Tier A (seedImage+refs) and Tier C (seedImage only) from the Painter
    degradation chain. New layout: prior scene's image slots into
    referenceImages[0] for spatial continuity, character portraits fill
    slots 1-3 (Runware caps at 4 total). Cinematographer instructed to
    emphasize continuity when sceneKey matches a prior scene.

All five package typechecks pass.

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(engine): address Copilot review feedback on #6

Three targeted fixes from PR #6 Copilot review.

F4 — Stale seedImage/img2img docstrings
  Four locations still referenced the original img2img design after F3
  switched to referenceImages-based spatial continuity:
  - types/index.ts:57   Scene.sceneKey docstring
  - types/index.ts:63   Scene.imageUuid docstring
  - director.ts:34      pipeline diagram in module block comment
  - director.ts:128     directScene JSDoc
  Doc-only changes; misleading wording corrected to mention referenceImages.
  (The design-rationale comment in pickPriorSceneReference is kept — it
  explains WHY we don't use seedImage and is load-bearing context.)

F5 — Remove JS-comment stripping from JSON repair pass
  parseJsonLoose's repair tier previously stripped `// ...` and
  `/* ... */` across the entire text, which would corrupt JSON string
  values containing URLs (e.g. "https://example.com" → "https:"). Since
  LLMs in `responseFormat: "json_object"` mode essentially never emit
  comments, dropping the comment-stripping step is a net win for safety.
  Trailing-comma and missing-comma repair (the high-frequency failures)
  are kept.

F6 — Pattern B parity on the insert-beat path
  Previously: directInsertBeat's INSERT_BEAT_SYSTEM forbade any speaker
  not in session.characters, and the orchestrator's unregistered-speaker
  guard demoted such lines to narration. This meant the player could not
  speak via speaker="你" in transient in-scene beats — inconsistent with
  the Writer path.
  Fix:
  - INSERT_BEAT_SYSTEM prompt now allows speaker="你" (NPC name OR "你")
    and rejects other POV variants
  - directInsertBeat applies normalizeSpeakerName to the LLM output, same
    as the Writer path, so POV variants collapse to "你"
  - lineDelivery is dropped when speaker="你" (no TTS for player)
  - orchestrator's unregistered-speaker guard adds a `speaker !== "你"`
    exception so Pattern B player dialog passes through

Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(engine): drop "JS-style comments" from parseJsonLoose header

The function header listed JS-style comments as a step-4 repair, but F5
already removed comment stripping from `repairJsonString` because the
regex would corrupt URLs inside JSON string values. The inner function's
comment was updated then; this header was missed.

Doc-only sync from second-round Copilot review on #6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: QiChen88 <2291969160@qq.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 13:30:24 +08:00

云梦

An AI-driven visual novel painted by an AI, one scene at a time. You talk and explore within a scene; when the story turns a corner, it paints the next. You click. It paints. The story unfolds.


How it works

The story unfolds as a sequence of scenes. Each scene is one AI-painted background plus a short tree of beats — moments of narration, dialogue, and the occasional choice. You tap through a scene's beats and the image stays put; only when a choice leads somewhere genuinely new — another place, a new point of view, a jump in time — does the AI paint the next scene.

entering a scene
        │
        ▼
1. Text LLM     directs the whole scene at once — a background prompt
                plus a tree of beats (narration / dialogue / choices)
        │
        ▼
2. Image model  paints the background once, 16:9, no UI baked in
        │
        ▼
[ tap through beats — no model calls, instant ]
        │
        ├─ in-scene choice ──────▶ jump to another beat (instant)
        │
        └─ scene-change choice ──▶ the next scene
                                   (usually pre-generated — see below)

While you're reading one scene, the engine speculatively generates the scenes your choices could lead to — and, for unavoidable next steps, the scene after that. By the time you pick a direction, its image is usually already painted, so the cut feels instant.

Clicking the background itself (not a button) routes through a vision model: it reads where you tapped and decides whether you're exploring the current scene (it inserts a beat — no new image) or moving on (a new scene).

There is no traditional game UI baked into the art. The AI paints the world in whatever style you pick — "stick figure on grid paper" or "cyberpunk noir" — and the dialogue panel and choice buttons are a light HTML layer drawn on top, tuned to sit over the scene.


One-click deploy

Deploy with Vercel

After deploy, set the nine environment variables (see below) in your Vercel project. That's it.


Environment variables

Three providers, all independently configurable. Text and Vision accept any OpenAI-compatible endpoint (OpenAI, Anthropic via OpenAI-compat proxy, Gemini, OpenRouter, DeepSeek, local Ollama, …). Image goes to Runware (its own task-array protocol, not OpenAI-compatible).

Provider Variables Recommended
Text · story director TEXT_BASE_URL TEXT_API_KEY TEXT_MODEL claude-opus-4-7 via Anthropic
Image · UI renderer IMAGE_BASE_URL IMAGE_API_KEY IMAGE_MODEL runware:400@6 (FLUX.2 [klein] 9B KV) via Runware
Vision · click reader VISION_BASE_URL VISION_API_KEY VISION_MODEL gemini-3-flash via Google

See apps/web/.env.example for the exact shape.


Local development

Requires Node 20+ and pnpm 9+.

pnpm install
cp apps/web/.env.example apps/web/.env.local
# fill in the nine env vars
pnpm dev
# open http://localhost:3000

Project layout

yume/
├── apps/web/              Next.js 16 app — pages + API routes
└── packages/
    ├── types/             shared TypeScript types
    ├── ai-client/         unified OpenAI-compatible clients
    └── engine/            three-stage AI orchestration (open core)

packages/engine is the open core — pure TS, no Next.js or browser dependency. Import it directly to build your own visual-novel front-end (Tauri, Electron, CLI, anywhere).


Cost & limits

With the recommended trio, each scene is dominated by the text-LLM call. The FLUX.2 [klein] 9B KV image is roughly $0.001 per scene (1792×1024, 4 steps, sub-second); the text call is the rest. Tapping through a scene's beats is free. To keep transitions instant, the engine also pre-generates scenes you might pick but don't — so real spend runs somewhat higher than the scenes you actually see. There is no rate limiting or auth out of the box — if you make your deployment public, your bill will reflect that. Add limits (and consider lowering the prefetch depth) before sharing widely.

S
Description
No description provided
Readme 116 MiB
Languages
TypeScript 83.1%
JavaScript 16.2%
PLpgSQL 0.4%
CSS 0.2%