Add a dedicated Architect LLM call at session start that expands the terse
world/style prompt into a persistent story bible (logline, genre, second-
person protagonist, cast, engineered opening hook). The bible seeds a
StoryState the Writer reads and patches every scene, carried + merged
across cuts (applyStoryStatePatch) so the story keeps a spine from beat
one instead of jumping between scenes.
- prompts: inject web-novel / short-drama / galgame craft into Writer +
Architect; Writer emits storyStatePatch to update the running bible
- director: parallelize voice + non-entry portraits with the Painter
(only entry-beat portraits block paint) to offset Architect latency
- architect: chat/parse guarded so a malformed response never aborts start
- types: StoryState / StoryStatePatch; required on Start/SceneResponse
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On mount the mute effect fired alongside the scene effect (both call
prefetchSceneAudio), so the initial /api/beat-audio batch was dispatched
twice — the first set aborted mid-flight. Track the previous muted value
in a ref and only re-prefetch on a real transition, leaving the mount-time
synthesis to the scene effect. Addresses Copilot review on PR #9.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Revises the InfiPlot homepage from the initial prototype pass.
Stories data model
- Replaces the artificial 7-hero + 16-gallery split with a flat
per-gender model: 30 preset stories each for 男性向 / 女性向.
- Renames assets hero*/gallery* → m{0..29} / f{0..29}; same index
shares aspect ratio across genders so the gender crossfade never
jumps card height.
- Fills in the missing 女性向 set and expands both genders to 30.
Cards
- StoryCard measures aspect ratio at runtime from the loaded image
(onLoad → naturalWidth/Height), fixing the frosted-caption band
reflow on lazy image load. Drops ready/fallback props; single
masonry map over STORIES[gender].
Hero input
- Single-line <input> → auto-growing <textarea> (rows=1, resize-none)
so long prompts and long card seeds are fully visible. Enter submits,
Shift+Enter inserts a newline.
- lining-nums on the input so digits sit on the baseline instead of
Cormorant's default old-style figures.
Typography / styles
- layout.tsx: editorial fonts (Cormorant Garamond + Inter via
--font-serif / --font-sans) + Font Awesome; drops Patrick Hand /
Noto Sans SC and the hand-drawn SVG jitter filters.
- globals.css trimmed to the editorial base (paper grain, hairline,
num, ripple); play/page.tsx font/style follow-up.
Scripts
- generate-home-images.mjs reworked into a flat 2×30 idempotent
Runware FLUX.2 generator.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rebuilds the landing page from the prototype: 1900px scale-to-fit hero with
hand-drawn SVG-jitter frames, typewriter input + start button, 5 horizontal
collapsible category selectors (with style-picker modal), 7 scattered hero
cards over a 16-card masonry gallery, and project intro panel.
Each card is filled with a Runware FLUX.2 image, pre-generated and stored as
WebP (~2 MB total for 30 cards). Hero card content + image switches by
性向 (男性向 / 女性向); gallery stays shared.
Hover overlay on every card shows title + outline in a bottom-up dark
gradient, matching the prior homepage's interaction style.
Bug fixes uncovered by tracing the form-state → engine pipeline:
- 「语音配音:关闭」was previously stuffed into styleGuide (consumed only by
FLUX, ignored by TTS). Now serialized as audioEnabled boolean in the
sessionStorage payload; play page's fetchBeatAudio early-returns when
false, so no /api/beat-audio request fires.
- 「绘画风格:自动」used to pass the literal Chinese phrase "由模型根据
prompt 自动判断画风" to FLUX, which painted it as text. Now maps to the
二次元/galgame default prompt.
Adds reusable scripts under apps/web/scripts/:
- generate-home-images.mjs — Runware FLUX.2 idempotent batch generator
- optimize-home-images.mjs — sharp WebP downscale + recompress
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- annotate.ts: add assertSafeUrl() to reject non-https/data URLs and
private/reserved IPs (SSRF prevention); cap response body to 10 MB
- jsonParser.ts: truncate raw model output in error log to first 800
chars to avoid flooding logs / leaking sensitive content
- Move vercel.json to apps/web/ with correct route paths; cap scene route
maxDuration 120→60s for Hobby. Root vercel.json removed. Vercel project's
Root Directory must be set to apps/web (Deploy button URL passes this).
- Switch image transport from base64-in-JSON to Runware-hosted URLs:
generateImage now uses outputType=URL and returns {imageUrl, imageUuid};
StartResponse/SceneResponse carry imageUrl; VisionRequest carries
prevImageUrl (server re-fetches the bytes for click annotation). This
eliminates the 4.5MB serverless body-size risk.
- Painter and director prefer URL over UUID for referenceImages — the UUID
returned by Runware imageInference isn't always recognized in the refs
pipeline (surfaces as `failedToTransferImage`).
- Client preloads scene images via `new Image().decode()` before committing
to React state, so URL transitions render instantly; prefetched scenes
also warm the HTTP cache.
- jsonParser uses the jsonrepair package (replaces hand-rolled repair) and
adds a targeted preRepair regex for the missing-key-close-quote pattern
that jsonrepair couldn't disambiguate. Full raw model output dumped on
failure for diagnostic visibility.
- Default text provider switched to DeepSeek v4-flash via direct API
(significantly more stable JSON than MiMo v2.5-pro). VISION/TTS stay on
MiMo (DeepSeek has no multimodal / TTS offerings).
- next.config: drop dead experimental.serverActions.bodySizeLimit (no
server actions used).
- README: real Deploy button URL (zonghaoyuan/yume + root-directory=apps/web
+ TTS/MOCK_IMAGE in env list); refreshed env vars table with optional
TTS section.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add explicit check for empty choices array in both chat.ts and vision.ts
- Add optional chaining for message property access
- Throw descriptive error when API returns no content
- Use English comments consistent with project style
- Fixes debugging issues when upstream returns empty responses
Related to: chat.ts and vision.ts silent empty string return on malformed responses
* feat(types): Character.voiceDescription rename + visual fields + Scene.sceneKey
Prepares the type surface for the multi-agent scene pipeline:
- Character.description → voiceDescription (clearer pairing with new visualDescription)
- Character gains visualDescription (English appearance card for Painter) +
basePortraitBase64 + basePortraitUuid (for Runware referenceImages reuse)
- Scene gains sceneKey (English slug for cross-scene img2img continuity) +
imageUuid (Runware UUID of the scene's rendered image for cheap seedImage
reuse on subsequent same-sceneKey calls)
- Beat gains activeCharacters[] so the Cinematographer can read which
characters are on-screen + their poses when composing the establishing shot
Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(ai-client): generateImage img2img + multi-reference options + uploadImage
Extends the Runware adapter to support the two anchoring mechanisms FLUX.2
[klein] 9B KV needs for character + scene visual consistency:
- generateImage gains optional { seedImage, referenceImages, strength }:
seedImage drives img2img (single starting image, sceneKey continuity),
referenceImages drives multi-reference anchoring (up to 4 character
portraits, capped per Runware spec). Default strength 0.85 — FLUX
ignores strength < 0.8.
- uploadImage POSTs a base64 to Runware's imageUpload taskType and
returns the UUID, so portraits/scene snapshots can be referenced by
UUID on subsequent calls instead of resending base64 every scene.
Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(engine): multi-agent scene pipeline (Writer→CharDesigner+Cinematographer→Painter)
Replaces the single-LLM directScene with a four-agent pipeline that
specializes each concern and parallelizes the slow parts. Adopts the
core idea from #4 (multi-agent dispatch + character visual consistency)
and grafts it onto the Scene/Beat architecture introduced in #2.
Pipeline per Scene (~9-12s critical path with parallelization):
Writer LLM (序列, ~3s)
│ outputs: sceneSummary + sceneKey + beats[] (each beat carries
│ activeCharacters[] with poses)
│
├─ CharacterDesigner LLM × N new chars (并行)
│ │ outputs: { visualDescription (英文外貌卡), voiceDescription (中文音色卡) }
│ ├─ FLUX portrait gen → upload → UUID (并行 within agent)
│ └─ Xiaomi MiMo voicedesign provision (并行 within agent)
│
└─ Cinematographer LLM (并行 with CharacterDesigner)
outputs: { shotType, integratedPrompt (英文构图+机位+人物站位) }
Painter (FLUX img2img + referenceImages, ~1-3s)
inputs: integratedPrompt + onStageCharacters' archetype block
+ (optional) prior sceneKey-hit scene as seedImage
+ (optional) character portrait UUIDs as referenceImages
fallback chain: A) both anchors → B) refs only (保角色) →
C) seed only (保背景) → D) pure t2i
output uploaded → Scene.imageUuid for the next sceneKey hop
Why this carving:
- Writer focuses purely on narrative (drops the voice-design duty
staging's DIRECTOR_SYSTEM was carrying as a side concern).
- CharacterDesigner bundles visual + voice so the agent that thinks
"who is this character" produces internally-consistent appearance +
vocal personality (split agents tend to diverge).
- Cinematographer doesn't need character visualDescriptions —
Painter appends archetypes after — so it parallelizes with
CharacterDesigner.
- sceneKey enables cross-scene backdrop continuity that Scene/Beat
doesn't cover (Scene/Beat only reuses backdrop WITHIN a scene's
beats; sceneKey reuses across scenes that share a location).
Other changes:
- voice.ts loses provisionVoicesForScene (moved into CharacterDesigner);
keeps synthesizeBeat for the lazy per-beat /api/beat-audio path.
- renderer.ts deleted (replaced by agents/painter.ts).
- directInsertBeat (vision-driven in-scene exploration) stays single-
LLM — it forbids new characters and produces no image, so multi-
agent doesn't apply.
apps/web is unchanged: orchestrator.ts keeps the same exports
(startSession / requestScene / visionDecide / requestInsertBeat /
requestBeatAudio) with identical request/response shapes.
Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): Pattern B player POV + JSON repair + drop seedImage tier
Three hotfixes surfaced by manual end-to-end testing of the multi-agent
pipeline.
F1 — Player viewpoint (galgame Pattern B):
- Writer accepts speaker="你" for player dialog (renders in dialog box,
never TTS'd because no Character record exists for "你"). Filter POV
variants (玩家/我/主角/protagonist/player/I/me/...) from
activeCharacters so CharacterDesigner never wastes API calls on the
player. Two-layer defense: explicit prompt rule in WRITER_SYSTEM +
code normalization (POV_VARIANTS set, isPovName, normalizeSpeakerName).
- Cinematographer and Painter prompts gain "player never in frame" rule
so the player never appears in any rendered scene.
- Cinematographer gains dynamic camera policy driven by the entry beat's
speaker: NPC-speaker → close-up looking toward camera; "你"-speaker →
medium shot of attentive NPC; no speaker → wide establishing shot.
- director.ts filters POV from orphanSpeakers so provisionVoiceForName
never fires for "你".
F2 — JSON parsing robustness:
- parseJsonLoose gains a 4th repair tier: strip JS-style comments, strip
trailing commas, insert missing commas between adjacent objects /
arrays / quoted values. Logs the first 800 chars of raw LLM output
when all repair attempts fail, so we can see what the model emitted.
F3 — Drop seedImage, use referenceImages for prior scene:
- FLUX.2 [klein] 9B KV does not support seedImage (img2img). Removed
Tier A (seedImage+refs) and Tier C (seedImage only) from the Painter
degradation chain. New layout: prior scene's image slots into
referenceImages[0] for spatial continuity, character portraits fill
slots 1-3 (Runware caps at 4 total). Cinematographer instructed to
emphasize continuity when sceneKey matches a prior scene.
All five package typechecks pass.
Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): address Copilot review feedback on #6
Three targeted fixes from PR #6 Copilot review.
F4 — Stale seedImage/img2img docstrings
Four locations still referenced the original img2img design after F3
switched to referenceImages-based spatial continuity:
- types/index.ts:57 Scene.sceneKey docstring
- types/index.ts:63 Scene.imageUuid docstring
- director.ts:34 pipeline diagram in module block comment
- director.ts:128 directScene JSDoc
Doc-only changes; misleading wording corrected to mention referenceImages.
(The design-rationale comment in pickPriorSceneReference is kept — it
explains WHY we don't use seedImage and is load-bearing context.)
F5 — Remove JS-comment stripping from JSON repair pass
parseJsonLoose's repair tier previously stripped `// ...` and
`/* ... */` across the entire text, which would corrupt JSON string
values containing URLs (e.g. "https://example.com" → "https:"). Since
LLMs in `responseFormat: "json_object"` mode essentially never emit
comments, dropping the comment-stripping step is a net win for safety.
Trailing-comma and missing-comma repair (the high-frequency failures)
are kept.
F6 — Pattern B parity on the insert-beat path
Previously: directInsertBeat's INSERT_BEAT_SYSTEM forbade any speaker
not in session.characters, and the orchestrator's unregistered-speaker
guard demoted such lines to narration. This meant the player could not
speak via speaker="你" in transient in-scene beats — inconsistent with
the Writer path.
Fix:
- INSERT_BEAT_SYSTEM prompt now allows speaker="你" (NPC name OR "你")
and rejects other POV variants
- directInsertBeat applies normalizeSpeakerName to the LLM output, same
as the Writer path, so POV variants collapse to "你"
- lineDelivery is dropped when speaker="你" (no TTS for player)
- orchestrator's unregistered-speaker guard adds a `speaker !== "你"`
exception so Pattern B player dialog passes through
Co-Authored-By: QiChen88 <2291969160@qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(engine): drop "JS-style comments" from parseJsonLoose header
The function header listed JS-style comments as a step-4 repair, but F5
already removed comment stripping from `repairJsonString` because the
regex would corrupt URLs inside JSON string values. The inner function's
comment was updated then; this header was missed.
Doc-only sync from second-round Copilot review on #6.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: QiChen88 <2291969160@qq.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reduce median scene-load latency from ~30-80s to ~17-25s by switching image generation to Runware FLUX.2 [klein] 9B KV and moving per-beat TTS synthesis off the scene response into a new lazy /api/beat-audio endpoint with hard timeout + abort support.
- feat(image): migrate to Runware FLUX.2 [klein] 9B KV — task-array API, $0.001/image, sub-second inference.
- feat(tts): split /api/scene into directScene + image + voicedesign-provisioning; lazily synth per beat via /api/beat-audio with 15s hard timeout + AbortSignal threaded to MiMo so timed-out calls don't keep burning sockets/quota; client fans out per-beat fetches on scene-id change with abort + identity-check finally to prevent cross-scene beat-id collisions.
- refactor(tts): slim BeatAudioRequest to { beat, voice } — ~800KB per-beat upload dropped to ~160KB by sending only the speaker's voice instead of the full session.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Adds optional Xiaomi MiMo TTS layer on top of the scene/beat engine and a MOCK_IMAGE flag for cheap local TTS iteration.
- Per-character voice provisioning via MiMo voice design → clone, reference audio persisted in session
- Per-line free-form delivery direction (Director writes "鼓起勇气又害羞,声音发颤" style instructions; sent to MiMo's director channel, never read aloud)
- Per-beat audio served with the scene response; frontend plays via hidden <audio> with typewriter synced to audio duration; mute toggle persisted via localStorage lazy initializer
- Graceful degradation: any TTS step failing → silent beat, game continues
- MOCK_IMAGE=true returns a sharp-generated placeholder PNG so local TTS iteration doesn't burn image tokens
- Recommended config in .env.example: MiMo Token Plan covers TEXT/VISION/TTS with one key (mimo-v2.5-pro for text, mimo-v2.5 omni for vision, mimo-v2.5-tts for TTS)
Squashed from #3:
- feat(tts): 小米 MiMo 逐 beat 配音 + 按 session 角色音色 + 自由文本配音指导
- feat(engine): MOCK_IMAGE 占位图便于本地测试
- fix(tts): address Copilot review on PR #3
- fix(tts): Copilot round-2 review feedback
Known limitation: Session.characters carries the full WAV reference audio (~200-300KB/character base64) and round-trips through every /api/scene, /api/vision, /api/insert-beat request. This is intrinsic to MiMo's design→clone model (voice identity IS the audio, no server-side voiceId). Fixing requires server-side storage which is out of scope; documented for future hardening.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Replace the one-image-per-interaction model with scenes that hold multiple
dialogue beats. The image regenerates only on scene-change actions; tapping
through beats and in-scene choices are instant and zero-network.
Squashed from #2:
- feat: scene/beat architecture — decouple dialogue from image generation
- fix: harden LLM-output parsing, prefetch lifecycle, and typewriter (PR review)
- fix: dedupe beat ids; fallback narration on empty insert-beat (PR review #2)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
HTML choice buttons now call /api/interact directly, bypassing the ~4s Vision roundtrip. Free-form background clicks still go through Vision as before.
- image prompt: vertical 9:16 → landscape 16:9 cinematic, scene fills
canvas with bottom dialogue band and horizontal choice row
- image-client: pass size=1792x1024 hint (provider honors it → output is
now exact 16:9 instead of the model's default 1.75:1)
- PlayCanvas: drop 560px cap, use object-contain into available space,
add fullViewport prop for chrome-less presentation rendering
- play page: F / Esc shortcuts + Fullscreen API + fullscreenchange
sync; chrome-less black-letterbox overlay (bg-black) suited for
screen recording
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace external <link> to fonts.googleapis.com with next/font/google
for Cormorant Garamond and Inter. Fonts are now built-time downloaded
and served from /_next/static/media, exposed via --font-serif and
--font-sans CSS variables that Tailwind's fontFamily reads.
Eliminates runtime dependency on Google Fonts CDN (helpful for offline
or region-restricted deploys), avoids FOUT through next/font's
size-adjusted fallback, and removes two render-blocking external
stylesheet requests on first load.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Project is now private; remove LICENSE file, README license
section, and "MIT · MMXXVI" footer tags. Root package.json
license set to UNLICENSED.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Engine
- Split /api/vision out from /api/interact so client can drive
prefetch + cache lookup independently of click interpretation
- Image client switched to chat-completions+modalities API (OpenRouter/
provider style), supporting markdown image URL responses
- annotateClick now resizes to 768w before composite to keep vision
payloads small and avoid CDN timeouts
- Prompts updated to mention "JSON" in user messages (required by
Gemini's strict JSON mode)
- Shared fetchWithRetry helper: 2 retries for chat/image, 0 for vision
(with 60s hard timeout)
Client
- Parallel prefetch of all three choice branches on each new frame
- Effect deliberately excludes phase from deps so user-click doesn't
abort in-flight prefetches
- Cache hit/miss/free-form fallback handled in handleClick
- PlayCanvas reads img naturalWidth/Height and adapts container to
whatever aspect AI returns (no more cropped third choice)
- max-width raised to 560px, max-height calc(100dvh - 200px)
Misc
- README env-path corrected to apps/web/.env.local
- users.md: BGM/TTS idea note
- .env.example moved into apps/web alongside next config
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Disable typed routes (default-on in Next 16, loops infinitely
with transpilePackages workspace setup, holding 500%+ CPU at idle)
- Pin turbopack.root to monorepo root so a stray ~/pnpm-lock.yaml
cannot misinfer the workspace boundary
- Commit pnpm-lock.yaml; ignore .claude/ local plugin state
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>