Replace the one-image-per-interaction model with scenes that hold multiple
dialogue beats. The image regenerates only on scene-change actions; tapping
through beats and in-scene choices are instant and zero-network.
Squashed from #2:
- feat: scene/beat architecture — decouple dialogue from image generation
- fix: harden LLM-output parsing, prefetch lifecycle, and typewriter (PR review)
- fix: dedupe beat ids; fallback narration on empty insert-beat (PR review #2)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Engine
- Split /api/vision out from /api/interact so client can drive
prefetch + cache lookup independently of click interpretation
- Image client switched to chat-completions+modalities API (OpenRouter/
provider style), supporting markdown image URL responses
- annotateClick now resizes to 768w before composite to keep vision
payloads small and avoid CDN timeouts
- Prompts updated to mention "JSON" in user messages (required by
Gemini's strict JSON mode)
- Shared fetchWithRetry helper: 2 retries for chat/image, 0 for vision
(with 60s hard timeout)
Client
- Parallel prefetch of all three choice branches on each new frame
- Effect deliberately excludes phase from deps so user-click doesn't
abort in-flight prefetches
- Cache hit/miss/free-form fallback handled in handleClick
- PlayCanvas reads img naturalWidth/Height and adapts container to
whatever aspect AI returns (no more cropped third choice)
- max-width raised to 560px, max-height calc(100dvh - 200px)
Misc
- README env-path corrected to apps/web/.env.local
- users.md: BGM/TTS idea note
- .env.example moved into apps/web alongside next config
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>