Addresses Copilot review on PR #9:
- /api/vision: add MAX_ANNOTATED_BYTES (3 MB) cap on annotatedImageBase64,
plus an explicit type/non-empty check. Browser annotator resizes to 768
wide (typically 200-800 KB base64), so 3 MB rejects abusive direct-API
payloads that would otherwise inflate upstream vision LLM costs.
- annotateClient: replace `img.src = ""` on timeout with removeAttribute
to avoid the legacy browser behavior of treating empty src as a
navigation to the current document URL.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The vision pipeline used sharp to draw a click marker on the scene image
server-side (engine/src/annotate.ts) and to render the MOCK_IMAGE
placeholder PNG (engine/src/mockImage.ts). Both moved off the runtime:
- annotateClick → apps/web/lib/annotateClient.ts (Canvas 2D in the
browser; toDataURL → raw PNG base64 forwarded to /api/vision). Saves
a server-side image re-fetch per click and frees the engine from
sharp's native binding (which doesn't run on Cloudflare Workers).
- mockImageDataUri → self-describing SVG data URI (no rendering needed).
VisionRequest contract changes: prevImageUrl + click → annotatedImageBase64.
Server forwards the bytes straight to the vision LLM as image_url.
sharp is removed from packages/engine entirely and from next.config.ts's
serverExternalPackages. apps/web/package.json + lockfile cleanup ships
in the follow-up Cloudflare deployment commit.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds optional Xiaomi MiMo TTS layer on top of the scene/beat engine and a MOCK_IMAGE flag for cheap local TTS iteration.
- Per-character voice provisioning via MiMo voice design → clone, reference audio persisted in session
- Per-line free-form delivery direction (Director writes "鼓起勇气又害羞,声音发颤" style instructions; sent to MiMo's director channel, never read aloud)
- Per-beat audio served with the scene response; frontend plays via hidden <audio> with typewriter synced to audio duration; mute toggle persisted via localStorage lazy initializer
- Graceful degradation: any TTS step failing → silent beat, game continues
- MOCK_IMAGE=true returns a sharp-generated placeholder PNG so local TTS iteration doesn't burn image tokens
- Recommended config in .env.example: MiMo Token Plan covers TEXT/VISION/TTS with one key (mimo-v2.5-pro for text, mimo-v2.5 omni for vision, mimo-v2.5-tts for TTS)
Squashed from #3:
- feat(tts): 小米 MiMo 逐 beat 配音 + 按 session 角色音色 + 自由文本配音指导
- feat(engine): MOCK_IMAGE 占位图便于本地测试
- fix(tts): address Copilot review on PR #3
- fix(tts): Copilot round-2 review feedback
Known limitation: Session.characters carries the full WAV reference audio (~200-300KB/character base64) and round-trips through every /api/scene, /api/vision, /api/insert-beat request. This is intrinsic to MiMo's design→clone model (voice identity IS the audio, no server-side voiceId). Fixing requires server-side storage which is out of scope; documented for future hardening.
🤖 Generated with [Claude Code](https://claude.com/claude-code)