Files
infiplot-web/README.md
T
yuanzonghao addbede929 feat: Vercel Hobby deploy readiness — image URLs, jsonrepair, DeepSeek
- Move vercel.json to apps/web/ with correct route paths; cap scene route
  maxDuration 120→60s for Hobby. Root vercel.json removed. Vercel project's
  Root Directory must be set to apps/web (Deploy button URL passes this).
- Switch image transport from base64-in-JSON to Runware-hosted URLs:
  generateImage now uses outputType=URL and returns {imageUrl, imageUuid};
  StartResponse/SceneResponse carry imageUrl; VisionRequest carries
  prevImageUrl (server re-fetches the bytes for click annotation). This
  eliminates the 4.5MB serverless body-size risk.
- Painter and director prefer URL over UUID for referenceImages — the UUID
  returned by Runware imageInference isn't always recognized in the refs
  pipeline (surfaces as `failedToTransferImage`).
- Client preloads scene images via `new Image().decode()` before committing
  to React state, so URL transitions render instantly; prefetched scenes
  also warm the HTTP cache.
- jsonParser uses the jsonrepair package (replaces hand-rolled repair) and
  adds a targeted preRepair regex for the missing-key-close-quote pattern
  that jsonrepair couldn't disambiguate. Full raw model output dumped on
  failure for diagnostic visibility.
- Default text provider switched to DeepSeek v4-flash via direct API
  (significantly more stable JSON than MiMo v2.5-pro). VISION/TTS stay on
  MiMo (DeepSeek has no multimodal / TTS offerings).
- next.config: drop dead experimental.serverActions.bodySizeLimit (no
  server actions used).
- README: real Deploy button URL (zonghaoyuan/yume + root-directory=apps/web
  + TTS/MOCK_IMAGE in env list); refreshed env vars table with optional
  TTS section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-01 16:04:13 +08:00

5.5 KiB
Raw Blame History

云梦

An AI-driven visual novel painted by an AI, one scene at a time. You talk and explore within a scene; when the story turns a corner, it paints the next. You click. It paints. The story unfolds.


How it works

The story unfolds as a sequence of scenes. Each scene is one AI-painted background plus a short tree of beats — moments of narration, dialogue, and the occasional choice. You tap through a scene's beats and the image stays put; only when a choice leads somewhere genuinely new — another place, a new point of view, a jump in time — does the AI paint the next scene.

entering a scene
        │
        ▼
1. Text LLM     directs the whole scene at once — a background prompt
                plus a tree of beats (narration / dialogue / choices)
        │
        ▼
2. Image model  paints the background once, 16:9, no UI baked in
        │
        ▼
[ tap through beats — no model calls, instant ]
        │
        ├─ in-scene choice ──────▶ jump to another beat (instant)
        │
        └─ scene-change choice ──▶ the next scene
                                   (usually pre-generated — see below)

While you're reading one scene, the engine speculatively generates the scenes your choices could lead to — and, for unavoidable next steps, the scene after that. By the time you pick a direction, its image is usually already painted, so the cut feels instant.

Clicking the background itself (not a button) routes through a vision model: it reads where you tapped and decides whether you're exploring the current scene (it inserts a beat — no new image) or moving on (a new scene).

There is no traditional game UI baked into the art. The AI paints the world in whatever style you pick — "stick figure on grid paper" or "cyberpunk noir" — and the dialogue panel and choice buttons are a light HTML layer drawn on top, tuned to sit over the scene.


One-click deploy

Deploy with Vercel

After deploy, set the environment variables (see below) in your Vercel project. Nine are required; TTS is optional (leave blank to run silently); MOCK_IMAGE=true skips image generation for cheap TTS-only testing. The Vercel project's Root Directory must be set to apps/web (the deploy button passes this; if you configure manually, set it in Project Settings).


Environment variables

Three required providers + optional TTS. Text, Vision, and TTS accept any OpenAI-compatible endpoint (OpenAI, Anthropic via OpenAI-compat proxy, Gemini, OpenRouter, DeepSeek, local Ollama, …). Image goes to Runware (its own task-array protocol, not OpenAI-compatible).

Provider Variables Required? Recommended
Text · story director TEXT_BASE_URL TEXT_API_KEY TEXT_MODEL claude-opus-4-7 via Anthropic
Image · UI renderer IMAGE_BASE_URL IMAGE_API_KEY IMAGE_MODEL runware:400@6 (FLUX.2 [klein] 9B KV) via Runware
Vision · click reader VISION_BASE_URL VISION_API_KEY VISION_MODEL gemini-3-flash via Google
TTS · per-character voice TTS_BASE_URL TTS_API_KEY TTS_SPEECH_MODEL optional — leave blank to run silently mimo-v2.5-tts via Xiaomi MiMo

There's also a flag for cheap testing:

Variable Effect
MOCK_IMAGE=true Skip image generation; the renderer returns a static placeholder. Story, voice, and choices still run normally. Great for iterating on TTS without burning Runware credits.

See apps/web/.env.example for the exact shape.


Local development

Requires Node 20+ and pnpm 9+.

pnpm install
cp apps/web/.env.example apps/web/.env.local
# fill in env vars (9 required + optional TTS/MOCK_IMAGE)
pnpm dev
# open http://localhost:3000

Project layout

yume/
├── apps/web/              Next.js 16 app — pages + API routes (Vercel root)
└── packages/
    ├── types/             shared TypeScript types
    ├── ai-client/         unified OpenAI-compatible clients + Runware adapter
    ├── tts-client/        Xiaomi MiMo TTS adapter
    └── engine/            multi-agent AI orchestration (open core)

packages/engine is the open core — pure TS, no Next.js or browser dependency. Import it directly to build your own visual-novel front-end (Tauri, Electron, CLI, anywhere).


Cost & limits

With the recommended trio, each scene is dominated by the text-LLM call. The FLUX.2 [klein] 9B KV image is roughly $0.001 per scene (1792×1024, 4 steps, sub-second); the text call is the rest. Tapping through a scene's beats is free. To keep transitions instant, the engine also pre-generates scenes you might pick but don't — so real spend runs somewhat higher than the scenes you actually see. There is no rate limiting or auth out of the box — if you make your deployment public, your bill will reflect that. Add limits (and consider lowering the prefetch depth) before sharing widely.