- Restructure overview with scannable bullet lists for capabilities
- Move screenshots up (after overview), reduce from 14 to 6
- Extract configuration guide to docs/configuration{,.en,.ja}.md
- Update Vercel deploy button envLink to point to new config docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5.6 KiB
Configuration guide
InfiPlot talks to four kinds of model providers. Text and Vision use any OpenAI-compatible endpoint, so you can mix and match freely — for Google Gemini, point *_BASE_URL at its OpenAI-compatible endpoint (https://generativelanguage.googleapis.com/v1beta/openai). For Anthropic Claude, a compatible gateway (e.g. LiteLLM) is recommended — Anthropic's official endpoint offers an OpenAI-compatible layer but no caching, which raises cost and latency. Image supports Runware (its own task-array protocol) and OpenAI (gpt-image). TTS supports Xiaomi MiMo (its own voice design / clone protocol — per-character voice design, clone, and per-line delivery direction; free) and StepFun (32 preset voices, auto-matched by AI; paid but better quality).
1. Choose your providers
| Provider | Variables | Required? | Recommended |
|---|---|---|---|
| Text · story director | TEXT_BASE_URL TEXT_API_KEY TEXT_MODEL |
✅ | deepseek-v4-flash via DeepSeek |
| Image · scene renderer | IMAGE_BASE_URL IMAGE_API_KEY IMAGE_MODEL |
✅ | runware:400@6 (FLUX.2 [klein] 9B KV) via Runware |
| Vision · click reader | VISION_BASE_URL VISION_API_KEY VISION_MODEL |
✅ | gemini-3.5-flash via Google |
| TTS · per-character voice | TTS_BASE_URL TTS_API_KEY TTS_SPEECH_MODEL |
optional — leave blank to run silently | mimo-v2.5-tts via Xiaomi MiMo (free); paid alternative: step-tts-2 via StepFun |
Optional · explicit protocol override: each provider slot accepts a
*_PROVIDERvariable (TEXT_PROVIDER/VISION_PROVIDER/IMAGE_PROVIDER) to force a specific protocol. Leave unset for backwards-compatible defaults — text/vision default to OpenAI-compatible, image auto-detects from*_BASE_URL(runware.ai→ Runware, otherwise OpenAI-compatible; models served via OpenAI protocol onrunware.ai— such asimage-2-vip— are handled as OpenAI-compatible; override withIMAGE_PROVIDERwhen needed).
Value Applies to Description openai_compatible(default)Text · Vision · Image OpenAI Chat Completions / /images/generationsopenaiImage OpenAI gpt-image, supports reference-image editingrunwareImage Runware task-array protocol Text and vision only support
openai_compatible. For Gemini, point*_BASE_URLat its OpenAI-compatible endpoint (https://generativelanguage.googleapis.com/v1beta/openai). For Claude, a compatible gateway (e.g. LiteLLM) is recommended — Anthropic's official endpoint offers an OpenAI-compatible layer but no caching, raising cost and latency.
*_BASE_URLworks with or without a trailing/v1(or even a trailing/chat/completions) — the engine normalizes automatically.
2. Set the environment variables
Nine variables are required; TTS is optional (leave blank to run silently). There's also a flag for cheap testing:
| Variable | Effect |
|---|---|
MOCK_IMAGE=true |
Skip image generation; the renderer returns a static placeholder. Story, voice, and choices still run normally. Great for iterating on TTS without burning Runware credits. |
Where to set them (see .env.example for the exact shape):
- Local dev —
.env.local - Vercel — Project Settings → Environment Variables
- Cloudflare Workers — from the repo root, run
wrangler secret put <NAME>for each variable, or set them in the dashboard (Workers → infiplot → Settings → Variables and Secrets). For a private staging instance, gate the Worker behind Cloudflare Access — zero-code email-whitelist auth in front of the Worker.
3. Mind the cost
With the recommended trio, each scene's cost comes mainly from the image generation model. The FLUX.2 [klein] 9B KV image is roughly $0.00078 per scene (1792×1024, 4 steps, sub-second); the text model uses deepseek-v4-flash, so text costs are negligible by comparison. Tapping through a scene's beats is free. To keep transitions instant, the engine also pre-generates scenes you might pick but ultimately don't — so real spend runs somewhat higher than the scenes you actually see.
4. Image proxy (optional)
By default the browser fetches images directly from the provider — no setup needed; leave NEXT_PUBLIC_IMAGE_PROXY_URL blank and you're completely unaffected. You only want this if you hit progressive "top-to-bottom" image loading (Chrome's ERR_QUIC_PROTOCOL_ERROR on some networks paints partial PNGs row by row): deploy a tiny Cloudflare Worker that re-fetches images server-side and serves them atomically over HTTP/2. One-click deploy at infiplot-image-proxy, then paste the workers.dev URL it prints into NEXT_PUBLIC_IMAGE_PROXY_URL.
5. Let players bring their own voice Key (optional, recommended)
Xiaomi rate-limits the TTS model by RPM/TPM. When a public deployment has many people playing at once through a single shared TTS_API_KEY, those limits are easy to hit — the symptom is story and visuals work fine, but there's no audio. To fix this, players can optionally enter their own Xiaomi MiMo key on the homepage (free to obtain). Synthesis then runs browser-direct to Xiaomi, the key stays in the player's browser and never touches your server, and they get stable voice with lower latency. It's purely additive: leave it blank and playback falls back to your server key exactly as before.
See the Bring-your-own voice Key guide for how to obtain and enter one.