docs(readme): restructure for star conversion optimization
- Restructure overview with scannable bullet lists for capabilities
- Move screenshots up (after overview), reduce from 14 to 6
- Extract configuration guide to docs/configuration{,.en,.ja}.md
- Update Vercel deploy button envLink to point to new config docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,52 @@
|
||||
# Configuration guide
|
||||
|
||||
InfiPlot talks to four kinds of model providers. **Text and Vision use any OpenAI-compatible endpoint**, so you can mix and match freely — for Google Gemini, point `*_BASE_URL` at its OpenAI-compatible endpoint (`https://generativelanguage.googleapis.com/v1beta/openai`). For Anthropic Claude, a compatible gateway (e.g. LiteLLM) is recommended — Anthropic's official endpoint offers an OpenAI-compatible layer but no caching, which raises cost and latency. **Image** supports **Runware** (its own task-array protocol) and **OpenAI** (`gpt-image`). **TTS** supports **Xiaomi MiMo** (its own voice design / clone protocol — per-character voice design, clone, and per-line delivery direction; free) and **StepFun** (32 preset voices, auto-matched by AI; paid but better quality).
|
||||
|
||||
## 1. Choose your providers
|
||||
|
||||
| Provider | Variables | Required? | Recommended |
|
||||
|---|---|---|---|
|
||||
| Text · story director | `TEXT_BASE_URL` `TEXT_API_KEY` `TEXT_MODEL` | ✅ | `deepseek-v4-flash` via DeepSeek |
|
||||
| Image · scene renderer | `IMAGE_BASE_URL` `IMAGE_API_KEY` `IMAGE_MODEL` | ✅ | `runware:400@6` (FLUX.2 [klein] 9B KV) via [Runware](https://runware.ai) |
|
||||
| Vision · click reader | `VISION_BASE_URL` `VISION_API_KEY` `VISION_MODEL` | ✅ | `gemini-3.5-flash` via Google |
|
||||
| TTS · per-character voice | `TTS_BASE_URL` `TTS_API_KEY` `TTS_SPEECH_MODEL` | optional — leave blank to run silently | `mimo-v2.5-tts` via Xiaomi MiMo (free); paid alternative: `step-tts-2` via [StepFun](https://www.stepfun.com) |
|
||||
|
||||
> **Optional · explicit protocol override**: each provider slot accepts a `*_PROVIDER` variable (`TEXT_PROVIDER` / `VISION_PROVIDER` / `IMAGE_PROVIDER`) to force a specific protocol. **Leave unset for backwards-compatible defaults** — text/vision default to OpenAI-compatible, image auto-detects from `*_BASE_URL` (`runware.ai` → Runware, otherwise OpenAI-compatible; models served via OpenAI protocol on `runware.ai` — such as `image-2-vip` — are handled as OpenAI-compatible; override with `IMAGE_PROVIDER` when needed).
|
||||
>
|
||||
> | Value | Applies to | Description |
|
||||
> |---|---|---|
|
||||
> | `openai_compatible` (default) | Text · Vision · Image | OpenAI Chat Completions / `/images/generations` |
|
||||
> | `openai` | Image | OpenAI `gpt-image`, supports reference-image editing |
|
||||
> | `runware` | Image | Runware task-array protocol |
|
||||
>
|
||||
> Text and vision **only** support `openai_compatible`. For Gemini, point `*_BASE_URL` at its OpenAI-compatible endpoint (`https://generativelanguage.googleapis.com/v1beta/openai`). For Claude, a compatible gateway (e.g. LiteLLM) is recommended — Anthropic's official endpoint offers an OpenAI-compatible layer but no caching, raising cost and latency.
|
||||
>
|
||||
> `*_BASE_URL` works with or without a trailing `/v1` (or even a trailing `/chat/completions`) — the engine normalizes automatically.
|
||||
|
||||
## 2. Set the environment variables
|
||||
|
||||
Nine variables are required; TTS is optional (leave blank to run silently). There's also a flag for cheap testing:
|
||||
|
||||
| Variable | Effect |
|
||||
|---|---|
|
||||
| `MOCK_IMAGE=true` | Skip image generation; the renderer returns a static placeholder. Story, voice, and choices still run normally. Great for iterating on TTS without burning Runware credits. |
|
||||
|
||||
Where to set them (see `.env.example` for the exact shape):
|
||||
|
||||
- **Local dev** — `.env.local`
|
||||
- **Vercel** — Project Settings → Environment Variables
|
||||
- **Cloudflare Workers** — from the repo root, run `wrangler secret put <NAME>` for each variable, or set them in the dashboard (Workers → infiplot → Settings → Variables and Secrets). For a private staging instance, gate the Worker behind [Cloudflare Access](https://developers.cloudflare.com/cloudflare-one/applications/) — zero-code email-whitelist auth in front of the Worker.
|
||||
|
||||
## 3. Mind the cost
|
||||
|
||||
With the recommended trio, each scene's cost comes mainly from the image generation model. The FLUX.2 [klein] 9B KV image is roughly **\$0.00078** per scene (1792×1024, 4 steps, sub-second); the text model uses `deepseek-v4-flash`, so text costs are negligible by comparison. Tapping through a scene's beats is free. To keep transitions instant, the engine also pre-generates scenes you might pick but ultimately don't — so real spend runs somewhat higher than the scenes you actually see.
|
||||
|
||||
## 4. Image proxy (optional)
|
||||
|
||||
By default the browser fetches images directly from the provider — no setup needed; leave `NEXT_PUBLIC_IMAGE_PROXY_URL` blank and you're completely unaffected. You only want this if you hit progressive "top-to-bottom" image loading (Chrome's `ERR_QUIC_PROTOCOL_ERROR` on some networks paints partial PNGs row by row): deploy a tiny Cloudflare Worker that re-fetches images server-side and serves them atomically over HTTP/2. One-click deploy at **[infiplot-image-proxy](https://github.com/zonghaoyuan/infiplot-image-proxy)**, then paste the `workers.dev` URL it prints into `NEXT_PUBLIC_IMAGE_PROXY_URL`.
|
||||
|
||||
## 5. Let players bring their own voice Key (optional, recommended)
|
||||
|
||||
Xiaomi rate-limits the TTS model by RPM/TPM. When a public deployment has many people playing at once through a single shared `TTS_API_KEY`, those limits are easy to hit — the symptom is **story and visuals work fine, but there's no audio**. To fix this, players can optionally enter **their own** Xiaomi MiMo key on the homepage (free to obtain). Synthesis then runs **browser-direct to Xiaomi**, the **key stays in the player's browser and never touches your server**, and they get stable voice with lower latency. It's purely additive: leave it blank and playback falls back to your server key exactly as before.
|
||||
|
||||
See the [Bring-your-own voice Key guide](xiaomi-tts-key.md) for how to obtain and enter one.
|
||||
Reference in New Issue
Block a user