ca73a41a0b
Make homepage cards and live sessions produce sound when the server is configured for StepFun TTS, instead of silently failing (the prebaked Xiaomi voice was useless on a StepFun server, and wasted ~220KB/beat in Fast Origin Transfer). Three coordinated changes: 1. CharacterDesigner now picks a StepFun preset voice id directly from the 32-entry catalog in the SAME LLM call that designs the character — zero extra latency, LLM-grade match quality. The Xiaomi prompt path is byte-identical to history (verified programmatically) so cache hit rate and voice quality are preserved. pickStepfunVoiceId (keyword scorer) remains the fallback for orphan speakers / invalid LLM picks. 2. The 32-preset catalog moves to lib/tts-client/stepfun-voices.json as the single source of truth, shared by the scorer, the CharacterDesigner prompt, /api/tts-provider, and the offline enrich script. 3. A new GET /api/tts-provider endpoint lets the client probe the server's TTS provider at /play mount. fetchBeatAudio then shapes its request body: on a StepFun server it sends the lightweight stepfunVoiceId / voiceDescription and omits the ~220KB Xiaomi reference audio (FOT saving ~13MB per protagonist per session on prebaked cards). requestBeatAudio re-provisions on a provider mismatch before synth, so audio never goes silent on a cross-provider replay or mid-session provider flip. New type fields are all optional and backward-compatible: Character.stepfunVoiceId, BeatAudioRequest.voiceDescription/characterName/stepfunVoiceId, voice made optional. AGENTS.md updated for the new route, type fields, dependency map, and StepFun voice-selection flow.
177 lines
18 KiB
Markdown
177 lines
18 KiB
Markdown
<!-- BEGIN:nextjs-agent-rules -->
|
|
# This is NOT the Next.js you know
|
|
|
|
This version has breaking changes — APIs, conventions, and file structure may all differ from your training data. Read the relevant guide in `node_modules/next/dist/docs/` before writing any code. Heed deprecation notices.
|
|
<!-- END:nextjs-agent-rules -->
|
|
|
|
# Repository Guidelines
|
|
|
|
This is the primary working guide for AI coding agents and contributors. It summarizes the repo-specific rules and adds contributor workflow guidance. Prefer it over generic Next.js assumptions.
|
|
|
|
## Project Structure & First Reads
|
|
|
|
InfiPlot is a Next.js 16 / React 19 / TypeScript app for AI-driven interactive visual novels (galgame). The server is intentionally stateless: the client carries the full `Session` and sends it to API routes whenever new generation is needed.
|
|
|
|
- `app/`: App Router pages and API routes. Start here for request/response behavior.
|
|
- `app/page.tsx`: Home/custom-start flow, preset cards, style-image upload/parsing, and analytics.
|
|
- `app/play/page.tsx`: Client session runtime, speculative scene prefetch, voice retention/stripping, image preload/proxying, orientation locking, and API callers.
|
|
- `components/`: Client UI, especially `PlayCanvas.tsx`, `CustomForm.tsx`, `PresetCard.tsx`, `TtsKeyModal.tsx`, and `Analytics.tsx`.
|
|
- `lib/types/index.ts`: Shared domain contracts. Read this before changing payload shapes.
|
|
- `lib/engine/`: Core story engine. `director.ts` orchestrates scene generation.
|
|
- `lib/engine/agents/`: Architect, Writer, CharacterDesigner, Cinematographer, Painter.
|
|
- `lib/engine/prompts.ts`: Agent prompts and prompt-cache-sensitive message builders.
|
|
- `lib/ai-client/`: Text, image, vision, and retry wrappers.
|
|
- `lib/tts-client/`: TTS integration. `stepfun-voices.json` is the single source of truth for the 32 StepFun preset voices (shared by the scorer, CharacterDesigner prompt, `/api/tts-provider`, and the enrich script).
|
|
- `lib/config.ts`: Server-side provider/environment loading.
|
|
- `lib/presets.ts`, `lib/ttsPresets.ts`, `lib/options.ts`: Home-page presets and selectable options.
|
|
- `scripts/`: Asset and preset generation helpers.
|
|
- `public/`, `docs/`: Static assets and documentation imagery.
|
|
|
|
For engine work, read `lib/types/index.ts`, the target agent/orchestrator file, and the API route exposing the behavior. For UI work, inspect the component and the owning page.
|
|
|
|
## Core Architecture
|
|
|
|
The engine behaves like `Session + EngineConfig -> SceneResult`. The client appends returned scenes to `session.history`, replaces `session.characters` and `session.storyState`, and sends the updated `Session` back later. Do not introduce server-side session storage, hidden global game state, or persistence unless explicitly requested.
|
|
|
|
The core pipeline is `directScene()` in `lib/engine/director.ts`. Writer is intentionally split into two phases so image generation can begin before full dialogue is ready:
|
|
|
|
1. Writer Phase A runs serially and produces `WriterPlan`: `sceneSummary`, `sceneKey`, `entryBeatId`, `cast`, `entryActiveCharacters`, and `entrySpeaker`.
|
|
2. Writer Phase B starts immediately and overlaps the image pipeline. It produces `beats[]` and `storyStatePatch`, constrained to honor the plan.
|
|
3. CharacterDesigner card LLMs and Cinematographer run in parallel from the plan.
|
|
4. Entry-beat portraits may block Painter because they become references.
|
|
5. Painter generates the scene background from Cinematographer `integratedPrompt` plus `referenceImages`.
|
|
6. Non-entry portraits and all voice provisioning should overlap with painting, then Phase B is awaited before scene assembly.
|
|
|
|
Do not add blocking calls between Writer Phase A completion and Painter start. Anything that can overlap with Phase B or painting should.
|
|
|
|
At session start, `startSession()` runs Architect first to create `storyState`; subsequent scene requests must rely on the client-carried `Session`, not server memory.
|
|
|
|
## Domain Model Invariants
|
|
|
|
`Scene` is an image plus a graph of `Beat` nodes. `Beat.next` is either `continue` or `choice`. A scene should have at least one meaningful `change-scene` exit toward a new scene. Beat ids are graph keys; keep them unique and repair references when coercing LLM output.
|
|
|
|
`SceneHistoryEntry.storyStateAfter` snapshots the story memory after each scene is generated. Keep it when exporting/importing playable story JSON or replaying shared sessions so continuing from a replayed prefix uses the right narrative context.
|
|
|
|
`StoryState` has stable and volatile zones. Stable fields are set by Architect and must not be patched by Writer: `logline`, `genreTags`, `protagonist`, `castNotes`. Volatile fields may be rewritten every scene: `synopsis`, `openThreads`, `relationships`, `nextHook`. If adding a field, classify it and update `applyStoryStatePatch()` plus Writer coercion.
|
|
|
|
Characters are identified by `name`. `mergeCharacters()` preserves existing portrait and voice fields when a later design omits them. Do not casually change character matching without checking Writer, Director, and Painter reference handling.
|
|
|
|
The player POV is hardcoded as second-person Chinese `"你"`. The player should not appear in `activeCharacters`, images, portraits, or TTS. Preserve normalization in Writer and InsertBeat flows.
|
|
|
|
`orientation` is session-wide and locked at start (`"portrait"` for upright touch devices, otherwise `"landscape"`). It controls prompt framing, generated dimensions, mock images, and `PlayCanvas` layout; preserve back-compat by coercing missing/invalid values to `"landscape"`.
|
|
|
|
`styleReferenceImage` is an optional client-resized `data:image/...` reference stored in the carried `Session`. It can make request bodies large, so keep validation limits and client resizing intact.
|
|
|
|
## Agent Output & Error Handling
|
|
|
|
Agent outputs should follow the existing pattern:
|
|
|
|
1. Raw LLM type accepts optional and variant fields.
|
|
2. Coercion normalizes names, defaults, and malformed values.
|
|
3. Repair fixes structural issues.
|
|
4. Fallback returns a safe value instead of throwing at the agent boundary.
|
|
|
|
Never use direct `JSON.parse()` on core agent LLM output. Use `parseJsonLoose()` from `lib/engine/jsonParser.ts`, which attempts direct parse, fenced JSON extraction, object slicing, and `jsonrepair`. Narrow utility routes may parse first only when they also have a safe fallback, as `/api/parse-style-image` does.
|
|
|
|
Maintain graceful degradation. Existing flows tolerate malformed AI JSON, failed character cards, failed portraits, failed TTS, failed image references, optional analytics, and provider timeouts. Do not convert optional provider failures into hard crashes.
|
|
|
|
## Visual Continuity & Prompt Caching
|
|
|
|
`sceneKey` identifies a physical space such as `"classroom-dusk"`. If a new scene shares a key with prior history, the prior scene image should be reused as a reference. Character portraits are also references.
|
|
|
|
Runware allows at most 4 references. Preserve the priority: style reference image, prior scene, speaker portrait, then other NPCs. Prefer image URLs for `referenceImages` when needed because Runware can fail to recognize UUIDs. The native OpenAI image path (gpt-image) can also accept references via `images.edit`, but returns data URIs and synthetic UUIDs, so repeated session transport is heavier than Runware's URL/UUID loop.
|
|
|
|
Writer prompt caching depends on `buildWriterPlanUserMessage()` and `buildWriterBeatsUserMessage()` keeping their stable prefixes intact: world, style, story spine, archived history, known scene keys, and character list. The dynamic suffix contains current state, last beat, exit hint, and the current plan. Do not reorder or reformat stable prefix sections casually; it can destroy cache hit rates.
|
|
|
|
## API Flow
|
|
|
|
Common routes live under `app/api/`:
|
|
|
|
- `POST /api/start`: starts a session via Architect then `directScene()`.
|
|
- `POST /api/scene`: generates the next scene from an existing session.
|
|
- `POST /api/vision`: interprets scene-image clicks.
|
|
- `POST /api/insert-beat`: creates a transient beat without image generation.
|
|
- `POST /api/beat-audio`: lazy TTS for a displayed beat; returns binary audio, or `204` when silent. `voice` is now OPTIONAL — when the server runs StepFun, the client omits the ~220KB Xiaomi reference audio and sends `stepfunVoiceId` / `voiceDescription` instead (saves Fast Origin Transfer bandwidth). The engine re-provisions on a provider mismatch before synthesizing.
|
|
- `POST /api/parse-style-image`: extracts a style prompt from uploaded reference art.
|
|
- `GET /api/tts-provider`: returns `{ provider: "stepfun" | "xiaomi" | null }` (the server's TTS provider, inferred from `TTS_BASE_URL`). Probed once at `/play` mount (non-BYO) so `fetchBeatAudio` can shape its request body — skip the ~220KB Xiaomi reference audio when the server runs StepFun. BYO client TTS takes precedence over this signal.
|
|
- `POST /api/story-pack` / `POST /api/story-unpack`: stateless AES-GCM packing/unpacking for playable story share `.infiplot` files; uses `GALLERY_SECRET`.
|
|
|
|
When changing public types or route payloads, update all route callers and client consumers in the same change.
|
|
|
|
All API routes currently run on `runtime = "nodejs"`. Keep Cloudflare implications in mind before adding Node-only dependencies to code that should also work in browser/client or OpenNext builds.
|
|
|
|
The client deliberately strips `voice.referenceAudioBase64` from `Session` before `/api/scene`, `/api/vision`, and `/api/insert-beat` transport, then merges voices back locally. Server responses strip already-known voices to reduce payload size. Preserve this first-load/request-size behavior when changing character or TTS flow.
|
|
|
|
`clientTts: true` means the browser owns Xiaomi TTS keys and provisions/synthesizes voices locally; routes must drop `config.tts` so server-side TTS is skipped and user keys never touch the server.
|
|
|
|
`app/play/page.tsx` speculatively prefetches future `/api/scene` responses up to `PREFETCH_MAX_DEPTH`. If scene/session shape changes, update speculative session construction, cache re-rooting, abort logic, and voice/image preload handling together.
|
|
|
|
## Build, Test, and Development Commands
|
|
|
|
Use pnpm with Node >=22. `pnpm-lock.yaml` is the source of truth; `package-lock.json` is legacy and should not be updated unless requested.
|
|
|
|
- `pnpm dev`: local Next.js dev server.
|
|
- `pnpm build`: production build for Vercel/default target.
|
|
- `pnpm start`: run production server after building.
|
|
- `pnpm lint`: Next.js built-in lint.
|
|
- `pnpm typecheck`: `tsc --noEmit`.
|
|
- `pnpm enrich:firstacts`: one-off enrichment of `public/home/firstact{,-portrait}/*.json` — adds `characters[i].stepfunVoiceId` via a TEXT-provider LLM call per character (uses `.env.local`). Idempotent; `--force` re-picks, `--only=f0,f1` filters, `--portrait` targets the portrait set.
|
|
- `pnpm build:cf`: Cloudflare Workers build through OpenNext.
|
|
- `pnpm preview:cf`: local Cloudflare preview.
|
|
- `pnpm deploy:cf`: Cloudflare deploy.
|
|
|
|
There is no dedicated test framework, no Prettier config, and no standalone ESLint config. Before handing off code changes, run `pnpm typecheck` and `pnpm lint`; run `pnpm build` for routing, deployment, or provider initialization changes.
|
|
|
|
## Coding Style & Imports
|
|
|
|
Write TypeScript with 2-space indentation, double quotes, semicolons, and ESM imports. Prefer named exports for shared helpers and components when practical.
|
|
|
|
Use aliases from `tsconfig.json`: `@/*`, `@infiplot/engine`, `@infiplot/ai-client`, `@infiplot/tts-client`, and `@infiplot/types`. Avoid deep relative import chains when an alias exists.
|
|
|
|
React components use PascalCase. Hooks, helpers, variables, and functions use camelCase. Types and interfaces use PascalCase. Route folders follow Next.js App Router conventions. UI work should follow the existing Tailwind-heavy visual language.
|
|
|
|
Modal/dialog UI should be extracted into dedicated components instead of being inlined inside large page or canvas components. Keep the host responsible for open/close state and domain data, and keep the modal component responsible for dialog layout, overlay behavior, keyboard close handling, scroll containers, and modal-specific styling.
|
|
|
|
Comment only non-obvious sequencing, provider quirks, fallback behavior, or architectural invariants.
|
|
|
|
## Configuration & Providers
|
|
|
|
Use `.env.example` as the source of truth. Never commit `.env.local`, API keys, uploaded user content, or generated secrets.
|
|
|
|
- Text and Vision use `TEXT_*` and `VISION_*` over the `openai_compatible` protocol (the only supported text/vision protocol); Claude and Gemini are reached via their own OpenAI-compatible endpoints with the `*_PROVIDER` var unset.
|
|
- Image uses `IMAGE_*`; supported protocols are `runware`, `openai_compatible`, and native `openai`. When `IMAGE_PROVIDER` is unset, Runware is inferred from `*.runware.ai` URLs and otherwise falls back to OpenAI-compatible image generations.
|
|
- `IMAGE_TIMEOUT_MS` (per-attempt hard deadline) and `IMAGE_HEDGE_MS` (Painter scene-paint hedging: race a second request when the first is still pending after the threshold) are both OFF when unset — the default path must stay byte-identical to historical behavior. Hedging applies only to the Tier-A scene paint, never to portraits, and never fires after a fast failure (saturation guard). Client-side engine configs (`resolveEngineConfig`) intentionally do not set these fields.
|
|
- TTS supports Xiaomi MiMo (voicedesign + voiceclone) or StepFun (preset voices), inferred from `TTS_BASE_URL` (host containing `stepfun.com` → StepFun, otherwise → MiMo). `CharacterVoice` is a discriminated union on `provider`; synth dispatches on the voice's own tag so a session may carry both shapes through a provider switch. Blank config means silent mode. StepFun voice selection: the CharacterDesigner LLM picks a preset id directly from the 32-entry catalog (`lib/tts-client/stepfun-voices.json`, rendered by `formatStepfunCatalogForPrompt`) when `config.tts` is StepFun — zero extra LLM call. `pickStepfunVoiceId` (keyword scorer) is the fallback for orphan speakers / invalid picks. Prebaked homepage cards are enriched with `Character.stepfunVoiceId` via `scripts/enrich-firstacts-stepfun.mjs` so a card works under either server provider.
|
|
- `MOCK_IMAGE=true` skips image generation and returns a placeholder for cheap local iteration.
|
|
- `NEXT_PUBLIC_IMAGE_PROXY_URL` and `NEXT_PUBLIC_IMAGE_PROXY_ALLOWED_HOSTS` opt into browser-side image proxying for allowed hosts.
|
|
- Analytics uses optional Umami `NEXT_PUBLIC_UMAMI_*` values and must stay content-free/privacy-preserving.
|
|
- `GALLERY_SECRET` enables encrypted `.infiplot` share files for gallery and playable story export/import.
|
|
- `NEXT_PUBLIC_*` values are inlined at build time.
|
|
|
|
## File Dependency Map
|
|
|
|
If modifying Writer, also check `director.ts`, `prompts.ts`, WriterPlan/StoryState types, and Cinematographer/Painter consumers. If modifying CharacterDesigner, check Director scheduling/merge logic, portrait prompts, voice provisioning, Painter reference collection, and (StepFun path) the `buildCharacterDesignerSystem` catalog injection + `stepfunVoiceId` validation. If modifying the StepFun voice catalog (`lib/tts-client/stepfun-voices.json`), also check `formatStepfunCatalogForPrompt`, `isValidStepfunVoiceId`, the CharacterDesigner system prompt, and the enrich script. If modifying Cinematographer or Painter, check Director, prompt builders, provider image options, orientation handling, and reference priority. If modifying Architect, check `orchestrator.ts`, `prompts.ts`, and StoryState patch rules. If modifying `lib/types/index.ts`, check all agents, Director, Orchestrator, API routes, and client consumers in `app/page.tsx`, `app/play/page.tsx`, and `components/PlayCanvas.tsx`. If modifying TTS, check server `beat-audio` (including the `resolveVoice` provider-mismatch normalization), `/api/tts-provider`, BYO client TTS, voice stripping/merging, payload privacy, and the StepFun voice-id flow (CharacterDesigner → provision → synth). If modifying image delivery, check Painter, `lib/ai-client/image.ts`, mock images, orientation dimensions, preload/proxy logic, and style-reference validation.
|
|
|
|
## Guide Maintenance
|
|
|
|
After any refactor, architecture change, provider-client rewrite, public type change, new route, payload-shape change, or major UI flow change, reread the affected files and compare them against this `AGENTS.md`. Update `AGENTS.md` in the same change if the architecture, commands, invariants, dependency map, environment variables, or "What Not To Do" list drifted. The canonical filename is `AGENTS.md`; treat mentions like `AGETNS.md` as typos and repair the real file.
|
|
|
|
## Commit & Pull Request Guidelines
|
|
|
|
Follow observed Conventional Commit style: `feat(web): ...`, `fix(play): ...`, `perf(engine): ...`, `chore(engine): ...`.
|
|
|
|
PRs should include a short behavior summary, validation commands run, linked issues when relevant, screenshots or recordings for UI changes, and notes for environment, provider, deployment, or payload-shape changes.
|
|
|
|
## What Not To Do
|
|
|
|
- Do not make the server stateful.
|
|
- Do not generate images, portraits, or TTS for `"你"`.
|
|
- Do not let Writer patch stable `StoryState` fields.
|
|
- Do not reorder the Writer stable prompt prefix without a clear cache-aware reason.
|
|
- Do not assume Runware UUID references always work.
|
|
- Do not remove fallbacks, timeout handling, analytics privacy constraints, or reference priority rules.
|
|
- Do not leak browser-provided TTS keys to the server or send retained voice audio through scene/vision/insert-beat session payloads.
|
|
- Do not break session-locked orientation or style-reference propagation when changing start/play flows.
|
|
- Do not regenerate large assets in `public/` unless the user requested asset work.
|
|
- Do not mix prompt refactors, provider-client rewrites, UI restyling, and deployment changes in one narrow task.
|