Merge pull request #36 from zonghaoyuan/staging

Release staging to production
2026-06-06 18:39:44 +08:00
parent 1b1d5ce1c5 aed05a0512
commit 8cfb2d2860
36 changed files with 2347 additions and 401 deletions
@@ -3,14 +3,18 @@
 # Recommended setup: Xiaomi MiMo Token Plan for TEXT / VISION / TTS
 # (one API key covers all three) + Runware for IMAGE (FLUX.2 [klein]).
 #
-# TEXT / VISION use any OpenAI-compatible endpoint (any OpenAI-
-# compatible host works: OpenRouter, OpenAI, Anthropic via proxy,
-# Gemini, DeepSeek, Ollama, ...).
+# TEXT / VISION default to any OpenAI-compatible endpoint, and can switch to
+# native Anthropic or Google Gemini via TEXT_PROVIDER / VISION_PROVIDER.
 # TTS uses Xiaomi MiMo's own voice design / clone protocol
 # (not OpenAI-compatible; appends -voicedesign / -voiceclone).
 #
-# IMAGE uses Runware's own task-array protocol (not OpenAI-compatible);
-# the adapter posts an `imageInference` task to IMAGE_BASE_URL.
+# IMAGE supports Runware (its own task-array protocol), OpenAI (gpt-image),
+# and Google Gemini (Nano Banana) via IMAGE_PROVIDER.
+#
+# *_PROVIDER (optional) selects the wire protocol; leave unset for the
+# OpenAI-compatible default (image is auto-detected from the URL). Base URLs
+# tolerate a missing or extra /v1 (or a trailing /chat/completions) — the
+# engine normalizes them.
 # =============================================================

 # ---- 1. Text LLM · scene director ----------------------------------
@@ -26,6 +30,10 @@
 TEXT_BASE_URL=https://api.deepseek.com/v1
 TEXT_API_KEY=sk-xxx
 TEXT_MODEL=deepseek-v4-flash
+# TEXT_PROVIDER: openai_compatible (default) | anthropic | google
+#   anthropic → TEXT_BASE_URL=https://api.anthropic.com  TEXT_MODEL=claude-sonnet-4-6
+#   google    → TEXT_BASE_URL=https://generativelanguage.googleapis.com  TEXT_MODEL=gemini-3.5-flash
+# TEXT_PROVIDER=openai_compatible

 # ---- 2. Image generator (renders the scene background) -------------
 # Recommended: Runware + FLUX.2 [klein] 9B KV — distilled 4-step model,
@@ -36,12 +44,27 @@ TEXT_MODEL=deepseek-v4-flash
 IMAGE_BASE_URL=https://api.runware.ai/v1
 IMAGE_API_KEY=runware-xxx
 IMAGE_MODEL=runware:400@6
+# IMAGE_PROVIDER: runware (auto-detected for runware.ai) | openai_compatible
+#                 | openai | google
+#   openai → gpt-image, supports referenceImages (character/scene continuity).
+#            IMAGE_BASE_URL=https://api.openai.com  IMAGE_MODEL=gpt-image-1
+#   google → Gemini "Nano Banana" (Imagen is EOL 2026-06-24, do not use it).
+#            IMAGE_BASE_URL=https://generativelanguage.googleapis.com
+#            IMAGE_MODEL=gemini-2.5-flash-image
+# NOTE: openai/google return raw bytes → inlined as a data: URI for the session
+# (heavier per-call transport than Runware's UUID re-reference loop). Runware
+# stays fastest + cheapest for the scene-by-scene flow.
+# IMAGE_PROVIDER=runware

 # ---- 3. Vision model · multimodal click interpretation -------------
 # Recommended: MiMo V2.5 — multimodal, accepts image_url content parts.
 VISION_BASE_URL=https://token-plan-sgp.xiaomimimo.com/v1
 VISION_API_KEY=tp-xxx
 VISION_MODEL=mimo-v2.5
+# VISION_PROVIDER: openai_compatible (default) | anthropic | google
+#   anthropic → VISION_BASE_URL=https://api.anthropic.com  VISION_MODEL=claude-sonnet-4-6
+#   google    → VISION_BASE_URL=https://generativelanguage.googleapis.com  VISION_MODEL=gemini-3.5-flash
+# VISION_PROVIDER=openai_compatible

 # ---- 4. TTS · Xiaomi MiMo (optional — leave blank to disable) ------
 # Per-character voice design → clone, with per-line delivery direction.
@@ -159,6 +159,12 @@ With the recommended trio, each scene's cost comes mainly from the image generat

 By default the browser fetches images directly from the provider — no setup needed; leave `NEXT_PUBLIC_IMAGE_PROXY_URL` blank and you're completely unaffected. You only want this if you hit progressive "top-to-bottom" image loading (Chrome's `ERR_QUIC_PROTOCOL_ERROR` on some networks paints partial PNGs row by row): deploy a tiny Cloudflare Worker that re-fetches images server-side and serves them atomically over HTTP/2. One-click deploy at **[infiplot-image-proxy](https://github.com/zonghaoyuan/infiplot-image-proxy)**, then paste the `workers.dev` URL it prints into `NEXT_PUBLIC_IMAGE_PROXY_URL`.

+**5. Let players bring their own voice Key (optional, recommended)**
+
+Xiaomi rate-limits the TTS model by RPM/TPM. When a public deployment has many people playing at once through a single shared `TTS_API_KEY`, those limits are easy to hit — the symptom is **story and visuals work fine, but there's no audio**. To fix this, players can optionally enter **their own** Xiaomi MiMo key on the homepage (free to obtain). Synthesis then runs **browser-direct to Xiaomi**, the **key stays in the player's browser and never touches your server**, and they get stable voice with lower latency. It's purely additive: leave it blank and playback falls back to your server key exactly as before.
+
+See the [Bring-your-own voice Key guide](docs/xiaomi-tts-key.md) for how to obtain and enter one.
+
 ---

 ## Roadmap
@@ -158,6 +158,12 @@ InfiPlot は 4 種類のモデルプロバイダと通信します。**テキス

 デフォルトではブラウザが画像プロバイダーに直接アクセスするため、設定は不要です —— `NEXT_PUBLIC_IMAGE_PROXY_URL` を空欄のままにすれば、まったく影響ありません。画像が「上から順に」表示される現象（一部のネットワークで Chrome の `ERR_QUIC_PROTOCOL_ERROR` により PNG が行ごとに描画される）に遭遇した場合のみ必要です。小さな Cloudflare Worker をデプロイすると、画像をサーバー側で再取得し HTTP/2 で一括返却します。ワンクリックデプロイは **[infiplot-image-proxy](https://github.com/zonghaoyuan/infiplot-image-proxy)** を参照し、出力された `workers.dev` の URL を `NEXT_PUBLIC_IMAGE_PROXY_URL` に設定してください。

+**5. プレイヤー自身の音声 Key（任意・推奨）**
+
+Xiaomi は TTS モデルに RPM/TPM 制限を設けています。公開デプロイで多数のプレイヤーが単一の `TTS_API_KEY` を共有して同時にプレイすると、この制限に達しやすく、**ストーリーも画像も正常なのに音声だけ出ない**という症状になります。対策として、プレイヤーはトップページで**自分の** Xiaomi MiMo Key（無料で取得可）を任意で入力できます。合成は**ブラウザから Xiaomi へ直接**行われ、**Key はプレイヤーのブラウザ内にのみ保存され、あなたのサーバーを一切経由しません**。これにより安定した音声と低遅延が得られます。完全な追加機能であり、未入力ならこれまで通りサーバー側の Key にフォールバックします。
+
+取得・入力の手順は [音声 Key 持ち込みガイド](docs/xiaomi-tts-key.md) を参照してください。
+
 ---

 ## Roadmap
@@ -125,7 +125,7 @@ InfiPlot 同时支持部署到 Vercel 与 Cloudflare Workers。Cloudflare 部署

 ## 配置教程

-InfiPlot 会与四类模型供应商通信。**文本（Text）和视觉（Vision）都使用 OpenAI 兼容的接口**，可以自由搭配。**图像（Image）**目前接入 **Runware**（其自有的 task-array 协议，并非 OpenAI 兼容）。**语音（TTS）**使用**小米 MiMo** 自有的音色设计/克隆协议——支持角色级音色设计、克隆与逐行演绎指导。
+InfiPlot 会与四类模型供应商通信。**文本（Text）和视觉（Vision）** 默认使用 OpenAI 兼容接口，也可原生切换到 **Anthropic** 或 **Google Gemini**。**图像（Image）** 支持 **Runware**（其自有 task-array 协议）、**OpenAI**（`gpt-image`）与 **Google Gemini**（Nano Banana）。**语音（TTS）**使用**小米 MiMo** 自有的音色设计/克隆协议——支持角色级音色设计、克隆与逐行演绎指导。

 **1. 选择你的供应商**

@@ -136,6 +136,18 @@ InfiPlot 会与四类模型供应商通信。**文本（Text）和视觉（Visio
 | Vision · 点击解读  | `VISION_BASE_URL` `VISION_API_KEY` `VISION_MODEL`  | ✅ | Google 的 `gemini-3.5-flash` |
 | TTS · 角色配音 | `TTS_BASE_URL` `TTS_API_KEY` `TTS_SPEECH_MODEL` | 可选 —— 留空则静音运行 | 小米 MiMo 的 `mimo-v2.5-tts` |

+> **可选 · 指定接口协议**：每类模型都可加一个 `*_PROVIDER` 变量（`TEXT_PROVIDER` / `VISION_PROVIDER` / `IMAGE_PROVIDER`）显式选择接口协议。**不设则保持向后兼容**——文本/视觉默认走 OpenAI 兼容接口，图像按 `*_BASE_URL` 自动判断（`runware.ai` → Runware，否则 OpenAI 兼容；个别在 `runware.ai` 上以 OpenAI 协议提供的模型——如 `image-2-vip`——会按 OpenAI 兼容处理，需要时用 `IMAGE_PROVIDER` 显式覆盖即可）。
+>
+> | 取值 | 适用 | 说明 |
+> |---|---|---|
+> | `openai_compatible`（默认） | Text · Vision · Image | OpenAI Chat Completions / `/images/generations` |
+> | `anthropic` | Text · Vision | 原生 Anthropic Messages 接口 |
+> | `google` | Text · Vision · Image | 原生 Gemini；图像用 Nano Banana 系（如 `gemini-2.5-flash-image`，**勿用 Imagen（已废弃，2026-06-24 停服）**） |
+> | `openai` | Image | OpenAI `gpt-image`，支持参考图编辑 |
+> | `runware` | Image | Runware task-array 协议 |
+>
+> 此外，`*_BASE_URL` 带不带 `/v1`（甚至末尾多写了 `/chat/completions`）都能正常工作——引擎会自动规范化。
+
 **2. 填写环境变量**

 九个变量为必填；TTS 可选（留空则静音运行）。此外还有一个用于低成本测试的开关：
@@ -158,6 +170,12 @@ InfiPlot 会与四类模型供应商通信。**文本（Text）和视觉（Visio

 默认浏览器直连图片供应商，无需任何配置 —— 留空 `NEXT_PUBLIC_IMAGE_PROXY_URL` 即可，完全不受影响。只有当你遇到图片「层层加载」（Chrome 在某些网络下 `ERR_QUIC_PROTOCOL_ERROR` 导致 PNG 逐行渲染）时才需要它：部署一个极小的 Cloudflare Worker，把图片改为服务端转发 + HTTP/2 原子返回。一键部署见 **[infiplot-image-proxy](https://github.com/zonghaoyuan/infiplot-image-proxy)**，然后把它给出的 `workers.dev` 地址填进 `NEXT_PUBLIC_IMAGE_PROXY_URL`。

+**5. 玩家自带配音 Key（可选，推荐）**
+
+小米对 TTS 模型有 RPM/TPM 限额。当你的公共部署有多人同时游玩、共用同一把 `TTS_API_KEY` 时，很容易撞到限额，表现为**剧情、画面都正常，唯独没有声音**。为此，玩家可以在首页可选地填入**自己的**小米 MiMo Key（免费申请）——配音请求由**浏览器直连小米**完成，**Key 只存在玩家本地、绝不经过你的服务器**，从而获得稳定配音与更低延迟。这是纯增强：不填则照常使用你部署的服务器 Key，行为不变。
+
+申请与填写步骤见 [自带配音 Key 教程](docs/xiaomi-tts-key.md)。
+
 ---

 ## Roadmap
@@ -4,9 +4,6 @@ import { NextResponse } from "next/server";
 import { loadEngineConfig } from "@/lib/config";

 export const runtime = "nodejs";
-// The synth itself has a 15s per-call ceiling in the engine. 30s here just
-// covers JSON parsing + outbound network buffer.
-export const maxDuration = 30;

 export async function POST(req: Request) {
  let body: BeatAudioRequest;
@@ -26,7 +23,11 @@ export async function POST(req: Request) {
  try {
    const config = loadEngineConfig();
    const result = await requestBeatAudio(config, body);
-    return NextResponse.json(result);
+    if (!result.audio) return new Response(null, { status: 204 });
+    const binary = Buffer.from(result.audio.base64, "base64");
+    return new Response(binary, {
+      headers: { "Content-Type": result.audio.mime },
+    });
  } catch (err) {
    // Engine already swallows synth errors and returns audio:null. Anything
    // that reaches here is config-level — surface so the client can log it.
@@ -4,7 +4,6 @@ import { NextResponse } from "next/server";
 import { loadEngineConfig } from "@/lib/config";

 export const runtime = "nodejs";
-export const maxDuration = 60;

 export async function POST(req: Request) {
  let body: InsertBeatRequest;
@@ -22,9 +21,14 @@ export async function POST(req: Request) {
  }

  try {
-    const config = loadEngineConfig();
+    const base = loadEngineConfig();
+    // See StartRequest.clientTts — BYO clients synth in-browser, so drop server TTS.
+    const config = body.clientTts === true ? { ...base, tts: undefined } : base;
    const result = await requestInsertBeat(config, body);
-    return NextResponse.json(result);
+    return NextResponse.json({
+      ...result,
+      characters: result.characters.map((c) => ({ ...c, voice: undefined })),
+    });
  } catch (err) {
    const message = err instanceof Error ? err.message : "Unknown error";
    return NextResponse.json({ error: message }, { status: 500 });
@@ -7,7 +7,6 @@ import { NextResponse } from "next/server";
 import { loadEngineConfig } from "@/lib/config";

 export const runtime = "nodejs";
-export const maxDuration = 60;

 // Same rationale as /api/vision: the client resizes to 512px max-dim webp
 // (~30-80KB base64 typical) before upload, so 3 MB is generous headroom
@@ -1,14 +1,18 @@
 import { requestScene } from "@infiplot/engine";
-import type { SceneRequest } from "@infiplot/types";
+import type { Character, SceneRequest } from "@infiplot/types";
 import { NextResponse } from "next/server";
 import { loadEngineConfig } from "@/lib/config";

+function stripKnownVoices(
+  characters: Character[],
+  knownNames: Set<string>,
+): Character[] {
+  return characters.map((c) =>
+    knownNames.has(c.name) ? { ...c, voice: undefined } : c,
+  );
+}
+
 export const runtime = "nodejs";
-// Capped at 60 for Vercel Hobby (300 allowed on Pro). The scene pipeline is
-// Writer + CharDesigner×N + Cinematographer + Painter — happy path 9–12s; the
-// tail (cold provider, multiple new characters) can push 30–45s, so 60 is a
-// reasonable headroom on Hobby.
-export const maxDuration = 60;

 export async function POST(req: Request) {
  let body: SceneRequest;
@@ -23,9 +27,17 @@ export async function POST(req: Request) {
  }

  try {
-    const config = loadEngineConfig();
+    const base = loadEngineConfig();
+    // See StartRequest.clientTts — BYO clients synth in-browser, so drop server TTS.
+    const config = body.clientTts === true ? { ...base, tts: undefined } : base;
    const result = await requestScene(config, body);
-    return NextResponse.json(result);
+    const knownNames = new Set(
+      (body.session.characters ?? []).map((c) => c.name),
+    );
+    return NextResponse.json({
+      ...result,
+      characters: stripKnownVoices(result.characters, knownNames),
+    });
  } catch (err) {
    const message = err instanceof Error ? err.message : "Unknown error";
    return NextResponse.json({ error: message }, { status: 500 });
@@ -4,7 +4,6 @@ import { NextResponse } from "next/server";
 import { loadEngineConfig } from "@/lib/config";

 export const runtime = "nodejs";
-export const maxDuration = 60;

 // Matches /api/vision and /api/parse-style-image — the user's resized 512px
 // webp is ~30-80 KB; this caps pathological direct-API payloads (which would
@@ -41,7 +40,11 @@ export async function POST(req: Request) {
  }

  try {
-    const config = loadEngineConfig();
+    const base = loadEngineConfig();
+    // BYO key: the browser provisions + synths voices directly against Xiaomi
+    // (key never reaches us), so strip server-side TTS so the engine skips all
+    // provisioning + synth. See StartRequest.clientTts.
+    const config = body.clientTts === true ? { ...base, tts: undefined } : base;
    const result = await startSession(config, body);
    return NextResponse.json(result);
  } catch (err) {
@@ -4,7 +4,6 @@ import { NextResponse } from "next/server";
 import { loadEngineConfig } from "@/lib/config";

 export const runtime = "nodejs";
-export const maxDuration = 60;

 // Browser annotator resizes to 768 wide → typically 200-800 KB base64.
 // 3 MB caps abusive direct-API payloads (which would inflate upstream
@@ -1,4 +1,4 @@
-import type { Metadata } from "next";
+import type { Metadata, Viewport } from "next";
 import { Cormorant_Garamond, Inter } from "next/font/google";
 import { Analytics } from "@/components/Analytics";
 import "./globals.css";
@@ -25,6 +25,15 @@ export const metadata: Metadata = {
  description: "InfiPlot 是一款用 AI 实时生成图片、语音与剧情分支的交互式剧情游戏 Demo。",
 };

+// viewportFit:cover lets the immersive /play portrait layout extend under the
+// iOS notch / home-indicator and exposes env(safe-area-inset-*) to the
+// floating controls. device-width + initialScale keep mobile rendering 1:1.
+export const viewport: Viewport = {
+  width: "device-width",
+  initialScale: 1,
+  viewportFit: "cover",
+};
+
 export default function RootLayout({
  children,
 }: {
@@ -10,14 +10,8 @@ import {
  PLOT_STYLES,
  type Gender,
 } from "@/lib/options";
-
-/* ============================================================================
-   InfiPlot · 首页（编辑式视觉风格 · 居中构图，呼应低保真原型）
-   - 顶部 Header：左上角衬线 wordmark logo
-"use client";
-
-import { useRouter } from "next/navigation";
-import { useEffect, useRef, useState } from "react";
+import { readStoredTtsConfig } from "@/lib/clientTtsConfig";
+import { TtsKeyModal } from "@/components/TtsKeyModal";

 /* ============================================================================
   InfiPlot · 首页（编辑式视觉风格 · 居中构图，呼应低保真原型）
@@ -1394,7 +1388,12 @@ export default function HomePage() {
  // 顶部使用提示：默认展示，用户可点 × 永久关闭（localStorage:infiplot:hintClosed）。
  const [hintClosed, setHintClosed] = useState(false);

+  // 自带 TTS Key 弹窗：可选增强，Key 只存浏览器、绝不经过服务器。
+  const [ttsOpen, setTtsOpen] = useState(false);
+  const [ttsConfigured, setTtsConfigured] = useState(false);
+
  const styleRow = OPTS.findIndex((o) => o.modal);
+  const voiceRow = OPTS.findIndex((o) => o.label === "语音配音");
  const genderIndex = sel[0] ?? 0;
  const gender = (OPTS[0]!.items[genderIndex] as Gender) ?? "男性向";
  const phrases = EXAMPLE_PHRASES[gender];
@@ -1436,6 +1435,11 @@ export default function HomePage() {
    }
  }, []);

+  // 启动时回填「已启用」徽标——读 localStorage 判断用户是否已存过 Key。
+  useEffect(() => {
+    setTtsConfigured(readStoredTtsConfig() != null);
+  }, []);
+
  // 输入框随内容自动增高：长文本整段可见（打字与点卡片填入都覆盖）。
  useEffect(() => {
    const el = inputRef.current;
@@ -1661,6 +1665,30 @@ export default function HomePage() {
            ))}
          </div>

+          {/* 自带 TTS Key 入口：公共语音模型有 RPM/TPM 限额，高并发易静音；
+              填自己的小米 MiMo Key（免费）→ 稳定配音、延迟更低，且 Key 只存本地。 */}
+          <div className="mt-5 flex justify-center">
+            <button
+              type="button"
+              onClick={() => setTtsOpen(true)}
+              className={
+                "inline-flex items-center gap-2 rounded-full border px-4 py-1.5 font-sans text-xs md:text-[13px] transition-colors " +
+                (ttsConfigured
+                  ? "border-ember-500/40 bg-ember-500/5 text-ember-500 hover:bg-ember-500/10"
+                  : "border-clay-900/15 text-clay-500 hover:border-clay-900/30 hover:text-clay-700")
+              }
+            >
+              <i
+                className={
+                  ttsConfigured
+                    ? "fa-solid fa-circle-check text-[11px]"
+                    : "fa-solid fa-microphone-lines text-[11px]"
+                }
+              />
+              {ttsConfigured ? "自带配音 Key · 已启用" : "经常没声音？自带配音 Key（可选）"}
+            </button>
+          </div>
+
          {/* 使用提示：可被用户永久关闭（localStorage:infiplot:hintClosed） */}
          {!hintClosed && (
            <div className="relative mx-auto mt-10 md:mt-12 max-w-[640px] rounded-sm border border-clay-900/10 bg-cream-100/50 px-8 py-3.5">
@@ -1826,6 +1854,21 @@ export default function HomePage() {
          setCustomStyleRefImage={setCustomStyleRefImage}
        />
      )}
+      {ttsOpen && (
+        <TtsKeyModal
+          onClose={() => setTtsOpen(false)}
+          onSaved={(configured) => {
+            setTtsConfigured(configured);
+            // 启用自带 Key 时顺手把「语音配音」拨到「开启」——否则用户配了 Key
+            // 却还是静音，体验自相矛盾。停用时不动其选择，尊重用户原本的偏好。
+            if (configured && voiceRow >= 0) {
+              const onIdx = OPTS[voiceRow]!.items.indexOf("开启");
+              if (onIdx >= 0)
+                setSel((s) => s.map((v, j) => (j === voiceRow ? onIdx : v)));
+            }
+          }}
+        />
+      )}
    </div>
  );
 }
@@ -6,30 +6,87 @@ import {
  Suspense,
  useCallback,
  useEffect,
+  useLayoutEffect,
  useMemo,
  useRef,
  useState,
 } from "react";
 import { PlayCanvas, type Phase } from "@/components/PlayCanvas";
+import { TtsKeyModal } from "@/components/TtsKeyModal";
 import { annotateClick } from "@/lib/annotateClient";
+import { loadClientTtsConfig } from "@/lib/clientTtsConfig";
 import { PRESETS } from "@/lib/presets";
+import { provisionVoice, synthesize } from "@infiplot/tts-client";
 import type {
  Beat,
-  BeatAudio,
-  BeatAudioResponse,
  BeatChoice,
+  Character,
+  CharacterVoice,
  InsertBeatResponse,
+  Orientation,
  Scene,
  SceneExit,
  SceneResponse,
  Session,
  StartResponse,
+  TtsConfig,
  VisionResponse,
 } from "@infiplot/types";
 import { track } from "@/lib/analytics";

 const MUTED_STORAGE_KEY = "infiplot:muted";

+// ── FOT reduction helpers ──────────────────────────────────────────────
+// Strip bulky voice.referenceAudioBase64 from the session before sending it to
+// the server. The engine only needs character names + visualDescriptions for
+// scene generation; voice data is only used by /api/beat-audio (which receives
+// the voice directly, not via session). The client retains voices locally and
+// re-merges them from the response via mergeCharactersPreserveVoice.
+function stripVoicesForTransport(session: Session): Session {
+  return {
+    ...session,
+    characters: session.characters.map((c) => ({ ...c, voice: undefined })),
+  };
+}
+
+// Merge server-returned characters with locally-held voices. The server strips
+// voice from already-known characters (P0), so only NEW characters carry voice.
+// For existing characters, re-attach the voice the client already holds.
+function mergeCharactersPreserveVoice(
+  local: Character[],
+  remote: Character[],
+): Character[] {
+  const localByName = new Map(local.map((c) => [c.name, c]));
+  return remote.map((c) => {
+    const prev = localByName.get(c.name);
+    if (!prev) return c;
+    return { ...c, voice: c.voice ?? prev.voice };
+  });
+}
+
+// Consecutive silent (no-audio) beats before we surface the BYO-key nudge to a
+// non-BYO, unmuted player. Set high enough that one transient miss won't trip
+// it, low enough to catch a scene that's clearly being rate-limited.
+const SILENCE_NUDGE_THRESHOLD = 3;
+
+// Mobile-portrait users get a 9:16 scene image painted for them; everyone else
+// (desktop, tablet, mobile-landscape) keeps the 16:9 landscape image. Only a
+// touch device (coarse pointer) held upright counts as "portrait" — a mouse
+// device is always landscape. Detected once and locked for the whole session.
+function detectOrientation(): Orientation {
+  if (typeof window === "undefined") return "landscape";
+  const portrait = window.matchMedia("(orientation: portrait)").matches;
+  const coarse = window.matchMedia("(pointer: coarse)").matches;
+  return portrait && coarse ? "portrait" : "landscape";
+}
+
+// Runs before the browser paints (so it can correct first-frame state without a
+// visible flash), but useLayoutEffect warns when called during SSR. PlayInner
+// only ever renders on the client (/play prerenders the Suspense fallback), yet
+// fall back to useEffect on the server anyway to keep the warning out.
+const useIsomorphicLayoutEffect =
+  typeof window !== "undefined" ? useLayoutEffect : useEffect;
+
 // Cap how long we wait for the browser to download + decode a scene image
 // before giving up and rendering anyway. Runware's CDN is usually <2s for a
 // 1792×1024 PNG, but over slow links / VPN / strict corp networks the same
@@ -257,6 +314,7 @@ function prefetchScenePath(
  baseSession: Session,
  steps: ScenePathStep[],
  depth: number,
+  clientTts: boolean,
 ): void {
  if (depth >= PREFETCH_MAX_DEPTH) return;
  const key = pathKey(steps);
@@ -267,8 +325,10 @@ function prefetchScenePath(
  const promise = (async () => {
    const res = await fetch("/api/scene", {
      method: "POST",
-      headers: { "Content-Type": "application/json" },
-      body: JSON.stringify({ session: specSession }),
+      headers: {
+        "Content-Type": "application/json",
+      },
+      body: JSON.stringify({ session: stripVoicesForTransport(specSession), clientTts }),
      signal: abort.signal,
    });
    if (!res.ok) {
@@ -283,6 +343,12 @@ function prefetchScenePath(
    // transition path awaits the same cached promise via getOrCreateBlobUrl.
    void getOrCreateBlobUrl(data.imageUrl);

+    // Re-attach locally-held voices the server stripped from known characters.
+    data.characters = mergeCharactersPreserveVoice(
+      baseSession.characters,
+      data.characters,
+    );
+
    // Recursive: if the resulting scene has exactly one change-scene exit,
    // it is a must-pass node — prefetch its child too.
    if (depth + 1 < PREFETCH_MAX_DEPTH) {
@@ -307,7 +373,13 @@ function prefetchScenePath(
          characters: data.characters,
          storyState: data.storyState,
        };
-        prefetchScenePath(pool, carriedBase, [...steps, nextStep], depth + 1);
+        prefetchScenePath(
+          pool,
+          carriedBase,
+          [...steps, nextStep],
+          depth + 1,
+          clientTts,
+        );
      }
    }

@@ -342,6 +414,44 @@ function clearPool(pool: Map<string, PrefetchEntry>): void {
  pool.clear();
 }

+// ──────────────────────────────────────────────────────────────────────
+//  BYO voice resolution (client-direct Xiaomi TTS).
+//
+//  In BYO mode the server skips all TTS (clientTts:true), so the browser must
+//  obtain each speaker's reference audio itself. `cache` is keyed by character
+//  NAME and persists for the whole session, so a voice locked in on a
+//  character's first speaking beat stays identical across every later scene —
+//  even though /api/scene returns its characters without `.voice`. Storing the
+//  in-flight Promise (not the resolved value) dedupes the burst of concurrent
+//  beats by the same speaker into ONE voicedesign call, which matters because
+//  Xiaomi rate-limits voicedesign hard.
+// ──────────────────────────────────────────────────────────────────────
+
+async function resolveByoVoice(
+  cache: Map<string, Promise<CharacterVoice>>,
+  cfg: TtsConfig,
+  speaker: Character,
+): Promise<CharacterVoice | null> {
+  const cached = cache.get(speaker.name);
+  if (cached) return cached;
+  // Prebaked cards ship baked reference audio — reuse it directly (cross-key
+  // synth with the user's key works), keeping the prebaked voice identical.
+  if (speaker.voice) {
+    const ready = Promise.resolve(speaker.voice);
+    cache.set(speaker.name, ready);
+    return ready;
+  }
+  if (!speaker.voiceDescription) return null;
+  const p = provisionVoice(cfg, speaker.voiceDescription);
+  cache.set(speaker.name, p);
+  try {
+    return await p;
+  } catch (e) {
+    cache.delete(speaker.name); // failed provision — let a later beat retry
+    throw e;
+  }
+}
+
 // ──────────────────────────────────────────────────────────────────────
 //  Component
 // ──────────────────────────────────────────────────────────────────────
@@ -355,7 +465,7 @@ function PlayInner() {
  const [currentScene, setCurrentScene] = useState<Scene | null>(null);
  const [currentBeatId, setCurrentBeatId] = useState<string | null>(null);
  const [imageUrl, setImageUrl] = useState<string | null>(null);
-  const [beatAudioMap, setBeatAudioMap] = useState<Record<string, BeatAudio>>({});
+  const [beatAudioMap, setBeatAudioMap] = useState<Record<string, string>>({});
  // Lazy-initialize 优先级：本局选择(homepage 的「语音配音」存到 sessionStorage:infiplot:custom)
  // > 上次会话的粘性偏好(localStorage:infiplot:muted) > 默认非静音。
  // 这样首页选了「关闭」开始游戏，进来就是静音；选「开启」就不是静音；进入 play 页后用户自己
@@ -381,7 +491,20 @@ function PlayInner() {
  } | null>(null);
  const [error, setError] = useState<string | null>(null);
  const [presentation, setPresentation] = useState(false);
+  // Session-locked image orientation (see detectOrientation). "portrait" makes
+  // the whole play surface render full-bleed vertical on phones.
+  const [orientation, setOrientation] = useState<Orientation>("landscape");
  const [lastExitLabel, setLastExitLabel] = useState<string | null>(null);
+  // Consecutive server-side TTS misses (null audio / failed /api/beat-audio).
+  // Climbs when the shared server key is rate-limited by MiMo — the exact pain
+  // BYO fixes — so the play page can nudge non-BYO users to add their own key.
+  // Reset to 0 on any successful synth. Only the server path touches it.
+  const [silenceStrikes, setSilenceStrikes] = useState(0);
+  // Once the player dismisses the silence nudge, keep it gone for this session.
+  const [nudgeDismissed, setNudgeDismissed] = useState(false);
+  // The in-place BYO-key modal, opened from the silence nudge so the player can
+  // add a key without leaving the play page.
+  const [ttsModalOpen, setTtsModalOpen] = useState(false);

  const startedRef = useRef(false);
  const poolRef = useRef<Map<string, PrefetchEntry>>(new Map());
@@ -396,6 +519,21 @@ function PlayInner() {
  // 不再单独维护 audioEnabledRef —— 单一来源避免两个 flag 漂移。
  const mutedRef = useRef<boolean>(muted);

+  // Resolved bring-your-own Xiaomi TTS config (region preset + key), read once
+  // from localStorage. When non-null, the browser provisions + synths voices
+  // directly against Xiaomi — the key never touches our server — and every
+  // start/scene/insert-beat request carries clientTts:true so the engine skips
+  // server-side TTS. null = user hasn't opted in (server default / silent).
+  const [byoTtsConfig, setByoTtsConfig] = useState<TtsConfig | null>(() =>
+    loadClientTtsConfig(),
+  );
+  const byoTtsRef = useRef<TtsConfig | null>(byoTtsConfig);
+  // BYO voice cache (see resolveByoVoice). Keyed by character name; persists
+  // across scenes so each speaker is provisioned at most once per session.
+  const provisionedVoicesRef = useRef<Map<string, Promise<CharacterVoice>>>(
+    new Map(),
+  );
+
  // Mirrors for use inside async handlers (closure-stable)
  const sessionRef = useRef<Session | null>(null);
  const currentSceneRef = useRef<Scene | null>(null);
@@ -411,9 +549,7 @@ function PlayInner() {
    return currentScene.beats.find((b) => b.id === currentBeatId) ?? null;
  }, [currentScene, currentBeatId]);

-  const currentBeatAudio = currentBeat ? beatAudioMap[currentBeat.id] : undefined;
-  const audioBase64 = currentBeatAudio?.base64 ?? null;
-  const audioMime = currentBeatAudio?.mime ?? null;
+  const audioSrc = (currentBeat ? beatAudioMap[currentBeat.id] : undefined) ?? null;

  useEffect(() => {
    sessionRef.current = session;
@@ -476,31 +612,73 @@ function PlayInner() {
      // 「首页选关闭」也走这条路：bootstrap 时 muted 已被初始化为 true。
      if (!beat.speaker || !beat.line) return;
      const speaker = sess.characters.find((c) => c.name === beat.speaker);
-      if (!speaker?.voice) return; // not yet provisioned — server can't synth anyway
+      if (!speaker) return;
+
+      const byo = byoTtsRef.current;
+      // Non-BYO relies on the server having provisioned speaker.voice. BYO
+      // skipped server TTS, so it needs a baked voice (prebaked card) or a
+      // voiceDescription to provision from in the browser.
+      if (!byo && !speaker.voice) return;
+      if (byo && !speaker.voice && !speaker.voiceDescription) return;
+
      if (beatAudioAbortRef.current.has(beat.id)) return;
      const abort = new AbortController();
      beatAudioAbortRef.current.set(beat.id, abort);
      try {
-        const res = await fetch("/api/beat-audio", {
-          method: "POST",
-          headers: { "Content-Type": "application/json" },
-          body: JSON.stringify({
-            beat: { id: beat.id, line: beat.line, lineDelivery: beat.lineDelivery },
-            voice: speaker.voice,
-          }),
-          signal: abort.signal,
-        });
-        if (!res.ok) return;
-        const json = (await res.json()) as BeatAudioResponse;
-        // Skip the state write if we've been aborted between the .ok check and
+        let audioUrl: string | null = null;
+        if (byo) {
+          // Client-direct: provision (once per speaker, cached) + synth against
+          // Xiaomi with the user's own key — no /api/beat-audio round-trip and
+          // the key never touches our server.
+          const voice = await resolveByoVoice(
+            provisionedVoicesRef.current,
+            byo,
+            speaker,
+          );
+          if (!voice || abort.signal.aborted) return;
+          const out = await synthesize(
+            byo,
+            voice,
+            beat.line,
+            beat.lineDelivery,
+            abort.signal,
+          );
+          audioUrl = `data:${out.mimeType};base64,${out.audioBase64}`;
+        } else {
+          const res = await fetch("/api/beat-audio", {
+            method: "POST",
+            headers: {
+              "Content-Type": "application/json",
+            },
+            body: JSON.stringify({
+              beat: { id: beat.id, line: beat.line, lineDelivery: beat.lineDelivery },
+              voice: speaker.voice,
+            }),
+            signal: abort.signal,
+          });
+          if (res.status === 204) {
+            setSilenceStrikes((n) => Math.min(n + 1, 99));
+            return;
+          }
+          if (!res.ok) {
+            setSilenceStrikes((n) => Math.min(n + 1, 99));
+            return;
+          }
+          const blob = await res.blob();
+          audioUrl = URL.createObjectURL(blob);
+          setSilenceStrikes(0);
+        }
+        // Skip the state write if we've been aborted between the await and
        // here — beat ids are scene-local, so a late arrival from a prior
        // scene would otherwise overwrite the current scene's audio under the
        // same id.
-        if (json.audio && !abort.signal.aborted) {
-          setBeatAudioMap((m) => ({ ...m, [beat.id]: json.audio as BeatAudio }));
+        if (audioUrl && !abort.signal.aborted) {
+          setBeatAudioMap((m) => ({ ...m, [beat.id]: audioUrl }));
+        } else if (audioUrl?.startsWith("blob:")) {
+          URL.revokeObjectURL(audioUrl);
        }
      } catch {
-        // aborted or network error — silent fallback
+        // aborted / network / Xiaomi rate-limit — silent fallback (no audio)
      } finally {
        // Only clear the slot if it's still ours. An aborted prior fetch
        // running its finally late could otherwise delete the controller of a
@@ -536,7 +714,12 @@ function PlayInner() {
  // scenes) so a late arrival would land under the wrong beat otherwise.
  useEffect(() => {
    cancelBeatAudioFetches();
-    setBeatAudioMap({});
+    setBeatAudioMap((prev) => {
+      for (const url of Object.values(prev)) {
+        if (url.startsWith("blob:")) URL.revokeObjectURL(url);
+      }
+      return {};
+    });
    prefetchSceneAudio();
  }, [currentScene?.id, prefetchSceneAudio]);

@@ -571,10 +754,41 @@ function PlayInner() {
    if (prev === muted) return;
    cancelBeatAudioFetches();
    if (muted) return;
-    setBeatAudioMap({});
+    setBeatAudioMap((prev) => {
+      for (const url of Object.values(prev)) {
+        if (url.startsWith("blob:")) URL.revokeObjectURL(url);
+      }
+      return {};
+    });
    prefetchSceneAudio();
  }, [muted, prefetchSceneAudio]);

+  // ── BYO key enabled/disabled from the play page (silence nudge → modal) ─
+  // On enable: point the synth path at the user's key and immediately
+  // re-synthesize the current scene in-browser, so the voices the player just
+  // missed come back without a reload (their characters already carry
+  // server-provisioned `voice`, which resolveByoVoice reuses with the new key).
+  // On disable: just stop using it; later scenes fall back to the server.
+  const handleByoSaved = useCallback(
+    (configured: boolean) => {
+      const cfg = configured ? loadClientTtsConfig() : null;
+      byoTtsRef.current = cfg;
+      setByoTtsConfig(cfg);
+      if (cfg) {
+        setSilenceStrikes(0);
+        cancelBeatAudioFetches();
+        setBeatAudioMap((prev) => {
+          for (const url of Object.values(prev)) {
+            if (url.startsWith("blob:")) URL.revokeObjectURL(url);
+          }
+          return {};
+        });
+        prefetchSceneAudio();
+      }
+    },
+    [prefetchSceneAudio],
+  );
+
  // ── Presentation mode toggle ─────────────────────────────────────────
  const togglePresentation = useCallback(async () => {
    const entering = !presentation;
@@ -619,6 +833,16 @@ function PlayInner() {
    };
  }, [togglePresentation, presentation]);

+  // Lock the visible orientation BEFORE the first paint, so portrait phones
+  // never flash the landscape loading chrome. The state inits to "landscape"
+  // for SSR-safety; this corrects it pre-paint (no-op re-render on landscape
+  // devices). Prebaked cards (decision C) stay landscape-baked regardless of
+  // device. The bootstrap effect below re-derives the same value for the
+  // /api/start payload.
+  useIsomorphicLayoutEffect(() => {
+    setOrientation(params.get("card") ? "landscape" : detectOrientation());
+  }, [params]);
+
  // ── Bootstrap: start session ─────────────────────────────────────────
  useEffect(() => {
    if (startedRef.current) return;
@@ -638,6 +862,7 @@ function PlayInner() {
      worldSetting: string;
      styleGuide: string;
      styleReferenceImage?: string;
+      orientation?: Orientation;
    } | null = null;
    if (!cardName) {
      if (presetId) {
@@ -666,6 +891,16 @@ function PlayInner() {
      }
    }

+    // Lock orientation for the whole session. Prebaked cards (decision C) are
+    // landscape-baked, so they stay landscape regardless of device; only the
+    // live /api/start path requests a portrait paint when the phone is upright.
+    // The visible state is already set pre-paint by the layout effect above;
+    // here we only need the value for the /api/start payload.
+    const sessionOrientation: Orientation = cardName
+      ? "landscape"
+      : detectOrientation();
+    if (livePayload) livePayload.orientation = sessionOrientation;
+
    if (!cardName && !livePayload) {
      router.replace("/");
      return;
@@ -693,8 +928,13 @@ function PlayInner() {
        )
      : fetch("/api/start", {
          method: "POST",
-          headers: { "Content-Type": "application/json" },
-          body: JSON.stringify(livePayload),
+          headers: {
+            "Content-Type": "application/json",
+          },
+          body: JSON.stringify({
+            ...livePayload,
+            clientTts: !!byoTtsRef.current,
+          }),
        }).then(async (r) => {
          if (!r.ok) {
            const j = (await r.json().catch(() => ({}))) as { error?: string };
@@ -734,6 +974,7 @@ function PlayInner() {
          characters: data.characters,
          storyState: data.storyState,
          styleReferenceImage: data.styleReferenceImage,
+          orientation: data.scene.orientation ?? sessionOrientation,
        };
        visitedBeatsRef.current = [data.scene.entryBeatId];
        setSession(initial);
@@ -767,7 +1008,7 @@ function PlayInner() {
          nextSceneSeed: choice.effect.nextSceneSeed,
        },
      };
-      prefetchScenePath(poolRef.current, s, [step], 0);
+      prefetchScenePath(poolRef.current, s, [step], 0, !!byoTtsRef.current);
    }
  }, [currentScene?.id, session?.id]);

@@ -844,7 +1085,10 @@ function PlayInner() {
            visitedBeatIds: [result.scene.entryBeatId],
          },
        ],
-        characters: result.characters,
+        characters: mergeCharactersPreserveVoice(
+          base.characters,
+          result.characters,
+        ),
        storyState: result.storyState,
      };
      visitedBeatsRef.current = [result.scene.entryBeatId];
@@ -918,8 +1162,13 @@ function PlayInner() {
    const promise = (async () => {
      const res = await fetch("/api/scene", {
        method: "POST",
-        headers: { "Content-Type": "application/json" },
-        body: JSON.stringify({ session: specSession }),
+        headers: {
+          "Content-Type": "application/json",
+        },
+        body: JSON.stringify({
+          session: stripVoicesForTransport(specSession),
+          clientTts: !!byoTtsRef.current,
+        }),
      });
      if (!res.ok) {
        const j = (await res.json().catch(() => ({}))) as { error?: string };
@@ -940,8 +1189,10 @@ function PlayInner() {
      const annotatedImageBase64 = await annotateClick(imageUrl, click);
      const visionRes = await fetch("/api/vision", {
        method: "POST",
-        headers: { "Content-Type": "application/json" },
-        body: JSON.stringify({ session, annotatedImageBase64 }),
+        headers: {
+          "Content-Type": "application/json",
+        },
+        body: JSON.stringify({ session: stripVoicesForTransport(session), annotatedImageBase64 }),
      });
      if (!visionRes.ok) {
        const j = (await visionRes.json().catch(() => ({}))) as {
@@ -956,10 +1207,13 @@ function PlayInner() {
        setPhase("inserting-beat");
        const insertRes = await fetch("/api/insert-beat", {
          method: "POST",
-          headers: { "Content-Type": "application/json" },
+          headers: {
+            "Content-Type": "application/json",
+          },
          body: JSON.stringify({
-            session,
+            session: stripVoicesForTransport(session),
            freeformAction: decision.intent.freeformAction,
+            clientTts: !!byoTtsRef.current,
          }),
        });
        if (!insertRes.ok) {
@@ -995,7 +1249,10 @@ function PlayInner() {
          history: session.history.map((h, i, arr) =>
            i === arr.length - 1 ? { ...h, scene: patched } : h,
          ),
-          characters: insertChars,
+          characters: mergeCharactersPreserveVoice(
+            session.characters,
+            insertChars,
+          ),
        };
        setSession(nextSession);
        setCurrentScene(patched);
@@ -1036,8 +1293,13 @@ function PlayInner() {
        const promise = (async () => {
          const res = await fetch("/api/scene", {
            method: "POST",
-            headers: { "Content-Type": "application/json" },
-            body: JSON.stringify({ session: specSession }),
+            headers: {
+              "Content-Type": "application/json",
+            },
+            body: JSON.stringify({
+              session: stripVoicesForTransport(specSession),
+              clientTts: !!byoTtsRef.current,
+            }),
          });
          if (!res.ok) {
            const j = (await res.json().catch(() => ({}))) as {
@@ -1071,12 +1333,12 @@ function PlayInner() {
          <p className="text-[10px] smallcaps text-clay-500 mb-6">
            出 · 了 · 点 · 状 · 况
          </p>
-          <p className="font-serif italic text-clay-900 text-lg leading-[1.7] mb-10">
+          <p className="font-serif italic text-clay-900 text-lg leading-[1.7] mb-6">
            {error}
          </p>
          <Link
            href="/"
-            className="text-[10px] smallcaps text-clay-700 hover:text-ember-500 transition-colors inline-flex items-center gap-3"
+            className="mt-4 text-[10px] smallcaps text-clay-700 hover:text-ember-500 transition-colors inline-flex items-center gap-3"
          >
            <i className="fa-solid fa-arrow-left text-[9px]" />
            返 回
@@ -1086,13 +1348,18 @@ function PlayInner() {
    );
  }

-  if (presentation) {
+  // Mobile portrait renders full-bleed by default — it sidesteps the iOS
+  // Safari Fullscreen API (unsupported on iPhone) with a CSS full-viewport
+  // layout instead. Desktop "presentation" mode shares the same immersive
+  // canvas, toggled via the F key.
+  const immersive = presentation || orientation === "portrait";
+
+  if (immersive) {
    return (
      <div className="fixed inset-0 bg-black flex items-center justify-center z-50">
        <PlayCanvas
          imageUrl={imageUrl}
-          audioBase64={audioBase64}
-          audioMime={audioMime}
+          audioSrc={audioSrc}
          muted={muted}
          phase={phase}
          beat={currentBeat}
@@ -1100,8 +1367,33 @@ function PlayInner() {
          onBackgroundClick={onBackgroundClick}
          onAdvance={onAdvance}
          onSelectChoice={onSelectChoice}
+          orientation={orientation}
          fullViewport
        />
+        {orientation === "portrait" && (
+          <div
+            className="absolute inset-x-0 top-0 z-10 flex items-center justify-between px-4 pointer-events-none"
+            style={{ paddingTop: "max(0.5rem, env(safe-area-inset-top))" }}
+          >
+            <Link
+              href="/"
+              className="pointer-events-auto flex h-9 w-9 items-center justify-center rounded-full bg-black/40 text-white/80 backdrop-blur-sm transition-colors hover:text-white"
+              aria-label="返回"
+            >
+              <i className="fa-solid fa-arrow-left text-[13px]" />
+            </Link>
+            <button
+              type="button"
+              onClick={toggleMuted}
+              className="pointer-events-auto flex h-9 w-9 items-center justify-center rounded-full bg-black/40 text-white/80 backdrop-blur-sm transition-colors hover:text-white"
+              aria-label={muted ? "取消静音" : "静音"}
+            >
+              <i
+                className={`fa-solid ${muted ? "fa-volume-xmark" : "fa-volume-high"} text-[13px]`}
+              />
+            </button>
+          </div>
+        )}
      </div>
    );
  }
@@ -1109,6 +1401,16 @@ function PlayInner() {
  const sceneCount = session?.history.length ?? 0;
  const beatCount = visitedBeatsRef.current.length;

+  // Surface the BYO-key nudge only to an unmuted, non-BYO player whose last few
+  // beats came back silent (shared key rate-limited) — the exact pain BYO fixes.
+  // Dismissible for the session.
+  const showSilenceNudge =
+    phase === "ready" &&
+    !muted &&
+    !byoTtsConfig &&
+    !nudgeDismissed &&
+    silenceStrikes >= SILENCE_NUDGE_THRESHOLD;
+
  return (
    <div className="min-h-screen flex flex-col">
      <header className="px-5 md:px-12 pt-6 md:pt-8 flex items-center justify-between">
@@ -1131,8 +1433,7 @@ function PlayInner() {
      <main className="flex-1 flex flex-col items-center justify-center px-4 md:px-8 py-6 md:py-10">
        <PlayCanvas
          imageUrl={imageUrl}
-          audioBase64={audioBase64}
-          audioMime={audioMime}
+          audioSrc={audioSrc}
          muted={muted}
          phase={phase}
          beat={currentBeat}
@@ -1140,6 +1441,7 @@ function PlayInner() {
          onBackgroundClick={onBackgroundClick}
          onAdvance={onAdvance}
          onSelectChoice={onSelectChoice}
+          orientation={orientation}
          aboveCanvas={
            <button
              type="button"
@@ -1153,18 +1455,46 @@ function PlayInner() {
            </button>
          }
          aboveCanvasLeft={
-            <button
-              type="button"
-              onClick={toggleMuted}
-              className="text-[10px] smallcaps text-clay-500 hover:text-ember-500 transition-colors flex items-center gap-2"
-              aria-label={muted ? "取消静音" : "静音"}
-              title={muted ? "取消静音" : "静音"}
-            >
-              <i
-                className={`fa-solid ${muted ? "fa-volume-xmark" : "fa-volume-high"} text-[10px]`}
-              />
-              {muted ? "静 · 音" : "有 · 声"}
-            </button>
+            <>
+              <button
+                type="button"
+                onClick={toggleMuted}
+                className="text-[10px] smallcaps text-clay-500 hover:text-ember-500 transition-colors flex items-center gap-2"
+                aria-label={muted ? "取消静音" : "静音"}
+                title={muted ? "取消静音" : "静音"}
+              >
+                <i
+                  className={`fa-solid ${muted ? "fa-volume-xmark" : "fa-volume-high"} text-[10px]`}
+                />
+                {muted ? "静 · 音" : "有 · 声"}
+              </button>
+
+              {/* Silence nudge — a compact pill right beside the mute toggle.
+                  Clicking opens the BYO-key modal in place (no trip to the
+                  homepage). The × dismisses it for the session. */}
+              {showSilenceNudge && (
+                <span className="flex items-center gap-1 animate-fade-in">
+                  <button
+                    type="button"
+                    onClick={() => setTtsModalOpen(true)}
+                    className="inline-flex items-center gap-1.5 rounded-full border border-ember-500/40 bg-ember-500/10 px-2.5 py-1 text-[10px] text-ember-500 hover:bg-ember-500/20 transition-colors"
+                    title="经常没声音？填入你自己的小米 MiMo Key（免费），配音更稳定"
+                  >
+                    <i className="fa-solid fa-volume-xmark text-[9px]" />
+                    经常没声音？自带 Key
+                  </button>
+                  <button
+                    type="button"
+                    onClick={() => setNudgeDismissed(true)}
+                    aria-label="关闭提示"
+                    title="关闭"
+                    className="text-clay-400 hover:text-clay-700 transition-colors"
+                  >
+                    <i className="fa-solid fa-xmark text-[10px]" />
+                  </button>
+                </span>
+              )}
+            </>
          }
        />

@@ -1181,7 +1511,16 @@ function PlayInner() {
            </p>
          )}
        </div>
+
      </main>
+
+      {ttsModalOpen && (
+        <TtsKeyModal
+          onClose={() => setTtsModalOpen(false)}
+          onSaved={handleByoSaved}
+          footerNote="保存后会立即用这把 Key 在你的浏览器里合成当前这一幕的配音；本设备后续游玩也会自动使用此 Key。"
+        />
+      )}
    </div>
  );
 }
@@ -1,7 +1,7 @@
 "use client";

 import { useCallback, useEffect, useRef, useState, type ReactNode } from "react";
-import type { Beat, BeatChoice } from "@infiplot/types";
+import type { Beat, BeatChoice, Orientation } from "@infiplot/types";

 export type Phase =
  | "loading-first"        // first scene not yet rendered
@@ -109,11 +109,13 @@ function ChoiceButton({
  index,
  label,
  disabled,
+  vertical,
  onClick,
 }: {
  index: number;
  label: string;
  disabled: boolean;
+  vertical: boolean;
  onClick: () => void;
 }) {
  return (
@@ -121,8 +123,8 @@ function ChoiceButton({
      type="button"
      disabled={disabled}
      onClick={onClick}
-      className="group relative flex-1 min-w-0 px-4 py-3 text-left transition-all duration-200
-        disabled:opacity-50 disabled:cursor-wait"
+      className={`group relative ${vertical ? "w-full" : "flex-1 min-w-0"} px-4 py-3 text-left transition-all duration-200
+        disabled:opacity-50 disabled:cursor-wait`}
      style={{
        background: "rgba(20, 14, 8, 0.68)",
        border: "1.5px solid rgba(180, 140, 80, 0.65)",
@@ -141,13 +143,13 @@ function ChoiceButton({
      />
      <span className="relative flex items-baseline gap-2">
        <span
-          className="shrink-0 font-serif text-[11px] num"
+          className={`shrink-0 font-serif num ${vertical ? "text-[13px]" : "text-[11px]"}`}
          style={{ color: "rgba(195,155,75,0.9)" }}
        >
          {index + 1}.
        </span>
        <span
-          className="font-serif text-[13px] md:text-[14px] leading-snug"
+          className={`font-serif leading-snug ${vertical ? "text-[15px]" : "text-[13px] md:text-[14px]"}`}
          style={{ color: "rgba(245,235,210,0.95)" }}
        >
          {label}
@@ -160,8 +162,7 @@ function ChoiceButton({
 // ── Main component ─────────────────────────────────────────────────────
 export function PlayCanvas({
  imageUrl,
-  audioBase64,
-  audioMime,
+  audioSrc,
  muted,
  phase,
  beat,
@@ -170,12 +171,12 @@ export function PlayCanvas({
  onAdvance,
  onSelectChoice,
  fullViewport = false,
+  orientation = "landscape",
  aboveCanvas,
  aboveCanvasLeft,
 }: {
  imageUrl: string | null;
-  audioBase64: string | null;
-  audioMime: string | null;
+  audioSrc: string | null;
  muted: boolean;
  phase: Phase;
  beat: Beat | null;
@@ -184,6 +185,8 @@ export function PlayCanvas({
  onAdvance: () => void;
  onSelectChoice: (choice: BeatChoice) => void;
  fullViewport?: boolean;
+  // 会话锁定的图片朝向。"portrait" 时整图铺满视口（object-fit:cover）、选项竖排、字号放大。
+  orientation?: Orientation;
  // 渲染在图片正上方、右对齐的 slot（画面外、紧贴右上角）。
  aboveCanvas?: ReactNode;
  // 渲染在图片正上方、左对齐的 slot（画面外、紧贴左上角），与 aboveCanvas 水平镜像。
@@ -204,7 +207,7 @@ export function PlayCanvas({
  const { shown: typedBody, done: typingDone, skip: skipTypewriter } =
    useTypewriter(displayBody, beat?.id ?? "", {
      targetDurationMs: audioDurationMs,
-      waitForAudio: Boolean(audioBase64),
+      waitForAudio: Boolean(audioSrc),
    });

  // ── Audio source change ──────────────────────────────────────────────
@@ -212,12 +215,12 @@ export function PlayCanvas({
  // unblock the typewriter via timeout so text doesn't stall.
  useEffect(() => {
    setAudioDurationMs(undefined);
-    if (!audioBase64) return;
+    if (!audioSrc) return;
    const timer = setTimeout(() => {
      setAudioDurationMs((prev) => prev ?? 0);
    }, AUDIO_WAIT_TIMEOUT_MS);
    return () => clearTimeout(timer);
-  }, [audioBase64]);
+  }, [audioSrc]);

  // ── Mute toggle ───────────────────────────────────────────────────────
  useEffect(() => {
@@ -225,12 +228,12 @@ export function PlayCanvas({
    if (!el) return;
    el.muted = muted;
    el.playbackRate = SPEECH_RATE;
-    if (!muted && audioBase64 && el.paused) {
+    if (!muted && audioSrc && el.paused) {
      el.play().catch(() => {
        // autoplay blocked — silent until next interaction
      });
    }
-  }, [muted, audioBase64]);
+  }, [muted, audioSrc]);

  function handleAudioMetadata() {
    const el = audioRef.current;
@@ -255,9 +258,27 @@ export function PlayCanvas({

  function handleImageClick(e: React.MouseEvent<HTMLImageElement>) {
    if (phase !== "ready" || !imgRef.current || !beat) return;
-    const rect = imgRef.current.getBoundingClientRect();
-    const x = (e.clientX - rect.left) / rect.width;
-    const y = (e.clientY - rect.top) / rect.height;
+    const el = imgRef.current;
+    const rect = el.getBoundingClientRect();
+    // Portrait renders with object-fit:cover, which scales the 9:16 image to
+    // FILL the box and crops the overflow — so the rendered box ≠ the full
+    // image. Map the click from box-space back into full-image-space via the
+    // cover geometry so the marker lands where the user tapped. Landscape's box
+    // matches the image aspect (no crop), so it keeps simple normalization.
+    let x: number;
+    let y: number;
+    if (orientation === "portrait") {
+      const nw = el.naturalWidth || 1024;
+      const nh = el.naturalHeight || 1792;
+      const scale = Math.max(rect.width / nw, rect.height / nh);
+      const dispW = nw * scale;
+      const dispH = nh * scale;
+      x = (e.clientX - rect.left + (dispW - rect.width) / 2) / dispW;
+      y = (e.clientY - rect.top + (dispH - rect.height) / 2) / dispH;
+    } else {
+      x = (e.clientX - rect.left) / rect.width;
+      y = (e.clientY - rect.top) / rect.height;
+    }
    // If the typewriter is still printing, a click completes it instantly
    // (standard VN affordance) — the page never sees this click.
    if (!typingDone) {
@@ -291,13 +312,26 @@ export function PlayCanvas({
  const interactive = phase === "ready" && !!imageUrl;
  const dimmed = phase === "transitioning";

-  const sizeStyle = fullViewport
-    ? { maxWidth: "100vw", maxHeight: "100dvh" }
-    : { maxWidth: "96vw", maxHeight: "calc(100dvh - 200px)" };
+  const portrait = orientation === "portrait";
+  const intrinsicW = portrait ? 1024 : 1792;
+  const intrinsicH = portrait ? 1792 : 1024;

-  const placeholderWidth = fullViewport
-    ? "min(100vw, calc(100dvh * 16 / 9))"
-    : "min(96vw, calc((100dvh - 200px) * 16 / 9))";
+  // Portrait (mobile) always fills the whole viewport with object-fit:cover so
+  // the 9:16 image matches the exact device/window — no letterbox. Landscape
+  // keeps the prior contain-style sizing so the full 16:9 frame stays visible.
+  const sizeStyle: React.CSSProperties = portrait
+    ? { width: "100vw", height: "100dvh", objectFit: "cover" }
+    : fullViewport
+      ? { maxWidth: "100vw", maxHeight: "100dvh" }
+      : { maxWidth: "96vw", maxHeight: "calc(100dvh - 200px)" };
+
+  const placeholderStyle: React.CSSProperties = portrait
+    ? { width: "100vw", height: "100dvh" }
+    : {
+        width: fullViewport
+          ? "min(100vw, calc(100dvh * 16 / 9))"
+          : "min(96vw, calc((100dvh - 200px) * 16 / 9))",
+      };


  return (
@@ -305,11 +339,11 @@ export function PlayCanvas({
      className={`flex flex-col items-center ${fullViewport ? "w-full h-full justify-center" : "w-full"}`}
    >
      {/* Hidden audio element — voice playback for the current beat */}
-      {audioBase64 && (
+      {audioSrc && (
        <audio
-          key={audioBase64.slice(-48)}
+          key={audioSrc.slice(-48)}
          ref={audioRef}
-          src={`data:${audioMime ?? "audio/wav"};base64,${audioBase64}`}
+          src={audioSrc}
          preload="auto"
          onLoadedMetadata={handleAudioMetadata}
          onError={handleAudioError}
@@ -323,22 +357,23 @@ export function PlayCanvas({
          style={{ boxShadow: fullViewport ? "none" : SHADOW }}
        >
          {/* Background image — Runware CDN URL or data URI (mock mode).
-              The width/height attributes are NOT rendered dimensions (w-auto
-              h-auto + the maxWidth/maxHeight in sizeStyle still drive the
-              final layout); they give the browser an intrinsic aspect ratio
-              so that, while the bytes are still arriving from the CDN, the
-              <img> reserves a 1792:1024 box instead of collapsing to a
-              one-pixel sliver — fixes the "等很久 → 一根线 → 突然出图" jank. */}
+              The width/height attributes give the browser the intrinsic aspect
+              ratio (1792:1024 landscape / 1024:1792 portrait) so that, while the
+              bytes are still arriving from the CDN, the <img> reserves the right
+              box instead of collapsing to a one-pixel sliver — fixes the
+              "等很久 → 一根线 → 突然出图" jank. Landscape uses w-auto/h-auto +
+              maxWidth/maxHeight (contain); portrait switches sizeStyle to
+              100vw×100dvh with object-fit:cover (full-bleed, no letterbox). */}
          <img
            key={imageUrl.slice(-48)}
            ref={imgRef}
            src={imageUrl}
-            width={1792}
-            height={1024}
+            width={intrinsicW}
+            height={intrinsicH}
            alt="Generated scene"
            onClick={handleImageClick}
            draggable={false}
-            className={`block w-auto h-auto select-none animate-fade-in transition-opacity duration-700 ease-out ${
+            className={`block ${portrait ? "" : "w-auto h-auto"} select-none animate-fade-in transition-opacity duration-700 ease-out ${
              interactive ? "cursor-pointer" : "cursor-wait"
            } ${dimmed ? "opacity-40" : "opacity-100"}`}
            style={sizeStyle}
@@ -361,15 +396,29 @@ export function PlayCanvas({
          )}

          {beat && (
-            <div className="absolute inset-0 flex flex-col justify-end pointer-events-none select-none">
+            <div
+              className="absolute inset-0 flex flex-col justify-end pointer-events-none select-none"
+              style={
+                portrait
+                  ? { paddingBottom: "env(safe-area-inset-bottom)" }
+                  : undefined
+              }
+            >
              {choices.length > 0 && (
-                <div className="pointer-events-auto px-[3%] pb-[1.5%] flex gap-[1.5%] items-stretch">
+                <div
+                  className={`pointer-events-auto px-[3%] pb-[1.5%] flex items-stretch ${
+                    portrait
+                      ? "flex-col gap-2 max-h-[45dvh] overflow-y-auto"
+                      : "gap-[1.5%]"
+                  }`}
+                >
                  {choices.map((choice, i) => (
                    <ChoiceButton
                      key={choice.id}
                      index={i}
                      label={choice.label}
                      disabled={phase !== "ready"}
+                      vertical={portrait}
                      onClick={() => onSelectChoice(choice)}
                    />
                  ))}
@@ -407,7 +456,9 @@ export function PlayCanvas({

                  {beat.speaker && (
                    <p
-                      className="font-serif text-[11px] md:text-[12px] smallcaps mb-[0.6em]"
+                      className={`font-serif smallcaps mb-[0.6em] ${
+                        portrait ? "text-[13px]" : "text-[11px] md:text-[12px]"
+                      }`}
                      style={{ color: "rgba(205,165,90,0.92)" }}
                    >
                      {beat.speaker}
@@ -415,15 +466,17 @@ export function PlayCanvas({
                  )}

                  <p
-                    className="font-serif leading-[1.85] text-[13px] md:text-[15px]"
+                    className={`font-serif leading-[1.85] ${
+                      portrait ? "text-[16px]" : "text-[13px] md:text-[15px]"
+                    }`}
                    style={{ color: "rgba(245,235,210,0.95)" }}
                  >
                    {typedBody}
                    {beat.speaker && beat.narration && (
                      <span
-                        className={`block mt-[0.5em] italic text-[12px] md:text-[13px] transition-opacity duration-300 ${
-                          typingDone ? "opacity-100" : "opacity-0"
-                        }`}
+                        className={`block mt-[0.5em] italic transition-opacity duration-300 ${
+                          portrait ? "text-[14px]" : "text-[12px] md:text-[13px]"
+                        } ${typingDone ? "opacity-100" : "opacity-0"}`}
                        style={{ color: "rgba(200,185,155,0.78)" }}
                        aria-hidden={!typingDone}
                      >
@@ -488,11 +541,10 @@ export function PlayCanvas({
        </div>
      ) : (
        <div
-          className="relative aspect-video bg-cream-200 flex flex-col items-center justify-center gap-4"
-          style={{
-            width: placeholderWidth,
-            boxShadow: fullViewport ? "none" : SHADOW,
-          }}
+          className={`relative bg-cream-200 flex flex-col items-center justify-center gap-4 ${
+            portrait ? "" : "aspect-video"
+          }`}
+          style={{ ...placeholderStyle, boxShadow: fullViewport ? "none" : SHADOW }}
        >
          <div className="w-1.5 h-1.5 bg-clay-500 rounded-full animate-slow-pulse" />
          <p className="text-[9px] smallcaps text-clay-500 animate-slow-pulse">
@@ -0,0 +1,271 @@
+"use client";
+
+// Bring-your-own Xiaomi MiMo TTS key modal — shared by the homepage and the
+// play page. Two-step picker (key family → region for Token Plan only), key
+// stored CLIENT-SIDE ONLY (see lib/clientTtsConfig). `onSaved(configured)`
+// fires after a save/disable so each host can react (homepage flips the
+// 语音配音 toggle; the play page re-synthesizes the current scene in-browser).
+// `footerNote` lets the host tailor the closing hint to its own context.
+
+import { type ReactNode, useEffect, useState } from "react";
+import {
+  clearStoredTtsConfig,
+  readStoredTtsConfig,
+  writeStoredTtsConfig,
+} from "@/lib/clientTtsConfig";
+import {
+  findTtsPreset,
+  PAYG_PRESET_ID,
+  TTS_KEY_DOC_URL,
+  TTS_REGION_PRESETS,
+} from "@/lib/ttsPresets";
+
+const DEFAULT_FOOTER_NOTE: ReactNode =
+  "提示：需将上方「语音配音」设为「开启」配音才会生效。保存后本设备后续游玩会自动使用此 Key。";
+
+export function TtsKeyModal({
+  onClose,
+  onSaved,
+  footerNote = DEFAULT_FOOTER_NOTE,
+}: {
+  onClose: () => void;
+  onSaved: (configured: boolean) => void;
+  footerNote?: ReactNode;
+}) {
+  // Read storage once; useState initializers ignore later renders, so local
+  // edits aren't clobbered and we don't re-hit localStorage every render.
+  const [initial] = useState(() => readStoredTtsConfig());
+  // Two-step picker: choose key family first, then — only for Token Plan — a
+  // region. Pay-as-you-go (`sk-`) keys hit one fixed endpoint, so no region.
+  const initialKind = findTtsPreset(initial?.presetId)?.kind ?? "token-plan";
+  const [keyType, setKeyType] = useState<"token-plan" | "payg">(initialKind);
+  const [regionId, setRegionId] = useState<string>(
+    initialKind === "token-plan"
+      ? (initial?.presetId ?? TTS_REGION_PRESETS[0]!.id)
+      : TTS_REGION_PRESETS[0]!.id,
+  );
+  const [apiKey, setApiKey] = useState<string>(initial?.apiKey ?? "");
+  const [showKey, setShowKey] = useState(false);
+  const [shown, setShown] = useState(false);
+  const alreadyConfigured = initial != null;
+  // Soft guard: tp- keys belong to Token Plan, sk- to pay-as-you-go. A
+  // mismatched pairing hits the wrong endpoint → guaranteed auth failure →
+  // silent playback (the very symptom BYO exists to kill). Warn, but never
+  // block: prefix conventions could change and a hard gate would lock out an
+  // otherwise-valid key.
+  const expectedPrefix = keyType === "payg" ? "sk-" : "tp-";
+  const prefixMismatch =
+    apiKey.trim().length > 0 && !apiKey.trim().startsWith(expectedPrefix);
+
+  useEffect(() => {
+    const id = requestAnimationFrame(() => setShown(true));
+    return () => cancelAnimationFrame(id);
+  }, []);
+
+  const close = () => {
+    setShown(false);
+    setTimeout(onClose, 280);
+  };
+  const save = () => {
+    const key = apiKey.trim();
+    if (!key) return;
+    const presetId = keyType === "payg" ? PAYG_PRESET_ID : regionId;
+    writeStoredTtsConfig({ presetId, apiKey: key });
+    onSaved(true);
+    close();
+  };
+  const disable = () => {
+    clearStoredTtsConfig();
+    onSaved(false);
+    close();
+  };
+
+  return (
+    <div
+      onMouseDown={close}
+      className={
+        "fixed inset-0 z-[60] flex items-center justify-center p-6 md:p-10 transition-all duration-300 " +
+        (shown
+          ? "bg-clay-900/30 backdrop-blur-md"
+          : "bg-clay-900/0 backdrop-blur-0")
+      }
+    >
+      <div
+        onMouseDown={(e) => e.stopPropagation()}
+        className={
+          "flex w-[560px] max-w-[94vw] max-h-[88vh] flex-col overflow-hidden rounded-sm border border-clay-900/15 bg-cream-50 shadow-2xl shadow-clay-900/25 transition-all duration-300 " +
+          (shown ? "opacity-100 scale-100" : "opacity-0 scale-95")
+        }
+      >
+        <div className="flex items-center gap-5 px-6 md:px-8 py-5 border-b border-clay-900/10">
+          <div className="flex flex-col">
+            <span className="font-serif text-xl md:text-2xl text-clay-900">
+              自带配音 Key
+            </span>
+            <span className="text-[11px] text-clay-500 mt-1 tracking-wide">
+              可选 · 用你自己的小米 MiMo 免费额度，配音更稳定、延迟更低
+            </span>
+          </div>
+          <button
+            type="button"
+            onClick={close}
+            aria-label="关闭"
+            className="ml-auto text-xl leading-none text-clay-500 hover:text-clay-900 transition-colors"
+          >
+            <i className="fa-solid fa-xmark" />
+          </button>
+        </div>
+
+        <div className="flex flex-col gap-6 overflow-y-auto px-6 md:px-8 py-6">
+          <p className="text-[13px] leading-relaxed text-clay-600">
+            经常没有声音？公共语音模型有调用频率限额（RPM / TPM），同时游玩的人多时很容易撞到限额而静音。填入你自己的小米 MiMo API Key 后，配音将
+            <span className="text-clay-900">直接在你的浏览器里合成</span>
+            、使用你自己的免费额度 ——{" "}
+            <span className="text-clay-900">Key 只保存在本地浏览器、绝不经过我们的服务器</span>
+            。
+          </p>
+
+          <div className="flex flex-col gap-2">
+            <span className="text-[10px] smallcaps text-clay-500">K e y · 类 型</span>
+            <div className="grid grid-cols-2 gap-2">
+              {(
+                [
+                  { kind: "token-plan", label: "套餐 Token Plan", sub: "tp- 开头" },
+                  { kind: "payg", label: "按量付费 Pay-as-you-go", sub: "sk- 开头" },
+                ] as const
+              ).map((t) => {
+                const active = keyType === t.kind;
+                return (
+                  <button
+                    key={t.kind}
+                    type="button"
+                    onClick={() => setKeyType(t.kind)}
+                    className={
+                      "flex flex-col gap-0.5 rounded-sm border px-3 py-2.5 text-left transition-all " +
+                      (active
+                        ? "border-ember-500 bg-ember-500/5 text-clay-900"
+                        : "border-clay-900/12 text-clay-600 hover:border-clay-900/35 hover:bg-cream-100")
+                    }
+                  >
+                    <span className="text-[13px]">{t.label}</span>
+                    <span className="text-[10px] text-clay-400">{t.sub}</span>
+                  </button>
+                );
+              })}
+            </div>
+          </div>
+
+          {keyType === "token-plan" ? (
+            <div className="flex flex-col gap-2">
+              <span className="text-[10px] smallcaps text-clay-500">区 域 节 点</span>
+              <div className="grid grid-cols-1 gap-2 sm:grid-cols-3">
+                {TTS_REGION_PRESETS.map((p) => {
+                  const active = p.id === regionId;
+                  return (
+                    <button
+                      key={p.id}
+                      type="button"
+                      onClick={() => setRegionId(p.id)}
+                      className={
+                        "rounded-sm border px-3 py-2.5 text-left text-[13px] transition-all " +
+                        (active
+                          ? "border-ember-500 bg-ember-500/5 text-clay-900"
+                          : "border-clay-900/12 text-clay-600 hover:border-clay-900/35 hover:bg-cream-100")
+                      }
+                    >
+                      {p.label}
+                    </button>
+                  );
+                })}
+              </div>
+              <span className="text-[11px] text-clay-400">
+                选择与你的套餐订阅地区一致的节点（通常也是延迟最低的那个）。
+              </span>
+            </div>
+          ) : (
+            <div className="flex items-start gap-2 rounded-sm border border-clay-900/10 bg-cream-100/60 px-3.5 py-2.5">
+              <i className="fa-solid fa-circle-info mt-0.5 text-[11px] text-clay-400" />
+              <span className="text-[11px] leading-relaxed text-clay-500">
+                按量付费使用统一端点{" "}
+                <span className="text-clay-700">api.xiaomimimo.com</span>
+                ，无需选择区域。
+              </span>
+            </div>
+          )}
+
+          <div className="flex flex-col gap-2">
+            <span className="text-[10px] smallcaps text-clay-500">
+              A P I · K e y
+            </span>
+            <div className="relative">
+              <input
+                value={apiKey}
+                onChange={(e) => setApiKey(e.target.value)}
+                type={showKey ? "text" : "password"}
+                autoComplete="off"
+                spellCheck={false}
+                placeholder={
+                  keyType === "payg"
+                    ? "粘贴 sk- 开头的按量 Key"
+                    : "粘贴 tp- 开头的套餐 Key"
+                }
+                className="h-11 w-full rounded-sm border border-clay-900/15 bg-cream-100 pl-4 pr-11 font-sans text-sm text-clay-900 outline-none transition-colors focus:border-ember-500 placeholder:text-clay-400"
+              />
+              <button
+                type="button"
+                onClick={() => setShowKey((v) => !v)}
+                aria-label={showKey ? "隐藏" : "显示"}
+                className="absolute right-3 top-1/2 -translate-y-1/2 text-clay-400 hover:text-clay-700 transition-colors"
+              >
+                <i
+                  className={`fa-solid ${showKey ? "fa-eye-slash" : "fa-eye"} text-sm`}
+                />
+              </button>
+            </div>
+            {prefixMismatch && (
+              <span className="flex items-start gap-1.5 text-[11px] leading-relaxed text-ember-500">
+                <i className="fa-solid fa-triangle-exclamation mt-0.5 text-[10px]" />
+                此 Key 不是 {expectedPrefix} 开头，可能与所选「
+                {keyType === "payg" ? "按量付费 Pay-as-you-go" : "套餐 Token Plan"}
+                」类型不符，请确认是否填错。
+              </span>
+            )}
+            <a
+              href={TTS_KEY_DOC_URL}
+              target="_blank"
+              rel="noopener noreferrer"
+              className="inline-flex items-center gap-1.5 text-[11px] text-ember-500 hover:text-ember-400 transition-colors"
+            >
+              <i className="fa-brands fa-github text-[11px]" />
+              如何免费申请 Key？查看图文教程
+            </a>
+          </div>
+
+          <p className="text-[11px] leading-relaxed text-clay-400">{footerNote}</p>
+        </div>
+
+        <div className="flex items-center gap-3 border-t border-clay-900/10 px-6 md:px-8 py-4">
+          {alreadyConfigured && (
+            <button
+              type="button"
+              onClick={disable}
+              className="inline-flex items-center gap-2 rounded-sm border border-clay-900/15 px-4 py-2 font-sans text-sm text-clay-600 transition-colors hover:border-clay-900/35 hover:text-clay-900"
+            >
+              <i className="fa-solid fa-rotate-left text-xs" />
+              停用并清除
+            </button>
+          )}
+          <button
+            type="button"
+            onClick={save}
+            disabled={!apiKey.trim()}
+            className="ml-auto inline-flex items-center gap-2 rounded-sm bg-clay-900 px-5 py-2.5 font-sans text-sm text-cream-50 transition-colors hover:bg-ember-500 disabled:cursor-not-allowed disabled:opacity-40"
+          >
+            <i className="fa-solid fa-check text-xs" />
+            保存并启用
+          </button>
+        </div>
+      </div>
+    </div>
+  );
+}
@@ -0,0 +1,106 @@
+# 自带配音 Key 教程（小米 MiMo TTS）
+
+InfiPlot 的角色配音由小米 **MiMo-V2.5-TTS** 模型实时合成。按照本教程，你可以免费申请一个属于自己的 API Key，
+填入 InfiPlot 后即可获得**更稳定的配音和更低的延迟**——而且这个 Key **只保存在你的浏览器里，绝不会经过我们的服务器**。
+
+> 本教程随仓库维护，链接长期有效。
+
+---
+
+## 为什么需要自带 Key？
+
+InfiPlot 默认使用一个**公共 Key** 为所有用户提供配音。小米对语音模型设有 **RPM（每分钟请求数）/ TPM（每分钟 Token 数）** 的调用限额，而公共 Key 的额度由全部用户共享。当同时在线的人较多时，公共 Key 很容易达到上限，表现为——
+
+- 剧情和画面都正常，**唯独没有声音**（静音）；
+- 或者配音断断续续、需要等待较长时间。
+
+填入你**自己的** Key 后，你将使用独立的额度，不再受其他用户的影响：
+
+- ✅ **配音稳定**，不再出现随机静音；
+- ✅ **延迟更低**（套餐 Key 还可以选择就近的区域节点）；
+- ✅ **完全免费**——MiMo-V2.5-TTS 目前限时**免费**，不消耗套餐额度。
+
+这是一个**可选的增强功能**。不填也能正常游玩，只是高峰期更容易遇到静音。
+
+---
+
+## 一、免费申请 API Key
+
+1. 打开小米 MiMo 开放平台并注册 / 登录：<https://platform.xiaomimimo.com>
+
+2. **推荐：获取按量付费 Key（`sk-` 开头）**
+   - 进入**控制台 → API Keys**：<https://platform.xiaomimimo.com/console/api-keys>
+   - 在该页面创建或复制你的 API Key（形如 `sk-xxxxxxxx`）。
+   - 按量付费 Key 注册后即可使用，**无需额外购买套餐**，适合大多数用户。
+
+3. **备选：获取套餐 Key（`tp-` 开头）**
+   - 如果你已经购买了 Token Plan 套餐，可以进入**控制台 → 套餐管理**：<https://platform.xiaomimimo.com/console/plan-manage>
+   - 在该页面复制你的套餐 API Key（形如 `tp-xxxxxxxx`）。
+
+4. 妥善保管你的 Key，**不要公开分享**。
+
+> MiMo-V2.5-TTS 系列当前限时**免费**（不消耗套餐 Credits），配音基本不会产生费用。具体以平台公告为准。
+
+---
+
+## 二、选择 Key 类型（套餐需额外选区域）
+
+小米有**两类 Key**，分别对应不同的服务地址。在 InfiPlot 中填写时需要**先选择 Key 类型**——通过 Key 的前缀即可判断：`sk-` 是按量付费、`tp-` 是套餐，两者不能混用。
+
+**① 按量付费 Pay-as-you-go（`sk-` 开头）** —— 使用统一服务地址 `https://api.xiaomimimo.com/v1`，**无需选择区域**，直接填入 Key 即可。
+
+**② 套餐 Token Plan（`tp-` 开头）** —— 需要额外选择一个**区域节点**，对应小米在不同地区部署的 Token Plan 服务：
+
+| 区域 | 说明 | 服务地址 |
+| --- | --- | --- |
+| 新加坡 · Singapore | 亚太地区推荐 | `https://token-plan-sgp.xiaomimimo.com/v1` |
+| 中国大陆 · China | 中国大陆推荐 | `https://token-plan-cn.xiaomimimo.com/v1` |
+| 欧洲 · Amsterdam | 欧洲推荐 | `https://token-plan-ams.xiaomimimo.com/v1` |
+
+请选择**与你的套餐订阅地区一致**的节点（通常也是离你最近、延迟最低的那个）。
+
+---
+
+## 三、在 InfiPlot 里填写
+
+1. 回到 InfiPlot **首页**，在选项区下方点击 **「经常没声音？自带配音 Key（可选）」**。
+2. 在弹窗里：
+   - **选择 Key 类型**（按量付费 / 套餐）；选「套餐」时需要额外**选择区域**，选「按量付费」则无需选区域；
+   - **粘贴你的 API Key**；
+3. 点击 **「保存并启用」**。按钮会变为 **「自带配音 Key · 已启用」**，「语音配音」也会自动切换为「开启」。
+4. 开始游玩——配音将由你的浏览器**直接连接小米服务**完成。
+
+想停用时，再次打开弹窗点击 **「停用并清除」** 即可，本地保存的 Key 会一并删除。
+
+---
+
+## 四、隐私说明
+
+- 你的 API Key **只保存在当前浏览器的 `localStorage`**（键名 `infiplot:tts`）中。
+- 启用后，配音请求由**你的浏览器直接发送至小米**对应的服务地址，携带你的 Key。
+- 我们的服务器**完全不参与**这条链路，**既看不到也不会记录**你的 Key。
+- 更换设备、更换浏览器或清除缓存后需要重新填写，这是预期行为。
+
+---
+
+## 五、常见问题
+
+**Q：填了 Key 还是没声音？**
+- 确认「语音配音」处于「开启」状态；
+- 确认 **Key 类型选对了**：`sk-` 选「按量付费」、`tp-` 选「套餐」，类型选错会导致鉴权失败；
+- 确认 Key 没有填错或包含多余空格，且仍有可用额度；
+- 使用套餐 Key 时，可以尝试切换**区域**（区域与订阅地不匹配也可能导致失败）；
+- 打开浏览器开发者工具的 Network 面板，查看对 `*.xiaomimimo.com` 的请求返回了什么错误。
+
+**Q：会产生费用吗？**
+- MiMo-V2.5-TTS 当前限时免费，正常游玩的配音不会消耗套餐额度。最终以小米平台的计费公告为准。
+
+**Q：`sk-` 和 `tp-` 用哪个？**
+- 推荐使用 `sk-`（按量付费），注册后即可使用，无需购买套餐。如果你已有 Token Plan 套餐，也可以使用 `tp-`（套餐 Key）。两者不能混用，类型选错会导致鉴权失败。
+
+**Q：我的 Key 安全吗？**
+- 安全。Key 只存在你的本地浏览器中，只会发送至小米官方服务地址，不经过 InfiPlot 服务器。但请不要将 Key 公开发布或分享给他人。
+
+---
+
+有问题欢迎在 [GitHub Issues](https://github.com/zonghaoyuan/infiplot/issues) 反馈。
@@ -1,5 +1,10 @@
-import type { ProviderConfig } from "@infiplot/types";
+import { generateText } from "ai";
+import type { LanguageModelUsage, ModelMessage } from "ai";
+import { createAnthropic } from "@ai-sdk/anthropic";
+import { createGoogleGenerativeAI } from "@ai-sdk/google";
+import type { ProviderConfig, ProviderProtocol } from "@infiplot/types";
 import { fetchWithRetry } from "./fetchWithRetry";
+import { normalizeBaseUrl } from "./normalizeUrl";

 export type ChatMessage = {
  role: "system" | "user" | "assistant";
@@ -57,6 +62,31 @@ function summarizeUsage(tag: string, usage: Usage | undefined): string {
  return `[cache] ${tag} prompt=${prompt} completion=${completion} (provider didn't report cache stats)`;
 }

+// AI SDK 6 unifies cache stats across providers into usage.inputTokenDetails,
+// so a single shape covers Anthropic + Gemini (no per-provider probing).
+function summarizeSdkUsage(
+  tag: string,
+  usage: LanguageModelUsage | undefined,
+): string {
+  if (!usage) return `[cache] ${tag} no-usage`;
+  const input = usage.inputTokens ?? 0;
+  const output = usage.outputTokens ?? 0;
+  const read = usage.inputTokenDetails?.cacheReadTokens;
+  const write = usage.inputTokenDetails?.cacheWriteTokens;
+  if (typeof read === "number" || typeof write === "number") {
+    const hit = read ?? 0;
+    const create = write ?? 0;
+    const rate = input > 0 ? ((hit / input) * 100).toFixed(1) : "n/a";
+    return `[cache] ${tag} hit=${hit} create=${create} input=${input} rate=${rate}% completion=${output}`;
+  }
+  return `[cache] ${tag} input=${input} completion=${output} (provider didn't report cache stats)`;
+}
+
+// text/vision default to the OpenAI-compatible wire protocol when unset.
+function resolveTextProtocol(config: ProviderConfig): ProviderProtocol {
+  return config.provider ?? "openai_compatible";
+}
+
 export async function chat(
  config: ProviderConfig,
  messages: ChatMessage[],
@@ -66,7 +96,63 @@ export async function chat(
    tag?: string;
  },
 ): Promise<string> {
-  const url = `${config.baseUrl.replace(/\/$/, "")}/chat/completions`;
+  const protocol = resolveTextProtocol(config);
+  if (protocol === "anthropic" || protocol === "google") {
+    return chatViaAiSdk(config, messages, opts, protocol);
+  }
+  return chatOpenAiCompatible(config, messages, opts);
+}
+
+// Native Anthropic / Gemini via the Vercel AI SDK. response_format is not sent
+// (Anthropic has no JSON mode); the engine relies on parseJsonLoose downstream,
+// matching how it already tolerates loose JSON from every provider.
+async function chatViaAiSdk(
+  config: ProviderConfig,
+  messages: ChatMessage[],
+  opts: { temperature?: number; tag?: string } | undefined,
+  protocol: "anthropic" | "google",
+): Promise<string> {
+  const baseURL = normalizeBaseUrl(config.baseUrl, protocol);
+  const model =
+    protocol === "anthropic"
+      ? createAnthropic({ apiKey: config.apiKey, baseURL })(config.model)
+      : createGoogleGenerativeAI({ apiKey: config.apiKey, baseURL })(
+          config.model,
+        );
+
+  const system = messages.find((m) => m.role === "system")?.content;
+  const convo: ModelMessage[] = messages
+    .filter((m) => m.role !== "system")
+    .map((m) => ({
+      role: m.role as "user" | "assistant",
+      content: m.content,
+    }));
+
+  const { text, usage } = await generateText({
+    model,
+    system,
+    messages: convo,
+    temperature: opts?.temperature ?? 0.9,
+  });
+
+  console.log(summarizeSdkUsage(opts?.tag ?? "chat", usage));
+
+  if (typeof text !== "string" || text.length === 0) {
+    throw new Error(`Chat API (AI SDK ${protocol}) returned no content.`);
+  }
+  return text;
+}
+
+async function chatOpenAiCompatible(
+  config: ProviderConfig,
+  messages: ChatMessage[],
+  opts?: {
+    temperature?: number;
+    responseFormat?: "json_object" | "text";
+    tag?: string;
+  },
+): Promise<string> {
+  const url = `${normalizeBaseUrl(config.baseUrl, "openai_compatible")}/chat/completions`;
  const body: Record<string, unknown> = {
    model: config.model,
    messages,
@@ -5,6 +5,7 @@ export async function fetchWithRetry(
  init: RetryInit,
 ): Promise<Response> {
  const { retries = 2, retryDelayMs = 1500, ...fetchInit } = init;
+  if (!fetchInit.redirect) fetchInit.redirect = "manual";

  let lastError: unknown;
  for (let attempt = 0; attempt <= retries; attempt++) {
@@ -1,5 +1,9 @@
-import type { ProviderConfig } from "@infiplot/types";
+import { generateImage as generateImageSdk } from "ai";
+import { createOpenAI } from "@ai-sdk/openai";
+import { createGoogleGenerativeAI } from "@ai-sdk/google";
+import type { Orientation, ProviderConfig, ProviderProtocol } from "@infiplot/types";
 import { fetchWithRetry } from "./fetchWithRetry";
+import { normalizeBaseUrl } from "./normalizeUrl";

 // Runware uses its own task-array protocol (not OpenAI-compatible).
 // POST <baseUrl> with [{ taskType: "imageInference", ... }]; errors come
@@ -38,30 +42,71 @@ export type GenerateImageOptions = {
   * Reference image (UUID, public URL, or base64) for img2img. When set,
   * FLUX preserves the seed image's composition and applies `strength` to
   * deviate. NOTE: FLUX.2 [klein] 9B KV does NOT support seedImage — use
-   * `referenceImages` for visual continuity instead.
+   * `referenceImages` for visual continuity instead. Runware-only.
   */
  seedImage?: string;
  /**
   * Reference images (UUIDs, URLs, or base64) to condition generation on —
   * typically character portraits + the prior scene image. Runware caps at 4;
-   * we silently truncate beyond that.
+   * we silently truncate beyond that. On the OpenAI/Gemini AI SDK paths these
+   * map to `prompt.images` (the SDK accepts public URLs or data URLs).
   */
  referenceImages?: string[];
-  /** 0–1, FLUX needs ≥ 0.8 to actually have an effect. */
+  /** 0–1, FLUX needs ≥ 0.8 to actually have an effect. Runware-only. */
  strength?: number;
+  /**
+   * Output aspect, locked per session. "portrait" → 9:16 vertical for mobile;
+   * default/"landscape" → 16:9 widescreen. Mapped to each provider's nearest
+   * supported size: Runware 1024×1792, OpenAI-compatible REST 1024x1792,
+   * native gpt-image 1024x1536, Gemini aspectRatio 9:16.
+   */
+  orientation?: Orientation;
 };

 export type GenerateImageResult = {
-  /** Public CDN URL of the generated image (Runware-hosted). */
+  /**
+   * Image the client can render directly. A Runware CDN URL on the Runware
+   * path; a `data:<mime>;base64,...` URI on the AI SDK paths (OpenAI/Gemini
+   * return raw bytes, not a hosted URL).
+   */
  imageUrl: string;
-  /** Stable UUID for cheap re-reference in later `referenceImages`. */
+  /**
+   * Stable handle for cheap re-reference in later `referenceImages`. A real
+   * Runware UUID on the Runware path; a synthetic UUID on other paths (those
+   * re-reference via the URL/data-URL form instead).
+   */
  imageUuid: string;
 };

+// Match the Runware host by parsed hostname (exact match or subdomain), not a
+// bare substring — otherwise `notrunware.ai` or `api.runware.ai.evil.com` would
+// misroute to the Runware protocol. Falls back to false on an unparseable URL.
+function isRunwareHost(baseUrl: string): boolean {
+  try {
+    const host = new URL(baseUrl).hostname.toLowerCase();
+    return host === "runware.ai" || host.endsWith(".runware.ai");
+  } catch {
+    return false;
+  }
+}
+
+// Image roles support more protocols than text/vision. When IMAGE_PROVIDER is
+// unset we keep the historical URL-based inference so existing deployments
+// (Runware, or an OpenAI-compatible gateway) behave exactly as before.
+function inferImageProtocol(config: ProviderConfig): ProviderProtocol {
+  const isOpenAiCompat =
+    !isRunwareHost(config.baseUrl) || config.model === "image-2-vip";
+  return isOpenAiCompat ? "openai_compatible" : "runware";
+}
+
+function resolveImageProtocol(config: ProviderConfig): ProviderProtocol {
+  return config.provider ?? inferImageProtocol(config);
+}
+
 // ──────────────────────────────────────────────────────────────────────
 //  generateImage — text-to-image (default) or referenceImages-conditioned.
-//  Returns both the public URL (for client display + future references)
-//  and the UUID (cheapest reference form for subsequent calls).
+//  Returns both a renderable image URL and a re-reference handle (see
+//  GenerateImageResult). Dispatches on the resolved wire protocol.
 // ──────────────────────────────────────────────────────────────────────

 export async function generateImage(
@@ -69,58 +114,135 @@ export async function generateImage(
  prompt: string,
  options?: GenerateImageOptions,
 ): Promise<GenerateImageResult> {
-  const url = config.baseUrl.replace(/\/$/, "");
+  const protocol = resolveImageProtocol(config);
+  switch (protocol) {
+    case "openai":
+    case "google":
+      return generateImageViaAiSdk(config, prompt, options, protocol);
+    case "runware":
+      return generateImageRunware(config, prompt, options);
+    case "anthropic":
+      throw new Error(
+        'IMAGE_PROVIDER "anthropic" does not generate images. Use "openai", "google", "runware", or "openai_compatible".',
+      );
+    case "openai_compatible":
+    default:
+      return generateImageOpenAiCompatible(config, prompt, options);
+  }
+}

-  // 1. OpenAI-compatible route (GPTGod, DALL-E, etc.)
-  const isOpenAi = !url.includes("runware.ai") || config.model === "image-2-vip";
-  if (isOpenAi) {
-    const endpoint = url.endsWith("/images/generations") ? url : `${url}/images/generations`;
-    console.log(`[ai-client] Calling OpenAI-compatible image generations at: ${endpoint} with model: ${config.model}`);
-    
-    const res = await fetchWithRetry(endpoint, {
-      method: "POST",
-      headers: {
-        "Content-Type": "application/json",
-        Authorization: `Bearer ${config.apiKey}`,
-      },
-      body: JSON.stringify({
-        model: config.model,
-        prompt: prompt,
-        n: 1,
-        size: "1792x1024", // Use horizontal size (16:9)
-      }),
-    });
+// Native OpenAI (gpt-image) / Gemini (Nano Banana) via the Vercel AI SDK.
+// Unlike the fetch path, this supports reference-image editing via
+// `prompt.images`. The SDK returns raw bytes (no hosted URL), so we hand the
+// client a data URI and synthesize a UUID; continuity references reuse the
+// data URI rather than a provider UUID.
+async function generateImageViaAiSdk(
+  config: ProviderConfig,
+  prompt: string,
+  options: GenerateImageOptions | undefined,
+  protocol: "openai" | "google",
+): Promise<GenerateImageResult> {
+  const baseURL = normalizeBaseUrl(config.baseUrl, protocol);
+  const imageModel =
+    protocol === "openai"
+      ? createOpenAI({ apiKey: config.apiKey, baseURL }).image(config.model)
+      : createGoogleGenerativeAI({ apiKey: config.apiKey, baseURL }).image(
+          config.model,
+        );

-    const text = await res.text();
-    let json: any;
-    try {
-      json = JSON.parse(text);
-    } catch {
-      throw new Error(`OpenAI Image API error ${res.status}: ${text.slice(0, 500)}`);
-    }
+  const refs = (options?.referenceImages ?? []).slice(0, MAX_REFERENCE_IMAGES);
+  const promptArg =
+    refs.length > 0 ? { text: prompt, images: refs } : prompt;

-    if (json.error) {
-      throw new Error(`OpenAI Image API error: ${json.error.message || JSON.stringify(json.error)}`);
-    }
+  // Session-locked aspect. gpt-image takes an explicit `size` (portrait /
+  // landscape options are 1024x1536 / 1536x1024); Gemini takes an `aspectRatio`.
+  const portrait = options?.orientation === "portrait";
+  const { image } = await generateImageSdk({
+    model: imageModel,
+    prompt: promptArg,
+    ...(protocol === "openai"
+      ? { size: (portrait ? "1024x1536" : "1536x1024") as `${number}x${number}` }
+      : { aspectRatio: (portrait ? "9:16" : "16:9") as `${number}:${number}` }),
+  });

-    const data = json.data?.[0];
-    const imageUrl = data?.url;
-    if (!imageUrl) {
-      throw new Error(`No image URL in OpenAI response: ${text.slice(0, 300)}`);
-    }
-    // Generate a mock UUID since OpenAI compatible endpoint doesn't have UUIDs
-    const imageUuid = crypto.randomUUID();
-    return { imageUrl, imageUuid };
+  return {
+    imageUrl: `data:${image.mediaType};base64,${image.base64}`,
+    imageUuid: crypto.randomUUID(),
+  };
+}
+
+// OpenAI-compatible REST route (GPTGod, DALL-E proxies, etc.). Basic
+// text-to-image only — no reference images on this path; for editing/anchoring
+// set IMAGE_PROVIDER=openai (or google) to take the AI SDK path above.
+async function generateImageOpenAiCompatible(
+  config: ProviderConfig,
+  prompt: string,
+  options?: GenerateImageOptions,
+): Promise<GenerateImageResult> {
+  const base = normalizeBaseUrl(config.baseUrl, "openai_compatible");
+  const endpoint = `${base}/images/generations`;
+  console.log(
+    `[ai-client] Calling OpenAI-compatible image generations at: ${endpoint} with model: ${config.model}`,
+  );
+
+  const res = await fetchWithRetry(endpoint, {
+    method: "POST",
+    headers: {
+      "Content-Type": "application/json",
+      Authorization: `Bearer ${config.apiKey}`,
+    },
+    body: JSON.stringify({
+      model: config.model,
+      prompt: prompt,
+      n: 1,
+      // Session-locked aspect (16:9 default, 9:16 portrait for mobile).
+      size: options?.orientation === "portrait" ? "1024x1792" : "1792x1024",
+    }),
+  });
+
+  const text = await res.text();
+  let json: any;
+  try {
+    json = JSON.parse(text);
+  } catch {
+    throw new Error(`OpenAI Image API error ${res.status}: ${text.slice(0, 500)}`);
  }

-  // 2. Runware task-array route
+  if (json.error) {
+    throw new Error(`OpenAI Image API error: ${json.error.message || JSON.stringify(json.error)}`);
+  }
+
+  const data = json.data?.[0];
+  const imageUrl = data?.url;
+  if (!imageUrl) {
+    throw new Error(`No image URL in OpenAI response: ${text.slice(0, 300)}`);
+  }
+  // Generate a mock UUID since OpenAI compatible endpoint doesn't have UUIDs
+  const imageUuid = crypto.randomUUID();
+  return { imageUrl, imageUuid };
+}
+
+// Runware task-array route — self-implemented to preserve the UUID/URL closed
+// loop (the official @runware/ai-sdk-provider drops both).
+async function generateImageRunware(
+  config: ProviderConfig,
+  prompt: string,
+  options?: GenerateImageOptions,
+): Promise<GenerateImageResult> {
+  const url = normalizeBaseUrl(config.baseUrl, "runware");
+
+  // Session-locked output aspect. Image models emit a FIXED pixel size; CSS
+  // object-fit on the client adapts this frame to the exact device/window. Both
+  // dimensions stay a multiple of 64 as FLUX requires.
+  const portrait = options?.orientation === "portrait";
+
  const task: Record<string, unknown> = {
    taskType: "imageInference",
    taskUUID: crypto.randomUUID(),
    model: config.model,
    positivePrompt: prompt,
-    width: 1792,
-    height: 1024,
+    width: portrait ? 1024 : 1792,
+    height: portrait ? 1792 : 1024,
    steps: 4,
    CFGScale: 3.5,
    numberResults: 1,
@@ -0,0 +1,66 @@
+import type { ProviderProtocol } from "@infiplot/types";
+
+// ──────────────────────────────────────────────────────────────────────
+//  Base-URL normalization — tolerate whatever shape the user pastes.
+//
+//  The README never specified whether the base URL needs a `/v1` suffix,
+//  so users provide all of these for the same endpoint:
+//      https://api.deepseek.com
+//      https://api.deepseek.com/v1
+//      https://api.deepseek.com/v1/chat/completions
+//  We normalize to a canonical base the adapter can safely append its own
+//  endpoint path to. This also fixes the pre-existing double-suffix bug
+//  where a pasted `.../chat/completions` became `.../chat/completions/chat/completions`.
+//
+//  Strategy (bare-host-only version append):
+//    1. strip trailing slashes
+//    2. strip a trailing known endpoint suffix (chat/completions, messages, …)
+//    3. only when the URL the user gave is a BARE host (scheme://host[:port]
+//       with no path) do we append the protocol's default version segment.
+//       Any path the user wrote (/v1, /beta, /zen/go, /chat/completions, …) is
+//       treated as an explicit location and left intact — so we never turn
+//       `/beta` into `/beta/v1`, and a version-less `/chat/completions`
+//       endpoint is preserved.
+// ──────────────────────────────────────────────────────────────────────
+
+// Endpoint paths an adapter appends itself — stripped so we keep only the base.
+const ENDPOINT_SUFFIX =
+  /\/(chat\/completions|completions|responses|messages|images\/(generations|edits))\/?$/i;
+
+// Default version segment to append per protocol for a bare host.
+const DEFAULT_VERSION_SEGMENT: Record<ProviderProtocol, string | null> = {
+  openai_compatible: "v1",
+  openai: "v1",
+  anthropic: "v1",
+  google: "v1beta",
+  // Runware posts to the bare base URL with no version-pathed sub-resource,
+  // so never inject a segment for it.
+  runware: null,
+};
+
+// True when `raw` is just scheme://host[:port] with no meaningful path — the
+// only shape where we infer a default version segment. A lone "/" counts as
+// bare. Falls back to a scheme-anchored regex if the URL can't be parsed.
+function isBareHost(raw: string): boolean {
+  try {
+    const { pathname } = new URL(raw);
+    return pathname === "" || pathname === "/";
+  } catch {
+    return !/^[a-z][a-z0-9+.-]*:\/\/[^/]+\/.+/i.test(raw);
+  }
+}
+
+export function normalizeBaseUrl(
+  raw: string,
+  protocol: ProviderProtocol,
+): string {
+  const trimmed = raw.trim();
+  let u = trimmed.replace(/\/+$/, "");
+  u = u.replace(ENDPOINT_SUFFIX, "").replace(/\/+$/, "");
+
+  const seg = DEFAULT_VERSION_SEGMENT[protocol];
+  if (seg && isBareHost(trimmed)) {
+    u = `${u}/${seg}`;
+  }
+  return u;
+}
@@ -1,5 +1,12 @@
-import type { ProviderConfig } from "@infiplot/types";
+import { generateText } from "ai";
+import type { ModelMessage } from "ai";
+import { createAnthropic } from "@ai-sdk/anthropic";
+import { createGoogleGenerativeAI } from "@ai-sdk/google";
+import type { ProviderConfig, ProviderProtocol } from "@infiplot/types";
 import { fetchWithRetry } from "./fetchWithRetry";
+import { normalizeBaseUrl } from "./normalizeUrl";
+
+const VISION_TIMEOUT_MS = 60_000;

 export async function interpretClick(
  config: ProviderConfig,
@@ -16,6 +23,11 @@ export async function interpretClick(
  );
 }

+// text/vision default to the OpenAI-compatible wire protocol when unset.
+function resolveVisionProtocol(config: ProviderConfig): ProviderProtocol {
+  return config.provider ?? "openai_compatible";
+}
+
 /**
 * General single-image vision call. Accepts a complete data URL (preserves
 * the source mime type, e.g. webp/jpeg) and lets the caller opt out of
@@ -27,7 +39,65 @@ export async function analyzeImageDataUrl(
  prompt: string,
  opts: { responseFormat?: "json_object" | "text" } = {},
 ): Promise<string> {
-  const url = `${config.baseUrl.replace(/\/$/, "")}/chat/completions`;
+  const protocol = resolveVisionProtocol(config);
+  if (protocol === "anthropic" || protocol === "google") {
+    return analyzeViaAiSdk(config, imageDataUrl, prompt, protocol);
+  }
+  return analyzeOpenAiCompatible(config, imageDataUrl, prompt, opts);
+}
+
+// Native Anthropic / Gemini multimodal via the AI SDK. The image part takes
+// the full data URL directly; the SDK decodes it. response_format is not sent
+// (no JSON mode on Anthropic) — the engine's parseJsonLoose handles output.
+async function analyzeViaAiSdk(
+  config: ProviderConfig,
+  imageDataUrl: string,
+  prompt: string,
+  protocol: "anthropic" | "google",
+): Promise<string> {
+  const baseURL = normalizeBaseUrl(config.baseUrl, protocol);
+  const model =
+    protocol === "anthropic"
+      ? createAnthropic({ apiKey: config.apiKey, baseURL })(config.model)
+      : createGoogleGenerativeAI({ apiKey: config.apiKey, baseURL })(
+          config.model,
+        );
+
+  const messages: ModelMessage[] = [
+    {
+      role: "user",
+      content: [
+        { type: "text", text: prompt },
+        { type: "image", image: imageDataUrl },
+      ],
+    },
+  ];
+
+  const timeoutCtrl = new AbortController();
+  const timeoutId = setTimeout(() => timeoutCtrl.abort(), VISION_TIMEOUT_MS);
+  try {
+    const { text } = await generateText({
+      model,
+      messages,
+      temperature: 0.2,
+      abortSignal: timeoutCtrl.signal,
+    });
+    if (typeof text !== "string" || text.length === 0) {
+      throw new Error(`Vision API (AI SDK ${protocol}) returned no content.`);
+    }
+    return text;
+  } finally {
+    clearTimeout(timeoutId);
+  }
+}
+
+async function analyzeOpenAiCompatible(
+  config: ProviderConfig,
+  imageDataUrl: string,
+  prompt: string,
+  opts: { responseFormat?: "json_object" | "text" } = {},
+): Promise<string> {
+  const url = `${normalizeBaseUrl(config.baseUrl, "openai_compatible")}/chat/completions`;

  const body: Record<string, unknown> = {
    model: config.model,
@@ -47,7 +117,7 @@ export async function analyzeImageDataUrl(
  }

  const timeoutCtrl = new AbortController();
-  const timeoutId = setTimeout(() => timeoutCtrl.abort(), 60_000);
+  const timeoutId = setTimeout(() => timeoutCtrl.abort(), VISION_TIMEOUT_MS);

  let res: Response;
  try {
@@ -0,0 +1,86 @@
+// Bring-your-own Xiaomi MiMo TTS key — stored CLIENT-SIDE ONLY.
+//
+// When a user supplies their own key, we persist {presetId, apiKey} in
+// localStorage and the browser talks to Xiaomi directly (see lib/tts-client).
+// The key is therefore never sent to our server: no request body, no header,
+// no log. resolveTtsConfig() turns the stored pair into the TtsConfig shape the
+// tts-client adapter expects, mapping the chosen endpoint preset to its baseUrl.
+
+import type { TtsConfig } from "@infiplot/types";
+import { DEFAULT_TTS_SPEECH_MODEL, findTtsPreset } from "./ttsPresets";
+
+const STORAGE_KEY = "infiplot:tts";
+
+/** Exactly what we persist — endpoint choice + raw key. Resolved to a full
+ *  TtsConfig (with baseUrl + model) at read time so a renamed/removed preset
+ *  can't leave a stale baseUrl baked into storage. */
+export type StoredTtsConfig = {
+  presetId: string;
+  apiKey: string;
+};
+
+/** Read + validate the persisted BYO config. Returns null when running on the
+ *  server, when nothing is stored, on parse failure, or when the stored shape
+ *  is no longer valid (unknown preset / empty key). */
+export function readStoredTtsConfig(): StoredTtsConfig | null {
+  if (typeof window === "undefined") return null;
+  try {
+    const raw = window.localStorage.getItem(STORAGE_KEY);
+    if (!raw) return null;
+    const parsed = JSON.parse(raw) as Partial<StoredTtsConfig>;
+    const presetId = typeof parsed.presetId === "string" ? parsed.presetId : "";
+    const apiKey = typeof parsed.apiKey === "string" ? parsed.apiKey : "";
+    if (!findTtsPreset(presetId)) return null;
+    if (!apiKey.trim()) return null;
+    return { presetId, apiKey };
+  } catch {
+    return null;
+  }
+}
+
+/** Persist the BYO config. Trims the key so trailing whitespace from a paste
+ *  never breaks the `api-key` header. */
+export function writeStoredTtsConfig(config: StoredTtsConfig): void {
+  if (typeof window === "undefined") return;
+  try {
+    const payload: StoredTtsConfig = {
+      presetId: config.presetId,
+      apiKey: config.apiKey.trim(),
+    };
+    window.localStorage.setItem(STORAGE_KEY, JSON.stringify(payload));
+  } catch {
+    // Storage disabled / quota / private mode — BYO simply stays off.
+  }
+}
+
+export function clearStoredTtsConfig(): void {
+  if (typeof window === "undefined") return;
+  try {
+    window.localStorage.removeItem(STORAGE_KEY);
+  } catch {
+    // ignore
+  }
+}
+
+/** Map a stored pair to the adapter-ready TtsConfig, resolving the endpoint
+ *  preset to its baseUrl. Returns null when the preset is unknown or the key
+ *  is blank — callers treat null as "no BYO; use server default / silent". */
+export function resolveTtsConfig(
+  stored: StoredTtsConfig | null,
+): TtsConfig | null {
+  if (!stored) return null;
+  const preset = findTtsPreset(stored.presetId);
+  if (!preset) return null;
+  const apiKey = stored.apiKey.trim();
+  if (!apiKey) return null;
+  return {
+    baseUrl: preset.baseUrl,
+    apiKey,
+    speechModel: DEFAULT_TTS_SPEECH_MODEL,
+  };
+}
+
+/** Convenience: read storage and resolve in one step. */
+export function loadClientTtsConfig(): TtsConfig | null {
+  return resolveTtsConfig(readStoredTtsConfig());
+}
@@ -1,4 +1,16 @@
-import type { EngineConfig, TtsConfig } from "@infiplot/types";
+import type {
+  EngineConfig,
+  ProviderProtocol,
+  TtsConfig,
+} from "@infiplot/types";
+
+const VALID_PROTOCOLS = [
+  "openai_compatible",
+  "anthropic",
+  "google",
+  "openai",
+  "runware",
+] as const;

 function readVar(name: string): string {
  const v = process.env[name];
@@ -11,6 +23,21 @@ function readOptionalVar(name: string): string | undefined {
  return v && v.length > 0 ? v : undefined;
 }

+// Optional *_PROVIDER selector. Unset → undefined, and each ai-client adapter
+// applies its own default (text/vision → openai_compatible; image → inferred
+// from the base URL). Validated eagerly so a typo fails fast at boot rather
+// than mid-request.
+function readProvider(name: string): ProviderProtocol | undefined {
+  const v = readOptionalVar(name)?.trim().toLowerCase();
+  if (!v) return undefined;
+  if ((VALID_PROTOCOLS as readonly string[]).includes(v)) {
+    return v as ProviderProtocol;
+  }
+  throw new Error(
+    `Invalid ${name}: "${v}". Must be one of: ${VALID_PROTOCOLS.join(", ")}`,
+  );
+}
+
 function loadTtsConfig(): TtsConfig | undefined {
  const baseUrl = readOptionalVar("TTS_BASE_URL");
  const apiKey = readOptionalVar("TTS_API_KEY");
@@ -28,16 +55,19 @@ export function loadEngineConfig(): EngineConfig {
      baseUrl: readVar("TEXT_BASE_URL"),
      apiKey: readVar("TEXT_API_KEY"),
      model: readVar("TEXT_MODEL"),
+      provider: readProvider("TEXT_PROVIDER"),
    },
    image: {
      baseUrl: readVar("IMAGE_BASE_URL"),
      apiKey: readVar("IMAGE_API_KEY"),
      model: readVar("IMAGE_MODEL"),
+      provider: readProvider("IMAGE_PROVIDER"),
    },
    vision: {
      baseUrl: readVar("VISION_BASE_URL"),
      apiKey: readVar("VISION_API_KEY"),
      model: readVar("VISION_MODEL"),
+      provider: readProvider("VISION_PROVIDER"),
    },
    tts: loadTtsConfig(),
    mockImage: readOptionalVar("MOCK_IMAGE") === "true",
@@ -4,6 +4,7 @@ import type {
  Beat,
  Character,
  EngineConfig,
+  Orientation,
  ProviderConfig,
 } from "@infiplot/types";
 import { mockImageDataUri } from "../mockImage";
@@ -54,6 +55,11 @@ export type PainterInput = {
   * session paints — even before any priorScene exists.
   */
  styleReferenceImage?: string;
+  /**
+   * Session-locked output aspect. Drives both the Painter prompt's framing
+   * rules and the generated image's pixel dimensions. Default "landscape".
+   */
+  orientation?: Orientation;
 };

 // Pick the references we send to Runware as `referenceImages`. Priority:
@@ -142,13 +148,14 @@ export async function runPainter(
  entryBeat: Beat | undefined,
 ): Promise<PainterResult> {
  if (config.mockImage) {
-    return { kind: "mock", imageUrl: await mockImageDataUri() };
+    return { kind: "mock", imageUrl: await mockImageDataUri(input.orientation) };
  }

  const prompt = buildPainterPrompt(
    input.integratedPrompt,
    input.styleGuide,
    input.onStageCharacters,
+    input.orientation,
  );

  const refs = collectReferenceImages(
@@ -165,7 +172,7 @@ export async function runPainter(
    const r = await tryGenerate(
      config.image,
      prompt,
-      { referenceImages: refs },
+      { referenceImages: refs, orientation: input.orientation },
      `referenceImages (${refs.length})`,
    );
    if (r) return { kind: "real", imageUrl: r.imageUrl, imageUuid: r.imageUuid };
@@ -174,6 +181,8 @@ export async function runPainter(
  // Tier B — pure text-to-image. Last resort, used when Tier A failed OR
  // there are no references to send (first scene with no characters yet).
  // Errors here propagate to the caller.
-  const r = await generateImage(config.image, prompt);
+  const r = await generateImage(config.image, prompt, {
+    orientation: input.orientation,
+  });
  return { kind: "real", imageUrl: r.imageUrl, imageUuid: r.imageUuid };
 }
@@ -8,26 +8,30 @@ import type {
  ProviderConfig,
  Session,
  StoryStatePatch,
+  WriterPlan,
 } from "@infiplot/types";
 import { parseJsonLoose } from "../jsonParser";
-import { WRITER_SYSTEM, buildWriterUserMessage } from "../prompts";
+import {
+  WRITER_BEATS_SYSTEM,
+  WRITER_PLAN_SYSTEM,
+  buildWriterBeatsUserMessage,
+  buildWriterPlanUserMessage,
+} from "../prompts";

 // ──────────────────────────────────────────────────────────────────────
-//  Writer agent — owns the narrative half of scene generation.
+//  Writer agent — owns the narrative half of scene generation, in TWO phases.
 //
-//  Output: { sceneSummary, sceneKey, entryBeatId, beats[] }
-//  Each beat carries activeCharacters[] (names + poses) the
-//  Cinematographer reads when composing the establishing shot.
+//  Phase A — runWriterPlan: the scene skeleton (WriterPlan) the image pipeline
+//    needs (sceneSummary + sceneKey + entry roster + full cast). No dialogue,
+//    so it returns fast and unblocks the Cinematographer + character design.
+//  Phase B — runWriterBeats: the full beats[] graph + storyStatePatch, written
+//    to honor the plan and overlapped with the (longer) image pipeline.
 //
-//  Character DESIGN (visual + voice) is NOT this agent's job —
-//  it only names characters; the CharacterDesigner picks up any
-//  unknown name from beats[].activeCharacters.
+//  Character DESIGN (visual + voice) is NOT this agent's job — it only NAMES
+//  characters (Phase A's cast); the CharacterDesigner picks up unknown names.
 // ──────────────────────────────────────────────────────────────────────

-export type WriterOutput = {
-  sceneSummary: string;
-  sceneKey?: string;
-  entryBeatId: string;
+export type WriterBeatsOutput = {
  beats: Beat[];
  /** Rewritten volatile story memory — merged onto the carried StoryState by
   *  the director. Absent when the model omitted it (rare; bible just stales). */
@@ -69,10 +73,17 @@ type RawStoryStatePatch = {
  relationships?: unknown;
  nextHook?: unknown;
 };
-type RawScene = {
+// Phase A raw shape (skeleton only — no beats).
+type RawPlan = {
  sceneSummary?: string;
  sceneKey?: string;
  entryBeatId?: string;
+  cast?: unknown;
+  entrySpeaker?: string;
+  entryActiveCharacters?: RawActiveCharacter[];
+};
+// Phase B raw shape (beats + memory only — plan fields come from runWriterPlan).
+type RawBeats = {
  beats?: RawBeat[];
  storyStatePatch?: RawStoryStatePatch;
 };
@@ -359,26 +370,119 @@ function coerceStoryStatePatch(
  return Object.keys(patch).length > 0 ? patch : undefined;
 }

-export async function runWriter(
+// Phase A — dedupe + clean the planned cast. Drops the POV player (never
+// designed) and any blank/duplicate name. Order is preserved.
+function coerceCast(raw: unknown): string[] {
+  if (!Array.isArray(raw)) return [];
+  const seen = new Set<string>();
+  const out: string[] = [];
+  for (const x of raw) {
+    const name = typeof x === "string" ? x.trim() : "";
+    if (!name || isPovName(name) || seen.has(name)) continue;
+    seen.add(name);
+    out.push(name);
+  }
+  return out;
+}
+
+// Rename one beat's id and repoint every INTERNAL reference (continue targets,
+// advance-beat targets) so the graph stays intact. Only called when `to` is
+// absent from the scene, so it can't introduce a duplicate id.
+function renameBeatId(beats: Beat[], from: string, to: string): Beat[] {
+  if (from === to) return beats;
+  return beats.map((b): Beat => {
+    const id = b.id === from ? to : b.id;
+    let next = b.next;
+    if (next.type === "continue" && next.nextBeatId === from) {
+      next = { type: "continue", nextBeatId: to };
+    } else if (next.type === "choice") {
+      next = {
+        type: "choice",
+        choices: next.choices.map((c) =>
+          c.effect.kind === "advance-beat" && c.effect.targetBeatId === from
+            ? { ...c, effect: { kind: "advance-beat" as const, targetBeatId: to } }
+            : c,
+        ),
+      };
+    }
+    return { ...b, id, next };
+  });
+}
+
+// ── Phase A — plan the scene skeleton. Fast (small output): just enough for
+// the Cinematographer + character design + Painter to start before the
+// dialogue exists. The cast is unioned with the entry roster/speaker so a
+// character named in the entry but omitted from `cast` still gets designed.
+export async function runWriterPlan(
  config: ProviderConfig,
  session: Session,
-): Promise<WriterOutput> {
+): Promise<WriterPlan> {
  const raw = await chat(
    config,
    [
-      { role: "system", content: WRITER_SYSTEM },
-      { role: "user", content: buildWriterUserMessage(session) },
+      { role: "system", content: WRITER_PLAN_SYSTEM },
+      { role: "user", content: buildWriterPlanUserMessage(session) },
    ],
-    { temperature: 0.9, responseFormat: "json_object", tag: "writer" },
+    { temperature: 0.9, responseFormat: "json_object", tag: "writer-plan" },
  );

-  const parsed = parseJsonLoose<RawScene>(raw);
+  const parsed = parseJsonLoose<RawPlan>(raw);
+
+  const entryActiveCharacters =
+    coerceActiveCharacters(parsed.entryActiveCharacters) ?? [];
+
+  // Normalize POV variants → "你"; NPC names pass through. "你" is a valid entry
+  // speaker (Pattern B — player talking), but is never a designed cast member.
+  const rawEntrySpeaker = parsed.entrySpeaker?.trim() || undefined;
+  const entrySpeaker = rawEntrySpeaker
+    ? normalizeSpeakerName(rawEntrySpeaker)
+    : undefined;
+
+  const cast = coerceCast(parsed.cast);
+  const castSet = new Set(cast);
+  const addToCast = (name: string): void => {
+    if (!isPovName(name) && !castSet.has(name)) {
+      castSet.add(name);
+      cast.push(name);
+    }
+  };
+  for (const c of entryActiveCharacters) addToCast(c.name);
+  if (entrySpeaker) addToCast(entrySpeaker);
+
+  return {
+    sceneSummary: parsed.sceneSummary?.trim() || "未指定场景概要",
+    sceneKey: normalizeSceneKey(parsed.sceneKey),
+    entryBeatId: parsed.entryBeatId?.trim() || "b1",
+    cast,
+    entryActiveCharacters,
+    entrySpeaker,
+  };
+}
+
+// ── Phase B — expand the plan into the full beats[] graph + storyStatePatch.
+// Overlapped with the image pipeline by the director. The plan's entry id is
+// pinned onto a real beat so the already-painted entry frame resolves.
+export async function runWriterBeats(
+  config: ProviderConfig,
+  session: Session,
+  plan: WriterPlan,
+): Promise<WriterBeatsOutput> {
+  const raw = await chat(
+    config,
+    [
+      { role: "system", content: WRITER_BEATS_SYSTEM },
+      { role: "user", content: buildWriterBeatsUserMessage(session, plan) },
+    ],
+    { temperature: 0.9, responseFormat: "json_object", tag: "writer-beats" },
+  );
+
+  const parsed = parseJsonLoose<RawBeats>(raw);
  const rawBeats = Array.isArray(parsed.beats) ? parsed.beats : [];
  if (rawBeats.length === 0) {
-    throw new Error("Writer returned no beats");
+    throw new Error("Writer (beats) returned no beats");
  }

-  const beats = ensureUniqueChoiceIds(
+  let beats = ensureUniqueChoiceIds(
    repairBeats(
      ensureUniqueBeatIds(
        rawBeats.map((b, i) => coerceBeat(b, i, rawBeats.length)),
@@ -386,40 +490,45 @@ export async function runWriter(
    ),
  );

-  const declaredEntry = parsed.entryBeatId?.trim();
-  const entryBeatId =
-    declaredEntry && beats.some((b) => b.id === declaredEntry)
-      ? declaredEntry
-      : beats[0]!.id;
+  // The Painter already composed the entry frame from plan.entryBeatId + its
+  // roster, so the scene's entry MUST resolve to that id. If Phase B ignored
+  // it, rename the first beat to it (no collision — id is absent by the guard).
+  if (!beats.some((b) => b.id === plan.entryBeatId)) {
+    beats = renameBeatId(beats, beats[0]!.id, plan.entryBeatId);
+  }
+
+  // 把入场 beat 的 roster 钉成 plan 的：画师合成进帧的正是
+  // plan.entryActiveCharacters，运行时入场 beat 必须显示同一批人（与上面钉
+  // id 同理）。speaker 故意不钉——它和 line/TTS 耦合，强行覆盖会错配台词。
+  const entryRoster =
+    plan.entryActiveCharacters.length > 0 ? plan.entryActiveCharacters : undefined;
+  beats = beats.map((b) =>
+    b.id === plan.entryBeatId ? { ...b, activeCharacters: entryRoster } : b,
+  );

  return {
-    sceneSummary: parsed.sceneSummary?.trim() || "未指定场景概要",
-    sceneKey: normalizeSceneKey(parsed.sceneKey),
-    entryBeatId,
    beats,
    storyStatePatch: coerceStoryStatePatch(parsed.storyStatePatch),
  };
 }

-// Surface the set of character names introduced by this scene's beats,
-// so the orchestrator can decide which ones need the CharacterDesigner to
-// fire. Pulls names from both `speaker` fields AND `activeCharacters`
-// (a character can be on-screen without speaking).
-//
-// Excludes POV ("你" / 玩家 / 主角 / ...) entirely — the player is never
-// designed (no portrait, no voice, no archetype).
-export function collectActiveCharacterNames(beats: Beat[]): string[] {
-  const seen = new Set<string>();
-  for (const b of beats) {
-    if (b.speaker && !isPovName(b.speaker)) seen.add(b.speaker);
-    if (b.activeCharacters) {
-      for (const c of b.activeCharacters) {
-        if (!isPovName(c.name)) seen.add(c.name);
-      }
-    }
-  }
-  return Array.from(seen);
+// Phase B fallback — when runWriterBeats fails entirely, keep the scene
+// playable with a single entry beat synthesized from the plan: narrate the
+// planned summary and offer one change-scene exit so the player can advance.
+export function synthesizeFallbackBeats(plan: WriterPlan): Beat[] {
+  const id = plan.entryBeatId || "b1";
+  return [
+    {
+      id,
+      narration: plan.sceneSummary,
+      activeCharacters:
+        plan.entryActiveCharacters.length > 0
+          ? plan.entryActiveCharacters
+          : undefined,
+      next: { type: "choice", choices: [fallbackExitChoice(id)] },
+    },
+  ];
 }

-// Re-export POV constants for downstream filters (director's orphanSpeakers).
+// Re-export POV constants for downstream filters (director's orphan voices).
 export { POV_DISPLAY_NAME, POV_VARIANTS, isPovName, normalizeSpeakerName };
@@ -1,5 +1,7 @@
 import { chat } from "@infiplot/ai-client";
+import { coerceOrientation } from "@infiplot/types";
 import type {
+  Beat,
  Character,
  EngineConfig,
  InsertBeatPartial,
@@ -8,6 +10,7 @@ import type {
  Session,
  StoryState,
  StoryStatePatch,
+  WriterPlan,
 } from "@infiplot/types";
 import type { CharacterCard } from "./agents/characterDesigner";
 import {
@@ -18,12 +21,14 @@ import {
 } from "./agents/characterDesigner";
 import { runCinematographer } from "./agents/cinematographer";
 import { runPainter } from "./agents/painter";
+import type { WriterBeatsOutput } from "./agents/writer";
 import {
-  collectActiveCharacterNames,
  isPovName,
  normalizeSpeakerName,
  POV_DISPLAY_NAME,
-  runWriter,
+  runWriterBeats,
+  runWriterPlan,
+  synthesizeFallbackBeats,
 } from "./agents/writer";
 import { parseJsonLoose } from "./jsonParser";
 import { INSERT_BEAT_SYSTEM, buildInsertBeatUserMessage } from "./prompts";
@@ -33,25 +38,25 @@ import { INSERT_BEAT_SYSTEM, buildInsertBeatUserMessage } from "./prompts";
 //
 //  Critical path (per Scene call):
 //
-//    Writer LLM (~3s, serial)
+//    Writer PHASE A — plan LLM (scene skeleton only, serial)
 //      │
-//      ├─ CharacterCard LLM × N        (parallel per new char — TEXT only)
-//      ├─ Cinematographer LLM          (parallel with the cards)
-//      │
-//      └─ wait for cards + cinema
-//      │
-//      ├─ entry-beat portraits   ──┐  (block the Painter — its refs)
-//      ▼                           │
-//    Painter — generateImage       │  (overlapped, NOT on the paint path):
-//      with referenceImages        ├─ non-entry-beat portraits
-//      │                           └─ ALL voice provisioning + orphan voices
+//      ├──────────────────────────┬───────────────────────────────────────┐
+//      ▼                           ▼                                       │
+//    Writer PHASE B            image pipeline (concurrent):                 │
+//    beats LLM                   CharacterCard LLM × N ∥ Cinematographer    │
+//    (full dialogue,             → entry-beat portraits (block Painter)     │
+//     overlapped)                → Painter (generateImage w/ refs)          │
+//      │                         → await overlapped: rest portraits+voices  │
+//      └──────────────────────────► await Phase B ◄────────────────────────┘
 //      ▼
-//    await the overlapped work, fold into the registry
-//      │
-//      ▼
-//    return { scene, sceneImageUrl, characters, storyState }
+//    assemble Scene → { scene, sceneImageUrl, characters, storyState }
 //
-//  Two deliberate decouplings unlock the parallelism:
+//  Why split the Writer (the latency win): the image pipeline only needs the
+//  scene SUMMARY + entry roster + cast (Phase A) — NOT the dialogue (Phase B).
+//  Writing beats used to sit serially in FRONT of the image; now it overlaps
+//  it, so the floor is max(beats, image) instead of beats + image.
+//
+//  The decouplings that unlock the rest of the parallelism:
 //   1. The Cinematographer only POSITIONS named characters, so it needs no
 //      visualDescription and runs alongside the card LLMs.
 //   2. The Painter only needs visualDescription TEXT (all on-stage) + the
@@ -163,31 +168,60 @@ export async function directScene(
 ): Promise<SceneResult> {
  const tTotal = Date.now();

-  // Stage 1 — Writer (serial; everything downstream needs sceneSummary +
-  // beats[] to know who's on stage and what to compose around).
-  const tWriter = Date.now();
-  const writerOut = await runWriter(config.text, session);
-  tlog("[directScene] Writer", tWriter);
+  // ── Phase A — Writer PLAN (serial). The image pipeline needs the scene
+  // summary + entry roster + cast to start, but NOT the dialogue beats. This
+  // call is small (skeleton only), so it returns fast and unblocks everything.
+  const tPlan = Date.now();
+  const plan = await runWriterPlan(config.text, session);
+  tlog("[directScene] Phase A (plan)", tPlan);

-  // Identify NEW characters introduced by this scene that need to be
-  // designed (LLM + portrait + voice). Existing characters in the registry
-  // are skipped — their cards / portraits / voices persist across scenes.
-  const allActiveNames = collectActiveCharacterNames(writerOut.beats);
-  const newCharNames = allActiveNames.filter(
+  // ── Phase B — Writer BEATS, launched NOW so its (longer) output overlaps the
+  // ENTIRE image pipeline below. Only needed to assemble the final Scene, so we
+  // await it last. A failure degrades to a single playable beat from the plan.
+  const tBeats = Date.now();
+  const beatsPromise: Promise<WriterBeatsOutput> = runWriterBeats(
+    config.text,
+    session,
+    plan,
+  )
+    .then((out) => {
+      tlog("[directScene] Phase B (beats)", tBeats);
+      return out;
+    })
+    .catch((err): WriterBeatsOutput => {
+      const msg = err instanceof Error ? err.message : String(err);
+      console.error(
+        `[directScene] Phase B (beats) failed, using fallback: ${msg}`,
+      );
+      return { beats: synthesizeFallbackBeats(plan), storyStatePatch: undefined };
+    });
+
+  // NEW characters to design come from the PLAN's cast (so design fires in
+  // parallel with Phase B, not after the beats are written). Existing
+  // characters keep their cards / portraits / voices across scenes.
+  const newCharNames = plan.cast.filter(
    (n) => !session.characters.some((c) => c.name === n),
  );

-  // Find the entry beat for the Cinematographer (which characters are
-  // on-screen in the establishing shot).
-  const entryBeat = writerOut.beats.find((b) => b.id === writerOut.entryBeatId);
-  const entryBeatActive = entryBeat?.activeCharacters ?? [];
+  // Entry-beat composition is the PLAN's (Phase B is constrained to honor it).
+  // The Painter needs a Beat-shaped object for reference collection, but the
+  // real beat isn't written until Phase B — so synthesize one from the plan
+  // (collectReferenceImages only reads speaker + activeCharacters).
+  const entryBeatActive = plan.entryActiveCharacters;
+  const entryBeatSpeaker = plan.entrySpeaker;
+  const entryBeatForPaint: Beat = {
+    id: plan.entryBeatId,
+    speaker: entryBeatSpeaker,
+    activeCharacters: entryBeatActive.length > 0 ? entryBeatActive : undefined,
+    next: { type: "continue", nextBeatId: plan.entryBeatId },
+  };

  // For sceneKey-based visual continuity, look up the prior matching scene's
  // image to slot into Painter's referenceImages (max 4 of which include
  // character portraits too).
  const { priorSceneReference, priorSceneKey } = pickPriorSceneReference(
    session,
-    writerOut.sceneKey,
+    plan.sceneKey,
  );

  // ── Stage 2 — character cards (LLM) ∥ Cinematographer ──────────────────
@@ -211,12 +245,12 @@ export async function directScene(
  );

  const cinemaPromise = runCinematographer(config.text, {
-    sceneSummary: writerOut.sceneSummary,
+    sceneSummary: plan.sceneSummary,
    styleGuide: session.styleGuide,
    entryBeatActive,
-    entryBeatSpeaker: entryBeat?.speaker,
+    entryBeatSpeaker,
    priorSceneKey,
-    currentSceneKey: writerOut.sceneKey,
+    currentSceneKey: plan.sceneKey,
  });

  const [cards, cinemaOut] = await Promise.all([
@@ -242,8 +276,8 @@ export async function directScene(
  // Entry-beat character names: the ONLY portraits the Painter references
  // (collectReferenceImages slots in the entry beat's speaker + activeChars).
  const entryNames = new Set<string>();
-  if (entryBeat?.speaker && !isPovName(entryBeat.speaker)) {
-    entryNames.add(entryBeat.speaker);
+  if (entryBeatSpeaker && !isPovName(entryBeatSpeaker)) {
+    entryNames.add(entryBeatSpeaker);
  }
  for (const c of entryBeatActive) {
    if (!isPovName(c.name)) entryNames.add(c.name);
@@ -281,24 +315,6 @@ export async function directScene(
    ),
  );

-  // Edge case: a speaker the Writer referenced without listing in any beat's
-  // activeCharacters. collectActiveCharacterNames already includes speakers,
-  // so this is a rare defensive net. Provision a voice only (never on-screen).
-  const speakerNames = new Set(
-    writerOut.beats.map((b) => b.speaker).filter((n): n is string => Boolean(n)),
-  );
-  const orphanSpeakers = [...speakerNames].filter(
-    // Pattern B: "你" (player) is a valid speaker but never gets a Character
-    // record — TTS is intentionally skipped on the client.
-    (n) =>
-      !isPovName(n) &&
-      !characters.some((c) => c.name === n) &&
-      !cards.some((c) => c.name === n),
-  );
-  const orphanPromises = orphanSpeakers.map((n) =>
-    provisionVoiceForName(config, session, n),
-  );
-
  // Block the Painter ONLY on entry-beat portraits (its referenceImages).
  const entryPortraits = await Promise.all(entryPortraitPromises);
  characters = mergeCharacters(
@@ -313,11 +329,13 @@ export async function directScene(
  tlog("[directScene] entry-beat portraits", tProvision);

  // ── Stage 4 — Painter (depends on cinemaOut + on-stage visual cards +
-  // entry portraits). On-stage = everyone named in any beat, so the archetype
-  // block covers anyone the player might encounter in this scene.
-  const onStageCharacters = characters.filter((c) =>
-    allActiveNames.includes(c.name),
-  );
+  // entry portraits). On-stage = the plan's cast (everyone who'll appear),
+  // filtered to those now in the registry, so the archetype block covers them.
+  const onStageCharacters = characters.filter((c) => plan.cast.includes(c.name));
+
+  // Session-locked orientation (set at session start). Threads into both the
+  // Painter prompt's framing rules and the generated image's pixel dimensions.
+  const orientation = coerceOrientation(session.orientation);

  const tPainter = Date.now();
  const painted = await runPainter(
@@ -328,19 +346,19 @@ export async function directScene(
      onStageCharacters,
      priorSceneImage: priorSceneReference,
      styleReferenceImage: session.styleReferenceImage,
+      orientation,
    },
-    entryBeat,
+    entryBeatForPaint,
  );
  tlog("[directScene] Painter", tPainter);

-  // Fold in the work that overlapped the paint: remaining portraits, all
-  // voices, and any orphan-speaker voices. Awaited before returning so the
-  // session the client persists is fully provisioned for later scenes.
+  // Fold in the work that overlapped the paint: remaining portraits + all
+  // voices. Awaited before returning so the session the client persists is
+  // fully provisioned for later scenes.
  const tOverlap = Date.now();
-  const [restPortraits, voicedChars, orphanChars] = await Promise.all([
+  const [restPortraits, voicedChars] = await Promise.all([
    Promise.all(restPortraitPromises),
    Promise.all(voicePromises),
-    Promise.all(orphanPromises),
  ]);
  characters = mergeCharacters(
    characters,
@@ -352,10 +370,31 @@ export async function directScene(
    })),
  );
  characters = mergeCharacters(characters, voicedChars);
-  if (orphanChars.length > 0) {
+  tlog("[directScene] overlapped portraits+voices", tOverlap);
+
+  // ── Await Phase B — it overlapped the whole image pipeline above. ──────
+  const beatsOut = await beatsPromise;
+  const beats = beatsOut.beats;
+
+  // entryBeatId is guaranteed present (runWriterBeats pins it onto a beat), but
+  // keep the defensive fallback for the synthesized-fallback path.
+  const entryBeatId = beats.some((b) => b.id === plan.entryBeatId)
+    ? plan.entryBeatId
+    : beats[0]!.id;
+
+  // Orphan-speaker voices: a beat speaker Phase B used that isn't in the
+  // registry. Should be rare — the prompt constrains speakers to the cast, and
+  // every cast member was provisioned above — so this is a defensive net,
+  // serial but skipped entirely (zero latency) in the common case.
+  const orphanSpeakers = [
+    ...new Set(beats.map((b) => b.speaker).filter((n): n is string => Boolean(n))),
+  ].filter((n) => !isPovName(n) && !characters.some((c) => c.name === n));
+  if (orphanSpeakers.length > 0) {
+    const orphanChars = await Promise.all(
+      orphanSpeakers.map((n) => provisionVoiceForName(config, session, n)),
+    );
    characters = mergeCharacters(characters, orphanChars);
  }
-  tlog("[directScene] overlapped portraits+voices", tOverlap);

  const scene: Scene = {
    id: newSceneId(),
@@ -365,11 +404,12 @@ export async function directScene(
    // anything that already reads scene.scenePrompt (e.g., insert-beat
    // user prompt).
    scenePrompt: cinemaOut.integratedPrompt,
-    beats: writerOut.beats,
-    entryBeatId: writerOut.entryBeatId,
-    sceneKey: writerOut.sceneKey,
+    beats,
+    entryBeatId,
+    sceneKey: plan.sceneKey,
    imageUuid: painted.kind === "real" ? painted.imageUuid : undefined,
    imageUrl: painted.imageUrl,
+    orientation,
  };

  // Merge the Writer's volatile memory rewrite onto the carried bible so the
@@ -377,7 +417,7 @@ export async function directScene(
  // client persists it back into the session).
  const storyState = applyStoryStatePatch(
    session.storyState,
-    writerOut.storyStatePatch,
+    beatsOut.storyStatePatch,
  );

  tlog("[directScene] TOTAL", tTotal);
@@ -9,7 +9,7 @@ export { synthesizeBeat } from "./voice";
 export { mergeCharacters } from "./director";
 export type { SceneResult } from "./director";
 export { runArchitect } from "./agents/architect";
-export type { WriterOutput } from "./agents/writer";
+export type { WriterBeatsOutput } from "./agents/writer";
 export type { CinematographerOutput } from "./agents/cinematographer";
 export type { InsertBeatPartial } from "@infiplot/types";
 export * from "./prompts";
@@ -1,3 +1,5 @@
+import type { Orientation } from "@infiplot/types";
+
 // Static SVG placeholder used when MOCK_IMAGE=true, so we can exercise the
 // TTS path without paying for image generation. Returned as a data URI so the
 // rest of the pipeline can treat it as an `imageUrl` interchangeably with
@@ -9,17 +11,23 @@
 // data URI so the engine has zero Node-native dependencies and runs on
 // Cloudflare Workers. SVG also stays crisp at any display size.

-const W = 1792;
-const H = 1024;
-const SVG = `<svg xmlns="http://www.w3.org/2000/svg" width="${W}" height="${H}">
-  <rect width="${W}" height="${H}" fill="#161109"/>
-  <rect x="2" y="2" width="${W - 4}" height="${H - 4}" fill="none" stroke="#5a4628" stroke-width="3" stroke-dasharray="14 10"/>
+function buildDataUri(w: number, h: number): string {
+  const svg = `<svg xmlns="http://www.w3.org/2000/svg" width="${w}" height="${h}">
+  <rect width="${w}" height="${h}" fill="#161109"/>
+  <rect x="2" y="2" width="${w - 4}" height="${h - 4}" fill="none" stroke="#5a4628" stroke-width="3" stroke-dasharray="14 10"/>
  <text x="50%" y="45%" fill="#b88f4a" font-family="Georgia, serif" font-size="72" letter-spacing="6" text-anchor="middle">MOCK IMAGE</text>
  <text x="50%" y="53%" fill="#6e5430" font-family="Georgia, serif" font-size="30" letter-spacing="3" text-anchor="middle">TTS TEST — image generation skipped</text>
 </svg>`;
-
-const DATA_URI = `data:image/svg+xml;charset=utf-8,${encodeURIComponent(SVG)}`;
-
-export async function mockImageDataUri(): Promise<string> {
-  return DATA_URI;
+  return `data:image/svg+xml;charset=utf-8,${encodeURIComponent(svg)}`;
+}
+
+// Mirror the real Painter's dimensions per orientation so mock mode exercises
+// the same portrait/landscape layout the client renders for real images.
+const LANDSCAPE = buildDataUri(1792, 1024);
+const PORTRAIT = buildDataUri(1024, 1792);
+
+export async function mockImageDataUri(
+  orientation: Orientation = "landscape",
+): Promise<string> {
+  return orientation === "portrait" ? PORTRAIT : LANDSCAPE;
 }
@@ -12,6 +12,7 @@ import type {
  VisionRequest,
  VisionResponse,
 } from "@infiplot/types";
+import { coerceOrientation } from "@infiplot/types";
 import { runArchitect } from "./agents/architect";
 import { directInsertBeat, directScene } from "./director";
 import { synthesizeBeat } from "./voice";
@@ -48,6 +49,7 @@ export async function startSession(
    history: [],
    characters: [],
    styleReferenceImage: req.styleReferenceImage?.trim() || undefined,
+    orientation: coerceOrientation(req.orientation),
  };

  // Stage 0 — Architect: expand the terse world/style prompt into a story
@@ -1,9 +1,11 @@
 import type {
  BeatActiveCharacter,
  Character,
+  Orientation,
  Scene,
  Session,
  StoryState,
+  WriterPlan,
 } from "@infiplot/types";

 // ══════════════════════════════════════════════════════════════════════
@@ -137,16 +139,77 @@ export function buildArchitectUserMessage(session: Session): string {
 }

 // ──────────────────────────────────────────────────────────────────────
-//  1. Writer (编剧) — drives the narrative.
+//  1. Writer (编剧) — drives the narrative, in TWO phases.
 //
-//  Emits a full Scene: beats[] graph + entryBeatId + sceneKey hint +
-//  activeCharacters per beat. Does NOT design characters (that's the
-//  CharacterDesigner's job) — only names them in `activeCharacters`.
-//  The CharacterDesigner is invoked separately for any name not yet in
-//  session.characters.
+//  Phase A (WRITER_PLAN_SYSTEM): plans the scene SKELETON only — sceneSummary
+//    + sceneKey + entry-beat roster + the full cast. No dialogue. Its output
+//    is enough for the Cinematographer + character design + Painter to start.
+//  Phase B (WRITER_BEATS_SYSTEM): expands the plan into the full beats[] graph
+//    + storyStatePatch, overlapped with the (longer) image pipeline.
+//
+//  Neither phase designs characters (that's the CharacterDesigner's job) —
+//  Phase A only NAMES them in `cast` / `entryActiveCharacters`; the
+//  CharacterDesigner is invoked for any name not yet in session.characters.
 // ──────────────────────────────────────────────────────────────────────

-export const WRITER_SYSTEM = `你是一部交互视觉小说的「编剧」。每次基于【故事档案 / 主线记忆】、世界观、画风、玩家历史、已登记角色，写出**一个完整场景的剧本**：场景背景概要 + 一组对话节拍 beats，并在最后更新主线记忆。你只负责**剧情和台词**——不设计角色形象、不写出图提示词、不做镜头调度，这些由其他 agent 完成。
+export const WRITER_PLAN_SYSTEM = `你是一部交互视觉小说的「编剧」。这是**两步生成中的第一步——场景规划**。你只产出本场景的「骨架」，**不要写任何 beat 台词**。你的产出会被立刻送去配图（分镜导演 + 生图），所以要快、要准、画面感要强。
+
+═══════════════════════════════════════════════════════════════════
+爆款心法（要在规划阶段就立住，后续展开才好看）
+═══════════════════════════════════════════════════════════════════
+- **进场即钩子**：这一场开场就要抛出新信息 / 悬念 / 冲突 / 情绪冲击，别铺陈。把这个抓人的瞬间写进 sceneSummary。
+- **兑现情绪**：按题材给观众想要的情绪（甜宠的心动、暗恋的拉扯、逆袭的扬眉、悬疑的真相一角）。
+- **人设有反差**：每个角色一个强标签 + 一个反差面。
+
+═══════════════════════════════════════════════════════════════════
+连贯性铁律（跨场景切换不能跳戏 —— 最重要）
+═══════════════════════════════════════════════════════════════════
+- 你会收到【故事档案 / 主线记忆】和上一场的结尾。**新场景必须从上一刻自然承接**——承接情绪、地点逻辑、人物状态与未收的悬念。
+- 若给了「转场种子 nextSceneSeed」，把它当作"下一场的命题"去兑现，开场要让玩家感到"这正是我上一步的结果"。
+- 沿用主线记忆里的人物关系与情绪温度，别让刚告白的人下一场形同陌路。
+
+本步你要规划（如实产出，缺一不可）：
+- **sceneSummary**：当前场景的中文概要——地点 + 时间 + 氛围 + 关键事件 + 那个抓人的开场瞬间。这是分镜导演构图的**唯一依据**，要画面感强、信息足（2–4 句）。
+- **sceneKey**：当前场景的英文 slug（如 "classroom-dusk"、"rooftop-night"）。
+- **entryBeatId**：玩家进入场景时落在哪个 beat 的 id（通常就是 "b1"）。
+- **cast**：本场景**会出场的全部 NPC 角色名**（字符串数组）。第二步写 beats 时**只能用这里列出的名字**，所以现在必须一次想全——谁会说话、谁会在画面里露面，全部列出。名字要与「已登记角色」**完全一致**；新角色起符合世界观的真名（不要"神秘女子"这种占位）。**绝不**包含玩家（你 / 我 / 主角 / protagonist / player / MC...）。
+- **entrySpeaker**：入口 beat 由谁开口 —— 取值只有三种：① 某个 NPC 真名（必须在 cast 里）② "你"（玩家本人开口）③ 留空（纯旁白 / 环境开场）。这决定镜头语言，要选准。
+- **entryActiveCharacters**：入口画面里**此刻出现的 NPC** 及其当下姿态 / 神情（中文 pose）。即使没人说话，画面里有谁也要列。**绝不**包含玩家。
+
+sceneKey 设计原则（用于跨场景视觉一致性）：
+- 同一物理空间 + 同一时段 → 必须沿用**完全相同**的英文 slug
+- 时段 / 空间变化时换 slug（"classroom-dusk" → "classroom-night" / "corridor-dusk"）
+- slug 规范：lowercase-with-dashes，2–4 个英文单词
+- 用户消息会列出已用过的 sceneKey，请优先**复用**这些已有 slug
+
+玩家视角硬规则（违反会破坏整个 galgame）：
+- 玩家是第二人称 POV，**永远不出现在任何画面里**——entryActiveCharacters 的 name **绝不允许**是「玩家 / 你 / 我 / 主角 / protagonist / player / Player / MC / I / me」任何变体。
+- entrySpeaker 只能是 NPC 真名 / "你" / 留空；其它 POV 变体一律视为错误。
+
+必须输出严格 JSON：
+{
+  "sceneSummary": "黄昏的天台，风很大。夏海背对你站在栏杆边，手里攥着一张揉皱的成绩单——她把你单独叫上来，却迟迟不开口。",
+  "sceneKey": "rooftop-dusk",
+  "entryBeatId": "b1",
+  "cast": ["夏海"],
+  "entrySpeaker": "夏海",
+  "entryActiveCharacters": [
+    { "name": "夏海", "pose": "背对你倚着栏杆，侧脸绷着，手里攥着揉皱的纸" }
+  ]
+}
+
+不要输出 JSON 以外的任何文本。`;
+
+// ──────────────────────────────────────────────────────────────────────
+//  Phase B — expands the plan into the full beats[] + storyStatePatch.
+// ──────────────────────────────────────────────────────────────────────
+
+export const WRITER_BEATS_SYSTEM = `你是一部交互视觉小说的「编剧」。这是**两步生成中的第二步——把已规划好的场景展开成完整剧本**。你会收到本场景的「规划」（场景概要 sceneSummary、sceneKey、入口 beat 的 id / speaker / 登场角色、以及本场景允许出场的角色名单 cast）。你的任务：基于规划写出玩家依次经历的对话节拍 beats，并在最后更新主线记忆。你只负责**剧情和台词**——不设计角色形象、不写出图提示词、不做镜头调度，这些由其他 agent 完成。
+
+你必须严格遵守收到的规划：
+- 必须存在一个 id 等于规划 entryBeatId 的 beat，作为玩家入口。
+- 该入口 beat 的 speaker 与登场角色（activeCharacters）要与规划一致（姿态措辞可微调，但**人物身份必须一致**）。
+- speaker 与 activeCharacters 里的 NPC 名字**只能来自规划的 cast**（或玩家 "你"）——**不要引入规划之外的新角色**。

 ═══════════════════════════════════════════════════════════════════
 爆款心法（番茄网文 / 红果短剧 / galgame 的叙事手感）—— 必须贯彻
@@ -167,11 +230,7 @@ export const WRITER_SYSTEM = `你是一部交互视觉小说的「编剧」。
 - 沿用主线记忆里的人物关系与情绪温度——别让刚告白的人下一场形同陌路，也别凭空遗忘已埋的伏笔。
 - 推进、但别重置：每一场都让主线问题往前走一点（关系变化 / 真相揭露一角 / 新悬念浮现）。

-一个场景包含：
- sceneSummary：当前场景的中文概要（地点、时间、氛围、关键事件——给后续的分镜导演看）
- sceneKey：当前场景的英文 slug（如 "classroom-dusk"、"rooftop-night"、"rainy-street"）——同一物理空间应沿用相同 slug
- beats[]：玩家依次经历的对话节拍
- entryBeatId：玩家进入场景时落在哪个 beat
+本步你只产出两样：**beats[]**（玩家依次经历的对话节拍）和 **storyStatePatch**（主线记忆更新）。sceneSummary / sceneKey / entryBeatId 已由规划给定，**不要再输出**它们。

 每个 beat 是玩家会看到的一段叙述 / 对话 / 选择。beat 之间通过 next 字段连接：
 - "continue"：玩家点击图片背景 / 按继续，自然推进到下一个 beat
@@ -183,6 +242,7 @@ choice 的 effect 有两种：

 设计原则：
 - 同场景内 beat 数自由发挥，按剧情节奏自然给出（通常 2–6 个，可以更多）
+- 入口 beat 的 id 必须等于规划给定的 entryBeatId；其余 beat id 依次自取且互不重复
 - 多用 continue，少用 choice — 选择只应出现在「真正的岔路口」
 - advance-beat 适合处理对话分支（同一场景里换个话题、追问、撒娇）
 - change-scene 适合空间/时间跳跃（出门、转身看窗外、第二天清晨）
@@ -192,12 +252,6 @@ choice 的 effect 有两种：
 - next.nextBeatId 引用的 beat 必须存在
 - choice 至少 2 个，至多 4 个，互不重复

-sceneKey 设计原则（重要 — 用于跨场景视觉一致性）：
- 同一物理空间 + 同一时段 → 必须沿用**完全相同**的英文 slug
- 时段或空间变化时换 slug（如 "classroom-dusk" → "classroom-night"，"classroom-dusk" → "corridor-dusk"）
- slug 规范：lowercase-with-dashes，2–4 个英文单词
- 已登记的历史场景 sceneKey 会在用户消息里列出，请优先**复用**这些已有 slug
-
 文本风格约束：
 - narration / line 用中文（**纯净可显示文本**，绝不要写 (叹气)(语速快) 这类标注 —— 那是给配音的，会被玩家看见）
 - sceneSummary / lineDelivery / activeCharacters[].pose 内的文字也用中文
@@ -243,11 +297,8 @@ sceneKey 设计原则（重要 — 用于跨场景视觉一致性）：
 - nextHook：基于这一场的结尾，下一场应往哪走（给"下一次的你"一个明确命题，接住本场留下的扣子）
 这些字段是写给"未来的你"的连贯性记忆，请认真写。

-必须输出严格 JSON，结构如下：
+必须输出严格 JSON，结构如下（**只含 beats 与 storyStatePatch**；sceneSummary / sceneKey / entryBeatId 由规划给定，不要输出。下例入口 beat 的 id "b1" 即规划的 entryBeatId）：
 {
-  "sceneSummary": "中文场景概要：地点+时间+氛围+关键事件",
-  "sceneKey": "classroom-dusk",
-  "entryBeatId": "b1",
  "beats": [
    {
      "id": "b1",
@@ -343,29 +394,28 @@ function renderHistoryEntry(
  return lines.join("\n");
 }

-export function buildWriterUserMessage(session: Session): string {
-  // ─── STABLE PREFIX ────────────────────────────────────────────────────
-  // Everything in this section is invariant across consecutive Writer calls
-  // within the session (or monotonically grows in a way that keeps the
-  // earlier bytes byte-identical). Always emit every section header — even
-  // when empty — so positions don't shift between calls.
-  //
-  // Order optimized for DeepSeek/MiMo prefix caching (64-token chunks):
-  //   1. session-immutable scalars (world / style)
-  //   2. story bible spine (Architect-set, never patched)
-  //   3. monotonically-growing lists (characters, sceneKeys)
-  //   4. history entries 0..N-2 (the last entry is what THIS call must
-  //      react to, so it lives in the dynamic suffix instead)
-  //
-  // ─── DYNAMIC SUFFIX ───────────────────────────────────────────────────
-  // Everything below changes on (almost) every call:
-  //   5. story bible dynamic patch (synopsis/threads/relationships/nextHook)
-  //   6. the just-completed entry (history[-1]) — same render format as the
-  //      stable history blocks, just preceded by a "just completed" header
-  //   7. last-beat snippet (the exact emotional cliffhanger)
-  //   8. lastExit hint
-  //   9. format reminder tail
-
+// Shared narrative context for BOTH Writer phases. Returns the message parts
+// from the cacheable STABLE PREFIX (sections 1-4) through the dynamic
+// transition hint (section 7), but WITHOUT the trailing phase-specific
+// instruction — each phase appends its own. Building this once and reusing it
+// keeps EACH phase's prompt prefix byte-stable across scenes for DeepSeek
+// prompt caching (Phase A and Phase B cache independently since their system
+// prompts differ, but each shares its own prefix across consecutive calls).
+//
+// ─── STABLE PREFIX ──────────────────────────────────────────────────────
+// Invariant across consecutive Writer calls within the session (or grows in a
+// way that keeps earlier bytes byte-identical). Always emit every section
+// header — even when empty — so positions don't shift between calls.
+//   1. session-immutable scalars (world / style)
+//   2. story bible spine (Architect-set, never patched)
+//   3. monotonically-growing lists (characters, sceneKeys)
+//   4. history entries 0..N-2 (the last entry is what THIS call must react
+//      to, so it lives in the dynamic suffix instead)
+// ─── DYNAMIC SUFFIX ─────────────────────────────────────────────────────
+//   5. story bible dynamic patch (synopsis/threads/relationships/nextHook)
+//   6. last-beat snippet (the exact emotional cliffhanger)
+//   7. transition hint (opening cold-open directive OR lastExit承接)
+function buildWriterContextParts(session: Session): string[] {
  const parts: string[] = [];

  // ── 1. session scalars ────────────────────────────────────────────────
@@ -423,8 +473,7 @@ export function buildWriterUserMessage(session: Session): string {
  // ── 6. last-beat snippet (the exact emotional cliffhanger) ──
  // The full last entry is already in the stable history block above; here
  // we only re-emit the very last beat to sharply focus the Writer on the
-  // emotional moment to continue from. Skip the duplicate full-entry render
-  // that was here previously — it wasted ~200-500 tokens of dynamic suffix.
+  // emotional moment to continue from.
  const last = session.history.at(-1);
  if (last) {
    const lastBeatId = last.visitedBeatIds.at(-1) ?? last.scene.entryBeatId;
@@ -441,14 +490,14 @@ export function buildWriterUserMessage(session: Session): string {
    }
  }

+  // ── 7. transition hint ────────────────────────────────────────────────
  if (session.history.length === 0) {
    parts.push(
-      "\n这是故事的开场。请按【故事档案】里的 nextHook 把第一幕的冷开场写出来——开场即抓人，别花笔墨铺垫世界观。写完后更新 storyStatePatch。严格以 JSON 格式返回。",
+      "\n这是故事的开场。请按【故事档案】里的 nextHook 把第一幕的冷开场设计出来——开场即抓人，别花笔墨铺垫世界观。",
    );
-    return parts.join("\n");
+    return parts;
  }

-  // ── 8. lastExit hint ──────────────────────────────────────────────────
  const lastExit = last?.exit;
  if (lastExit) {
    if (lastExit.kind === "choice") {
@@ -464,8 +513,59 @@ export function buildWriterUserMessage(session: Session): string {
    parts.push("\n无缝续写下一个场景，延续上一刻的情绪。");
  }

-  // ── 9. format reminder tail ───────────────────────────────────────────
-  parts.push("写完后别忘了更新 storyStatePatch。严格以 JSON 格式返回。");
+  return parts;
+}
+
+// Phase A — plan the scene skeleton (no beats). Shares the cacheable context;
+// appends a plan-only instruction tail.
+export function buildWriterPlanUserMessage(session: Session): string {
+  const parts = buildWriterContextParts(session);
+  parts.push(
+    '\n现在**只规划本场景的骨架**（不要写 beats 台词）：给出 sceneSummary（画面感强、含开场钩子）、sceneKey、entryBeatId、本场景会出场的全部角色 cast、以及入口 beat 的 entrySpeaker 与 entryActiveCharacters。严格以 JSON 格式返回。',
+  );
+  return parts.join("\n");
+}
+
+// Phase B — expand the plan into full beats[] + storyStatePatch. The plan is
+// dynamic per scene, so it goes AFTER the cacheable context (keeping Phase B's
+// prefix stable across scenes).
+export function buildWriterBeatsUserMessage(
+  session: Session,
+  plan: WriterPlan,
+): string {
+  const parts = buildWriterContextParts(session);
+
+  parts.push("");
+  parts.push("━━━ 本场景规划（上一步已定，必须严格遵守）━━━");
+  parts.push(`场景概要 sceneSummary：${plan.sceneSummary}`);
+  if (plan.sceneKey) parts.push(`sceneKey：${plan.sceneKey}`);
+  parts.push(
+    `入口 beat 的 id（entryBeatId，必须有一个此 id 的 beat 作为入口）：${plan.entryBeatId}`,
+  );
+  parts.push(
+    `入口 beat 的 speaker：${plan.entrySpeaker ? plan.entrySpeaker : "（空 —— 纯旁白 / 环境开场）"}`,
+  );
+  parts.push("入口 beat 的登场角色 activeCharacters（人物身份须一致，姿态可微调）：");
+  if (plan.entryActiveCharacters.length === 0) {
+    parts.push("（无 —— 入口画面没有 NPC）");
+  } else {
+    for (const c of plan.entryActiveCharacters) {
+      parts.push(`- ${c.name}${c.pose ? `：${c.pose}` : ""}`);
+    }
+  }
+  parts.push(
+    '本场景允许出现的角色名 cast（speaker / activeCharacters 只能用这些名字或 "你"，不要新增角色）：',
+  );
+  if (plan.cast.length === 0) {
+    parts.push("（无 NPC —— 仅旁白与玩家）");
+  } else {
+    for (const n of plan.cast) parts.push(`- ${n}`);
+  }
+  parts.push("━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━");
+
+  parts.push(
+    "\n把上面的规划展开成完整的 beats[]（入口 beat 用规划的 entryBeatId / speaker / 登场角色），写完后更新 storyStatePatch。严格以 JSON 格式返回。",
+  );
  return parts.join("\n");
 }

@@ -704,6 +804,7 @@ export function buildPainterPrompt(
  integratedPrompt: string,
  styleGuide: string,
  characters: { name: string; visualDescription?: string }[],
+  orientation: Orientation = "landscape",
 ): string {
  const archetypeBlock = characters
    .filter((c) => c.visualDescription)
@@ -714,7 +815,15 @@ export function buildPainterPrompt(
    ? `\n\nCHARACTER ARCHETYPES (anchor identity, outfit, and style across scenes — keep each character visually identical to their archetype):\n${archetypeBlock}`
    : "";

-  return `Generate a cinematic landscape background illustration, 16:9 widescreen (1792x1024).
+  const portrait = orientation === "portrait";
+  const header = portrait
+    ? "Generate a cinematic vertical (portrait) background illustration, 9:16 tall format (1024x1792)."
+    : "Generate a cinematic landscape background illustration, 16:9 widescreen (1792x1024).";
+  const orientationRule = portrait
+    ? "- 9:16 PORTRAIT orientation — taller than wide. No landscape or square output."
+    : "- 16:9 LANDSCAPE orientation — wider than tall. No portrait or square output.";
+
+  return `${header}

 ART STYLE: ${styleGuide}

@@ -727,7 +836,7 @@ STRICT RULES — NEVER violate these:
 - DO NOT render any Chinese or English text anywhere in the image.
 - DO NOT add any HUD, interface chrome, or game UI elements.
 - The image is a PURE BACKGROUND SCENE ONLY. All UI will be added as HTML on top.
- 16:9 LANDSCAPE orientation — wider than tall. No portrait or square output.
+${orientationRule}
 - Leave the bottom 35% of the frame relatively uncluttered (darker or softer) so overlaid UI panels remain readable.
 - Characters or key scene elements should be positioned in the upper 65% of the frame.
 - Maintain character identity exactly as specified in CHARACTER ARCHETYPES — same face, same hairstyle, same outfit across every scene.
@@ -0,0 +1,77 @@
+// Xiaomi MiMo TTS endpoint presets.
+//
+// Xiaomi issues two independent key types, each with its own base URL:
+//   - Token Plan (套餐, `tp-` key): per-region endpoints token-plan-{sgp,cn,ams}.
+//   - Pay-as-you-go (按量, `sk-` key): the single unified endpoint api.xiaomimimo.com.
+//
+// Used CLIENT-SIDE ONLY: when a user supplies their own key, the browser calls
+// one of these endpoints directly (all return permissive CORS allowing the
+// `api-key` header), so the key never transits our server. Every endpoint
+// serves the same `mimo-v2.5-tts` family; Token Plan users pick the region
+// matching their subscription (also the closest hop → lower synth latency),
+// pay-as-you-go users have no region to choose. See docs/xiaomi-tts-key.md.
+
+export type TtsPreset = {
+  id: string;
+  /** Which key family this endpoint serves — drives the two-step picker UI. */
+  kind: "token-plan" | "payg";
+  /** Human label shown in the picker (region for Token Plan, type for payg). */
+  label: string;
+  /** OpenAI-style base; the TTS adapter appends `/chat/completions`. */
+  baseUrl: string;
+};
+
+/** Base model name; the adapter derives `-voicedesign` / `-voiceclone`. */
+export const DEFAULT_TTS_SPEECH_MODEL = "mimo-v2.5-tts";
+
+/**
+ * In-repo tutorial for getting a free Xiaomi MiMo key + picking a region.
+ * Points at the default branch so it resolves once this lands on main (which
+ * is what production serves). Linked from the homepage BYO modal, the play
+ * page's silence nudge, and the README.
+ */
+export const TTS_KEY_DOC_URL =
+  "https://github.com/zonghaoyuan/infiplot/blob/main/docs/xiaomi-tts-key.md";
+
+export const TTS_PRESETS: TtsPreset[] = [
+  {
+    id: "sgp",
+    kind: "token-plan",
+    label: "新加坡 · Singapore",
+    baseUrl: "https://token-plan-sgp.xiaomimimo.com/v1",
+  },
+  {
+    id: "cn",
+    kind: "token-plan",
+    label: "中国大陆 · China",
+    baseUrl: "https://token-plan-cn.xiaomimimo.com/v1",
+  },
+  {
+    id: "ams",
+    kind: "token-plan",
+    label: "欧洲 · Amsterdam",
+    baseUrl: "https://token-plan-ams.xiaomimimo.com/v1",
+  },
+  {
+    id: "payg",
+    kind: "payg",
+    label: "按量付费 · Pay-as-you-go",
+    baseUrl: "https://api.xiaomimimo.com/v1",
+  },
+];
+
+/** Token Plan endpoints only — the region sub-options shown once the user
+ *  picks the "套餐" key type. */
+export const TTS_REGION_PRESETS = TTS_PRESETS.filter(
+  (p) => p.kind === "token-plan",
+);
+
+/** The single pay-as-you-go preset id (`sk-` keys have no region). */
+export const PAYG_PRESET_ID = "payg";
+
+export function findTtsPreset(
+  id: string | null | undefined,
+): TtsPreset | undefined {
+  if (!id) return undefined;
+  return TTS_PRESETS.find((p) => p.id === id);
+}
@@ -40,6 +40,23 @@ export type BeatChoiceEffect =
  | { kind: "advance-beat"; targetBeatId: string }
  | { kind: "change-scene"; nextSceneSeed: string };

+// ──────────────────────────────────────────────────────────────────────
+//  Orientation — session-wide image aspect, locked at session start.
+//  "landscape" → 16:9 (1792×1024), the default for desktop / mobile-landscape.
+//  "portrait"  → 9:16 (1024×1792), painted for mobile users holding the phone
+//  upright so the scene fills the screen instead of letterboxing a widescreen
+//  image. CSS object-fit then adapts the 9:16 frame to the exact device size.
+// ──────────────────────────────────────────────────────────────────────
+
+export type Orientation = "portrait" | "landscape";
+
+/** Normalize an untrusted orientation value (from a request body, or a
+ *  persisted session that predates the field) to a valid Orientation.
+ *  Anything other than "portrait" falls back to "landscape" (back-compat). */
+export function coerceOrientation(value: unknown): Orientation {
+  return value === "portrait" ? "portrait" : "landscape";
+}
+
 // ──────────────────────────────────────────────────────────────────────
 //  Scene — one background image + a graph of beats.
 //  The Director emits an entire Scene per call; the player navigates
@@ -75,6 +92,12 @@ export type Scene = {
   * Runware URL — the client renders both forms transparently.
   */
  imageUrl?: string;
+  /**
+   * Orientation this scene's image was painted in. Mirrors the session's
+   * locked orientation; recorded per-scene so the client can pick the right
+   * intrinsic dimensions / object-fit even across legacy or mixed history.
+   */
+  orientation?: Orientation;
 };

 export type SceneExit =
@@ -92,6 +115,43 @@ export type SceneHistoryEntry = {
  exit?: SceneExit;
 };

+// ──────────────────────────────────────────────────────────────────────
+//  Writer two-phase split
+//
+//  The Writer runs as TWO LLM calls so scene-image generation can begin
+//  before the dialogue is fully written:
+//    Phase A (WriterPlan) — the minimal skeleton the image pipeline needs:
+//                           sceneSummary + sceneKey + the entry beat's
+//                           on-stage roster + the full cast to design.
+//    Phase B (beats)      — the full beats[] graph + storyStatePatch, written
+//                           to honor the plan, overlapped with image gen.
+//  The Cinematographer + character design + Painter all run off the Plan, so
+//  Phase B's (longer) output is hidden behind the image pipeline.
+// ──────────────────────────────────────────────────────────────────────
+
+export type WriterPlan = {
+  /** 中文 scene synopsis (location + time + mood + key event + opening hook).
+   *  The sole input the Cinematographer composes the establishing shot from. */
+  sceneSummary: string;
+  /** English location+time slug for cross-scene visual continuity. */
+  sceneKey?: string;
+  /** Beat id the player lands on when entering the scene. Phase B must emit a
+   *  beat with this id (reconciled if it doesn't). */
+  entryBeatId: string;
+  /** Every NPC name that appears anywhere in this scene. Drives character
+   *  design (card + portrait + voice) IN PARALLEL with Phase B beat writing, so
+   *  the whole cast is provisioned by the time the scene returns. Phase B may
+   *  only use names from this list (plus the POV "你"). Never includes the player. */
+  cast: string[];
+  /** The entry beat's on-stage roster (who's visible + pose when the player
+   *  lands). Drives the Cinematographer's framing and the entry-beat portraits
+   *  the Painter anchors to. Never includes the POV player. */
+  entryActiveCharacters: BeatActiveCharacter[];
+  /** The entry beat's speaker — an NPC name, "你" (player speaking), or
+   *  undefined for a pure narration/environment entry. Drives shot selection. */
+  entrySpeaker?: string;
+};
+
 // ──────────────────────────────────────────────────────────────────────
 //  Characters & voices (TTS)
 // ──────────────────────────────────────────────────────────────────────
@@ -214,6 +274,12 @@ export type Session = {
   * payload small for /api/scene round-trips.
   */
  styleReferenceImage?: string;
+  /**
+   * Session-wide image orientation, locked at session start from the client's
+   * device + orientation and carried on every /api/scene call so all scenes
+   * share one aspect ratio. Absent → "landscape" (back-compat).
+   */
+  orientation?: Orientation;
 };

 // ──────────────────────────────────────────────────────────────────────
@@ -231,10 +297,41 @@ export type VisionClassify = "insert-beat" | "change-scene";
 //  Provider config
 // ──────────────────────────────────────────────────────────────────────

+/**
+ * Wire protocol used to talk to a model provider. Which values are valid
+ * depends on the model role — each ai-client adapter accepts its own subset
+ * and falls back to a sensible default for anything else:
+ *
+ *   openai_compatible  text / vision / image  — OpenAI Chat Completions +
+ *                      `/images/generations` (self-implemented fetch; the
+ *                      default for text/vision when unset)
+ *   anthropic          text / vision          — native Anthropic Messages (AI SDK)
+ *   google             text / vision / image  — native Gemini (AI SDK); image
+ *                      uses the Nano Banana family
+ *   openai             image only             — OpenAI gpt-image via AI SDK,
+ *                      unlocks reference-image editing (for text/vision use
+ *                      openai_compatible, which already speaks OpenAI's format)
+ *   runware            image only             — Runware task-array protocol
+ *                      (self-implemented; the default for runware.ai URLs)
+ */
+export type ProviderProtocol =
+  | "openai_compatible"
+  | "anthropic"
+  | "google"
+  | "openai"
+  | "runware";
+
 export type ProviderConfig = {
  baseUrl: string;
  apiKey: string;
  model: string;
+  /**
+   * Wire protocol. When unset, callers apply a role-specific default:
+   * text/vision → "openai_compatible"; image → inferred from baseUrl
+   * (runware.ai → "runware", otherwise "openai_compatible") so existing
+   * deployments keep working without setting *_PROVIDER.
+   */
+  provider?: ProviderProtocol;
 };

 export type TtsConfig = {
@@ -263,6 +360,18 @@ export type StartRequest = {
  styleGuide: string;
  /** Optional user-uploaded style reference image — see Session.styleReferenceImage. */
  styleReferenceImage?: string;
+  /**
+   * When true the client supplied its own Xiaomi TTS key and will provision +
+   * synth voices in the browser (key never touches our server). The route then
+   * drops `config.tts` so the engine skips all server-side TTS work.
+   */
+  clientTts?: boolean;
+  /**
+   * Device orientation chosen at session start. "portrait" makes the engine
+   * paint 9:16 vertical scene images (mobile, held upright); "landscape"
+   * (default) keeps 16:9 widescreen. Locked for the whole session.
+   */
+  orientation?: Orientation;
 };

 // /api/parse-style-image — vision LLM extracts a textual painting-style
@@ -295,6 +404,8 @@ export type StartResponse = {
 // (frontend synthesizes a speculative exit).
 export type SceneRequest = {
  session: Session;
+  /** See StartRequest.clientTts — drops server-side TTS for BYO-key clients. */
+  clientTts?: boolean;
 };

 export type SceneResponse = {
@@ -352,6 +463,8 @@ export type VisionResponse = {
 export type InsertBeatRequest = {
  session: Session;
  freeformAction: string;
+  /** See StartRequest.clientTts — drops server-side TTS for BYO-key clients. */
+  clientTts?: boolean;
 };

 /** Partial beat fields produced by the insert-beat director. */
@@ -1,6 +1,6 @@
 /// <reference types="next" />
 /// <reference types="next/image-types/global" />
-import "./.next/types/routes.d.ts";
+import "./.next/dev/types/routes.d.ts";

 // NOTE: This file should not be edited
 // see https://nextjs.org/docs/app/api-reference/config/typescript for more information.
@@ -20,6 +20,10 @@
    "deploy:cf": "opennextjs-cloudflare deploy"
  },
  "dependencies": {
+    "@ai-sdk/anthropic": "^3.0.81",
+    "@ai-sdk/google": "^3.0.80",
+    "@ai-sdk/openai": "^3.0.67",
+    "ai": "^6.0.196",
    "jsonrepair": "^3.14.0",
    "next": "^16.0.0",
    "react": "^19.0.0",
@@ -8,12 +8,24 @@ importers:

  .:
    dependencies:
+      '@ai-sdk/anthropic':
+        specifier: ^3.0.81
+        version: 3.0.81(zod@4.4.3)
+      '@ai-sdk/google':
+        specifier: ^3.0.80
+        version: 3.0.80(zod@4.4.3)
+      '@ai-sdk/openai':
+        specifier: ^3.0.67
+        version: 3.0.67(zod@4.4.3)
+      ai:
+        specifier: ^6.0.196
+        version: 6.0.196(zod@4.4.3)
      jsonrepair:
        specifier: ^3.14.0
        version: 3.14.0
      next:
        specifier: ^16.0.0
-        version: 16.2.7(react-dom@19.2.7(react@19.2.7))(react@19.2.7)
+        version: 16.2.7(@opentelemetry/api@1.9.1)(react-dom@19.2.7(react@19.2.7))(react@19.2.7)
      react:
        specifier: ^19.0.0
        version: 19.2.7
@@ -23,7 +35,7 @@ importers:
    devDependencies:
      '@opennextjs/cloudflare':
        specifier: ^1.19.11
-        version: 1.19.11(next@16.2.7(react-dom@19.2.7(react@19.2.7))(react@19.2.7))(wrangler@4.97.0)
+        version: 1.19.11(next@16.2.7(@opentelemetry/api@1.9.1)(react-dom@19.2.7(react@19.2.7))(react@19.2.7))(wrangler@4.97.0)
      '@types/node':
        specifier: ^22.9.0
        version: 22.19.19
@@ -54,6 +66,40 @@ importers:

 packages:

+  '@ai-sdk/anthropic@3.0.81':
+    resolution: {integrity: sha512-B1JDd9Ugq9R5AgIaW3674lhGCMMYJcPUxnrZh8fzbGojgg4QvHFRv6eZahGQAUsmGHbcf74G9bdSBDLWQGY2GA==}
+    engines: {node: '>=18'}
+    peerDependencies:
+      zod: ^3.25.76 || ^4.1.8
+
+  '@ai-sdk/gateway@3.0.124':
+    resolution: {integrity: sha512-h8CrmbSG+8X0C+M/E1M4oiDHYevqwbzAPN+uLRHS0eJaatF2MZ+juNtOHXNOjk7Bsk9mD2RjYMjJO9dFkb9I7Q==}
+    engines: {node: '>=18'}
+    peerDependencies:
+      zod: ^3.25.76 || ^4.1.8
+
+  '@ai-sdk/google@3.0.80':
+    resolution: {integrity: sha512-5ORbm/yFUPO0MEvZsxBMN0cdKw2+lwU/wVn5KN3KF8Dmk1LughuDuUohMh/7iU/XFTiyB0OvmTW/tdV/J7O9zg==}
+    engines: {node: '>=18'}
+    peerDependencies:
+      zod: ^3.25.76 || ^4.1.8
+
+  '@ai-sdk/openai@3.0.67':
+    resolution: {integrity: sha512-oAiGC9eWG7IgtdsdS74bOCnAAHarAfTJhWN9x5INwnWPekL802AvF+0I5DvLzIF1MIRmNw4N8mPSL/GUVbX9Mw==}
+    engines: {node: '>=18'}
+    peerDependencies:
+      zod: ^3.25.76 || ^4.1.8
+
+  '@ai-sdk/provider-utils@4.0.27':
+    resolution: {integrity: sha512-ubkAJ+xODouwtmN1tYlvTPphH1hPOBfZaEQe8U7skGvFAnIRs9PPpsq57bC2+Ky/MB4yzhd6YOsxTAx9sGpazw==}
+    engines: {node: '>=18'}
+    peerDependencies:
+      zod: ^3.25.76 || ^4.1.8
+
+  '@ai-sdk/provider@3.0.10':
+    resolution: {integrity: sha512-Q3BZ27qfpYqnCYGvE3vt+Qi6LGOF9R5Nmzn+9JoM1lCRsD9mYaIhfJLkSunN48nfGXJ6n+XNV0J/XVpqGQl7Dw==}
+    engines: {node: '>=18'}
+
  '@alloc/quick-lru@5.2.0':
    resolution: {integrity: sha512-UrcABB+4bUrFABwbluTIBErXwvbsU/V7TZWfmbgJfbkwiBuziS9gxdODUyuiecfdGQ85jglMW6juS3+z5TsKLw==}
    engines: {node: '>=10'}
@@ -1036,6 +1082,10 @@ packages:
      next: '>=15.5.18 <16 || >=16.2.6'
      wrangler: ^4.86.0

+  '@opentelemetry/api@1.9.1':
+    resolution: {integrity: sha512-gLyJlPHPZYdAk1JENA9LeHejZe1Ti77/pTeFm/nMXmQH/HFZlcS/O2XJB+L8fkbrNSqhdtlvjBVjxwUYanNH5Q==}
+    engines: {node: '>=8.0.0'}
+
  '@poppinss/colors@4.1.6':
    resolution: {integrity: sha512-H9xkIdFswbS8n1d6vmRd8+c10t2Qe+rZITbbDHHkQixH5+2x1FDGmi/0K+WgWiqQFKPSlIYB7jlH6Kpfn6Fleg==}

@@ -1204,6 +1254,9 @@ packages:
  '@speed-highlight/core@1.2.15':
    resolution: {integrity: sha512-BMq1K3DsElxDWawkX6eLg9+CKJrTVGCBAWVuHXVUV2u0s2711qiChLSId6ikYPfxhdYocLNt3wWwSvDiTvFabw==}

+  '@standard-schema/spec@1.1.0':
+    resolution: {integrity: sha512-l2aFy5jALhniG5HgqrD6jXLi/rUWrKvqN/qJx6yoJsgKhblVd+iqqU4RCXavm/jPityDo5TCvKMnpjKnOriy0w==}
+
  '@swc/helpers@0.5.15':
    resolution: {integrity: sha512-JQ5TuMi45Owi4/BIMAJBoSQoOJu12oOk/gADqlcUL9JEdHB8vyjUSsxqeNXnmXHjYKMi2WcYtezGEEhqUI/E2g==}

@@ -1227,6 +1280,10 @@ packages:
  '@types/react@19.2.16':
    resolution: {integrity: sha512-esJiCAnl0kfpNdE69f3So4WJUXy95dLZydX0KwK46riIHDzHM7O9Vtf9xCHW0PXIqvgqNrswl522kA/5yx+F4w==}

+  '@vercel/oidc@3.2.0':
+    resolution: {integrity: sha512-UycprH3T6n3jH0k44NHMa7pnFHGu/N05MjojYr+Mc6I7obkoLIJujSWwin1pCvdy/eOxrI/l3uDLQsmcrOb4ug==}
+    engines: {node: '>= 20'}
+
  abort-controller@3.0.0:
    resolution: {integrity: sha512-h8lQ8tacZYnR3vNQTgibj+tODHI5/+l06Au2Pcriv/Gmet0eaj4TwWH41sO9wnHDiQsEj19q0drzdWdeAHtweg==}
    engines: {node: '>=6.5'}
@@ -1244,6 +1301,12 @@ packages:
    resolution: {integrity: sha512-kja8j7PjmncONqaTsB8fQ+wE2mSU2DJ9D4XKoJ5PFWIdRMa6SLSN1ff4mOr4jCbfRSsxR4keIiySJU0N9T5hIQ==}
    engines: {node: '>= 8.0.0'}

+  ai@6.0.196:
+    resolution: {integrity: sha512-2T45UeqKL4a11KQ14I5i1YYHOvCFrMF478E1k6PVjlQSGUvXSv4xrxIaQbUL4qgv91DADSbddwv3oR49pPAK3g==}
+    engines: {node: '>=18'}
+    peerDependencies:
+      zod: ^3.25.76 || ^4.1.8
+
  ansi-colors@4.1.3:
    resolution: {integrity: sha512-/6w/C21Pm1A7aZitlI5Ni/2J6FFQN8i1Cvz3kHABAAbw93v/NlvKdVOqz7CCWz/3iv/JplRSEEZ83XION15ovw==}
    engines: {node: '>=6'}
@@ -1549,6 +1612,10 @@ packages:
    resolution: {integrity: sha512-i/2XbnSz/uxRCU6+NdVJgKWDTM427+MqYbkQzD321DuCQJUqOuJKIA0IM2+W2xtYHdKOmZ4dR6fExsd4SXL+WQ==}
    engines: {node: '>=6'}

+  eventsource-parser@3.1.0:
+    resolution: {integrity: sha512-kJezFj9YFAMLeORyi7aCLxLbD5/qWMQnoMVlVPyHIll7lgRJCc3JVln9Vgl9nwQi0YkMnhdGTMNn7CkRRAptMg==}
+    engines: {node: '>=18.0.0'}
+
  execa@5.1.1:
    resolution: {integrity: sha512-8uSpZZocAZRBAPIEINJj3Lo9HyGitllczc27Eh5YYojjMFMn8yHMDMaUHE2Jqfq05D/wucwI4JGURyXt1vchyg==}
    engines: {node: '>=10'}
@@ -1754,6 +1821,9 @@ packages:
    resolution: {integrity: sha512-/imKNG4EbWNrVjoNC/1H5/9GFy+tqjGBHCaSsN+P2RnPqjsLmv6UD3Ej+Kj8nBWaRAwyk7kK5ZUc+OEatnTR3A==}
    hasBin: true

+  json-schema@0.4.0:
+    resolution: {integrity: sha512-es94M3nTIfsEPisRafak+HDLfHXnKBhV3vU5eqPcS3flIWqcxJWgXHXiey3YrpaNsanY5ei1VoYEbOzijuq9BA==}
+
  jsonrepair@3.14.0:
    resolution: {integrity: sha512-tWPGKMZf/8UPim+fcW2EfcQ/d/7aKUrP6IECz9G3Tu6Q5dX0orSleqJ9z6sSw7qrQkjF8/Edo4DvsWBZ8H+HNg==}
    hasBin: true
@@ -2384,8 +2454,47 @@ packages:
  youch@4.1.0-beta.10:
    resolution: {integrity: sha512-rLfVLB4FgQneDr0dv1oddCVZmKjcJ6yX6mS4pU82Mq/Dt9a3cLZQ62pDBL4AUO+uVrCvtWz3ZFUL2HFAFJ/BXQ==}

+  zod@4.4.3:
+    resolution: {integrity: sha512-ytENFjIJFl2UwYglde2jchW2Hwm4GJFLDiSXWdTrJQBIN9Fcyp7n4DhxJEiWNAJMV1/BqWfW/kkg71UDcHJyTQ==}
+
 snapshots:

+  '@ai-sdk/anthropic@3.0.81(zod@4.4.3)':
+    dependencies:
+      '@ai-sdk/provider': 3.0.10
+      '@ai-sdk/provider-utils': 4.0.27(zod@4.4.3)
+      zod: 4.4.3
+
+  '@ai-sdk/gateway@3.0.124(zod@4.4.3)':
+    dependencies:
+      '@ai-sdk/provider': 3.0.10
+      '@ai-sdk/provider-utils': 4.0.27(zod@4.4.3)
+      '@vercel/oidc': 3.2.0
+      zod: 4.4.3
+
+  '@ai-sdk/google@3.0.80(zod@4.4.3)':
+    dependencies:
+      '@ai-sdk/provider': 3.0.10
+      '@ai-sdk/provider-utils': 4.0.27(zod@4.4.3)
+      zod: 4.4.3
+
+  '@ai-sdk/openai@3.0.67(zod@4.4.3)':
+    dependencies:
+      '@ai-sdk/provider': 3.0.10
+      '@ai-sdk/provider-utils': 4.0.27(zod@4.4.3)
+      zod: 4.4.3
+
+  '@ai-sdk/provider-utils@4.0.27(zod@4.4.3)':
+    dependencies:
+      '@ai-sdk/provider': 3.0.10
+      '@standard-schema/spec': 1.1.0
+      eventsource-parser: 3.1.0
+      zod: 4.4.3
+
+  '@ai-sdk/provider@3.0.10':
+    dependencies:
+      json-schema: 0.4.0
+
  '@alloc/quick-lru@5.2.0': {}

  '@ast-grep/napi-darwin-arm64@0.40.5':
@@ -3446,7 +3555,7 @@ snapshots:
      '@nodelib/fs.scandir': 2.1.5
      fastq: 1.20.1

-  '@opennextjs/aws@4.0.2(next@16.2.7(react-dom@19.2.7(react@19.2.7))(react@19.2.7))':
+  '@opennextjs/aws@4.0.2(next@16.2.7(@opentelemetry/api@1.9.1)(react-dom@19.2.7(react@19.2.7))(react@19.2.7))':
    dependencies:
      '@ast-grep/napi': 0.40.5
      '@aws-sdk/client-cloudfront': 3.984.0
@@ -3462,24 +3571,24 @@ snapshots:
      cookie: 1.1.1
      esbuild: 0.25.4
      express: 5.2.1
-      next: 16.2.7(react-dom@19.2.7(react@19.2.7))(react@19.2.7)
+      next: 16.2.7(@opentelemetry/api@1.9.1)(react-dom@19.2.7(react@19.2.7))(react@19.2.7)
      path-to-regexp: 6.3.0
      urlpattern-polyfill: 10.1.0
      yaml: 2.9.0
    transitivePeerDependencies:
      - supports-color

-  '@opennextjs/cloudflare@1.19.11(next@16.2.7(react-dom@19.2.7(react@19.2.7))(react@19.2.7))(wrangler@4.97.0)':
+  '@opennextjs/cloudflare@1.19.11(next@16.2.7(@opentelemetry/api@1.9.1)(react-dom@19.2.7(react@19.2.7))(react@19.2.7))(wrangler@4.97.0)':
    dependencies:
      '@ast-grep/napi': 0.40.5
      '@dotenvx/dotenvx': 1.31.0
-      '@opennextjs/aws': 4.0.2(next@16.2.7(react-dom@19.2.7(react@19.2.7))(react@19.2.7))
+      '@opennextjs/aws': 4.0.2(next@16.2.7(@opentelemetry/api@1.9.1)(react-dom@19.2.7(react@19.2.7))(react@19.2.7))
      ci-info: 4.4.0
      cloudflare: 4.5.0
      comment-json: 4.6.2
      enquirer: 2.4.1
      glob: 12.0.0
-      next: 16.2.7(react-dom@19.2.7(react@19.2.7))(react@19.2.7)
+      next: 16.2.7(@opentelemetry/api@1.9.1)(react-dom@19.2.7(react@19.2.7))(react@19.2.7)
      ts-tqdm: 0.8.6
      wrangler: 4.97.0
      yargs: 18.0.0
@@ -3487,6 +3596,8 @@ snapshots:
      - encoding
      - supports-color

+  '@opentelemetry/api@1.9.1': {}
+
  '@poppinss/colors@4.1.6':
    dependencies:
      kleur: 4.1.5
@@ -3697,6 +3808,8 @@ snapshots:

  '@speed-highlight/core@1.2.15': {}

+  '@standard-schema/spec@1.1.0': {}
+
  '@swc/helpers@0.5.15':
    dependencies:
      tslib: 2.8.1
@@ -3724,6 +3837,8 @@ snapshots:
    dependencies:
      csstype: 3.2.3

+  '@vercel/oidc@3.2.0': {}
+
  abort-controller@3.0.0:
    dependencies:
      event-target-shim: 5.0.1
@@ -3739,6 +3854,14 @@ snapshots:
    dependencies:
      humanize-ms: 1.2.1

+  ai@6.0.196(zod@4.4.3):
+    dependencies:
+      '@ai-sdk/gateway': 3.0.124(zod@4.4.3)
+      '@ai-sdk/provider': 3.0.10
+      '@ai-sdk/provider-utils': 4.0.27(zod@4.4.3)
+      '@opentelemetry/api': 1.9.1
+      zod: 4.4.3
+
  ansi-colors@4.1.3: {}

  ansi-regex@5.0.1: {}
@@ -4052,6 +4175,8 @@ snapshots:

  event-target-shim@5.0.1: {}

+  eventsource-parser@3.1.0: {}
+
  execa@5.1.1:
    dependencies:
      cross-spawn: 7.0.6
@@ -4293,6 +4418,8 @@ snapshots:

  jiti@1.21.7: {}

+  json-schema@0.4.0: {}
+
  jsonrepair@3.14.0: {}

  kleur@4.1.5: {}
@@ -4376,7 +4503,7 @@ snapshots:

  negotiator@1.0.0: {}

-  next@16.2.7(react-dom@19.2.7(react@19.2.7))(react@19.2.7):
+  next@16.2.7(@opentelemetry/api@1.9.1)(react-dom@19.2.7(react@19.2.7))(react@19.2.7):
    dependencies:
      '@next/env': 16.2.7
      '@swc/helpers': 0.5.15
@@ -4395,6 +4522,7 @@ snapshots:
      '@next/swc-linux-x64-musl': 16.2.7
      '@next/swc-win32-arm64-msvc': 16.2.7
      '@next/swc-win32-x64-msvc': 16.2.7
+      '@opentelemetry/api': 1.9.1
      sharp: 0.34.5
    transitivePeerDependencies:
      - '@babel/core'
@@ -4928,3 +5056,5 @@ snapshots:
      '@speed-highlight/core': 1.2.15
      cookie: 1.1.1
      youch-core: 0.3.3
+
+  zod@4.4.3: {}
@@ -1,11 +1,4 @@
 {
  "$schema": "https://openapi.vercel.sh/vercel.json",
-  "framework": "nextjs",
-  "functions": {
-    "app/api/start/route.ts":       { "maxDuration": 60 },
-    "app/api/scene/route.ts":       { "maxDuration": 60 },
-    "app/api/vision/route.ts":      { "maxDuration": 60 },
-    "app/api/insert-beat/route.ts": { "maxDuration": 60 },
-    "app/api/beat-audio/route.ts":  { "maxDuration": 30 }
-  }
+  "framework": "nextjs"
 }