feat: scene/beat architecture — decouple dialogue from image generation (#2)

Replace the one-image-per-interaction model with scenes that hold multiple dialogue beats. The image regenerates only on scene-change actions; tapping through beats and in-scene choices are instant and zero-network. Squashed from #2: - feat: scene/beat architecture — decouple dialogue from image generation - fix: harden LLM-output parsing, prefetch lifecycle, and typewriter (PR review) - fix: dedupe beat ids; fallback narration on empty insert-beat (PR review #2) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-05-28 15:20:12 +08:00
parent d116c2e3b5
commit d1f13d51a3
13 changed files with 1275 additions and 402 deletions
@@ -1,31 +1,37 @@
 # 云梦

-> An AI-driven visual novel where every frame — scenes, dialogue, choices — is rendered by an AI, one frame at a time. You click. It paints. The story unfolds.
+> An AI-driven visual novel painted by an AI, one scene at a time. You talk and explore within a scene; when the story turns a corner, it paints the next. You click. It paints. The story unfolds.

 ---

 ## How it works

-Each turn is three model calls:
+The story unfolds as a sequence of **scenes**. Each scene is one AI-painted background plus a short tree of **beats** — moments of narration, dialogue, and the occasional choice. You tap through a scene's beats and the image stays put; only when a choice leads somewhere genuinely new — another place, a new point of view, a jump in time — does the AI paint the next scene.

 ```
-[user clicks somewhere on the image]
+entering a scene
        │
        ▼
-1. Vision model    interprets the click against the visible UI
+1. Text LLM     directs the whole scene at once — a background prompt
+                plus a tree of beats (narration / dialogue / choices)
        │
        ▼
-2. Text LLM        writes the next frame (narration, dialogue, choices)
+2. Image model  paints the background once, 16:9, no UI baked in
        │
        ▼
-3. Image model     renders the entire next UI screen — scene, dialogue,
-                   buttons, all of it — as one painted frame
+[ tap through beats — no model calls, instant ]
        │
-        ▼
-[new image is shown; repeat]
+        ├─ in-scene choice ──────▶ jump to another beat (instant)
+        │
+        └─ scene-change choice ──▶ the next scene
+                                   (usually pre-generated — see below)
 ```

-There is no traditional UI. There is only the image. The AI chooses the layout, the colors, the typography, the buttons. Pick "stick figure on grid paper" as your style and you'll get hand-drawn UI. Pick "cyberpunk noir" and you'll get neon HUDs. Whatever fits the world.
+While you're reading one scene, the engine **speculatively generates the scenes your choices could lead to** — and, for unavoidable next steps, the scene after that. By the time you pick a direction, its image is usually already painted, so the cut feels instant.
+
+Clicking the background itself (not a button) routes through a **vision** model: it reads where you tapped and decides whether you're exploring the current scene (it inserts a beat — no new image) or moving on (a new scene).
+
+There is no traditional game UI baked into the art. The AI paints the world in whatever style you pick — "stick figure on grid paper" or "cyberpunk noir" — and the dialogue panel and choice buttons are a light HTML layer drawn on top, tuned to sit over the scene.

 ---

@@ -82,4 +88,4 @@ yume/

 ## Cost & limits

-Each turn costs roughly **\$0.15–0.25** in API fees with the recommended model trio. A 30-turn session is **\~\$5–8**. There is no rate limiting or auth out of the box — if you make your deployment public, your bill will reflect that. Add limits before sharing widely.
+Each **scene** costs roughly **\$0.15–0.25** in API fees with the recommended model trio (one text + one image call); tapping through a scene's beats is free. To keep transitions instant, the engine also **pre-generates scenes you might pick but don't** — so real spend runs somewhat higher than the scenes you actually see. There is no rate limiting or auth out of the box — if you make your deployment public, your bill will reflect that. Add limits (and consider lowering the prefetch depth) before sharing widely.
@@ -1,5 +1,5 @@
-import { takeTurn } from "@yume/engine";
-import type { InteractRequest } from "@yume/types";
+import { requestInsertBeat } from "@yume/engine";
+import type { InsertBeatRequest } from "@yume/types";
 import { NextResponse } from "next/server";
 import { loadEngineConfig } from "@/lib/config";

@@ -7,23 +7,23 @@ export const runtime = "nodejs";
 export const maxDuration = 60;

 export async function POST(req: Request) {
-  let body: InteractRequest;
+  let body: InsertBeatRequest;
  try {
-    body = (await req.json()) as InteractRequest;
+    body = (await req.json()) as InsertBeatRequest;
  } catch {
    return NextResponse.json({ error: "Invalid JSON" }, { status: 400 });
  }

-  if (!body.session || !body.intent) {
+  if (!body.session || !body.freeformAction) {
    return NextResponse.json(
-      { error: "session and intent are required" },
+      { error: "session and freeformAction are required" },
      { status: 400 },
    );
  }

  try {
    const config = loadEngineConfig();
-    const result = await takeTurn(config, body);
+    const result = await requestInsertBeat(config, body);
    return NextResponse.json(result);
  } catch (err) {
    const message = err instanceof Error ? err.message : "Unknown error";
@@ -0,0 +1,29 @@
+import { requestScene } from "@yume/engine";
+import type { SceneRequest } from "@yume/types";
+import { NextResponse } from "next/server";
+import { loadEngineConfig } from "@/lib/config";
+
+export const runtime = "nodejs";
+export const maxDuration = 120;
+
+export async function POST(req: Request) {
+  let body: SceneRequest;
+  try {
+    body = (await req.json()) as SceneRequest;
+  } catch {
+    return NextResponse.json({ error: "Invalid JSON" }, { status: 400 });
+  }
+
+  if (!body.session) {
+    return NextResponse.json({ error: "session is required" }, { status: 400 });
+  }
+
+  try {
+    const config = loadEngineConfig();
+    const result = await requestScene(config, body);
+    return NextResponse.json(result);
+  } catch (err) {
+    const message = err instanceof Error ? err.message : "Unknown error";
+    return NextResponse.json({ error: message }, { status: 500 });
+  }
+}
@@ -1,4 +1,4 @@
-import { visionTurn } from "@yume/engine";
+import { visionDecide } from "@yume/engine";
 import type { VisionRequest } from "@yume/types";
 import { NextResponse } from "next/server";
 import { loadEngineConfig } from "@/lib/config";
@@ -23,7 +23,7 @@ export async function POST(req: Request) {

  try {
    const config = loadEngineConfig();
-    const result = await visionTurn(config, body);
+    const result = await visionDecide(config, body);
    return NextResponse.json(result);
  } catch (err) {
    const message = err instanceof Error ? err.message : "Unknown error";
@@ -2,39 +2,236 @@

 import Link from "next/link";
 import { useRouter, useSearchParams } from "next/navigation";
-import { Suspense, useCallback, useEffect, useRef, useState } from "react";
+import {
+  Suspense,
+  useCallback,
+  useEffect,
+  useMemo,
+  useRef,
+  useState,
+} from "react";
 import { PlayCanvas, type Phase } from "@/components/PlayCanvas";
 import { PRESETS } from "@/lib/presets";
 import type {
-  ClickIntent,
-  InteractResponse,
+  Beat,
+  BeatChoice,
+  InsertBeatResponse,
+  Scene,
+  SceneExit,
+  SceneResponse,
  Session,
  StartResponse,
-  StoryFrame,
  VisionResponse,
 } from "@yume/types";

+// ──────────────────────────────────────────────────────────────────────
+//  Prefetch pool — speculative SceneResponses keyed by choice path.
+//
+//  Key format: "C1" → reached by choosing C1 from current scene.
+//              "C1/C2" → after C1, then C2 (recursive must-pass prefetch).
+//
+//  When the player picks a change-scene choice, we keep that key's
+//  descendants (re-rooted) and abort the rest.
+// ──────────────────────────────────────────────────────────────────────
+
+const PREFETCH_MAX_DEPTH = 3;
+
+type PrefetchEntry = {
+  promise: Promise<SceneResponse>;
+  abort: AbortController;
+};
+
+type ScenePathStep = {
+  fromScene: Scene;
+  fromVisitedBeats: string[];
+  exit: { choiceId: string; label: string; nextSceneSeed: string };
+};
+
+function pathKey(steps: ScenePathStep[]): string {
+  return steps.map((s) => s.exit.choiceId).join("/");
+}
+
+function buildSpeculativeSession(
+  base: Session,
+  steps: ScenePathStep[],
+): Session {
+  // Drop base's current (last) entry and re-add each step's `fromScene` with
+  // its exit set. Final result has `history.length = base.length - 1 + steps.length`.
+  const newHistory = [...base.history.slice(0, -1)];
+  for (const step of steps) {
+    newHistory.push({
+      scene: step.fromScene,
+      visitedBeatIds: step.fromVisitedBeats,
+      exit: {
+        kind: "choice",
+        choiceId: step.exit.choiceId,
+        label: step.exit.label,
+        nextSceneSeed: step.exit.nextSceneSeed,
+      },
+    });
+  }
+  return { ...base, history: newHistory };
+}
+
+function findAllChangeSceneChoices(scene: Scene): BeatChoice[] {
+  const result: BeatChoice[] = [];
+  const seen = new Set<string>();
+  for (const b of scene.beats) {
+    if (b.next.type === "choice") {
+      for (const c of b.next.choices) {
+        if (c.effect.kind === "change-scene" && !seen.has(c.id)) {
+          seen.add(c.id);
+          result.push(c);
+        }
+      }
+    }
+  }
+  return result;
+}
+
+function findSoleChangeSceneChoice(scene: Scene): BeatChoice | null {
+  const all = findAllChangeSceneChoices(scene);
+  return all.length === 1 ? all[0]! : null;
+}
+
+function prefetchScenePath(
+  pool: Map<string, PrefetchEntry>,
+  baseSession: Session,
+  steps: ScenePathStep[],
+  depth: number,
+): void {
+  if (depth >= PREFETCH_MAX_DEPTH) return;
+  const key = pathKey(steps);
+  if (pool.has(key)) return;
+
+  const specSession = buildSpeculativeSession(baseSession, steps);
+  const abort = new AbortController();
+  const promise = (async () => {
+    const res = await fetch("/api/scene", {
+      method: "POST",
+      headers: { "Content-Type": "application/json" },
+      body: JSON.stringify({ session: specSession }),
+      signal: abort.signal,
+    });
+    if (!res.ok) {
+      const j = (await res.json().catch(() => ({}))) as { error?: string };
+      throw new Error(j.error ?? res.statusText);
+    }
+    const data = (await res.json()) as SceneResponse;
+
+    // Recursive: if the resulting scene has exactly one change-scene exit,
+    // it is a must-pass node — prefetch its child too.
+    if (depth + 1 < PREFETCH_MAX_DEPTH) {
+      const sole = findSoleChangeSceneChoice(data.scene);
+      if (sole && sole.effect.kind === "change-scene") {
+        const nextStep: ScenePathStep = {
+          fromScene: data.scene,
+          fromVisitedBeats: [data.scene.entryBeatId],
+          exit: {
+            choiceId: sole.id,
+            label: sole.label,
+            nextSceneSeed: sole.effect.nextSceneSeed,
+          },
+        };
+        prefetchScenePath(pool, baseSession, [...steps, nextStep], depth + 1);
+      }
+    }
+
+    return data;
+  })();
+
+  promise.catch(() => {});
+  pool.set(key, { promise, abort });
+}
+
+function consumeChoice(
+  pool: Map<string, PrefetchEntry>,
+  choiceId: string,
+): PrefetchEntry | undefined {
+  const my = pool.get(choiceId);
+  const survivors = new Map<string, PrefetchEntry>();
+  for (const [key, entry] of pool) {
+    if (key === choiceId) continue;
+    if (key.startsWith(choiceId + "/")) {
+      survivors.set(key.slice(choiceId.length + 1), entry);
+    } else {
+      entry.abort.abort();
+    }
+  }
+  pool.clear();
+  for (const [k, e] of survivors) pool.set(k, e);
+  return my;
+}
+
+function clearPool(pool: Map<string, PrefetchEntry>): void {
+  for (const e of pool.values()) e.abort.abort();
+  pool.clear();
+}
+
+// ──────────────────────────────────────────────────────────────────────
+//  Component
+// ──────────────────────────────────────────────────────────────────────
+
 function PlayInner() {
  const router = useRouter();
  const params = useSearchParams();

  const [phase, setPhase] = useState<Phase>("loading-first");
  const [session, setSession] = useState<Session | null>(null);
+  const [currentScene, setCurrentScene] = useState<Scene | null>(null);
+  const [currentBeatId, setCurrentBeatId] = useState<string | null>(null);
  const [imageBase64, setImageBase64] = useState<string | null>(null);
-  const [frame, setFrame] = useState<StoryFrame | null>(null);
-  const [intent, setIntent] = useState<ClickIntent | null>(null);
  const [pendingClick, setPendingClick] = useState<{
    x: number;
    y: number;
  } | null>(null);
-  const [turnNum, setTurnNum] = useState(0);
  const [error, setError] = useState<string | null>(null);
  const [presentation, setPresentation] = useState(false);
+  const [lastExitLabel, setLastExitLabel] = useState<string | null>(null);

  const startedRef = useRef(false);
-  const prefetchAbortRef = useRef<AbortController | null>(null);
-  const prefetchRef = useRef<Record<string, Promise<InteractResponse>>>({});
+  const poolRef = useRef<Map<string, PrefetchEntry>>(new Map());

+  // Mirrors for use inside async handlers (closure-stable)
+  const sessionRef = useRef<Session | null>(null);
+  const currentSceneRef = useRef<Scene | null>(null);
+  const currentBeatRef = useRef<Beat | null>(null);
+  const visitedBeatsRef = useRef<string[]>([]);
+
+  const currentBeat = useMemo<Beat | null>(() => {
+    if (!currentScene || !currentBeatId) return null;
+    return currentScene.beats.find((b) => b.id === currentBeatId) ?? null;
+  }, [currentScene, currentBeatId]);
+
+  useEffect(() => {
+    sessionRef.current = session;
+  }, [session]);
+  useEffect(() => {
+    currentSceneRef.current = currentScene;
+  }, [currentScene]);
+  useEffect(() => {
+    currentBeatRef.current = currentBeat;
+  }, [currentBeat]);
+
+  // Whenever currentBeatId changes, append it to visited (skip consecutive dups)
+  useEffect(() => {
+    if (!currentBeatId) return;
+    if (visitedBeatsRef.current.at(-1) === currentBeatId) return;
+    visitedBeatsRef.current = [...visitedBeatsRef.current, currentBeatId];
+    setSession((s) => {
+      if (!s) return s;
+      return {
+        ...s,
+        history: s.history.map((h, i, arr) =>
+          i === arr.length - 1
+            ? { ...h, visitedBeatIds: [...visitedBeatsRef.current] }
+            : h,
+        ),
+      };
+    });
+  }, [currentBeatId]);
+
+  // ── Presentation mode toggle ─────────────────────────────────────────
  const togglePresentation = useCallback(async () => {
    const entering = !presentation;
    if (entering) {
@@ -43,14 +240,12 @@ function PlayInner() {
          await document.documentElement.requestFullscreen();
        }
      } catch {
-        // Browser may refuse fullscreen — still enter chrome-less mode
+        // ignore — fall through to chrome-less mode anyway
      }
      setPresentation(true);
    } else {
      try {
-        if (document.fullscreenElement) {
-          await document.exitFullscreen();
-        }
+        if (document.fullscreenElement) await document.exitFullscreen();
      } catch {
        // ignore
      }
@@ -69,10 +264,7 @@ function PlayInner() {
      }
    }
    function onFullscreenChange() {
-      // Sync if user exited browser fullscreen via Esc / system gesture
-      if (!document.fullscreenElement && presentation) {
-        setPresentation(false);
-      }
+      if (!document.fullscreenElement && presentation) setPresentation(false);
    }
    window.addEventListener("keydown", onKey);
    document.addEventListener("fullscreenchange", onFullscreenChange);
@@ -82,6 +274,7 @@ function PlayInner() {
    };
  }, [togglePresentation, presentation]);

+  // ── Bootstrap: start session ─────────────────────────────────────────
  useEffect(() => {
    if (startedRef.current) return;
    startedRef.current = true;
@@ -91,9 +284,7 @@ function PlayInner() {

    if (presetId) {
      const p = PRESETS.find((x) => x.id === presetId);
-      if (p) {
-        payload = { worldSetting: p.worldSetting, styleGuide: p.styleGuide };
-      }
+      if (p) payload = { worldSetting: p.worldSetting, styleGuide: p.styleGuide };
    } else if (params.get("custom") === "1") {
      const stored = sessionStorage.getItem("yume:custom");
      if (stored) {
@@ -122,151 +313,176 @@ function PlayInner() {
          const j = (await r.json().catch(() => ({}))) as { error?: string };
          throw new Error(j.error ?? r.statusText);
        }
-        return r.json() as Promise<StartResponse>;
+        return (await r.json()) as StartResponse;
      })
      .then((data) => {
-        setSession({
+        const initial: Session = {
          id: data.sessionId,
          createdAt: Date.now(),
          worldSetting: finalPayload.worldSetting,
          styleGuide: finalPayload.styleGuide,
-          history: [{ frame: data.frame }],
-          characters: [],
-        });
-        setFrame(data.frame);
+          history: [
+            {
+              scene: data.scene,
+              visitedBeatIds: [data.scene.entryBeatId],
+            },
+          ],
+        };
+        visitedBeatsRef.current = [data.scene.entryBeatId];
+        setSession(initial);
+        setCurrentScene(data.scene);
+        setCurrentBeatId(data.scene.entryBeatId);
        setImageBase64(data.imageBase64);
        setPhase("ready");
-        setTurnNum(1);
      })
      .catch((e) => setError(String(e)));
  }, [params, router]);

-  // Prefetch next-frame candidates whenever current frame becomes ready.
-  // All three fire in parallel for fastest cache fill. NOT depending on
-  // `phase` — we don't want to abort in-flight prefetches just because
-  // the user clicked. They should continue so handleClick can await them.
+  // ── Prefetch on scene entry: L1 + recursive L2/L3 for must-pass ──────
  useEffect(() => {
-    if (!session || !frame) return;
+    const s = session;
+    const scene = currentScene;
+    if (!s || !scene) return;

-    prefetchAbortRef.current?.abort();
-    const ctrl = new AbortController();
-    prefetchAbortRef.current = ctrl;
-
-    const choices = frame.uiElements.filter((e) => e.kind === "choice");
-    const promises: Record<string, Promise<InteractResponse>> = {};
-
-    for (const choice of choices) {
-      const syntheticIntent: ClickIntent = {
-        targetId: choice.id,
-        targetLabel: choice.label,
-        reasoning: "prefetch",
+    const exits = findAllChangeSceneChoices(scene);
+    for (const choice of exits) {
+      if (choice.effect.kind !== "change-scene") continue;
+      const step: ScenePathStep = {
+        fromScene: scene,
+        // Snapshot of visited beats at prefetch start. Slight drift is OK.
+        fromVisitedBeats: [...visitedBeatsRef.current],
+        exit: {
+          choiceId: choice.id,
+          label: choice.label,
+          nextSceneSeed: choice.effect.nextSceneSeed,
+        },
      };
-      const p = fetch("/api/interact", {
-        method: "POST",
-        headers: { "Content-Type": "application/json" },
-        body: JSON.stringify({ session, intent: syntheticIntent }),
-        signal: ctrl.signal,
-      }).then(async (r) => {
-        if (!r.ok) {
-          const j = (await r.json().catch(() => ({}))) as { error?: string };
-          throw new Error(j.error ?? r.statusText);
-        }
-        return r.json() as Promise<InteractResponse>;
-      });
-      p.catch(() => {});
-      promises[choice.id] = p;
+      prefetchScenePath(poolRef.current, s, [step], 0);
    }
+  }, [currentScene?.id, session?.id]);

-    prefetchRef.current = promises;
-
+  // Abort all in-flight speculative prefetches when the page unmounts, so we
+  // stop paying for background scene/image generation. Empty deps → fires only
+  // on unmount; it must NOT run on scene transitions, which rely on
+  // consumeChoice keeping the re-rooted survivor prefetches alive.
+  useEffect(() => {
+    const pool = poolRef.current;
    return () => {
-      ctrl.abort();
+      clearPool(pool);
    };
-  }, [frame?.id, session?.id]);
+  }, []);

-  // ── Shared result applier ────────────────────────────────────────────
-  async function applyInteractResult(
-    resultPromise: Promise<InteractResponse>,
-    clickIntent: ClickIntent,
-    click?: { x: number; y: number },
-  ) {
-    const result = await resultPromise;
-    // Overwrite synthetic prefetch intent with the real click intent
-    const lastIdx = result.session.history.length - 1;
-    const patched: InteractResponse = {
-      ...result,
-      intent: clickIntent,
-      session: {
-        ...result.session,
-        history: result.session.history.map((entry, idx) =>
-          idx === lastIdx ? { ...entry, click, intent: clickIntent } : entry,
-        ),
-      },
-    };
-    const updatedHistory = [
-      ...patched.session.history,
-      { frame: patched.frame },
-    ];
-    setSession({ ...patched.session, history: updatedHistory });
-    setFrame(patched.frame);
-    setImageBase64(patched.imageBase64);
-    setIntent(clickIntent);
-    setPendingClick(null);
-    setTurnNum((t) => t + 1);
-    setPhase("ready");
+  // ── Handlers ──────────────────────────────────────────────────────────
+
+  function onAdvance() {
+    if (phase !== "ready") return;
+    const beat = currentBeatRef.current;
+    if (!beat || beat.next.type !== "continue") return;
+    setCurrentBeatId(beat.next.nextBeatId);
  }

-  // ── HTML button click — bypasses Vision entirely ──────────────────────
-  async function handleChoiceSelect(choiceId: string, label: string) {
-    if (phase !== "ready" || !session) return;
-    setPhase("interacting");
-    setIntent(null);
-
-    const clickIntent: ClickIntent = {
-      targetId: choiceId,
-      targetLabel: label,
-      reasoning: "direct-button-click",
-    };
-
-    const cacheSnapshot = prefetchRef.current;
-    const cached = cacheSnapshot[choiceId];
-
+  async function performSceneTransition(
+    source: PrefetchEntry | Promise<SceneResponse>,
+    exit: SceneExit,
+    visitedForCurrent: string[],
+    exitLabel: string,
+  ) {
+    setPhase("transitioning");
+    setPendingClick(null);
    try {
-      if (cached) {
-        // Cache hit — zero extra wait
-        await applyInteractResult(cached, clickIntent);
-      } else {
-        // Cache miss — call interact directly (no Vision roundtrip)
-        prefetchAbortRef.current?.abort();
-        const res = await fetch("/api/interact", {
-          method: "POST",
-          headers: { "Content-Type": "application/json" },
-          body: JSON.stringify({ session, intent: clickIntent }),
-        });
-        if (!res.ok) {
-          const j = (await res.json().catch(() => ({}))) as { error?: string };
-          throw new Error(j.error ?? res.statusText);
-        }
-        await applyInteractResult(
-          res.json() as Promise<InteractResponse>,
-          clickIntent,
-        );
-      }
+      const result = await ("promise" in source ? source.promise : source);
+
+      const base = sessionRef.current;
+      if (!base) throw new Error("Session lost mid-transition");
+
+      const closedHistory = base.history.map((h, i, arr) =>
+        i === arr.length - 1
+          ? { ...h, visitedBeatIds: visitedForCurrent, exit }
+          : h,
+      );
+      const newSession: Session = {
+        ...base,
+        history: [
+          ...closedHistory,
+          {
+            scene: result.scene,
+            visitedBeatIds: [result.scene.entryBeatId],
+          },
+        ],
+      };
+      visitedBeatsRef.current = [result.scene.entryBeatId];
+      setSession(newSession);
+      setCurrentScene(result.scene);
+      setCurrentBeatId(result.scene.entryBeatId);
+      setImageBase64(result.imageBase64);
+      setLastExitLabel(exitLabel);
+      setPhase("ready");
    } catch (e) {
+      if ((e as { name?: string }).name === "AbortError") {
+        setPhase("ready");
+        return;
+      }
      setError(String(e));
-      setPendingClick(null);
      setPhase("ready");
    }
  }

-  // ── Background / free-form click — still uses Vision ─────────────────
-  async function handleClick(click: { x: number; y: number }) {
-    if (phase !== "ready" || !session || !imageBase64) return;
-    setPhase("interacting");
-    setPendingClick(click);
-    setIntent(null);
+  function onSelectChoice(choice: BeatChoice) {
+    if (phase !== "ready" || !session || !currentScene) return;

-    const cacheSnapshot = prefetchRef.current;
+    if (choice.effect.kind === "advance-beat") {
+      // Pure local jump. No network. No pool changes.
+      setCurrentBeatId(choice.effect.targetBeatId);
+      return;
+    }
+
+    const visited = [...visitedBeatsRef.current];
+    const exit: SceneExit = {
+      kind: "choice",
+      choiceId: choice.id,
+      label: choice.label,
+      nextSceneSeed: choice.effect.nextSceneSeed,
+    };
+
+    const cached = consumeChoice(poolRef.current, choice.id);
+    if (cached) {
+      void performSceneTransition(cached, exit, visited, choice.label);
+      return;
+    }
+
+    // Cold path — start a fresh fetch
+    const step: ScenePathStep = {
+      fromScene: currentScene,
+      fromVisitedBeats: visited,
+      exit: {
+        choiceId: choice.id,
+        label: choice.label,
+        nextSceneSeed: choice.effect.nextSceneSeed,
+      },
+    };
+    const specSession = buildSpeculativeSession(session, [step]);
+    clearPool(poolRef.current);
+
+    const promise = (async () => {
+      const res = await fetch("/api/scene", {
+        method: "POST",
+        headers: { "Content-Type": "application/json" },
+        body: JSON.stringify({ session: specSession }),
+      });
+      if (!res.ok) {
+        const j = (await res.json().catch(() => ({}))) as { error?: string };
+        throw new Error(j.error ?? res.statusText);
+      }
+      return (await res.json()) as SceneResponse;
+    })();
+
+    void performSceneTransition(promise, exit, visited, choice.label);
+  }
+
+  async function onBackgroundClick(click: { x: number; y: number }) {
+    if (phase !== "ready" || !session || !currentScene || !imageBase64) return;
+    setPhase("vision-thinking");
+    setPendingClick(click);

    try {
      const visionRes = await fetch("/api/vision", {
@@ -280,32 +496,99 @@ function PlayInner() {
        };
        throw new Error(j.error ?? visionRes.statusText);
      }
-      const { intent: clickIntent } =
-        (await visionRes.json()) as VisionResponse;
+      const decision = (await visionRes.json()) as VisionResponse;

-      const cached = clickIntent.targetId
-        ? cacheSnapshot[clickIntent.targetId]
-        : undefined;
-
-      if (cached) {
-        await applyInteractResult(cached, clickIntent, click);
-      } else {
-        prefetchAbortRef.current?.abort();
-        const liveRes = await fetch("/api/interact", {
+      if (decision.classify === "insert-beat") {
+        setPhase("inserting-beat");
+        const insertRes = await fetch("/api/insert-beat", {
          method: "POST",
          headers: { "Content-Type": "application/json" },
-          body: JSON.stringify({ session, intent: clickIntent, click }),
+          body: JSON.stringify({
+            session,
+            freeformAction: decision.intent.freeformAction,
+          }),
        });
-        if (!liveRes.ok) {
-          const j = (await liveRes.json().catch(() => ({}))) as {
+        if (!insertRes.ok) {
+          const j = (await insertRes.json().catch(() => ({}))) as {
            error?: string;
          };
-          throw new Error(j.error ?? liveRes.statusText);
+          throw new Error(j.error ?? insertRes.statusText);
        }
-        await applyInteractResult(
-          liveRes.json() as Promise<InteractResponse>,
-          clickIntent,
-          click,
+        const { partial } = (await insertRes.json()) as InsertBeatResponse;
+
+        const fromBeatId =
+          currentBeatRef.current?.id ?? currentScene.entryBeatId;
+        const newBeatId = `b_ins_${Date.now()}_${Math.random()
+          .toString(36)
+          .slice(2, 6)}`;
+        const newBeat: Beat = {
+          id: newBeatId,
+          narration: partial.narration,
+          speaker: partial.speaker,
+          line: partial.line,
+          next: { type: "continue", nextBeatId: fromBeatId },
+        };
+
+        const patched: Scene = {
+          ...currentScene,
+          beats: [...currentScene.beats, newBeat],
+        };
+
+        setSession((s) =>
+          s
+            ? {
+                ...s,
+                history: s.history.map((h, i, arr) =>
+                  i === arr.length - 1 ? { ...h, scene: patched } : h,
+                ),
+              }
+            : s,
+        );
+        setCurrentScene(patched);
+        setCurrentBeatId(newBeatId);
+        setLastExitLabel(decision.intent.freeformAction);
+        setPhase("ready");
+        setPendingClick(null);
+      } else {
+        const exit: SceneExit = {
+          kind: "freeform",
+          action: decision.intent.freeformAction,
+        };
+        const visited = [...visitedBeatsRef.current];
+        const base = sessionRef.current;
+        if (!base) {
+          setPhase("ready");
+          setPendingClick(null);
+          return;
+        }
+        const specSession: Session = {
+          ...base,
+          history: base.history.map((h, i, arr) =>
+            i === arr.length - 1 ? { ...h, visitedBeatIds: visited, exit } : h,
+          ),
+        };
+        clearPool(poolRef.current);
+
+        const promise = (async () => {
+          const res = await fetch("/api/scene", {
+            method: "POST",
+            headers: { "Content-Type": "application/json" },
+            body: JSON.stringify({ session: specSession }),
+          });
+          if (!res.ok) {
+            const j = (await res.json().catch(() => ({}))) as {
+              error?: string;
+            };
+            throw new Error(j.error ?? res.statusText);
+          }
+          return (await res.json()) as SceneResponse;
+        })();
+
+        await performSceneTransition(
+          promise,
+          exit,
+          visited,
+          decision.intent.freeformAction,
        );
      }
    } catch (e) {
@@ -315,6 +598,8 @@ function PlayInner() {
    }
  }

+  // ── Render ────────────────────────────────────────────────────────────
+
  if (error) {
    return (
      <div className="min-h-screen flex flex-col items-center justify-center px-8">
@@ -343,16 +628,20 @@ function PlayInner() {
        <PlayCanvas
          imageBase64={imageBase64}
          phase={phase}
-          frame={frame}
+          beat={currentBeat}
          pendingClick={pendingClick}
-          onClick={handleClick}
-          onSelectChoice={handleChoiceSelect}
+          onBackgroundClick={onBackgroundClick}
+          onAdvance={onAdvance}
+          onSelectChoice={onSelectChoice}
          fullViewport
        />
      </div>
    );
  }

+  const sceneCount = session?.history.length ?? 0;
+  const beatCount = visitedBeatsRef.current.length;
+
  return (
    <div className="min-h-screen flex flex-col">
      <header className="px-5 md:px-12 pt-6 md:pt-8 flex items-center justify-between">
@@ -364,7 +653,9 @@ function PlayInner() {
          云梦
        </Link>
        <div className="flex items-center gap-3 text-[10px] smallcaps text-clay-500 num">
-          <span>第 · {String(turnNum).padStart(3, "0")} · 帧</span>
+          <span>第 · {String(sceneCount).padStart(3, "0")} · 幕</span>
+          <span className="text-clay-300">·</span>
+          <span>{String(beatCount).padStart(3, "0")} · 拍</span>
          <span className="text-clay-300">·</span>
          <span className="hidden sm:inline truncate max-w-[180px]">
            {session?.id.slice(2, 14) ?? "—"}
@@ -376,22 +667,23 @@ function PlayInner() {
        <PlayCanvas
          imageBase64={imageBase64}
          phase={phase}
-          frame={frame}
+          beat={currentBeat}
          pendingClick={pendingClick}
-          onClick={handleClick}
-          onSelectChoice={handleChoiceSelect}
+          onBackgroundClick={onBackgroundClick}
+          onAdvance={onAdvance}
+          onSelectChoice={onSelectChoice}
        />

        <div className="mt-4 max-w-md w-full text-center min-h-[28px] flex items-center justify-center">
          {phase === "loading-first" && (
            <p className="text-[10px] smallcaps text-clay-500 animate-slow-pulse">
-              正 · 在 · 唤 · 起 · 第 · 一 · 帧
+              正 · 在 · 唤 · 起 · 第 · 一 · 幕
            </p>
          )}
-          {phase === "ready" && intent?.targetLabel && (
+          {phase === "ready" && lastExitLabel && (
            <p className="text-[9px] smallcaps text-clay-400 animate-fade-in">
              <span className="mr-2">上 · 一 · 步 ·</span>
-              <span className="text-clay-600">{intent.targetLabel}</span>
+              <span className="text-clay-600">{lastExitLabel}</span>
            </p>
          )}
        </div>
@@ -1,34 +1,70 @@
 "use client";

-import { useEffect, useRef, useState } from "react";
-import type { StoryFrame } from "@yume/types";
+import { useCallback, useEffect, useRef, useState } from "react";
+import type { Beat, BeatChoice } from "@yume/types";

-export type Phase = "loading-first" | "ready" | "interacting";
+export type Phase =
+  | "loading-first"        // first scene not yet rendered
+  | "ready"                // current beat is interactive
+  | "vision-thinking"      // background click → waiting on vision verdict
+  | "inserting-beat"       // vision-driven beat being generated
+  | "transitioning";       // changing scenes (cache miss or speculative wait)

 const SHADOW =
  "0 1px 0 rgba(45,24,16,0.05), 0 36px 64px -28px rgba(45,24,16,0.25), 0 8px 18px -6px rgba(45,24,16,0.10)";

 // ── Typewriter hook ────────────────────────────────────────────────────
-function useTypewriter(text: string, speed = 28): string {
+// Returns the progressively-revealed text, a `done` flag, and a `skip()` that
+// instantly completes the current text. Reset is keyed by `resetKey` (the beat
+// id) rather than the text, so a new beat whose line happens to match the
+// previous one still replays from scratch. `done` is derived synchronously
+// (not from a post-paint effect) so a stale "done" frame never paints.
+function useTypewriter(
+  text: string,
+  resetKey: string,
+  speed = 28,
+): { shown: string; done: boolean; skip: () => void } {
  const [displayed, setDisplayed] = useState("");
-  const textRef = useRef(text);
+  const [prevKey, setPrevKey] = useState(resetKey);
+  const timer = useRef<ReturnType<typeof setInterval> | null>(null);
+
+  // Render-phase reset (React "adjust state on prop change" pattern): when the
+  // beat changes, drop the old progress before this render commits.
+  if (resetKey !== prevKey) {
+    setPrevKey(resetKey);
+    setDisplayed("");
+  }

  useEffect(() => {
-    // Reset immediately when the text changes
-    setDisplayed("");
-    textRef.current = text;
    if (!text) return;
-
    let i = 0;
-    const id = setInterval(() => {
+    timer.current = setInterval(() => {
      i += 1;
      setDisplayed(text.slice(0, i));
-      if (i >= text.length) clearInterval(id);
+      if (i >= text.length && timer.current) {
+        clearInterval(timer.current);
+        timer.current = null;
+      }
    }, speed);
-    return () => clearInterval(id);
-  }, [text, speed]);
+    return () => {
+      if (timer.current) clearInterval(timer.current);
+      timer.current = null;
+    };
+  }, [resetKey, text, speed]);

-  return displayed;
+  const skip = useCallback(() => {
+    if (timer.current) {
+      clearInterval(timer.current);
+      timer.current = null;
+    }
+    setDisplayed(text);
+  }, [text]);
+
+  // During the throwaway render where the beat just changed, `displayed` still
+  // holds the previous beat's text — coerce it to empty so nothing stale shows.
+  const shown = resetKey === prevKey ? displayed : "";
+  const done = text.length === 0 || shown.length >= text.length;
+  return { shown, done, skip };
 }

 // ── Choice button ──────────────────────────────────────────────────────
@@ -59,7 +95,6 @@ function ChoiceButton({
        boxShadow: "0 2px 12px rgba(0,0,0,0.4), inset 0 1px 0 rgba(200,165,90,0.12)",
      }}
    >
-      {/* Hover shimmer overlay */}
      <span
        className="absolute inset-0 rounded-[5px] opacity-0 group-hover:opacity-100 transition-opacity duration-200 pointer-events-none"
        style={{
@@ -89,49 +124,59 @@ function ChoiceButton({
 export function PlayCanvas({
  imageBase64,
  phase,
-  frame,
+  beat,
  pendingClick,
-  onClick,
+  onBackgroundClick,
+  onAdvance,
  onSelectChoice,
  fullViewport = false,
 }: {
  imageBase64: string | null;
  phase: Phase;
-  frame: StoryFrame | null;
+  beat: Beat | null;
  pendingClick: { x: number; y: number } | null;
-  onClick: (click: { x: number; y: number }) => void;
-  onSelectChoice?: (choiceId: string, label: string) => void;
+  onBackgroundClick: (click: { x: number; y: number }) => void;
+  onAdvance: () => void;
+  onSelectChoice: (choice: BeatChoice) => void;
  fullViewport?: boolean;
 }) {
  const imgRef = useRef<HTMLImageElement>(null);
  const [dims, setDims] = useState<{ w: number; h: number } | null>(null);

-  const choices = frame?.uiElements.filter((e) => e.kind === "choice") ?? [];
-  const dialogueText = frame
-    ? [frame.speaker ? `${frame.speaker}：${frame.line ?? ""}` : frame.line, frame.narration]
-        .filter(Boolean)
-        .join("\n")
-    : "";
-  const narrationOnly = !frame?.speaker && !frame?.line && !!frame?.narration;
-  const displayBody = frame?.speaker
-    ? frame.line ?? ""
-    : frame?.narration ?? "";
+  const isChoiceBeat = beat?.next.type === "choice";
+  const choices: BeatChoice[] = isChoiceBeat
+    ? (beat!.next as { type: "choice"; choices: BeatChoice[] }).choices
+    : [];

-  const typedBody = useTypewriter(displayBody, 30);
+  const displayBody = beat?.speaker ? beat.line ?? "" : beat?.narration ?? "";
+  const { shown: typedBody, done: typingDone, skip: skipTypewriter } =
+    useTypewriter(displayBody, beat?.id ?? "", 30);

-  function handleClick(e: React.MouseEvent<HTMLImageElement>) {
-    if (phase !== "ready" || !imgRef.current) return;
+  function handleImageClick(e: React.MouseEvent<HTMLImageElement>) {
+    if (phase !== "ready" || !imgRef.current || !beat) return;
    const rect = imgRef.current.getBoundingClientRect();
    const x = (e.clientX - rect.left) / rect.width;
    const y = (e.clientY - rect.top) / rect.height;
-    onClick({
+    // If the typewriter is still printing, a click completes it instantly
+    // (standard VN affordance) — the page never sees this click.
+    if (!typingDone) {
+      skipTypewriter();
+      return;
+    }
+    // For continue-type beats, image click advances; for choice beats,
+    // image click goes through vision (treat as freeform action).
+    if (beat.next.type === "continue") {
+      onAdvance();
+      return;
+    }
+    onBackgroundClick({
      x: Math.max(0, Math.min(1, x)),
      y: Math.max(0, Math.min(1, y)),
    });
  }

  const interactive = phase === "ready" && !!imageBase64;
-  const dimmed = phase === "interacting";
+  const dimmed = phase === "transitioning";

  const sizeStyle = fullViewport
    ? { maxWidth: "100vw", maxHeight: "100dvh" }
@@ -141,6 +186,13 @@ export function PlayCanvas({
    ? "min(100vw, calc(100dvh * 16 / 9))"
    : "min(96vw, calc((100dvh - 200px) * 16 / 9))";

+  const footerHint =
+    phase === "ready"
+      ? isChoiceBeat
+        ? "选 · 择 · 一 · 项"
+        : "点 · 击 · 推 · 进"
+      : "···";
+
  return (
    <div
      className={`flex flex-col items-center ${fullViewport ? "w-full h-full justify-center" : "w-full"}`}
@@ -150,13 +202,13 @@ export function PlayCanvas({
          className="relative inline-block"
          style={{ boxShadow: fullViewport ? "none" : SHADOW }}
        >
-          {/* ── Background image ── */}
+          {/* Background image */}
          <img
            key={imageBase64.slice(-48)}
            ref={imgRef}
            src={`data:image/png;base64,${imageBase64}`}
-            alt="Generated frame"
-            onClick={handleClick}
+            alt="Generated scene"
+            onClick={handleImageClick}
            onLoad={(e) => {
              const img = e.currentTarget;
              setDims({ w: img.naturalWidth, h: img.naturalHeight });
@@ -168,37 +220,27 @@ export function PlayCanvas({
            style={sizeStyle}
          />

-          {/* ── Top/bottom gradient vignette ── */}
          {!fullViewport && (
-            <>
-              <div className="absolute inset-x-0 top-0 h-10 bg-gradient-to-b from-clay-900/12 to-transparent pointer-events-none" />
-            </>
+            <div className="absolute inset-x-0 top-0 h-10 bg-gradient-to-b from-clay-900/12 to-transparent pointer-events-none" />
          )}

-          {/* ══════════════════════════════════════════════════════════
-              PREFAB UI OVERLAY — rendered on top of image
-          ══════════════════════════════════════════════════════════ */}
-          {frame && (
+          {beat && (
            <div className="absolute inset-0 flex flex-col justify-end pointer-events-none select-none">
-              {/* ── Choices row ── */}
              {choices.length > 0 && (
-                <div
-                  className="pointer-events-auto px-[3%] pb-[1.5%] flex gap-[1.5%] items-stretch"
-                >
+                <div className="pointer-events-auto px-[3%] pb-[1.5%] flex gap-[1.5%] items-stretch">
                  {choices.map((choice, i) => (
                    <ChoiceButton
                      key={choice.id}
                      index={i}
                      label={choice.label}
                      disabled={phase !== "ready"}
-                      onClick={() => onSelectChoice?.(choice.id, choice.label)}
+                      onClick={() => onSelectChoice(choice)}
                    />
                  ))}
                </div>
              )}

-              {/* ── Dialogue / narration box ── */}
-              {(frame.narration || frame.line) && (
+              {(beat.narration || beat.line) && (
                <div
                  className="pointer-events-none mx-[2%] mb-[2%] px-[3%] py-[2.2%] relative"
                  style={{
@@ -211,7 +253,6 @@ export function PlayCanvas({
                      "0 4px 24px rgba(0,0,0,0.55), inset 0 1px 0 rgba(200,165,90,0.10)",
                  }}
                >
-                  {/* Inner golden corner decoration */}
                  <span
                    className="absolute top-[6px] left-[8px] text-[10px] opacity-40 pointer-events-none"
                    style={{ color: "rgba(195,155,75,1)" }}
@@ -227,56 +268,57 @@ export function PlayCanvas({
                    ✦
                  </span>

-                  {/* Speaker name tag */}
-                  {frame.speaker && (
+                  {beat.speaker && (
                    <p
                      className="font-serif text-[11px] md:text-[12px] smallcaps mb-[0.6em]"
                      style={{ color: "rgba(205,165,90,0.92)" }}
                    >
-                      {frame.speaker}
+                      {beat.speaker}
                    </p>
                  )}

-                  {/* Main text */}
                  <p
                    className="font-serif leading-[1.85] text-[13px] md:text-[15px]"
                    style={{ color: "rgba(245,235,210,0.95)" }}
                  >
                    {typedBody}
-                    {/* Narration only — also show secondary line */}
-                    {frame.speaker && frame.narration && (
+                    {beat.speaker && beat.narration && (
                      <span
-                        className="block mt-[0.5em] italic text-[12px] md:text-[13px]"
+                        className={`block mt-[0.5em] italic text-[12px] md:text-[13px] transition-opacity duration-300 ${
+                          typingDone ? "opacity-100" : "opacity-0"
+                        }`}
                        style={{ color: "rgba(200,185,155,0.78)" }}
+                        aria-hidden={!typingDone}
                      >
-                        {frame.narration}
+                        {beat.narration}
                      </span>
                    )}
                  </p>

-                  {/* Scroll hint ▼ */}
-                  <span
-                    className="absolute bottom-[6px] right-[10px] text-[10px] animate-slow-pulse"
-                    style={{ color: "rgba(195,155,75,0.7)" }}
-                    aria-hidden
-                  >
-                    ▼
-                  </span>
+                  {typingDone && beat.next.type === "continue" && (
+                    <span
+                      className="absolute bottom-[6px] right-[10px] text-[10px] animate-slow-pulse"
+                      style={{ color: "rgba(195,155,75,0.7)" }}
+                      aria-hidden
+                    >
+                      ▼
+                    </span>
+                  )}
                </div>
              )}
            </div>
          )}

-          {/* Loading/interacting dim overlay */}
-          {phase === "interacting" && (
+          {(phase === "transitioning" || phase === "inserting-beat") && (
            <div className="absolute inset-0 flex items-center justify-center pointer-events-none">
              <p className="text-[10px] smallcaps text-cream-50/70 animate-slow-pulse">
-                AI · 正 · 在 · 描 · 画 · 下 · 一 · 刻
+                {phase === "transitioning"
+                  ? "AI · 正 · 在 · 描 · 画 · 下 · 一 · 幕"
+                  : "AI · 正 · 在 · 想 · 你 · 看 · 到 · 了 · 什 · 么"}
              </p>
            </div>
          )}

-          {/* Click ripple indicator */}
          {pendingClick && (
            <>
              <div
@@ -317,7 +359,7 @@ export function PlayCanvas({
        >
          <div className="w-1.5 h-1.5 bg-clay-500 rounded-full animate-slow-pulse" />
          <p className="text-[9px] smallcaps text-clay-500 animate-slow-pulse">
-            正 · 在 · 绘 · 制 · 第 · 一 · 帧
+            正 · 在 · 绘 · 制 · 第 · 一 · 幕
          </p>
        </div>
      )}
@@ -330,9 +372,7 @@ export function PlayCanvas({
          <span className="text-[9px] smallcaps text-clay-400 num">
            {dims ? `${dims.w} × ${dims.h} · png` : "—"}
          </span>
-          <span className="text-[9px] smallcaps text-clay-400">
-            {phase === "ready" ? (choices.length > 0 ? "选 · 择 · 一 · 项" : "任 · 意 · 点 · 击") : "···"}
-          </span>
+          <span className="text-[9px] smallcaps text-clay-400">{footerHint}</span>
        </div>
      )}
    </div>
@@ -1,20 +1,239 @@
 import { chat } from "@yume/ai-client";
-import type { ProviderConfig, Session, StoryFrame, UIElement } from "@yume/types";
+import type {
+  Beat,
+  BeatChoice,
+  BeatChoiceEffect,
+  BeatNext,
+  ProviderConfig,
+  Scene,
+  Session,
+} from "@yume/types";
 import { parseJsonLoose } from "./jsonParser";
-import { DIRECTOR_SYSTEM, buildDirectorUserMessage } from "./prompts";
+import {
+  DIRECTOR_SYSTEM,
+  INSERT_BEAT_SYSTEM,
+  buildDirectorUserMessage,
+  buildInsertBeatUserMessage,
+} from "./prompts";

-type DirectorOutput = {
+// ──────────────────────────────────────────────────────────────────────
+//  Raw shape produced by the model — we coerce + validate into a Scene.
+// ──────────────────────────────────────────────────────────────────────
+
+type RawEffect = {
+  kind?: string;
+  targetBeatId?: string;
+  nextSceneSeed?: string;
+};
+
+type RawChoice = {
+  id?: string;
+  label?: string;
+  effect?: RawEffect;
+};
+
+type RawNext = {
+  type?: string;
+  nextBeatId?: string;
+  choices?: RawChoice[];
+};
+
+type RawBeat = {
+  id?: string;
  narration?: string;
  speaker?: string;
  line?: string;
-  scenePrompt: string;
-  uiElements: UIElement[];
+  next?: RawNext;
 };

-export async function direct(
+type RawScene = {
+  scenePrompt?: string;
+  entryBeatId?: string;
+  beats?: RawBeat[];
+};
+
+function coerceEffect(raw: RawEffect | undefined): BeatChoiceEffect {
+  if (raw?.kind === "advance-beat" && raw.targetBeatId?.trim()) {
+    return { kind: "advance-beat", targetBeatId: raw.targetBeatId.trim() };
+  }
+  return {
+    kind: "change-scene",
+    nextSceneSeed: raw?.nextSceneSeed?.trim() || "未指定",
+  };
+}
+
+function coerceChoice(raw: RawChoice, idx: number): BeatChoice {
+  return {
+    id: raw.id?.trim() || `c${idx + 1}`,
+    label: raw.label?.trim() || `选项 ${idx + 1}`,
+    effect: coerceEffect(raw.effect),
+  };
+}
+
+function coerceNext(raw: RawNext | undefined, fallbackBeatId: string): BeatNext {
+  if (raw?.type === "choice" && Array.isArray(raw.choices) && raw.choices.length) {
+    return {
+      type: "choice",
+      choices: raw.choices.map((c, i) => coerceChoice(c, i)),
+    };
+  }
+  return {
+    type: "continue",
+    nextBeatId: raw?.nextBeatId?.trim() || fallbackBeatId,
+  };
+}
+
+function coerceBeat(raw: RawBeat, idx: number, totalBeats: number): Beat {
+  const id = raw.id?.trim() || `b${idx + 1}`;
+  // Non-last beats default their `continue` target to the following beat.
+  // The last beat gets an empty fallback on purpose: repairBeats() turns a
+  // last/dangling continue into a real scene-change exit so the player can
+  // never get stuck self-looping on it.
+  const fallback = idx + 1 < totalBeats ? `b${idx + 2}` : "";
+  return {
+    id,
+    narration: raw.narration?.trim() || undefined,
+    speaker: raw.speaker?.trim() || undefined,
+    line: raw.line?.trim() || undefined,
+    next: coerceNext(raw.next, fallback),
+  };
+}
+
+const FALLBACK_SEED = "故事继续推进";
+
+function fallbackExitChoice(beatId: string): BeatChoice {
+  return {
+    id: `${beatId}__exit`,
+    label: "继续",
+    effect: { kind: "change-scene", nextSceneSeed: FALLBACK_SEED },
+  };
+}
+
+// Beat ids are graph keys (the front-end's `beats.find(b => b.id === ...)`,
+// the session's `visitedBeatIds`, and `continue`/`advance-beat` targets). If
+// the model reuses an id across beats, the second occurrence becomes silently
+// unreachable and external references collapse to the first beat. Rename
+// duplicates; rewrite the renamed beat's OWN self-references (the most
+// natural interpretation of a duplicate id being referenced from inside that
+// same beat). External references stay pointing at the first occurrence.
+function ensureUniqueBeatIds(beats: Beat[]): Beat[] {
+  const seen = new Set<string>();
+  return beats.map((b): Beat => {
+    if (!seen.has(b.id)) {
+      seen.add(b.id);
+      return b;
+    }
+    const oldId = b.id;
+    let n = 2;
+    while (seen.has(`${oldId}_${n}`)) n += 1;
+    const newId = `${oldId}_${n}`;
+    seen.add(newId);
+
+    let next = b.next;
+    if (next.type === "continue" && next.nextBeatId === oldId) {
+      next = { type: "continue", nextBeatId: newId };
+    } else if (next.type === "choice") {
+      next = {
+        type: "choice",
+        choices: next.choices.map((c) =>
+          c.effect.kind === "advance-beat" && c.effect.targetBeatId === oldId
+            ? {
+                ...c,
+                effect: { kind: "advance-beat" as const, targetBeatId: newId },
+              }
+            : c,
+        ),
+      };
+    }
+    return { ...b, id: newId, next };
+  });
+}
+
+// Repairs referential integrity AND guarantees the scene is escapable:
+// - a `continue` to a missing/self id is repointed to the next beat in order;
+//   a last/dangling continue with nowhere to go becomes a scene-change exit
+//   (never a self-loop, which would strand the player on "click to advance")
+// - an `advance-beat` to a missing id is downgraded to a scene change
+// - if no change-scene exit exists anywhere, one is appended to the last beat
+function repairBeats(beats: Beat[]): Beat[] {
+  const ids = new Set(beats.map((b) => b.id));
+
+  const fixed: Beat[] = beats.map((b, idx): Beat => {
+    if (b.next.type === "continue") {
+      const target = b.next.nextBeatId;
+      if (ids.has(target) && target !== b.id) return b;
+      const nextByIndex = beats[idx + 1]?.id;
+      if (nextByIndex) {
+        return { ...b, next: { type: "continue", nextBeatId: nextByIndex } };
+      }
+      return { ...b, next: { type: "choice", choices: [fallbackExitChoice(b.id)] } };
+    }
+
+    const patched = b.next.choices.map((c) =>
+      c.effect.kind === "advance-beat" && !ids.has(c.effect.targetBeatId)
+        ? {
+            ...c,
+            effect: {
+              kind: "change-scene" as const,
+              nextSceneSeed: "未指定（导演引用不存在的 beat，已降级为换场）",
+            },
+          }
+        : c,
+    );
+    return { ...b, next: { type: "choice", choices: patched } };
+  });
+
+  const hasExit = fixed.some(
+    (b) =>
+      b.next.type === "choice" &&
+      b.next.choices.some((c) => c.effect.kind === "change-scene"),
+  );
+  if (!hasExit && fixed.length > 0) {
+    const lastIdx = fixed.length - 1;
+    const last = fixed[lastIdx]!;
+    const existing = last.next.type === "choice" ? last.next.choices : [];
+    fixed[lastIdx] = {
+      ...last,
+      next: { type: "choice", choices: [...existing, fallbackExitChoice(last.id)] },
+    };
+  }
+
+  return fixed;
+}
+
+// Choice ids are the keys the front-end uses to cache and consume prefetched
+// scenes. Two beats both defaulting to c1/c2 (or the model reusing ids across
+// beats) would make a transition reuse the WRONG prefetched scene — so force
+// every choice id to be unique within the scene.
+function ensureUniqueChoiceIds(beats: Beat[]): Beat[] {
+  const seen = new Set<string>();
+  for (const b of beats) {
+    if (b.next.type !== "choice") continue;
+    for (const c of b.next.choices) {
+      if (seen.has(c.id)) {
+        let n = 2;
+        while (seen.has(`${c.id}_${n}`)) n += 1;
+        c.id = `${c.id}_${n}`;
+      }
+      seen.add(c.id);
+    }
+  }
+  return beats;
+}
+
+function newSceneId(): string {
+  return `scene_${Date.now()}_${Math.random().toString(36).slice(2, 6)}`;
+}
+
+// ──────────────────────────────────────────────────────────────────────
+//  directScene — generates one Scene (multi-beat) for the player.
+//  Called both on real scene transitions AND on speculative prefetch.
+// ──────────────────────────────────────────────────────────────────────
+
+export async function directScene(
  config: ProviderConfig,
  session: Session,
-): Promise<StoryFrame> {
+): Promise<Scene> {
  const raw = await chat(
    config,
    [
@@ -24,14 +243,71 @@ export async function direct(
    { temperature: 0.9, responseFormat: "json_object" },
  );

-  const parsed = parseJsonLoose<DirectorOutput>(raw);
+  const parsed = parseJsonLoose<RawScene>(raw);
+  const rawBeats = Array.isArray(parsed.beats) ? parsed.beats : [];
+  if (rawBeats.length === 0) {
+    throw new Error("Director returned no beats");
+  }
+
+  const beats = ensureUniqueChoiceIds(
+    repairBeats(
+      ensureUniqueBeatIds(
+        rawBeats.map((b, i) => coerceBeat(b, i, rawBeats.length)),
+      ),
+    ),
+  );
+
+  const declaredEntry = parsed.entryBeatId?.trim();
+  const entryBeatId =
+    declaredEntry && beats.some((b) => b.id === declaredEntry)
+      ? declaredEntry
+      : beats[0]!.id;

  return {
-    id: `frame_${Date.now()}`,
-    narration: parsed.narration?.trim() || undefined,
-    speaker: parsed.speaker?.trim() || undefined,
-    line: parsed.line?.trim() || undefined,
-    scenePrompt: parsed.scenePrompt,
-    uiElements: parsed.uiElements ?? [],
+    id: newSceneId(),
+    scenePrompt: parsed.scenePrompt?.trim() || "an empty scene",
+    beats,
+    entryBeatId,
  };
 }
+
+// ──────────────────────────────────────────────────────────────────────
+//  directInsertBeat — generates a one-off transient beat in response to
+//  a freeform vision action that stays in-scene. Used by /api/insert-beat.
+// ──────────────────────────────────────────────────────────────────────
+
+export async function directInsertBeat(
+  config: ProviderConfig,
+  session: Session,
+  freeformAction: string,
+): Promise<{ narration?: string; speaker?: string; line?: string }> {
+  const raw = await chat(
+    config,
+    [
+      { role: "system", content: INSERT_BEAT_SYSTEM },
+      {
+        role: "user",
+        content: buildInsertBeatUserMessage(session, freeformAction),
+      },
+    ],
+    { temperature: 0.9, responseFormat: "json_object" },
+  );
+
+  const parsed = parseJsonLoose<{
+    narration?: string;
+    speaker?: string;
+    line?: string;
+  }>(raw);
+
+  const narration = parsed.narration?.trim() || undefined;
+  const speaker = parsed.speaker?.trim() || undefined;
+  const line = parsed.line?.trim() || undefined;
+
+  // If the model returned nothing usable, supply a fallback narration so the
+  // frontend doesn't append a silent empty beat that renders no dialogue —
+  // which would make the click appear to do nothing.
+  if (!narration && !speaker && !line) {
+    return { narration: "（你停下脚步，环视片刻。）" };
+  }
+  return { narration, speaker, line };
+}
@@ -1,3 +1,8 @@
-export { startSession, takeTurn, visionTurn } from "./orchestrator";
+export {
+  startSession,
+  requestScene,
+  visionDecide,
+  requestInsertBeat,
+} from "./orchestrator";
 export { annotateClick } from "./annotate";
 export * from "./prompts";
@@ -1,8 +1,9 @@
 import type {
-  ClickIntent,
  EngineConfig,
-  InteractRequest,
-  InteractResponse,
+  InsertBeatRequest,
+  InsertBeatResponse,
+  SceneRequest,
+  SceneResponse,
  Session,
  StartRequest,
  StartResponse,
@@ -10,7 +11,7 @@ import type {
  VisionResponse,
 } from "@yume/types";
 import { annotateClick } from "./annotate";
-import { direct } from "./director";
+import { directInsertBeat, directScene } from "./director";
 import { render } from "./renderer";
 import { interpret } from "./vision";

@@ -18,6 +19,10 @@ function newSessionId(): string {
  return `s_${Date.now()}_${Math.random().toString(36).slice(2, 8)}`;
 }

+// ──────────────────────────────────────────────────────────────────────
+//  startSession — first scene + image
+// ──────────────────────────────────────────────────────────────────────
+
 export async function startSession(
  config: EngineConfig,
  req: StartRequest,
@@ -30,51 +35,56 @@ export async function startSession(
    history: [],
  };

-  const frame = await direct(config.text, session);
-  const imageBase64 = await render(config.image, frame, session.styleGuide);
+  const scene = await directScene(config.text, session);
+  const imageBase64 = await render(config.image, scene, session.styleGuide);

  return {
    sessionId: session.id,
-    frame,
+    scene,
    imageBase64,
  };
 }

-export async function visionTurn(
+// ──────────────────────────────────────────────────────────────────────
+//  requestScene — generate the NEXT scene + image.
+//  Frontend passes a session whose latest history entry has `exit` set.
+//  Also used for prefetch speculation (frontend synthesizes the exit).
+// ──────────────────────────────────────────────────────────────────────
+
+export async function requestScene(
+  config: EngineConfig,
+  req: SceneRequest,
+): Promise<SceneResponse> {
+  const scene = await directScene(config.text, req.session);
+  const imageBase64 = await render(config.image, scene, req.session.styleGuide);
+  return { scene, imageBase64 };
+}
+
+// ──────────────────────────────────────────────────────────────────────
+//  visionDecide — interprets a background click into intent + classify.
+// ──────────────────────────────────────────────────────────────────────
+
+export async function visionDecide(
  config: EngineConfig,
  req: VisionRequest,
 ): Promise<VisionResponse> {
  const annotated = await annotateClick(req.prevImageBase64, req.click);
-  const lastFrame = req.session.history.at(-1)?.frame;
-  const uiElements = lastFrame?.uiElements ?? [];
-  const intent = await interpret(config.vision, annotated, uiElements);
-  return { intent };
+  const current = req.session.history.at(-1)?.scene ?? null;
+  return interpret(config.vision, annotated, current);
 }

-export async function takeTurn(
+// ──────────────────────────────────────────────────────────────────────
+//  requestInsertBeat — generates a transient in-scene beat (no image regen)
+// ──────────────────────────────────────────────────────────────────────
+
+export async function requestInsertBeat(
  config: EngineConfig,
-  req: InteractRequest,
-): Promise<InteractResponse> {
-  const updatedSession: Session = {
-    ...req.session,
-    history: req.session.history.map((entry, idx, arr) =>
-      idx === arr.length - 1
-        ? { ...entry, click: req.click, intent: req.intent }
-        : entry,
-    ),
-  };
-
-  const nextFrame = await direct(config.text, updatedSession);
-  const nextImage = await render(
-    config.image,
-    nextFrame,
-    updatedSession.styleGuide,
+  req: InsertBeatRequest,
+): Promise<InsertBeatResponse> {
+  const partial = await directInsertBeat(
+    config.text,
+    req.session,
+    req.freeformAction,
  );
-
-  return {
-    session: updatedSession,
-    frame: nextFrame,
-    imageBase64: nextImage,
-    intent: req.intent,
-  };
+  return { partial };
 }
@@ -1,28 +1,76 @@
-import type { Character, Session, StoryFrame, UIElement } from "@yume/types";
+import type { Scene, Session } from "@yume/types";

+// ──────────────────────────────────────────────────────────────────────
+//  Director — emits one Scene (background + a graph of beats) at a time.
+// ──────────────────────────────────────────────────────────────────────

-export const DIRECTOR_SYSTEM = `你是一个交互视觉小说的编剧导演。每次根据世界观、画风和历史，输出当前画面要呈现的内容。
+export const DIRECTOR_SYSTEM = `你是一个交互视觉小说的「场景导演」。每次基于世界观、画风、玩家历史，输出**一个完整的场景**。
+
+一个场景包含：
+- 一张背景图（你给出英文 scenePrompt）
+- 一组对话节拍 beats，玩家会按顺序经历它们
+
+每个 beat 是玩家会看到的一段叙述 / 对话 / 选择。beat 之间通过 next 字段连接：
+- "continue": 玩家点击图片背景 / 按继续，自然推进到下一个 beat
+- "choice": 在此让玩家做选择，按所选 choice 的 effect 走向
+
+choice 的 effect 有两种：
+- "advance-beat": 玩家选了之后跳到**同场景内**的另一个 beat（不换背景图，速度极快）
+- "change-scene": 玩家选了之后切换到**新场景**（视角变了 / 走到新地方 / 时间跳了）
+
+设计原则：
+- 同场景内 beat 数自由发挥，按剧情节奏自然给出（通常 2–6 个，可以更多）
+- 多用 continue，少用 choice — 选择只应出现在「真正的岔路口」
+- advance-beat 适合处理对话分支（同一场景里换个话题、追问、撒娇）
+- change-scene 适合空间/时间跳跃（出门、转身看窗外、第二天清晨）
+- 一个场景至少要有一个 change-scene 出口（除非真到结局）
+- 每个 change-scene 必须带 nextSceneSeed —— 一句中文简述「下一场是哪里、谁在、要发生什么」，用来引导下一次导演调用
+- 同一场景的 beat id 互不重复
+- next.nextBeatId 引用的 beat 必须存在
+- choice 至少 2 个，至多 4 个，互不重复
+
+文本风格约束：
+- narration / line 用中文，scenePrompt 用英文
+- 单个 beat 的 narration 与 line 加起来 ≤80 字
+- 单个 choice label ≤15 字
+- scenePrompt 只描述画面里看到什么，不要描述 UI

 必须输出严格 JSON，结构如下：
 {
-  "narration": "本帧旁白（可空字符串）",
-  "speaker": "本帧说话角色名（可空）",
-  "line": "本帧角色台词（可空）",
-  "scenePrompt": "英文场景描述，给图像模型用，描述画面里看到什么",
-  "uiElements": [
-    { "id": "choice_1", "kind": "choice", "label": "选项一文字（≤15 字）" },
-    { "id": "choice_2", "kind": "choice", "label": "选项二文字（≤15 字）" },
-    { "id": "choice_3", "kind": "choice", "label": "选项三文字（≤15 字）" }
+  "scenePrompt": "english scene description, no UI",
+  "entryBeatId": "b1",
+  "beats": [
+    {
+      "id": "b1",
+      "narration": "可空",
+      "speaker": "可空",
+      "line": "可空",
+      "next": { "type": "continue", "nextBeatId": "b2" }
+    },
+    {
+      "id": "b2",
+      "speaker": "...",
+      "line": "...",
+      "next": {
+        "type": "choice",
+        "choices": [
+          {
+            "id": "c1",
+            "label": "继续追问",
+            "effect": { "kind": "advance-beat", "targetBeatId": "b3" }
+          },
+          {
+            "id": "c2",
+            "label": "起身离开教室",
+            "effect": { "kind": "change-scene", "nextSceneSeed": "雨后湿漉漉的走廊，她追了出来" }
+          }
+        ]
+      }
+    }
  ]
 }

-规则：
- narration / line 中文，scenePrompt 英文
- 默认 3 个 choice 元素，可以根据情境额外加 menu/item/custom（罕见）
- 选项必须能切实推进剧情，且互不重复
- scenePrompt 描述当前的画面，不要包括 UI 元素
- 单帧旁白与台词加起来控制在 80 字以内
- 不要输出 JSON 以外的任何文本`;
+不要输出 JSON 以外的任何文本。`;

 export function buildDirectorUserMessage(session: Session): string {
  const parts: string[] = [];
@@ -30,38 +78,120 @@ export function buildDirectorUserMessage(session: Session): string {
  parts.push(`画风：${session.styleGuide}`);

  if (session.history.length === 0) {
-    parts.push("\n这是故事的开场。请生成开场画面，严格以 JSON 格式返回。");
+    parts.push("\n这是故事的开场。请生成第一个场景，严格以 JSON 格式返回。");
    return parts.join("\n");
  }

-  parts.push("\n历史：");
+  parts.push("\n场景历史（按时间顺序）：");
  session.history.forEach((entry, idx) => {
-    const f = entry.frame;
-    const beat: string[] = [`【第 ${idx + 1} 帧】`];
-    if (f.narration) beat.push(`旁白：${f.narration}`);
-    if (f.line) beat.push(`${f.speaker ?? "?"}：${f.line}`);
-    if (entry.intent) {
-      beat.push(
-        `用户行为：${entry.intent.targetLabel ?? entry.intent.freeformAction ?? "未知"}`,
-      );
+    const lines: string[] = [`【场景 ${idx + 1}】`];
+    lines.push(`  scenePrompt: ${entry.scene.scenePrompt}`);
+
+    const visited = entry.visitedBeatIds.length
+      ? entry.visitedBeatIds
+      : [entry.scene.entryBeatId];
+    const beatById = new Map(entry.scene.beats.map((b) => [b.id, b]));
+    const visitedBeats = visited
+      .map((id) => beatById.get(id))
+      .filter((b): b is NonNullable<typeof b> => Boolean(b));
+
+    for (const b of visitedBeats) {
+      const fragments: string[] = [];
+      if (b.narration) fragments.push(`旁白：${b.narration}`);
+      if (b.line) fragments.push(`${b.speaker ?? "?"}：${b.line}`);
+      if (fragments.length) lines.push("  " + fragments.join(" / "));
    }
-    parts.push(beat.join("\n"));
+
+    if (entry.exit) {
+      if (entry.exit.kind === "choice") {
+        lines.push(
+          `  玩家最终选择：${entry.exit.label}（去往：${entry.exit.nextSceneSeed}）`,
+        );
+      } else {
+        lines.push(`  玩家自由动作：${entry.exit.action}`);
+      }
+    }
+    parts.push(lines.join("\n"));
  });

-  parts.push("\n请生成下一帧，严格以 JSON 格式返回。");
+  const last = session.history.at(-1);
+  const lastExit = last?.exit;
+  if (lastExit) {
+    if (lastExit.kind === "choice") {
+      parts.push(
+        `\n请基于「玩家在上一场选择了：${lastExit.label}」，生成下一个场景（参考种子：${lastExit.nextSceneSeed}）。`,
+      );
+    } else {
+      parts.push(
+        `\n请基于「玩家自由动作：${lastExit.action}」，生成下一个场景。`,
+      );
+    }
+  } else {
+    parts.push("\n请生成下一个场景。");
+  }
+
+  parts.push("严格以 JSON 格式返回。");
  return parts.join("\n");
 }

-export function buildImagePrompt(
-  frame: StoryFrame,
-  styleGuide: string,
+// ──────────────────────────────────────────────────────────────────────
+//  Insert-Beat — given a freeform vision action that is judged to stay
+//  *within* the current scene, generate one transient beat.
+// ──────────────────────────────────────────────────────────────────────
+
+export const INSERT_BEAT_SYSTEM = `你是视觉小说编剧。玩家在当前场景内做了一个**不会换场景的自由动作**（比如看一眼桌上的相框、想了想刚才那句话）。请基于此动作，写出一个**单独的、过渡性的 beat**：可以是旁白、角色台词、或两者结合。
+
+文本风格约束：
+- narration / line 用中文
+- narration 与 line 加起来 ≤80 字
+- 不要打破当前场景的物理状态（玩家仍在原地、对面仍是同一个角色）
+- 不要生成选项或下一步指引 —— 玩家点击会自然回到原 beat
+
+必须输出严格 JSON：
+{
+  "narration": "...",
+  "speaker": "...",
+  "line": "..."
+}
+
+字段都可为空字符串。不要输出 JSON 以外的任何文本。`;
+
+export function buildInsertBeatUserMessage(
+  session: Session,
+  freeformAction: string,
 ): string {
+  const parts: string[] = [];
+  parts.push(`世界观：${session.worldSetting}`);
+
+  const current = session.history.at(-1);
+  if (current) {
+    parts.push(`当前场景：${current.scene.scenePrompt}`);
+    const lastBeatId = current.visitedBeatIds.at(-1) ?? current.scene.entryBeatId;
+    const lastBeat = current.scene.beats.find((b) => b.id === lastBeatId);
+    if (lastBeat) {
+      const recent: string[] = [];
+      if (lastBeat.narration) recent.push(`旁白：${lastBeat.narration}`);
+      if (lastBeat.line) recent.push(`${lastBeat.speaker ?? "?"}：${lastBeat.line}`);
+      if (recent.length) parts.push(`刚才发生：${recent.join(" / ")}`);
+    }
+  }
+
+  parts.push(`\n玩家此刻的自由动作：${freeformAction}`);
+  parts.push("\n请生成一个过渡性 beat，严格以 JSON 格式返回。");
+  return parts.join("\n");
+}
+
+// ──────────────────────────────────────────────────────────────────────
+//  Image renderer
+// ──────────────────────────────────────────────────────────────────────
+
+export function buildImagePrompt(scene: Scene, styleGuide: string): string {
  return `Generate a cinematic landscape background illustration, 16:9 widescreen (1792x1024).

 ART STYLE: ${styleGuide}

 SCENE (fill the ENTIRE canvas — no UI elements, no text overlays):
-${frame.scenePrompt}
+${scene.scenePrompt}

 STRICT RULES — NEVER violate these:
 - DO NOT draw any dialogue boxes, speech bubbles, text panels, or any rectangular overlay.
@@ -74,25 +204,31 @@ STRICT RULES — NEVER violate these:
 - Characters or key scene elements should be positioned in the upper 65% of the frame.`;
 }

+// ──────────────────────────────────────────────────────────────────────
+//  Vision — interprets a background click and classifies the action.
+// ──────────────────────────────────────────────────────────────────────

-export const VISION_SYSTEM_PROMPT = `你是视觉理解助手。用户在视觉小说界面上点击了红色圆点位置，你要根据红点位置和图中可见的 UI 元素，判断用户的意图。
+export const VISION_SYSTEM_PROMPT = `你是视觉理解助手。玩家在视觉小说的背景图上点击了红色圆点位置（HTML 上的选项按钮不会走到你这里）。你的任务是：
+1. 看清红点指向画面里的什么（物件、角色、空间、远处的方向）
+2. 推断玩家想干什么
+3. 判断这个动作是「场内探索」（不该换图）还是「场景切换」（要换图）
+
+判断准则：
+- "insert-beat"（场内探索）：观察画面里某个细节、自言自语、和当前角色继续互动、看一眼某个物件
+- "change-scene"（场景切换）：走向画面深处的门 / 走廊、转头看向新方向（视角变了）、点了远处的另一个空间、暗示时间跳跃的物件（如时钟）

 必须输出严格 JSON：
 {
-  "targetId": "对应的 UI 元素 id（choice_1 / choice_2 / choice_3 / menu / ...），如果点击的是非 UI 区域则为 null",
-  "targetLabel": "对应 UI 元素的文字描述（如 '告诉她真相'），未知则为 null",
-  "reasoning": "一句话说明判断理由",
-  "freeformAction": "如果用户点的是场景中的物件/角色等非选项区域，描述他可能的意图（如 '想拿起桌上的钥匙'），否则空字符串"
+  "freeformAction": "玩家想做什么的一句中文描述，例如「想拿起桌上的钥匙」",
+  "classify": "insert-beat" 或 "change-scene",
+  "reasoning": "一句话说明判断理由"
 }

 不要输出 JSON 以外的任何文本。`;

-export function buildVisionUserPrompt(uiElements: UIElement[]): string {
-  const list = uiElements
-    .map((e) => `- id="${e.id}" kind="${e.kind}" label="${e.label}"`)
-    .join("\n");
-  return `当前画面包含以下已知 UI 元素：
-${list}
+export function buildVisionUserPrompt(scene: Scene | null): string {
+  if (!scene) return "请判断玩家意图，并以 JSON 格式返回。";
+  return `当前场景描述：${scene.scenePrompt}

-红点位置即为用户点击位置。请判断用户的意图，并以 JSON 格式返回结果。`;
+红点位置即为玩家点击位置。请判断玩家意图与分类，以 JSON 格式返回。`;
 }
@@ -1,12 +1,12 @@
 import { generateImage } from "@yume/ai-client";
-import type { ProviderConfig, StoryFrame } from "@yume/types";
+import type { ProviderConfig, Scene } from "@yume/types";
 import { buildImagePrompt } from "./prompts";

 export async function render(
  config: ProviderConfig,
-  frame: StoryFrame,
+  scene: Scene,
  styleGuide: string,
 ): Promise<string> {
-  const prompt = buildImagePrompt(frame, styleGuide);
+  const prompt = buildImagePrompt(scene, styleGuide);
  return generateImage(config, prompt);
 }
@@ -1,26 +1,39 @@
 import { interpretClick } from "@yume/ai-client";
-import type { ClickIntent, ProviderConfig, UIElement } from "@yume/types";
+import type {
+  ClickIntent,
+  ProviderConfig,
+  Scene,
+  VisionClassify,
+} from "@yume/types";
 import { parseJsonLoose } from "./jsonParser";
 import { VISION_SYSTEM_PROMPT, buildVisionUserPrompt } from "./prompts";

+export type VisionInterpretation = {
+  intent: ClickIntent;
+  classify: VisionClassify;
+};
+
 export async function interpret(
  config: ProviderConfig,
  annotatedImageBase64: string,
-  uiElements: UIElement[],
-): Promise<ClickIntent> {
-  const userPrompt = `${VISION_SYSTEM_PROMPT}\n\n${buildVisionUserPrompt(uiElements)}`;
+  scene: Scene | null,
+): Promise<VisionInterpretation> {
+  const userPrompt = `${VISION_SYSTEM_PROMPT}\n\n${buildVisionUserPrompt(scene)}`;
  const raw = await interpretClick(config, annotatedImageBase64, userPrompt);
  const parsed = parseJsonLoose<{
-    targetId?: string | null;
-    targetLabel?: string | null;
-    reasoning?: string;
    freeformAction?: string;
+    classify?: string;
+    reasoning?: string;
  }>(raw);

+  const classify: VisionClassify =
+    parsed.classify === "change-scene" ? "change-scene" : "insert-beat";
+
  return {
-    targetId: parsed.targetId ?? null,
-    targetLabel: parsed.targetLabel ?? null,
-    reasoning: parsed.reasoning ?? "",
-    freeformAction: parsed.freeformAction || undefined,
+    intent: {
+      freeformAction: parsed.freeformAction?.trim() || "玩家点了画面，但意图不明",
+      reasoning: parsed.reasoning?.trim() || "",
+    },
+    classify,
  };
 }
@@ -1,42 +1,86 @@
-export type UIElementKind = "choice" | "menu" | "item" | "custom";
+// ──────────────────────────────────────────────────────────────────────
+//  Beat — one dialogue / narration moment within a Scene.
+//  Multiple beats share the same background image; tapping or choosing
+//  advances among them WITHOUT regenerating the image.
+// ──────────────────────────────────────────────────────────────────────

-export type UIElement = {
-  id: string;
-  kind: UIElementKind;
-  label: string;
-  hint?: string;
-};
-
-export type StoryFrame = {
+export type Beat = {
  id: string;
  narration?: string;
  speaker?: string;
  line?: string;
+  next: BeatNext;
+};
+
+export type BeatNext =
+  | { type: "continue"; nextBeatId: string }
+  | { type: "choice"; choices: BeatChoice[] };
+
+export type BeatChoice = {
+  id: string;
+  label: string;
+  effect: BeatChoiceEffect;
+};
+
+export type BeatChoiceEffect =
+  | { kind: "advance-beat"; targetBeatId: string }
+  | { kind: "change-scene"; nextSceneSeed: string };
+
+// ──────────────────────────────────────────────────────────────────────
+//  Scene — one background image + a graph of beats.
+//  The Director emits an entire Scene per call; the player navigates
+//  through its beats locally with zero network until exiting.
+// ──────────────────────────────────────────────────────────────────────
+
+export type Scene = {
+  id: string;
  scenePrompt: string;
-  uiElements: UIElement[];
+  beats: Beat[];
+  entryBeatId: string;
 };

-export type ClickIntent = {
-  targetId: string | null;
-  targetLabel: string | null;
-  reasoning: string;
-  freeformAction?: string;
+export type SceneExit =
+  | {
+      kind: "choice";
+      choiceId: string;
+      label: string;
+      nextSceneSeed: string;
+    }
+  | { kind: "freeform"; action: string };
+
+export type SceneHistoryEntry = {
+  scene: Scene;
+  visitedBeatIds: string[];
+  exit?: SceneExit;
 };

-export type HistoryEntry = {
-  frame: StoryFrame;
-  click?: { x: number; y: number };
-  intent?: ClickIntent;
-};
+// ──────────────────────────────────────────────────────────────────────
+//  Session
+// ──────────────────────────────────────────────────────────────────────

 export type Session = {
  id: string;
  createdAt: number;
  worldSetting: string;
  styleGuide: string;
-  history: HistoryEntry[];
+  history: SceneHistoryEntry[];
 };

+// ──────────────────────────────────────────────────────────────────────
+//  Vision
+// ──────────────────────────────────────────────────────────────────────
+
+export type ClickIntent = {
+  freeformAction: string;
+  reasoning: string;
+};
+
+export type VisionClassify = "insert-beat" | "change-scene";
+
+// ──────────────────────────────────────────────────────────────────────
+//  Provider config
+// ──────────────────────────────────────────────────────────────────────
+
 export type ProviderConfig = {
  baseUrl: string;
  apiKey: string;
@@ -49,6 +93,10 @@ export type EngineConfig = {
  vision: ProviderConfig;
 };

+// ──────────────────────────────────────────────────────────────────────
+//  API contracts
+// ──────────────────────────────────────────────────────────────────────
+
 export type StartRequest = {
  worldSetting: string;
  styleGuide: string;
@@ -56,10 +104,25 @@ export type StartRequest = {

 export type StartResponse = {
  sessionId: string;
-  frame: StoryFrame;
+  scene: Scene;
  imageBase64: string;
 };

+// /api/scene — generates the next Scene, given session whose latest
+// history entry has `exit` set. Also used for prefetch speculation
+// (frontend synthesizes a speculative exit).
+export type SceneRequest = {
+  session: Session;
+};
+
+export type SceneResponse = {
+  scene: Scene;
+  imageBase64: string;
+};
+
+// /api/vision — interprets a background click on the current image and
+// classifies whether it should insert a beat (in-scene exploration) or
+// trigger a scene change.
 export type VisionRequest = {
  session: Session;
  prevImageBase64: string;
@@ -68,17 +131,20 @@ export type VisionRequest = {

 export type VisionResponse = {
  intent: ClickIntent;
+  classify: VisionClassify;
 };

-export type InteractRequest = {
+// /api/insert-beat — generates a single transient beat in response to
+// a freeform vision action. Does NOT regenerate the image.
+export type InsertBeatRequest = {
  session: Session;
-  intent: ClickIntent;
-  click?: { x: number; y: number };
+  freeformAction: string;
 };

-export type InteractResponse = {
-  session: Session;
-  frame: StoryFrame;
-  imageBase64: string;
-  intent: ClickIntent;
+export type InsertBeatResponse = {
+  partial: {
+    narration?: string;
+    speaker?: string;
+    line?: string;
+  };
 };