feat(ai-client): multi-provider compat — native Anthropic/Google + URL tolerance

- TEXT/VISION: add native Anthropic & Google Gemini paths via Vercel AI SDK, selectable through TEXT_PROVIDER / VISION_PROVIDER (default openai_compatible) - IMAGE: expand to openai (gpt-image) / google (Nano Banana) via AI SDK alongside the existing Runware task-array and OpenAI-compatible REST paths - normalizeBaseUrl: tolerate URLs with/without /v1 (or /chat/completions); append the per-protocol version segment only for bare hosts - config: readProvider() reads *_PROVIDER; types: ProviderProtocol + provider? - deps: @ai-sdk/anthropic, @ai-sdk/google; docs in .env.example + README Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-04 15:51:53 +08:00
parent a4dc57a1b6
commit 83fd5717e7
10 changed files with 614 additions and 67 deletions
@@ -3,14 +3,18 @@
 # Recommended setup: Xiaomi MiMo Token Plan for TEXT / VISION / TTS
 # (one API key covers all three) + Runware for IMAGE (FLUX.2 [klein]).
 #
-# TEXT / VISION use any OpenAI-compatible endpoint (any OpenAI-
-# compatible host works: OpenRouter, OpenAI, Anthropic via proxy,
-# Gemini, DeepSeek, Ollama, ...).
+# TEXT / VISION default to any OpenAI-compatible endpoint, and can switch to
+# native Anthropic or Google Gemini via TEXT_PROVIDER / VISION_PROVIDER.
 # TTS uses Xiaomi MiMo's own voice design / clone protocol
 # (not OpenAI-compatible; appends -voicedesign / -voiceclone).
 #
-# IMAGE uses Runware's own task-array protocol (not OpenAI-compatible);
-# the adapter posts an `imageInference` task to IMAGE_BASE_URL.
+# IMAGE supports Runware (its own task-array protocol), OpenAI (gpt-image),
+# and Google Gemini (Nano Banana) via IMAGE_PROVIDER.
+#
+# *_PROVIDER (optional) selects the wire protocol; leave unset for the
+# OpenAI-compatible default (image is auto-detected from the URL). Base URLs
+# tolerate a missing or extra /v1 (or a trailing /chat/completions) — the
+# engine normalizes them.
 # =============================================================

 # ---- 1. Text LLM · scene director ----------------------------------
@@ -26,6 +30,10 @@
 TEXT_BASE_URL=https://api.deepseek.com/v1
 TEXT_API_KEY=sk-xxx
 TEXT_MODEL=deepseek-v4-flash
+# TEXT_PROVIDER: openai_compatible (default) | anthropic | google
+#   anthropic → TEXT_BASE_URL=https://api.anthropic.com  TEXT_MODEL=claude-sonnet-4-6
+#   google    → TEXT_BASE_URL=https://generativelanguage.googleapis.com  TEXT_MODEL=gemini-3.5-flash
+# TEXT_PROVIDER=openai_compatible

 # ---- 2. Image generator (renders the scene background) -------------
 # Recommended: Runware + FLUX.2 [klein] 9B KV — distilled 4-step model,
@@ -36,12 +44,27 @@ TEXT_MODEL=deepseek-v4-flash
 IMAGE_BASE_URL=https://api.runware.ai/v1
 IMAGE_API_KEY=runware-xxx
 IMAGE_MODEL=runware:400@6
+# IMAGE_PROVIDER: runware (auto-detected for runware.ai) | openai_compatible
+#                 | openai | google
+#   openai → gpt-image, supports referenceImages (character/scene continuity).
+#            IMAGE_BASE_URL=https://api.openai.com  IMAGE_MODEL=gpt-image-1
+#   google → Gemini "Nano Banana" (Imagen is EOL 2026-06-24, do not use it).
+#            IMAGE_BASE_URL=https://generativelanguage.googleapis.com
+#            IMAGE_MODEL=gemini-2.5-flash-image
+# NOTE: openai/google return raw bytes → inlined as a data: URI for the session
+# (heavier per-call transport than Runware's UUID re-reference loop). Runware
+# stays fastest + cheapest for the scene-by-scene flow.
+# IMAGE_PROVIDER=runware

 # ---- 3. Vision model · multimodal click interpretation -------------
 # Recommended: MiMo V2.5 — multimodal, accepts image_url content parts.
 VISION_BASE_URL=https://token-plan-sgp.xiaomimimo.com/v1
 VISION_API_KEY=tp-xxx
 VISION_MODEL=mimo-v2.5
+# VISION_PROVIDER: openai_compatible (default) | anthropic | google
+#   anthropic → VISION_BASE_URL=https://api.anthropic.com  VISION_MODEL=claude-sonnet-4-6
+#   google    → VISION_BASE_URL=https://generativelanguage.googleapis.com  VISION_MODEL=gemini-3.5-flash
+# VISION_PROVIDER=openai_compatible

 # ---- 4. TTS · Xiaomi MiMo (optional — leave blank to disable) ------
 # Per-character voice design → clone, with per-line delivery direction.