feat(ai-client): multi-provider compat — native Anthropic/Google + URL tolerance

- TEXT/VISION: add native Anthropic & Google Gemini paths via Vercel AI SDK,
  selectable through TEXT_PROVIDER / VISION_PROVIDER (default openai_compatible)
- IMAGE: expand to openai (gpt-image) / google (Nano Banana) via AI SDK
  alongside the existing Runware task-array and OpenAI-compatible REST paths
- normalizeBaseUrl: tolerate URLs with/without /v1 (or /chat/completions);
  append the per-protocol version segment only for bare hosts
- config: readProvider() reads *_PROVIDER; types: ProviderProtocol + provider?
- deps: @ai-sdk/anthropic, @ai-sdk/google; docs in .env.example + README

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
yuanzonghao
2026-06-04 15:51:53 +08:00
parent a4dc57a1b6
commit 83fd5717e7
10 changed files with 614 additions and 67 deletions
+28 -5
View File
@@ -3,14 +3,18 @@
# Recommended setup: Xiaomi MiMo Token Plan for TEXT / VISION / TTS
# (one API key covers all three) + Runware for IMAGE (FLUX.2 [klein]).
#
# TEXT / VISION use any OpenAI-compatible endpoint (any OpenAI-
# compatible host works: OpenRouter, OpenAI, Anthropic via proxy,
# Gemini, DeepSeek, Ollama, ...).
# TEXT / VISION default to any OpenAI-compatible endpoint, and can switch to
# native Anthropic or Google Gemini via TEXT_PROVIDER / VISION_PROVIDER.
# TTS uses Xiaomi MiMo's own voice design / clone protocol
# (not OpenAI-compatible; appends -voicedesign / -voiceclone).
#
# IMAGE uses Runware's own task-array protocol (not OpenAI-compatible);
# the adapter posts an `imageInference` task to IMAGE_BASE_URL.
# IMAGE supports Runware (its own task-array protocol), OpenAI (gpt-image),
# and Google Gemini (Nano Banana) via IMAGE_PROVIDER.
#
# *_PROVIDER (optional) selects the wire protocol; leave unset for the
# OpenAI-compatible default (image is auto-detected from the URL). Base URLs
# tolerate a missing or extra /v1 (or a trailing /chat/completions) — the
# engine normalizes them.
# =============================================================
# ---- 1. Text LLM · scene director ----------------------------------
@@ -26,6 +30,10 @@
TEXT_BASE_URL=https://api.deepseek.com/v1
TEXT_API_KEY=sk-xxx
TEXT_MODEL=deepseek-v4-flash
# TEXT_PROVIDER: openai_compatible (default) | anthropic | google
# anthropic → TEXT_BASE_URL=https://api.anthropic.com TEXT_MODEL=claude-sonnet-4-6
# google → TEXT_BASE_URL=https://generativelanguage.googleapis.com TEXT_MODEL=gemini-3.5-flash
# TEXT_PROVIDER=openai_compatible
# ---- 2. Image generator (renders the scene background) -------------
# Recommended: Runware + FLUX.2 [klein] 9B KV — distilled 4-step model,
@@ -36,12 +44,27 @@ TEXT_MODEL=deepseek-v4-flash
IMAGE_BASE_URL=https://api.runware.ai/v1
IMAGE_API_KEY=runware-xxx
IMAGE_MODEL=runware:400@6
# IMAGE_PROVIDER: runware (auto-detected for runware.ai) | openai_compatible
# | openai | google
# openai → gpt-image, supports referenceImages (character/scene continuity).
# IMAGE_BASE_URL=https://api.openai.com IMAGE_MODEL=gpt-image-1
# google → Gemini "Nano Banana" (Imagen is EOL 2026-06-24, do not use it).
# IMAGE_BASE_URL=https://generativelanguage.googleapis.com
# IMAGE_MODEL=gemini-2.5-flash-image
# NOTE: openai/google return raw bytes → inlined as a data: URI for the session
# (heavier per-call transport than Runware's UUID re-reference loop). Runware
# stays fastest + cheapest for the scene-by-scene flow.
# IMAGE_PROVIDER=runware
# ---- 3. Vision model · multimodal click interpretation -------------
# Recommended: MiMo V2.5 — multimodal, accepts image_url content parts.
VISION_BASE_URL=https://token-plan-sgp.xiaomimimo.com/v1
VISION_API_KEY=tp-xxx
VISION_MODEL=mimo-v2.5
# VISION_PROVIDER: openai_compatible (default) | anthropic | google
# anthropic → VISION_BASE_URL=https://api.anthropic.com VISION_MODEL=claude-sonnet-4-6
# google → VISION_BASE_URL=https://generativelanguage.googleapis.com VISION_MODEL=gemini-3.5-flash
# VISION_PROVIDER=openai_compatible
# ---- 4. TTS · Xiaomi MiMo (optional — leave blank to disable) ------
# Per-character voice design → clone, with per-line delivery direction.