docs: AI API cost analysis + complete Gemini forge fallback consultation

2026-04-16 00:46:44 +00:00
parent 6d1fcac283
commit 16ea1f94f3
2 changed files with 180 additions and 0 deletions
--- a/docs/consultations/gemini-forge-fallback-strategy-2026-04-15.md
+++ b/docs/consultations/gemini-forge-fallback-strategy-2026-04-15.md
@@ -102,3 +102,71 @@ Thanks Gemini — we need your honest take here. Don't spare our feelings. 🔥

 — Michael (The Wizard) + Claude (Chronicler #92)  
 *Fire + Arcane + Frost = Forever*
+
+---
+
+## Gemini's Response (April 15, 2026)
+
+**Summary:** Option B (API hedge) wins decisively. Option D (graceful degradation) is mandatory co-pilot. Kill Ollama on TX1. Use OpenRouter as the single fallback endpoint. Gemini API as a wild card due to 1M token context window.
+
+### Q1: Honest Assessment
+
+- **Option B (API Hedge)** — Definitive winner. Standard enterprise architecture, zero hardware changes, Arbiter stays responsive.
+- **Option D (Graceful Degradation)** — Mandatory co-pilot. Even with fallback API, if Arbiter can't reach any AI it should queue non-critical tasks and post a static "The Chronicler is currently meditating" message.
+- **Option A (Async-only)** — Over-engineering. Takes time away from actual server development.
+- **Option C (Minimal local model)** — Dead end. Do not do this.
+
+### Q2: API Hedge Recommendation — OpenRouter
+
+Single OpenAI-compatible endpoint. If Claude drops, Arbiter just swaps the model string.
+
+- **For reasoning/RAG:** Llama 3 70B via Groq through OpenRouter
+- **For lore generation:** Llama 3 8B — lightning fast, perfect for 2-sentence fragments
+
+```json
+// Primary
+{ "api_base": "https://api.anthropic.com/v1", "model": "claude-sonnet-4-6" }
+
+// Fallback
+{ "api_base": "https://openrouter.ai/api/v1", "model": "meta-llama/llama-3-70b-instruct" }
+```
+
+### Q3: Option C Baseline (Don't Do This)
+
+Phi-3-Mini (3.8B) or Qwen1.5-1.8B with heavy GGUF quantization would take 45-60 seconds per query and hallucinate heavily on complex log analysis. Not worth the CPU tax on game nodes.
+
+### Q4: Decouple The Forge Art Installation
+
+**Yes, cleanly separate.** The visual installation does not need a local LLM. Tie it to real server metrics via Arbiter/Pterodactyl webhooks:
+- Particle speed → TPS across the network
+- Color shifts → player count or in-game time of day
+- Pulses → Arbiter welcome message or major event logged
+
+Keeps the visual magic alive without the 29GB RAM footprint.
+
+### Q5: Wild Cards
+
+**Wild Card 1: Cloudflare Workers AI** — Since firefrostgaming.com routes through Cloudflare, use their Serverless AI for edge-GPU inference at very low cost. Essentially Option B baked into the network layer.
+
+**Wild Card 2: Gemini API** — Gemini 1.5 Flash has a 1-million token context window. In an emergency, dump the entire 62MB ops manual into the prompt without needing Dify or RAG at all.
+
+### The Verdict
+
+> Delete the Ollama runner from TX1. Free up 29GB RAM for game servers. Set up OpenRouter with $20 of credit, script Arbiter to swap endpoints if Anthropic times out. Focus on soft launch and players, not EPYC memory bottlenecks.
+
+---
+
+## Conclusion
+
+Three clear actions:
+1. Kill Ollama on TX1 (free 29GB RAM)
+2. Set up OpenRouter account + implement API failover in Arbiter
+3. Clean separation: The Forge art installation ≠ AI fallback
+
+**Next Steps:**
+1. Cost analysis on OpenRouter + Gemini API fallback usage
+2. Create task for Arbiter API failover implementation
+3. Kill Ollama on TX1
+4. Separate The Forge art installation into its own roadmap item
+
+*Fire + Arcane + Frost = Forever* 🔥💜❄️
--- a/docs/reference/ai-api-cost-analysis-2026-04-15.md
+++ b/docs/reference/ai-api-cost-analysis-2026-04-15.md
@@ -0,0 +1,112 @@
+# AI API Cost Analysis — Forge Fallback Strategy
+**Created:** April 15, 2026 (Chronicler #92)  
+**Purpose:** Cost comparison for Claude API (primary) vs fallback options  
+**Status:** Pondering — no decisions made yet
+
+---
+
+## Claude API Pricing (Primary — Anthropic)
+
+| Model | Input /1M tokens | Output /1M tokens | Notes |
+|-------|-----------------|-------------------|-------|
+| **Haiku 4.5** | $1.00 | $5.00 | Fastest, cheapest, good for simple tasks |
+| **Sonnet 4.6** | $3.00 | $15.00 | Current Chronicler model — balanced |
+| **Opus 4.6** | $5.00 | $25.00 | Most capable, 1M context |
+| Opus 4.6 Fast Mode | $30.00 | $150.00 | ⚠️ Only if latency critical |
+
+**Discounts:**
+- **Batch API:** 50% off all models (async, 24hr window)
+- **Prompt caching:** Up to 90% off repeated input
+- Both discounts stack (up to 95% savings on cached batch work)
+
+**Real-world estimate for Firefrost:**
+- Arbiter's Awakened Concierge welcome messages: ~1,000 tokens each, maybe 50/month = ~50K tokens = **$0.25/month on Haiku**
+- Lore Engine (if built): ~500 tokens per fragment, ~100/month = ~50K tokens = **$0.25/month on Haiku**
+- Emergency Chronicler sessions (API fallback): ~200K tokens/session, maybe 2/month = **~$1.20/month on Haiku**
+
+**Estimated total Claude API spend: $2-5/month at current scale**
+
+---
+
+## Fallback Options Pricing
+
+### OpenRouter
+Single OpenAI-compatible endpoint routing to 300+ models.
+
+| Model | Input /1M | Output /1M | Notes |
+|-------|-----------|------------|-------|
+| **Llama 3.3 70B (FREE)** | $0 | $0 | 200 req/day, 20 req/min limit |
+| Llama 3.3 70B (Paid) | ~$0.51 | ~$0.74 | Unlimited |
+| Llama 3.1 8B | $0.02 | $0.05 | Fast, lightweight |
+
+**Emergency use estimate:** Free tier covers it entirely. $20 credit as insurance = months of actual use.
+
+### Gemini API (Google)
+Gemini is our architectural partner. Strong candidate for fallback.
+
+| Model | Input /1M | Output /1M | Context | Notes |
+|-------|-----------|------------|---------|-------|
+| **2.5 Flash-Lite** | $0.10 | $0.40 | 1M | Cheapest paid option |
+| **2.5 Flash (FREE)** | $0 | $0 | 1M | 1,500 req/day free |
+| 2.5 Pro | $1.25 | $10.00 | 1M | Premium reasoning |
+| 3 Flash | $0.50 | $3.00 | 1M | Balanced |
+
+**Key advantage:** Gemini 2.5 Flash has **1M token context** — entire ops manual fits in one prompt without RAG during emergency. 1,500 free requests/day is more than enough for fallback use.
+
+**Batch discount:** 50% off all paid models for async work.
+
+### Cloudflare Workers AI
+Edge GPU inference via Cloudflare's network. Since firefrostgaming.com already routes through Cloudflare.
+
+- Pricing: Very low, usage-based
+- Advantage: Already in the network layer, no new vendor
+- Models available: Llama, Mistral, others
+- Best for: Simple, fast inference at the edge
+
+---
+
+## Full Comparison Table
+
+| Provider | Model | Cost/month (emergency) | Context | Reliability |
+|----------|-------|----------------------|---------|-------------|
+| **Anthropic** (primary) | Sonnet 4.6 | ~$2-5 | 1M | ⚠️ 9 outages in April |
+| **Gemini Free** (fallback) | 2.5 Flash | $0 | 1M | ✅ Different infrastructure |
+| **OpenRouter Free** (backup) | Llama 3.3 70B | $0 | 65K | ✅ Routes to multiple providers |
+| **Cloudflare Workers AI** | Various | ~$0-1 | Varies | ✅ Edge network |
+| **Local Ollama** (TX1) | Llama 3.1 8B | $0 | 16K | ❌ CPU too slow for real-time |
+
+---
+
+## Key Observations for Pondering
+
+1. **Claude API cost is trivial at Firefrost's scale** — $2-5/month. Not a cost problem, a reliability problem.
+
+2. **The outage problem is real** — 9 outages in April 2026 alone, including today where both claude.ai AND the API went down simultaneously. A fallback that also uses Anthropic infrastructure doesn't help.
+
+3. **Gemini free tier is the obvious answer** for emergency fallback — different company, different infrastructure, 1,500 req/day free, 1M context window means no RAG needed in an emergency.
+
+4. **OpenRouter free tier** as a secondary backup — routes through multiple providers, if one goes down it tries another.
+
+5. **Local Ollama is dead** for real-time use without GPU. Keep it only if batch async tasks make sense later.
+
+6. **The architecture is simple:**
+   - Arbiter tries Claude API first
+   - If timeout/error → falls back to Gemini API
+   - If Gemini fails → falls back to OpenRouter
+   - If all fail → graceful degradation ("The Chronicler is meditating")
+
+7. **Cloudflare Workers AI** is worth evaluating specifically for the log analyzer bot — edge inference, already in the network, potentially faster than API calls for simple tasks.
+
+---
+
+## What's Not Decided Yet
+
+- Which tasks actually need AI fallback vs which can just queue
+- Whether The Forge art installation gets decoupled from AI entirely (Gemini recommends yes)
+- Whether to build the failover into Arbiter now or post-launch
+- The Chloe-chan replacement (log analyzer) architecture
+
+---
+
+*Michael is pondering. No action items yet.*  
+*Fire + Arcane + Frost = Forever* 🔥💜❄️