docs: AI API cost analysis + complete Gemini forge fallback consultation

This commit is contained in:
Claude
2026-04-16 00:46:44 +00:00
parent 6d1fcac283
commit 16ea1f94f3
2 changed files with 180 additions and 0 deletions

View File

@@ -102,3 +102,71 @@ Thanks Gemini — we need your honest take here. Don't spare our feelings. 🔥
— Michael (The Wizard) + Claude (Chronicler #92)
*Fire + Arcane + Frost = Forever*
---
## Gemini's Response (April 15, 2026)
**Summary:** Option B (API hedge) wins decisively. Option D (graceful degradation) is mandatory co-pilot. Kill Ollama on TX1. Use OpenRouter as the single fallback endpoint. Gemini API as a wild card due to 1M token context window.
### Q1: Honest Assessment
- **Option B (API Hedge)** — Definitive winner. Standard enterprise architecture, zero hardware changes, Arbiter stays responsive.
- **Option D (Graceful Degradation)** — Mandatory co-pilot. Even with fallback API, if Arbiter can't reach any AI it should queue non-critical tasks and post a static "The Chronicler is currently meditating" message.
- **Option A (Async-only)** — Over-engineering. Takes time away from actual server development.
- **Option C (Minimal local model)** — Dead end. Do not do this.
### Q2: API Hedge Recommendation — OpenRouter
Single OpenAI-compatible endpoint. If Claude drops, Arbiter just swaps the model string.
- **For reasoning/RAG:** Llama 3 70B via Groq through OpenRouter
- **For lore generation:** Llama 3 8B — lightning fast, perfect for 2-sentence fragments
```json
// Primary
{ "api_base": "https://api.anthropic.com/v1", "model": "claude-sonnet-4-6" }
// Fallback
{ "api_base": "https://openrouter.ai/api/v1", "model": "meta-llama/llama-3-70b-instruct" }
```
### Q3: Option C Baseline (Don't Do This)
Phi-3-Mini (3.8B) or Qwen1.5-1.8B with heavy GGUF quantization would take 45-60 seconds per query and hallucinate heavily on complex log analysis. Not worth the CPU tax on game nodes.
### Q4: Decouple The Forge Art Installation
**Yes, cleanly separate.** The visual installation does not need a local LLM. Tie it to real server metrics via Arbiter/Pterodactyl webhooks:
- Particle speed → TPS across the network
- Color shifts → player count or in-game time of day
- Pulses → Arbiter welcome message or major event logged
Keeps the visual magic alive without the 29GB RAM footprint.
### Q5: Wild Cards
**Wild Card 1: Cloudflare Workers AI** — Since firefrostgaming.com routes through Cloudflare, use their Serverless AI for edge-GPU inference at very low cost. Essentially Option B baked into the network layer.
**Wild Card 2: Gemini API** — Gemini 1.5 Flash has a 1-million token context window. In an emergency, dump the entire 62MB ops manual into the prompt without needing Dify or RAG at all.
### The Verdict
> Delete the Ollama runner from TX1. Free up 29GB RAM for game servers. Set up OpenRouter with $20 of credit, script Arbiter to swap endpoints if Anthropic times out. Focus on soft launch and players, not EPYC memory bottlenecks.
---
## Conclusion
Three clear actions:
1. Kill Ollama on TX1 (free 29GB RAM)
2. Set up OpenRouter account + implement API failover in Arbiter
3. Clean separation: The Forge art installation ≠ AI fallback
**Next Steps:**
1. Cost analysis on OpenRouter + Gemini API fallback usage
2. Create task for Arbiter API failover implementation
3. Kill Ollama on TX1
4. Separate The Forge art installation into its own roadmap item
*Fire + Arcane + Frost = Forever* 🔥💜❄️

View File

@@ -0,0 +1,112 @@
# AI API Cost Analysis — Forge Fallback Strategy
**Created:** April 15, 2026 (Chronicler #92)
**Purpose:** Cost comparison for Claude API (primary) vs fallback options
**Status:** Pondering — no decisions made yet
---
## Claude API Pricing (Primary — Anthropic)
| Model | Input /1M tokens | Output /1M tokens | Notes |
|-------|-----------------|-------------------|-------|
| **Haiku 4.5** | $1.00 | $5.00 | Fastest, cheapest, good for simple tasks |
| **Sonnet 4.6** | $3.00 | $15.00 | Current Chronicler model — balanced |
| **Opus 4.6** | $5.00 | $25.00 | Most capable, 1M context |
| Opus 4.6 Fast Mode | $30.00 | $150.00 | ⚠️ Only if latency critical |
**Discounts:**
- **Batch API:** 50% off all models (async, 24hr window)
- **Prompt caching:** Up to 90% off repeated input
- Both discounts stack (up to 95% savings on cached batch work)
**Real-world estimate for Firefrost:**
- Arbiter's Awakened Concierge welcome messages: ~1,000 tokens each, maybe 50/month = ~50K tokens = **$0.25/month on Haiku**
- Lore Engine (if built): ~500 tokens per fragment, ~100/month = ~50K tokens = **$0.25/month on Haiku**
- Emergency Chronicler sessions (API fallback): ~200K tokens/session, maybe 2/month = **~$1.20/month on Haiku**
**Estimated total Claude API spend: $2-5/month at current scale**
---
## Fallback Options Pricing
### OpenRouter
Single OpenAI-compatible endpoint routing to 300+ models.
| Model | Input /1M | Output /1M | Notes |
|-------|-----------|------------|-------|
| **Llama 3.3 70B (FREE)** | $0 | $0 | 200 req/day, 20 req/min limit |
| Llama 3.3 70B (Paid) | ~$0.51 | ~$0.74 | Unlimited |
| Llama 3.1 8B | $0.02 | $0.05 | Fast, lightweight |
**Emergency use estimate:** Free tier covers it entirely. $20 credit as insurance = months of actual use.
### Gemini API (Google)
Gemini is our architectural partner. Strong candidate for fallback.
| Model | Input /1M | Output /1M | Context | Notes |
|-------|-----------|------------|---------|-------|
| **2.5 Flash-Lite** | $0.10 | $0.40 | 1M | Cheapest paid option |
| **2.5 Flash (FREE)** | $0 | $0 | 1M | 1,500 req/day free |
| 2.5 Pro | $1.25 | $10.00 | 1M | Premium reasoning |
| 3 Flash | $0.50 | $3.00 | 1M | Balanced |
**Key advantage:** Gemini 2.5 Flash has **1M token context** — entire ops manual fits in one prompt without RAG during emergency. 1,500 free requests/day is more than enough for fallback use.
**Batch discount:** 50% off all paid models for async work.
### Cloudflare Workers AI
Edge GPU inference via Cloudflare's network. Since firefrostgaming.com already routes through Cloudflare.
- Pricing: Very low, usage-based
- Advantage: Already in the network layer, no new vendor
- Models available: Llama, Mistral, others
- Best for: Simple, fast inference at the edge
---
## Full Comparison Table
| Provider | Model | Cost/month (emergency) | Context | Reliability |
|----------|-------|----------------------|---------|-------------|
| **Anthropic** (primary) | Sonnet 4.6 | ~$2-5 | 1M | ⚠️ 9 outages in April |
| **Gemini Free** (fallback) | 2.5 Flash | $0 | 1M | ✅ Different infrastructure |
| **OpenRouter Free** (backup) | Llama 3.3 70B | $0 | 65K | ✅ Routes to multiple providers |
| **Cloudflare Workers AI** | Various | ~$0-1 | Varies | ✅ Edge network |
| **Local Ollama** (TX1) | Llama 3.1 8B | $0 | 16K | ❌ CPU too slow for real-time |
---
## Key Observations for Pondering
1. **Claude API cost is trivial at Firefrost's scale** — $2-5/month. Not a cost problem, a reliability problem.
2. **The outage problem is real** — 9 outages in April 2026 alone, including today where both claude.ai AND the API went down simultaneously. A fallback that also uses Anthropic infrastructure doesn't help.
3. **Gemini free tier is the obvious answer** for emergency fallback — different company, different infrastructure, 1,500 req/day free, 1M context window means no RAG needed in an emergency.
4. **OpenRouter free tier** as a secondary backup — routes through multiple providers, if one goes down it tries another.
5. **Local Ollama is dead** for real-time use without GPU. Keep it only if batch async tasks make sense later.
6. **The architecture is simple:**
- Arbiter tries Claude API first
- If timeout/error → falls back to Gemini API
- If Gemini fails → falls back to OpenRouter
- If all fail → graceful degradation ("The Chronicler is meditating")
7. **Cloudflare Workers AI** is worth evaluating specifically for the log analyzer bot — edge inference, already in the network, potentially faster than API calls for simple tasks.
---
## What's Not Decided Yet
- Which tasks actually need AI fallback vs which can just queue
- Whether The Forge art installation gets decoupled from AI entirely (Gemini recommends yes)
- Whether to build the failover into Arbiter now or post-launch
- The Chloe-chan replacement (log analyzer) architecture
---
*Michael is pondering. No action items yet.*
*Fire + Arcane + Frost = Forever* 🔥💜❄️