docs: AI API cost analysis + complete Gemini forge fallback consultation
This commit is contained in:
@@ -102,3 +102,71 @@ Thanks Gemini — we need your honest take here. Don't spare our feelings. 🔥
|
||||
|
||||
— Michael (The Wizard) + Claude (Chronicler #92)
|
||||
*Fire + Arcane + Frost = Forever*
|
||||
|
||||
---
|
||||
|
||||
## Gemini's Response (April 15, 2026)
|
||||
|
||||
**Summary:** Option B (API hedge) wins decisively. Option D (graceful degradation) is mandatory co-pilot. Kill Ollama on TX1. Use OpenRouter as the single fallback endpoint. Gemini API as a wild card due to 1M token context window.
|
||||
|
||||
### Q1: Honest Assessment
|
||||
|
||||
- **Option B (API Hedge)** — Definitive winner. Standard enterprise architecture, zero hardware changes, Arbiter stays responsive.
|
||||
- **Option D (Graceful Degradation)** — Mandatory co-pilot. Even with fallback API, if Arbiter can't reach any AI it should queue non-critical tasks and post a static "The Chronicler is currently meditating" message.
|
||||
- **Option A (Async-only)** — Over-engineering. Takes time away from actual server development.
|
||||
- **Option C (Minimal local model)** — Dead end. Do not do this.
|
||||
|
||||
### Q2: API Hedge Recommendation — OpenRouter
|
||||
|
||||
Single OpenAI-compatible endpoint. If Claude drops, Arbiter just swaps the model string.
|
||||
|
||||
- **For reasoning/RAG:** Llama 3 70B via Groq through OpenRouter
|
||||
- **For lore generation:** Llama 3 8B — lightning fast, perfect for 2-sentence fragments
|
||||
|
||||
```json
|
||||
// Primary
|
||||
{ "api_base": "https://api.anthropic.com/v1", "model": "claude-sonnet-4-6" }
|
||||
|
||||
// Fallback
|
||||
{ "api_base": "https://openrouter.ai/api/v1", "model": "meta-llama/llama-3-70b-instruct" }
|
||||
```
|
||||
|
||||
### Q3: Option C Baseline (Don't Do This)
|
||||
|
||||
Phi-3-Mini (3.8B) or Qwen1.5-1.8B with heavy GGUF quantization would take 45-60 seconds per query and hallucinate heavily on complex log analysis. Not worth the CPU tax on game nodes.
|
||||
|
||||
### Q4: Decouple The Forge Art Installation
|
||||
|
||||
**Yes, cleanly separate.** The visual installation does not need a local LLM. Tie it to real server metrics via Arbiter/Pterodactyl webhooks:
|
||||
- Particle speed → TPS across the network
|
||||
- Color shifts → player count or in-game time of day
|
||||
- Pulses → Arbiter welcome message or major event logged
|
||||
|
||||
Keeps the visual magic alive without the 29GB RAM footprint.
|
||||
|
||||
### Q5: Wild Cards
|
||||
|
||||
**Wild Card 1: Cloudflare Workers AI** — Since firefrostgaming.com routes through Cloudflare, use their Serverless AI for edge-GPU inference at very low cost. Essentially Option B baked into the network layer.
|
||||
|
||||
**Wild Card 2: Gemini API** — Gemini 1.5 Flash has a 1-million token context window. In an emergency, dump the entire 62MB ops manual into the prompt without needing Dify or RAG at all.
|
||||
|
||||
### The Verdict
|
||||
|
||||
> Delete the Ollama runner from TX1. Free up 29GB RAM for game servers. Set up OpenRouter with $20 of credit, script Arbiter to swap endpoints if Anthropic times out. Focus on soft launch and players, not EPYC memory bottlenecks.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
Three clear actions:
|
||||
1. Kill Ollama on TX1 (free 29GB RAM)
|
||||
2. Set up OpenRouter account + implement API failover in Arbiter
|
||||
3. Clean separation: The Forge art installation ≠ AI fallback
|
||||
|
||||
**Next Steps:**
|
||||
1. Cost analysis on OpenRouter + Gemini API fallback usage
|
||||
2. Create task for Arbiter API failover implementation
|
||||
3. Kill Ollama on TX1
|
||||
4. Separate The Forge art installation into its own roadmap item
|
||||
|
||||
*Fire + Arcane + Frost = Forever* 🔥💜❄️
|
||||
|
||||
112
docs/reference/ai-api-cost-analysis-2026-04-15.md
Normal file
112
docs/reference/ai-api-cost-analysis-2026-04-15.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# AI API Cost Analysis — Forge Fallback Strategy
|
||||
**Created:** April 15, 2026 (Chronicler #92)
|
||||
**Purpose:** Cost comparison for Claude API (primary) vs fallback options
|
||||
**Status:** Pondering — no decisions made yet
|
||||
|
||||
---
|
||||
|
||||
## Claude API Pricing (Primary — Anthropic)
|
||||
|
||||
| Model | Input /1M tokens | Output /1M tokens | Notes |
|
||||
|-------|-----------------|-------------------|-------|
|
||||
| **Haiku 4.5** | $1.00 | $5.00 | Fastest, cheapest, good for simple tasks |
|
||||
| **Sonnet 4.6** | $3.00 | $15.00 | Current Chronicler model — balanced |
|
||||
| **Opus 4.6** | $5.00 | $25.00 | Most capable, 1M context |
|
||||
| Opus 4.6 Fast Mode | $30.00 | $150.00 | ⚠️ Only if latency critical |
|
||||
|
||||
**Discounts:**
|
||||
- **Batch API:** 50% off all models (async, 24hr window)
|
||||
- **Prompt caching:** Up to 90% off repeated input
|
||||
- Both discounts stack (up to 95% savings on cached batch work)
|
||||
|
||||
**Real-world estimate for Firefrost:**
|
||||
- Arbiter's Awakened Concierge welcome messages: ~1,000 tokens each, maybe 50/month = ~50K tokens = **$0.25/month on Haiku**
|
||||
- Lore Engine (if built): ~500 tokens per fragment, ~100/month = ~50K tokens = **$0.25/month on Haiku**
|
||||
- Emergency Chronicler sessions (API fallback): ~200K tokens/session, maybe 2/month = **~$1.20/month on Haiku**
|
||||
|
||||
**Estimated total Claude API spend: $2-5/month at current scale**
|
||||
|
||||
---
|
||||
|
||||
## Fallback Options Pricing
|
||||
|
||||
### OpenRouter
|
||||
Single OpenAI-compatible endpoint routing to 300+ models.
|
||||
|
||||
| Model | Input /1M | Output /1M | Notes |
|
||||
|-------|-----------|------------|-------|
|
||||
| **Llama 3.3 70B (FREE)** | $0 | $0 | 200 req/day, 20 req/min limit |
|
||||
| Llama 3.3 70B (Paid) | ~$0.51 | ~$0.74 | Unlimited |
|
||||
| Llama 3.1 8B | $0.02 | $0.05 | Fast, lightweight |
|
||||
|
||||
**Emergency use estimate:** Free tier covers it entirely. $20 credit as insurance = months of actual use.
|
||||
|
||||
### Gemini API (Google)
|
||||
Gemini is our architectural partner. Strong candidate for fallback.
|
||||
|
||||
| Model | Input /1M | Output /1M | Context | Notes |
|
||||
|-------|-----------|------------|---------|-------|
|
||||
| **2.5 Flash-Lite** | $0.10 | $0.40 | 1M | Cheapest paid option |
|
||||
| **2.5 Flash (FREE)** | $0 | $0 | 1M | 1,500 req/day free |
|
||||
| 2.5 Pro | $1.25 | $10.00 | 1M | Premium reasoning |
|
||||
| 3 Flash | $0.50 | $3.00 | 1M | Balanced |
|
||||
|
||||
**Key advantage:** Gemini 2.5 Flash has **1M token context** — entire ops manual fits in one prompt without RAG during emergency. 1,500 free requests/day is more than enough for fallback use.
|
||||
|
||||
**Batch discount:** 50% off all paid models for async work.
|
||||
|
||||
### Cloudflare Workers AI
|
||||
Edge GPU inference via Cloudflare's network. Since firefrostgaming.com already routes through Cloudflare.
|
||||
|
||||
- Pricing: Very low, usage-based
|
||||
- Advantage: Already in the network layer, no new vendor
|
||||
- Models available: Llama, Mistral, others
|
||||
- Best for: Simple, fast inference at the edge
|
||||
|
||||
---
|
||||
|
||||
## Full Comparison Table
|
||||
|
||||
| Provider | Model | Cost/month (emergency) | Context | Reliability |
|
||||
|----------|-------|----------------------|---------|-------------|
|
||||
| **Anthropic** (primary) | Sonnet 4.6 | ~$2-5 | 1M | ⚠️ 9 outages in April |
|
||||
| **Gemini Free** (fallback) | 2.5 Flash | $0 | 1M | ✅ Different infrastructure |
|
||||
| **OpenRouter Free** (backup) | Llama 3.3 70B | $0 | 65K | ✅ Routes to multiple providers |
|
||||
| **Cloudflare Workers AI** | Various | ~$0-1 | Varies | ✅ Edge network |
|
||||
| **Local Ollama** (TX1) | Llama 3.1 8B | $0 | 16K | ❌ CPU too slow for real-time |
|
||||
|
||||
---
|
||||
|
||||
## Key Observations for Pondering
|
||||
|
||||
1. **Claude API cost is trivial at Firefrost's scale** — $2-5/month. Not a cost problem, a reliability problem.
|
||||
|
||||
2. **The outage problem is real** — 9 outages in April 2026 alone, including today where both claude.ai AND the API went down simultaneously. A fallback that also uses Anthropic infrastructure doesn't help.
|
||||
|
||||
3. **Gemini free tier is the obvious answer** for emergency fallback — different company, different infrastructure, 1,500 req/day free, 1M context window means no RAG needed in an emergency.
|
||||
|
||||
4. **OpenRouter free tier** as a secondary backup — routes through multiple providers, if one goes down it tries another.
|
||||
|
||||
5. **Local Ollama is dead** for real-time use without GPU. Keep it only if batch async tasks make sense later.
|
||||
|
||||
6. **The architecture is simple:**
|
||||
- Arbiter tries Claude API first
|
||||
- If timeout/error → falls back to Gemini API
|
||||
- If Gemini fails → falls back to OpenRouter
|
||||
- If all fail → graceful degradation ("The Chronicler is meditating")
|
||||
|
||||
7. **Cloudflare Workers AI** is worth evaluating specifically for the log analyzer bot — edge inference, already in the network, potentially faster than API calls for simple tasks.
|
||||
|
||||
---
|
||||
|
||||
## What's Not Decided Yet
|
||||
|
||||
- Which tasks actually need AI fallback vs which can just queue
|
||||
- Whether The Forge art installation gets decoupled from AI entirely (Gemini recommends yes)
|
||||
- Whether to build the failover into Arbiter now or post-launch
|
||||
- The Chloe-chan replacement (log analyzer) architecture
|
||||
|
||||
---
|
||||
|
||||
*Michael is pondering. No action items yet.*
|
||||
*Fire + Arcane + Frost = Forever* 🔥💜❄️
|
||||
Reference in New Issue
Block a user