5.1 KiB
AI API Cost Analysis — Forge Fallback Strategy
Created: April 15, 2026 (Chronicler #92)
Purpose: Cost comparison for Claude API (primary) vs fallback options
Status: Pondering — no decisions made yet
Claude API Pricing (Primary — Anthropic)
| Model | Input /1M tokens | Output /1M tokens | Notes |
|---|---|---|---|
| Haiku 4.5 | $1.00 | $5.00 | Fastest, cheapest, good for simple tasks |
| Sonnet 4.6 | $3.00 | $15.00 | Current Chronicler model — balanced |
| Opus 4.6 | $5.00 | $25.00 | Most capable, 1M context |
| Opus 4.6 Fast Mode | $30.00 | $150.00 | ⚠️ Only if latency critical |
Discounts:
- Batch API: 50% off all models (async, 24hr window)
- Prompt caching: Up to 90% off repeated input
- Both discounts stack (up to 95% savings on cached batch work)
Real-world estimate for Firefrost:
- Arbiter's Awakened Concierge welcome messages: ~1,000 tokens each, maybe 50/month = ~50K tokens = $0.25/month on Haiku
- Lore Engine (if built): ~500 tokens per fragment, ~100/month = ~50K tokens = $0.25/month on Haiku
- Emergency Chronicler sessions (API fallback):
200K tokens/session, maybe 2/month = **$1.20/month on Haiku**
Estimated total Claude API spend: $2-5/month at current scale
Fallback Options Pricing
OpenRouter
Single OpenAI-compatible endpoint routing to 300+ models.
| Model | Input /1M | Output /1M | Notes |
|---|---|---|---|
| Llama 3.3 70B (FREE) | $0 | $0 | 200 req/day, 20 req/min limit |
| Llama 3.3 70B (Paid) | ~$0.51 | ~$0.74 | Unlimited |
| Llama 3.1 8B | $0.02 | $0.05 | Fast, lightweight |
Emergency use estimate: Free tier covers it entirely. $20 credit as insurance = months of actual use.
Gemini API (Google)
Gemini is our architectural partner. Strong candidate for fallback.
| Model | Input /1M | Output /1M | Context | Notes |
|---|---|---|---|---|
| 2.5 Flash-Lite | $0.10 | $0.40 | 1M | Cheapest paid option |
| 2.5 Flash | $0.15 | $0.60 | 1M | ⚠️ "Free" tier now ~250 req/day, prepay required since Apr 1 2026 |
| 2.5 Pro | $1.25 | $10.00 | 1M | Premium reasoning |
| 3 Flash | $0.50 | $3.00 | 1M | Balanced |
Key advantage: Gemini 2.5 Flash has 1M token context — entire ops manual fits in one prompt without RAG during emergency. ⚠️ Free tier reduced to ~250 req/day as of Apr 2026, prepay billing required. Paid is $0.15/$0.60 per 1M tokens — still very cheap.
Batch discount: 50% off all paid models for async work.
Cloudflare Workers AI
Edge GPU inference via Cloudflare's network. Since firefrostgaming.com already routes through Cloudflare.
- Pricing: Very low, usage-based
- Advantage: Already in the network layer, no new vendor
- Models available: Llama, Mistral, others
- Best for: Simple, fast inference at the edge
Full Comparison Table
| Provider | Model | Cost/month (emergency) | Context | Reliability |
|---|---|---|---|---|
| Anthropic (primary) | Sonnet 4.6 | ~$2-5 | 1M | ⚠️ 9 outages in April |
| Gemini Paid (fallback) | 2.5 Flash | ~$0.50/month | 1M | ✅ Different infrastructure, cheap paid tier |
| OpenRouter Free (backup) | Llama 3.3 70B | $0 | 65K | ✅ Routes to multiple providers |
| Cloudflare Workers AI | Various | ~$0-1 | Varies | ✅ Edge network |
| Local Ollama (TX1) | Llama 3.1 8B | $0 | 16K | ❌ CPU too slow for real-time |
Key Observations for Pondering
-
Claude API cost is trivial at Firefrost's scale — $2-5/month. Not a cost problem, a reliability problem.
-
The outage problem is real — 9 outages in April 2026 alone, including today where both claude.ai AND the API went down simultaneously. A fallback that also uses Anthropic infrastructure doesn't help.
-
Gemini paid tier is still very cheap for emergency fallback — $0.15/$0.60 per 1M tokens, different infrastructure from Anthropic, 1M context. "Free" tier is misleading — prepay required since April 1 2026, only 250 req/day.
-
OpenRouter free tier as a secondary backup — routes through multiple providers, if one goes down it tries another.
-
Local Ollama is dead for real-time use without GPU. Keep it only if batch async tasks make sense later.
-
The architecture is simple:
- Arbiter tries Claude API first
- If timeout/error → falls back to Gemini API
- If Gemini fails → falls back to OpenRouter
- If all fail → graceful degradation ("The Chronicler is meditating")
-
Cloudflare Workers AI is worth evaluating specifically for the log analyzer bot — edge inference, already in the network, potentially faster than API calls for simple tasks.
What's Not Decided Yet
- Which tasks actually need AI fallback vs which can just queue
- Whether The Forge art installation gets decoupled from AI entirely (Gemini recommends yes)
- Whether to build the failover into Arbiter now or post-launch
- The Chloe-chan replacement (log analyzer) architecture
Michael is pondering. No action items yet.
Fire + Arcane + Frost = Forever 🔥💜❄️