AI API Cost Analysis — Forge Fallback Strategy

Created: April 15, 2026 (Chronicler #92)
Purpose: Cost comparison for Claude API (primary) vs fallback options
Status: Pondering — no decisions made yet

Claude API Pricing (Primary — Anthropic)

Model	Input /1M tokens	Output /1M tokens	Notes
Haiku 4.5	$1.00	$5.00	Fastest, cheapest, good for simple tasks
Sonnet 4.6	$3.00	$15.00	Current Chronicler model — balanced
Opus 4.6	$5.00	$25.00	Most capable, 1M context
Opus 4.6 Fast Mode	$30.00	$150.00	⚠️ Only if latency critical

Discounts:

Batch API: 50% off all models (async, 24hr window)
Prompt caching: Up to 90% off repeated input
Both discounts stack (up to 95% savings on cached batch work)

Real-world estimate for Firefrost:

Arbiter's Awakened Concierge welcome messages: ~1,000 tokens each, maybe 50/month = ~50K tokens = $0.25/month on Haiku
Lore Engine (if built): ~500 tokens per fragment, ~100/month = ~50K tokens = $0.25/month on Haiku
Emergency Chronicler sessions (API fallback): 200K tokens/session, maybe 2/month = **$1.20/month on Haiku**

Estimated total Claude API spend: $2-5/month at current scale

Fallback Options Pricing

OpenRouter

Single OpenAI-compatible endpoint routing to 300+ models.

Model	Input /1M	Output /1M	Notes
Llama 3.3 70B (FREE)	$0	$0	200 req/day, 20 req/min limit
Llama 3.3 70B (Paid)	~$0.51	~$0.74	Unlimited
Llama 3.1 8B	$0.02	$0.05	Fast, lightweight

Emergency use estimate: Free tier covers it entirely. $20 credit as insurance = months of actual use.

Gemini API (Google)

Gemini is our architectural partner. Strong candidate for fallback.

Model	Input /1M	Output /1M	Context	Notes
2.5 Flash-Lite	$0.10	$0.40	1M	Cheapest paid option
2.5 Flash	$0.15	$0.60	1M	⚠️ "Free" tier now ~250 req/day, prepay required since Apr 1 2026
2.5 Pro	$1.25	$10.00	1M	Premium reasoning
3 Flash	$0.50	$3.00	1M	Balanced

Key advantage: Gemini 2.5 Flash has 1M token context — entire ops manual fits in one prompt without RAG during emergency. ⚠️ Free tier reduced to ~250 req/day as of Apr 2026, prepay billing required. Paid is $0.15/$0.60 per 1M tokens — still very cheap.

Batch discount: 50% off all paid models for async work.

Cloudflare Workers AI

Edge GPU inference via Cloudflare's network. Since firefrostgaming.com already routes through Cloudflare.

Pricing: Very low, usage-based
Advantage: Already in the network layer, no new vendor
Models available: Llama, Mistral, others
Best for: Simple, fast inference at the edge

Full Comparison Table

Provider	Model	Cost/month (emergency)	Context	Reliability
Anthropic (primary)	Sonnet 4.6	~$2-5	1M	⚠️ 9 outages in April
Gemini Paid (fallback)	2.5 Flash	~$0.50/month	1M	✅ Different infrastructure, cheap paid tier
OpenRouter Free (backup)	Llama 3.3 70B	$0	65K	✅ Routes to multiple providers
Cloudflare Workers AI	Various	~$0-1	Varies	✅ Edge network
Local Ollama (TX1)	Llama 3.1 8B	$0	16K	❌ CPU too slow for real-time

Key Observations for Pondering

Claude API cost is trivial at Firefrost's scale — $2-5/month. Not a cost problem, a reliability problem.
The outage problem is real — 9 outages in April 2026 alone, including today where both claude.ai AND the API went down simultaneously. A fallback that also uses Anthropic infrastructure doesn't help.
Gemini paid tier is still very cheap for emergency fallback — $0.15/$0.60 per 1M tokens, different infrastructure from Anthropic, 1M context. "Free" tier is misleading — prepay required since April 1 2026, only 250 req/day.
OpenRouter free tier as a secondary backup — routes through multiple providers, if one goes down it tries another.
Local Ollama is dead for real-time use without GPU. Keep it only if batch async tasks make sense later.
The architecture is simple:
- Arbiter tries Claude API first
- If timeout/error → falls back to Gemini API
- If Gemini fails → falls back to OpenRouter
- If all fail → graceful degradation ("The Chronicler is meditating")
Cloudflare Workers AI is worth evaluating specifically for the log analyzer bot — edge inference, already in the network, potentially faster than API calls for simple tasks.

What's Not Decided Yet

Which tasks actually need AI fallback vs which can just queue
Whether The Forge art installation gets decoupled from AI entirely (Gemini recommends yes)
Whether to build the failover into Arbiter now or post-launch
The Chloe-chan replacement (log analyzer) architecture

Michael is pondering. No action items yet.
Fire + Arcane + Frost = Forever 🔥💜❄️

5.1 KiB Raw Blame History