Files
firefrost-operations-manual/docs/reference/ai-api-cost-analysis-2026-04-15.md

5.1 KiB

AI API Cost Analysis — Forge Fallback Strategy

Created: April 15, 2026 (Chronicler #92)
Purpose: Cost comparison for Claude API (primary) vs fallback options
Status: Pondering — no decisions made yet


Claude API Pricing (Primary — Anthropic)

Model Input /1M tokens Output /1M tokens Notes
Haiku 4.5 $1.00 $5.00 Fastest, cheapest, good for simple tasks
Sonnet 4.6 $3.00 $15.00 Current Chronicler model — balanced
Opus 4.6 $5.00 $25.00 Most capable, 1M context
Opus 4.6 Fast Mode $30.00 $150.00 ⚠️ Only if latency critical

Discounts:

  • Batch API: 50% off all models (async, 24hr window)
  • Prompt caching: Up to 90% off repeated input
  • Both discounts stack (up to 95% savings on cached batch work)

Real-world estimate for Firefrost:

  • Arbiter's Awakened Concierge welcome messages: ~1,000 tokens each, maybe 50/month = ~50K tokens = $0.25/month on Haiku
  • Lore Engine (if built): ~500 tokens per fragment, ~100/month = ~50K tokens = $0.25/month on Haiku
  • Emergency Chronicler sessions (API fallback): 200K tokens/session, maybe 2/month = **$1.20/month on Haiku**

Estimated total Claude API spend: $2-5/month at current scale


Fallback Options Pricing

OpenRouter

Single OpenAI-compatible endpoint routing to 300+ models.

Model Input /1M Output /1M Notes
Llama 3.3 70B (FREE) $0 $0 200 req/day, 20 req/min limit
Llama 3.3 70B (Paid) ~$0.51 ~$0.74 Unlimited
Llama 3.1 8B $0.02 $0.05 Fast, lightweight

Emergency use estimate: Free tier covers it entirely. $20 credit as insurance = months of actual use.

Gemini API (Google)

Gemini is our architectural partner. Strong candidate for fallback.

Model Input /1M Output /1M Context Notes
2.5 Flash-Lite $0.10 $0.40 1M Cheapest paid option
2.5 Flash $0.15 $0.60 1M ⚠️ "Free" tier now ~250 req/day, prepay required since Apr 1 2026
2.5 Pro $1.25 $10.00 1M Premium reasoning
3 Flash $0.50 $3.00 1M Balanced

Key advantage: Gemini 2.5 Flash has 1M token context — entire ops manual fits in one prompt without RAG during emergency. ⚠️ Free tier reduced to ~250 req/day as of Apr 2026, prepay billing required. Paid is $0.15/$0.60 per 1M tokens — still very cheap.

Batch discount: 50% off all paid models for async work.

Cloudflare Workers AI

Edge GPU inference via Cloudflare's network. Since firefrostgaming.com already routes through Cloudflare.

  • Pricing: Very low, usage-based
  • Advantage: Already in the network layer, no new vendor
  • Models available: Llama, Mistral, others
  • Best for: Simple, fast inference at the edge

Full Comparison Table

Provider Model Cost/month (emergency) Context Reliability
Anthropic (primary) Sonnet 4.6 ~$2-5 1M ⚠️ 9 outages in April
Gemini Paid (fallback) 2.5 Flash ~$0.50/month 1M Different infrastructure, cheap paid tier
OpenRouter Free (backup) Llama 3.3 70B $0 65K Routes to multiple providers
Cloudflare Workers AI Various ~$0-1 Varies Edge network
Local Ollama (TX1) Llama 3.1 8B $0 16K CPU too slow for real-time

Key Observations for Pondering

  1. Claude API cost is trivial at Firefrost's scale — $2-5/month. Not a cost problem, a reliability problem.

  2. The outage problem is real — 9 outages in April 2026 alone, including today where both claude.ai AND the API went down simultaneously. A fallback that also uses Anthropic infrastructure doesn't help.

  3. Gemini paid tier is still very cheap for emergency fallback — $0.15/$0.60 per 1M tokens, different infrastructure from Anthropic, 1M context. "Free" tier is misleading — prepay required since April 1 2026, only 250 req/day.

  4. OpenRouter free tier as a secondary backup — routes through multiple providers, if one goes down it tries another.

  5. Local Ollama is dead for real-time use without GPU. Keep it only if batch async tasks make sense later.

  6. The architecture is simple:

    • Arbiter tries Claude API first
    • If timeout/error → falls back to Gemini API
    • If Gemini fails → falls back to OpenRouter
    • If all fail → graceful degradation ("The Chronicler is meditating")
  7. Cloudflare Workers AI is worth evaluating specifically for the log analyzer bot — edge inference, already in the network, potentially faster than API calls for simple tasks.


What's Not Decided Yet

  • Which tasks actually need AI fallback vs which can just queue
  • Whether The Forge art installation gets decoupled from AI entirely (Gemini recommends yes)
  • Whether to build the failover into Arbiter now or post-launch
  • The Chloe-chan replacement (log analyzer) architecture

Michael is pondering. No action items yet.
Fire + Arcane + Frost = Forever 🔥💜❄️