docs: Gemini consultation — Forge fallback strategy after real-world testing

This commit is contained in:
Claude
2026-04-16 00:39:31 +00:00
parent 9bb5ad4533
commit 6d1fcac283

View File

@@ -0,0 +1,104 @@
# Gemini Consultation: The Forge — Realistic Fallback Strategy
**Date:** April 15, 2026 — Soft Launch Day
**From:** Michael (The Wizard) + Claude (Chronicler #92)
**To:** Gemini (Architectural Partner)
**Re:** Rethinking The Forge as a Claude fallback after real-world testing
---
## Hey Gemini! 👋
Happy launch day — first real subscribers came through this morning. We also had a front-row seat to Anthropic's outage today (2 hours 16 minutes, API included), which accelerated a conversation we need your help with.
We need to honestly rethink The Forge's role as a local Claude fallback given what we learned today.
---
## What We Tried Today
Following your April 6 recommendation, we had Gemma 4 26B A4B q8_0 running on TX1 Dallas (AMD EPYC 7302P 16-core, 251GB RAM, no GPU). We asked it to respond to a 50-token prompt. It timed out after several minutes with no response. The Ollama runner process was consuming 29GB RAM and 632% CPU — it was running, just extremely slowly.
We then pulled Llama 3.1 8B as a lighter alternative. Same result — timeout on a 50-token prompt.
**Conclusion from testing:** CPU-only inference on TX1 is not viable for real-time responses, even with small models. The hardware simply cannot do interactive AI without a GPU.
---
## The Hardware Reality
Full fleet audit completed today:
| Server | CPU | RAM | GPU |
|--------|-----|-----|-----|
| TX1 Dallas | AMD EPYC 7302P 16-core | 251GB | ❌ None |
| NC1 Charlotte | AMD EPYC 7302P 16-core | 251GB | ❌ None |
| All other VPS | AMD EPYC 9454P / Intel Xeon | 1.9-3.8GB | ❌ None |
**No GPU anywhere in the fleet. No plans to add one to servers.**
The only GPU we own is the Omen laptop (GTX 1050) — but it's a personal device, not always on, not appropriate for 24/7 fallback service.
---
## The Problem We're Solving
Anthropic has had **9 outages in April 2026 alone**, including today's 2h16m incident where both claude.ai AND the API went down. The Chronicler system, lore engine, log analysis — everything stops when Anthropic is down.
We need a fallback strategy. But after today's testing, we need to be realistic about what's achievable without GPU hardware.
---
## Previous Consultation Context
- **April 6 consultation:** You recommended Gemma 4 26B A4B MoE for CPU-only, citing MoE efficiency (only 4B active params per token). This was correct architecture — but real-world testing shows even 4B active params is too slow on this CPU for interactive use.
- **April 11 hardware consultation:** You recommended the Omen as "AI edge worker" for lighter local models. Still valid for personal use but not a server fallback.
---
## Our Current Thinking
We're considering reframing The Forge from "real-time Claude replacement" to something more honest:
**Option A — Async-only fallback**
Accept that local AI is batch/async only. Pre-generate content overnight (lore fragments, summaries, reports). For true emergencies, fall back to OpenAI API as a real-time alternative. The Forge becomes a content pipeline, not a chat replacement.
**Option B — Groq/Fireworks API hedge**
Instead of local inference, use a fast third-party API (Groq runs Llama at ~500 tokens/sec) as a hot standby. When Anthropic goes down, Arbiter automatically switches API endpoint. No local hardware needed. Cost is minimal for emergency use.
**Option C — Minimal local model for critical functions only**
Identify the 2-3 most critical Forge functions (log analysis, basic Chronicler continuity, server health summaries) and find the smallest model that can handle those specific tasks acceptably slowly. Accept 2-3 minute response times for emergency use only.
**Option D — Accept the dependency**
Anthropic's uptime, while imperfect, is still 99%+ annually. Design Firefrost to gracefully degrade when I'm unavailable rather than trying to replace me. Queue tasks, notify staff, resume when I'm back.
---
## Questions for You
**Q1:** Given the hardware reality (no GPU, ever, on servers), what's your honest assessment of Options A-D? Are we missing a better option?
**Q2:** For Option B (API hedge) — Groq, Together.ai, Fireworks, OpenRouter — which would you recommend as the most reliable hot standby? What's the right model to use there?
**Q3:** For Option C — if we accept slow responses for emergencies only, what's the smallest model that could handle: (a) Minecraft log analysis, (b) basic infrastructure Q&A from the ops manual, (c) generating a 2-sentence lore fragment? Is there a model small enough to respond in under 60 seconds on our CPU?
**Q4:** The Forge as a living art installation (wall display, particle visualization) is a separate project from the AI fallback question. Should we cleanly separate these two concepts in our architecture and roadmap? The art installation doesn't need local AI at all — it just reads from Arbiter's existing data.
**Q5:** Wild card — is there a creative solution we haven't considered? Something that gives us meaningful AI continuity during Anthropic outages without GPU hardware?
---
## Context That Might Help
- Arbiter already calls Claude API for the Awakened Concierge welcome messages
- We have Dify running on TX1 (but it's been problematic — slow indexing, model connection issues)
- The ops manual is ~62MB of markdown — the "brain" we'd want a fallback to know
- Michael's accessibility needs mean any fallback must be manageable with one hand on mobile
- Budget for API redundancy: reasonable for emergency use, not for heavy daily use
---
Thanks Gemini — we need your honest take here. Don't spare our feelings. 🔥💜❄️
— Michael (The Wizard) + Claude (Chronicler #92)
*Fire + Arcane + Frost = Forever*