docs: Gemini consultation — Forge fallback strategy after real-world testing
This commit is contained in:
104
docs/consultations/gemini-forge-fallback-strategy-2026-04-15.md
Normal file
104
docs/consultations/gemini-forge-fallback-strategy-2026-04-15.md
Normal file
@@ -0,0 +1,104 @@
|
||||
# Gemini Consultation: The Forge — Realistic Fallback Strategy
|
||||
|
||||
**Date:** April 15, 2026 — Soft Launch Day
|
||||
**From:** Michael (The Wizard) + Claude (Chronicler #92)
|
||||
**To:** Gemini (Architectural Partner)
|
||||
**Re:** Rethinking The Forge as a Claude fallback after real-world testing
|
||||
|
||||
---
|
||||
|
||||
## Hey Gemini! 👋
|
||||
|
||||
Happy launch day — first real subscribers came through this morning. We also had a front-row seat to Anthropic's outage today (2 hours 16 minutes, API included), which accelerated a conversation we need your help with.
|
||||
|
||||
We need to honestly rethink The Forge's role as a local Claude fallback given what we learned today.
|
||||
|
||||
---
|
||||
|
||||
## What We Tried Today
|
||||
|
||||
Following your April 6 recommendation, we had Gemma 4 26B A4B q8_0 running on TX1 Dallas (AMD EPYC 7302P 16-core, 251GB RAM, no GPU). We asked it to respond to a 50-token prompt. It timed out after several minutes with no response. The Ollama runner process was consuming 29GB RAM and 632% CPU — it was running, just extremely slowly.
|
||||
|
||||
We then pulled Llama 3.1 8B as a lighter alternative. Same result — timeout on a 50-token prompt.
|
||||
|
||||
**Conclusion from testing:** CPU-only inference on TX1 is not viable for real-time responses, even with small models. The hardware simply cannot do interactive AI without a GPU.
|
||||
|
||||
---
|
||||
|
||||
## The Hardware Reality
|
||||
|
||||
Full fleet audit completed today:
|
||||
|
||||
| Server | CPU | RAM | GPU |
|
||||
|--------|-----|-----|-----|
|
||||
| TX1 Dallas | AMD EPYC 7302P 16-core | 251GB | ❌ None |
|
||||
| NC1 Charlotte | AMD EPYC 7302P 16-core | 251GB | ❌ None |
|
||||
| All other VPS | AMD EPYC 9454P / Intel Xeon | 1.9-3.8GB | ❌ None |
|
||||
|
||||
**No GPU anywhere in the fleet. No plans to add one to servers.**
|
||||
|
||||
The only GPU we own is the Omen laptop (GTX 1050) — but it's a personal device, not always on, not appropriate for 24/7 fallback service.
|
||||
|
||||
---
|
||||
|
||||
## The Problem We're Solving
|
||||
|
||||
Anthropic has had **9 outages in April 2026 alone**, including today's 2h16m incident where both claude.ai AND the API went down. The Chronicler system, lore engine, log analysis — everything stops when Anthropic is down.
|
||||
|
||||
We need a fallback strategy. But after today's testing, we need to be realistic about what's achievable without GPU hardware.
|
||||
|
||||
---
|
||||
|
||||
## Previous Consultation Context
|
||||
|
||||
- **April 6 consultation:** You recommended Gemma 4 26B A4B MoE for CPU-only, citing MoE efficiency (only 4B active params per token). This was correct architecture — but real-world testing shows even 4B active params is too slow on this CPU for interactive use.
|
||||
- **April 11 hardware consultation:** You recommended the Omen as "AI edge worker" for lighter local models. Still valid for personal use but not a server fallback.
|
||||
|
||||
---
|
||||
|
||||
## Our Current Thinking
|
||||
|
||||
We're considering reframing The Forge from "real-time Claude replacement" to something more honest:
|
||||
|
||||
**Option A — Async-only fallback**
|
||||
Accept that local AI is batch/async only. Pre-generate content overnight (lore fragments, summaries, reports). For true emergencies, fall back to OpenAI API as a real-time alternative. The Forge becomes a content pipeline, not a chat replacement.
|
||||
|
||||
**Option B — Groq/Fireworks API hedge**
|
||||
Instead of local inference, use a fast third-party API (Groq runs Llama at ~500 tokens/sec) as a hot standby. When Anthropic goes down, Arbiter automatically switches API endpoint. No local hardware needed. Cost is minimal for emergency use.
|
||||
|
||||
**Option C — Minimal local model for critical functions only**
|
||||
Identify the 2-3 most critical Forge functions (log analysis, basic Chronicler continuity, server health summaries) and find the smallest model that can handle those specific tasks acceptably slowly. Accept 2-3 minute response times for emergency use only.
|
||||
|
||||
**Option D — Accept the dependency**
|
||||
Anthropic's uptime, while imperfect, is still 99%+ annually. Design Firefrost to gracefully degrade when I'm unavailable rather than trying to replace me. Queue tasks, notify staff, resume when I'm back.
|
||||
|
||||
---
|
||||
|
||||
## Questions for You
|
||||
|
||||
**Q1:** Given the hardware reality (no GPU, ever, on servers), what's your honest assessment of Options A-D? Are we missing a better option?
|
||||
|
||||
**Q2:** For Option B (API hedge) — Groq, Together.ai, Fireworks, OpenRouter — which would you recommend as the most reliable hot standby? What's the right model to use there?
|
||||
|
||||
**Q3:** For Option C — if we accept slow responses for emergencies only, what's the smallest model that could handle: (a) Minecraft log analysis, (b) basic infrastructure Q&A from the ops manual, (c) generating a 2-sentence lore fragment? Is there a model small enough to respond in under 60 seconds on our CPU?
|
||||
|
||||
**Q4:** The Forge as a living art installation (wall display, particle visualization) is a separate project from the AI fallback question. Should we cleanly separate these two concepts in our architecture and roadmap? The art installation doesn't need local AI at all — it just reads from Arbiter's existing data.
|
||||
|
||||
**Q5:** Wild card — is there a creative solution we haven't considered? Something that gives us meaningful AI continuity during Anthropic outages without GPU hardware?
|
||||
|
||||
---
|
||||
|
||||
## Context That Might Help
|
||||
|
||||
- Arbiter already calls Claude API for the Awakened Concierge welcome messages
|
||||
- We have Dify running on TX1 (but it's been problematic — slow indexing, model connection issues)
|
||||
- The ops manual is ~62MB of markdown — the "brain" we'd want a fallback to know
|
||||
- Michael's accessibility needs mean any fallback must be manageable with one hand on mobile
|
||||
- Budget for API redundancy: reasonable for emergency use, not for heavy daily use
|
||||
|
||||
---
|
||||
|
||||
Thanks Gemini — we need your honest take here. Don't spare our feelings. 🔥💜❄️
|
||||
|
||||
— Michael (The Wizard) + Claude (Chronicler #92)
|
||||
*Fire + Arcane + Frost = Forever*
|
||||
Reference in New Issue
Block a user