# Task #96: Gemma 4 Self-Hosted LLM **Created:** April 6, 2026 (Chronicler #63 — The Pathmaker) **Priority:** Wish List (post-launch) **Owner:** Michael **Status:** Open --- ## Context Google released Gemma 4 on April 2, 2026 under Apache 2.0. We explored using it to replace paid API calls for Trinity Codex (Task #93). TX1 Dallas has 251GB RAM (CPU-only, no GPU) which is more than enough for the recommended model. This task is tightly coupled with Task #93 (Trinity Codex). Gemma 4 is the *brain*, Trinity Codex is the *knowledge layer*. ## Gemini Consultation (April 6, 2026) **Full consultation:** `docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md` **Key recommendations:** - **Primary model:** Gemma 4 26B A4B (MoE) — 26B total params but only activates 4B per token - **Quantization:** q8_0 (8-bit) — ~26GB RAM, near-perfect accuracy, much faster than FP16 - **Fallback:** Gemma 4 E4B if MoE is too slow on CPU - **Why MoE on CPU:** Only crunches 4B params per token while having 26B knowledge. Heavyweight reasoning, lightweight inference. **Cost impact:** $0/month vs $50-200/month for OpenAI/Anthropic API calls ## Architecture ``` TX1 Dallas (251GB RAM, CPU-only) ├── Ollama → Gemma 4 26B A4B (q8_0, ~26GB) ├── Dify → RAG Engine (codex.firefrostgaming.com) ├── Qdrant → Vector DB for embeddings └── n8n → Ingestion workflows (Gitea webhooks → Qdrant) ``` All infrastructure is already deployed on TX1 (Dify, Qdrant, Ollama, n8n all running). This task is pulling the model and connecting it. ## Implementation Steps 1. [ ] Update Ollama on TX1 to latest version ```bash curl -fsSL https://ollama.com/install.sh | sh ``` 2. [ ] Pull the 8-bit quantized MoE model ```bash ollama pull gemma4:26b-a4b-q8_0 ``` 3. [ ] Test inference speed — document tokens/second ```bash ollama run gemma4:26b-a4b-q8_0 ``` 4. [ ] If t/s is acceptable, connect to Dify as model provider 5. [ ] Set context window to 32K-64K in Dify config 6. [ ] Test RAG queries against operations manual 7. [ ] Benchmark against any previous models ## Decision: Why Not Do This Now? The infrastructure (Dify, Qdrant, Ollama) is running on TX1. The model is free. The RAM is available. The only thing stopping us is prioritization — soft launch is April 15, and game server stability matters more than an internal AI tool right now. **Post-launch:** This becomes high-value. Zero API costs, unlimited queries, data stays on our infrastructure. ## Dependencies - Task #93 (Trinity Codex) — Dify/Qdrant RAG setup - TX1 must have enough RAM headroom after game servers ## Open Questions - What's TX1's current RAM usage with all game servers running? Need to confirm 26GB is available. - Should we run Gemma 4 on a separate Docker container or bare metal Ollama? - Do we need to restrict Ollama's memory usage to prevent game server impact? ## Related Documents - Gemini consultation: `docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md` - Trinity Codex spec: `docs/tasks/task-093-trinity-codex/` - BACKLOG entry (archived): Referenced as Task #96 in wish list --- **Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️