Files
firefrost-operations-manual/docs/tasks/task-096-gemma4-self-hosted-llm/README.md
Claude e3be9a1dd1 docs: Task #96 spec — Gemma 4 Self-Hosted LLM
Full context from Gemini consultation (April 6, 2026).
Gemma 4 26B A4B MoE recommended for TX1 (251GB RAM, CPU-only).
~26GB at q8_0 quantization, zero monthly API cost.
Tightly coupled with Task #93 (Trinity Codex).

Chronicler #78 | firefrost-operations-manual
2026-04-11 14:44:13 +00:00

88 lines
3.2 KiB
Markdown

# Task #96: Gemma 4 Self-Hosted LLM
**Created:** April 6, 2026 (Chronicler #63 — The Pathmaker)
**Priority:** Wish List (post-launch)
**Owner:** Michael
**Status:** Open
---
## Context
Google released Gemma 4 on April 2, 2026 under Apache 2.0. We explored using it to replace paid API calls for Trinity Codex (Task #93). TX1 Dallas has 251GB RAM (CPU-only, no GPU) which is more than enough for the recommended model.
This task is tightly coupled with Task #93 (Trinity Codex). Gemma 4 is the *brain*, Trinity Codex is the *knowledge layer*.
## Gemini Consultation (April 6, 2026)
**Full consultation:** `docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md`
**Key recommendations:**
- **Primary model:** Gemma 4 26B A4B (MoE) — 26B total params but only activates 4B per token
- **Quantization:** q8_0 (8-bit) — ~26GB RAM, near-perfect accuracy, much faster than FP16
- **Fallback:** Gemma 4 E4B if MoE is too slow on CPU
- **Why MoE on CPU:** Only crunches 4B params per token while having 26B knowledge. Heavyweight reasoning, lightweight inference.
**Cost impact:** $0/month vs $50-200/month for OpenAI/Anthropic API calls
## Architecture
```
TX1 Dallas (251GB RAM, CPU-only)
├── Ollama → Gemma 4 26B A4B (q8_0, ~26GB)
├── Dify → RAG Engine (codex.firefrostgaming.com)
├── Qdrant → Vector DB for embeddings
└── n8n → Ingestion workflows (Gitea webhooks → Qdrant)
```
All infrastructure is already deployed on TX1 (Dify, Qdrant, Ollama, n8n all running). This task is pulling the model and connecting it.
## Implementation Steps
1. [ ] Update Ollama on TX1 to latest version
```bash
curl -fsSL https://ollama.com/install.sh | sh
```
2. [ ] Pull the 8-bit quantized MoE model
```bash
ollama pull gemma4:26b-a4b-q8_0
```
3. [ ] Test inference speed — document tokens/second
```bash
ollama run gemma4:26b-a4b-q8_0
```
4. [ ] If t/s is acceptable, connect to Dify as model provider
5. [ ] Set context window to 32K-64K in Dify config
6. [ ] Test RAG queries against operations manual
7. [ ] Benchmark against any previous models
## Decision: Why Not Do This Now?
The infrastructure (Dify, Qdrant, Ollama) is running on TX1. The model is free. The RAM is available. The only thing stopping us is prioritization — soft launch is April 15, and game server stability matters more than an internal AI tool right now.
**Post-launch:** This becomes high-value. Zero API costs, unlimited queries, data stays on our infrastructure.
## Dependencies
- Task #93 (Trinity Codex) — Dify/Qdrant RAG setup
- TX1 must have enough RAM headroom after game servers
## Open Questions
- What's TX1's current RAM usage with all game servers running? Need to confirm 26GB is available.
- Should we run Gemma 4 on a separate Docker container or bare metal Ollama?
- Do we need to restrict Ollama's memory usage to prevent game server impact?
## Related Documents
- Gemini consultation: `docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md`
- Trinity Codex spec: `docs/tasks/task-093-trinity-codex/`
- BACKLOG entry (archived): Referenced as Task #96 in wish list
---
**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️