Full context from Gemini consultation (April 6, 2026). Gemma 4 26B A4B MoE recommended for TX1 (251GB RAM, CPU-only). ~26GB at q8_0 quantization, zero monthly API cost. Tightly coupled with Task #93 (Trinity Codex). Chronicler #78 | firefrost-operations-manual
88 lines
3.2 KiB
Markdown
88 lines
3.2 KiB
Markdown
# Task #96: Gemma 4 Self-Hosted LLM
|
|
|
|
**Created:** April 6, 2026 (Chronicler #63 — The Pathmaker)
|
|
**Priority:** Wish List (post-launch)
|
|
**Owner:** Michael
|
|
**Status:** Open
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
Google released Gemma 4 on April 2, 2026 under Apache 2.0. We explored using it to replace paid API calls for Trinity Codex (Task #93). TX1 Dallas has 251GB RAM (CPU-only, no GPU) which is more than enough for the recommended model.
|
|
|
|
This task is tightly coupled with Task #93 (Trinity Codex). Gemma 4 is the *brain*, Trinity Codex is the *knowledge layer*.
|
|
|
|
## Gemini Consultation (April 6, 2026)
|
|
|
|
**Full consultation:** `docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md`
|
|
|
|
**Key recommendations:**
|
|
- **Primary model:** Gemma 4 26B A4B (MoE) — 26B total params but only activates 4B per token
|
|
- **Quantization:** q8_0 (8-bit) — ~26GB RAM, near-perfect accuracy, much faster than FP16
|
|
- **Fallback:** Gemma 4 E4B if MoE is too slow on CPU
|
|
- **Why MoE on CPU:** Only crunches 4B params per token while having 26B knowledge. Heavyweight reasoning, lightweight inference.
|
|
|
|
**Cost impact:** $0/month vs $50-200/month for OpenAI/Anthropic API calls
|
|
|
|
## Architecture
|
|
|
|
```
|
|
TX1 Dallas (251GB RAM, CPU-only)
|
|
├── Ollama → Gemma 4 26B A4B (q8_0, ~26GB)
|
|
├── Dify → RAG Engine (codex.firefrostgaming.com)
|
|
├── Qdrant → Vector DB for embeddings
|
|
└── n8n → Ingestion workflows (Gitea webhooks → Qdrant)
|
|
```
|
|
|
|
All infrastructure is already deployed on TX1 (Dify, Qdrant, Ollama, n8n all running). This task is pulling the model and connecting it.
|
|
|
|
## Implementation Steps
|
|
|
|
1. [ ] Update Ollama on TX1 to latest version
|
|
```bash
|
|
curl -fsSL https://ollama.com/install.sh | sh
|
|
```
|
|
|
|
2. [ ] Pull the 8-bit quantized MoE model
|
|
```bash
|
|
ollama pull gemma4:26b-a4b-q8_0
|
|
```
|
|
|
|
3. [ ] Test inference speed — document tokens/second
|
|
```bash
|
|
ollama run gemma4:26b-a4b-q8_0
|
|
```
|
|
|
|
4. [ ] If t/s is acceptable, connect to Dify as model provider
|
|
5. [ ] Set context window to 32K-64K in Dify config
|
|
6. [ ] Test RAG queries against operations manual
|
|
7. [ ] Benchmark against any previous models
|
|
|
|
## Decision: Why Not Do This Now?
|
|
|
|
The infrastructure (Dify, Qdrant, Ollama) is running on TX1. The model is free. The RAM is available. The only thing stopping us is prioritization — soft launch is April 15, and game server stability matters more than an internal AI tool right now.
|
|
|
|
**Post-launch:** This becomes high-value. Zero API costs, unlimited queries, data stays on our infrastructure.
|
|
|
|
## Dependencies
|
|
|
|
- Task #93 (Trinity Codex) — Dify/Qdrant RAG setup
|
|
- TX1 must have enough RAM headroom after game servers
|
|
|
|
## Open Questions
|
|
|
|
- What's TX1's current RAM usage with all game servers running? Need to confirm 26GB is available.
|
|
- Should we run Gemma 4 on a separate Docker container or bare metal Ollama?
|
|
- Do we need to restrict Ollama's memory usage to prevent game server impact?
|
|
|
|
## Related Documents
|
|
|
|
- Gemini consultation: `docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md`
|
|
- Trinity Codex spec: `docs/tasks/task-093-trinity-codex/`
|
|
- BACKLOG entry (archived): Referenced as Task #96 in wish list
|
|
|
|
---
|
|
|
|
**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️
|