Ollama 0.20.5 (updated from 0.16.2, fixed Docker networking) Model: gemma4:26b-a4b-it-q8_0 (28GB, q8_0 quantization) Speed: 14.4 tokens/sec on CPU-only RAM: 93GB/251GB used, 157GB available for game servers Remaining: Connect to Dify as model provider (web UI step) Chronicler #78 | firefrost-operations-manual
Task #96: Gemma 4 Self-Hosted LLM
Created: April 6, 2026 (Chronicler #63 — The Pathmaker) Priority: Wish List (post-launch) Owner: Michael Status: Open
Context
Google released Gemma 4 on April 2, 2026 under Apache 2.0. We explored using it to replace paid API calls for Trinity Codex (Task #93). TX1 Dallas has 251GB RAM (CPU-only, no GPU) which is more than enough for the recommended model.
This task is tightly coupled with Task #93 (Trinity Codex). Gemma 4 is the brain, Trinity Codex is the knowledge layer.
Gemini Consultation (April 6, 2026)
Full consultation: docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md
Key recommendations:
- Primary model: Gemma 4 26B A4B (MoE) — 26B total params but only activates 4B per token
- Quantization: q8_0 (8-bit) — ~26GB RAM, near-perfect accuracy, much faster than FP16
- Fallback: Gemma 4 E4B if MoE is too slow on CPU
- Why MoE on CPU: Only crunches 4B params per token while having 26B knowledge. Heavyweight reasoning, lightweight inference.
Cost impact: $0/month vs $50-200/month for OpenAI/Anthropic API calls
Architecture
TX1 Dallas (251GB RAM, CPU-only)
├── Ollama → Gemma 4 26B A4B (q8_0, ~26GB)
├── Dify → RAG Engine (codex.firefrostgaming.com)
├── Qdrant → Vector DB for embeddings
└── n8n → Ingestion workflows (Gitea webhooks → Qdrant)
All infrastructure is already deployed on TX1 (Dify, Qdrant, Ollama, n8n all running). This task is pulling the model and connecting it.
Implementation Steps
-
Update Ollama on TX1 to latest version
curl -fsSL https://ollama.com/install.sh | sh -
Pull the 8-bit quantized MoE model
ollama pull gemma4:26b-a4b-q8_0 -
Test inference speed — document tokens/second
ollama run gemma4:26b-a4b-q8_0 -
If t/s is acceptable, connect to Dify as model provider
-
Set context window to 32K-64K in Dify config
-
Test RAG queries against operations manual
-
Benchmark against any previous models
Decision: Why Not Do This Now?
The infrastructure (Dify, Qdrant, Ollama) is running on TX1. The model is free. The RAM is available. The only thing stopping us is prioritization — soft launch is April 15, and game server stability matters more than an internal AI tool right now.
Post-launch: This becomes high-value. Zero API costs, unlimited queries, data stays on our infrastructure.
Dependencies
- Task #93 (Trinity Codex) — Dify/Qdrant RAG setup
- TX1 must have enough RAM headroom after game servers
Open Questions
- What's TX1's current RAM usage with all game servers running? Need to confirm 26GB is available.
- Should we run Gemma 4 on a separate Docker container or bare metal Ollama?
- Do we need to restrict Ollama's memory usage to prevent game server impact?
Related Documents
- Gemini consultation:
docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md - Trinity Codex spec:
docs/tasks/task-093-trinity-codex/ - BACKLOG entry (archived): Referenced as Task #96 in wish list
Fire + Frost + Foundation = Where Love Builds Legacy 💙🔥❄️