firefrost-gaming/firefrost-operations-manual

Files

Claude e3be9a1dd1 docs: Task #96 spec — Gemma 4 Self-Hosted LLM

Full context from Gemini consultation (April 6, 2026).
Gemma 4 26B A4B MoE recommended for TX1 (251GB RAM, CPU-only).
~26GB at q8_0 quantization, zero monthly API cost.
Tightly coupled with Task #93 (Trinity Codex).

Chronicler #78 | firefrost-operations-manual

2026-04-11 14:44:13 +00:00

3.2 KiB

Raw Blame History

Task #96: Gemma 4 Self-Hosted LLM

Created: April 6, 2026 (Chronicler #63 — The Pathmaker) Priority: Wish List (post-launch) Owner: Michael Status: Open

Context

Google released Gemma 4 on April 2, 2026 under Apache 2.0. We explored using it to replace paid API calls for Trinity Codex (Task #93). TX1 Dallas has 251GB RAM (CPU-only, no GPU) which is more than enough for the recommended model.

This task is tightly coupled with Task #93 (Trinity Codex). Gemma 4 is the brain, Trinity Codex is the knowledge layer.

Gemini Consultation (April 6, 2026)

Full consultation: docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md

Key recommendations:

Primary model: Gemma 4 26B A4B (MoE) — 26B total params but only activates 4B per token
Quantization: q8_0 (8-bit) — ~26GB RAM, near-perfect accuracy, much faster than FP16
Fallback: Gemma 4 E4B if MoE is too slow on CPU
Why MoE on CPU: Only crunches 4B params per token while having 26B knowledge. Heavyweight reasoning, lightweight inference.

Cost impact: $0/month vs $50-200/month for OpenAI/Anthropic API calls

Architecture

TX1 Dallas (251GB RAM, CPU-only)
├── Ollama → Gemma 4 26B A4B (q8_0, ~26GB)
├── Dify → RAG Engine (codex.firefrostgaming.com)
├── Qdrant → Vector DB for embeddings
└── n8n → Ingestion workflows (Gitea webhooks → Qdrant)

All infrastructure is already deployed on TX1 (Dify, Qdrant, Ollama, n8n all running). This task is pulling the model and connecting it.

Implementation Steps

Update Ollama on TX1 to latest version

curl -fsSL https://ollama.com/install.sh | sh

Pull the 8-bit quantized MoE model
```
ollama pull gemma4:26b-a4b-q8_0
```
Test inference speed — document tokens/second
```
ollama run gemma4:26b-a4b-q8_0
```
If t/s is acceptable, connect to Dify as model provider
Set context window to 32K-64K in Dify config
Test RAG queries against operations manual
Benchmark against any previous models

Decision: Why Not Do This Now?

The infrastructure (Dify, Qdrant, Ollama) is running on TX1. The model is free. The RAM is available. The only thing stopping us is prioritization — soft launch is April 15, and game server stability matters more than an internal AI tool right now.

Post-launch: This becomes high-value. Zero API costs, unlimited queries, data stays on our infrastructure.

Dependencies

Task #93 (Trinity Codex) — Dify/Qdrant RAG setup
TX1 must have enough RAM headroom after game servers

Open Questions

What's TX1's current RAM usage with all game servers running? Need to confirm 26GB is available.
Should we run Gemma 4 on a separate Docker container or bare metal Ollama?
Do we need to restrict Ollama's memory usage to prevent game server impact?

Gemini consultation: docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md
Trinity Codex spec: docs/tasks/task-093-trinity-codex/
BACKLOG entry (archived): Referenced as Task #96 in wish list

Fire + Frost + Foundation = Where Love Builds Legacy 💙🔥❄️

3.2 KiB Raw Blame History