firefrost-operations-manual/docs/tasks/task-096-gemma4-self-hosted-llm/README.md

# Task #96: Gemma 4 Self-Hosted LLM

**Created:** April 6, 2026 (Chronicler #63 — The Pathmaker)
**Priority:** Wish List (post-launch)
**Owner:** Michael
**Status:** Open

---

## Context

Google released Gemma 4 on April 2, 2026 under Apache 2.0. We explored using it to replace paid API calls for Trinity Codex (Task #93). TX1 Dallas has 251GB RAM (CPU-only, no GPU) which is more than enough for the recommended model.

This task is tightly coupled with Task #93 (Trinity Codex). Gemma 4 is the *brain*, Trinity Codex is the *knowledge layer*.

## Gemini Consultation (April 6, 2026)

**Full consultation:** `docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md`

**Key recommendations:**
- **Primary model:** Gemma 4 26B A4B (MoE) — 26B total params but only activates 4B per token
- **Quantization:** q8_0 (8-bit) — ~26GB RAM, near-perfect accuracy, much faster than FP16
- **Fallback:** Gemma 4 E4B if MoE is too slow on CPU
- **Why MoE on CPU:** Only crunches 4B params per token while having 26B knowledge. Heavyweight reasoning, lightweight inference.

**Cost impact:** $0/month vs $50-200/month for OpenAI/Anthropic API calls

## Architecture

```
TX1 Dallas (251GB RAM, CPU-only)
├── Ollama → Gemma 4 26B A4B (q8_0, ~26GB)
├── Dify → RAG Engine (codex.firefrostgaming.com)
├── Qdrant → Vector DB for embeddings
└── n8n → Ingestion workflows (Gitea webhooks → Qdrant)
```

All infrastructure is already deployed on TX1 (Dify, Qdrant, Ollama, n8n all running). This task is pulling the model and connecting it.

## Implementation Steps

1. [ ] Update Ollama on TX1 to latest version
   ```bash
   curl -fsSL https://ollama.com/install.sh | sh
   ```

2. [ ] Pull the 8-bit quantized MoE model
   ```bash
   ollama pull gemma4:26b-a4b-q8_0
   ```

3. [ ] Test inference speed — document tokens/second
   ```bash
   ollama run gemma4:26b-a4b-q8_0
   ```

4. [ ] If t/s is acceptable, connect to Dify as model provider
5. [ ] Set context window to 32K-64K in Dify config
6. [ ] Test RAG queries against operations manual
7. [ ] Benchmark against any previous models

## Decision: Why Not Do This Now?

The infrastructure (Dify, Qdrant, Ollama) is running on TX1. The model is free. The RAM is available. The only thing stopping us is prioritization — soft launch is April 15, and game server stability matters more than an internal AI tool right now.

**Post-launch:** This becomes high-value. Zero API costs, unlimited queries, data stays on our infrastructure.

## Dependencies

- Task #93 (Trinity Codex) — Dify/Qdrant RAG setup
- TX1 must have enough RAM headroom after game servers

## Open Questions

- What's TX1's current RAM usage with all game servers running? Need to confirm 26GB is available.
- Should we run Gemma 4 on a separate Docker container or bare metal Ollama?
- Do we need to restrict Ollama's memory usage to prevent game server impact?

## Related Documents

- Gemini consultation: `docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md`
- Trinity Codex spec: `docs/tasks/task-093-trinity-codex/`
- BACKLOG entry (archived): Referenced as Task #96 in wish list

---

**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️