Files
firefrost-operations-manual/docs/tasks/task-096-gemma4-self-hosted-llm/README.md
Claude e3be9a1dd1 docs: Task #96 spec — Gemma 4 Self-Hosted LLM
Full context from Gemini consultation (April 6, 2026).
Gemma 4 26B A4B MoE recommended for TX1 (251GB RAM, CPU-only).
~26GB at q8_0 quantization, zero monthly API cost.
Tightly coupled with Task #93 (Trinity Codex).

Chronicler #78 | firefrost-operations-manual
2026-04-11 14:44:13 +00:00

3.2 KiB

Task #96: Gemma 4 Self-Hosted LLM

Created: April 6, 2026 (Chronicler #63 — The Pathmaker) Priority: Wish List (post-launch) Owner: Michael Status: Open


Context

Google released Gemma 4 on April 2, 2026 under Apache 2.0. We explored using it to replace paid API calls for Trinity Codex (Task #93). TX1 Dallas has 251GB RAM (CPU-only, no GPU) which is more than enough for the recommended model.

This task is tightly coupled with Task #93 (Trinity Codex). Gemma 4 is the brain, Trinity Codex is the knowledge layer.

Gemini Consultation (April 6, 2026)

Full consultation: docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md

Key recommendations:

  • Primary model: Gemma 4 26B A4B (MoE) — 26B total params but only activates 4B per token
  • Quantization: q8_0 (8-bit) — ~26GB RAM, near-perfect accuracy, much faster than FP16
  • Fallback: Gemma 4 E4B if MoE is too slow on CPU
  • Why MoE on CPU: Only crunches 4B params per token while having 26B knowledge. Heavyweight reasoning, lightweight inference.

Cost impact: $0/month vs $50-200/month for OpenAI/Anthropic API calls

Architecture

TX1 Dallas (251GB RAM, CPU-only)
├── Ollama → Gemma 4 26B A4B (q8_0, ~26GB)
├── Dify → RAG Engine (codex.firefrostgaming.com)
├── Qdrant → Vector DB for embeddings
└── n8n → Ingestion workflows (Gitea webhooks → Qdrant)

All infrastructure is already deployed on TX1 (Dify, Qdrant, Ollama, n8n all running). This task is pulling the model and connecting it.

Implementation Steps

  1. Update Ollama on TX1 to latest version

    curl -fsSL https://ollama.com/install.sh | sh
    
  2. Pull the 8-bit quantized MoE model

    ollama pull gemma4:26b-a4b-q8_0
    
  3. Test inference speed — document tokens/second

    ollama run gemma4:26b-a4b-q8_0
    
  4. If t/s is acceptable, connect to Dify as model provider

  5. Set context window to 32K-64K in Dify config

  6. Test RAG queries against operations manual

  7. Benchmark against any previous models

Decision: Why Not Do This Now?

The infrastructure (Dify, Qdrant, Ollama) is running on TX1. The model is free. The RAM is available. The only thing stopping us is prioritization — soft launch is April 15, and game server stability matters more than an internal AI tool right now.

Post-launch: This becomes high-value. Zero API costs, unlimited queries, data stays on our infrastructure.

Dependencies

  • Task #93 (Trinity Codex) — Dify/Qdrant RAG setup
  • TX1 must have enough RAM headroom after game servers

Open Questions

  • What's TX1's current RAM usage with all game servers running? Need to confirm 26GB is available.
  • Should we run Gemma 4 on a separate Docker container or bare metal Ollama?
  • Do we need to restrict Ollama's memory usage to prevent game server impact?
  • Gemini consultation: docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md
  • Trinity Codex spec: docs/tasks/task-093-trinity-codex/
  • BACKLOG entry (archived): Referenced as Task #96 in wish list

Fire + Frost + Foundation = Where Love Builds Legacy 💙🔥❄️