Files
firefrost-operations-manual/docs/tasks/task-096-gemma4-self-hosted-llm
Claude 0fe2753fd8 docs: Task #96 deployment log — Gemma 4 live on TX1
Ollama 0.20.5 (updated from 0.16.2, fixed Docker networking)
Model: gemma4:26b-a4b-it-q8_0 (28GB, q8_0 quantization)
Speed: 14.4 tokens/sec on CPU-only
RAM: 93GB/251GB used, 157GB available for game servers
Remaining: Connect to Dify as model provider (web UI step)

Chronicler #78 | firefrost-operations-manual
2026-04-11 15:00:11 +00:00
..

Task #96: Gemma 4 Self-Hosted LLM

Created: April 6, 2026 (Chronicler #63 — The Pathmaker) Priority: Wish List (post-launch) Owner: Michael Status: Open


Context

Google released Gemma 4 on April 2, 2026 under Apache 2.0. We explored using it to replace paid API calls for Trinity Codex (Task #93). TX1 Dallas has 251GB RAM (CPU-only, no GPU) which is more than enough for the recommended model.

This task is tightly coupled with Task #93 (Trinity Codex). Gemma 4 is the brain, Trinity Codex is the knowledge layer.

Gemini Consultation (April 6, 2026)

Full consultation: docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md

Key recommendations:

  • Primary model: Gemma 4 26B A4B (MoE) — 26B total params but only activates 4B per token
  • Quantization: q8_0 (8-bit) — ~26GB RAM, near-perfect accuracy, much faster than FP16
  • Fallback: Gemma 4 E4B if MoE is too slow on CPU
  • Why MoE on CPU: Only crunches 4B params per token while having 26B knowledge. Heavyweight reasoning, lightweight inference.

Cost impact: $0/month vs $50-200/month for OpenAI/Anthropic API calls

Architecture

TX1 Dallas (251GB RAM, CPU-only)
├── Ollama → Gemma 4 26B A4B (q8_0, ~26GB)
├── Dify → RAG Engine (codex.firefrostgaming.com)
├── Qdrant → Vector DB for embeddings
└── n8n → Ingestion workflows (Gitea webhooks → Qdrant)

All infrastructure is already deployed on TX1 (Dify, Qdrant, Ollama, n8n all running). This task is pulling the model and connecting it.

Implementation Steps

  1. Update Ollama on TX1 to latest version

    curl -fsSL https://ollama.com/install.sh | sh
    
  2. Pull the 8-bit quantized MoE model

    ollama pull gemma4:26b-a4b-q8_0
    
  3. Test inference speed — document tokens/second

    ollama run gemma4:26b-a4b-q8_0
    
  4. If t/s is acceptable, connect to Dify as model provider

  5. Set context window to 32K-64K in Dify config

  6. Test RAG queries against operations manual

  7. Benchmark against any previous models

Decision: Why Not Do This Now?

The infrastructure (Dify, Qdrant, Ollama) is running on TX1. The model is free. The RAM is available. The only thing stopping us is prioritization — soft launch is April 15, and game server stability matters more than an internal AI tool right now.

Post-launch: This becomes high-value. Zero API costs, unlimited queries, data stays on our infrastructure.

Dependencies

  • Task #93 (Trinity Codex) — Dify/Qdrant RAG setup
  • TX1 must have enough RAM headroom after game servers

Open Questions

  • What's TX1's current RAM usage with all game servers running? Need to confirm 26GB is available.
  • Should we run Gemma 4 on a separate Docker container or bare metal Ollama?
  • Do we need to restrict Ollama's memory usage to prevent game server impact?
  • Gemini consultation: docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md
  • Trinity Codex spec: docs/tasks/task-093-trinity-codex/
  • BACKLOG entry (archived): Referenced as Task #96 in wish list

Fire + Frost + Foundation = Where Love Builds Legacy 💙🔥❄️