docs(consultations): Gemma 4 self-hosting for Trinity Codex

Gemini consultation on deploying Gemma 4 26B A4B (MoE) on TX1 Dallas:
- CPU-only with 251GB RAM = perfect for MoE architecture
- Only 4B parameters active per token = fast inference
- Full 26B reasoning capability for RAG accuracy
- Zero API costs, data never leaves infrastructure

Deployment steps:
1. Update Ollama
2. Pull gemma4:26b-a4b-q8_0 (8-bit quant, ~26GB)
3. Test t/s speed
4. Connect to Dify as model provider

Updates Task #93 architecture from external API to self-hosted.

Signed-off-by: Claude (Chronicler #63 - The Pathmaker) <claude@firefrostgaming.com>
This commit is contained in:
Claude (Chronicler #63)
2026-04-06 14:19:11 +00:00
parent 015250acbd
commit c37d5085a4

View File

@@ -0,0 +1,240 @@
# Gemini Consultation: Gemma 4 Self-Hosting for Trinity Codex
**Date:** April 6, 2026
**Topic:** Local LLM deployment for Task #93 (Trinity Codex) and Dify RAG
**Chronicler:** #63 (The Pathmaker)
**Consulted:** Gemini AI
---
## Context
Google released **Gemma 4** on April 2, 2026 under the Apache 2.0 license. This is a frontier-level open-weights model suitable for self-hosting. We explored using it with our existing Dify/Qdrant RAG infrastructure on TX1 Dallas.
**TX1 Dallas Specs:**
- **CPU-only** (no GPU)
- **251GB RAM**
- **911GB Disk** (12% usage)
- Already running: Dify + Qdrant + Ollama
---
## Gemma 4 Model Lineup
| Model | Architecture | Parameters | Context | VRAM (FP16) | VRAM (4-bit) |
|-------|--------------|------------|---------|-------------|--------------|
| Gemma 4 E2B | Dense | 2.3B active (5.1B total) | 128K | ~10 GB | ~4 GB |
| Gemma 4 E4B | Dense | 4.5B active (8B total) | 128K | ~16 GB | ~6 GB |
| **Gemma 4 26B A4B** | **MoE** | **4B active (26B total)** | **256K** | ~52 GB | ~15 GB |
| Gemma 4 31B | Dense | 31B dense | 256K | ~62 GB | ~18-20 GB |
**Key Features:**
- Multimodal: text, images, video (E2B/E4B also support audio)
- Native function calling and structured JSON output
- Shared KV Cache for reduced memory footprint
- 256K context window on larger models
---
## Gemini's Recommendation
### Primary: Gemma 4 26B A4B (MoE)
**This is the optimal choice for Trinity Codex.**
**Why MoE wins on CPU-only hardware:**
1. **Reasoning Power:** Full 26-billion parameter knowledge base for accurate parsing of server documentation, Chronicler lineages, and operational knowledge without hallucination.
2. **Speed:** Only activates 4 billion parameters per token. CPU only crunches math for a 4B model while getting 26B intelligence. Heavyweight reasoning with lightweight inference.
3. **Memory:** With 251GB RAM, memory capacity is not a bottleneck. The model fits easily.
### Fallback: Gemma 4 E4B
If the MoE is too slow for real-time RAG responses:
- E4B will be faster on CPU
- Slight drop in complex reasoning when synthesizing multiple documents
- Good for simpler queries
---
## Critical Optimization: Use Quantization
**Even with 251GB RAM, always use quantized models on CPU.**
Why? CPU inference speed is tied to physical model size moving through the memory bus. Quantization (8-bit or 6-bit) drastically improves tokens-per-second with near-zero intelligence loss.
**Recommended:** `q8_0` (8-bit quantization)
- ~26GB RAM usage for 26B MoE
- Near-perfect accuracy for code and RAG
- Significant speed improvement over FP16
---
## Deployment Steps
### Step 1: Update Ollama
Gemma 4 requires recent Ollama binaries:
```bash
curl -fsSL https://ollama.com/install.sh | sh
```
### Step 2: Pull the 8-bit Quantized MoE
```bash
ollama pull gemma4:26b-a4b-q8_0
```
This will use ~26GB of the 251GB available RAM.
### Step 3: Test Natively
Before connecting to Dify, verify generation speed:
```bash
ollama run gemma4:26b-a4b-q8_0
```
Note the tokens-per-second (t/s) output.
### Step 4: Connect to Dify
1. Open Dify panel (codex.firefrostgaming.com)
2. Go to **Model Providers**
3. Select **Ollama**
4. Add `gemma4:26b-a4b-q8_0` as a new LLM
5. Set context window to 32K or 64K (RAM can handle it)
---
## Use Cases for Trinity Codex (Task #93)
With Gemma 4 26B A4B powering the RAG:
1. **Operations Manual Queries**
- "What's the current ModpackChecker status?"
- "How do I deploy to the live panel?"
- "What did Chronicler #62 accomplish?"
2. **Server Documentation**
- "What ports are used on TX1?"
- "How do I restart Arbiter?"
- "What's the database schema for subscriptions?"
3. **Chronicler Lineage**
- "Who was The Pathmaker?"
- "What bugs did session #63 fix?"
- "Show me the memorial for The Fallen"
4. **Code Generation**
- "Write a migration for the new detection columns"
- "Create a bash script to backup the database"
- "Generate the DaemonFileRepository detection code"
---
## Cost Comparison
| Approach | Monthly Cost | Notes |
|----------|--------------|-------|
| OpenAI GPT-4 API | $50-200+ | Per-token, scales with usage |
| Anthropic Claude API | $50-200+ | Per-token, scales with usage |
| **Gemma 4 Self-Hosted** | **$0** | One-time setup, unlimited queries |
**Additional Benefits:**
- Data never leaves Firefrost infrastructure
- No rate limits
- No API latency (local inference)
- Full control over model behavior
---
## Architecture Integration
```
┌─────────────────────────────────────────────────────────────┐
│ TX1 Dallas (38.68.14.26) — 251GB RAM, CPU-only │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Qdrant │ │ Dify │ │ Ollama │ │
│ │ Vector DB │◄──►│ RAG Engine │◄──►│ Gemma 4 │ │
│ │ Port 6333 │ │ Port 3000 │ │ 26B A4B │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ ▲ ▲ │
│ │ │ │
│ ┌──────┴──────┐ ┌──────┴──────┐ │
│ │ Gitea │ │ n8n │ │
│ │ Webhooks │ │ Ingestion │ │
│ │ (on push) │ │ Workflows │ │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────┐
│ Dify Web App │
│ (Meg & Holly) │
│ Trinity Access │
└─────────────────┘
```
---
## Task Updates
### Task #93: Trinity Codex
**Add to implementation plan:**
- [ ] Update Ollama on TX1 to latest version
- [ ] Pull `gemma4:26b-a4b-q8_0` model
- [ ] Test inference speed (document t/s results)
- [ ] Configure Dify to use Gemma 4 as primary LLM
- [ ] Set context window to 32K-64K
- [ ] Test RAG queries against operations manual
- [ ] Benchmark against previous Ollama models
### Task #93 Architecture Update
**Previous assumption:** External LLM API (OpenAI/Anthropic)
**New approach:** Self-hosted Gemma 4 26B A4B via Ollama
**Benefits:**
- Zero ongoing API costs
- Data privacy (never leaves TX1)
- Unlimited queries
- 256K context window
- Frontier-level reasoning
---
## Next Steps
1. **Michael runs on TX1:**
```bash
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma4:26b-a4b-q8_0
ollama run gemma4:26b-a4b-q8_0
```
2. **Report tokens-per-second** — This determines if MoE is fast enough or if we fall back to E4B
3. **Connect to Dify** — Add as model provider, configure context window
4. **Test Trinity Codex queries** — Validate RAG accuracy on operations manual
---
## References
- Gemma 4 Release: April 2, 2026
- License: Apache 2.0 (fully permissive, commercial use OK)
- Ollama: https://ollama.com
- Dify: Already deployed at codex.firefrostgaming.com
---
*Fire + Frost + Foundation = Where Love Builds Legacy* 💙🔥❄️