docs(consultations): Gemma 4 self-hosting for Trinity Codex

Gemini consultation on deploying Gemma 4 26B A4B (MoE) on TX1 Dallas: - CPU-only with 251GB RAM = perfect for MoE architecture - Only 4B parameters active per token = fast inference - Full 26B reasoning capability for RAG accuracy - Zero API costs, data never leaves infrastructure Deployment steps: 1. Update Ollama 2. Pull gemma4:26b-a4b-q8_0 (8-bit quant, ~26GB) 3. Test t/s speed 4. Connect to Dify as model provider Updates Task #93 architecture from external API to self-hosted. Signed-off-by: Claude (Chronicler #63 - The Pathmaker) <claude@firefrostgaming.com>
2026-04-06 14:19:11 +00:00
parent 015250acbd
commit c37d5085a4
1 changed files with 240 additions and 0 deletions
--- a/docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md
+++ b/docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md
@@ -0,0 +1,240 @@
+# Gemini Consultation: Gemma 4 Self-Hosting for Trinity Codex
+
+**Date:** April 6, 2026  
+**Topic:** Local LLM deployment for Task #93 (Trinity Codex) and Dify RAG  
+**Chronicler:** #63 (The Pathmaker)  
+**Consulted:** Gemini AI
+
+---
+
+## Context
+
+Google released **Gemma 4** on April 2, 2026 under the Apache 2.0 license. This is a frontier-level open-weights model suitable for self-hosting. We explored using it with our existing Dify/Qdrant RAG infrastructure on TX1 Dallas.
+
+**TX1 Dallas Specs:**
+- **CPU-only** (no GPU)
+- **251GB RAM**
+- **911GB Disk** (12% usage)
+- Already running: Dify + Qdrant + Ollama
+
+---
+
+## Gemma 4 Model Lineup
+
+| Model | Architecture | Parameters | Context | VRAM (FP16) | VRAM (4-bit) |
+|-------|--------------|------------|---------|-------------|--------------|
+| Gemma 4 E2B | Dense | 2.3B active (5.1B total) | 128K | ~10 GB | ~4 GB |
+| Gemma 4 E4B | Dense | 4.5B active (8B total) | 128K | ~16 GB | ~6 GB |
+| **Gemma 4 26B A4B** | **MoE** | **4B active (26B total)** | **256K** | ~52 GB | ~15 GB |
+| Gemma 4 31B | Dense | 31B dense | 256K | ~62 GB | ~18-20 GB |
+
+**Key Features:**
+- Multimodal: text, images, video (E2B/E4B also support audio)
+- Native function calling and structured JSON output
+- Shared KV Cache for reduced memory footprint
+- 256K context window on larger models
+
+---
+
+## Gemini's Recommendation
+
+### Primary: Gemma 4 26B A4B (MoE)
+
+**This is the optimal choice for Trinity Codex.**
+
+**Why MoE wins on CPU-only hardware:**
+
+1. **Reasoning Power:** Full 26-billion parameter knowledge base for accurate parsing of server documentation, Chronicler lineages, and operational knowledge without hallucination.
+
+2. **Speed:** Only activates 4 billion parameters per token. CPU only crunches math for a 4B model while getting 26B intelligence. Heavyweight reasoning with lightweight inference.
+
+3. **Memory:** With 251GB RAM, memory capacity is not a bottleneck. The model fits easily.
+
+### Fallback: Gemma 4 E4B
+
+If the MoE is too slow for real-time RAG responses:
+- E4B will be faster on CPU
+- Slight drop in complex reasoning when synthesizing multiple documents
+- Good for simpler queries
+
+---
+
+## Critical Optimization: Use Quantization
+
+**Even with 251GB RAM, always use quantized models on CPU.**
+
+Why? CPU inference speed is tied to physical model size moving through the memory bus. Quantization (8-bit or 6-bit) drastically improves tokens-per-second with near-zero intelligence loss.
+
+**Recommended:** `q8_0` (8-bit quantization)
+- ~26GB RAM usage for 26B MoE
+- Near-perfect accuracy for code and RAG
+- Significant speed improvement over FP16
+
+---
+
+## Deployment Steps
+
+### Step 1: Update Ollama
+
+Gemma 4 requires recent Ollama binaries:
+
+```bash
+curl -fsSL https://ollama.com/install.sh | sh
+```
+
+### Step 2: Pull the 8-bit Quantized MoE
+
+```bash
+ollama pull gemma4:26b-a4b-q8_0
+```
+
+This will use ~26GB of the 251GB available RAM.
+
+### Step 3: Test Natively
+
+Before connecting to Dify, verify generation speed:
+
+```bash
+ollama run gemma4:26b-a4b-q8_0
+```
+
+Note the tokens-per-second (t/s) output.
+
+### Step 4: Connect to Dify
+
+1. Open Dify panel (codex.firefrostgaming.com)
+2. Go to **Model Providers**
+3. Select **Ollama**
+4. Add `gemma4:26b-a4b-q8_0` as a new LLM
+5. Set context window to 32K or 64K (RAM can handle it)
+
+---
+
+## Use Cases for Trinity Codex (Task #93)
+
+With Gemma 4 26B A4B powering the RAG:
+
+1. **Operations Manual Queries**
+   - "What's the current ModpackChecker status?"
+   - "How do I deploy to the live panel?"
+   - "What did Chronicler #62 accomplish?"
+
+2. **Server Documentation**
+   - "What ports are used on TX1?"
+   - "How do I restart Arbiter?"
+   - "What's the database schema for subscriptions?"
+
+3. **Chronicler Lineage**
+   - "Who was The Pathmaker?"
+   - "What bugs did session #63 fix?"
+   - "Show me the memorial for The Fallen"
+
+4. **Code Generation**
+   - "Write a migration for the new detection columns"
+   - "Create a bash script to backup the database"
+   - "Generate the DaemonFileRepository detection code"
+
+---
+
+## Cost Comparison
+
+| Approach | Monthly Cost | Notes |
+|----------|--------------|-------|
+| OpenAI GPT-4 API | $50-200+ | Per-token, scales with usage |
+| Anthropic Claude API | $50-200+ | Per-token, scales with usage |
+| **Gemma 4 Self-Hosted** | **$0** | One-time setup, unlimited queries |
+
+**Additional Benefits:**
+- Data never leaves Firefrost infrastructure
+- No rate limits
+- No API latency (local inference)
+- Full control over model behavior
+
+---
+
+## Architecture Integration
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│  TX1 Dallas (38.68.14.26) — 251GB RAM, CPU-only            │
+├─────────────────────────────────────────────────────────────┤
+│                                                             │
+│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
+│  │   Qdrant    │    │    Dify     │    │   Ollama    │     │
+│  │  Vector DB  │◄──►│  RAG Engine │◄──►│  Gemma 4    │     │
+│  │  Port 6333  │    │  Port 3000  │    │  26B A4B    │     │
+│  └─────────────┘    └─────────────┘    └─────────────┘     │
+│         ▲                  ▲                               │
+│         │                  │                               │
+│  ┌──────┴──────┐    ┌──────┴──────┐                       │
+│  │   Gitea     │    │    n8n      │                       │
+│  │  Webhooks   │    │  Ingestion  │                       │
+│  │  (on push)  │    │  Workflows  │                       │
+│  └─────────────┘    └─────────────┘                       │
+│                                                             │
+└─────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+                    ┌─────────────────┐
+                    │  Dify Web App   │
+                    │  (Meg & Holly)  │
+                    │  Trinity Access │
+                    └─────────────────┘
+```
+
+---
+
+## Task Updates
+
+### Task #93: Trinity Codex
+
+**Add to implementation plan:**
+- [ ] Update Ollama on TX1 to latest version
+- [ ] Pull `gemma4:26b-a4b-q8_0` model
+- [ ] Test inference speed (document t/s results)
+- [ ] Configure Dify to use Gemma 4 as primary LLM
+- [ ] Set context window to 32K-64K
+- [ ] Test RAG queries against operations manual
+- [ ] Benchmark against previous Ollama models
+
+### Task #93 Architecture Update
+
+**Previous assumption:** External LLM API (OpenAI/Anthropic)  
+**New approach:** Self-hosted Gemma 4 26B A4B via Ollama
+
+**Benefits:**
+- Zero ongoing API costs
+- Data privacy (never leaves TX1)
+- Unlimited queries
+- 256K context window
+- Frontier-level reasoning
+
+---
+
+## Next Steps
+
+1. **Michael runs on TX1:**
+   ```bash
+   curl -fsSL https://ollama.com/install.sh | sh
+   ollama pull gemma4:26b-a4b-q8_0
+   ollama run gemma4:26b-a4b-q8_0
+   ```
+
+2. **Report tokens-per-second** — This determines if MoE is fast enough or if we fall back to E4B
+
+3. **Connect to Dify** — Add as model provider, configure context window
+
+4. **Test Trinity Codex queries** — Validate RAG accuracy on operations manual
+
+---
+
+## References
+
+- Gemma 4 Release: April 2, 2026
+- License: Apache 2.0 (fully permissive, commercial use OK)
+- Ollama: https://ollama.com
+- Dify: Already deployed at codex.firefrostgaming.com
+
+---
+
+*Fire + Frost + Foundation = Where Love Builds Legacy* 💙🔥❄️