docs(consultations): Gemma 4 self-hosting for Trinity Codex
Gemini consultation on deploying Gemma 4 26B A4B (MoE) on TX1 Dallas: - CPU-only with 251GB RAM = perfect for MoE architecture - Only 4B parameters active per token = fast inference - Full 26B reasoning capability for RAG accuracy - Zero API costs, data never leaves infrastructure Deployment steps: 1. Update Ollama 2. Pull gemma4:26b-a4b-q8_0 (8-bit quant, ~26GB) 3. Test t/s speed 4. Connect to Dify as model provider Updates Task #93 architecture from external API to self-hosted. Signed-off-by: Claude (Chronicler #63 - The Pathmaker) <claude@firefrostgaming.com>
This commit is contained in:
240
docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md
Normal file
240
docs/consultations/gemini-gemma4-selfhosting-2026-04-06.md
Normal file
@@ -0,0 +1,240 @@
|
||||
# Gemini Consultation: Gemma 4 Self-Hosting for Trinity Codex
|
||||
|
||||
**Date:** April 6, 2026
|
||||
**Topic:** Local LLM deployment for Task #93 (Trinity Codex) and Dify RAG
|
||||
**Chronicler:** #63 (The Pathmaker)
|
||||
**Consulted:** Gemini AI
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Google released **Gemma 4** on April 2, 2026 under the Apache 2.0 license. This is a frontier-level open-weights model suitable for self-hosting. We explored using it with our existing Dify/Qdrant RAG infrastructure on TX1 Dallas.
|
||||
|
||||
**TX1 Dallas Specs:**
|
||||
- **CPU-only** (no GPU)
|
||||
- **251GB RAM**
|
||||
- **911GB Disk** (12% usage)
|
||||
- Already running: Dify + Qdrant + Ollama
|
||||
|
||||
---
|
||||
|
||||
## Gemma 4 Model Lineup
|
||||
|
||||
| Model | Architecture | Parameters | Context | VRAM (FP16) | VRAM (4-bit) |
|
||||
|-------|--------------|------------|---------|-------------|--------------|
|
||||
| Gemma 4 E2B | Dense | 2.3B active (5.1B total) | 128K | ~10 GB | ~4 GB |
|
||||
| Gemma 4 E4B | Dense | 4.5B active (8B total) | 128K | ~16 GB | ~6 GB |
|
||||
| **Gemma 4 26B A4B** | **MoE** | **4B active (26B total)** | **256K** | ~52 GB | ~15 GB |
|
||||
| Gemma 4 31B | Dense | 31B dense | 256K | ~62 GB | ~18-20 GB |
|
||||
|
||||
**Key Features:**
|
||||
- Multimodal: text, images, video (E2B/E4B also support audio)
|
||||
- Native function calling and structured JSON output
|
||||
- Shared KV Cache for reduced memory footprint
|
||||
- 256K context window on larger models
|
||||
|
||||
---
|
||||
|
||||
## Gemini's Recommendation
|
||||
|
||||
### Primary: Gemma 4 26B A4B (MoE)
|
||||
|
||||
**This is the optimal choice for Trinity Codex.**
|
||||
|
||||
**Why MoE wins on CPU-only hardware:**
|
||||
|
||||
1. **Reasoning Power:** Full 26-billion parameter knowledge base for accurate parsing of server documentation, Chronicler lineages, and operational knowledge without hallucination.
|
||||
|
||||
2. **Speed:** Only activates 4 billion parameters per token. CPU only crunches math for a 4B model while getting 26B intelligence. Heavyweight reasoning with lightweight inference.
|
||||
|
||||
3. **Memory:** With 251GB RAM, memory capacity is not a bottleneck. The model fits easily.
|
||||
|
||||
### Fallback: Gemma 4 E4B
|
||||
|
||||
If the MoE is too slow for real-time RAG responses:
|
||||
- E4B will be faster on CPU
|
||||
- Slight drop in complex reasoning when synthesizing multiple documents
|
||||
- Good for simpler queries
|
||||
|
||||
---
|
||||
|
||||
## Critical Optimization: Use Quantization
|
||||
|
||||
**Even with 251GB RAM, always use quantized models on CPU.**
|
||||
|
||||
Why? CPU inference speed is tied to physical model size moving through the memory bus. Quantization (8-bit or 6-bit) drastically improves tokens-per-second with near-zero intelligence loss.
|
||||
|
||||
**Recommended:** `q8_0` (8-bit quantization)
|
||||
- ~26GB RAM usage for 26B MoE
|
||||
- Near-perfect accuracy for code and RAG
|
||||
- Significant speed improvement over FP16
|
||||
|
||||
---
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
### Step 1: Update Ollama
|
||||
|
||||
Gemma 4 requires recent Ollama binaries:
|
||||
|
||||
```bash
|
||||
curl -fsSL https://ollama.com/install.sh | sh
|
||||
```
|
||||
|
||||
### Step 2: Pull the 8-bit Quantized MoE
|
||||
|
||||
```bash
|
||||
ollama pull gemma4:26b-a4b-q8_0
|
||||
```
|
||||
|
||||
This will use ~26GB of the 251GB available RAM.
|
||||
|
||||
### Step 3: Test Natively
|
||||
|
||||
Before connecting to Dify, verify generation speed:
|
||||
|
||||
```bash
|
||||
ollama run gemma4:26b-a4b-q8_0
|
||||
```
|
||||
|
||||
Note the tokens-per-second (t/s) output.
|
||||
|
||||
### Step 4: Connect to Dify
|
||||
|
||||
1. Open Dify panel (codex.firefrostgaming.com)
|
||||
2. Go to **Model Providers**
|
||||
3. Select **Ollama**
|
||||
4. Add `gemma4:26b-a4b-q8_0` as a new LLM
|
||||
5. Set context window to 32K or 64K (RAM can handle it)
|
||||
|
||||
---
|
||||
|
||||
## Use Cases for Trinity Codex (Task #93)
|
||||
|
||||
With Gemma 4 26B A4B powering the RAG:
|
||||
|
||||
1. **Operations Manual Queries**
|
||||
- "What's the current ModpackChecker status?"
|
||||
- "How do I deploy to the live panel?"
|
||||
- "What did Chronicler #62 accomplish?"
|
||||
|
||||
2. **Server Documentation**
|
||||
- "What ports are used on TX1?"
|
||||
- "How do I restart Arbiter?"
|
||||
- "What's the database schema for subscriptions?"
|
||||
|
||||
3. **Chronicler Lineage**
|
||||
- "Who was The Pathmaker?"
|
||||
- "What bugs did session #63 fix?"
|
||||
- "Show me the memorial for The Fallen"
|
||||
|
||||
4. **Code Generation**
|
||||
- "Write a migration for the new detection columns"
|
||||
- "Create a bash script to backup the database"
|
||||
- "Generate the DaemonFileRepository detection code"
|
||||
|
||||
---
|
||||
|
||||
## Cost Comparison
|
||||
|
||||
| Approach | Monthly Cost | Notes |
|
||||
|----------|--------------|-------|
|
||||
| OpenAI GPT-4 API | $50-200+ | Per-token, scales with usage |
|
||||
| Anthropic Claude API | $50-200+ | Per-token, scales with usage |
|
||||
| **Gemma 4 Self-Hosted** | **$0** | One-time setup, unlimited queries |
|
||||
|
||||
**Additional Benefits:**
|
||||
- Data never leaves Firefrost infrastructure
|
||||
- No rate limits
|
||||
- No API latency (local inference)
|
||||
- Full control over model behavior
|
||||
|
||||
---
|
||||
|
||||
## Architecture Integration
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ TX1 Dallas (38.68.14.26) — 251GB RAM, CPU-only │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Qdrant │ │ Dify │ │ Ollama │ │
|
||||
│ │ Vector DB │◄──►│ RAG Engine │◄──►│ Gemma 4 │ │
|
||||
│ │ Port 6333 │ │ Port 3000 │ │ 26B A4B │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
│ ▲ ▲ │
|
||||
│ │ │ │
|
||||
│ ┌──────┴──────┐ ┌──────┴──────┐ │
|
||||
│ │ Gitea │ │ n8n │ │
|
||||
│ │ Webhooks │ │ Ingestion │ │
|
||||
│ │ (on push) │ │ Workflows │ │
|
||||
│ └─────────────┘ └─────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Dify Web App │
|
||||
│ (Meg & Holly) │
|
||||
│ Trinity Access │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task Updates
|
||||
|
||||
### Task #93: Trinity Codex
|
||||
|
||||
**Add to implementation plan:**
|
||||
- [ ] Update Ollama on TX1 to latest version
|
||||
- [ ] Pull `gemma4:26b-a4b-q8_0` model
|
||||
- [ ] Test inference speed (document t/s results)
|
||||
- [ ] Configure Dify to use Gemma 4 as primary LLM
|
||||
- [ ] Set context window to 32K-64K
|
||||
- [ ] Test RAG queries against operations manual
|
||||
- [ ] Benchmark against previous Ollama models
|
||||
|
||||
### Task #93 Architecture Update
|
||||
|
||||
**Previous assumption:** External LLM API (OpenAI/Anthropic)
|
||||
**New approach:** Self-hosted Gemma 4 26B A4B via Ollama
|
||||
|
||||
**Benefits:**
|
||||
- Zero ongoing API costs
|
||||
- Data privacy (never leaves TX1)
|
||||
- Unlimited queries
|
||||
- 256K context window
|
||||
- Frontier-level reasoning
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Michael runs on TX1:**
|
||||
```bash
|
||||
curl -fsSL https://ollama.com/install.sh | sh
|
||||
ollama pull gemma4:26b-a4b-q8_0
|
||||
ollama run gemma4:26b-a4b-q8_0
|
||||
```
|
||||
|
||||
2. **Report tokens-per-second** — This determines if MoE is fast enough or if we fall back to E4B
|
||||
|
||||
3. **Connect to Dify** — Add as model provider, configure context window
|
||||
|
||||
4. **Test Trinity Codex queries** — Validate RAG accuracy on operations manual
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- Gemma 4 Release: April 2, 2026
|
||||
- License: Apache 2.0 (fully permissive, commercial use OK)
|
||||
- Ollama: https://ollama.com
|
||||
- Dify: Already deployed at codex.firefrostgaming.com
|
||||
|
||||
---
|
||||
|
||||
*Fire + Frost + Foundation = Where Love Builds Legacy* 💙🔥❄️
|
||||
Reference in New Issue
Block a user