Complete rewrite of self-hosted AI stack (Task #9) with new DERP-compliant architecture: CHANGES: - Architecture: AnythingLLM+OpenWebUI → Dify+Ollama (DERP-compliant) - Cost model: $0/month additional (self-hosted on TX1, no external APIs) - Usage tiers: Claude Projects (primary) → DERP backup (emergency) → Discord bots (staff/subscribers) - Time estimate: 8-12hrs → 6-8hrs (more focused deployment) - Resource allocation: 97GB storage, 92GB RAM when active (vs 150GB/110GB) NEW DOCUMENTATION: - README.md: Complete architecture rewrite with three-tier usage model - deployment-plan.md: Step-by-step deployment (6 phases, all commands included) - usage-guide.md: Decision tree for when to use Claude vs DERP vs bots - resource-requirements.md: TX1 capacity planning, monitoring, disaster recovery KEY FEATURES: - Zero additional monthly cost (beyond existing $20 Claude Pro) - True DERP compliance (fully self-hosted when Claude unavailable) - Knowledge graph RAG (indexes entire 416-file repo) - Discord bot integration (role-based staff/subscriber access) - Emergency procedures documented - Capacity planning for growth (up to 18 game servers) MODELS: - Qwen 2.5 Coder 72B (infrastructure/coding, 128K context) - Llama 3.3 70B (general reasoning, 128K context) - Llama 3.2 Vision 11B (screenshot analysis) Updated tasks.md summary to reflect new architecture. Status: Ready for deployment (pending medical clearance) Fire + Frost + Foundation + DERP = True Independence 💙🔥❄️
368 lines
8.5 KiB
Markdown
368 lines
8.5 KiB
Markdown
# AI Stack Resource Requirements
|
||
|
||
**Server:** TX1 Dallas (38.68.14.26)
|
||
**Purpose:** Resource allocation planning
|
||
**Last Updated:** 2026-02-18
|
||
|
||
---
|
||
|
||
## TX1 Server Specifications
|
||
|
||
**CPU:** 32 vCPU
|
||
**RAM:** 256GB
|
||
**Storage:** 1TB NVMe SSD
|
||
**Location:** Dallas, TX
|
||
**Network:** 1Gbps
|
||
|
||
**Current Usage (before AI stack):**
|
||
- Game servers: 6 Minecraft instances
|
||
- Management services: Minimal overhead
|
||
- Available for AI: Significant capacity
|
||
|
||
---
|
||
|
||
## Storage Requirements
|
||
|
||
### Component Breakdown
|
||
|
||
| Component | Size | Purpose |
|
||
|-----------|------|---------|
|
||
| **Qwen 2.5 Coder 72B** | ~40GB | Infrastructure/coding model |
|
||
| **Llama 3.3 70B** | ~40GB | General reasoning model |
|
||
| **Llama 3.2 Vision 11B** | ~7GB | Image analysis model |
|
||
| **Dify Services** | ~5GB | Docker containers, databases |
|
||
| **Knowledge Base** | ~5GB | Indexed docs, embeddings |
|
||
| **Logs & Temp** | ~2GB | Operational overhead |
|
||
| **Total** | **~99GB** | ✅ Well under 1TB limit |
|
||
|
||
### Storage Growth Estimate
|
||
|
||
**Year 1:**
|
||
- Models: 87GB (static, no growth unless upgrading)
|
||
- Knowledge base: 5GB → 8GB (as docs grow)
|
||
- Logs: 2GB → 5GB (6 months rotation)
|
||
- **Total Year 1:** ~100GB
|
||
|
||
**Storage is NOT a concern.**
|
||
|
||
---
|
||
|
||
## RAM Requirements
|
||
|
||
### Scenario 1: Normal Operations (Claude Available)
|
||
|
||
| Component | RAM Usage |
|
||
|-----------|-----------|
|
||
| **Dify Services** | ~4GB |
|
||
| **PostgreSQL** | ~2GB |
|
||
| **Redis** | ~1GB |
|
||
| **Ollama (idle)** | <1GB |
|
||
| **Total (idle)** | **~8GB** ✅ |
|
||
|
||
**Game servers have ~248GB available** (256GB - 8GB)
|
||
|
||
---
|
||
|
||
### Scenario 2: DERP Activated (Claude Down, Emergency)
|
||
|
||
**Load ONE large model at a time:**
|
||
|
||
| Component | RAM Usage |
|
||
|-----------|-----------|
|
||
| **Qwen 2.5 Coder 72B** OR **Llama 3.3 70B** | ~80GB |
|
||
| **Dify Services** | ~4GB |
|
||
| **PostgreSQL** | ~2GB |
|
||
| **Redis** | ~1GB |
|
||
| **Ollama Runtime** | ~2GB |
|
||
| **OS Overhead** | ~3GB |
|
||
| **Total (active DERP)** | **~92GB** ✅ |
|
||
|
||
**Game servers have ~164GB available** (256GB - 92GB)
|
||
|
||
**Critical:** DO NOT load both large models simultaneously (160GB would impact game servers)
|
||
|
||
---
|
||
|
||
### Scenario 3: Vision Model Only (Screenshot Analysis)
|
||
|
||
| Component | RAM Usage |
|
||
|-----------|-----------|
|
||
| **Llama 3.2 Vision 11B** | ~7GB |
|
||
| **Dify Services** | ~4GB |
|
||
| **Other Services** | ~3GB |
|
||
| **Total** | **~14GB** ✅ |
|
||
|
||
**Very lightweight, can run alongside game servers with no impact**
|
||
|
||
---
|
||
|
||
## CPU Requirements
|
||
|
||
### Model Inference Performance
|
||
|
||
**TX1 has 32 vCPU (shared among all services)**
|
||
|
||
**Expected Inference Times:**
|
||
|
||
| Model | Token Generation Speed | Typical Response |
|
||
|-------|----------------------|------------------|
|
||
| **Qwen 2.5 Coder 72B** | ~3-5 tokens/second | 30-120 seconds |
|
||
| **Llama 3.3 70B** | ~3-5 tokens/second | 30-120 seconds |
|
||
| **Llama 3.2 Vision 11B** | ~8-12 tokens/second | 10-45 seconds |
|
||
|
||
**For comparison:**
|
||
- Claude API: 20-40 tokens/second
|
||
- **DERP is 5-10× slower** (this is expected and acceptable for emergency use)
|
||
|
||
**CPU Impact on Game Servers:**
|
||
- During DERP inference: ~70-80% CPU usage (temporary spikes)
|
||
- Game servers may experience brief lag during AI responses
|
||
- **Acceptable for emergency use** (not for normal operations)
|
||
|
||
---
|
||
|
||
## Network Requirements
|
||
|
||
### Initial Model Downloads (One-Time)
|
||
|
||
| Model | Size | Download Time (1Gbps) |
|
||
|-------|------|----------------------|
|
||
| **Qwen 2.5 Coder 72B** | ~40GB | 5-10 minutes |
|
||
| **Llama 3.3 70B** | ~40GB | 5-10 minutes |
|
||
| **Llama 3.2 Vision 11B** | ~7GB | 1-2 minutes |
|
||
| **Total** | **~87GB** | **15-25 minutes** |
|
||
|
||
**Reality:** Download speeds vary, budget 2-4 hours for all models.
|
||
|
||
**Recommendation:** Download overnight to avoid impacting game server traffic.
|
||
|
||
---
|
||
|
||
### Ongoing Bandwidth
|
||
|
||
**Dify Web Interface:**
|
||
- Minimal (text-based queries)
|
||
- ~1-5 KB per query
|
||
- Negligible impact
|
||
|
||
**Discord Bot:**
|
||
- Text-based queries only
|
||
- ~1-5 KB per query
|
||
- Negligible impact
|
||
|
||
**Model Updates:**
|
||
- Infrequent (quarterly at most)
|
||
- Same as initial download (~87GB)
|
||
- Schedule during low-traffic periods
|
||
|
||
---
|
||
|
||
## Resource Allocation Strategy
|
||
|
||
### Priority Levels
|
||
|
||
**Priority 1 (Always):** Game Servers
|
||
**Priority 2 (Normal):** Management Services (Pterodactyl, Gitea, etc.)
|
||
**Priority 3 (Emergency Only):** DERP AI Stack
|
||
|
||
### RAM Allocation Rules
|
||
|
||
**Normal Operations:**
|
||
- Game servers: Up to 240GB
|
||
- Management: ~8GB
|
||
- AI Stack (idle): ~8GB
|
||
- **Total: 256GB** ✅
|
||
|
||
**DERP Emergency:**
|
||
- Game servers: Temporarily limited to 160GB
|
||
- Management: ~8GB
|
||
- AI Stack (active): ~92GB
|
||
- **Total: 260GB** ⚠️ (4GB overcommit acceptable for brief periods)
|
||
|
||
**If RAM pressure occurs during DERP:**
|
||
1. Unload one game server temporarily
|
||
2. Run AI query
|
||
3. Reload game server
|
||
4. **Total downtime per query: <5 minutes**
|
||
|
||
---
|
||
|
||
## Monitoring & Alerts
|
||
|
||
### Critical Thresholds
|
||
|
||
**RAM Usage:**
|
||
- **Warning:** >220GB used (85%)
|
||
- **Critical:** >240GB used (93%)
|
||
- **Action:** Defer DERP usage or unload game server
|
||
|
||
**CPU Usage:**
|
||
- **Warning:** >80% sustained for >5 minutes
|
||
- **Critical:** >90% sustained for >2 minutes
|
||
- **Action:** Pause AI inference, prioritize game servers
|
||
|
||
**Storage:**
|
||
- **Warning:** >800GB used (80%)
|
||
- **Critical:** >900GB used (90%)
|
||
- **Action:** Clean up old logs, model cache
|
||
|
||
### Monitoring Commands
|
||
|
||
```bash
|
||
# Check RAM
|
||
free -h
|
||
|
||
# Check CPU
|
||
htop
|
||
|
||
# Check storage
|
||
df -h /
|
||
|
||
# Check Ollama status
|
||
ollama list
|
||
ollama ps # Shows loaded models
|
||
|
||
# Check Dify
|
||
cd /opt/dify
|
||
docker-compose ps
|
||
docker stats # Real-time resource usage
|
||
```
|
||
|
||
---
|
||
|
||
## Resource Optimization
|
||
|
||
### Unload Models When Not Needed
|
||
|
||
```bash
|
||
# Unload all models (frees RAM)
|
||
ollama stop qwen2.5-coder:72b
|
||
ollama stop llama3.3:70b
|
||
ollama stop llama3.2-vision:11b
|
||
|
||
# Verify RAM freed
|
||
free -h
|
||
```
|
||
|
||
### Preload Models for Faster Response
|
||
|
||
```bash
|
||
# Preload model (takes ~30 seconds)
|
||
ollama run qwen2.5-coder:72b ""
|
||
# Model now in RAM, queries will be faster
|
||
```
|
||
|
||
### Schedule Maintenance Windows
|
||
|
||
**Best time for model downloads/updates:**
|
||
- Tuesday/Wednesday 2-6 AM CST (lowest traffic)
|
||
- Announce in Discord 24 hours ahead
|
||
- Expected downtime: <10 minutes
|
||
|
||
---
|
||
|
||
## Capacity Planning
|
||
|
||
### Current State (Feb 2026)
|
||
- **Game servers:** 6 active
|
||
- **RAM available:** 256GB
|
||
- **Storage available:** 1TB
|
||
- **AI stack:** Fits comfortably
|
||
|
||
### Growth Scenarios
|
||
|
||
**Scenario 1: Add 6 more game servers (12 total)**
|
||
- Additional RAM needed: ~60GB
|
||
- Available for AI (normal): 248GB → 188GB ✅
|
||
- Available for AI (DERP): 164GB → 104GB ✅
|
||
- **Status:** Still viable
|
||
|
||
**Scenario 2: Add 12 more game servers (18 total)**
|
||
- Additional RAM needed: ~120GB
|
||
- Available for AI (normal): 248GB → 128GB ✅
|
||
- Available for AI (DERP): 164GB → 44GB ⚠️
|
||
- **Status:** DERP would require unloading 2 game servers
|
||
|
||
**Scenario 3: Upgrade to larger models (theoretical)**
|
||
- Qwen 3.0 Coder 170B: ~180GB RAM
|
||
- **Status:** Would NOT fit alongside game servers
|
||
- **Recommendation:** Stick with 72B models
|
||
|
||
### Upgrade Path
|
||
|
||
**If TX1 reaches capacity:**
|
||
|
||
**Option A: Add second dedicated AI server**
|
||
- Move AI stack to separate VPS
|
||
- TX1 focuses only on game servers
|
||
- Cost: ~$100-200/month (NOT DERP-compliant)
|
||
|
||
**Option B: Upgrade TX1 RAM**
|
||
- 256GB → 512GB
|
||
- Cost: Contact Hetzner for pricing
|
||
- **Preferred:** Maintains DERP compliance
|
||
|
||
**Option C: Use smaller AI models**
|
||
- Qwen 2.5 Coder 32B (~35GB RAM)
|
||
- Llama 3.2 8B (~8GB RAM)
|
||
- **Tradeoff:** Lower quality, but more capacity
|
||
|
||
---
|
||
|
||
## Disaster Recovery
|
||
|
||
### Backup Strategy
|
||
|
||
**What to backup:**
|
||
- ✅ Dify configuration files
|
||
- ✅ Knowledge base data
|
||
- ✅ Discord bot code
|
||
- ❌ Models (can re-download)
|
||
|
||
**Backup location:**
|
||
- Git repository (for configs/code)
|
||
- NC1 Charlotte (for knowledge base)
|
||
|
||
**Backup frequency:**
|
||
- Configurations: After every change
|
||
- Knowledge base: Weekly
|
||
- Models: No backup needed
|
||
|
||
### Recovery Procedure
|
||
|
||
**If TX1 fails completely:**
|
||
|
||
1. Deploy Dify on NC1 (temporary)
|
||
2. Restore knowledge base from backup
|
||
3. Re-download models (~4 hours)
|
||
4. Point Discord bot to NC1
|
||
5. **Downtime: 4-6 hours**
|
||
|
||
**Note:** This is acceptable for DERP (emergency-only system)
|
||
|
||
---
|
||
|
||
## Cost Analysis
|
||
|
||
### One-Time Costs
|
||
- Setup time: 6-8 hours (Michael's time)
|
||
- Model downloads: Bandwidth usage (included in hosting)
|
||
- **Total: $0** (sweat equity only)
|
||
|
||
### Monthly Costs
|
||
- Hosting: $0 (using existing TX1)
|
||
- Bandwidth: $0 (included in hosting)
|
||
- Maintenance: ~1 hour/month (Michael's time)
|
||
- **Total: $0/month** ✅
|
||
|
||
### Opportunity Cost
|
||
- RAM reserved for AI: ~8GB (idle) or ~92GB (active DERP)
|
||
- Could host 1-2 more game servers in that space
|
||
- **Acceptable tradeoff:** DERP independence worth more than 2 game servers
|
||
|
||
---
|
||
|
||
**Fire + Frost + Foundation + DERP = True Independence** 💙🔥❄️
|
||
|
||
**TX1 has the capacity. Resources are allocated wisely. $0 monthly cost maintained.**
|