Complete rewrite of self-hosted AI stack (Task #9) with new DERP-compliant architecture: CHANGES: - Architecture: AnythingLLM+OpenWebUI → Dify+Ollama (DERP-compliant) - Cost model: $0/month additional (self-hosted on TX1, no external APIs) - Usage tiers: Claude Projects (primary) → DERP backup (emergency) → Discord bots (staff/subscribers) - Time estimate: 8-12hrs → 6-8hrs (more focused deployment) - Resource allocation: 97GB storage, 92GB RAM when active (vs 150GB/110GB) NEW DOCUMENTATION: - README.md: Complete architecture rewrite with three-tier usage model - deployment-plan.md: Step-by-step deployment (6 phases, all commands included) - usage-guide.md: Decision tree for when to use Claude vs DERP vs bots - resource-requirements.md: TX1 capacity planning, monitoring, disaster recovery KEY FEATURES: - Zero additional monthly cost (beyond existing $20 Claude Pro) - True DERP compliance (fully self-hosted when Claude unavailable) - Knowledge graph RAG (indexes entire 416-file repo) - Discord bot integration (role-based staff/subscriber access) - Emergency procedures documented - Capacity planning for growth (up to 18 game servers) MODELS: - Qwen 2.5 Coder 72B (infrastructure/coding, 128K context) - Llama 3.3 70B (general reasoning, 128K context) - Llama 3.2 Vision 11B (screenshot analysis) Updated tasks.md summary to reflect new architecture. Status: Ready for deployment (pending medical clearance) Fire + Frost + Foundation + DERP = True Independence 💙🔥❄️
8.5 KiB
AI Stack Resource Requirements
Server: TX1 Dallas (38.68.14.26)
Purpose: Resource allocation planning
Last Updated: 2026-02-18
TX1 Server Specifications
CPU: 32 vCPU
RAM: 256GB
Storage: 1TB NVMe SSD
Location: Dallas, TX
Network: 1Gbps
Current Usage (before AI stack):
- Game servers: 6 Minecraft instances
- Management services: Minimal overhead
- Available for AI: Significant capacity
Storage Requirements
Component Breakdown
| Component | Size | Purpose |
|---|---|---|
| Qwen 2.5 Coder 72B | ~40GB | Infrastructure/coding model |
| Llama 3.3 70B | ~40GB | General reasoning model |
| Llama 3.2 Vision 11B | ~7GB | Image analysis model |
| Dify Services | ~5GB | Docker containers, databases |
| Knowledge Base | ~5GB | Indexed docs, embeddings |
| Logs & Temp | ~2GB | Operational overhead |
| Total | ~99GB | ✅ Well under 1TB limit |
Storage Growth Estimate
Year 1:
- Models: 87GB (static, no growth unless upgrading)
- Knowledge base: 5GB → 8GB (as docs grow)
- Logs: 2GB → 5GB (6 months rotation)
- Total Year 1: ~100GB
Storage is NOT a concern.
RAM Requirements
Scenario 1: Normal Operations (Claude Available)
| Component | RAM Usage |
|---|---|
| Dify Services | ~4GB |
| PostgreSQL | ~2GB |
| Redis | ~1GB |
| Ollama (idle) | <1GB |
| Total (idle) | ~8GB ✅ |
Game servers have ~248GB available (256GB - 8GB)
Scenario 2: DERP Activated (Claude Down, Emergency)
Load ONE large model at a time:
| Component | RAM Usage |
|---|---|
| Qwen 2.5 Coder 72B OR Llama 3.3 70B | ~80GB |
| Dify Services | ~4GB |
| PostgreSQL | ~2GB |
| Redis | ~1GB |
| Ollama Runtime | ~2GB |
| OS Overhead | ~3GB |
| Total (active DERP) | ~92GB ✅ |
Game servers have ~164GB available (256GB - 92GB)
Critical: DO NOT load both large models simultaneously (160GB would impact game servers)
Scenario 3: Vision Model Only (Screenshot Analysis)
| Component | RAM Usage |
|---|---|
| Llama 3.2 Vision 11B | ~7GB |
| Dify Services | ~4GB |
| Other Services | ~3GB |
| Total | ~14GB ✅ |
Very lightweight, can run alongside game servers with no impact
CPU Requirements
Model Inference Performance
TX1 has 32 vCPU (shared among all services)
Expected Inference Times:
| Model | Token Generation Speed | Typical Response |
|---|---|---|
| Qwen 2.5 Coder 72B | ~3-5 tokens/second | 30-120 seconds |
| Llama 3.3 70B | ~3-5 tokens/second | 30-120 seconds |
| Llama 3.2 Vision 11B | ~8-12 tokens/second | 10-45 seconds |
For comparison:
- Claude API: 20-40 tokens/second
- DERP is 5-10× slower (this is expected and acceptable for emergency use)
CPU Impact on Game Servers:
- During DERP inference: ~70-80% CPU usage (temporary spikes)
- Game servers may experience brief lag during AI responses
- Acceptable for emergency use (not for normal operations)
Network Requirements
Initial Model Downloads (One-Time)
| Model | Size | Download Time (1Gbps) |
|---|---|---|
| Qwen 2.5 Coder 72B | ~40GB | 5-10 minutes |
| Llama 3.3 70B | ~40GB | 5-10 minutes |
| Llama 3.2 Vision 11B | ~7GB | 1-2 minutes |
| Total | ~87GB | 15-25 minutes |
Reality: Download speeds vary, budget 2-4 hours for all models.
Recommendation: Download overnight to avoid impacting game server traffic.
Ongoing Bandwidth
Dify Web Interface:
- Minimal (text-based queries)
- ~1-5 KB per query
- Negligible impact
Discord Bot:
- Text-based queries only
- ~1-5 KB per query
- Negligible impact
Model Updates:
- Infrequent (quarterly at most)
- Same as initial download (~87GB)
- Schedule during low-traffic periods
Resource Allocation Strategy
Priority Levels
Priority 1 (Always): Game Servers
Priority 2 (Normal): Management Services (Pterodactyl, Gitea, etc.)
Priority 3 (Emergency Only): DERP AI Stack
RAM Allocation Rules
Normal Operations:
- Game servers: Up to 240GB
- Management: ~8GB
- AI Stack (idle): ~8GB
- Total: 256GB ✅
DERP Emergency:
- Game servers: Temporarily limited to 160GB
- Management: ~8GB
- AI Stack (active): ~92GB
- Total: 260GB ⚠️ (4GB overcommit acceptable for brief periods)
If RAM pressure occurs during DERP:
- Unload one game server temporarily
- Run AI query
- Reload game server
- Total downtime per query: <5 minutes
Monitoring & Alerts
Critical Thresholds
RAM Usage:
- Warning: >220GB used (85%)
- Critical: >240GB used (93%)
- Action: Defer DERP usage or unload game server
CPU Usage:
- Warning: >80% sustained for >5 minutes
- Critical: >90% sustained for >2 minutes
- Action: Pause AI inference, prioritize game servers
Storage:
- Warning: >800GB used (80%)
- Critical: >900GB used (90%)
- Action: Clean up old logs, model cache
Monitoring Commands
# Check RAM
free -h
# Check CPU
htop
# Check storage
df -h /
# Check Ollama status
ollama list
ollama ps # Shows loaded models
# Check Dify
cd /opt/dify
docker-compose ps
docker stats # Real-time resource usage
Resource Optimization
Unload Models When Not Needed
# Unload all models (frees RAM)
ollama stop qwen2.5-coder:72b
ollama stop llama3.3:70b
ollama stop llama3.2-vision:11b
# Verify RAM freed
free -h
Preload Models for Faster Response
# Preload model (takes ~30 seconds)
ollama run qwen2.5-coder:72b ""
# Model now in RAM, queries will be faster
Schedule Maintenance Windows
Best time for model downloads/updates:
- Tuesday/Wednesday 2-6 AM CST (lowest traffic)
- Announce in Discord 24 hours ahead
- Expected downtime: <10 minutes
Capacity Planning
Current State (Feb 2026)
- Game servers: 6 active
- RAM available: 256GB
- Storage available: 1TB
- AI stack: Fits comfortably
Growth Scenarios
Scenario 1: Add 6 more game servers (12 total)
- Additional RAM needed: ~60GB
- Available for AI (normal): 248GB → 188GB ✅
- Available for AI (DERP): 164GB → 104GB ✅
- Status: Still viable
Scenario 2: Add 12 more game servers (18 total)
- Additional RAM needed: ~120GB
- Available for AI (normal): 248GB → 128GB ✅
- Available for AI (DERP): 164GB → 44GB ⚠️
- Status: DERP would require unloading 2 game servers
Scenario 3: Upgrade to larger models (theoretical)
- Qwen 3.0 Coder 170B: ~180GB RAM
- Status: Would NOT fit alongside game servers
- Recommendation: Stick with 72B models
Upgrade Path
If TX1 reaches capacity:
Option A: Add second dedicated AI server
- Move AI stack to separate VPS
- TX1 focuses only on game servers
- Cost: ~$100-200/month (NOT DERP-compliant)
Option B: Upgrade TX1 RAM
- 256GB → 512GB
- Cost: Contact Hetzner for pricing
- Preferred: Maintains DERP compliance
Option C: Use smaller AI models
- Qwen 2.5 Coder 32B (~35GB RAM)
- Llama 3.2 8B (~8GB RAM)
- Tradeoff: Lower quality, but more capacity
Disaster Recovery
Backup Strategy
What to backup:
- ✅ Dify configuration files
- ✅ Knowledge base data
- ✅ Discord bot code
- ❌ Models (can re-download)
Backup location:
- Git repository (for configs/code)
- NC1 Charlotte (for knowledge base)
Backup frequency:
- Configurations: After every change
- Knowledge base: Weekly
- Models: No backup needed
Recovery Procedure
If TX1 fails completely:
- Deploy Dify on NC1 (temporary)
- Restore knowledge base from backup
- Re-download models (~4 hours)
- Point Discord bot to NC1
- Downtime: 4-6 hours
Note: This is acceptable for DERP (emergency-only system)
Cost Analysis
One-Time Costs
- Setup time: 6-8 hours (Michael's time)
- Model downloads: Bandwidth usage (included in hosting)
- Total: $0 (sweat equity only)
Monthly Costs
- Hosting: $0 (using existing TX1)
- Bandwidth: $0 (included in hosting)
- Maintenance: ~1 hour/month (Michael's time)
- Total: $0/month ✅
Opportunity Cost
- RAM reserved for AI: ~8GB (idle) or ~92GB (active DERP)
- Could host 1-2 more game servers in that space
- Acceptable tradeoff: DERP independence worth more than 2 game servers
Fire + Frost + Foundation + DERP = True Independence 💙🔥❄️
TX1 has the capacity. Resources are allocated wisely. $0 monthly cost maintained.