firefrost-gaming/firefrost-operations-manual

Files

The Chronicler b32afdd1db Task #9 : Rewrite AI Stack architecture for DERP compliance

Complete rewrite of self-hosted AI stack (Task #9) with new DERP-compliant architecture:

CHANGES:
- Architecture: AnythingLLM+OpenWebUI → Dify+Ollama (DERP-compliant)
- Cost model: $0/month additional (self-hosted on TX1, no external APIs)
- Usage tiers: Claude Projects (primary) → DERP backup (emergency) → Discord bots (staff/subscribers)
- Time estimate: 8-12hrs → 6-8hrs (more focused deployment)
- Resource allocation: 97GB storage, 92GB RAM when active (vs 150GB/110GB)

NEW DOCUMENTATION:
- README.md: Complete architecture rewrite with three-tier usage model
- deployment-plan.md: Step-by-step deployment (6 phases, all commands included)
- usage-guide.md: Decision tree for when to use Claude vs DERP vs bots
- resource-requirements.md: TX1 capacity planning, monitoring, disaster recovery

KEY FEATURES:
- Zero additional monthly cost (beyond existing $20 Claude Pro)
- True DERP compliance (fully self-hosted when Claude unavailable)
- Knowledge graph RAG (indexes entire 416-file repo)
- Discord bot integration (role-based staff/subscriber access)
- Emergency procedures documented
- Capacity planning for growth (up to 18 game servers)

MODELS:
- Qwen 2.5 Coder 72B (infrastructure/coding, 128K context)
- Llama 3.3 70B (general reasoning, 128K context)
- Llama 3.2 Vision 11B (screenshot analysis)

Updated tasks.md summary to reflect new architecture.

Status: Ready for deployment (pending medical clearance)

Fire + Frost + Foundation + DERP = True Independence 💙🔥❄️

2026-02-18 17:27:25 +00:00

8.5 KiB

Raw Blame History

AI Stack Resource Requirements

Server: TX1 Dallas (38.68.14.26)
Purpose: Resource allocation planning
Last Updated: 2026-02-18

TX1 Server Specifications

CPU: 32 vCPU
RAM: 256GB
Storage: 1TB NVMe SSD
Location: Dallas, TX
Network: 1Gbps

Current Usage (before AI stack):

Game servers: 6 Minecraft instances
Management services: Minimal overhead
Available for AI: Significant capacity

Storage Requirements

Component Breakdown

Component	Size	Purpose
Qwen 2.5 Coder 72B	~40GB	Infrastructure/coding model
Llama 3.3 70B	~40GB	General reasoning model
Llama 3.2 Vision 11B	~7GB	Image analysis model
Dify Services	~5GB	Docker containers, databases
Knowledge Base	~5GB	Indexed docs, embeddings
Logs & Temp	~2GB	Operational overhead
Total	~99GB	✅ Well under 1TB limit

Storage Growth Estimate

Year 1:

Models: 87GB (static, no growth unless upgrading)
Knowledge base: 5GB → 8GB (as docs grow)
Logs: 2GB → 5GB (6 months rotation)
Total Year 1: ~100GB

Storage is NOT a concern.

RAM Requirements

Scenario 1: Normal Operations (Claude Available)

Component	RAM Usage
Dify Services	~4GB
PostgreSQL	~2GB
Redis	~1GB
Ollama (idle)	<1GB
Total (idle)	~8GB ✅

Game servers have ~248GB available (256GB - 8GB)

Scenario 2: DERP Activated (Claude Down, Emergency)

Load ONE large model at a time:

Component	RAM Usage
Qwen 2.5 Coder 72B OR Llama 3.3 70B	~80GB
Dify Services	~4GB
PostgreSQL	~2GB
Redis	~1GB
Ollama Runtime	~2GB
OS Overhead	~3GB
Total (active DERP)	~92GB ✅

Game servers have ~164GB available (256GB - 92GB)

Critical: DO NOT load both large models simultaneously (160GB would impact game servers)

Scenario 3: Vision Model Only (Screenshot Analysis)

Component	RAM Usage
Llama 3.2 Vision 11B	~7GB
Dify Services	~4GB
Other Services	~3GB
Total	~14GB ✅

Very lightweight, can run alongside game servers with no impact

CPU Requirements

Model Inference Performance

TX1 has 32 vCPU (shared among all services)

Expected Inference Times:

Model	Token Generation Speed	Typical Response
Qwen 2.5 Coder 72B	~3-5 tokens/second	30-120 seconds
Llama 3.3 70B	~3-5 tokens/second	30-120 seconds
Llama 3.2 Vision 11B	~8-12 tokens/second	10-45 seconds

For comparison:

Claude API: 20-40 tokens/second
DERP is 5-10× slower (this is expected and acceptable for emergency use)

CPU Impact on Game Servers:

During DERP inference: ~70-80% CPU usage (temporary spikes)
Game servers may experience brief lag during AI responses
Acceptable for emergency use (not for normal operations)

Network Requirements

Initial Model Downloads (One-Time)

Model	Size	Download Time (1Gbps)
Qwen 2.5 Coder 72B	~40GB	5-10 minutes
Llama 3.3 70B	~40GB	5-10 minutes
Llama 3.2 Vision 11B	~7GB	1-2 minutes
Total	~87GB	15-25 minutes

Reality: Download speeds vary, budget 2-4 hours for all models.

Recommendation: Download overnight to avoid impacting game server traffic.

Ongoing Bandwidth

Dify Web Interface:

Minimal (text-based queries)
~1-5 KB per query
Negligible impact

Discord Bot:

Text-based queries only
~1-5 KB per query
Negligible impact

Model Updates:

Infrequent (quarterly at most)
Same as initial download (~87GB)
Schedule during low-traffic periods

Resource Allocation Strategy

Priority Levels

Priority 1 (Always): Game Servers
Priority 2 (Normal): Management Services (Pterodactyl, Gitea, etc.)
Priority 3 (Emergency Only): DERP AI Stack

RAM Allocation Rules

Normal Operations:

Game servers: Up to 240GB
Management: ~8GB
AI Stack (idle): ~8GB
Total: 256GB ✅

DERP Emergency:

Game servers: Temporarily limited to 160GB
Management: ~8GB
AI Stack (active): ~92GB
Total: 260GB ⚠️ (4GB overcommit acceptable for brief periods)

If RAM pressure occurs during DERP:

Unload one game server temporarily
Run AI query
Reload game server
Total downtime per query: <5 minutes

Monitoring & Alerts

Critical Thresholds

RAM Usage:

Warning: >220GB used (85%)
Critical: >240GB used (93%)
Action: Defer DERP usage or unload game server

CPU Usage:

Warning: >80% sustained for >5 minutes
Critical: >90% sustained for >2 minutes
Action: Pause AI inference, prioritize game servers

Storage:

Warning: >800GB used (80%)
Critical: >900GB used (90%)
Action: Clean up old logs, model cache

Monitoring Commands

# Check RAM
free -h

# Check CPU
htop

# Check storage
df -h /

# Check Ollama status
ollama list
ollama ps  # Shows loaded models

# Check Dify
cd /opt/dify
docker-compose ps
docker stats  # Real-time resource usage

Resource Optimization

Unload Models When Not Needed

# Unload all models (frees RAM)
ollama stop qwen2.5-coder:72b
ollama stop llama3.3:70b
ollama stop llama3.2-vision:11b

# Verify RAM freed
free -h

Preload Models for Faster Response

# Preload model (takes ~30 seconds)
ollama run qwen2.5-coder:72b ""
# Model now in RAM, queries will be faster

Schedule Maintenance Windows

Best time for model downloads/updates:

Tuesday/Wednesday 2-6 AM CST (lowest traffic)
Announce in Discord 24 hours ahead
Expected downtime: <10 minutes

Capacity Planning

Current State (Feb 2026)

Game servers: 6 active
RAM available: 256GB
Storage available: 1TB
AI stack: Fits comfortably

Growth Scenarios

Scenario 1: Add 6 more game servers (12 total)

Additional RAM needed: ~60GB
Available for AI (normal): 248GB → 188GB ✅
Available for AI (DERP): 164GB → 104GB ✅
Status: Still viable

Scenario 2: Add 12 more game servers (18 total)

Additional RAM needed: ~120GB
Available for AI (normal): 248GB → 128GB ✅
Available for AI (DERP): 164GB → 44GB ⚠️
Status: DERP would require unloading 2 game servers

Scenario 3: Upgrade to larger models (theoretical)

Qwen 3.0 Coder 170B: ~180GB RAM
Status: Would NOT fit alongside game servers
Recommendation: Stick with 72B models

Upgrade Path

If TX1 reaches capacity:

Option A: Add second dedicated AI server

Move AI stack to separate VPS
TX1 focuses only on game servers
Cost: ~$100-200/month (NOT DERP-compliant)

Option B: Upgrade TX1 RAM

256GB → 512GB
Cost: Contact Hetzner for pricing
Preferred: Maintains DERP compliance

Option C: Use smaller AI models

Qwen 2.5 Coder 32B (~35GB RAM)
Llama 3.2 8B (~8GB RAM)
Tradeoff: Lower quality, but more capacity

Disaster Recovery

Backup Strategy

What to backup:

✅ Dify configuration files
✅ Knowledge base data
✅ Discord bot code
❌ Models (can re-download)

Backup location:

Git repository (for configs/code)
NC1 Charlotte (for knowledge base)

Backup frequency:

Configurations: After every change
Knowledge base: Weekly
Models: No backup needed

Recovery Procedure

If TX1 fails completely:

Deploy Dify on NC1 (temporary)
Restore knowledge base from backup
Re-download models (~4 hours)
Point Discord bot to NC1
Downtime: 4-6 hours

Note: This is acceptable for DERP (emergency-only system)

Cost Analysis

One-Time Costs

Setup time: 6-8 hours (Michael's time)
Model downloads: Bandwidth usage (included in hosting)
Total: $0 (sweat equity only)

Monthly Costs

Hosting: $0 (using existing TX1)
Bandwidth: $0 (included in hosting)
Maintenance: ~1 hour/month (Michael's time)
Total: $0/month ✅

Opportunity Cost

RAM reserved for AI: ~8GB (idle) or ~92GB (active DERP)
Could host 1-2 more game servers in that space
Acceptable tradeoff: DERP independence worth more than 2 game servers

Fire + Frost + Foundation + DERP = True Independence 💙🔥❄️

TX1 has the capacity. Resources are allocated wisely. $0 monthly cost maintained.

8.5 KiB Raw Blame History Unescape Escape