diff --git a/README-INDEX.md b/README-INDEX.md new file mode 100644 index 0000000..dc32598 --- /dev/null +++ b/README-INDEX.md @@ -0,0 +1,324 @@ +# 📚 Firefrost Gaming Operations Manual - Complete Index + +**Last Updated:** 2026-02-17 +**Version:** Starfleet Grade +**Status:** PRODUCTION READY + +--- + +## 🚀 QUICK START + +**New to the repository?** Start here: +1. `docs/planning/mission-statement.md` - Understand our philosophy +2. `docs/core/infrastructure-manifest.md` - See what we run +3. `docs/quick-reference/common-operations.md` - Daily operations +4. `docs/emergency-protocols/` - Emergency procedures + +--- + +## 📁 DIRECTORY STRUCTURE + +``` +firefrost-operations-manual/ +├── deployments/ # Production-ready deployment packages +│ ├── whitelist-manager/ # Flask web app (3 files) +│ ├── staggered-restart/ # Python automation (1 file) +│ └── world-backup/ # Backup automation (3 files) +│ +├── docs/ +│ ├── core/ # Critical infrastructure docs (17 files) +│ ├── diagrams/ # Visual network/system diagrams (4 files) +│ ├── emergency-protocols/ # Red/Yellow Alert procedures (2 files) +│ ├── metrics/ # SLAs and performance targets (1 file) +│ ├── planning/ # Strategic documents (14 files) +│ ├── quick-reference/ # One-page operation guides (1 file) +│ ├── reference/ # Technical references (17 files) +│ ├── sessions/ # Session summaries (2 files) +│ ├── tasks/ # 28 task directories +│ └── training/ # Staff training curriculum (1 file) +│ +└── README.md # Repository overview +``` + +--- + +## 🎯 BY USE CASE + +### I need to... + +**Deploy a new service:** +1. Check `docs/tasks/[service-name]/deployment-plan.md` +2. Review `docs/core/infrastructure-manifest.md` +3. Follow step-by-step guide +4. Update manifest when complete + +**Handle an emergency:** +1. Assess severity (Red or Yellow Alert) +2. Follow `docs/emergency-protocols/RED-ALERT-*.md` or `YELLOW-ALERT-*.md` +3. Communicate per protocol +4. Document in post-mortem + +**Perform daily operations:** +1. Use `docs/quick-reference/common-operations.md` +2. Check `docs/metrics/sla-definitions-and-targets.md` for targets +3. Monitor via Uptime Kuma +4. Log any issues + +**Train a new staff member:** +1. Follow `docs/training/staff-training-curriculum.md` +2. Provide access per `docs/tasks/department-structure/README.md` +3. Assign role-specific reading +4. Track progress + +**Understand the infrastructure:** +1. Read `docs/core/infrastructure-manifest.md` +2. View `docs/diagrams/complete-infrastructure-map.mermaid` +3. Review `docs/diagrams/frostwall-network-topology.mermaid` +4. Check `docs/core/project-scope.md` + +--- + +## 📋 CORE DOCUMENTS (17 files) + +| Document | Purpose | Priority | +|----------|---------|----------| +| `infrastructure-manifest.md` | Complete infrastructure inventory | CRITICAL | +| `project-scope.md` | Project vision and roadmap | HIGH | +| `tasks.md` | All tasks and priorities | HIGH | +| `workflow-guide.md` | How to work with Claude | HIGH | +| `session-handoff.md` | Session continuity protocol | HIGH | +| `SESSION-START-PROMPT.md` | Quick session start | MEDIUM | +| `DERP.md` | Emergency recovery procedures | CRITICAL | +| `EMERGENCY-GIT-ACCESS.md` | Git access recovery | CRITICAL | +| `GITEA-API-PATTERNS.md` | API usage patterns | MEDIUM | +| `revision-control-standard.md` | Git commit standards (FFG-STD-001) | HIGH | +| `memorial-completion-task.md` | End-of-session protocol | MEDIUM | +| `API-EFFICIENCY-PROTOCOL.md` | Optimize API usage | MEDIUM | +| Others | Various operational docs | MEDIUM | + +--- + +## 🎨 DIAGRAMS (4 files) + +| Diagram | Type | View With | +|---------|------|-----------| +| `frostwall-network-topology.mermaid` | Network security architecture | Mermaid viewer | +| `complete-infrastructure-map.mermaid` | All services overview | Mermaid viewer | +| `task-prioritization-flowchart.mermaid` | Decision tree for tasks | Mermaid viewer | +| (More in `docs/reference/diagrams/`) | Legacy diagrams | Various | + +**How to view Mermaid diagrams:** +- Paste into https://mermaid.live +- Use VS Code Mermaid extension +- GitHub/Gitea render automatically + +--- + +## 🚨 EMERGENCY PROTOCOLS (2 files) + +| Protocol | When to Use | Response Time | +|----------|-------------|---------------| +| `RED-ALERT-complete-failure.md` | All services down | 5 min acknowledge | +| `YELLOW-ALERT-partial-degradation.md` | Single service down | 15 min acknowledge | + +**Escalation ladder:** +- Minor issue → Daily operations +- Single service → Yellow Alert +- Multiple services → Red Alert + +--- + +## 📊 METRICS & SLAs (1 file) + +| Document | Contents | +|----------|----------| +| `sla-definitions-and-targets.md` | Uptime targets, performance metrics, costs, capacity planning | + +**Key SLAs:** +- Overall uptime: 99.5% monthly +- Game server TPS: 19.5-20.0 target +- Response times: <100ms latency + +--- + +## 🎓 TRAINING (1 file) + +| Document | Purpose | +|----------|---------| +| `staff-training-curriculum.md` | 4-level onboarding program | + +**Training Levels:** +1. Orientation (Days 1-3) +2. Core Skills (Week 1) +3. Advanced Skills (Week 2-3) +4. Specialization (Week 4+) + +--- + +## 📋 TASKS (28 directories) + +### Tier 0 - Immediate Wins (3 tasks) +1. `whitelist-manager/` - ✅ READY TO DEPLOY +2. `command-center-cleanup/` - ✅ READY +3. `staff-recruitment-launch/` - ✅ COMPLETE DOCS + +### Tier 1 - Security Foundation (5 tasks) +4. `vaultwarden-setup/` - ✅ CONFIG GUIDE +5. `frostwall-protocol/` - ✅ COMPLETE (4 files) +6. `command-center-security/` - ✅ DEPLOYMENT GUIDE +7. `scoped-gitea-token/` - ✅ DEPLOYMENT GUIDE + +### Tier 2 - Major Infrastructure (5 tasks documented) +8. `self-hosted-ai-stack-on-tx1/` - Blocked (medical) +9. `mailcow-email-server-on-nc1/` - Blocked (Frostwall) +10. `netdata-deployment/` - ✅ DEPLOYMENT GUIDE +11. `department-structure/` - ✅ COMPLETE +12. `mkdocs-decommission/` - ✅ DEPLOYMENT GUIDE + +### Tier 3 - Documentation & Optimization (15 tasks) +13. `fix-frostwall-vs-firefrost-naming/` - ✅ COMPLETE +14. `scope-document-corrections/` - ✅ COMPLETE +15. `workflow-guide-review-&-trim/` - Ready +16. `terraria-branding-training-arc/` - Active Phase 1 +17. `paymenter-theme-installation-citadel-theme/` - Ready +18. `consultant-photo-processing/` - Ongoing +19. `nextcloud-upload-portal-for-meg/` - Ready +20. `coming-soon-video-creation-(capcut)/` - Planning +21. `staggered-server-restart-system/` - ✅ COMPLETE +22. `game-server-startup-script-audit-&-optimization/` - ✅ OPTIMIZATION GUIDE +23. `luckperms-mysql-backend/` - Ready +24. `world-backup-automation/` - ✅ COMPLETE +25. `blueprint-extension-installation-node-usage-status/` - Ready +26. `discord-server-complete-reorganization/` - ✅ DEPLOYMENT PLAN +27. `flagship-modpack-eternal-skyforge/` - ✅ DESIGN DOC +28. `among-us-weekly-events-(phase-2-expansion)/` - Planning + +--- + +## 🚀 DEPLOYMENT PACKAGES (3 packages) + +| Package | Status | Deployment Time | +|---------|--------|-----------------| +| `whitelist-manager/` | Production-ready | 30-45 min | +| `staggered-restart/` | Production-ready | 2 hours | +| `world-backup/` | Production-ready | 1-2 hours | + +All include: +- Complete code +- Configuration examples +- Deployment scripts +- Documentation + +--- + +## 📖 PLANNING DOCUMENTS (14 files) + +Strategic and design documents: +- `mission-statement.md` - Core philosophy +- `path-philosophy.md` - Fire vs Frost +- `subscription-tiers.md` - Pricing strategy +- `design-bible.md` - Visual/brand guidelines +- `ideas-backlog.md` - Future features +- And 9 more... + +--- + +## 📚 REFERENCE DOCUMENTS (17 files) + +Technical references: +- `task-directory-audit-2026-02-17.md` - Complete audit +- `complete-repository-audit-2026-02-17.md` - Full repo audit +- `incident-post-mortem-template.md` - Post-incident template +- `terminology-guide.md` - Firefrost vocabulary +- `visual-assets-guide.md` - Brand assets +- And 12 more... + +--- + +## 🔍 SEARCH SHORTCUTS + +**By topic:** +- **Security:** Search for "Frostwall", "security", "hardening" +- **Automation:** Search for "restart", "backup", "automation" +- **Emergency:** Look in `docs/emergency-protocols/` +- **Metrics:** Check `docs/metrics/` +- **Training:** Start with `docs/training/` + +**By file type:** +- **Diagrams:** `.mermaid` files in `docs/diagrams/` +- **Guides:** `deployment-guide.md` or `deployment-plan.md` +- **Templates:** Files ending in `-template.md` +- **Protocols:** Files starting with uppercase (RED-ALERT, etc.) + +--- + +## 📈 VERSION HISTORY + +**v1.0 (Starfleet Grade) - 2026-02-17** +- Added visual diagrams (4 files) +- Added emergency protocols (2 files) +- Added metrics & SLAs (1 file) +- Added training curriculum (1 file) +- Added quick reference (1 file) +- Complete repository audit +- Perfect organization + +**v0.9 (Enterprise-D) - 2026-02-17** +- 28 task directories documented +- 3 deployment packages ready +- Core docs updated +- Infrastructure manifest v2.0 + +--- + +## 🎯 NEXT STEPS + +**For new users:** +1. Read this index +2. Review mission statement +3. Check infrastructure manifest +4. Access training curriculum + +**For operators:** +1. Bookmark quick reference +2. Know emergency protocols +3. Monitor SLAs +4. Use deployment guides + +**For developers:** +1. Follow revision control standard +2. Update documentation with changes +3. Test deployments thoroughly +4. Document lessons learned + +--- + +## 🤝 CONTRIBUTING + +**When updating documentation:** +1. Follow FFG-STD-001 (commit standards) +2. Follow FFG-STD-002 (task documentation) +3. Update this index if adding new sections +4. Test procedures before documenting +5. Use templates where available + +--- + +## 🔗 EXTERNAL RESOURCES + +- **Gitea:** git.firefrostgaming.com +- **Panel:** panel.firefrostgaming.com +- **Status:** status.firefrostgaming.com +- **Vault:** vault.firefrostgaming.com +- **Docs:** docs.firefrostgaming.com + +--- + +**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️ + +--- + +**Index Status:** CURRENT +**Maintained By:** The Auditor (Chronicler lineage) +**Last Updated:** 2026-02-17 +**Next Review:** Monthly diff --git a/docs/diagrams/complete-infrastructure-map.mermaid b/docs/diagrams/complete-infrastructure-map.mermaid new file mode 100644 index 0000000..0f0041a --- /dev/null +++ b/docs/diagrams/complete-infrastructure-map.mermaid @@ -0,0 +1,66 @@ +--- +title: Firefrost Gaming - Complete Infrastructure Map +--- +graph TB + subgraph External["🌐 EXTERNAL SERVICES"] + DNS["📡 DNS
Cloudflare"] + Users["👥 Users
Players & Staff"] + end + + subgraph VPS_Tier["💻 VPS TIER - Management Services"] + CC["🛡️ Command Center
Dallas, TX
63.143.34.217

Services:
• Gitea
• Uptime Kuma
• Code-Server
• Automation
• Vaultwarden"] + + Panel["🎛️ Panel
Charlotte, NC
45.94.168.138

Pterodactyl Control"] + + Billing["💳 Billing
Chicago, IL
38.68.14.188

Services:
• Paymenter
• Whitelist Manager"] + + Ghost["📚 Ghost
Chicago, IL
64.50.188.14

Services:
• Wiki.js (Sub)
• Wiki.js (Staff)
• NextCloud
• MkDocs"] + end + + subgraph Dedicated["🖥️ DEDICATED TIER - Game Servers"] + TX1["🎮 TX1 Dallas
38.68.14.26
32 vCPU, 256GB RAM

Servers (5):
• Reclamation
• Stoneblock 4
• Society
• Vanilla
• All The Mons"] + + NC1["🎮 NC1 Charlotte
216.239.104.130
32 vCPU, 256GB RAM

Servers (6):
• Ember Project
• MC: C&C
• ATM10
• Homestead
• EMC Subterra
• Hytale"] + end + + subgraph Automation["🤖 AUTOMATION SYSTEMS"] + Restart["⏰ Staggered Restart
Daily 4:00 AM"] + Backup["💾 World Backup
Daily 3:30 AM"] + Monitor["📊 Frostwall Monitor
Every 5 min"] + end + + Users -->|"Web Traffic"| DNS + DNS -->|"Route to Services"| CC + DNS -->|"Route to Services"| Ghost + DNS -->|"Route to Services"| Billing + + Users -->|"Game Traffic"| CC + CC -->|"Frostwall GRE"| TX1 + CC -->|"Frostwall GRE"| NC1 + + Panel -.->|"Controls"| TX1 + Panel -.->|"Controls"| NC1 + + CC -->|"Monitors"| TX1 + CC -->|"Monitors"| NC1 + + Restart -.->|"Restarts"| TX1 + Restart -.->|"Restarts"| NC1 + + Backup -.->|"Backs Up"| TX1 + Backup -.->|"Backs Up"| NC1 + Backup -->|"Stores"| Ghost + + Monitor -.->|"Health Checks"| CC + Monitor -.->|"Health Checks"| TX1 + Monitor -.->|"Health Checks"| NC1 + + style CC fill:#1e3a8a,stroke:#3b82f6,stroke-width:3px,color:#fff + style Panel fill:#7c2d12,stroke:#f97316,stroke-width:3px,color:#fff + style Billing fill:#065f46,stroke:#10b981,stroke-width:3px,color:#fff + style Ghost fill:#4c1d95,stroke:#8b5cf6,stroke-width:3px,color:#fff + style TX1 fill:#0c4a6e,stroke:#0ea5e9,stroke-width:4px,color:#fff + style NC1 fill:#0c4a6e,stroke:#0ea5e9,stroke-width:4px,color:#fff + + classDef automation fill:#581c87,stroke:#a855f7,stroke-width:2px,color:#fff + class Restart,Backup,Monitor automation diff --git a/docs/diagrams/frostwall-network-topology.mermaid b/docs/diagrams/frostwall-network-topology.mermaid new file mode 100644 index 0000000..679a712 --- /dev/null +++ b/docs/diagrams/frostwall-network-topology.mermaid @@ -0,0 +1,52 @@ +--- +title: Frostwall Protocol - Network Topology +--- +graph TB + subgraph Internet["🌐 INTERNET"] + Players["👥 Players
Game Clients"] + DDoS["⚠️ DDoS Attacks
(Mitigated)"] + end + + subgraph CommandCenter["🛡️ COMMAND CENTER (Dallas)
63.143.34.217
Scrubbing Layer"] + CC_Physical["Physical Interface
63.143.34.217"] + CC_GRE_TX1["GRE Tunnel to TX1
10.0.1.1/30"] + CC_GRE_NC1["GRE Tunnel to NC1
10.0.2.1/30"] + CC_NAT["NAT/Port Forwarding
All Game Ports"] + end + + subgraph TX1["🎮 TX1 DALLAS
38.68.14.26
Backend Protected"] + TX1_Physical["Physical Interface
38.68.14.26
(BLOCKED by Iron Wall)"] + TX1_GRE["GRE Tunnel from CC
10.0.1.2/30"] + TX1_Servers["5 Game Servers
Reclamation, Stoneblock,
Society, Vanilla, All The Mons"] + end + + subgraph NC1["🎮 NC1 CHARLOTTE
216.239.104.130
Backend Protected"] + NC1_Physical["Physical Interface
216.239.104.130
(BLOCKED by Iron Wall)"] + NC1_GRE["GRE Tunnel from CC
10.0.2.2/30"] + NC1_Servers["6 Game Servers
Ember Project, MC:C&C,
ATM10, Homestead,
EMC Subterra, Hytale"] + end + + Players -->|"Connect to
game.firefrostgaming.com"| CC_Physical + DDoS -.->|"Absorbed by
Command Center"| CC_Physical + + CC_Physical --> CC_NAT + CC_NAT -->|"GRE Encapsulation"| CC_GRE_TX1 + CC_NAT -->|"GRE Encapsulation"| CC_GRE_NC1 + + CC_GRE_TX1 <==>|"Encrypted Tunnel"| TX1_GRE + CC_GRE_NC1 <==>|"Encrypted Tunnel"| NC1_GRE + + TX1_GRE --> TX1_Servers + NC1_GRE --> NC1_Servers + + TX1_Physical -.->|"BLOCKED
by UFW"| TX1_Servers + NC1_Physical -.->|"BLOCKED
by UFW"| NC1_Servers + + style CommandCenter fill:#1e3a8a,stroke:#3b82f6,stroke-width:4px,color:#fff + style TX1 fill:#065f46,stroke:#10b981,stroke-width:3px,color:#fff + style NC1 fill:#065f46,stroke:#10b981,stroke-width:3px,color:#fff + style Players fill:#7c3aed,stroke:#a78bfa,stroke-width:2px,color:#fff + style DDoS fill:#991b1b,stroke:#ef4444,stroke-width:2px,color:#fff + + classDef tunnel fill:#0369a1,stroke:#0ea5e9,stroke-width:2px,color:#fff + class CC_GRE_TX1,CC_GRE_NC1,TX1_GRE,NC1_GRE tunnel diff --git a/docs/diagrams/task-prioritization-flowchart.mermaid b/docs/diagrams/task-prioritization-flowchart.mermaid new file mode 100644 index 0000000..7c501af --- /dev/null +++ b/docs/diagrams/task-prioritization-flowchart.mermaid @@ -0,0 +1,57 @@ +--- +title: Task Prioritization Decision Tree +--- +flowchart TD + Start([New Task or Issue]) + + Start --> Critical{Is it
CRITICAL?} + + Critical -->|YES| RedAlert{All services
down?} + Critical -->|NO| Urgent{Is it
URGENT?} + + RedAlert -->|YES| RA[🚨 RED ALERT
Follow emergency protocol
Drop everything] + RedAlert -->|NO| YA[⚠️ YELLOW ALERT
Single service/degradation
Respond in 15 min] + + Urgent -->|YES| Revenue{Revenue
impacting?} + Urgent -->|NO| Important{Important but
not urgent?} + + Revenue -->|YES| Tier0[⭐ TIER 0
Immediate action
Fix within 1 hour] + Revenue -->|NO| Security{Security
related?} + + Security -->|YES| Tier1[🔒 TIER 1
Security Foundation
High priority] + Security -->|NO| Infrastructure{Major
infrastructure?} + + Infrastructure -->|YES| Tier2[🏗️ TIER 2
Infrastructure
Schedule this week] + Infrastructure -->|NO| Tier3[📋 TIER 3
Optimization
Schedule this month] + + Important -->|YES| HasDeps{Blocks other
tasks?} + Important -->|NO| CanWait[📅 BACKLOG
Nice to have
Do when time allows] + + HasDeps -->|YES| Tier1 + HasDeps -->|NO| Quick{Can be done
in <1 hour?} + + Quick -->|YES| QuickWin[✨ QUICK WIN
Do now if available] + Quick -->|NO| Tier3 + + RA --> Execute[Execute
Immediately] + YA --> Execute + Tier0 --> Execute + Tier1 --> Schedule1[Schedule
This Week] + Tier2 --> Schedule2[Schedule
Next 2 Weeks] + Tier3 --> Schedule3[Schedule
This Month] + QuickWin --> Execute + CanWait --> Backlog[Add to
Backlog] + + Execute --> Done([Task Complete]) + Schedule1 --> Done + Schedule2 --> Done + Schedule3 --> Done + Backlog --> Review[Review
Quarterly] + + style RA fill:#991b1b,stroke:#ef4444,stroke-width:4px,color:#fff + style YA fill:#92400e,stroke:#f59e0b,stroke-width:3px,color:#fff + style Tier0 fill:#1e3a8a,stroke:#3b82f6,stroke-width:3px,color:#fff + style Tier1 fill:#065f46,stroke:#10b981,stroke-width:3px,color:#fff + style Tier2 fill:#4c1d95,stroke:#8b5cf6,stroke-width:2px,color:#fff + style Tier3 fill:#0c4a6e,stroke:#0ea5e9,stroke-width:2px,color:#fff + style QuickWin fill:#15803d,stroke:#22c55e,stroke-width:2px,color:#fff diff --git a/docs/emergency-protocols/RED-ALERT-complete-failure.md b/docs/emergency-protocols/RED-ALERT-complete-failure.md new file mode 100644 index 0000000..698ef07 --- /dev/null +++ b/docs/emergency-protocols/RED-ALERT-complete-failure.md @@ -0,0 +1,374 @@ +# 🚨 RED ALERT - Complete Infrastructure Failure Protocol + +**Status:** Emergency Response Procedure +**Alert Level:** RED ALERT +**Priority:** CRITICAL +**Last Updated:** 2026-02-17 + +--- + +## 🚨 RED ALERT DEFINITION + +**Complete infrastructure failure affecting multiple critical systems:** +- All game servers down +- Management services inaccessible +- Revenue/billing systems offline +- No user access to any services + +**This is a business-critical emergency requiring immediate action.** + +--- + +## ⏱️ RESPONSE TIMELINE + +**0-5 minutes:** Initial assessment and communication +**5-15 minutes:** Emergency containment +**15-60 minutes:** Restore critical services +**1-4 hours:** Full recovery +**24-48 hours:** Post-mortem and prevention + +--- + +## 📞 IMMEDIATE ACTIONS (First 5 Minutes) + +### Step 1: CONFIRM RED ALERT (60 seconds) + +**Check multiple indicators:** +- [ ] Uptime Kuma shows all services down +- [ ] Cannot SSH to Command Center +- [ ] Cannot access panel.firefrostgaming.com +- [ ] Multiple player reports in Discord +- [ ] Email/SMS alerts from hosting provider + +**If 3+ indicators confirm → RED ALERT CONFIRMED** + +--- + +### Step 2: NOTIFY STAKEHOLDERS (2 minutes) + +**Communication hierarchy:** + +1. **Michael (The Wizard)** - Primary incident commander + - Text/Call immediately + - Use emergency contact if needed + +2. **Meg (The Emissary)** - Community management + - Brief on situation + - Prepare community message + +3. **Discord Announcement** (if accessible): +``` +🚨 RED ALERT - ALL SERVICES DOWN + +We are aware of a complete service outage affecting all Firefrost servers. Our team is investigating and working on restoration. + +ETA: Updates every 15 minutes +Status: https://status.firefrostgaming.com (if available) + +We apologize for the inconvenience. +- The Firefrost Team +``` + +4. **Social Media** (Twitter/X): +``` +⚠️ Service Alert: Firefrost Gaming is experiencing a complete service outage. We're working on restoration. Updates to follow. +``` + +--- + +### Step 3: INITIAL TRIAGE (2 minutes) + +**Determine failure scope:** + +**Check hosting provider status:** +- Hetzner status page +- Provider support ticket system +- Email from provider? + +**Likely causes (priority order):** +1. **Provider-wide outage** → Wait for provider +2. **DDoS attack** → Enable DDoS mitigation +3. **Network failure** → Check Frostwall tunnels +4. **Payment/billing issue** → Check accounts +5. **Configuration error** → Review recent changes +6. **Hardware failure** → Provider intervention needed + +--- + +## 🔧 EMERGENCY RECOVERY PROCEDURES + +### Scenario A: Provider-Wide Outage + +**If Hetzner/provider has known outage:** + +1. **DO NOT PANIC** - This is out of your control +2. **Monitor provider status page** - Get ETAs +3. **Update community every 15 minutes** +4. **Document timeline** for compensation claims +5. **Prepare communication** for when services return + +**Actions:** +- [ ] Check Hetzner status: https://status.hetzner.com +- [ ] Open support ticket (if not provider-wide) +- [ ] Monitor Discord for player questions +- [ ] Document downtime duration + +**Recovery:** Services will restore when provider resolves issue + +--- + +### Scenario B: DDoS Attack + +**If traffic volume is abnormally high:** + +1. **Enable Cloudflare DDoS protection** (if not already) +2. **Contact hosting provider** for mitigation help +3. **Check Command Center** for abnormal traffic +4. **Review UFW logs** for attack patterns + +**Actions:** +- [ ] Check traffic graphs in provider dashboard +- [ ] Enable Cloudflare "I'm Under Attack" mode +- [ ] Contact provider NOC for emergency mitigation +- [ ] Document attack source IPs (if visible) + +**Recovery:** 15-60 minutes depending on attack severity + +--- + +### Scenario C: Frostwall/Network Failure + +**If GRE tunnels are down:** + +1. **SSH to Command Center** (if accessible) +2. **Check tunnel status:** +```bash +ip link show | grep gre +ping 10.0.1.2 # TX1 tunnel +ping 10.0.2.2 # NC1 tunnel +``` + +3. **Restart tunnels:** +```bash +systemctl restart networking +# Or manually: +/etc/network/if-up.d/frostwall-tunnels +``` + +4. **Verify UFW rules** aren't blocking traffic + +**Actions:** +- [ ] Check GRE tunnel status +- [ ] Restart network services +- [ ] Verify routing tables +- [ ] Test game server connectivity + +**Recovery:** 5-15 minutes + +--- + +### Scenario D: Payment/Billing Failure + +**If services suspended for non-payment:** + +1. **Check email** for suspension notices +2. **Log into provider billing** portal +3. **Make immediate payment** if overdue +4. **Contact provider support** for expedited restoration + +**Actions:** +- [ ] Check all provider invoices +- [ ] Verify payment methods current +- [ ] Make emergency payment if needed +- [ ] Request immediate service restoration + +**Recovery:** 30-120 minutes (depending on provider response) + +--- + +### Scenario E: Configuration Error + +**If recent changes caused failure:** + +1. **Identify last change** (check git log, command history) +2. **Rollback configuration:** +```bash +# Restore from backup +cd /opt/config-backups +ls -lt | head -5 # Find recent backup +cp backup-YYYYMMDD.tar.gz / +tar -xzf backup-YYYYMMDD.tar.gz +systemctl restart [affected-service] +``` + +3. **Test services incrementally** + +**Actions:** +- [ ] Review git commit log +- [ ] Check command history: `history | tail -50` +- [ ] Restore previous working config +- [ ] Test each service individually + +**Recovery:** 15-30 minutes + +--- + +### Scenario F: Hardware Failure + +**If physical hardware failed:** + +1. **Open EMERGENCY ticket** with provider +2. **Request hardware replacement/migration** +3. **Prepare for potential data loss** +4. **Activate disaster recovery plan** + +**Actions:** +- [ ] Contact provider emergency support +- [ ] Request server health diagnostics +- [ ] Prepare to restore from backups +- [ ] Estimate RTO (Recovery Time Objective) + +**Recovery:** 2-24 hours (provider dependent) + +--- + +## 📊 RESTORATION PRIORITY ORDER + +**Restore in this sequence:** + +### Phase 1: CRITICAL (0-15 minutes) +1. **Command Center** - Management hub +2. **Pterodactyl Panel** - Control plane +3. **Uptime Kuma** - Monitoring +4. **Frostwall tunnels** - Network security + +### Phase 2: REVENUE (15-30 minutes) +5. **Paymenter/Billing** - Financial systems +6. **Whitelist Manager** - Player access +7. **Top 3 game servers** - ATM10, Ember, MC:C&C + +### Phase 3: SERVICES (30-60 minutes) +8. **Remaining game servers** +9. **Wiki.js** - Documentation +10. **NextCloud** - File storage + +### Phase 4: SECONDARY (1-2 hours) +11. **Gitea** - Version control +12. **Discord bots** - Community tools +13. **Code-Server** - Development + +--- + +## ✅ RECOVERY VERIFICATION CHECKLIST + +**Before declaring "all clear":** + +- [ ] All servers accessible via SSH +- [ ] All game servers online in Pterodactyl +- [ ] Players can connect to servers +- [ ] Uptime Kuma shows all green +- [ ] Website/billing accessible +- [ ] No error messages in logs +- [ ] Network performance normal +- [ ] All automation systems running + +--- + +## 📢 RECOVERY COMMUNICATION + +**When services are restored:** + +### Discord Announcement: +``` +✅ ALL CLEAR - Services Restored + +All Firefrost services have been restored and are operating normally. + +Total downtime: [X] hours [Y] minutes +Cause: [Brief explanation] + +We apologize for the disruption and thank you for your patience. + +Compensation: [If applicable] +- [Details of any compensation for subscribers] + +Full post-mortem will be published within 48 hours. + +- The Firefrost Team +``` + +### Twitter/X: +``` +✅ Service Alert Resolved: All Firefrost Gaming services are now operational. Thank you for your patience during the outage. Full details: [link] +``` + +--- + +## 📝 POST-INCIDENT REQUIREMENTS + +**Within 24 hours:** + +1. **Create timeline** of events (minute-by-minute) +2. **Document root cause** +3. **Identify what worked well** +4. **Identify what failed** +5. **List action items** for prevention + +**Within 48 hours:** + +6. **Publish post-mortem** (public or staff-only) +7. **Implement immediate fixes** +8. **Update emergency procedures** if needed +9. **Test recovery procedures** +10. **Review disaster recovery plan** + +**Post-Mortem Template:** `docs/reference/incident-post-mortem-template.md` + +--- + +## 🎯 PREVENTION MEASURES + +**After RED ALERT, implement:** + +1. **Enhanced monitoring** - More comprehensive alerts +2. **Redundancy** - Eliminate single points of failure +3. **Automated health checks** - Self-healing where possible +4. **Regular drills** - Test emergency procedures quarterly +5. **Documentation updates** - Capture lessons learned + +--- + +## 📞 EMERGENCY CONTACTS + +**Primary:** +- Michael (The Wizard): [Emergency contact method] +- Meg (The Emissary): [Emergency contact method] + +**Providers:** +- Hetzner Emergency Support: [Support number] +- Cloudflare Support: [Support number] +- Discord Support: [Support email] + +**Escalation:** +- If Michael unavailable: Meg takes incident command +- If both unavailable: [Designated backup contact] + +--- + +## 🔐 CREDENTIALS EMERGENCY ACCESS + +**If Vaultwarden is down:** +- Emergency credential sheet: [Physical location] +- Backup password manager: [Alternative access] +- Provider console access: [Direct login method] + +--- + +**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️ + +--- + +**Protocol Status:** ACTIVE +**Last Drill:** [Date of last test] +**Next Review:** Monthly +**Version:** 1.0 diff --git a/docs/emergency-protocols/YELLOW-ALERT-partial-degradation.md b/docs/emergency-protocols/YELLOW-ALERT-partial-degradation.md new file mode 100644 index 0000000..cb473e7 --- /dev/null +++ b/docs/emergency-protocols/YELLOW-ALERT-partial-degradation.md @@ -0,0 +1,382 @@ +# ⚠️ YELLOW ALERT - Partial Service Degradation Protocol + +**Status:** Elevated Response Procedure +**Alert Level:** YELLOW ALERT +**Priority:** HIGH +**Last Updated:** 2026-02-17 + +--- + +## ⚠️ YELLOW ALERT DEFINITION + +**Partial service degradation or single critical system failure:** +- One or more game servers down (but not all) +- Single management service unavailable +- Performance degradation (high latency, low TPS) +- Single node failure (TX1 or NC1 affected) +- Non-critical but user-impacting issues + +**This requires prompt attention but is not business-critical.** + +--- + +## 📊 YELLOW ALERT TRIGGERS + +**Automatic triggers:** +- Any game server offline for >15 minutes +- TPS below 15 on any server for >30 minutes +- Panel/billing system inaccessible for >10 minutes +- More than 5 player complaints in 15 minutes +- Uptime Kuma shows red status for any service +- Memory usage >90% for >20 minutes + +--- + +## 📞 RESPONSE PROCEDURE (15-30 minutes) + +### Step 1: ASSESS SITUATION (5 minutes) + +**Determine scope:** +- [ ] Which services are affected? +- [ ] How many players impacted? +- [ ] Is degradation worsening? +- [ ] Any revenue impact? +- [ ] Can it wait or needs immediate action? + +**Quick checks:** +```bash +# Check server status +ssh root@63.143.34.217 "systemctl status" + +# Check game servers in Pterodactyl +curl https://panel.firefrostgaming.com/api/client + +# Check resource usage +ssh root@38.68.14.26 "htop" +``` + +--- + +### Step 2: COMMUNICATE (3 minutes) + +**If user-facing impact:** + +Discord #server-status: +``` +⚠️ SERVICE NOTICE + +We're experiencing issues with [specific service/server]. + +Affected: [Server name(s)] +Status: Investigating +ETA: [Estimate] + +Players on unaffected servers: No action needed +Players on affected server: Please standby + +Updates will be posted here. +``` + +**If internal only:** +- Post in #staff-lounge +- No public announcement needed + +--- + +### Step 3: DIAGNOSE & FIX (10-20 minutes) + +See scenario-specific procedures below. + +--- + +## 🔧 COMMON YELLOW ALERT SCENARIOS + +### Scenario 1: Single Game Server Down + +**Quick diagnostics:** +```bash +# Via Pterodactyl panel +1. Check server status in panel +2. View console for errors +3. Check resource usage graphs + +# Common causes: +- Out of memory (OOM) +- Crash from mod conflict +- World corruption +- Java process died +``` + +**Resolution:** +```bash +# Restart server via panel +1. Stop server +2. Wait 30 seconds +3. Start server +4. Monitor console for successful startup +5. Test player connection +``` + +**If restart fails:** +- Check logs for error messages +- Restore from backup if world corrupted +- Rollback recent mod changes +- Allocate more RAM if OOM + +**Recovery time:** 5-15 minutes + +--- + +### Scenario 2: Low TPS / Server Lag + +**Diagnostics:** +```bash +# In-game +/tps +/forge tps + +# Via SSH +top -u minecraft +htop +iostat +``` + +**Common causes:** +- Chunk loading lag +- Redstone contraptions +- Mob farms +- Memory pressure +- Disk I/O bottleneck + +**Quick fixes:** +```bash +# Clear entities +/kill @e[type=!player] + +# Reduce view distance temporarily +# (via server.properties or Pterodactyl) + +# Restart server during low-traffic time +``` + +**Long-term solutions:** +- Optimize JVM flags (see optimization guide) +- Add more RAM +- Limit chunk loading +- Remove lag-causing builds + +**Recovery time:** 10-30 minutes + +--- + +### Scenario 3: Pterodactyl Panel Inaccessible + +**Quick checks:** +```bash +# Panel server (45.94.168.138) +ssh root@45.94.168.138 + +# Check panel service +systemctl status pteroq +systemctl status wings + +# Check Nginx +systemctl status nginx + +# Check database +systemctl status mariadb +``` + +**Common fixes:** +```bash +# Restart panel services +systemctl restart pteroq wings nginx + +# Check disk space (common cause) +df -h + +# If database issue +systemctl restart mariadb +``` + +**Recovery time:** 5-10 minutes + +--- + +### Scenario 4: Billing/Whitelist Manager Down + +**Impact:** Players cannot subscribe or whitelist + +**Diagnostics:** +```bash +# Billing VPS (38.68.14.188) +ssh root@38.68.14.188 + +# Check services +systemctl status paymenter +systemctl status whitelist-manager +systemctl status nginx +``` + +**Quick fix:** +```bash +systemctl restart [affected-service] +``` + +**Recovery time:** 2-5 minutes + +--- + +### Scenario 5: Frostwall Tunnel Degraded + +**Symptoms:** +- High latency on specific node +- Packet loss +- Intermittent disconnections + +**Diagnostics:** +```bash +# On Command Center +ping 10.0.1.2 # TX1 tunnel +ping 10.0.2.2 # NC1 tunnel + +# Check tunnel interface +ip link show gre-tx1 +ip link show gre-nc1 + +# Check routing +ip route show +``` + +**Quick fix:** +```bash +# Restart specific tunnel +ip link set gre-tx1 down +ip link set gre-tx1 up + +# Or restart all networking +systemctl restart networking +``` + +**Recovery time:** 5-10 minutes + +--- + +### Scenario 6: High Memory Usage (Pre-OOM) + +**Warning signs:** +- Memory >90% on any server +- Swap usage increasing +- JVM GC warnings in logs + +**Immediate action:** +```bash +# Identify memory hog +htop +ps aux --sort=-%mem | head + +# If game server: +# Schedule restart during low-traffic + +# If other service: +systemctl restart [service] +``` + +**Prevention:** +- Enable swap if not present +- Right-size RAM allocation +- Schedule regular restarts + +**Recovery time:** 5-20 minutes + +--- + +### Scenario 7: Discord Bot Offline + +**Impact:** Automated features unavailable + +**Quick fix:** +```bash +# Restart bot container/service +docker restart [bot-name] +# or +systemctl restart [bot-service] + +# Check bot token hasn't expired +``` + +**Recovery time:** 2-5 minutes + +--- + +## ✅ RESOLUTION VERIFICATION + +**Before downgrading from Yellow Alert:** + +- [ ] Affected service operational +- [ ] Players can connect/use service +- [ ] No error messages in logs +- [ ] Performance metrics normal +- [ ] Root cause identified +- [ ] Temporary or permanent fix applied +- [ ] Monitoring in place for recurrence + +--- + +## 📢 RESOLUTION COMMUNICATION + +**Public (if announced):** +``` +✅ RESOLVED + +[Service/Server] is now operational. + +Cause: [Brief explanation] +Duration: [X minutes] + +Thank you for your patience! +``` + +**Staff-only:** +``` +Yellow Alert cleared: [Service] +Cause: [Details] +Fix: [What was done] +Prevention: [Next steps] +``` + +--- + +## 📊 ESCALATION TO RED ALERT + +**Escalate if:** +- Multiple services failing simultaneously +- Fix attempts unsuccessful after 30 minutes +- Issue worsening despite interventions +- Provider reports hardware failure +- Security breach suspected + +**When escalating:** +- Follow RED ALERT protocol immediately +- Document what was tried +- Preserve logs/state for diagnosis + +--- + +## 🔄 POST-INCIDENT TASKS + +**For significant Yellow Alerts:** + +1. **Document incident** (brief summary) +2. **Update monitoring** (prevent recurrence) +3. **Review capacity** (if resource-related) +4. **Schedule preventive maintenance** (if needed) + +--- + +**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️ + +--- + +**Protocol Status:** ACTIVE +**Version:** 1.0 diff --git a/docs/metrics/sla-definitions-and-targets.md b/docs/metrics/sla-definitions-and-targets.md new file mode 100644 index 0000000..8b86949 --- /dev/null +++ b/docs/metrics/sla-definitions-and-targets.md @@ -0,0 +1,343 @@ +# 📊 Service Metrics & SLA Definitions + +**Status:** Operational Standards +**Owner:** Michael "The Wizard" Krause +**Last Updated:** 2026-02-17 + +--- + +## 🎯 SERVICE LEVEL AGREEMENTS (SLAs) + +### Overall Infrastructure SLA + +**Target Uptime:** 99.5% monthly +**Allowed Downtime:** ~3.6 hours per month +**Measurement:** Uptime Kuma historical data + +--- + +## 📈 PERFORMANCE TARGETS + +### Game Servers + +**TPS (Ticks Per Second):** +- **Target:** 19.5-20.0 TPS +- **Acceptable:** 18.0-19.5 TPS +- **Degraded:** 15.0-18.0 TPS +- **Critical:** <15.0 TPS (Yellow Alert) + +**Player Connection:** +- **Target:** <100ms latency +- **Acceptable:** 100-200ms latency +- **Degraded:** 200-300ms latency +- **Critical:** >300ms latency + +**Server Uptime:** +- **Target:** 99.5% per server monthly +- **Scheduled Maintenance:** 30 minutes daily (4:00 AM restart) +- **Unplanned Downtime:** <2 hours monthly per server + +--- + +### Management Services + +**Pterodactyl Panel:** +- **Uptime Target:** 99.9% monthly +- **Response Time:** <2 seconds page load +- **API Response:** <500ms per request + +**Billing (Paymenter):** +- **Uptime Target:** 99.9% monthly (revenue-critical) +- **Payment Processing:** <30 seconds +- **Page Load:** <3 seconds + +**Wiki/Documentation:** +- **Uptime Target:** 99.0% monthly +- **Search Response:** <1 second +- **Page Load:** <2 seconds + +--- + +## 💾 BACKUP METRICS + +**World Backups:** +- **Frequency:** Daily at 3:30 AM +- **Retention:** 7 daily, 4 weekly, 12 monthly +- **Success Rate Target:** 100% (all 11 servers) +- **Recovery Time Objective (RTO):** 30 minutes +- **Recovery Point Objective (RPO):** 24 hours (daily backups) + +**Configuration Backups:** +- **Frequency:** On every change + daily +- **Retention:** 30 days +- **Storage:** Git repository + off-server + +--- + +## 🌐 NETWORK METRICS + +**Frostwall Tunnels:** +- **Uptime Target:** 99.9% per tunnel +- **Latency:** <10ms additional overhead +- **Packet Loss:** <0.1% +- **Health Check:** Every 5 minutes + +**Bandwidth Usage:** +- **TX1 Node:** ~500GB/month baseline +- **NC1 Node:** ~800GB/month baseline +- **Alert Threshold:** >80% of allocated bandwidth + +--- + +## 🔒 SECURITY METRICS + +**Fail2Ban:** +- **SSH Ban Threshold:** 3 failed attempts +- **Ban Duration:** 1 hour (first offense) +- **Monitoring:** Check banned IPs daily + +**Firewall:** +- **Blocked Attempts:** Monitor daily +- **Rule Changes:** Logged and reviewed +- **Audit Frequency:** Weekly + +**Vulnerability Scans:** +- **Frequency:** Monthly +- **Critical Patches:** Within 48 hours +- **Security Updates:** Within 7 days + +--- + +## 💰 COST METRICS + +### Infrastructure Costs (Monthly) + +**Dedicated Servers:** +- TX1 Dallas: ~$150/month +- NC1 Charlotte: ~$150/month +- **Total Dedicated:** ~$300/month + +**VPS Services:** +- Command Center: ~$20/month +- Panel: ~$15/month +- Billing VPS: ~$10/month +- Ghost VPS: ~$15/month +- **Total VPS:** ~$60/month + +**Additional Services:** +- Domain registration: ~$15/year +- Cloudflare: $0 (free tier) +- Backups/Storage: ~$10/month + +**Total Monthly Infrastructure:** ~$370/month + +--- + +### Revenue Metrics + +**Subscription Tiers:** +- Sovereign: $99/month +- Consular: $49/month +- Community: Free + +**Targets:** +- **Break-even:** 4 Sovereign OR 8 Consular subscribers +- **Profit Target:** 10+ paying subscribers +- **Growth Rate:** +2 subscribers per month + +--- + +## 📊 CAPACITY PLANNING + +### Current Capacity (Feb 2026) + +**TX1 Dallas:** +- CPU: 32 vCPUs (avg 40% usage) +- RAM: 256GB (avg 60% usage - 150GB) +- Disk: 2TB (40% usage - 800GB) +- **Headroom:** 5 more servers possible + +**NC1 Charlotte:** +- CPU: 32 vCPUs (avg 50% usage) +- RAM: 256GB (avg 70% usage - 180GB) +- Disk: 2TB (45% usage - 900GB) +- **Headroom:** 3-4 more servers possible + +**Scaling Triggers:** +- RAM usage sustained >80%: Add more RAM or migrate servers +- CPU usage sustained >70%: Optimize or add node +- Disk usage >80%: Add storage or implement cleanup + +--- + +### Growth Projections + +**Q1 2026 (Current):** +- 11 game servers +- ~50 active players +- ~5 paying subscribers (projected) + +**Q2 2026 (Target):** +- 13-15 game servers +- ~100 active players +- ~12 paying subscribers + +**Q3 2026 (Growth):** +- 15-18 game servers +- ~150 active players +- ~20 paying subscribers + +**Capacity Limit (Current Infrastructure):** +- Maximum: ~20 servers across both nodes +- Need 3rd node if exceeding 20 servers + +--- + +## ⏱️ RESPONSE TIME TARGETS + +**Incident Response:** +- **Critical (Red Alert):** Acknowledge in 5 min, resolve in 1 hour +- **High (Yellow Alert):** Acknowledge in 15 min, resolve in 30 min +- **Medium:** Respond in 1 hour, resolve in 4 hours +- **Low:** Respond in 24 hours, resolve in 1 week + +**Support Tickets:** +- **Urgent:** Response in 2 hours +- **Normal:** Response in 12 hours +- **Low Priority:** Response in 48 hours + +--- + +## 🎮 PLAYER EXPERIENCE METRICS + +**Connection Success Rate:** +- **Target:** >99% of connection attempts succeed +- **Measurement:** Player reports + server logs + +**Server Stability:** +- **Target:** <1 crash per server per month +- **Measurement:** Pterodactyl crash reports + +**Player Retention:** +- **Target:** >60% monthly active players return +- **Measurement:** Login tracking + +**Support Satisfaction:** +- **Target:** >90% positive feedback +- **Measurement:** Player surveys + +--- + +## 📉 FAILURE METRICS + +**Mean Time Between Failures (MTBF):** +- **Target:** >720 hours (30 days) per service +- **Current:** Track and improve monthly + +**Mean Time To Repair (MTTR):** +- **Critical Services:** <30 minutes +- **Game Servers:** <15 minutes +- **Non-critical:** <2 hours + +**Change Success Rate:** +- **Target:** >95% of changes deploy without incident +- **Measurement:** Track deployments vs rollbacks + +--- + +## 📋 MONITORING DASHBOARDS + +**Uptime Kuma:** +- All services monitored +- Status page: status.firefrostgaming.com +- Alert thresholds configured + +**Netdata (Planned):** +- Real-time performance metrics +- Historical data retention: 7 days +- Alert integration with Discord + +**Pterodactyl:** +- Server resource usage graphs +- Player connection logs +- Crash reports + +--- + +## 🔔 ALERT THRESHOLDS + +**Uptime Kuma Alerts:** +- Service down >5 minutes → Discord notification +- Service down >15 minutes → Email alert +- Service down >30 minutes → SMS/Call escalation + +**Resource Alerts:** +- CPU >80% for 10 min → Warning +- RAM >90% for 5 min → Critical +- Disk >90% → Critical +- Network down → Critical immediate + +**Performance Alerts:** +- TPS <15 for 15 min → Warning +- TPS <10 for 5 min → Critical +- Latency >300ms for 10 min → Warning + +--- + +## 📊 REPORTING SCHEDULE + +**Daily:** +- Automated backup success/failure report +- Critical alerts summary + +**Weekly:** +- Uptime summary (per service) +- Performance trends +- Failed login attempts +- Bandwidth usage + +**Monthly:** +- SLA compliance report +- Cost analysis +- Capacity utilization +- Growth metrics +- Incident post-mortems + +**Quarterly:** +- Infrastructure review +- Capacity planning update +- Security audit summary +- Financial performance + +--- + +## 🎯 SUCCESS METRICS + +**Infrastructure:** +- ✅ 99.5% uptime achieved +- ✅ All backups successful +- ✅ Zero data loss incidents +- ✅ Response times within SLA + +**Business:** +- ✅ Revenue > costs (profitability) +- ✅ Subscriber growth on track +- ✅ Player retention >60% +- ✅ Positive community sentiment + +**Operations:** +- ✅ Incidents resolved within targets +- ✅ Change success rate >95% +- ✅ Security posture maintained +- ✅ Documentation complete and current + +--- + +**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️ + +--- + +**Document Status:** ACTIVE +**Review Schedule:** Monthly +**Next Review:** 2026-03-17 +**Version:** 1.0 diff --git a/docs/quick-reference/common-operations.md b/docs/quick-reference/common-operations.md new file mode 100644 index 0000000..8aeae4f --- /dev/null +++ b/docs/quick-reference/common-operations.md @@ -0,0 +1,377 @@ +# 🚀 QUICK REFERENCE - Common Operations + +**One-page quick reference for daily operations** +**Print and keep handy!** + +--- + +## 🔐 EMERGENCY CREDENTIALS ACCESS + +**Vaultwarden:** vault.firefrostgaming.com +**If Vaultwarden down:** Check emergency credential sheet + +--- + +## 🖥️ SERVER ACCESS + +```bash +# Command Center (Dallas hub) +ssh root@63.143.34.217 + +# TX1 (Dallas game servers) +ssh root@38.68.14.26 + +# NC1 (Charlotte game servers) +ssh root@216.239.104.130 + +# Panel (Control plane) +ssh root@45.94.168.138 + +# Billing VPS +ssh root@38.68.14.188 + +# Ghost VPS (Docs/Wiki) +ssh root@64.50.188.14 +``` + +--- + +## 🎮 RESTART SINGLE SERVER + +**Via Pterodactyl Panel:** +1. Go to panel.firefrostgaming.com +2. Select server +3. Click "Restart" button +4. Wait 2-3 minutes +5. Verify server online + +**Via API:** +```bash +curl -X POST "https://panel.firefrostgaming.com/api/client/servers/{uuid}/power" \ + -H "Authorization: Bearer YOUR_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"signal":"restart"}' +``` + +--- + +## 🔄 RESTART ALL SERVERS (Staggered) + +**Manual (when automation down):** +```bash +# On Command Center +python3 /opt/automation/staggered-restart/staggered-restart.py +``` + +**Scheduled (cron):** +- Runs automatically at 4:00 AM daily +- Check logs: `tail -f /var/log/staggered-restart.log` + +--- + +## 💾 MANUAL BACKUP + +**Single server world:** +```bash +# On Command Center +python3 /opt/automation/world-backup/world-backup.py --server "ATM10" +``` + +**All servers:** +```bash +python3 /opt/automation/world-backup/world-backup.py +``` + +**Check backup status:** +- NextCloud: downloads.firefrostgaming.com/backups/worlds/ + +--- + +## 📊 CHECK SERVER HEALTH + +**TPS (in-game):** +``` +/tps +/forge tps +``` + +**Resource usage (SSH):** +```bash +# Quick overview +htop + +# Memory +free -h + +# Disk space +df -h + +# Network +iftop +``` + +**Via Pterodactyl:** +- View server → Graphs tab + +--- + +## 🔥 PERFORMANCE ISSUES + +**High CPU:** +```bash +# Find process +top +# Kill if needed +kill [PID] +``` + +**High Memory:** +```bash +# Check usage +free -h +# Restart server if critical +``` + +**Low TPS:** +``` +# In-game +/kill @e[type=!player] # Clear entities +# Then restart server +``` + +**High Disk I/O:** +```bash +iostat -x 1 +# Check what's writing +iotop +``` + +--- + +## 🌐 FROSTWALL TUNNEL CHECK + +**Command Center:** +```bash +# Check tunnel status +ip link show | grep gre + +# Test connectivity +ping 10.0.1.2 # TX1 +ping 10.0.2.2 # NC1 + +# Restart if needed +systemctl restart networking +``` + +--- + +## 🚨 CHECK SERVICE STATUS + +```bash +# Any systemd service +systemctl status [service-name] + +# Common services +systemctl status nginx +systemctl status gitea +systemctl status vaultwarden +systemctl status netdata +``` + +--- + +## 📝 VIEW LOGS + +```bash +# Service logs (last 50 lines) +journalctl -u [service] -n 50 + +# Follow logs live +journalctl -u [service] -f + +# All system logs +journalctl -xe + +# Specific log files +tail -f /var/log/[logfile] +``` + +--- + +## 🔧 RESTART SERVICES + +```bash +# Restart service +systemctl restart [service] + +# Restart web server +systemctl restart nginx + +# Restart all Pterodactyl +systemctl restart pteroq wings + +# Restart automation +systemctl restart staggered-restart +``` + +--- + +## 🎯 WHITELIST PLAYER + +**Via Web Dashboard:** +1. Go to whitelist.firefrostgaming.com +2. Enter Minecraft username +3. Select server +4. Click "Add to Whitelist" + +**Manual (in-game console):** +``` +/whitelist add [username] +/whitelist reload +``` + +--- + +## 👥 ADD STAFF PERMISSIONS + +**LuckPerms (in-game):** +``` +/lp user [username] parent set admin +/lp user [username] permission set [perm] true +``` + +**Pterodactyl Panel:** +1. Users → Create User +2. Assign to servers +3. Set permissions + +--- + +## 📈 CHECK UPTIME + +**Uptime Kuma:** +- Go to status.firefrostgaming.com +- View all service status + +**Manual check:** +```bash +uptime +systemctl status [service] +``` + +--- + +## 💬 DISCORD NOTIFICATIONS + +**Server Status:** +- Posted automatically to #server-status +- Configured via webhooks + +**Manual notification:** +```bash +curl -X POST [DISCORD_WEBHOOK_URL] \ + -H "Content-Type: application/json" \ + -d '{"content":"[Your message]"}' +``` + +--- + +## 🗄️ DATABASE ACCESS + +**MySQL (if needed):** +```bash +mysql -u root -p +SHOW DATABASES; +USE [database]; +SHOW TABLES; +``` + +**Pterodactyl database:** +```bash +mysql -u pterodactyl -p pterodactyl +``` + +--- + +## 🔐 SECURITY QUICK CHECKS + +**Check for attacks:** +```bash +# Failed SSH attempts +grep "Failed password" /var/log/auth.log | tail -20 + +# Fail2Ban status +fail2ban-client status sshd + +# UFW status +ufw status +``` + +--- + +## 📦 UPDATE SYSTEM + +```bash +# Update packages +apt update && apt upgrade -y + +# Check what's outdated +apt list --upgradable + +# Security updates only +unattended-upgrades +``` + +--- + +## 🆘 EMERGENCY STOP + +**Stop specific server:** +- Pterodactyl panel → Stop button + +**Stop all game servers:** +```bash +# Via Pterodactyl API (script) +for uuid in [server-uuids]; do + curl -X POST ".../power" -d '{"signal":"stop"}' +done +``` + +**Stop critical service:** +```bash +systemctl stop [service] +``` + +--- + +## 📞 WHEN TO ESCALATE + +**Yellow Alert (⚠️):** +- Single server down >15 min +- Performance degraded >30 min +- Any revenue system affected + +**Red Alert (🚨):** +- Multiple services down +- All game servers unreachable +- Provider outage +- Security breach + +**See:** `docs/emergency-protocols/` + +--- + +## 🔗 QUICK LINKS + +- **Panel:** panel.firefrostgaming.com +- **Status:** status.firefrostgaming.com +- **Vault:** vault.firefrostgaming.com +- **Docs:** docs.firefrostgaming.com +- **Git:** git.firefrostgaming.com + +--- + +**Fire + Frost + Foundation** 💙🔥❄️ + +**Print Date:** 2026-02-17 +**Version:** 1.0 diff --git a/docs/reference/incident-post-mortem-template.md b/docs/reference/incident-post-mortem-template.md new file mode 100644 index 0000000..40e0dda --- /dev/null +++ b/docs/reference/incident-post-mortem-template.md @@ -0,0 +1,187 @@ +# 🔍 Incident Post-Mortem Template + +**Incident ID:** [YYYY-MM-DD-###] +**Severity:** [Red Alert / Yellow Alert / Info] +**Date:** [Date of incident] +**Author:** [Name] +**Status:** [Draft / Under Review / Published] + +--- + +## 📊 INCIDENT SUMMARY + +**In plain language, what happened?** + +[2-3 sentence summary that anyone can understand] + +**Impact:** +- **Services Affected:** [List] +- **Users Impacted:** [Number/percentage] +- **Duration:** [X hours Y minutes] +- **Revenue Impact:** [Yes/No, details if yes] + +--- + +## ⏱️ TIMELINE + +**All times in Central Time (America/Chicago)** + +| Time | Event | Action Taken | By Whom | +|------|-------|--------------|---------| +| HH:MM | [What happened] | [What was done] | [Who] | +| HH:MM | [Next event] | [Next action] | [Who] | +| HH:MM | [Next event] | [Next action] | [Who] | + +**Example:** +| Time | Event | Action Taken | By Whom | +|------|-------|--------------|---------| +| 03:47 | ATM10 server crashed | Alert received in Discord | Automated | +| 03:52 | Investigated crash logs | SSH to NC1, checked logs | Michael | +| 04:05 | Root cause identified (OOM) | Increased RAM allocation | Michael | +| 04:12 | Server restarted | Restart via panel | Michael | +| 04:15 | Verified functionality | Test player connection | Michael | +| 04:20 | All clear | Posted update in Discord | Meg | + +--- + +## 🔍 ROOT CAUSE ANALYSIS + +### What was the root cause? + +[Detailed technical explanation] + +### Why did it happen? + +[Contributing factors] + +### Why didn't we catch it earlier? + +[Monitoring gaps, if any] + +--- + +## 🛡️ WHAT WENT WELL + +**Things that worked as expected:** +- [ ] [Monitoring detected issue quickly] +- [ ] [Team responded within SLA] +- [ ] [Emergency protocols followed] +- [ ] [Communication was clear] +- [ ] [Recovery was successful] + +[Expand on each point] + +--- + +## 🚨 WHAT WENT WRONG + +**Things that didn't work as expected:** +- [ ] [Issue that caused incident] +- [ ] [Monitoring didn't catch X] +- [ ] [Response was delayed because...] +- [ ] [Communication breakdown in...] + +[Expand on each point] + +--- + +## 🎯 ACTION ITEMS + +**Immediate (Within 24 hours):** +- [ ] [Action 1] - Assigned to: [Person] - Due: [Date] +- [ ] [Action 2] - Assigned to: [Person] - Due: [Date] + +**Short-term (Within 1 week):** +- [ ] [Action 1] - Assigned to: [Person] - Due: [Date] +- [ ] [Action 2] - Assigned to: [Person] - Due: [Date] + +**Long-term (Within 1 month):** +- [ ] [Action 1] - Assigned to: [Person] - Due: [Date] +- [ ] [Action 2] - Assigned to: [Person] - Due: [Date] + +--- + +## 📚 LESSONS LEARNED + +**What did we learn?** +1. [Lesson 1] +2. [Lesson 2] +3. [Lesson 3] + +**How will we prevent this from happening again?** +- [Prevention measure 1] +- [Prevention measure 2] +- [Prevention measure 3] + +**What documentation needs to be updated?** +- [ ] [Document 1 - link] +- [ ] [Document 2 - link] +- [ ] [Procedure 3 - link] + +--- + +## 💰 COST IMPACT + +**Direct Costs:** +- Lost revenue: $[amount] +- Emergency support costs: $[amount] +- Overtime/after-hours work: [hours] + +**Indirect Costs:** +- Player churn (estimated): [number] +- Reputation impact: [assessment] +- Time investment: [person-hours] + +**Total Estimated Impact:** $[amount] + +--- + +## 🔄 FOLLOW-UP + +**30-Day Follow-Up:** +- [ ] Verify all action items completed +- [ ] Check if similar incidents occurred +- [ ] Measure effectiveness of changes + +**90-Day Follow-Up:** +- [ ] Review long-term prevention measures +- [ ] Assess if incident type has recurred +- [ ] Update procedures based on experience + +--- + +## 📎 SUPPORTING MATERIALS + +**Logs:** +- Link to server logs: [path/link] +- Link to monitoring data: [path/link] +- Screenshots: [path/link] + +**Communications:** +- Discord announcements: [links] +- Staff communications: [links] +- Player feedback: [links] + +--- + +## ✅ APPROVAL & PUBLICATION + +**Reviewed by:** +- [ ] Technical Lead: [Name] - [Date] +- [ ] Management: [Name] - [Date] + +**Publication:** +- [ ] Internal (staff only) +- [ ] Public (redacted version) + +**Published:** [Date] +**Location:** [docs/reference/post-mortems/YYYY-MM-DD-###.md] + +--- + +**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️ + +--- + +**Template Version:** 1.0 +**Last Updated:** 2026-02-17 diff --git a/docs/training/staff-training-curriculum.md b/docs/training/staff-training-curriculum.md new file mode 100644 index 0000000..da605cf --- /dev/null +++ b/docs/training/staff-training-curriculum.md @@ -0,0 +1,460 @@ +# 🎓 Staff Training Curriculum + +**Purpose:** Comprehensive onboarding and skill development +**Target:** New Firefrost Gaming staff members +**Duration:** 2-4 weeks (self-paced) +**Last Updated:** 2026-02-17 + +--- + +## 📋 TRAINING OVERVIEW + +**Training Philosophy:** +- **Fire:** Passion-driven, hands-on learning +- **Frost:** Systematic, precise skill building +- **Foundation:** Building for the long term + +**Training Levels:** +1. **Level 1:** Orientation (Days 1-3) +2. **Level 2:** Core Skills (Week 1) +3. **Level 3:** Advanced Skills (Week 2-3) +4. **Level 4:** Specialization (Week 4+) + +--- + +## LEVEL 1: ORIENTATION (Days 1-3) + +### Day 1: Welcome & Philosophy + +**Topics:** +- [ ] Fire + Frost + Foundation philosophy +- [ ] Company mission and values +- [ ] Fire vs Frost player paths +- [ ] "For children not yet born" vision +- [ ] Team structure and roles + +**Materials:** +- `docs/planning/mission-statement.md` +- `docs/planning/path-philosophy.md` +- `docs/planning/design-bible.md` + +**Activities:** +- Introduction meeting with Michael & Meg +- Tour of all services (play on servers) +- Read Fire + Frost philosophy +- Join Discord and introduce yourself + +**Checkpoint:** Can you explain Fire + Frost philosophy? + +--- + +### Day 2: Infrastructure Overview + +**Topics:** +- [ ] Complete infrastructure map +- [ ] All 11 game servers (what they run) +- [ ] VPS tier services +- [ ] Dedicated tier architecture +- [ ] Frostwall Protocol basics + +**Materials:** +- `docs/core/infrastructure-manifest.md` +- `docs/diagrams/complete-infrastructure-map.mermaid` +- `docs/diagrams/frostwall-network-topology.mermaid` + +**Activities:** +- View infrastructure diagrams +- SSH to each server (read-only access) +- Join each game server as player +- Review Pterodactyl panel + +**Checkpoint:** Can you name all 11 game servers and their locations? + +--- + +### Day 3: Tools & Access + +**Topics:** +- [ ] Vaultwarden (password manager) +- [ ] Pterodactyl Panel +- [ ] Discord roles and channels +- [ ] Wiki.js (documentation) +- [ ] Gitea (version control) + +**Materials:** +- `docs/tasks/vaultwarden-setup/configuration-guide.md` +- `docs/quick-reference/common-operations.md` + +**Activities:** +- Get Vaultwarden account +- Get credentials for assigned services +- Set up 2FA +- Practice common operations +- Review quick reference card + +**Checkpoint:** Can you access all tools assigned to your role? + +--- + +## LEVEL 2: CORE SKILLS (Week 1) + +### Week 1, Day 1-2: Server Management Basics + +**Topics:** +- [ ] Starting/stopping servers +- [ ] Reading server console +- [ ] Basic troubleshooting +- [ ] Player whitelisting +- [ ] Common server issues + +**Materials:** +- `docs/quick-reference/common-operations.md` +- Pterodactyl documentation +- Server-specific READMEs + +**Hands-on Practice:** +- Restart a test server +- Whitelist yourself +- Read console logs +- Identify a simulated issue + +**Checkpoint:** Can you restart a server and verify it's online? + +--- + +### Week 1, Day 3-4: Discord & Community + +**Topics:** +- [ ] Discord server structure +- [ ] Fire vs Frost channels +- [ ] Community moderation basics +- [ ] Player support workflows +- [ ] Escalation procedures + +**Materials:** +- `docs/tasks/discord-server-complete-reorganization/deployment-plan.md` +- `docs/planning/emissary-social-media-handbook.md` + +**Activities:** +- Shadow Meg for community management +- Practice responding to player questions +- Learn Discord bot commands +- Review moderation guidelines + +**Checkpoint:** Can you handle a basic support request? + +--- + +### Week 1, Day 5: Emergency Procedures + +**Topics:** +- [ ] Red Alert protocol +- [ ] Yellow Alert protocol +- [ ] When to escalate +- [ ] Communication procedures +- [ ] Emergency contacts + +**Materials:** +- `docs/emergency-protocols/RED-ALERT-complete-failure.md` +- `docs/emergency-protocols/YELLOW-ALERT-partial-degradation.md` + +**Simulation:** +- Walk through Red Alert scenario (tabletop) +- Practice Yellow Alert response +- Draft emergency Discord message + +**Checkpoint:** Can you identify when to call Red/Yellow Alert? + +--- + +## LEVEL 3: ADVANCED SKILLS (Week 2-3) + +### Week 2: Role-Specific Training + +#### For Builders: + +**Topics:** +- [ ] Modpack installation +- [ ] Server configuration +- [ ] Mod compatibility +- [ ] Performance optimization +- [ ] World management + +**Materials:** +- `docs/tasks/game-server-startup-script-audit-&-optimization/` +- Modpack-specific documentation + +**Projects:** +- Set up a test modpack server +- Optimize JVM flags +- Create spawn area for new server +- Document your build process + +--- + +#### For Social Media Helper: + +**Topics:** +- [ ] Content calendar +- [ ] Brand voice (Fire + Frost) +- [ ] Platform-specific strategies +- [ ] Community engagement +- [ ] Analytics tracking + +**Materials:** +- `docs/planning/emissary-social-media-handbook.md` +- `docs/planning/gemini-social-media-calendar.md` + +**Projects:** +- Create 1 week of social media content +- Draft announcement for new server +- Design promotional graphic +- Schedule posts + +--- + +#### For Moderators: + +**Topics:** +- [ ] Conflict resolution +- [ ] Rule enforcement +- [ ] Player reports +- [ ] Ban procedures +- [ ] Community building + +**Materials:** +- Discord server rules +- Moderation guidelines +- Escalation matrix + +**Projects:** +- Shadow senior moderator +- Handle simulated conflicts +- Document 3 case studies +- Create moderation report + +--- + +### Week 3: Systems & Automation + +**Topics:** +- [ ] Staggered restart system +- [ ] World backup automation +- [ ] Monitoring (Uptime Kuma, Netdata) +- [ ] Performance metrics +- [ ] SLA understanding + +**Materials:** +- `docs/tasks/staggered-server-restart-system/deployment-plan.md` +- `docs/tasks/world-backup-automation/deployment-plan.md` +- `docs/metrics/sla-definitions-and-targets.md` + +**Activities:** +- Review automation logs +- Verify backup completion +- Check monitoring dashboards +- Understand SLA targets + +**Checkpoint:** Can you verify automation systems are working? + +--- + +## LEVEL 4: SPECIALIZATION (Week 4+) + +### Advanced Builder Track + +**Topics:** +- [ ] Custom modpack creation +- [ ] Server performance tuning +- [ ] Advanced world editing +- [ ] Plugin development (if applicable) +- [ ] Infrastructure expansion planning + +**Projects:** +- Design new flagship modpack +- Optimize existing server +- Create custom builds +- Document best practices + +--- + +### Advanced Social Media Track + +**Topics:** +- [ ] Video content creation (CapCut) +- [ ] Streaming setup +- [ ] Community growth strategies +- [ ] Partnership outreach +- [ ] Analytics deep-dive + +**Projects:** +- Create "Coming Soon" video +- Plan content series +- Grow follower base +- Launch campaign + +--- + +### Advanced Operations Track + +**Topics:** +- [ ] Infrastructure as Code +- [ ] Advanced security hardening +- [ ] Disaster recovery testing +- [ ] Capacity planning +- [ ] Cost optimization + +**Projects:** +- Deploy new service +- Run disaster recovery drill +- Create infrastructure diagram +- Optimize costs + +--- + +## 📚 RECOMMENDED READING ORDER + +**Week 1:** +1. Mission Statement & Philosophy +2. Infrastructure Manifest +3. Quick Reference - Common Operations +4. Emergency Protocols (both) + +**Week 2:** +5. Department Structure & Access Control +6. Discord Server Organization +7. Role-specific task documentation + +**Week 3:** +8. Automation system documentation +9. Metrics & SLA definitions +10. Advanced topics (role-dependent) + +**Week 4+:** +11. Deep-dive into specialty areas +12. Contribute to documentation updates +13. Propose improvements + +--- + +## ✅ CERTIFICATION CHECKPOINTS + +**Level 1 Complete:** +- [ ] Understands Fire + Frost philosophy +- [ ] Can access all assigned tools +- [ ] Knows infrastructure layout +- [ ] Has completed orientation + +**Level 2 Complete:** +- [ ] Can perform common operations independently +- [ ] Can handle basic support requests +- [ ] Knows emergency procedures +- [ ] Shadow period complete + +**Level 3 Complete:** +- [ ] Proficient in role-specific skills +- [ ] Can work independently +- [ ] Understands automation systems +- [ ] Can train others on basics + +**Level 4 Complete:** +- [ ] Expert in specialty area +- [ ] Can lead projects +- [ ] Contributes to improvements +- [ ] Mentors newer staff + +--- + +## 🎯 SKILLS ASSESSMENT + +**After each level, assess:** + +**Knowledge (Can explain):** +- Fire + Frost philosophy +- Infrastructure architecture +- Emergency procedures +- Role responsibilities + +**Skills (Can demonstrate):** +- Common operations +- Problem solving +- Communication +- Tool proficiency + +**Attitude (Exhibits):** +- Passion for mission +- Attention to detail +- Team collaboration +- Continuous learning + +--- + +## 📝 TRAINING RECORDS + +**Track for each staff member:** +- Start date +- Level completion dates +- Checkpoint results +- Skills assessments +- Certification achieved +- Specialization chosen +- Ongoing development goals + +**Template:** `docs/reference/staff-training-record-template.md` + +--- + +## 🔄 ONGOING DEVELOPMENT + +**After initial training:** + +**Monthly:** +- Review new documentation +- Learn about new features +- Attend team meetings +- Share knowledge + +**Quarterly:** +- Advanced skill development +- Cross-training opportunities +- Leadership development +- Innovation projects + +**Annually:** +- Full infrastructure review +- Disaster recovery drill participation +- Career development planning +- Contribution recognition + +--- + +## 🎓 TRAINING RESOURCES + +**Internal:** +- Complete operations manual (this repository) +- Wiki.js documentation +- Staff Discord channels +- Shadow senior team members + +**External:** +- Minecraft server optimization guides +- Discord community management +- Social media marketing courses +- Infrastructure/DevOps tutorials + +**Hands-on:** +- Test server for experimentation +- Simulated emergencies +- Real-world shadowing +- Project-based learning + +--- + +**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️ + +--- + +**Curriculum Status:** ACTIVE +**Review Schedule:** Quarterly +**Next Review:** 2026-05-17 +**Version:** 1.0