# 🚨 RED ALERT - Complete Infrastructure Failure Protocol **Status:** Emergency Response Procedure **Alert Level:** RED ALERT **Priority:** CRITICAL **Last Updated:** 2026-02-17 --- ## 🚨 RED ALERT DEFINITION **Complete infrastructure failure affecting multiple critical systems:** - All game servers down - Management services inaccessible - Revenue/billing systems offline - No user access to any services **This is a business-critical emergency requiring immediate action.** --- ## ⏱️ RESPONSE TIMELINE **0-5 minutes:** Initial assessment and communication **5-15 minutes:** Emergency containment **15-60 minutes:** Restore critical services **1-4 hours:** Full recovery **24-48 hours:** Post-mortem and prevention --- ## 📞 IMMEDIATE ACTIONS (First 5 Minutes) ### Step 1: CONFIRM RED ALERT (60 seconds) **Check multiple indicators:** - [ ] Uptime Kuma shows all services down - [ ] Cannot SSH to Command Center - [ ] Cannot access panel.firefrostgaming.com - [ ] Multiple player reports in Discord - [ ] Email/SMS alerts from hosting provider **If 3+ indicators confirm → RED ALERT CONFIRMED** --- ### Step 2: NOTIFY STAKEHOLDERS (2 minutes) **Communication hierarchy:** 1. **Michael (The Wizard)** - Primary incident commander - Text/Call immediately - Use emergency contact if needed 2. **Meg (The Emissary)** - Community management - Brief on situation - Prepare community message 3. **Discord Announcement** (if accessible): ``` 🚨 RED ALERT - ALL SERVICES DOWN We are aware of a complete service outage affecting all Firefrost servers. Our team is investigating and working on restoration. ETA: Updates every 15 minutes Status: https://status.firefrostgaming.com (if available) We apologize for the inconvenience. - The Firefrost Team ``` 4. **Social Media** (Twitter/X): ``` ⚠️ Service Alert: Firefrost Gaming is experiencing a complete service outage. We're working on restoration. Updates to follow. ``` --- ### Step 3: INITIAL TRIAGE (2 minutes) **Determine failure scope:** **Check hosting provider status:** - Hetzner status page - Provider support ticket system - Email from provider? **Likely causes (priority order):** 1. **Provider-wide outage** → Wait for provider 2. **DDoS attack** → Enable DDoS mitigation 3. **Network failure** → Check Frostwall tunnels 4. **Payment/billing issue** → Check accounts 5. **Configuration error** → Review recent changes 6. **Hardware failure** → Provider intervention needed --- ## 🔧 EMERGENCY RECOVERY PROCEDURES ### Scenario A: Provider-Wide Outage **If Hetzner/provider has known outage:** 1. **DO NOT PANIC** - This is out of your control 2. **Monitor provider status page** - Get ETAs 3. **Update community every 15 minutes** 4. **Document timeline** for compensation claims 5. **Prepare communication** for when services return **Actions:** - [ ] Check Hetzner status: https://status.hetzner.com - [ ] Open support ticket (if not provider-wide) - [ ] Monitor Discord for player questions - [ ] Document downtime duration **Recovery:** Services will restore when provider resolves issue --- ### Scenario B: DDoS Attack **If traffic volume is abnormally high:** 1. **Enable Cloudflare DDoS protection** (if not already) 2. **Contact hosting provider** for mitigation help 3. **Check Command Center** for abnormal traffic 4. **Review UFW logs** for attack patterns **Actions:** - [ ] Check traffic graphs in provider dashboard - [ ] Enable Cloudflare "I'm Under Attack" mode - [ ] Contact provider NOC for emergency mitigation - [ ] Document attack source IPs (if visible) **Recovery:** 15-60 minutes depending on attack severity --- ### Scenario C: Frostwall/Network Failure **If GRE tunnels are down:** 1. **SSH to Command Center** (if accessible) 2. **Check tunnel status:** ```bash ip link show | grep gre ping 10.0.1.2 # TX1 tunnel ping 10.0.2.2 # NC1 tunnel ``` 3. **Restart tunnels:** ```bash systemctl restart networking # Or manually: /etc/network/if-up.d/frostwall-tunnels ``` 4. **Verify UFW rules** aren't blocking traffic **Actions:** - [ ] Check GRE tunnel status - [ ] Restart network services - [ ] Verify routing tables - [ ] Test game server connectivity **Recovery:** 5-15 minutes --- ### Scenario D: Payment/Billing Failure **If services suspended for non-payment:** 1. **Check email** for suspension notices 2. **Log into provider billing** portal 3. **Make immediate payment** if overdue 4. **Contact provider support** for expedited restoration **Actions:** - [ ] Check all provider invoices - [ ] Verify payment methods current - [ ] Make emergency payment if needed - [ ] Request immediate service restoration **Recovery:** 30-120 minutes (depending on provider response) --- ### Scenario E: Configuration Error **If recent changes caused failure:** 1. **Identify last change** (check git log, command history) 2. **Rollback configuration:** ```bash # Restore from backup cd /opt/config-backups ls -lt | head -5 # Find recent backup cp backup-YYYYMMDD.tar.gz / tar -xzf backup-YYYYMMDD.tar.gz systemctl restart [affected-service] ``` 3. **Test services incrementally** **Actions:** - [ ] Review git commit log - [ ] Check command history: `history | tail -50` - [ ] Restore previous working config - [ ] Test each service individually **Recovery:** 15-30 minutes --- ### Scenario F: Hardware Failure **If physical hardware failed:** 1. **Open EMERGENCY ticket** with provider 2. **Request hardware replacement/migration** 3. **Prepare for potential data loss** 4. **Activate disaster recovery plan** **Actions:** - [ ] Contact provider emergency support - [ ] Request server health diagnostics - [ ] Prepare to restore from backups - [ ] Estimate RTO (Recovery Time Objective) **Recovery:** 2-24 hours (provider dependent) --- ## 📊 RESTORATION PRIORITY ORDER **Restore in this sequence:** ### Phase 1: CRITICAL (0-15 minutes) 1. **Command Center** - Management hub 2. **Pterodactyl Panel** - Control plane 3. **Uptime Kuma** - Monitoring 4. **Frostwall tunnels** - Network security ### Phase 2: REVENUE (15-30 minutes) 5. **Paymenter/Billing** - Financial systems 6. **Whitelist Manager** - Player access 7. **Top 3 game servers** - ATM10, Ember, MC:C&C ### Phase 3: SERVICES (30-60 minutes) 8. **Remaining game servers** 9. **Wiki.js** - Documentation 10. **NextCloud** - File storage ### Phase 4: SECONDARY (1-2 hours) 11. **Gitea** - Version control 12. **Discord bots** - Community tools 13. **Code-Server** - Development --- ## ✅ RECOVERY VERIFICATION CHECKLIST **Before declaring "all clear":** - [ ] All servers accessible via SSH - [ ] All game servers online in Pterodactyl - [ ] Players can connect to servers - [ ] Uptime Kuma shows all green - [ ] Website/billing accessible - [ ] No error messages in logs - [ ] Network performance normal - [ ] All automation systems running --- ## 📢 RECOVERY COMMUNICATION **When services are restored:** ### Discord Announcement: ``` ✅ ALL CLEAR - Services Restored All Firefrost services have been restored and are operating normally. Total downtime: [X] hours [Y] minutes Cause: [Brief explanation] We apologize for the disruption and thank you for your patience. Compensation: [If applicable] - [Details of any compensation for subscribers] Full post-mortem will be published within 48 hours. - The Firefrost Team ``` ### Twitter/X: ``` ✅ Service Alert Resolved: All Firefrost Gaming services are now operational. Thank you for your patience during the outage. Full details: [link] ``` --- ## 📝 POST-INCIDENT REQUIREMENTS **Within 24 hours:** 1. **Create timeline** of events (minute-by-minute) 2. **Document root cause** 3. **Identify what worked well** 4. **Identify what failed** 5. **List action items** for prevention **Within 48 hours:** 6. **Publish post-mortem** (public or staff-only) 7. **Implement immediate fixes** 8. **Update emergency procedures** if needed 9. **Test recovery procedures** 10. **Review disaster recovery plan** **Post-Mortem Template:** `docs/reference/incident-post-mortem-template.md` --- ## 🎯 PREVENTION MEASURES **After RED ALERT, implement:** 1. **Enhanced monitoring** - More comprehensive alerts 2. **Redundancy** - Eliminate single points of failure 3. **Automated health checks** - Self-healing where possible 4. **Regular drills** - Test emergency procedures quarterly 5. **Documentation updates** - Capture lessons learned --- ## 📞 EMERGENCY CONTACTS **Primary:** - Michael (The Wizard): [Emergency contact method] - Meg (The Emissary): [Emergency contact method] **Providers:** - Hetzner Emergency Support: [Support number] - Cloudflare Support: [Support number] - Discord Support: [Support email] **Escalation:** - If Michael unavailable: Meg takes incident command - If both unavailable: [Designated backup contact] --- ## 🔐 CREDENTIALS EMERGENCY ACCESS **If Vaultwarden is down:** - Emergency credential sheet: [Physical location] - Backup password manager: [Alternative access] - Provider console access: [Direct login method] --- **Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️ --- **Protocol Status:** ACTIVE **Last Drill:** [Date of last test] **Next Review:** Monthly **Version:** 1.0