firefrost-operations-manual/docs/emergency-protocols/RED-ALERT-complete-failure.md

# 🚨 RED ALERT - Complete Infrastructure Failure Protocol

**Status:** Emergency Response Procedure
**Alert Level:** RED ALERT
**Priority:** CRITICAL
**Last Updated:** 2026-02-17

---

## 🚨 RED ALERT DEFINITION

**Complete infrastructure failure affecting multiple critical systems:**
- All game servers down
- Management services inaccessible
- Revenue/billing systems offline
- No user access to any services

**This is a business-critical emergency requiring immediate action.**

---

## ⏱️ RESPONSE TIMELINE

**0-5 minutes:** Initial assessment and communication
**5-15 minutes:** Emergency containment
**15-60 minutes:** Restore critical services
**1-4 hours:** Full recovery
**24-48 hours:** Post-mortem and prevention

---

## 📞 IMMEDIATE ACTIONS (First 5 Minutes)

### Step 1: CONFIRM RED ALERT (60 seconds)

**Check multiple indicators:**
- [ ] Uptime Kuma shows all services down
- [ ] Cannot SSH to Command Center
- [ ] Cannot access panel.firefrostgaming.com
- [ ] Multiple player reports in Discord
- [ ] Email/SMS alerts from hosting provider

**If 3+ indicators confirm → RED ALERT CONFIRMED**

---

### Step 2: NOTIFY STAKEHOLDERS (2 minutes)

**Communication hierarchy:**

1. **Michael (The Wizard)** - Primary incident commander
   - Text/Call immediately
   - Use emergency contact if needed

2. **Meg (The Emissary)** - Community management
   - Brief on situation
   - Prepare community message

3. **Discord Announcement** (if accessible):
```
🚨 RED ALERT - ALL SERVICES DOWN

We are aware of a complete service outage affecting all Firefrost servers. Our team is investigating and working on restoration.

ETA: Updates every 15 minutes
Status: https://status.firefrostgaming.com (if available)

We apologize for the inconvenience.
- The Firefrost Team
```

4. **Social Media** (Twitter/X):
```
⚠️ Service Alert: Firefrost Gaming is experiencing a complete service outage. We're working on restoration. Updates to follow.
```

---

### Step 3: INITIAL TRIAGE (2 minutes)

**Determine failure scope:**

**Check hosting provider status:**
- Hetzner status page
- Provider support ticket system
- Email from provider?

**Likely causes (priority order):**
1. **Provider-wide outage** → Wait for provider
2. **DDoS attack** → Enable DDoS mitigation
3. **Network failure** → Check Frostwall tunnels
4. **Payment/billing issue** → Check accounts
5. **Configuration error** → Review recent changes
6. **Hardware failure** → Provider intervention needed

---

## 🔧 EMERGENCY RECOVERY PROCEDURES

### Scenario A: Provider-Wide Outage

**If Hetzner/provider has known outage:**

1. **DO NOT PANIC** - This is out of your control
2. **Monitor provider status page** - Get ETAs
3. **Update community every 15 minutes**
4. **Document timeline** for compensation claims
5. **Prepare communication** for when services return

**Actions:**
- [ ] Check Hetzner status: https://status.hetzner.com
- [ ] Open support ticket (if not provider-wide)
- [ ] Monitor Discord for player questions
- [ ] Document downtime duration

**Recovery:** Services will restore when provider resolves issue

---

### Scenario B: DDoS Attack

**If traffic volume is abnormally high:**

1. **Enable Cloudflare DDoS protection** (if not already)
2. **Contact hosting provider** for mitigation help
3. **Check Command Center** for abnormal traffic
4. **Review UFW logs** for attack patterns

**Actions:**
- [ ] Check traffic graphs in provider dashboard
- [ ] Enable Cloudflare "I'm Under Attack" mode
- [ ] Contact provider NOC for emergency mitigation
- [ ] Document attack source IPs (if visible)

**Recovery:** 15-60 minutes depending on attack severity

---

### Scenario C: Frostwall/Network Failure

**If GRE tunnels are down:**

1. **SSH to Command Center** (if accessible)
2. **Check tunnel status:**
```bash
ip link show | grep gre
ping 10.0.1.2  # TX1 tunnel
ping 10.0.2.2  # NC1 tunnel
```

3. **Restart tunnels:**
```bash
systemctl restart networking
# Or manually:
/etc/network/if-up.d/frostwall-tunnels
```

4. **Verify UFW rules** aren't blocking traffic

**Actions:**
- [ ] Check GRE tunnel status
- [ ] Restart network services
- [ ] Verify routing tables
- [ ] Test game server connectivity

**Recovery:** 5-15 minutes

---

### Scenario D: Payment/Billing Failure

**If services suspended for non-payment:**

1. **Check email** for suspension notices
2. **Log into provider billing** portal
3. **Make immediate payment** if overdue
4. **Contact provider support** for expedited restoration

**Actions:**
- [ ] Check all provider invoices
- [ ] Verify payment methods current
- [ ] Make emergency payment if needed
- [ ] Request immediate service restoration

**Recovery:** 30-120 minutes (depending on provider response)

---

### Scenario E: Configuration Error

**If recent changes caused failure:**

1. **Identify last change** (check git log, command history)
2. **Rollback configuration:**
```bash
# Restore from backup
cd /opt/config-backups
ls -lt | head -5  # Find recent backup
cp backup-YYYYMMDD.tar.gz /
tar -xzf backup-YYYYMMDD.tar.gz
systemctl restart [affected-service]
```

3. **Test services incrementally**

**Actions:**
- [ ] Review git commit log
- [ ] Check command history: `history | tail -50`
- [ ] Restore previous working config
- [ ] Test each service individually

**Recovery:** 15-30 minutes

---

### Scenario F: Hardware Failure

**If physical hardware failed:**

1. **Open EMERGENCY ticket** with provider
2. **Request hardware replacement/migration**
3. **Prepare for potential data loss**
4. **Activate disaster recovery plan**

**Actions:**
- [ ] Contact provider emergency support
- [ ] Request server health diagnostics
- [ ] Prepare to restore from backups
- [ ] Estimate RTO (Recovery Time Objective)

**Recovery:** 2-24 hours (provider dependent)

---

## 📊 RESTORATION PRIORITY ORDER

**Restore in this sequence:**

### Phase 1: CRITICAL (0-15 minutes)
1. **Command Center** - Management hub
2. **Pterodactyl Panel** - Control plane
3. **Uptime Kuma** - Monitoring
4. **Frostwall tunnels** - Network security

### Phase 2: REVENUE (15-30 minutes)
5. **Paymenter/Billing** - Financial systems
6. **Whitelist Manager** - Player access
7. **Top 3 game servers** - ATM10, Ember, MC:C&C

### Phase 3: SERVICES (30-60 minutes)
8. **Remaining game servers**
9. **Wiki.js** - Documentation
10. **NextCloud** - File storage

### Phase 4: SECONDARY (1-2 hours)
11. **Gitea** - Version control
12. **Discord bots** - Community tools
13. **Code-Server** - Development

---

## ✅ RECOVERY VERIFICATION CHECKLIST

**Before declaring "all clear":**

- [ ] All servers accessible via SSH
- [ ] All game servers online in Pterodactyl
- [ ] Players can connect to servers
- [ ] Uptime Kuma shows all green
- [ ] Website/billing accessible
- [ ] No error messages in logs
- [ ] Network performance normal
- [ ] All automation systems running

---

## 📢 RECOVERY COMMUNICATION

**When services are restored:**

### Discord Announcement:
```
✅ ALL CLEAR - Services Restored

All Firefrost services have been restored and are operating normally.

Total downtime: [X] hours [Y] minutes
Cause: [Brief explanation]

We apologize for the disruption and thank you for your patience.

Compensation: [If applicable]
- [Details of any compensation for subscribers]

Full post-mortem will be published within 48 hours.

- The Firefrost Team
```

### Twitter/X:
```
✅ Service Alert Resolved: All Firefrost Gaming services are now operational. Thank you for your patience during the outage. Full details: [link]
```

---

## 📝 POST-INCIDENT REQUIREMENTS

**Within 24 hours:**

1. **Create timeline** of events (minute-by-minute)
2. **Document root cause**
3. **Identify what worked well**
4. **Identify what failed**
5. **List action items** for prevention

**Within 48 hours:**

6. **Publish post-mortem** (public or staff-only)
7. **Implement immediate fixes**
8. **Update emergency procedures** if needed
9. **Test recovery procedures**
10. **Review disaster recovery plan**

**Post-Mortem Template:** `docs/reference/incident-post-mortem-template.md`

---

## 🎯 PREVENTION MEASURES

**After RED ALERT, implement:**

1. **Enhanced monitoring** - More comprehensive alerts
2. **Redundancy** - Eliminate single points of failure
3. **Automated health checks** - Self-healing where possible
4. **Regular drills** - Test emergency procedures quarterly
5. **Documentation updates** - Capture lessons learned

---

## 📞 EMERGENCY CONTACTS

**Primary:**
- Michael (The Wizard): [Emergency contact method]
- Meg (The Emissary): [Emergency contact method]

**Providers:**
- Hetzner Emergency Support: [Support number]
- Cloudflare Support: [Support number]
- Discord Support: [Support email]

**Escalation:**
- If Michael unavailable: Meg takes incident command
- If both unavailable: [Designated backup contact]

---

## 🔐 CREDENTIALS EMERGENCY ACCESS

**If Vaultwarden is down:**
- Emergency credential sheet: [Physical location]
- Backup password manager: [Alternative access]
- Provider console access: [Direct login method]

---

**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️

---

**Protocol Status:** ACTIVE
**Last Drill:** [Date of last test]
**Next Review:** Monthly
**Version:** 1.0