firefrost-operations-manual/docs/emergency-protocols/YELLOW-ALERT-partial-degradation.md

# ⚠️ YELLOW ALERT - Partial Service Degradation Protocol

**Status:** Elevated Response Procedure
**Alert Level:** YELLOW ALERT
**Priority:** HIGH
**Last Updated:** 2026-02-17

---

## ⚠️ YELLOW ALERT DEFINITION

**Partial service degradation or single critical system failure:**
- One or more game servers down (but not all)
- Single management service unavailable
- Performance degradation (high latency, low TPS)
- Single node failure (TX1 or NC1 affected)
- Non-critical but user-impacting issues

**This requires prompt attention but is not business-critical.**

---

## 📊 YELLOW ALERT TRIGGERS

**Automatic triggers:**
- Any game server offline for >15 minutes
- TPS below 15 on any server for >30 minutes
- Panel/billing system inaccessible for >10 minutes
- More than 5 player complaints in 15 minutes
- Uptime Kuma shows red status for any service
- Memory usage >90% for >20 minutes

---

## 📞 RESPONSE PROCEDURE (15-30 minutes)

### Step 1: ASSESS SITUATION (5 minutes)

**Determine scope:**
- [ ] Which services are affected?
- [ ] How many players impacted?
- [ ] Is degradation worsening?
- [ ] Any revenue impact?
- [ ] Can it wait or needs immediate action?

**Quick checks:**
```bash
# Check server status
ssh root@63.143.34.217 "systemctl status"

# Check game servers in Pterodactyl
curl https://panel.firefrostgaming.com/api/client

# Check resource usage
ssh root@38.68.14.26 "htop"
```

---

### Step 2: COMMUNICATE (3 minutes)

**If user-facing impact:**

Discord #server-status:
```
⚠️ SERVICE NOTICE

We're experiencing issues with [specific service/server].

Affected: [Server name(s)]
Status: Investigating
ETA: [Estimate]

Players on unaffected servers: No action needed
Players on affected server: Please standby

Updates will be posted here.
```

**If internal only:**
- Post in #staff-lounge
- No public announcement needed

---

### Step 3: DIAGNOSE & FIX (10-20 minutes)

See scenario-specific procedures below.

---

## 🔧 COMMON YELLOW ALERT SCENARIOS

### Scenario 1: Single Game Server Down

**Quick diagnostics:**
```bash
# Via Pterodactyl panel
1. Check server status in panel
2. View console for errors
3. Check resource usage graphs

# Common causes:
- Out of memory (OOM)
- Crash from mod conflict
- World corruption
- Java process died
```

**Resolution:**
```bash
# Restart server via panel
1. Stop server
2. Wait 30 seconds
3. Start server
4. Monitor console for successful startup
5. Test player connection
```

**If restart fails:**
- Check logs for error messages
- Restore from backup if world corrupted
- Rollback recent mod changes
- Allocate more RAM if OOM

**Recovery time:** 5-15 minutes

---

### Scenario 2: Low TPS / Server Lag

**Diagnostics:**
```bash
# In-game
/tps
/forge tps

# Via SSH
top -u minecraft
htop
iostat
```

**Common causes:**
- Chunk loading lag
- Redstone contraptions
- Mob farms
- Memory pressure
- Disk I/O bottleneck

**Quick fixes:**
```bash
# Clear entities
/kill @e[type=!player]

# Reduce view distance temporarily
# (via server.properties or Pterodactyl)

# Restart server during low-traffic time
```

**Long-term solutions:**
- Optimize JVM flags (see optimization guide)
- Add more RAM
- Limit chunk loading
- Remove lag-causing builds

**Recovery time:** 10-30 minutes

---

### Scenario 3: Pterodactyl Panel Inaccessible

**Quick checks:**
```bash
# Panel server (45.94.168.138)
ssh root@45.94.168.138

# Check panel service
systemctl status pteroq
systemctl status wings

# Check Nginx
systemctl status nginx

# Check database
systemctl status mariadb
```

**Common fixes:**
```bash
# Restart panel services
systemctl restart pteroq wings nginx

# Check disk space (common cause)
df -h

# If database issue
systemctl restart mariadb
```

**Recovery time:** 5-10 minutes

---

### Scenario 4: Billing/Whitelist Manager Down

**Impact:** Players cannot subscribe or whitelist

**Diagnostics:**
```bash
# Billing VPS (38.68.14.188)
ssh root@38.68.14.188

# Check services
systemctl status paymenter
systemctl status whitelist-manager
systemctl status nginx
```

**Quick fix:**
```bash
systemctl restart [affected-service]
```

**Recovery time:** 2-5 minutes

---

### Scenario 5: Frostwall Tunnel Degraded

**Symptoms:**
- High latency on specific node
- Packet loss
- Intermittent disconnections

**Diagnostics:**
```bash
# On Command Center
ping 10.0.1.2  # TX1 tunnel
ping 10.0.2.2  # NC1 tunnel

# Check tunnel interface
ip link show gre-tx1
ip link show gre-nc1

# Check routing
ip route show
```

**Quick fix:**
```bash
# Restart specific tunnel
ip link set gre-tx1 down
ip link set gre-tx1 up

# Or restart all networking
systemctl restart networking
```

**Recovery time:** 5-10 minutes

---

### Scenario 6: High Memory Usage (Pre-OOM)

**Warning signs:**
- Memory >90% on any server
- Swap usage increasing
- JVM GC warnings in logs

**Immediate action:**
```bash
# Identify memory hog
htop
ps aux --sort=-%mem | head

# If game server:
# Schedule restart during low-traffic

# If other service:
systemctl restart [service]
```

**Prevention:**
- Enable swap if not present
- Right-size RAM allocation
- Schedule regular restarts

**Recovery time:** 5-20 minutes

---

### Scenario 7: Discord Bot Offline

**Impact:** Automated features unavailable

**Quick fix:**
```bash
# Restart bot container/service
docker restart [bot-name]
# or
systemctl restart [bot-service]

# Check bot token hasn't expired
```

**Recovery time:** 2-5 minutes

---

## ✅ RESOLUTION VERIFICATION

**Before downgrading from Yellow Alert:**

- [ ] Affected service operational
- [ ] Players can connect/use service
- [ ] No error messages in logs
- [ ] Performance metrics normal
- [ ] Root cause identified
- [ ] Temporary or permanent fix applied
- [ ] Monitoring in place for recurrence

---

## 📢 RESOLUTION COMMUNICATION

**Public (if announced):**
```
✅ RESOLVED

[Service/Server] is now operational.

Cause: [Brief explanation]
Duration: [X minutes]

Thank you for your patience!
```

**Staff-only:**
```
Yellow Alert cleared: [Service]
Cause: [Details]
Fix: [What was done]
Prevention: [Next steps]
```

---

## 📊 ESCALATION TO RED ALERT

**Escalate if:**
- Multiple services failing simultaneously
- Fix attempts unsuccessful after 30 minutes
- Issue worsening despite interventions
- Provider reports hardware failure
- Security breach suspected

**When escalating:**
- Follow RED ALERT protocol immediately
- Document what was tried
- Preserve logs/state for diagnosis

---

## 🔄 POST-INCIDENT TASKS

**For significant Yellow Alerts:**

1. **Document incident** (brief summary)
2. **Update monitoring** (prevent recurrence)
3. **Review capacity** (if resource-related)
4. **Schedule preventive maintenance** (if needed)

---

**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️

---

**Protocol Status:** ACTIVE
**Version:** 1.0