# ⚠️ YELLOW ALERT - Partial Service Degradation Protocol **Status:** Elevated Response Procedure **Alert Level:** YELLOW ALERT **Priority:** HIGH **Last Updated:** 2026-02-17 --- ## ⚠️ YELLOW ALERT DEFINITION **Partial service degradation or single critical system failure:** - One or more game servers down (but not all) - Single management service unavailable - Performance degradation (high latency, low TPS) - Single node failure (TX1 or NC1 affected) - Non-critical but user-impacting issues **This requires prompt attention but is not business-critical.** --- ## 📊 YELLOW ALERT TRIGGERS **Automatic triggers:** - Any game server offline for >15 minutes - TPS below 15 on any server for >30 minutes - Panel/billing system inaccessible for >10 minutes - More than 5 player complaints in 15 minutes - Uptime Kuma shows red status for any service - Memory usage >90% for >20 minutes --- ## 📞 RESPONSE PROCEDURE (15-30 minutes) ### Step 1: ASSESS SITUATION (5 minutes) **Determine scope:** - [ ] Which services are affected? - [ ] How many players impacted? - [ ] Is degradation worsening? - [ ] Any revenue impact? - [ ] Can it wait or needs immediate action? **Quick checks:** ```bash # Check server status ssh root@63.143.34.217 "systemctl status" # Check game servers in Pterodactyl curl https://panel.firefrostgaming.com/api/client # Check resource usage ssh root@38.68.14.26 "htop" ``` --- ### Step 2: COMMUNICATE (3 minutes) **If user-facing impact:** Discord #server-status: ``` ⚠️ SERVICE NOTICE We're experiencing issues with [specific service/server]. Affected: [Server name(s)] Status: Investigating ETA: [Estimate] Players on unaffected servers: No action needed Players on affected server: Please standby Updates will be posted here. ``` **If internal only:** - Post in #staff-lounge - No public announcement needed --- ### Step 3: DIAGNOSE & FIX (10-20 minutes) See scenario-specific procedures below. --- ## 🔧 COMMON YELLOW ALERT SCENARIOS ### Scenario 1: Single Game Server Down **Quick diagnostics:** ```bash # Via Pterodactyl panel 1. Check server status in panel 2. View console for errors 3. Check resource usage graphs # Common causes: - Out of memory (OOM) - Crash from mod conflict - World corruption - Java process died ``` **Resolution:** ```bash # Restart server via panel 1. Stop server 2. Wait 30 seconds 3. Start server 4. Monitor console for successful startup 5. Test player connection ``` **If restart fails:** - Check logs for error messages - Restore from backup if world corrupted - Rollback recent mod changes - Allocate more RAM if OOM **Recovery time:** 5-15 minutes --- ### Scenario 2: Low TPS / Server Lag **Diagnostics:** ```bash # In-game /tps /forge tps # Via SSH top -u minecraft htop iostat ``` **Common causes:** - Chunk loading lag - Redstone contraptions - Mob farms - Memory pressure - Disk I/O bottleneck **Quick fixes:** ```bash # Clear entities /kill @e[type=!player] # Reduce view distance temporarily # (via server.properties or Pterodactyl) # Restart server during low-traffic time ``` **Long-term solutions:** - Optimize JVM flags (see optimization guide) - Add more RAM - Limit chunk loading - Remove lag-causing builds **Recovery time:** 10-30 minutes --- ### Scenario 3: Pterodactyl Panel Inaccessible **Quick checks:** ```bash # Panel server (45.94.168.138) ssh root@45.94.168.138 # Check panel service systemctl status pteroq systemctl status wings # Check Nginx systemctl status nginx # Check database systemctl status mariadb ``` **Common fixes:** ```bash # Restart panel services systemctl restart pteroq wings nginx # Check disk space (common cause) df -h # If database issue systemctl restart mariadb ``` **Recovery time:** 5-10 minutes --- ### Scenario 4: Billing/Whitelist Manager Down **Impact:** Players cannot subscribe or whitelist **Diagnostics:** ```bash # Billing VPS (38.68.14.188) ssh root@38.68.14.188 # Check services systemctl status paymenter systemctl status whitelist-manager systemctl status nginx ``` **Quick fix:** ```bash systemctl restart [affected-service] ``` **Recovery time:** 2-5 minutes --- ### Scenario 5: Frostwall Tunnel Degraded **Symptoms:** - High latency on specific node - Packet loss - Intermittent disconnections **Diagnostics:** ```bash # On Command Center ping 10.0.1.2 # TX1 tunnel ping 10.0.2.2 # NC1 tunnel # Check tunnel interface ip link show gre-tx1 ip link show gre-nc1 # Check routing ip route show ``` **Quick fix:** ```bash # Restart specific tunnel ip link set gre-tx1 down ip link set gre-tx1 up # Or restart all networking systemctl restart networking ``` **Recovery time:** 5-10 minutes --- ### Scenario 6: High Memory Usage (Pre-OOM) **Warning signs:** - Memory >90% on any server - Swap usage increasing - JVM GC warnings in logs **Immediate action:** ```bash # Identify memory hog htop ps aux --sort=-%mem | head # If game server: # Schedule restart during low-traffic # If other service: systemctl restart [service] ``` **Prevention:** - Enable swap if not present - Right-size RAM allocation - Schedule regular restarts **Recovery time:** 5-20 minutes --- ### Scenario 7: Discord Bot Offline **Impact:** Automated features unavailable **Quick fix:** ```bash # Restart bot container/service docker restart [bot-name] # or systemctl restart [bot-service] # Check bot token hasn't expired ``` **Recovery time:** 2-5 minutes --- ## ✅ RESOLUTION VERIFICATION **Before downgrading from Yellow Alert:** - [ ] Affected service operational - [ ] Players can connect/use service - [ ] No error messages in logs - [ ] Performance metrics normal - [ ] Root cause identified - [ ] Temporary or permanent fix applied - [ ] Monitoring in place for recurrence --- ## 📢 RESOLUTION COMMUNICATION **Public (if announced):** ``` ✅ RESOLVED [Service/Server] is now operational. Cause: [Brief explanation] Duration: [X minutes] Thank you for your patience! ``` **Staff-only:** ``` Yellow Alert cleared: [Service] Cause: [Details] Fix: [What was done] Prevention: [Next steps] ``` --- ## 📊 ESCALATION TO RED ALERT **Escalate if:** - Multiple services failing simultaneously - Fix attempts unsuccessful after 30 minutes - Issue worsening despite interventions - Provider reports hardware failure - Security breach suspected **When escalating:** - Follow RED ALERT protocol immediately - Document what was tried - Preserve logs/state for diagnosis --- ## 🔄 POST-INCIDENT TASKS **For significant Yellow Alerts:** 1. **Document incident** (brief summary) 2. **Update monitoring** (prevent recurrence) 3. **Review capacity** (if resource-related) 4. **Schedule preventive maintenance** (if needed) --- **Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️ --- **Protocol Status:** ACTIVE **Version:** 1.0