Files
firefrost-operations-manual/docs/tasks/frostwall-protocol/troubleshooting.md
Claude 2bd96ee8c7 docs: Complete Frostwall Protocol deployment documentation
Created comprehensive documentation for Frostwall Protocol rebuild:

deployment-plan.md (500+ lines):
- Complete 7-phase implementation guide
- GRE tunnel configuration for Command Center ↔ TX1/NC1
- Iron Wall UFW firewall rules
- NAT/port forwarding setup
- Self-healing tunnel monitoring with auto-recovery
- DNS configuration
- Testing and verification procedures
- Rollback plan
- Performance considerations

ip-hierarchy.md (400+ lines):
- Three-tier IP architecture explained
- Complete service mapping table (all 11 game servers)
- GRE tunnel IP addressing
- Traffic flow diagrams
- DNS configuration reference
- Security summary
- Quick command reference

troubleshooting.md (450+ lines):
- Quick diagnostics checklist
- Common problems with step-by-step solutions:
  - Tunnel won't come up
  - Can't ping tunnel IP
  - Port forwarding not working
  - Tunnel breaks after reboot
  - Self-healing monitor issues
  - High latency/packet loss
  - UFW blocking traffic
- Emergency recovery procedures
- Common error messages decoded
- Health check commands

This documentation enables rebuilding the Frostwall Protocol from scratch
with proper IP hierarchy, DDoS protection, and self-healing capabilities.

Unblocks: Mailcow deployment, AI stack, all Tier 2+ infrastructure

Task: Frostwall Protocol (Tier 1, Critical)
FFG-STD-002 compliant
2026-02-17 15:01:35 +00:00

706 lines
13 KiB
Markdown

# Frostwall Protocol - Troubleshooting Guide
**Purpose:** Diagnose and resolve common Frostwall issues
**Last Updated:** 2026-02-17
**Status:** Ready for use
---
## Quick Diagnostics Checklist
When something's not working, run through this checklist first:
- [ ] Are GRE tunnels up? (`ip tunnel show`)
- [ ] Can you ping tunnel endpoints? (`ping 10.0.1.2`, `ping 10.0.2.2`)
- [ ] Is UFW blocking necessary traffic? (`ufw status verbose`)
- [ ] Are NAT rules present? (`iptables -t nat -L -n -v`)
- [ ] Is IP forwarding enabled? (`cat /proc/sys/net/ipv4/ip_forward`)
- [ ] Is the self-healing monitor running? (`crontab -l`)
- [ ] Did the server reboot recently? (tunnels may need manual restart)
---
## Problem: Tunnel Won't Come Up
### Symptoms
- `ip tunnel show` shows no tunnel interface
- Cannot create tunnel with `ip tunnel add`
- Error: "RTNETLINK answers: File exists" or similar
### Diagnosis
**Step 1: Check if GRE module is loaded**
```bash
lsmod | grep gre
```
**Expected output:**
```
ip_gre 28672 0
gre 16384 1 ip_gre
```
**If not loaded:**
```bash
modprobe ip_gre
```
**Step 2: Check if tunnel already exists**
```bash
ip tunnel show
```
**If tunnel exists but is down:**
```bash
ip link set gre-tx1 up # or gre-nc1, gre-hub as appropriate
```
**Step 3: Verify remote endpoint is reachable**
```bash
ping 38.68.14.26 # TX1 physical IP
ping 216.239.104.130 # NC1 physical IP
```
If physical IPs aren't reachable, the GRE tunnel can't form.
### Solution
**Delete and recreate tunnel:**
```bash
# If tunnel exists
ip link set gre-tx1 down
ip tunnel del gre-tx1
# Recreate
ip tunnel add gre-tx1 mode gre remote 38.68.14.26 local 63.143.34.217 ttl 255
ip addr add 10.0.1.1/30 dev gre-tx1
ip link set gre-tx1 up
# Test
ping 10.0.1.2
```
---
## Problem: Can't Ping Tunnel IP
### Symptoms
- Tunnel shows as "UP" in `ip link show`
- `ping 10.0.1.2` times out or fails
- No response from remote tunnel endpoint
### Diagnosis
**Step 1: Verify tunnel interface is actually up**
```bash
ip link show gre-tx1
```
Look for: `state UP`
**Step 2: Check if UFW is blocking GRE**
```bash
ufw status verbose | grep -i gre
```
**Expected:**
```
Anywhere ALLOW Anywhere # allow GRE
47 ALLOW Anywhere
```
**Step 3: Check routing table**
```bash
ip route show
```
You should see routes for tunnel IPs:
```
10.0.1.0/30 dev gre-tx1 proto kernel scope link src 10.0.1.1
```
**Step 4: Check if remote server has tunnel up**
```bash
# SSH to remote server
ssh root@38.68.14.26
# Check tunnel
ip link show gre-hub
ping 10.0.1.1
```
### Solution
**On both Command Center and remote node:**
```bash
# Restart both ends of the tunnel
# Command Center:
ip link set gre-tx1 down
sleep 2
ip link set gre-tx1 up
# Remote (TX1):
ip link set gre-hub down
sleep 2
ip link set gre-hub up
# Test from both sides
ping 10.0.1.2 # From Command Center
ping 10.0.1.1 # From TX1
```
**If UFW is blocking:**
```bash
# On Command Center
ufw allow proto gre
# On TX1/NC1
ufw allow from 63.143.34.217 proto gre
```
---
## Problem: Port Forwarding Not Working
### Symptoms
- Players can't connect to game servers
- `telnet 63.143.34.217 25565` times out or refuses
- Tunnel is up and pingable, but game traffic doesn't flow
### Diagnosis
**Step 1: Check if NAT rules exist**
```bash
iptables -t nat -L PREROUTING -n -v
```
**Expected output (example for port 25565):**
```
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:25565 to:10.0.1.2:25565
```
**Step 2: Check FORWARD chain**
```bash
iptables -L FORWARD -n -v
```
**Expected:**
```
ACCEPT tcp -- 0.0.0.0/0 10.0.1.2 tcp dpt:25565
```
**Step 3: Verify IP forwarding is enabled**
```bash
cat /proc/sys/net/ipv4/ip_forward
```
**Expected:** `1`
**Step 4: Test from Command Center itself**
```bash
# From Command Center, test connection to tunnel IP
telnet 10.0.1.2 25565
```
If this works, but external connections don't, it's a NAT issue.
**Step 5: Check if game server is actually listening**
```bash
# SSH to TX1
ssh root@38.68.14.26
# Check if Minecraft is listening
netstat -tuln | grep 25565
```
**Expected:**
```
tcp6 0 0 :::25565 :::* LISTEN
```
### Solution
**Add missing NAT rules:**
```bash
# On Command Center
iptables -t nat -A PREROUTING -p tcp --dport 25565 -j DNAT --to-destination 10.0.1.2:25565
iptables -t nat -A PREROUTING -p udp --dport 25565 -j DNAT --to-destination 10.0.1.2:25565
iptables -A FORWARD -p tcp -d 10.0.1.2 --dport 25565 -j ACCEPT
iptables -A FORWARD -p udp -d 10.0.1.2 --dport 25565 -j ACCEPT
# Add masquerading if not present
iptables -t nat -A POSTROUTING -o gre-tx1 -j MASQUERADE
# Save rules
iptables-save > /etc/iptables/rules.v4
```
**Enable IP forwarding if disabled:**
```bash
echo 1 > /proc/sys/net/ipv4/ip_forward
echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf
sysctl -p
```
---
## Problem: Tunnel Works But Breaks After Reboot
### Symptoms
- Tunnel works fine until server reboots
- After reboot, tunnel doesn't come back up
- Must manually recreate tunnel every time
### Diagnosis
**Check if persistence script exists:**
```bash
ls -la /etc/network/if-up.d/frostwall-*
```
**Check if it's executable:**
```bash
ls -la /etc/network/if-up.d/frostwall-tunnels
```
Should show: `-rwxr-xr-x`
**Check if script is being called on boot:**
```bash
# Check recent boot logs
journalctl -b | grep frostwall
```
### Solution
**Create or fix persistence script:**
See deployment-plan.md Phase 1.3 for full scripts.
**Make sure it's executable:**
```bash
chmod +x /etc/network/if-up.d/frostwall-tunnels
```
**Test the script manually:**
```bash
# Bring tunnel down first
ip link set gre-tx1 down
# Run the script
/etc/network/if-up.d/frostwall-tunnels
# Check if tunnel came back up
ip tunnel show
ping 10.0.1.2
```
**Alternative: Use systemd service**
If if-up.d hooks don't work, create a systemd service:
`/etc/systemd/system/frostwall-tunnels.service`:
```
[Unit]
Description=Frostwall GRE Tunnels
After=network.target
[Service]
Type=oneshot
ExecStart=/etc/network/if-up.d/frostwall-tunnels
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
```bash
systemctl daemon-reload
systemctl enable frostwall-tunnels
systemctl start frostwall-tunnels
```
---
## Problem: Self-Healing Monitor Not Running
### Symptoms
- Tunnels go down and don't auto-recover
- No entries in `/var/log/frostwall-monitor.log`
- Cron job not running
### Diagnosis
**Check if cron job is scheduled:**
```bash
crontab -l | grep frostwall
```
**Expected:**
```
*/5 * * * * /usr/local/bin/frostwall-monitor.sh
```
**Check if script exists:**
```bash
ls -la /usr/local/bin/frostwall-monitor.sh
```
**Check if it's executable:**
```bash
chmod +x /usr/local/bin/frostwall-monitor.sh
```
**Run script manually to test:**
```bash
/usr/local/bin/frostwall-monitor.sh
```
**Check logs:**
```bash
cat /var/log/frostwall-monitor.log
```
### Solution
**Add cron job if missing:**
```bash
crontab -e
```
Add:
```
*/5 * * * * /usr/local/bin/frostwall-monitor.sh
```
**Fix script permissions:**
```bash
chmod +x /usr/local/bin/frostwall-monitor.sh
```
**Create log file if it doesn't exist:**
```bash
touch /var/log/frostwall-monitor.log
chmod 644 /var/log/frostwall-monitor.log
```
**Test the monitor:**
```bash
# Bring down a tunnel manually
ip link set gre-tx1 down
# Wait 5 minutes
sleep 300
# Check if it auto-recovered
ping 10.0.1.2
# Check logs
tail /var/log/frostwall-monitor.log
```
---
## Problem: High Latency or Packet Loss Through Tunnel
### Symptoms
- Players experience lag
- `ping 10.0.1.2` shows high latency or packet loss
- Game connections are unstable
### Diagnosis
**Test latency:**
```bash
# From Command Center
ping -c 100 10.0.1.2 | tail -5
```
Look for:
- Average latency (should be <2ms for TX1, ~30-40ms for NC1)
- Packet loss (should be 0%)
**Test MTU size:**
```bash
# Try different MTU sizes
ping -M do -s 1472 10.0.1.2 # Standard ethernet MTU
ping -M do -s 1450 10.0.1.2 # Lower MTU
```
If larger packets fail but smaller succeed, MTU is the issue.
**Check CPU load:**
```bash
top
```
High CPU on either end of tunnel could cause performance issues.
**Check bandwidth:**
```bash
# Install iperf3 if not present
apt install iperf3
# On TX1:
iperf3 -s
# On Command Center:
iperf3 -c 10.0.1.2
```
### Solution
**Adjust MTU if needed:**
```bash
# On tunnel interface
ip link set gre-tx1 mtu 1400
```
**Add to persistence script:**
```bash
# In /etc/network/if-up.d/frostwall-tunnels
ip link set gre-tx1 mtu 1400
```
**If packet loss persists:**
- Check physical network between nodes
- Contact datacenter if persistent issues
- Verify no other services saturating bandwidth
---
## Problem: UFW Blocking Legitimate Traffic
### Symptoms
- Can't SSH to server
- Specific ports not working despite NAT rules
- Connection refused or timeout
### Diagnosis
**Check UFW status:**
```bash
ufw status verbose
```
**Check UFW logs:**
```bash
tail -100 /var/log/ufw.log
```
Look for BLOCK entries for the port/IP you're trying to reach.
**Test with UFW temporarily disabled:**
```bash
ufw disable
# Try connection
# Re-enable immediately
ufw enable
```
**⚠️ WARNING:** Only disable UFW for brief testing, re-enable immediately.
### Solution
**Add specific rule for your management IP:**
```bash
ufw allow from MANAGEMENT_IP to any port 22 proto tcp
```
**Allow traffic on tunnel interfaces:**
```bash
ufw allow in on gre-tx1
ufw allow in on gre-nc1
ufw allow in on gre-hub # On TX1/NC1
```
**Check rule order:**
```bash
ufw status numbered
```
Rules are processed in order - make sure allow rules come before deny rules.
**Delete and re-add rules if needed:**
```bash
# Delete rule by number
ufw delete 5
# Re-add in correct order
ufw insert 1 allow from MANAGEMENT_IP to any port 22
```
---
## Emergency Recovery Procedures
### Complete Tunnel Failure
**If all troubleshooting fails, rebuild from scratch:**
```bash
# On Command Center
ip link set gre-tx1 down
ip link set gre-nc1 down
ip tunnel del gre-tx1
ip tunnel del gre-nc1
# On TX1
ip link set gre-hub down
ip tunnel del gre-hub
# On NC1
ip link set gre-hub down
ip tunnel del gre-hub
# Then follow deployment-plan.md Phase 1 to rebuild
```
### Lost SSH Access
**If locked out due to UFW misconfiguration:**
1. Access server via provider's console (IPMI, VNC, etc.)
2. Log in as root
3. Disable UFW: `ufw disable`
4. Fix rules, re-enable carefully
5. Test SSH before closing console session
### Complete Frostwall Removal (Rollback)
**If you need to remove Frostwall entirely:**
```bash
# Stop monitoring
crontab -e # Remove frostwall-monitor line
# Remove tunnels
ip link set gre-tx1 down
ip link set gre-nc1 down
ip tunnel del gre-tx1
ip tunnel del gre-nc1
# Remove NAT rules
iptables -t nat -F
iptables -F
# Restore previous UFW rules
ufw --force reset
# Re-add basic rules
# Remove persistence scripts
rm /etc/network/if-up.d/frostwall-*
rm /usr/local/bin/frostwall-monitor.sh
# Update DNS to point directly to server IPs
```
---
## Common Error Messages
### "RTNETLINK answers: File exists"
**Meaning:** Tunnel with that name already exists
**Solution:**
```bash
ip tunnel del gre-tx1 # Delete existing
# Then recreate
```
### "RTNETLINK answers: Network is unreachable"
**Meaning:** Can't reach remote endpoint
**Solution:**
- Verify remote IP is correct
- Check if physical network to remote is up
- Ping remote physical IP
### "GRE: DF set but fragmentation needed"
**Meaning:** MTU mismatch, packet too large
**Solution:**
```bash
ip link set gre-tx1 mtu 1400
```
### "Operation not permitted"
**Meaning:** Not running as root or module not loaded
**Solution:**
```bash
sudo su # Become root
modprobe ip_gre # Load module
```
---
## Monitoring and Health Checks
**Daily health check commands:**
```bash
# Check all tunnels are up
ip tunnel show
# Ping all tunnel endpoints
ping -c 4 10.0.1.2
ping -c 4 10.0.2.2
# Check monitor log
tail -20 /var/log/frostwall-monitor.log
# Verify NAT rules
iptables -t nat -L -n -v | head -20
```
**Set up alerts (optional):**
```bash
# Add to monitor script to send email on failure
# Requires mail configured
echo "Tunnel failure detected" | mail -s "ALERT: Frostwall Tunnel Down" admin@firefrostgaming.com
```
---
## Getting Help
If none of these troubleshooting steps resolve your issue:
1. **Gather diagnostics:**
```bash
ip tunnel show > /tmp/frostwall-diag.txt
ip addr show >> /tmp/frostwall-diag.txt
ip route show >> /tmp/frostwall-diag.txt
iptables -t nat -L -n -v >> /tmp/frostwall-diag.txt
ufw status verbose >> /tmp/frostwall-diag.txt
tail -100 /var/log/frostwall-monitor.log >> /tmp/frostwall-diag.txt
```
2. **Document symptoms:**
- What were you trying to do?
- What happened instead?
- When did it start?
- What changed recently?
3. **Check documentation:**
- Review deployment-plan.md
- Review ip-hierarchy.md
4. **Ask The Chronicler** (future Claude session) with full diagnostics
---
**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️
---
**Document Status:** TROUBLESHOOTING GUIDE
**Update When:** New issues discovered, solutions found, error messages encountered