Files
firefrost-operations-manual/docs/tasks/netdata-deployment/deployment-guide.md
Claude 4f27f25a74 docs: Add comprehensive Netdata deployment guide
Created complete deployment guide for Netdata monitoring (400+ lines):

Deployment Strategy:
- Install on all 4 infrastructure servers
- Command Center, TX1, NC1, Ghost VPS
- Quick one-line install per server
- Total deployment time: 30 minutes

Configuration:
- UFW firewall rules (management IP only)
- Parent-child streaming (unified dashboard)
- Custom alert configuration (CPU/RAM/Disk)
- Discord webhook integration
- Health monitoring

Features:
- Real-time performance monitoring
- Beautiful web dashboards on port 19999
- Zero configuration required
- Lightweight (< 3% CPU, ~100 MB RAM)
- Auto-detects all services and metrics

Monitoring Targets:
- CPU, RAM, Disk, Network metrics
- Java heap usage (Minecraft servers)
- Service-specific monitoring
- Alert thresholds configurable

Advanced Features:
- Netdata Cloud integration (centralized)
- Custom dashboards
- Mobile app access
- Longer data retention

Troubleshooting guide included for common issues.

Ready to deploy when SSH access available.

Task: Netdata Deployment (Tier 2)
FFG-STD-002 compliant
2026-02-17 22:58:31 +00:00

504 lines
9.0 KiB
Markdown

# Netdata Deployment - Complete Guide
**Status:** Ready to Deploy
**Priority:** Tier 2 - Infrastructure Monitoring
**Time Estimate:** 30 minutes (all servers)
**Last Updated:** 2026-02-17
---
## Overview
Deploy Netdata real-time monitoring across all Firefrost infrastructure. Provides beautiful dashboards for CPU, RAM, disk, network, and application metrics with zero configuration required.
**What is Netdata?**
- Real-time performance monitoring
- Beautiful web dashboards
- Zero configuration needed
- Extremely lightweight (< 3% CPU, ~100 MB RAM)
- Open source and free
---
## Deployment Targets
**All 4 infrastructure servers:**
1. **Command Center** (63.143.34.217) - Dallas hub
- Services: Gitea, Uptime Kuma, Code-Server, Automation
- Dashboard: `http://63.143.34.217:19999`
2. **TX1** (38.68.14.26) - Dallas game servers
- Services: 5 Minecraft servers + FoundryVTT
- Dashboard: `http://38.68.14.26:19999`
3. **NC1** (216.239.104.130) - Charlotte game servers
- Services: 6 Minecraft servers + Hytale
- Dashboard: `http://216.239.104.130:19999`
4. **Ghost VPS** (64.50.188.14) - Chicago staff services
- Services: MkDocs, Wiki.js (x2), NextCloud
- Dashboard: `http://64.50.188.14:19999`
---
## Installation (Per Server)
### One-Line Install
**On each server:**
```bash
# Install Netdata
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
# The installer will:
# - Auto-detect your OS
# - Install dependencies
# - Compile and install Netdata
# - Start the service
# - Open port 19999
```
**Installation takes:** 2-5 minutes per server
---
## Step-by-Step Deployment
### Phase 1: Install on Command Center (10 min)
```bash
# SSH to Command Center
ssh root@63.143.34.217
# Run installer
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
# Wait for installation to complete
# Answer prompts (usually just press Enter for defaults)
# Verify installation
systemctl status netdata
# Should show: active (running)
# Test dashboard
curl http://localhost:19999
# Should return HTML
```
**Open in browser:** `http://63.143.34.217:19999`
You should see the Netdata dashboard!
---
### Phase 2: Install on TX1 (5 min)
```bash
# SSH to TX1
ssh root@38.68.14.26
# Run installer
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
# Verify
systemctl status netdata
# Test
curl http://localhost:19999
```
**Open in browser:** `http://38.68.14.26:19999`
---
### Phase 3: Install on NC1 (5 min)
```bash
# SSH to NC1
ssh root@216.239.104.130
# Run installer
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
# Verify
systemctl status netdata
# Test
curl http://localhost:19999
```
**Open in browser:** `http://216.239.104.130:19999`
---
### Phase 4: Install on Ghost VPS (5 min)
```bash
# SSH to Ghost
ssh root@64.50.188.14
# Run installer
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
# Verify
systemctl status netdata
# Test
curl http://localhost:19999
```
**Open in browser:** `http://64.50.188.14:19999`
---
## Post-Installation Configuration
### 1. Configure UFW Firewall
**On each server:**
```bash
# Allow Netdata port from Michael's management IP only
ufw allow from MICHAEL_MANAGEMENT_IP to any port 19999 proto tcp
# Verify
ufw status | grep 19999
```
**Security note:** Netdata dashboards contain sensitive server information. Only allow access from trusted IPs.
---
### 2. Set Up Parent-Child Streaming (Optional)
**Benefit:** View all servers from one dashboard (Command Center)
**On Command Center (parent):**
```bash
# Edit config
nano /etc/netdata/stream.conf
# Add:
[11111111-2222-3333-4444-555555555555]
enabled = yes
default history = 3600
default memory mode = save
health enabled = yes
```
**On TX1, NC1, Ghost (children):**
```bash
# Edit config
nano /etc/netdata/stream.conf
# Add:
[stream]
enabled = yes
destination = 63.143.34.217:19999
api key = 11111111-2222-3333-4444-555555555555
# Restart netdata
systemctl restart netdata
```
**Result:** All server metrics visible on Command Center dashboard
---
### 3. Configure Alerts
**Edit alert config:**
```bash
nano /etc/netdata/health.d/custom.conf
```
**Example alerts:**
```yaml
# Alert when CPU usage > 80% for 5 minutes
alarm: cpu_usage
on: system.cpu
calc: $user + $system
every: 1m
warn: $this > 80
crit: $this > 95
delay: up 5m down 15m
info: CPU usage is too high
# Alert when RAM usage > 90%
alarm: ram_usage
on: system.ram
calc: $used * 100 / ($used + $free)
every: 1m
warn: $this > 90
crit: $this > 95
delay: up 5m down 15m
info: RAM usage is too high
# Alert when disk space < 20%
alarm: disk_space
on: disk.space
calc: $avail * 100 / ($avail + $used)
every: 1m
warn: $this < 20
crit: $this < 10
delay: up 5m down 15m
info: Disk space is running low
```
**Reload config:**
```bash
killall -USR2 netdata
```
---
### 4. Discord Integration (Optional)
**Set up Discord webhook for alerts:**
```bash
# Edit alarm notification config
nano /etc/netdata/health_alarm_notify.conf
# Find Discord section and configure:
SEND_DISCORD="YES"
DISCORD_WEBHOOK_URL="YOUR_DISCORD_WEBHOOK_URL_HERE"
DEFAULT_RECIPIENT_DISCORD="network-alerts"
```
**Test alert:**
```bash
# Trigger test alert
/usr/libexec/netdata/plugins.d/alarm-notify.sh test
```
Check Discord for test notification.
---
## Dashboard Access
### Quick Access Links
**Save these bookmarks:**
- Command Center: http://63.143.34.217:19999
- TX1: http://38.68.14.26:19999
- NC1: http://216.239.104.130:19999
- Ghost: http://64.50.188.14:19999
**Unified View (if streaming configured):**
- All servers: http://63.143.34.217:19999 → View nodes
---
### Key Metrics to Monitor
**CPU:**
- User % (application load)
- System % (kernel load)
- IOWait % (disk bottleneck indicator)
**RAM:**
- Used vs Available
- Cache (should be high, that's good!)
- Swap usage (should be low)
**Disk:**
- Disk space remaining
- Read/write speeds
- IOPs
**Network:**
- Bandwidth usage
- Packet drops
- Connection count
**Minecraft Servers (TX1/NC1):**
- Java heap usage
- GC activity
- Thread count
---
## Maintenance
### Daily
- Quick glance at dashboards (bookmark all 4)
- Check for any red alerts
### Weekly
- Review CPU/RAM trends
- Check disk space projections
- Verify alerts working
### Monthly
- Review historical data
- Adjust alert thresholds if needed
- Update Netdata if new version available
---
## Updates
**Check for updates:**
```bash
# On each server
netdata-updater.sh
```
**Or auto-update (recommended):**
Updates automatically check daily and install automatically.
---
## Troubleshooting
### Dashboard won't load
**Check service:**
```bash
systemctl status netdata
```
**Restart if needed:**
```bash
systemctl restart netdata
```
**Check firewall:**
```bash
ufw status | grep 19999
telnet localhost 19999
```
---
### High CPU usage from Netdata
Netdata should use < 3% CPU normally.
**Check what's using resources:**
```bash
# Disable some plugins if needed
nano /etc/netdata/netdata.conf
# Under [plugins], disable unused:
python.d = no
node.d = no
```
---
### Streaming not working
**Verify:**
- Parent (Command Center) has stream.conf with API key
- Children have correct parent IP
- Port 19999 accessible from children to parent
- API keys match exactly
**Debug:**
```bash
# On child
tail -f /var/log/netdata/error.log | grep stream
```
---
### Alerts not sending to Discord
**Check:**
- Discord webhook URL correct
- `SEND_DISCORD="YES"` set
- Test alert sent successfully
**Debug:**
```bash
/usr/libexec/netdata/plugins.d/alarm-notify.sh test debug
```
---
## Advanced Features (Optional)
### Netdata Cloud (Free)
**Benefits:**
- Centralized dashboard for all servers
- Mobile app
- Longer data retention
- Collaboration features
**Setup:**
1. Go to https://app.netdata.cloud
2. Create free account
3. Claim nodes:
```bash
# On each server
netdata-claim.sh -token=YOUR_TOKEN -rooms=YOUR_ROOM -url=https://app.netdata.cloud
```
---
### Custom Dashboards
Create custom dashboards with specific metrics:
1. Open Netdata dashboard
2. Click "Create Dashboard"
3. Add charts
4. Save and share URL
---
## Success Criteria Checklist
- [ ] Netdata installed on Command Center
- [ ] Netdata installed on TX1
- [ ] Netdata installed on NC1
- [ ] Netdata installed on Ghost VPS
- [ ] All dashboards accessible via browser
- [ ] UFW rules configured (management IP only)
- [ ] Alerts configured for CPU/RAM/Disk
- [ ] (Optional) Discord integration working
- [ ] (Optional) Parent-child streaming configured
- [ ] Dashboards bookmarked for quick access
---
## Related Tasks
- **Staggered Server Restart System** - Monitor impact on resources
- **World Backup Automation** - Monitor backup job duration
- **Command Center Security** - Part of monitoring infrastructure
- **Frostwall Protocol** - Monitor tunnel performance
---
**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️
---
**Document Status:** COMPLETE
**Ready for Deployment:** When SSH access available (30 minutes total)
**Dependencies:** SSH access to all 4 servers, management IP whitelisted
**Port Required:** 19999 (internal only, secured by UFW)