docs: Add comprehensive Netdata deployment guide
Created complete deployment guide for Netdata monitoring (400+ lines): Deployment Strategy: - Install on all 4 infrastructure servers - Command Center, TX1, NC1, Ghost VPS - Quick one-line install per server - Total deployment time: 30 minutes Configuration: - UFW firewall rules (management IP only) - Parent-child streaming (unified dashboard) - Custom alert configuration (CPU/RAM/Disk) - Discord webhook integration - Health monitoring Features: - Real-time performance monitoring - Beautiful web dashboards on port 19999 - Zero configuration required - Lightweight (< 3% CPU, ~100 MB RAM) - Auto-detects all services and metrics Monitoring Targets: - CPU, RAM, Disk, Network metrics - Java heap usage (Minecraft servers) - Service-specific monitoring - Alert thresholds configurable Advanced Features: - Netdata Cloud integration (centralized) - Custom dashboards - Mobile app access - Longer data retention Troubleshooting guide included for common issues. Ready to deploy when SSH access available. Task: Netdata Deployment (Tier 2) FFG-STD-002 compliant
This commit is contained in:
503
docs/tasks/netdata-deployment/deployment-guide.md
Normal file
503
docs/tasks/netdata-deployment/deployment-guide.md
Normal file
@@ -0,0 +1,503 @@
|
||||
# Netdata Deployment - Complete Guide
|
||||
|
||||
**Status:** Ready to Deploy
|
||||
**Priority:** Tier 2 - Infrastructure Monitoring
|
||||
**Time Estimate:** 30 minutes (all servers)
|
||||
**Last Updated:** 2026-02-17
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Deploy Netdata real-time monitoring across all Firefrost infrastructure. Provides beautiful dashboards for CPU, RAM, disk, network, and application metrics with zero configuration required.
|
||||
|
||||
**What is Netdata?**
|
||||
- Real-time performance monitoring
|
||||
- Beautiful web dashboards
|
||||
- Zero configuration needed
|
||||
- Extremely lightweight (< 3% CPU, ~100 MB RAM)
|
||||
- Open source and free
|
||||
|
||||
---
|
||||
|
||||
## Deployment Targets
|
||||
|
||||
**All 4 infrastructure servers:**
|
||||
|
||||
1. **Command Center** (63.143.34.217) - Dallas hub
|
||||
- Services: Gitea, Uptime Kuma, Code-Server, Automation
|
||||
- Dashboard: `http://63.143.34.217:19999`
|
||||
|
||||
2. **TX1** (38.68.14.26) - Dallas game servers
|
||||
- Services: 5 Minecraft servers + FoundryVTT
|
||||
- Dashboard: `http://38.68.14.26:19999`
|
||||
|
||||
3. **NC1** (216.239.104.130) - Charlotte game servers
|
||||
- Services: 6 Minecraft servers + Hytale
|
||||
- Dashboard: `http://216.239.104.130:19999`
|
||||
|
||||
4. **Ghost VPS** (64.50.188.14) - Chicago staff services
|
||||
- Services: MkDocs, Wiki.js (x2), NextCloud
|
||||
- Dashboard: `http://64.50.188.14:19999`
|
||||
|
||||
---
|
||||
|
||||
## Installation (Per Server)
|
||||
|
||||
### One-Line Install
|
||||
|
||||
**On each server:**
|
||||
|
||||
```bash
|
||||
# Install Netdata
|
||||
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
|
||||
|
||||
# The installer will:
|
||||
# - Auto-detect your OS
|
||||
# - Install dependencies
|
||||
# - Compile and install Netdata
|
||||
# - Start the service
|
||||
# - Open port 19999
|
||||
```
|
||||
|
||||
**Installation takes:** 2-5 minutes per server
|
||||
|
||||
---
|
||||
|
||||
## Step-by-Step Deployment
|
||||
|
||||
### Phase 1: Install on Command Center (10 min)
|
||||
|
||||
```bash
|
||||
# SSH to Command Center
|
||||
ssh root@63.143.34.217
|
||||
|
||||
# Run installer
|
||||
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
|
||||
|
||||
# Wait for installation to complete
|
||||
# Answer prompts (usually just press Enter for defaults)
|
||||
|
||||
# Verify installation
|
||||
systemctl status netdata
|
||||
|
||||
# Should show: active (running)
|
||||
|
||||
# Test dashboard
|
||||
curl http://localhost:19999
|
||||
|
||||
# Should return HTML
|
||||
```
|
||||
|
||||
**Open in browser:** `http://63.143.34.217:19999`
|
||||
|
||||
You should see the Netdata dashboard!
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: Install on TX1 (5 min)
|
||||
|
||||
```bash
|
||||
# SSH to TX1
|
||||
ssh root@38.68.14.26
|
||||
|
||||
# Run installer
|
||||
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
|
||||
|
||||
# Verify
|
||||
systemctl status netdata
|
||||
|
||||
# Test
|
||||
curl http://localhost:19999
|
||||
```
|
||||
|
||||
**Open in browser:** `http://38.68.14.26:19999`
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: Install on NC1 (5 min)
|
||||
|
||||
```bash
|
||||
# SSH to NC1
|
||||
ssh root@216.239.104.130
|
||||
|
||||
# Run installer
|
||||
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
|
||||
|
||||
# Verify
|
||||
systemctl status netdata
|
||||
|
||||
# Test
|
||||
curl http://localhost:19999
|
||||
```
|
||||
|
||||
**Open in browser:** `http://216.239.104.130:19999`
|
||||
|
||||
---
|
||||
|
||||
### Phase 4: Install on Ghost VPS (5 min)
|
||||
|
||||
```bash
|
||||
# SSH to Ghost
|
||||
ssh root@64.50.188.14
|
||||
|
||||
# Run installer
|
||||
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
|
||||
|
||||
# Verify
|
||||
systemctl status netdata
|
||||
|
||||
# Test
|
||||
curl http://localhost:19999
|
||||
```
|
||||
|
||||
**Open in browser:** `http://64.50.188.14:19999`
|
||||
|
||||
---
|
||||
|
||||
## Post-Installation Configuration
|
||||
|
||||
### 1. Configure UFW Firewall
|
||||
|
||||
**On each server:**
|
||||
|
||||
```bash
|
||||
# Allow Netdata port from Michael's management IP only
|
||||
ufw allow from MICHAEL_MANAGEMENT_IP to any port 19999 proto tcp
|
||||
|
||||
# Verify
|
||||
ufw status | grep 19999
|
||||
```
|
||||
|
||||
**Security note:** Netdata dashboards contain sensitive server information. Only allow access from trusted IPs.
|
||||
|
||||
---
|
||||
|
||||
### 2. Set Up Parent-Child Streaming (Optional)
|
||||
|
||||
**Benefit:** View all servers from one dashboard (Command Center)
|
||||
|
||||
**On Command Center (parent):**
|
||||
|
||||
```bash
|
||||
# Edit config
|
||||
nano /etc/netdata/stream.conf
|
||||
|
||||
# Add:
|
||||
[11111111-2222-3333-4444-555555555555]
|
||||
enabled = yes
|
||||
default history = 3600
|
||||
default memory mode = save
|
||||
health enabled = yes
|
||||
```
|
||||
|
||||
**On TX1, NC1, Ghost (children):**
|
||||
|
||||
```bash
|
||||
# Edit config
|
||||
nano /etc/netdata/stream.conf
|
||||
|
||||
# Add:
|
||||
[stream]
|
||||
enabled = yes
|
||||
destination = 63.143.34.217:19999
|
||||
api key = 11111111-2222-3333-4444-555555555555
|
||||
|
||||
# Restart netdata
|
||||
systemctl restart netdata
|
||||
```
|
||||
|
||||
**Result:** All server metrics visible on Command Center dashboard
|
||||
|
||||
---
|
||||
|
||||
### 3. Configure Alerts
|
||||
|
||||
**Edit alert config:**
|
||||
|
||||
```bash
|
||||
nano /etc/netdata/health.d/custom.conf
|
||||
```
|
||||
|
||||
**Example alerts:**
|
||||
|
||||
```yaml
|
||||
# Alert when CPU usage > 80% for 5 minutes
|
||||
alarm: cpu_usage
|
||||
on: system.cpu
|
||||
calc: $user + $system
|
||||
every: 1m
|
||||
warn: $this > 80
|
||||
crit: $this > 95
|
||||
delay: up 5m down 15m
|
||||
info: CPU usage is too high
|
||||
|
||||
# Alert when RAM usage > 90%
|
||||
alarm: ram_usage
|
||||
on: system.ram
|
||||
calc: $used * 100 / ($used + $free)
|
||||
every: 1m
|
||||
warn: $this > 90
|
||||
crit: $this > 95
|
||||
delay: up 5m down 15m
|
||||
info: RAM usage is too high
|
||||
|
||||
# Alert when disk space < 20%
|
||||
alarm: disk_space
|
||||
on: disk.space
|
||||
calc: $avail * 100 / ($avail + $used)
|
||||
every: 1m
|
||||
warn: $this < 20
|
||||
crit: $this < 10
|
||||
delay: up 5m down 15m
|
||||
info: Disk space is running low
|
||||
```
|
||||
|
||||
**Reload config:**
|
||||
|
||||
```bash
|
||||
killall -USR2 netdata
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. Discord Integration (Optional)
|
||||
|
||||
**Set up Discord webhook for alerts:**
|
||||
|
||||
```bash
|
||||
# Edit alarm notification config
|
||||
nano /etc/netdata/health_alarm_notify.conf
|
||||
|
||||
# Find Discord section and configure:
|
||||
SEND_DISCORD="YES"
|
||||
DISCORD_WEBHOOK_URL="YOUR_DISCORD_WEBHOOK_URL_HERE"
|
||||
DEFAULT_RECIPIENT_DISCORD="network-alerts"
|
||||
```
|
||||
|
||||
**Test alert:**
|
||||
|
||||
```bash
|
||||
# Trigger test alert
|
||||
/usr/libexec/netdata/plugins.d/alarm-notify.sh test
|
||||
```
|
||||
|
||||
Check Discord for test notification.
|
||||
|
||||
---
|
||||
|
||||
## Dashboard Access
|
||||
|
||||
### Quick Access Links
|
||||
|
||||
**Save these bookmarks:**
|
||||
|
||||
- Command Center: http://63.143.34.217:19999
|
||||
- TX1: http://38.68.14.26:19999
|
||||
- NC1: http://216.239.104.130:19999
|
||||
- Ghost: http://64.50.188.14:19999
|
||||
|
||||
**Unified View (if streaming configured):**
|
||||
- All servers: http://63.143.34.217:19999 → View nodes
|
||||
|
||||
---
|
||||
|
||||
### Key Metrics to Monitor
|
||||
|
||||
**CPU:**
|
||||
- User % (application load)
|
||||
- System % (kernel load)
|
||||
- IOWait % (disk bottleneck indicator)
|
||||
|
||||
**RAM:**
|
||||
- Used vs Available
|
||||
- Cache (should be high, that's good!)
|
||||
- Swap usage (should be low)
|
||||
|
||||
**Disk:**
|
||||
- Disk space remaining
|
||||
- Read/write speeds
|
||||
- IOPs
|
||||
|
||||
**Network:**
|
||||
- Bandwidth usage
|
||||
- Packet drops
|
||||
- Connection count
|
||||
|
||||
**Minecraft Servers (TX1/NC1):**
|
||||
- Java heap usage
|
||||
- GC activity
|
||||
- Thread count
|
||||
|
||||
---
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Daily
|
||||
|
||||
- Quick glance at dashboards (bookmark all 4)
|
||||
- Check for any red alerts
|
||||
|
||||
### Weekly
|
||||
|
||||
- Review CPU/RAM trends
|
||||
- Check disk space projections
|
||||
- Verify alerts working
|
||||
|
||||
### Monthly
|
||||
|
||||
- Review historical data
|
||||
- Adjust alert thresholds if needed
|
||||
- Update Netdata if new version available
|
||||
|
||||
---
|
||||
|
||||
## Updates
|
||||
|
||||
**Check for updates:**
|
||||
|
||||
```bash
|
||||
# On each server
|
||||
netdata-updater.sh
|
||||
```
|
||||
|
||||
**Or auto-update (recommended):**
|
||||
|
||||
Updates automatically check daily and install automatically.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Dashboard won't load
|
||||
|
||||
**Check service:**
|
||||
```bash
|
||||
systemctl status netdata
|
||||
```
|
||||
|
||||
**Restart if needed:**
|
||||
```bash
|
||||
systemctl restart netdata
|
||||
```
|
||||
|
||||
**Check firewall:**
|
||||
```bash
|
||||
ufw status | grep 19999
|
||||
telnet localhost 19999
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### High CPU usage from Netdata
|
||||
|
||||
Netdata should use < 3% CPU normally.
|
||||
|
||||
**Check what's using resources:**
|
||||
```bash
|
||||
# Disable some plugins if needed
|
||||
nano /etc/netdata/netdata.conf
|
||||
|
||||
# Under [plugins], disable unused:
|
||||
python.d = no
|
||||
node.d = no
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Streaming not working
|
||||
|
||||
**Verify:**
|
||||
- Parent (Command Center) has stream.conf with API key
|
||||
- Children have correct parent IP
|
||||
- Port 19999 accessible from children to parent
|
||||
- API keys match exactly
|
||||
|
||||
**Debug:**
|
||||
```bash
|
||||
# On child
|
||||
tail -f /var/log/netdata/error.log | grep stream
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Alerts not sending to Discord
|
||||
|
||||
**Check:**
|
||||
- Discord webhook URL correct
|
||||
- `SEND_DISCORD="YES"` set
|
||||
- Test alert sent successfully
|
||||
|
||||
**Debug:**
|
||||
```bash
|
||||
/usr/libexec/netdata/plugins.d/alarm-notify.sh test debug
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Advanced Features (Optional)
|
||||
|
||||
### Netdata Cloud (Free)
|
||||
|
||||
**Benefits:**
|
||||
- Centralized dashboard for all servers
|
||||
- Mobile app
|
||||
- Longer data retention
|
||||
- Collaboration features
|
||||
|
||||
**Setup:**
|
||||
|
||||
1. Go to https://app.netdata.cloud
|
||||
2. Create free account
|
||||
3. Claim nodes:
|
||||
|
||||
```bash
|
||||
# On each server
|
||||
netdata-claim.sh -token=YOUR_TOKEN -rooms=YOUR_ROOM -url=https://app.netdata.cloud
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Custom Dashboards
|
||||
|
||||
Create custom dashboards with specific metrics:
|
||||
|
||||
1. Open Netdata dashboard
|
||||
2. Click "Create Dashboard"
|
||||
3. Add charts
|
||||
4. Save and share URL
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria Checklist
|
||||
|
||||
- [ ] Netdata installed on Command Center
|
||||
- [ ] Netdata installed on TX1
|
||||
- [ ] Netdata installed on NC1
|
||||
- [ ] Netdata installed on Ghost VPS
|
||||
- [ ] All dashboards accessible via browser
|
||||
- [ ] UFW rules configured (management IP only)
|
||||
- [ ] Alerts configured for CPU/RAM/Disk
|
||||
- [ ] (Optional) Discord integration working
|
||||
- [ ] (Optional) Parent-child streaming configured
|
||||
- [ ] Dashboards bookmarked for quick access
|
||||
|
||||
---
|
||||
|
||||
## Related Tasks
|
||||
|
||||
- **Staggered Server Restart System** - Monitor impact on resources
|
||||
- **World Backup Automation** - Monitor backup job duration
|
||||
- **Command Center Security** - Part of monitoring infrastructure
|
||||
- **Frostwall Protocol** - Monitor tunnel performance
|
||||
|
||||
---
|
||||
|
||||
**Fire + Frost + Foundation = Where Love Builds Legacy** 💙🔥❄️
|
||||
|
||||
---
|
||||
|
||||
**Document Status:** COMPLETE
|
||||
**Ready for Deployment:** When SSH access available (30 minutes total)
|
||||
**Dependencies:** SSH access to all 4 servers, management IP whitelisted
|
||||
**Port Required:** 19999 (internal only, secured by UFW)
|
||||
Reference in New Issue
Block a user