docs: Document n8n node registry corruption and defer factory reset
Disaster #2 from Feb 23-24 session: - n8n core nodes broken (registry corruption) - PHP workaround operational (sync_codex.php) - Factory reset procedure documented - Added Task #34 for scheduled recovery Decision: Defer reset until next maintenance window Workaround: PHP script handles Codex sync successfully Co-documented with Gemini's post-mortem analysis.
This commit is contained in:
@@ -846,3 +846,43 @@ Small improvements to Whitelist Manager:
|
||||
**Impact:** Cosmetic only - does not affect functionality
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
### 34. n8n Factory Reset - Node Registry Recovery
|
||||
**Time:** 2-3 hours
|
||||
**Status:** DEFERRED
|
||||
**Priority:** Tier 2 - Major Infrastructure
|
||||
**Documentation:** `docs/troubleshooting/n8n-node-registry-corruption.md`
|
||||
|
||||
Reset n8n instance on TX1 to resolve corrupted node registry preventing workflow execution.
|
||||
|
||||
**Problem:** Core nodes (HTTP Request, Execute Command) fail with "Node not found" errors.
|
||||
|
||||
**Current Workaround:** PHP script (`sync_codex.php`) handling Codex Git sync directly.
|
||||
|
||||
**Reset Procedure:**
|
||||
1. Export all workflows to JSON backup
|
||||
2. Backup credentials and settings
|
||||
3. Stop n8n container
|
||||
4. Backup existing volume to `.backup` folder
|
||||
5. Wipe `./volumes/n8n/*` directory
|
||||
6. Recreate container (fresh initialization)
|
||||
7. Re-import workflows and credentials
|
||||
8. Test core nodes functionality
|
||||
9. Restore scheduled executions
|
||||
|
||||
**Prerequisites:**
|
||||
- Workflow JSON exports backed up
|
||||
- Credentials documented
|
||||
- Maintenance window scheduled (low-traffic time)
|
||||
|
||||
**Post-Reset:**
|
||||
- Verify Git sync workflow works
|
||||
- Test Discord notifications
|
||||
- Re-enable hourly scheduling
|
||||
- Monitor for 24 hours
|
||||
|
||||
**Alternative:** Keep PHP workaround permanently if simpler/more reliable.
|
||||
|
||||
---
|
||||
|
||||
243
docs/troubleshooting/n8n-node-registry-corruption.md
Normal file
243
docs/troubleshooting/n8n-node-registry-corruption.md
Normal file
@@ -0,0 +1,243 @@
|
||||
# n8n Node Registry Corruption (v2.x)
|
||||
|
||||
**Problem:** n8n UI accessible but core nodes (HTTP Request, Execute Command) fail with "Node not found" or "Registry Error"
|
||||
|
||||
**Incident Date:** February 23-24, 2026
|
||||
**Affected System:** TX1 Dallas n8n instance (firefrost-codex-n8n-1)
|
||||
**Status:** BYPASSED via PHP workaround, factory reset pending
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- ✅ n8n web interface loads normally at https://n8n.firefrostgaming.com
|
||||
- ✅ Existing workflows visible in UI
|
||||
- ❌ Cannot execute workflows using core nodes
|
||||
- ❌ "Node not found" errors for `n8n-nodes-base` package nodes
|
||||
- ❌ HTTP Request node: Registry error
|
||||
- ❌ Execute Command node: Registry error
|
||||
|
||||
**These are INTERNAL nodes that should always be available.**
|
||||
|
||||
---
|
||||
|
||||
## Root Cause
|
||||
|
||||
**Corrupted Node Registry during v2.x migration**
|
||||
|
||||
The internal node registry (`n8n-nodes-base` package) became desynchronized from the workflow engine. This typically happens when:
|
||||
|
||||
1. Partial update of n8n version with incompatible volume data
|
||||
2. Docker volume corruption in `/home/node/.n8n` directory
|
||||
3. Version mismatch between container image and persisted configuration
|
||||
|
||||
**Key indicator:** Core nodes from `n8n-nodes-base` package are "invisible" to the execution engine despite being bundled with n8n.
|
||||
|
||||
---
|
||||
|
||||
## Failed Resolution Attempts
|
||||
|
||||
### Attempt 1: Container Recreation
|
||||
```bash
|
||||
docker-compose down
|
||||
docker-compose pull n8n
|
||||
docker-compose up -d
|
||||
```
|
||||
**Result:** ❌ Failed - corruption persists in volume
|
||||
|
||||
### Attempt 2: Image Force Pull
|
||||
```bash
|
||||
docker-compose down
|
||||
docker rmi n8nio/n8n:1.121.0
|
||||
docker-compose up -d
|
||||
```
|
||||
**Result:** ❌ Failed - volume data still corrupted
|
||||
|
||||
**Why these failed:** The corruption is in the VOLUME (`./volumes/n8n`), not the container image.
|
||||
|
||||
---
|
||||
|
||||
## Temporary Workaround: PHP Direct Sync
|
||||
|
||||
**Created:** `sync_codex.php` on TX1 host OS (PHP 8.3 CLI)
|
||||
|
||||
**Purpose:** Bypass n8n entirely for Codex Git → Dify sync
|
||||
|
||||
**How it works:**
|
||||
```
|
||||
TX1 Host (PHP) → Git Pull → Process Files → Dify API (127.0.0.1:5001)
|
||||
```
|
||||
|
||||
**Advantages:**
|
||||
- No dependency on n8n registry
|
||||
- Direct Docker bridge access to Dify API
|
||||
- Simpler debugging (single script vs workflow nodes)
|
||||
- Can run via cron for scheduled execution
|
||||
|
||||
**Disadvantages:**
|
||||
- No Discord notifications (yet)
|
||||
- No visual workflow editor
|
||||
- Harder for non-technical users to modify
|
||||
|
||||
**Status:** ✅ OPERATIONAL - Successfully synced 361 documents to Dify
|
||||
|
||||
---
|
||||
|
||||
## Permanent Fix: n8n Factory Reset
|
||||
|
||||
**⚠️ THIS IS DESTRUCTIVE - BACKUP WORKFLOWS FIRST ⚠️**
|
||||
|
||||
### Prerequisites
|
||||
|
||||
1. **Export ALL workflows to JSON:**
|
||||
```bash
|
||||
# Via n8n UI:
|
||||
# Settings → Workflows → Export All
|
||||
# Save to: /opt/firefrost-codex/backups/n8n-workflows-YYYY-MM-DD.json
|
||||
|
||||
# Or via API:
|
||||
curl -X GET https://n8n.firefrostgaming.com/api/v1/workflows \
|
||||
-H "X-N8N-API-KEY: your_api_key" > n8n-workflows-backup.json
|
||||
```
|
||||
|
||||
2. **Backup credentials (if any):**
|
||||
```bash
|
||||
# Settings → Credentials → Export
|
||||
# Save separately - these are sensitive
|
||||
```
|
||||
|
||||
3. **Document current configuration:**
|
||||
- Webhook URLs
|
||||
- Environment variables
|
||||
- Executions settings
|
||||
- Timezone settings
|
||||
|
||||
### Reset Procedure
|
||||
|
||||
**Step 1: Stop n8n**
|
||||
```bash
|
||||
cd /opt/firefrost-codex
|
||||
docker-compose stop n8n
|
||||
```
|
||||
|
||||
**Step 2: Backup existing volume (safety net)**
|
||||
```bash
|
||||
sudo cp -r ./volumes/n8n ./volumes/n8n.backup.$(date +%Y%m%d)
|
||||
```
|
||||
|
||||
**Step 3: Wipe corrupted volume**
|
||||
```bash
|
||||
sudo rm -rf ./volumes/n8n/*
|
||||
```
|
||||
|
||||
**Step 4: Recreate container**
|
||||
```bash
|
||||
docker-compose up -d n8n
|
||||
```
|
||||
|
||||
**Step 5: Wait for initialization (~2 minutes)**
|
||||
```bash
|
||||
# Watch logs
|
||||
docker-compose logs -f n8n
|
||||
|
||||
# Look for: "Editor is now accessible via: https://n8n.firefrostgaming.com"
|
||||
```
|
||||
|
||||
**Step 6: Initial setup**
|
||||
- Visit https://n8n.firefrostgaming.com
|
||||
- Create owner account (use same credentials as before)
|
||||
- Configure timezone and settings
|
||||
|
||||
**Step 7: Import workflows**
|
||||
- Settings → Workflows → Import from File
|
||||
- Select backup JSON
|
||||
- Verify all nodes load correctly
|
||||
|
||||
**Step 8: Test core nodes**
|
||||
- Create new workflow
|
||||
- Add HTTP Request node → Should work
|
||||
- Add Execute Command node → Should work
|
||||
- Test execution → Should succeed
|
||||
|
||||
**Step 9: Restore credentials**
|
||||
- Settings → Credentials → Import
|
||||
- Re-enter any API keys/secrets
|
||||
|
||||
**Step 10: Verify automation**
|
||||
- Test Git sync workflow manually
|
||||
- Verify Discord notifications
|
||||
- Check scheduled executions
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
**To avoid this in the future:**
|
||||
|
||||
1. **Pin n8n version in docker-compose.yml:**
|
||||
```yaml
|
||||
n8n:
|
||||
image: n8nio/n8n:1.121.0 # Specific version, not :latest
|
||||
```
|
||||
|
||||
2. **Backup workflows regularly:**
|
||||
```bash
|
||||
# Add to cron: Weekly workflow export
|
||||
0 2 * * 0 curl https://n8n.firefrostgaming.com/api/v1/workflows > /backups/n8n-workflows-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
3. **Test updates on staging first:**
|
||||
- Don't upgrade n8n in production without testing
|
||||
- Check release notes for breaking changes
|
||||
|
||||
4. **Monitor n8n health:**
|
||||
- Add n8n health check to Uptime Kuma
|
||||
- Alert if workflow executions fail
|
||||
|
||||
---
|
||||
|
||||
## Current Status (February 24, 2026)
|
||||
|
||||
**n8n Service:**
|
||||
- ⚠️ DEGRADED - UI accessible, core nodes broken
|
||||
- 📋 FACTORY RESET PENDING - Scheduled for next maintenance window
|
||||
|
||||
**Codex Git Sync:**
|
||||
- ✅ OPERATIONAL - Using PHP workaround (`sync_codex.php`)
|
||||
- ✅ 361 documents syncing successfully
|
||||
- ⏱️ Manual execution (cron scheduling pending)
|
||||
|
||||
**Next Steps:**
|
||||
1. Add n8n factory reset to tasks.md
|
||||
2. Schedule maintenance window for reset
|
||||
3. Consider migrating to PHP permanently if simpler
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Phase 5 Deployment:** `docs/tasks/firefrost-codex/`
|
||||
- **PHP Workaround:** (To be documented if kept long-term)
|
||||
- **n8n Workflows:** Backup stored at `/opt/firefrost-codex/backups/` (when created)
|
||||
|
||||
---
|
||||
|
||||
**Incident Timeline:**
|
||||
|
||||
- **Feb 23, 9:00 PM:** n8n workflow failure discovered during Phase 5 deployment
|
||||
- **Feb 23, 9:30 PM:** Diagnosis: Node registry corruption
|
||||
- **Feb 23, 10:00 PM:** Pivot to PHP workaround (Gemini + Michael collaboration)
|
||||
- **Feb 24, 12:00 AM:** PHP script operational, 361 documents synced
|
||||
- **Feb 24, 9:00 AM:** Dify-Qdrant issue resolved (separate incident)
|
||||
- **Feb 24, 9:30 AM:** Decision to defer n8n reset until next session
|
||||
|
||||
---
|
||||
|
||||
**Created:** February 24, 2026
|
||||
**Created By:** Chronicler #26 (from Gemini's post-mortem)
|
||||
**Resolution Status:** DEFERRED - Workaround operational
|
||||
**Factory Reset:** Scheduled TBD
|
||||
|
||||
💙🔥❄️
|
||||
|
||||
**"Sometimes the best fix is the one that waits until you have the energy to do it right."**
|
||||
Reference in New Issue
Block a user