docs: Document n8n node registry corruption and defer factory reset

Disaster #2 from Feb 23-24 session:
- n8n core nodes broken (registry corruption)
- PHP workaround operational (sync_codex.php)
- Factory reset procedure documented
- Added Task #34 for scheduled recovery

Decision: Defer reset until next maintenance window
Workaround: PHP script handles Codex sync successfully

Co-documented with Gemini's post-mortem analysis.
This commit is contained in:
The Chronicler
2026-02-24 09:31:13 +00:00
parent 5f231f4436
commit e5d7f5032f
2 changed files with 283 additions and 0 deletions

View File

@@ -846,3 +846,43 @@ Small improvements to Whitelist Manager:
**Impact:** Cosmetic only - does not affect functionality
---
---
### 34. n8n Factory Reset - Node Registry Recovery
**Time:** 2-3 hours
**Status:** DEFERRED
**Priority:** Tier 2 - Major Infrastructure
**Documentation:** `docs/troubleshooting/n8n-node-registry-corruption.md`
Reset n8n instance on TX1 to resolve corrupted node registry preventing workflow execution.
**Problem:** Core nodes (HTTP Request, Execute Command) fail with "Node not found" errors.
**Current Workaround:** PHP script (`sync_codex.php`) handling Codex Git sync directly.
**Reset Procedure:**
1. Export all workflows to JSON backup
2. Backup credentials and settings
3. Stop n8n container
4. Backup existing volume to `.backup` folder
5. Wipe `./volumes/n8n/*` directory
6. Recreate container (fresh initialization)
7. Re-import workflows and credentials
8. Test core nodes functionality
9. Restore scheduled executions
**Prerequisites:**
- Workflow JSON exports backed up
- Credentials documented
- Maintenance window scheduled (low-traffic time)
**Post-Reset:**
- Verify Git sync workflow works
- Test Discord notifications
- Re-enable hourly scheduling
- Monitor for 24 hours
**Alternative:** Keep PHP workaround permanently if simpler/more reliable.
---

View File

@@ -0,0 +1,243 @@
# n8n Node Registry Corruption (v2.x)
**Problem:** n8n UI accessible but core nodes (HTTP Request, Execute Command) fail with "Node not found" or "Registry Error"
**Incident Date:** February 23-24, 2026
**Affected System:** TX1 Dallas n8n instance (firefrost-codex-n8n-1)
**Status:** BYPASSED via PHP workaround, factory reset pending
---
## Symptoms
- ✅ n8n web interface loads normally at https://n8n.firefrostgaming.com
- ✅ Existing workflows visible in UI
- ❌ Cannot execute workflows using core nodes
- ❌ "Node not found" errors for `n8n-nodes-base` package nodes
- ❌ HTTP Request node: Registry error
- ❌ Execute Command node: Registry error
**These are INTERNAL nodes that should always be available.**
---
## Root Cause
**Corrupted Node Registry during v2.x migration**
The internal node registry (`n8n-nodes-base` package) became desynchronized from the workflow engine. This typically happens when:
1. Partial update of n8n version with incompatible volume data
2. Docker volume corruption in `/home/node/.n8n` directory
3. Version mismatch between container image and persisted configuration
**Key indicator:** Core nodes from `n8n-nodes-base` package are "invisible" to the execution engine despite being bundled with n8n.
---
## Failed Resolution Attempts
### Attempt 1: Container Recreation
```bash
docker-compose down
docker-compose pull n8n
docker-compose up -d
```
**Result:** ❌ Failed - corruption persists in volume
### Attempt 2: Image Force Pull
```bash
docker-compose down
docker rmi n8nio/n8n:1.121.0
docker-compose up -d
```
**Result:** ❌ Failed - volume data still corrupted
**Why these failed:** The corruption is in the VOLUME (`./volumes/n8n`), not the container image.
---
## Temporary Workaround: PHP Direct Sync
**Created:** `sync_codex.php` on TX1 host OS (PHP 8.3 CLI)
**Purpose:** Bypass n8n entirely for Codex Git → Dify sync
**How it works:**
```
TX1 Host (PHP) → Git Pull → Process Files → Dify API (127.0.0.1:5001)
```
**Advantages:**
- No dependency on n8n registry
- Direct Docker bridge access to Dify API
- Simpler debugging (single script vs workflow nodes)
- Can run via cron for scheduled execution
**Disadvantages:**
- No Discord notifications (yet)
- No visual workflow editor
- Harder for non-technical users to modify
**Status:** ✅ OPERATIONAL - Successfully synced 361 documents to Dify
---
## Permanent Fix: n8n Factory Reset
**⚠️ THIS IS DESTRUCTIVE - BACKUP WORKFLOWS FIRST ⚠️**
### Prerequisites
1. **Export ALL workflows to JSON:**
```bash
# Via n8n UI:
# Settings → Workflows → Export All
# Save to: /opt/firefrost-codex/backups/n8n-workflows-YYYY-MM-DD.json
# Or via API:
curl -X GET https://n8n.firefrostgaming.com/api/v1/workflows \
-H "X-N8N-API-KEY: your_api_key" > n8n-workflows-backup.json
```
2. **Backup credentials (if any):**
```bash
# Settings → Credentials → Export
# Save separately - these are sensitive
```
3. **Document current configuration:**
- Webhook URLs
- Environment variables
- Executions settings
- Timezone settings
### Reset Procedure
**Step 1: Stop n8n**
```bash
cd /opt/firefrost-codex
docker-compose stop n8n
```
**Step 2: Backup existing volume (safety net)**
```bash
sudo cp -r ./volumes/n8n ./volumes/n8n.backup.$(date +%Y%m%d)
```
**Step 3: Wipe corrupted volume**
```bash
sudo rm -rf ./volumes/n8n/*
```
**Step 4: Recreate container**
```bash
docker-compose up -d n8n
```
**Step 5: Wait for initialization (~2 minutes)**
```bash
# Watch logs
docker-compose logs -f n8n
# Look for: "Editor is now accessible via: https://n8n.firefrostgaming.com"
```
**Step 6: Initial setup**
- Visit https://n8n.firefrostgaming.com
- Create owner account (use same credentials as before)
- Configure timezone and settings
**Step 7: Import workflows**
- Settings → Workflows → Import from File
- Select backup JSON
- Verify all nodes load correctly
**Step 8: Test core nodes**
- Create new workflow
- Add HTTP Request node → Should work
- Add Execute Command node → Should work
- Test execution → Should succeed
**Step 9: Restore credentials**
- Settings → Credentials → Import
- Re-enter any API keys/secrets
**Step 10: Verify automation**
- Test Git sync workflow manually
- Verify Discord notifications
- Check scheduled executions
---
## Prevention
**To avoid this in the future:**
1. **Pin n8n version in docker-compose.yml:**
```yaml
n8n:
image: n8nio/n8n:1.121.0 # Specific version, not :latest
```
2. **Backup workflows regularly:**
```bash
# Add to cron: Weekly workflow export
0 2 * * 0 curl https://n8n.firefrostgaming.com/api/v1/workflows > /backups/n8n-workflows-$(date +%Y%m%d).json
```
3. **Test updates on staging first:**
- Don't upgrade n8n in production without testing
- Check release notes for breaking changes
4. **Monitor n8n health:**
- Add n8n health check to Uptime Kuma
- Alert if workflow executions fail
---
## Current Status (February 24, 2026)
**n8n Service:**
- ⚠️ DEGRADED - UI accessible, core nodes broken
- 📋 FACTORY RESET PENDING - Scheduled for next maintenance window
**Codex Git Sync:**
- ✅ OPERATIONAL - Using PHP workaround (`sync_codex.php`)
- ✅ 361 documents syncing successfully
- ⏱️ Manual execution (cron scheduling pending)
**Next Steps:**
1. Add n8n factory reset to tasks.md
2. Schedule maintenance window for reset
3. Consider migrating to PHP permanently if simpler
---
## Related Documentation
- **Phase 5 Deployment:** `docs/tasks/firefrost-codex/`
- **PHP Workaround:** (To be documented if kept long-term)
- **n8n Workflows:** Backup stored at `/opt/firefrost-codex/backups/` (when created)
---
**Incident Timeline:**
- **Feb 23, 9:00 PM:** n8n workflow failure discovered during Phase 5 deployment
- **Feb 23, 9:30 PM:** Diagnosis: Node registry corruption
- **Feb 23, 10:00 PM:** Pivot to PHP workaround (Gemini + Michael collaboration)
- **Feb 24, 12:00 AM:** PHP script operational, 361 documents synced
- **Feb 24, 9:00 AM:** Dify-Qdrant issue resolved (separate incident)
- **Feb 24, 9:30 AM:** Decision to defer n8n reset until next session
---
**Created:** February 24, 2026
**Created By:** Chronicler #26 (from Gemini's post-mortem)
**Resolution Status:** DEFERRED - Workaround operational
**Factory Reset:** Scheduled TBD
💙🔥❄️
**"Sometimes the best fix is the one that waits until you have the energy to do it right."**