New skills covering 10 categories: **Security & Audit**: 007 (STRIDE/PASTA/OWASP), cred-omega (secrets management) **AI Personas**: Karpathy, Hinton, Sutskever, LeCun (4 sub-skills), Altman, Musk, Gates, Jobs, Buffett **Multi-agent Orchestration**: agent-orchestrator, task-intelligence, multi-advisor **Code Analysis**: matematico-tao (Terence Tao-inspired mathematical code analysis) **Social & Messaging**: Instagram Graph API, Telegram Bot, WhatsApp Cloud API, social-orchestrator **Image Generation**: AI Studio (Gemini), Stability AI, ComfyUI Gateway, image-studio router **Brazilian Domain**: 6 auction specialist modules, 2 legal advisors, auctioneers data scraper **Product & Growth**: design, invention, monetization, analytics, growth engine **DevOps & LLM Ops**: Docker/CI-CD/AWS, RAG/embeddings/fine-tuning **Skill Governance**: installer, sentinel auditor, context management Each skill includes: - Standardized YAML frontmatter (name, description, risk, source, tags, tools) - Structured sections (Overview, When to Use, How it Works, Best Practices) - Python scripts and reference documentation where applicable - Cross-platform compatibility (Claude Code, Antigravity, Cursor, Gemini CLI, Codex CLI) Co-authored-by: ProgramadorBrasil <214873561+ProgramadorBrasil@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
29 KiB
ComfyUI Gateway -- Troubleshooting Guide
Comprehensive troubleshooting reference for diagnosing and resolving issues with the ComfyUI Gateway. Every section follows the Symptom -> Cause -> Solution format with concrete commands you can run immediately.
Table of Contents
- ComfyUI Not Reachable
- OOM (Out of Memory) Errors
- Slow Generation
- Webhook Failures
- Redis Connection Issues
- Storage Errors
- Database Issues
- Job Stuck in "running"
- Rate Limiting Issues
- Authentication Problems
1. ComfyUI Not Reachable
The gateway returns COMFYUI_UNREACHABLE and the /health endpoint shows
comfyui.reachable: false.
1a. Wrong COMFYUI_URL
Symptom: Gateway starts fine but every job fails with COMFYUI_UNREACHABLE.
The health endpoint returns { ok: false, comfyui: { reachable: false } }.
Cause: The COMFYUI_URL in .env does not point to a running ComfyUI instance.
Solution:
# 1. Verify what you have configured
grep COMFYUI_URL .env
# 2. Test connectivity from the gateway host
curl -s http://127.0.0.1:8188/
# Expected: HTML page or JSON from ComfyUI
# 3. If ComfyUI is on a different port or host, update .env
# Example: COMFYUI_URL=http://192.168.1.50:8188
# 4. Restart the gateway after changing .env
npm run dev
1b. Firewall Blocking the Port
Symptom: curl to the ComfyUI URL times out or returns Connection refused,
but ComfyUI is confirmed running on that machine.
Cause: A host firewall (Windows Defender, iptables, ufw) is blocking the port.
Solution:
# Linux (ufw)
sudo ufw allow 8188/tcp
sudo ufw reload
# Linux (iptables)
sudo iptables -A INPUT -p tcp --dport 8188 -j ACCEPT
# Windows (PowerShell, run as Admin)
New-NetFirewallRule -DisplayName "ComfyUI" -Direction Inbound -Port 8188 -Protocol TCP -Action Allow
# Verify the port is listening
# Linux
ss -tlnp | grep 8188
# Windows
netstat -an | findstr 8188
1c. Docker Networking
Symptom: Gateway running inside Docker cannot reach ComfyUI on 127.0.0.1:8188.
Cause: 127.0.0.1 inside a Docker container refers to the container itself,
not the host machine.
Solution:
# Option A: Use Docker's special host DNS (Linux + Docker Desktop)
COMFYUI_URL=http://host.docker.internal:8188
# Option B: Use the host network mode
docker run --network host comfyui-gateway
# Option C: Put both containers on the same Docker network
docker network create comfy-net
docker run --name comfyui --network comfy-net ...
docker run --name gateway --network comfy-net -e COMFYUI_URL=http://comfyui:8188 ...
# Verify from inside the gateway container
docker exec -it gateway sh -c "wget -qO- http://comfyui:8188/ || echo FAIL"
1d. WSL2 Networking
Symptom: Gateway running on Windows/WSL2 cannot reach ComfyUI running on the other side (host vs WSL or vice-versa).
Cause: WSL2 uses a virtual network adapter. The WSL2 guest and Windows host have different IP addresses.
Solution:
# From WSL2, get the Windows host IP
cat /etc/resolv.conf | grep nameserver | awk '{print $2}'
# Example output: 172.25.192.1
# Set COMFYUI_URL to that IP
COMFYUI_URL=http://172.25.192.1:8188
# Alternatively, if ComfyUI runs inside WSL2 and the gateway is on Windows:
# Find WSL2 IP
wsl hostname -I
# Example output: 172.25.198.5
# Set: COMFYUI_URL=http://172.25.198.5:8188
# Make sure ComfyUI is listening on 0.0.0.0, not just 127.0.0.1
# Launch ComfyUI with: python main.py --listen 0.0.0.0
1e. ComfyUI Not Started or Crashed
Symptom: Port is not listening at all.
Cause: ComfyUI process is not running.
Solution:
# Check if the process is running
# Linux
ps aux | grep "main.py"
# Windows
tasklist | findstr python
# Start ComfyUI
cd /path/to/ComfyUI
python main.py --listen 0.0.0.0 --port 8188
# Check logs for startup errors
python main.py --listen 0.0.0.0 --port 8188 2>&1 | tail -50
# Verify it is accepting connections
curl -s http://127.0.0.1:8188/ && echo "OK" || echo "NOT REACHABLE"
2. OOM (Out of Memory) Errors
The gateway classifies these as COMFYUI_OOM with retryable: false.
2a. Resolution or Batch Size Too Large
Symptom: Job fails with error containing "CUDA out of memory", "allocator backend out of memory", or "failed to allocate".
Cause: The requested image dimensions or batch size exceeds available VRAM.
Solution:
# 1. Reduce resolution in your job request
# Instead of 2048x2048, try 1024x1024 or 768x768
curl -X POST http://localhost:3000/jobs \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"workflowId": "sdxl_realism_v1",
"inputs": {
"prompt": "a mountain landscape",
"width": 1024,
"height": 1024
}
}'
# 2. Reduce batch size to 1
# Set in your job inputs: "batch_size": 1
# 3. Lower the gateway-level limits in .env
MAX_IMAGE_SIZE=1024
MAX_BATCH_SIZE=2
2b. Too Many Steps
Symptom: OOM occurs mid-generation, not immediately at submission.
Cause: The sampler accumulates intermediate tensors over many steps.
Solution:
# Reduce steps in the job inputs
# Instead of 50 steps, try 20-30
curl -X POST http://localhost:3000/jobs \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"workflowId": "sdxl_realism_v1",
"inputs": {
"prompt": "a portrait photo",
"steps": 20,
"width": 1024,
"height": 1024
}
}'
2c. Model Quantization
Symptom: Even at low resolution, OOM errors occur because the model is too large for the GPU (common on 8 GB VRAM cards with SDXL).
Cause: Full-precision (fp32) or half-precision (fp16) model weights exceed available VRAM.
Solution:
# In ComfyUI, use fp8 or quantized checkpoints
# Update your workflow template to use a quantized model:
# e.g., "ckpt_name": "sdxl_base_1.0_fp8.safetensors"
# Or add --fp8_e4m3fn-unet flag when starting ComfyUI
python main.py --listen 0.0.0.0 --fp8_e4m3fn-unet
# Monitor VRAM usage
nvidia-smi -l 2
2d. VAE Tiling
Symptom: OOM happens during the VAE decode step (after sampling completes).
Cause: The VAE decoder processes the entire latent at once, which can be very memory-intensive at high resolutions.
Solution:
Enable VAE tiling in your ComfyUI workflow by adding a "VAEDecodeTiled" node
instead of "VAEDecode". Tile size of 512 is a good default.
In the workflow JSON template:
{
"10": {
"class_type": "VAEDecodeTiled",
"inputs": {
"samples": ["3", 0],
"vae": ["4", 2],
"tile_size": 512
}
}
}
3. Slow Generation
3a. GPU Not Being Utilized
Symptom: Jobs complete but take much longer than expected. GPU utilization stays near 0%.
Cause: ComfyUI is falling back to CPU inference, or the wrong GPU is selected.
Solution:
# 1. Check GPU utilization during a job
nvidia-smi -l 1
# Look for "GPU-Util" column -- should be 80-100% during sampling
# 2. Verify CUDA is available in ComfyUI
# Check ComfyUI startup logs for "Using device: cuda"
# 3. Force GPU selection (multi-GPU systems)
CUDA_VISIBLE_DEVICES=0 python main.py --listen 0.0.0.0
# 4. Verify PyTorch sees the GPU
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"
3b. Model Loading on Every Job
Symptom: First job is slow, subsequent jobs with the same workflow are faster, but switching workflows causes long delays.
Cause: ComfyUI loads the model from disk each time a different checkpoint is requested. This can take 10-30 seconds per model load.
Solution:
# 1. Increase ComfyUI's model cache
# Start ComfyUI with a larger cache (default is 1 model):
python main.py --listen 0.0.0.0 --cache-size 3
# 2. Use the same checkpoint across workflows when possible
# Standardize on one checkpoint (e.g., sdxl_base_1.0.safetensors)
# 3. Place models on an SSD, not an HDD
# Move ComfyUI/models/ to an NVMe drive for faster load times
3c. Queue Depth / Concurrency
Symptom: Jobs are queued for a long time before starting.
The job stays in status: "queued" for minutes.
Cause: The worker concurrency is set to 1 (default) and multiple jobs are queued, or the single slot is occupied by a long-running job.
Solution:
# 1. Check current queue state
curl -s http://localhost:3000/jobs?status=queued | jq '.count'
curl -s http://localhost:3000/jobs?status=running | jq '.count'
# 2. Increase concurrency if your GPU can handle it (multi-batch)
# Edit .env:
MAX_CONCURRENCY=2
# WARNING: Only increase if you have enough VRAM for parallel jobs.
# Two concurrent 1024x1024 SDXL jobs need ~20+ GB VRAM.
# 3. For multi-GPU setups, run multiple worker processes
# Terminal 1: CUDA_VISIBLE_DEVICES=0 npm run start:worker
# Terminal 2: CUDA_VISIBLE_DEVICES=1 npm run start:worker
# Both connect to the same Redis queue
3d. ComfyUI Startup Time
Symptom: The very first job after starting ComfyUI takes 30-60 seconds even for a simple generation.
Cause: ComfyUI performs initialization (loading nodes, compiling, warming up CUDA) on the first prompt.
Solution:
# 1. Send a warm-up job immediately after starting ComfyUI
# This is a tiny 64x64 generation that forces initialization
curl -X POST http://localhost:3000/jobs \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"workflowId": "sdxl_realism_v1",
"inputs": {
"prompt": "test",
"width": 64,
"height": 64,
"steps": 1
}
}'
# 2. Increase the gateway timeout to account for cold starts
COMFYUI_TIMEOUT_MS=600000
4. Webhook Failures
Webhook errors appear in logs as WEBHOOK_DELIVERY_FAILED.
4a. DNS Resolution Failure
Symptom: Webhook fails with "getaddrinfo ENOTFOUND" or "DNS lookup failed".
Cause: The callback URL hostname cannot be resolved.
Solution:
# 1. Test DNS resolution from the gateway host
nslookup your-webhook-domain.com
dig your-webhook-domain.com
# 2. If using a local hostname (e.g., within Docker), make sure it is resolvable
# Add to /etc/hosts if needed:
echo "192.168.1.50 my-webhook-server" | sudo tee -a /etc/hosts
# 3. Verify the callback URL is correct in your job request
curl -X POST http://localhost:3000/jobs \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"workflowId": "sdxl_realism_v1",
"inputs": { "prompt": "test" },
"callbackUrl": "https://your-valid-domain.com/webhook"
}'
4b. SSL Certificate Errors
Symptom: Webhook fails with "self signed certificate", "CERT_HAS_EXPIRED", or "unable to verify the first certificate".
Cause: The webhook receiver uses an invalid, expired, or self-signed SSL certificate.
Solution:
# 1. Test the certificate manually
openssl s_client -connect your-webhook-domain.com:443 -servername your-webhook-domain.com < /dev/null 2>&1 | head -20
# 2. Check expiration
echo | openssl s_client -connect your-webhook-domain.com:443 2>/dev/null | openssl x509 -noout -dates
# 3. For development with self-signed certs, set NODE_TLS_REJECT_UNAUTHORIZED
# WARNING: Do NOT use this in production
NODE_TLS_REJECT_UNAUTHORIZED=0 npm run dev
# 4. For production, fix the certificate (use Let's Encrypt or a valid CA)
4c. Webhook Timeout
Symptom: Webhook logs show "AbortError" or "Webhook POST timed out".
Cause: The webhook receiver takes longer than 10 seconds to respond. The gateway has a hardcoded 10-second timeout per webhook attempt with 3 retries and exponential backoff.
Solution:
# 1. Ensure your webhook receiver responds quickly
# The receiver should return 200 immediately and process asynchronously
# BAD: app.post("/webhook", async (req, res) => { await longProcess(); res.send("ok"); })
# GOOD: app.post("/webhook", (req, res) => { res.send("ok"); enqueueWork(req.body); })
# 2. Test receiver response time
time curl -s -o /dev/null -w "%{time_total}" -X POST https://your-webhook.com/callback \
-H "Content-Type: application/json" -d '{"test": true}'
# Should be < 2 seconds
4d. Domain Not in Allowlist
Symptom: Job creation fails with Callback domain "example.com" is not in the allowed domains list.
Cause: WEBHOOK_ALLOWED_DOMAINS is configured and does not include the
callback URL's domain.
Solution:
# 1. Check current setting
grep WEBHOOK_ALLOWED_DOMAINS .env
# 2. Add the domain (comma-separated list)
WEBHOOK_ALLOWED_DOMAINS=your-app.com,n8n.your-domain.com,*.internal.company.com
# 3. Or allow all domains (less secure, suitable for development)
WEBHOOK_ALLOWED_DOMAINS=*
# 4. Restart the gateway
npm run dev
4e. HMAC Signature Mismatch
Symptom: Your webhook receiver receives the POST but HMAC validation fails on your end.
Cause: The WEBHOOK_SECRET configured in the gateway does not match the secret
your receiver uses to validate signatures, or the signature computation differs.
Solution:
# 1. Verify the WEBHOOK_SECRET matches on both sides
grep WEBHOOK_SECRET .env
# 2. The gateway sends: X-Signature: sha256=<hex>
# Computed as: HMAC-SHA256(secret, raw_body_string)
# Verify in Node.js:
node -e "
const crypto = require('crypto');
const secret = 'your-webhook-secret';
const body = '{\"jobId\":\"test\",\"status\":\"succeeded\"}';
const sig = crypto.createHmac('sha256', secret).update(body, 'utf8').digest('hex');
console.log('Expected header: sha256=' + sig);
"
# 3. Common mistakes:
# - Parsing the body before computing HMAC (must use raw string)
# - Using different encodings (gateway uses utf8)
# - Comparing strings case-sensitively (hex is lowercase)
5. Redis Connection Issues
5a. Cannot Connect to Redis
Symptom: Gateway crashes at startup with "Redis connection error" or "ECONNREFUSED" targeting the Redis port.
Cause: Redis server is not running, or the REDIS_URL is wrong.
Solution:
# 1. Check if Redis is running
redis-cli ping
# Expected: PONG
# 2. Verify the URL format
# Correct formats:
# redis://localhost:6379
# redis://:yourpassword@redis-host:6379/0
# rediss://user:password@host:6380/0 (TLS)
# 3. Test connectivity
redis-cli -u "redis://localhost:6379" ping
# 4. If Redis is not needed, remove REDIS_URL to use in-memory queue
# Edit .env:
REDIS_URL=
# The gateway falls back to an in-memory queue automatically
5b. Redis Authentication Failure
Symptom: Error message contains "NOAUTH Authentication required" or "ERR invalid password".
Cause: Redis requires a password but REDIS_URL does not include one,
or the password is wrong.
Solution:
# 1. Include the password in the URL
REDIS_URL=redis://:your_redis_password@localhost:6379/0
# 2. Test with redis-cli
redis-cli -a "your_redis_password" ping
# 3. Check Redis config for requirepass
redis-cli CONFIG GET requirepass
5c. Fallback to In-Memory Queue
Symptom: Logs show "No Redis URL configured, using in-memory queue" and you expected BullMQ.
Cause: REDIS_URL is empty or not set in .env.
Solution:
# 1. Set REDIS_URL in .env
REDIS_URL=redis://localhost:6379
# 2. Verify Redis is running
redis-cli ping
# 3. Restart the gateway
npm run dev
# 4. Confirm in logs: should show "Redis URL configured, using BullMQ worker"
Note
: The in-memory queue is fine for single-instance development deployments. For production with multiple workers or durability requirements, use Redis + BullMQ.
6. Storage Errors
6a. Local Disk Permission Denied
Symptom: Job fails at the output storage step with "EACCES: permission denied"
or STORAGE_READ_ERROR.
Cause: The gateway process does not have write permissions to STORAGE_LOCAL_PATH.
Solution:
# 1. Check the configured path
grep STORAGE_LOCAL_PATH .env
# Default: ./data/outputs
# 2. Ensure the directory exists and is writable
mkdir -p ./data/outputs
chmod 755 ./data/outputs
# 3. Check ownership
ls -la ./data/
# 4. If running as a different user (e.g., in Docker)
chown -R node:node ./data/outputs
# 5. For Docker, mount a volume with correct permissions
# docker run -v /host/path/outputs:/app/data/outputs ...
6b. S3 Credentials Invalid
Symptom: Job fails with STORAGE_S3_PUT_ERROR and the underlying error
mentions "InvalidAccessKeyId", "SignatureDoesNotMatch", or "AccessDenied".
Cause: The S3_ACCESS_KEY / S3_SECRET_KEY are wrong, expired, or the
IAM policy does not grant s3:PutObject permission.
Solution:
# 1. Verify credentials are set
grep S3_ACCESS_KEY .env
grep S3_SECRET_KEY .env
grep S3_BUCKET .env
# 2. Test with AWS CLI
aws s3 ls s3://your-bucket/ \
--endpoint-url http://your-minio:9000 \
--region us-east-1
# 3. Test a put operation
echo "test" > /tmp/test.txt
aws s3 cp /tmp/test.txt s3://your-bucket/test.txt \
--endpoint-url http://your-minio:9000
# 4. Minimum IAM policy for the gateway:
# {
# "Version": "2012-10-17",
# "Statement": [{
# "Effect": "Allow",
# "Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject", "s3:ListBucket"],
# "Resource": ["arn:aws:s3:::your-bucket", "arn:aws:s3:::your-bucket/*"]
# }]
# }
6c. MinIO Configuration
Symptom: S3 storage fails with "socket hang up", "ECONNREFUSED", or "Bucket does not exist".
Cause: MinIO endpoint is wrong, the bucket has not been created, or
forcePathStyle is not enabled (handled automatically by the gateway).
Solution:
# 1. Verify MinIO is running
curl http://localhost:9000/minio/health/live
# Expected: HTTP 200
# 2. Set the correct endpoint in .env
S3_ENDPOINT=http://localhost:9000
S3_BUCKET=comfyui-outputs
S3_ACCESS_KEY=minioadmin
S3_SECRET_KEY=minioadmin
S3_REGION=us-east-1
# 3. Create the bucket if it does not exist
# Using mc (MinIO Client)
mc alias set local http://localhost:9000 minioadmin minioadmin
mc mb local/comfyui-outputs
# Or using AWS CLI
aws s3 mb s3://comfyui-outputs --endpoint-url http://localhost:9000
7. Database Issues
7a. SQLite WAL Lock Errors
Symptom: Intermittent "SQLITE_BUSY" or "database is locked" errors under concurrent load.
Cause: Multiple processes or threads are writing to the SQLite database simultaneously. SQLite WAL mode supports concurrent readers but only one writer.
Solution:
# 1. The gateway already sets optimal pragmas:
# journal_mode = WAL
# synchronous = NORMAL
# busy_timeout = 5000 (5 seconds)
# 2. If running multiple gateway instances, switch to Postgres
DATABASE_URL=postgresql://user:password@localhost:5432/comfyui_gateway
# 3. If you must use SQLite with a single instance, increase busy timeout
# (requires code change or env override):
# The default 5000ms should be sufficient for most single-instance use cases
# 4. Check for stuck WAL files
ls -la ./data/gateway.db*
# You should see: gateway.db, gateway.db-wal, gateway.db-shm
# 5. If the database is corrupted, try recovery
sqlite3 ./data/gateway.db "PRAGMA integrity_check;"
# If it reports errors, back up and recreate:
cp ./data/gateway.db ./data/gateway.db.bak
sqlite3 ./data/gateway.db ".recover" | sqlite3 ./data/gateway_recovered.db
7b. Postgres Connection Pooling
Symptom: Errors like "too many clients already", "remaining connection slots are reserved", or intermittent "Connection terminated unexpectedly".
Cause: The gateway opens too many connections to Postgres, exceeding
max_connections, or connections are not being properly returned to the pool.
Solution:
# 1. Check current connections in Postgres
psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'comfyui_gateway';"
# 2. Check max_connections setting
psql -c "SHOW max_connections;"
# 3. Use a connection pooler like PgBouncer
# Install PgBouncer and point DATABASE_URL to it
DATABASE_URL=postgresql://user:password@localhost:6432/comfyui_gateway
# 4. If running multiple gateway instances, ensure the total pool size
# across all instances does not exceed Postgres max_connections
7c. Database URL Format
Symptom: Gateway crashes at startup with "Invalid connection string" or uses SQLite when you intended Postgres.
Cause: The DATABASE_URL format is wrong. The gateway checks if the URL
starts with postgres:// or postgresql:// to select the Postgres backend.
Solution:
# SQLite formats (all valid):
DATABASE_URL=./data/gateway.db
DATABASE_URL=/absolute/path/to/gateway.db
# Postgres formats (must start with postgres:// or postgresql://):
DATABASE_URL=postgresql://user:password@localhost:5432/comfyui_gateway
DATABASE_URL=postgres://user:password@host:5432/dbname?sslmode=require
8. Job Stuck in "running"
8a. ComfyUI Crashed During Execution
Symptom: A job shows status: "running" indefinitely. No progress updates.
The gateway health endpoint may show comfyui.reachable: false.
Cause: ComfyUI crashed (segfault, CUDA error, killed by OOM killer) while processing the job, and the gateway's WebSocket connection was severed.
Solution:
# 1. Check job status
curl -s http://localhost:3000/jobs/<jobId> | jq '.status'
# 2. Check if ComfyUI is still running
curl -s http://localhost:3000/health | jq '.comfyui.reachable'
# 3. If ComfyUI crashed, restart it
cd /path/to/ComfyUI
python main.py --listen 0.0.0.0
# 4. The stuck job will eventually time out (COMFYUI_TIMEOUT_MS, default 5 min)
# and be marked as failed with COMFYUI_TIMEOUT
# 5. To immediately cancel the stuck job
curl -X POST http://localhost:3000/jobs/<jobId>/cancel \
-H "X-API-Key: your-key"
# 6. To reduce timeout for faster failure detection
COMFYUI_TIMEOUT_MS=120000
8b. WebSocket Disconnection
Symptom: Job stays "running" but ComfyUI is actually done. The output exists in ComfyUI's history.
Cause: The WebSocket connection dropped mid-execution, and the polling fallback failed to pick up the result.
Solution:
# 1. Check ComfyUI history directly
curl -s http://127.0.0.1:8188/history | jq 'keys | length'
# 2. The gateway automatically falls back to HTTP polling if WebSocket fails.
# If polling also fails, the job times out.
# 3. Restart the gateway to reset connections
npm run dev
# 4. Check network stability between gateway and ComfyUI
ping -c 10 <comfyui-host>
8c. Restart Recovery
Symptom: After restarting the gateway, jobs that were "running" remain in that state permanently.
Cause: The in-memory queue loses track of running jobs when the process restarts. There is no automatic recovery for in-memory jobs.
Solution:
# 1. For production, use Redis (BullMQ) for durable job queues
REDIS_URL=redis://localhost:6379
# 2. Manually fail stuck jobs via the database
sqlite3 ./data/gateway.db \
"UPDATE jobs SET status='failed', errorJson='{\"code\":\"GATEWAY_RESTART\",\"message\":\"Job interrupted by gateway restart\"}', completedAt=datetime('now') WHERE status='running';"
# 3. Verify
sqlite3 ./data/gateway.db "SELECT id, status FROM jobs WHERE status='running';"
9. Rate Limiting Issues
9a. Identifying You Are Being Rate Limited
Symptom: API returns HTTP 429 with body { "error": "RATE_LIMITED" } and
a Retry-After header.
Cause: You exceeded RATE_LIMIT_MAX requests within the RATE_LIMIT_WINDOW_MS
window. Limits are applied per API key or per IP.
Solution:
# 1. Check the response headers
curl -v http://localhost:3000/health -H "X-API-Key: your-key" 2>&1 | grep -i "x-ratelimit"
# X-RateLimit-Limit: 100
# X-RateLimit-Remaining: 0
# Retry-After: 42
# 2. Wait for the Retry-After period, then retry
# 3. Implement exponential backoff in your client
9b. Adjusting Rate Limits
Symptom: Legitimate usage is being throttled.
Cause: Default limits (100 requests/minute) are too low for your workload.
Solution:
# 1. Increase the limit in .env
RATE_LIMIT_MAX=500
RATE_LIMIT_WINDOW_MS=60000
# 2. For burst workloads, widen the window
RATE_LIMIT_MAX=1000
RATE_LIMIT_WINDOW_MS=300000
# 3. Restart the gateway
npm run dev
# 4. Note: Rate limits are per API key (if authenticated) or per IP.
# Different API keys have independent counters.
9c. Rate Limit Per API Key vs Per IP
Symptom: Different clients sharing the same IP are interfering with each other's rate limits.
Cause: Without API keys, all requests from the same IP share a single rate-limit bucket.
Solution:
# 1. Assign unique API keys to each client
API_KEYS=client1-key:user,client2-key:user,admin-key:admin
# 2. Each client uses its own X-API-Key header
# Client 1: -H "X-API-Key: client1-key"
# Client 2: -H "X-API-Key: client2-key"
# 3. Each key gets its own independent rate-limit counter
10. Authentication Problems
10a. API Key Not Accepted
Symptom: Every request returns HTTP 401 with { "error": "AUTH_FAILED", "message": "Invalid API key" }.
Cause: The X-API-Key header value does not match any entry in API_KEYS.
Solution:
# 1. Check configured keys
grep API_KEYS .env
# Format: key1:admin,key2:user
# 2. Ensure your request uses the exact key (no extra whitespace)
curl -H "X-API-Key: mykey123" http://localhost:3000/health
# 3. Keys are case-sensitive and matched exactly
# 4. If API_KEYS is empty, authentication is DISABLED (development mode)
# All requests are treated as admin. Set keys for production:
API_KEYS=sk-prod-abc123:admin,sk-user-xyz789:user
10b. JWT Token Expired
Symptom: Request returns { "error": "AUTH_FAILED", "message": "JWT token has expired" }.
Cause: The JWT exp claim is in the past.
Solution:
# 1. Decode the JWT to check expiration (without verification)
echo "<your-token>" | cut -d'.' -f2 | base64 -d 2>/dev/null | jq '.exp'
# 2. Compare with current time
date +%s
# 3. Generate a new token with a longer TTL
# Example using Node.js:
node -e "
const crypto = require('crypto');
const secret = 'your-jwt-secret';
const header = Buffer.from(JSON.stringify({alg:'HS256',typ:'JWT'})).toString('base64url');
const payload = Buffer.from(JSON.stringify({
sub: 'user-1',
role: 'admin',
iat: Math.floor(Date.now()/1000),
exp: Math.floor(Date.now()/1000) + 86400 // 24 hours
})).toString('base64url');
const sig = crypto.createHmac('sha256', secret).update(header+'.'+payload).digest('base64url');
console.log(header+'.'+payload+'.'+sig);
"
10c. JWT Signature Invalid
Symptom: Request returns { "error": "AUTH_FAILED", "message": "Invalid JWT signature" }.
Cause: The JWT was signed with a different secret than what is configured in
JWT_SECRET.
Solution:
# 1. Verify the secret matches on token-issuer side and gateway side
grep JWT_SECRET .env
# 2. The gateway uses HMAC-SHA256 (HS256) exclusively
# Make sure your token issuer also uses HS256 with the same secret
# 3. Re-generate the token using the correct secret
10d. No Authentication Header Provided
Symptom: Request returns { "error": "AUTH_FAILED", "message": "Authentication required. Provide X-API-Key header or Authorization: Bearer token." }.
Cause: The request has no X-API-Key header and no Authorization: Bearer
header, and authentication is enabled (API_KEYS or JWT_SECRET is set).
Solution:
# Option A: Use API Key
curl -H "X-API-Key: your-key" http://localhost:3000/health
# Option B: Use JWT Bearer token
curl -H "Authorization: Bearer your.jwt.token" http://localhost:3000/health
# Option C: Disable auth for development (NOT for production)
# Remove all values from API_KEYS and JWT_SECRET in .env:
API_KEYS=
JWT_SECRET=
10e. Insufficient Permissions (Forbidden)
Symptom: Request returns HTTP 403 with { "error": "FORBIDDEN", "message": "Admin role required for this operation" }.
Cause: You are using a user role key to perform an admin-only action
(workflow CRUD).
Solution:
# 1. Check which role your key has
grep API_KEYS .env
# Example: sk-user-key:user,sk-admin-key:admin
# 2. Use the admin key for workflow management
curl -H "X-API-Key: sk-admin-key" -X POST http://localhost:3000/workflows ...
# 3. User role can: create jobs, read own jobs, view health/capabilities
# Admin role can: everything the user can + workflow CRUD + view all jobs
Quick Diagnostic Commands
# Gateway health
curl -s http://localhost:3000/health | jq .
# ComfyUI direct connectivity
curl -s http://127.0.0.1:8188/ | head -5
# Queue status
curl -s http://localhost:3000/jobs?status=queued -H "X-API-Key: KEY" | jq '.count'
curl -s http://localhost:3000/jobs?status=running -H "X-API-Key: KEY" | jq '.count'
# GPU memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader
# Redis connectivity
redis-cli -u "$REDIS_URL" ping
# SQLite integrity
sqlite3 ./data/gateway.db "PRAGMA integrity_check;"
# Logs (if using pino-pretty)
npm run dev 2>&1 | npx pino-pretty
# Check all configured environment variables
grep -v '^#' .env | grep -v '^$'