firefrost-gaming/antigravity-skills-reference

Files

ProgramadorBrasil 61ec71c5c7 feat: add 52 specialized AI agent skills (#217 )

New skills covering 10 categories:

**Security & Audit**: 007 (STRIDE/PASTA/OWASP), cred-omega (secrets management)
**AI Personas**: Karpathy, Hinton, Sutskever, LeCun (4 sub-skills), Altman, Musk, Gates, Jobs, Buffett
**Multi-agent Orchestration**: agent-orchestrator, task-intelligence, multi-advisor
**Code Analysis**: matematico-tao (Terence Tao-inspired mathematical code analysis)
**Social & Messaging**: Instagram Graph API, Telegram Bot, WhatsApp Cloud API, social-orchestrator
**Image Generation**: AI Studio (Gemini), Stability AI, ComfyUI Gateway, image-studio router
**Brazilian Domain**: 6 auction specialist modules, 2 legal advisors, auctioneers data scraper
**Product & Growth**: design, invention, monetization, analytics, growth engine
**DevOps & LLM Ops**: Docker/CI-CD/AWS, RAG/embeddings/fine-tuning
**Skill Governance**: installer, sentinel auditor, context management

Each skill includes:
- Standardized YAML frontmatter (name, description, risk, source, tags, tools)
- Structured sections (Overview, When to Use, How it Works, Best Practices)
- Python scripts and reference documentation where applicable
- Cross-platform compatibility (Claude Code, Antigravity, Cursor, Gemini CLI, Codex CLI)

Co-authored-by: ProgramadorBrasil <214873561+ProgramadorBrasil@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-07 10:04:07 +01:00

29 KiB

Raw Blame History

ComfyUI Gateway -- Troubleshooting Guide

Comprehensive troubleshooting reference for diagnosing and resolving issues with the ComfyUI Gateway. Every section follows the Symptom -> Cause -> Solution format with concrete commands you can run immediately.

ComfyUI Not Reachable
OOM (Out of Memory) Errors
Slow Generation
Webhook Failures
Redis Connection Issues
Storage Errors
Database Issues
Job Stuck in "running"
Rate Limiting Issues
Authentication Problems

1. ComfyUI Not Reachable

The gateway returns COMFYUI_UNREACHABLE and the /health endpoint shows comfyui.reachable: false.

1a. Wrong COMFYUI_URL

Symptom: Gateway starts fine but every job fails with COMFYUI_UNREACHABLE. The health endpoint returns { ok: false, comfyui: { reachable: false } }.

Cause: The COMFYUI_URL in .env does not point to a running ComfyUI instance.

Solution:

# 1. Verify what you have configured
grep COMFYUI_URL .env

# 2. Test connectivity from the gateway host
curl -s http://127.0.0.1:8188/
# Expected: HTML page or JSON from ComfyUI

# 3. If ComfyUI is on a different port or host, update .env
# Example: COMFYUI_URL=http://192.168.1.50:8188

# 4. Restart the gateway after changing .env
npm run dev

1b. Firewall Blocking the Port

Symptom: curl to the ComfyUI URL times out or returns Connection refused, but ComfyUI is confirmed running on that machine.

Cause: A host firewall (Windows Defender, iptables, ufw) is blocking the port.

Solution:

# Linux (ufw)
sudo ufw allow 8188/tcp
sudo ufw reload

# Linux (iptables)
sudo iptables -A INPUT -p tcp --dport 8188 -j ACCEPT

# Windows (PowerShell, run as Admin)
New-NetFirewallRule -DisplayName "ComfyUI" -Direction Inbound -Port 8188 -Protocol TCP -Action Allow

# Verify the port is listening
# Linux
ss -tlnp | grep 8188
# Windows
netstat -an | findstr 8188

1c. Docker Networking

Symptom: Gateway running inside Docker cannot reach ComfyUI on 127.0.0.1:8188.

Cause: 127.0.0.1 inside a Docker container refers to the container itself, not the host machine.

Solution:

# Option A: Use Docker's special host DNS (Linux + Docker Desktop)
COMFYUI_URL=http://host.docker.internal:8188

# Option B: Use the host network mode
docker run --network host comfyui-gateway

# Option C: Put both containers on the same Docker network
docker network create comfy-net
docker run --name comfyui --network comfy-net ...
docker run --name gateway --network comfy-net -e COMFYUI_URL=http://comfyui:8188 ...

# Verify from inside the gateway container
docker exec -it gateway sh -c "wget -qO- http://comfyui:8188/ || echo FAIL"

1d. WSL2 Networking

Symptom: Gateway running on Windows/WSL2 cannot reach ComfyUI running on the other side (host vs WSL or vice-versa).

Cause: WSL2 uses a virtual network adapter. The WSL2 guest and Windows host have different IP addresses.

Solution:

# From WSL2, get the Windows host IP
cat /etc/resolv.conf | grep nameserver | awk '{print $2}'
# Example output: 172.25.192.1

# Set COMFYUI_URL to that IP
COMFYUI_URL=http://172.25.192.1:8188

# Alternatively, if ComfyUI runs inside WSL2 and the gateway is on Windows:
# Find WSL2 IP
wsl hostname -I
# Example output: 172.25.198.5
# Set: COMFYUI_URL=http://172.25.198.5:8188

# Make sure ComfyUI is listening on 0.0.0.0, not just 127.0.0.1
# Launch ComfyUI with: python main.py --listen 0.0.0.0

1e. ComfyUI Not Started or Crashed

Symptom: Port is not listening at all.

Cause: ComfyUI process is not running.

Solution:

# Check if the process is running
# Linux
ps aux | grep "main.py"
# Windows
tasklist | findstr python

# Start ComfyUI
cd /path/to/ComfyUI
python main.py --listen 0.0.0.0 --port 8188

# Check logs for startup errors
python main.py --listen 0.0.0.0 --port 8188 2>&1 | tail -50

# Verify it is accepting connections
curl -s http://127.0.0.1:8188/ && echo "OK" || echo "NOT REACHABLE"

2. OOM (Out of Memory) Errors

The gateway classifies these as COMFYUI_OOM with retryable: false.

2a. Resolution or Batch Size Too Large

Symptom: Job fails with error containing "CUDA out of memory", "allocator backend out of memory", or "failed to allocate".

Cause: The requested image dimensions or batch size exceeds available VRAM.

Solution:

# 1. Reduce resolution in your job request
# Instead of 2048x2048, try 1024x1024 or 768x768
curl -X POST http://localhost:3000/jobs \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-key" \
  -d '{
    "workflowId": "sdxl_realism_v1",
    "inputs": {
      "prompt": "a mountain landscape",
      "width": 1024,
      "height": 1024
    }
  }'

# 2. Reduce batch size to 1
# Set in your job inputs: "batch_size": 1

# 3. Lower the gateway-level limits in .env
MAX_IMAGE_SIZE=1024
MAX_BATCH_SIZE=2

2b. Too Many Steps

Symptom: OOM occurs mid-generation, not immediately at submission.

Cause: The sampler accumulates intermediate tensors over many steps.

Solution:

# Reduce steps in the job inputs
# Instead of 50 steps, try 20-30
curl -X POST http://localhost:3000/jobs \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-key" \
  -d '{
    "workflowId": "sdxl_realism_v1",
    "inputs": {
      "prompt": "a portrait photo",
      "steps": 20,
      "width": 1024,
      "height": 1024
    }
  }'

2c. Model Quantization

Symptom: Even at low resolution, OOM errors occur because the model is too large for the GPU (common on 8 GB VRAM cards with SDXL).

Cause: Full-precision (fp32) or half-precision (fp16) model weights exceed available VRAM.

Solution:

# In ComfyUI, use fp8 or quantized checkpoints
# Update your workflow template to use a quantized model:
# e.g., "ckpt_name": "sdxl_base_1.0_fp8.safetensors"

# Or add --fp8_e4m3fn-unet flag when starting ComfyUI
python main.py --listen 0.0.0.0 --fp8_e4m3fn-unet

# Monitor VRAM usage
nvidia-smi -l 2

2d. VAE Tiling

Symptom: OOM happens during the VAE decode step (after sampling completes).

Cause: The VAE decoder processes the entire latent at once, which can be very memory-intensive at high resolutions.

Solution:

Enable VAE tiling in your ComfyUI workflow by adding a "VAEDecodeTiled" node
instead of "VAEDecode". Tile size of 512 is a good default.

In the workflow JSON template:
{
  "10": {
    "class_type": "VAEDecodeTiled",
    "inputs": {
      "samples": ["3", 0],
      "vae": ["4", 2],
      "tile_size": 512
    }
  }
}

3. Slow Generation

3a. GPU Not Being Utilized

Symptom: Jobs complete but take much longer than expected. GPU utilization stays near 0%.

Cause: ComfyUI is falling back to CPU inference, or the wrong GPU is selected.

Solution:

# 1. Check GPU utilization during a job
nvidia-smi -l 1
# Look for "GPU-Util" column -- should be 80-100% during sampling

# 2. Verify CUDA is available in ComfyUI
# Check ComfyUI startup logs for "Using device: cuda"

# 3. Force GPU selection (multi-GPU systems)
CUDA_VISIBLE_DEVICES=0 python main.py --listen 0.0.0.0

# 4. Verify PyTorch sees the GPU
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"

3b. Model Loading on Every Job

Symptom: First job is slow, subsequent jobs with the same workflow are faster, but switching workflows causes long delays.

Cause: ComfyUI loads the model from disk each time a different checkpoint is requested. This can take 10-30 seconds per model load.

Solution:

# 1. Increase ComfyUI's model cache
# Start ComfyUI with a larger cache (default is 1 model):
python main.py --listen 0.0.0.0 --cache-size 3

# 2. Use the same checkpoint across workflows when possible
# Standardize on one checkpoint (e.g., sdxl_base_1.0.safetensors)

# 3. Place models on an SSD, not an HDD
# Move ComfyUI/models/ to an NVMe drive for faster load times

3c. Queue Depth / Concurrency

Symptom: Jobs are queued for a long time before starting. The job stays in status: "queued" for minutes.

Cause: The worker concurrency is set to 1 (default) and multiple jobs are queued, or the single slot is occupied by a long-running job.

Solution:

# 1. Check current queue state
curl -s http://localhost:3000/jobs?status=queued | jq '.count'
curl -s http://localhost:3000/jobs?status=running | jq '.count'

# 2. Increase concurrency if your GPU can handle it (multi-batch)
# Edit .env:
MAX_CONCURRENCY=2

# WARNING: Only increase if you have enough VRAM for parallel jobs.
# Two concurrent 1024x1024 SDXL jobs need ~20+ GB VRAM.

# 3. For multi-GPU setups, run multiple worker processes
# Terminal 1: CUDA_VISIBLE_DEVICES=0 npm run start:worker
# Terminal 2: CUDA_VISIBLE_DEVICES=1 npm run start:worker
# Both connect to the same Redis queue

3d. ComfyUI Startup Time

Symptom: The very first job after starting ComfyUI takes 30-60 seconds even for a simple generation.

Cause: ComfyUI performs initialization (loading nodes, compiling, warming up CUDA) on the first prompt.

Solution:

# 1. Send a warm-up job immediately after starting ComfyUI
# This is a tiny 64x64 generation that forces initialization
curl -X POST http://localhost:3000/jobs \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-key" \
  -d '{
    "workflowId": "sdxl_realism_v1",
    "inputs": {
      "prompt": "test",
      "width": 64,
      "height": 64,
      "steps": 1
    }
  }'

# 2. Increase the gateway timeout to account for cold starts
COMFYUI_TIMEOUT_MS=600000

4. Webhook Failures

Webhook errors appear in logs as WEBHOOK_DELIVERY_FAILED.

4a. DNS Resolution Failure

Symptom: Webhook fails with "getaddrinfo ENOTFOUND" or "DNS lookup failed".

Cause: The callback URL hostname cannot be resolved.

Solution:

# 1. Test DNS resolution from the gateway host
nslookup your-webhook-domain.com
dig your-webhook-domain.com

# 2. If using a local hostname (e.g., within Docker), make sure it is resolvable
# Add to /etc/hosts if needed:
echo "192.168.1.50 my-webhook-server" | sudo tee -a /etc/hosts

# 3. Verify the callback URL is correct in your job request
curl -X POST http://localhost:3000/jobs \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-key" \
  -d '{
    "workflowId": "sdxl_realism_v1",
    "inputs": { "prompt": "test" },
    "callbackUrl": "https://your-valid-domain.com/webhook"
  }'

4b. SSL Certificate Errors

Symptom: Webhook fails with "self signed certificate", "CERT_HAS_EXPIRED", or "unable to verify the first certificate".

Cause: The webhook receiver uses an invalid, expired, or self-signed SSL certificate.

Solution:

# 1. Test the certificate manually
openssl s_client -connect your-webhook-domain.com:443 -servername your-webhook-domain.com < /dev/null 2>&1 | head -20

# 2. Check expiration
echo | openssl s_client -connect your-webhook-domain.com:443 2>/dev/null | openssl x509 -noout -dates

# 3. For development with self-signed certs, set NODE_TLS_REJECT_UNAUTHORIZED
# WARNING: Do NOT use this in production
NODE_TLS_REJECT_UNAUTHORIZED=0 npm run dev

# 4. For production, fix the certificate (use Let's Encrypt or a valid CA)

4c. Webhook Timeout

Symptom: Webhook logs show "AbortError" or "Webhook POST timed out".

Cause: The webhook receiver takes longer than 10 seconds to respond. The gateway has a hardcoded 10-second timeout per webhook attempt with 3 retries and exponential backoff.

Solution:

# 1. Ensure your webhook receiver responds quickly
# The receiver should return 200 immediately and process asynchronously
# BAD:  app.post("/webhook", async (req, res) => { await longProcess(); res.send("ok"); })
# GOOD: app.post("/webhook", (req, res) => { res.send("ok"); enqueueWork(req.body); })

# 2. Test receiver response time
time curl -s -o /dev/null -w "%{time_total}" -X POST https://your-webhook.com/callback \
  -H "Content-Type: application/json" -d '{"test": true}'
# Should be < 2 seconds

4d. Domain Not in Allowlist

Symptom: Job creation fails with Callback domain "example.com" is not in the allowed domains list.

Cause: WEBHOOK_ALLOWED_DOMAINS is configured and does not include the callback URL's domain.

Solution:

# 1. Check current setting
grep WEBHOOK_ALLOWED_DOMAINS .env

# 2. Add the domain (comma-separated list)
WEBHOOK_ALLOWED_DOMAINS=your-app.com,n8n.your-domain.com,*.internal.company.com

# 3. Or allow all domains (less secure, suitable for development)
WEBHOOK_ALLOWED_DOMAINS=*

# 4. Restart the gateway
npm run dev

4e. HMAC Signature Mismatch

Symptom: Your webhook receiver receives the POST but HMAC validation fails on your end.

Cause: The WEBHOOK_SECRET configured in the gateway does not match the secret your receiver uses to validate signatures, or the signature computation differs.

Solution:

# 1. Verify the WEBHOOK_SECRET matches on both sides
grep WEBHOOK_SECRET .env

# 2. The gateway sends: X-Signature: sha256=<hex>
# Computed as: HMAC-SHA256(secret, raw_body_string)
# Verify in Node.js:
node -e "
const crypto = require('crypto');
const secret = 'your-webhook-secret';
const body = '{\"jobId\":\"test\",\"status\":\"succeeded\"}';
const sig = crypto.createHmac('sha256', secret).update(body, 'utf8').digest('hex');
console.log('Expected header: sha256=' + sig);
"

# 3. Common mistakes:
# - Parsing the body before computing HMAC (must use raw string)
# - Using different encodings (gateway uses utf8)
# - Comparing strings case-sensitively (hex is lowercase)

5. Redis Connection Issues

5a. Cannot Connect to Redis

Symptom: Gateway crashes at startup with "Redis connection error" or "ECONNREFUSED" targeting the Redis port.

Cause: Redis server is not running, or the REDIS_URL is wrong.

Solution:

# 1. Check if Redis is running
redis-cli ping
# Expected: PONG

# 2. Verify the URL format
# Correct formats:
#   redis://localhost:6379
#   redis://:yourpassword@redis-host:6379/0
#   rediss://user:password@host:6380/0  (TLS)

# 3. Test connectivity
redis-cli -u "redis://localhost:6379" ping

# 4. If Redis is not needed, remove REDIS_URL to use in-memory queue
# Edit .env:
REDIS_URL=
# The gateway falls back to an in-memory queue automatically

5b. Redis Authentication Failure

Symptom: Error message contains "NOAUTH Authentication required" or "ERR invalid password".

Cause: Redis requires a password but REDIS_URL does not include one, or the password is wrong.

Solution:

# 1. Include the password in the URL
REDIS_URL=redis://:your_redis_password@localhost:6379/0

# 2. Test with redis-cli
redis-cli -a "your_redis_password" ping

# 3. Check Redis config for requirepass
redis-cli CONFIG GET requirepass

5c. Fallback to In-Memory Queue

Symptom: Logs show "No Redis URL configured, using in-memory queue" and you expected BullMQ.

Cause: REDIS_URL is empty or not set in .env.

Solution:

# 1. Set REDIS_URL in .env
REDIS_URL=redis://localhost:6379

# 2. Verify Redis is running
redis-cli ping

# 3. Restart the gateway
npm run dev

# 4. Confirm in logs: should show "Redis URL configured, using BullMQ worker"

Note

: The in-memory queue is fine for single-instance development deployments. For production with multiple workers or durability requirements, use Redis + BullMQ.

6. Storage Errors

6a. Local Disk Permission Denied

Symptom: Job fails at the output storage step with "EACCES: permission denied" or STORAGE_READ_ERROR.

Cause: The gateway process does not have write permissions to STORAGE_LOCAL_PATH.

Solution:

# 1. Check the configured path
grep STORAGE_LOCAL_PATH .env
# Default: ./data/outputs

# 2. Ensure the directory exists and is writable
mkdir -p ./data/outputs
chmod 755 ./data/outputs

# 3. Check ownership
ls -la ./data/

# 4. If running as a different user (e.g., in Docker)
chown -R node:node ./data/outputs

# 5. For Docker, mount a volume with correct permissions
# docker run -v /host/path/outputs:/app/data/outputs ...

6b. S3 Credentials Invalid

Symptom: Job fails with STORAGE_S3_PUT_ERROR and the underlying error mentions "InvalidAccessKeyId", "SignatureDoesNotMatch", or "AccessDenied".

Cause: The S3_ACCESS_KEY / S3_SECRET_KEY are wrong, expired, or the IAM policy does not grant s3:PutObject permission.

Solution:

# 1. Verify credentials are set
grep S3_ACCESS_KEY .env
grep S3_SECRET_KEY .env
grep S3_BUCKET .env

# 2. Test with AWS CLI
aws s3 ls s3://your-bucket/ \
  --endpoint-url http://your-minio:9000 \
  --region us-east-1

# 3. Test a put operation
echo "test" > /tmp/test.txt
aws s3 cp /tmp/test.txt s3://your-bucket/test.txt \
  --endpoint-url http://your-minio:9000

# 4. Minimum IAM policy for the gateway:
# {
#   "Version": "2012-10-17",
#   "Statement": [{
#     "Effect": "Allow",
#     "Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject", "s3:ListBucket"],
#     "Resource": ["arn:aws:s3:::your-bucket", "arn:aws:s3:::your-bucket/*"]
#   }]
# }

6c. MinIO Configuration

Symptom: S3 storage fails with "socket hang up", "ECONNREFUSED", or "Bucket does not exist".

Cause: MinIO endpoint is wrong, the bucket has not been created, or forcePathStyle is not enabled (handled automatically by the gateway).

Solution:

# 1. Verify MinIO is running
curl http://localhost:9000/minio/health/live
# Expected: HTTP 200

# 2. Set the correct endpoint in .env
S3_ENDPOINT=http://localhost:9000
S3_BUCKET=comfyui-outputs
S3_ACCESS_KEY=minioadmin
S3_SECRET_KEY=minioadmin
S3_REGION=us-east-1

# 3. Create the bucket if it does not exist
# Using mc (MinIO Client)
mc alias set local http://localhost:9000 minioadmin minioadmin
mc mb local/comfyui-outputs

# Or using AWS CLI
aws s3 mb s3://comfyui-outputs --endpoint-url http://localhost:9000

7. Database Issues

7a. SQLite WAL Lock Errors

Symptom: Intermittent "SQLITE_BUSY" or "database is locked" errors under concurrent load.

Cause: Multiple processes or threads are writing to the SQLite database simultaneously. SQLite WAL mode supports concurrent readers but only one writer.

Solution:

# 1. The gateway already sets optimal pragmas:
#    journal_mode = WAL
#    synchronous = NORMAL
#    busy_timeout = 5000 (5 seconds)

# 2. If running multiple gateway instances, switch to Postgres
DATABASE_URL=postgresql://user:password@localhost:5432/comfyui_gateway

# 3. If you must use SQLite with a single instance, increase busy timeout
# (requires code change or env override):
# The default 5000ms should be sufficient for most single-instance use cases

# 4. Check for stuck WAL files
ls -la ./data/gateway.db*
# You should see: gateway.db, gateway.db-wal, gateway.db-shm

# 5. If the database is corrupted, try recovery
sqlite3 ./data/gateway.db "PRAGMA integrity_check;"
# If it reports errors, back up and recreate:
cp ./data/gateway.db ./data/gateway.db.bak
sqlite3 ./data/gateway.db ".recover" | sqlite3 ./data/gateway_recovered.db

7b. Postgres Connection Pooling

Symptom: Errors like "too many clients already", "remaining connection slots are reserved", or intermittent "Connection terminated unexpectedly".

Cause: The gateway opens too many connections to Postgres, exceeding max_connections, or connections are not being properly returned to the pool.

Solution:

# 1. Check current connections in Postgres
psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'comfyui_gateway';"

# 2. Check max_connections setting
psql -c "SHOW max_connections;"

# 3. Use a connection pooler like PgBouncer
# Install PgBouncer and point DATABASE_URL to it
DATABASE_URL=postgresql://user:password@localhost:6432/comfyui_gateway

# 4. If running multiple gateway instances, ensure the total pool size
# across all instances does not exceed Postgres max_connections

7c. Database URL Format

Symptom: Gateway crashes at startup with "Invalid connection string" or uses SQLite when you intended Postgres.

Cause: The DATABASE_URL format is wrong. The gateway checks if the URL starts with postgres:// or postgresql:// to select the Postgres backend.

Solution:

# SQLite formats (all valid):
DATABASE_URL=./data/gateway.db
DATABASE_URL=/absolute/path/to/gateway.db

# Postgres formats (must start with postgres:// or postgresql://):
DATABASE_URL=postgresql://user:password@localhost:5432/comfyui_gateway
DATABASE_URL=postgres://user:password@host:5432/dbname?sslmode=require

8. Job Stuck in "running"

8a. ComfyUI Crashed During Execution

Symptom: A job shows status: "running" indefinitely. No progress updates. The gateway health endpoint may show comfyui.reachable: false.

Cause: ComfyUI crashed (segfault, CUDA error, killed by OOM killer) while processing the job, and the gateway's WebSocket connection was severed.

Solution:

# 1. Check job status
curl -s http://localhost:3000/jobs/<jobId> | jq '.status'

# 2. Check if ComfyUI is still running
curl -s http://localhost:3000/health | jq '.comfyui.reachable'

# 3. If ComfyUI crashed, restart it
cd /path/to/ComfyUI
python main.py --listen 0.0.0.0

# 4. The stuck job will eventually time out (COMFYUI_TIMEOUT_MS, default 5 min)
# and be marked as failed with COMFYUI_TIMEOUT

# 5. To immediately cancel the stuck job
curl -X POST http://localhost:3000/jobs/<jobId>/cancel \
  -H "X-API-Key: your-key"

# 6. To reduce timeout for faster failure detection
COMFYUI_TIMEOUT_MS=120000

8b. WebSocket Disconnection

Symptom: Job stays "running" but ComfyUI is actually done. The output exists in ComfyUI's history.

Cause: The WebSocket connection dropped mid-execution, and the polling fallback failed to pick up the result.

Solution:

# 1. Check ComfyUI history directly
curl -s http://127.0.0.1:8188/history | jq 'keys | length'

# 2. The gateway automatically falls back to HTTP polling if WebSocket fails.
# If polling also fails, the job times out.

# 3. Restart the gateway to reset connections
npm run dev

# 4. Check network stability between gateway and ComfyUI
ping -c 10 <comfyui-host>

8c. Restart Recovery

Symptom: After restarting the gateway, jobs that were "running" remain in that state permanently.

Cause: The in-memory queue loses track of running jobs when the process restarts. There is no automatic recovery for in-memory jobs.

Solution:

# 1. For production, use Redis (BullMQ) for durable job queues
REDIS_URL=redis://localhost:6379

# 2. Manually fail stuck jobs via the database
sqlite3 ./data/gateway.db \
  "UPDATE jobs SET status='failed', errorJson='{\"code\":\"GATEWAY_RESTART\",\"message\":\"Job interrupted by gateway restart\"}', completedAt=datetime('now') WHERE status='running';"

# 3. Verify
sqlite3 ./data/gateway.db "SELECT id, status FROM jobs WHERE status='running';"

9. Rate Limiting Issues

9a. Identifying You Are Being Rate Limited

Symptom: API returns HTTP 429 with body { "error": "RATE_LIMITED" } and a Retry-After header.

Cause: You exceeded RATE_LIMIT_MAX requests within the RATE_LIMIT_WINDOW_MS window. Limits are applied per API key or per IP.

Solution:

# 1. Check the response headers
curl -v http://localhost:3000/health -H "X-API-Key: your-key" 2>&1 | grep -i "x-ratelimit"
# X-RateLimit-Limit: 100
# X-RateLimit-Remaining: 0
# Retry-After: 42

# 2. Wait for the Retry-After period, then retry

# 3. Implement exponential backoff in your client

9b. Adjusting Rate Limits

Symptom: Legitimate usage is being throttled.

Cause: Default limits (100 requests/minute) are too low for your workload.

Solution:

# 1. Increase the limit in .env
RATE_LIMIT_MAX=500
RATE_LIMIT_WINDOW_MS=60000

# 2. For burst workloads, widen the window
RATE_LIMIT_MAX=1000
RATE_LIMIT_WINDOW_MS=300000

# 3. Restart the gateway
npm run dev

# 4. Note: Rate limits are per API key (if authenticated) or per IP.
# Different API keys have independent counters.

9c. Rate Limit Per API Key vs Per IP

Symptom: Different clients sharing the same IP are interfering with each other's rate limits.

Cause: Without API keys, all requests from the same IP share a single rate-limit bucket.

Solution:

# 1. Assign unique API keys to each client
API_KEYS=client1-key:user,client2-key:user,admin-key:admin

# 2. Each client uses its own X-API-Key header
# Client 1: -H "X-API-Key: client1-key"
# Client 2: -H "X-API-Key: client2-key"

# 3. Each key gets its own independent rate-limit counter

10. Authentication Problems

10a. API Key Not Accepted

Symptom: Every request returns HTTP 401 with { "error": "AUTH_FAILED", "message": "Invalid API key" }.

Cause: The X-API-Key header value does not match any entry in API_KEYS.

Solution:

# 1. Check configured keys
grep API_KEYS .env
# Format: key1:admin,key2:user

# 2. Ensure your request uses the exact key (no extra whitespace)
curl -H "X-API-Key: mykey123" http://localhost:3000/health

# 3. Keys are case-sensitive and matched exactly

# 4. If API_KEYS is empty, authentication is DISABLED (development mode)
# All requests are treated as admin. Set keys for production:
API_KEYS=sk-prod-abc123:admin,sk-user-xyz789:user

10b. JWT Token Expired

Symptom: Request returns { "error": "AUTH_FAILED", "message": "JWT token has expired" }.

Cause: The JWT exp claim is in the past.

Solution:

# 1. Decode the JWT to check expiration (without verification)
echo "<your-token>" | cut -d'.' -f2 | base64 -d 2>/dev/null | jq '.exp'

# 2. Compare with current time
date +%s

# 3. Generate a new token with a longer TTL
# Example using Node.js:
node -e "
const crypto = require('crypto');
const secret = 'your-jwt-secret';
const header = Buffer.from(JSON.stringify({alg:'HS256',typ:'JWT'})).toString('base64url');
const payload = Buffer.from(JSON.stringify({
  sub: 'user-1',
  role: 'admin',
  iat: Math.floor(Date.now()/1000),
  exp: Math.floor(Date.now()/1000) + 86400  // 24 hours
})).toString('base64url');
const sig = crypto.createHmac('sha256', secret).update(header+'.'+payload).digest('base64url');
console.log(header+'.'+payload+'.'+sig);
"

10c. JWT Signature Invalid

Symptom: Request returns { "error": "AUTH_FAILED", "message": "Invalid JWT signature" }.

Cause: The JWT was signed with a different secret than what is configured in JWT_SECRET.

Solution:

# 1. Verify the secret matches on token-issuer side and gateway side
grep JWT_SECRET .env

# 2. The gateway uses HMAC-SHA256 (HS256) exclusively
# Make sure your token issuer also uses HS256 with the same secret

# 3. Re-generate the token using the correct secret

10d. No Authentication Header Provided

Symptom: Request returns { "error": "AUTH_FAILED", "message": "Authentication required. Provide X-API-Key header or Authorization: Bearer token." }.

Cause: The request has no X-API-Key header and no Authorization: Bearer header, and authentication is enabled (API_KEYS or JWT_SECRET is set).

Solution:

# Option A: Use API Key
curl -H "X-API-Key: your-key" http://localhost:3000/health

# Option B: Use JWT Bearer token
curl -H "Authorization: Bearer your.jwt.token" http://localhost:3000/health

# Option C: Disable auth for development (NOT for production)
# Remove all values from API_KEYS and JWT_SECRET in .env:
API_KEYS=
JWT_SECRET=

10e. Insufficient Permissions (Forbidden)

Symptom: Request returns HTTP 403 with { "error": "FORBIDDEN", "message": "Admin role required for this operation" }.

Cause: You are using a user role key to perform an admin-only action (workflow CRUD).

Solution:

# 1. Check which role your key has
grep API_KEYS .env
# Example: sk-user-key:user,sk-admin-key:admin

# 2. Use the admin key for workflow management
curl -H "X-API-Key: sk-admin-key" -X POST http://localhost:3000/workflows ...

# 3. User role can: create jobs, read own jobs, view health/capabilities
# Admin role can: everything the user can + workflow CRUD + view all jobs

Quick Diagnostic Commands

# Gateway health
curl -s http://localhost:3000/health | jq .

# ComfyUI direct connectivity
curl -s http://127.0.0.1:8188/ | head -5

# Queue status
curl -s http://localhost:3000/jobs?status=queued -H "X-API-Key: KEY" | jq '.count'
curl -s http://localhost:3000/jobs?status=running -H "X-API-Key: KEY" | jq '.count'

# GPU memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader

# Redis connectivity
redis-cli -u "$REDIS_URL" ping

# SQLite integrity
sqlite3 ./data/gateway.db "PRAGMA integrity_check;"

# Logs (if using pino-pretty)
npm run dev 2>&1 | npx pino-pretty

# Check all configured environment variables
grep -v '^#' .env | grep -v '^$'

29 KiB Raw Blame History

ComfyUI Gateway -- Troubleshooting Guide

Table of Contents

1. ComfyUI Not Reachable

1a. Wrong COMFYUI_URL

1b. Firewall Blocking the Port

1c. Docker Networking

1d. WSL2 Networking

1e. ComfyUI Not Started or Crashed

2. OOM (Out of Memory) Errors

2a. Resolution or Batch Size Too Large

2b. Too Many Steps

2c. Model Quantization

2d. VAE Tiling

3. Slow Generation

3a. GPU Not Being Utilized

3b. Model Loading on Every Job

3c. Queue Depth / Concurrency

3d. ComfyUI Startup Time

4. Webhook Failures

4a. DNS Resolution Failure

4b. SSL Certificate Errors

4c. Webhook Timeout

4d. Domain Not in Allowlist

4e. HMAC Signature Mismatch

5. Redis Connection Issues

5a. Cannot Connect to Redis

5b. Redis Authentication Failure

5c. Fallback to In-Memory Queue

6. Storage Errors

6a. Local Disk Permission Denied

6b. S3 Credentials Invalid

6c. MinIO Configuration

7. Database Issues

7a. SQLite WAL Lock Errors

7b. Postgres Connection Pooling

7c. Database URL Format

8. Job Stuck in "running"

8a. ComfyUI Crashed During Execution

8b. WebSocket Disconnection

8c. Restart Recovery

9. Rate Limiting Issues

9a. Identifying You Are Being Rate Limited

9b. Adjusting Rate Limits

9c. Rate Limit Per API Key vs Per IP

10. Authentication Problems

10a. API Key Not Accepted

10b. JWT Token Expired

10c. JWT Signature Invalid

10d. No Authentication Header Provided

10e. Insufficient Permissions (Forbidden)

Quick Diagnostic Commands

29 KiB

Raw Blame History