fix(engineering): improve runbook-generator - add scripts + extract references

2026-03-11 20:24:23 +01:00
parent 6f55bc4fd6
commit dc61de798d
3 changed files with 199 additions and 370 deletions
--- a/engineering/runbook-generator/SKILL.md
+++ b/engineering/runbook-generator/SKILL.md
@@ -7,409 +7,70 @@ description: "Runbook Generator"

 **Tier:** POWERFUL  
 **Category:** Engineering  
-**Domain:** DevOps / Site Reliability Engineering  
+**Domain:** DevOps / Site Reliability Engineering

 ---

 ## Overview

-Analyze a codebase and generate production-grade operational runbooks. Detects your stack (CI/CD, database, hosting, containers), then produces step-by-step runbooks with copy-paste commands, verification checks, rollback procedures, escalation paths, and time estimates. Keeps runbooks fresh with staleness detection linked to config file modification dates.
-
---
+Generate operational runbooks quickly from a service name, then customize for deployment, incident response, maintenance, and rollback workflows.

 ## Core Capabilities

- **Stack detection** — auto-identify CI/CD, database, hosting, orchestration from repo files
- **Runbook types** — deployment, incident response, database maintenance, scaling, monitoring setup
- **Format discipline** — numbered steps, copy-paste commands, ✅ verification checks, time estimates
- **Escalation paths** — L1 → L2 → L3 with contact info and decision criteria
- **Rollback procedures** — every deployment step has a corresponding undo
- **Staleness detection** — runbook sections reference config files; flag when source changes
- **Testing methodology** — dry-run framework for staging validation, quarterly review cadence
+- Runbook skeleton generation from a CLI
+- Standard sections for start/stop/health/rollback
+- Structured escalation and incident handling placeholders
+- Reference templates for deployment and incident playbooks

 ---

 ## When to Use

-Use when:
- A codebase has no runbooks and you need to bootstrap them fast
- Existing runbooks are outdated or incomplete (point at the repo, regenerate)
- Onboarding a new engineer who needs clear operational procedures
- Preparing for an incident response drill or audit
- Setting up monitoring and on-call rotation from scratch
-
-Skip when:
- The system is too early-stage to have stable operational patterns
- Runbooks already exist and only need minor updates (edit directly)
+- A service has no runbook and needs a baseline immediately
+- Existing runbooks are inconsistent across teams
+- On-call onboarding requires standardized operations docs
+- You need repeatable runbook scaffolding for new services

 ---

-## Stack Detection
-
-When given a repo, scan for these signals before writing a single runbook line:
+## Quick Start

 ```bash
-# CI/CD
-ls .github/workflows/     → GitHub Actions
-ls .gitlab-ci.yml         → GitLab CI
-ls Jenkinsfile            → Jenkins
-ls .circleci/             → CircleCI
-ls bitbucket-pipelines.yml → Bitbucket Pipelines
+# Print runbook to stdout
+python3 scripts/runbook_generator.py payments-api

-# Database
-grep -r "postgresql\|postgres\|pg" package.json pyproject.toml → PostgreSQL
-grep -r "mysql\|mariadb"           package.json               → MySQL
-grep -r "mongodb\|mongoose"        package.json               → MongoDB
-grep -r "redis"                    package.json               → Redis
-ls prisma/schema.prisma            → Prisma ORM (check provider field)
-ls drizzle.config.*                → Drizzle ORM
-
-# Hosting
-ls vercel.json                     → Vercel
-ls railway.toml                    → Railway
-ls fly.toml                        → Fly.io
-ls .ebextensions/                  → AWS Elastic Beanstalk
-ls terraform/  ls *.tf             → Custom AWS/GCP/Azure (check provider)
-ls kubernetes/ ls k8s/             → Kubernetes
-ls docker-compose.yml              → Docker Compose
-
-# Framework
-ls next.config.*                   → Next.js
-ls nuxt.config.*                   → Nuxt
-ls svelte.config.*                 → SvelteKit
-cat package.json | jq '.scripts'   → Check build/start commands
-```
-
-Map detected stack → runbook templates. A Next.js + PostgreSQL + Vercel + GitHub Actions repo needs:
- Deployment runbook (Vercel + GitHub Actions)
- Database runbook (PostgreSQL backup, migration, vacuum)
- Incident response (with Vercel logs + pg query debugging)
- Monitoring setup (Vercel Analytics, pg_stat, alerting)
-
---
-
-## Runbook Types
-
-### 1. Deployment Runbook
-
-```markdown
-# Deployment Runbook — [App Name]
-**Stack:** Next.js 14 + PostgreSQL 15 + Vercel  
-**Last verified:** 2025-03-01  
-**Source configs:** vercel.json (modified: git log -1 --format=%ci -- vercel.json)  
-**Owner:** Platform Team  
-**Est. total time:** 15–25 min  
-
---
-
-## Pre-deployment Checklist
- [ ] All PRs merged to main
- [ ] CI passing on main (GitHub Actions green)
- [ ] Database migrations tested in staging
- [ ] Rollback plan confirmed
-
-## Steps
-
-### Step 1 — Run CI checks locally (3 min)
-```bash
-pnpm test
-pnpm lint
-pnpm build
-```
-✅ Expected: All pass with 0 errors. Build output in `.next/`
-
-### Step 2 — Apply database migrations (5 min)
-```bash
-# Staging first
-DATABASE_URL=$STAGING_DATABASE_URL npx prisma migrate deploy
-```
-✅ Expected: `All migrations have been successfully applied.`
-
-```bash
-# Verify migration applied
-psql $STAGING_DATABASE_URL -c "\d" | grep -i migration
-```
-✅ Expected: Migration table shows new entry with today's date
-
-### Step 3 — Deploy to production (5 min)
-```bash
-git push origin main
-# OR trigger manually:
-vercel --prod
-```
-✅ Expected: Vercel dashboard shows deployment in progress. URL format:
-`https://app-name-<hash>-team.vercel.app`
-
-### Step 4 — Smoke test production (5 min)
-```bash
-# Health check
-curl -sf https://your-app.vercel.app/api/health | jq .
-
-# Critical path
-curl -sf https://your-app.vercel.app/api/users/me \
-  -H "Authorization: Bearer $TEST_TOKEN" | jq '.id'
-```
-✅ Expected: health returns `{"status":"ok","db":"connected"}`. Users API returns valid ID.
-
-### Step 5 — Monitor for 10 min
- Check Vercel Functions log for errors: `vercel logs --since=10m`
- Check error rate in Vercel Analytics: < 1% 5xx
- Check DB connection pool: `SELECT count(*) FROM pg_stat_activity;` (< 80% of max_connections)
-
---
-
-## Rollback
-
-If smoke tests fail or error rate spikes:
-
-```bash
-# Instant rollback via Vercel (preferred — < 30 sec)
-vercel rollback [previous-deployment-url]
-
-# Database rollback (only if migration was applied)
-DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate reset --skip-seed
-# WARNING: This resets to previous migration. Confirm data impact first.
-```
-
-✅ Expected after rollback: Previous deployment URL becomes active. Verify with smoke test.
-
---
-
-## Escalation
- **L1 (on-call engineer):** Check Vercel logs, run smoke tests, attempt rollback
- **L2 (platform lead):** DB issues, data loss risk, rollback failed — Slack: @platform-lead
- **L3 (CTO):** Production down > 30 min, data breach — PagerDuty: #critical-incidents
+# Write runbook file
+python3 scripts/runbook_generator.py payments-api --owner platform --output docs/runbooks/payments-api.md
 ```

 ---

-### 2. Incident Response Runbook
+## Recommended Workflow

-```markdown
-# Incident Response Runbook
-**Severity levels:** P1 (down), P2 (degraded), P3 (minor)  
-**Est. total time:** P1: 30–60 min, P2: 1–4 hours  
-
-## Phase 1 — Triage (5 min)
-
-### Confirm the incident
-```bash
-# Is the app responding?
-curl -sw "%{http_code}" https://your-app.vercel.app/api/health -o /dev/null
-
-# Check Vercel function errors (last 15 min)
-vercel logs --since=15m | grep -i "error\|exception\|5[0-9][0-9]"
-```
-✅ 200 = app up. 5xx or timeout = incident confirmed.
-
-Declare severity:
- Site completely down → P1 — page L2/L3 immediately
- Partial degradation / slow responses → P2 — notify team channel
- Single feature broken → P3 — create ticket, fix in business hours
+1. Generate the initial skeleton with `scripts/runbook_generator.py`.
+2. Fill in service-specific commands and URLs.
+3. Add verification checks and rollback triggers.
+4. Dry-run in staging.
+5. Store runbook in version control near service code.

 ---

-## Phase 2 — Diagnose (10–15 min)
+## Reference Docs

-```bash
-# Recent deployments — did something just ship?
-vercel ls --limit=5
-
-# Database health
-psql $DATABASE_URL -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE state != 'idle' LIMIT 20;"
-
-# Long-running queries (> 30 sec)
-psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds';"
-
-# Connection pool saturation
-psql $DATABASE_URL -c "SELECT count(*), max_conn FROM pg_stat_activity, (SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') t GROUP BY max_conn;"
-```
-
-Diagnostic decision tree:
- Recent deploy + new errors → rollback (see Deployment Runbook)
- DB query timeout / pool saturation → kill long queries, scale connections
- External dependency failing → check status pages, add circuit breaker
- Memory/CPU spike → check Vercel function logs for infinite loops
-
---
-
-## Phase 3 — Mitigate (variable)
-
-```bash
-# Kill a runaway DB query
-psql $DATABASE_URL -c "SELECT pg_terminate_backend(<pid>);"
-
-# Scale DB connections (Supabase/Neon — adjust pool size)
-# Vercel → Settings → Environment Variables → update DATABASE_POOL_MAX
-
-# Enable maintenance mode (if you have a feature flag)
-vercel env add MAINTENANCE_MODE true production
-vercel --prod  # redeploy with flag
-```
-
---
-
-## Phase 4 — Resolve & Postmortem
-
-After incident is resolved, within 24 hours:
-
-1. Write incident timeline (what happened, when, who noticed, what fixed it)
-2. Identify root cause (5-Whys)
-3. Define action items with owners and due dates
-4. Update this runbook if a step was missing or wrong
-5. Add monitoring/alert that would have caught this earlier
-
-**Postmortem template:** `docs/postmortems/YYYY-MM-DD-incident-title.md`
-
---
-
-## Escalation Path
-
-| Level | Who | When | Contact |
-|-------|-----|------|---------|
-| L1 | On-call engineer | Always first | PagerDuty rotation |
-| L2 | Platform lead | DB issues, rollback needed | Slack @platform-lead |
-| L3 | CTO/VP Eng | P1 > 30 min, data loss | Phone + PagerDuty |
-```
-
---
-
-### 3. Database Maintenance Runbook
-
-```markdown
-# Database Maintenance Runbook — PostgreSQL
-**Schedule:** Weekly vacuum (automated), monthly manual review  
-
-## Backup
-
-```bash
-# Full backup
-pg_dump $DATABASE_URL \
-  --format=custom \
-  --compress=9 \
-  --file="backup-$(date +%Y%m%d-%H%M%S).dump"
-```
-✅ Expected: File created, size > 0. `pg_restore --list backup.dump | head -20` shows tables.
-
-Verify backup is restorable (test monthly):
-```bash
-pg_restore --dbname=$STAGING_DATABASE_URL backup.dump
-psql $STAGING_DATABASE_URL -c "SELECT count(*) FROM users;"
-```
-✅ Expected: Row count matches production.
-
-## Migration
-
-```bash
-# Always test in staging first
-DATABASE_URL=$STAGING_DATABASE_URL npx prisma migrate deploy
-# Verify, then:
-DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate deploy
-```
-✅ Expected: `All migrations have been successfully applied.`
-
-⚠️ For large table migrations (> 1M rows), use `pg_repack` or add column with DEFAULT separately to avoid table locks.
-
-## Vacuum & Reindex
-
-```bash
-# Check bloat before deciding
-psql $DATABASE_URL -c "
-SELECT schemaname, tablename, 
-       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
-       n_dead_tup, n_live_tup,
-       ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_ratio
-FROM pg_stat_user_tables
-ORDER BY n_dead_tup DESC LIMIT 10;"
-
-# Vacuum high-bloat tables (non-blocking)
-psql $DATABASE_URL -c "VACUUM ANALYZE users;"
-psql $DATABASE_URL -c "VACUUM ANALYZE events;"
-
-# Reindex (use CONCURRENTLY to avoid locks)
-psql $DATABASE_URL -c "REINDEX INDEX CONCURRENTLY users_email_idx;"
-```
-✅ Expected: dead_ratio drops below 5% after vacuum.
-```
-
---
-
-## Staleness Detection
-
-Add a staleness header to every runbook:
-
-```markdown
-## Staleness Check
-This runbook references the following config files. If they've changed since the
-"Last verified" date, review the affected steps.
-
-| Config File | Last Modified | Affects Steps |
-|-------------|--------------|---------------|
-| vercel.json | `git log -1 --format=%ci -- vercel.json` | Step 3, Rollback |
-| prisma/schema.prisma | `git log -1 --format=%ci -- prisma/schema.prisma` | Step 2, DB Maintenance |
-| .github/workflows/deploy.yml | `git log -1 --format=%ci -- .github/workflows/deploy.yml` | Step 1, Step 3 |
-| docker-compose.yml | `git log -1 --format=%ci -- docker-compose.yml` | All scaling steps |
-```
-
-**Automation:** Add a CI job that runs weekly and comments on the runbook doc if any referenced file was modified more recently than the runbook's "Last verified" date.
-
---
-
-## Runbook Testing Methodology
-
-### Dry-Run in Staging
-
-Before trusting a runbook in production, validate every step in staging:
-
-```bash
-# 1. Create a staging environment mirror
-vercel env pull .env.staging
-source .env.staging
-
-# 2. Run each step with staging credentials
-# Replace all $DATABASE_URL with $STAGING_DATABASE_URL
-# Replace all production URLs with staging URLs
-
-# 3. Verify expected outputs match
-# Document any discrepancies and update the runbook
-
-# 4. Time each step — update estimates in the runbook
-time npx prisma migrate deploy
-```
-
-### Quarterly Review Cadence
-
-Schedule a 1-hour review every quarter:
-
-1. **Run each command** in staging — does it still work?
-2. **Check config drift** — compare "Last Modified" dates vs "Last verified"
-3. **Test rollback procedures** — actually roll back in staging
-4. **Update contact info** — L1/L2/L3 may have changed
-5. **Add new failure modes** discovered in the past quarter
-6. **Update "Last verified" date** at top of runbook
+- `references/runbook-templates.md`

 ---

 ## Common Pitfalls

-| Pitfall | Fix |
-|---|---|
-| Commands that require manual copy of dynamic values | Use env vars — `$DATABASE_URL` not `postgres://user:pass@host/db` |
-| No expected output specified | Add ✅ with exact expected string after every verification step |
-| Rollback steps missing | Every destructive step needs a corresponding undo |
-| Runbooks that never get tested | Schedule quarterly staging dry-runs in team calendar |
-| L3 escalation contact is the former CTO | Review contact info every quarter |
-| Migration runbook doesn't mention table locks | Call out lock risk for large table operations explicitly |
-
---
+- Missing rollback triggers or rollback commands
+- Steps without expected output checks
+- Stale ownership/escalation contacts
+- Runbooks never tested outside of incidents

 ## Best Practices

-1. **Every command must be copy-pasteable** — no placeholder text, use env vars
-2. **✅ after every step** — explicit expected output, not "it should work"
-3. **Time estimates are mandatory** — engineers need to know if they have time to fix before SLA breach
-4. **Rollback before you deploy** — plan the undo before executing
-5. **Runbooks live in the repo** — `docs/runbooks/`, versioned with the code they describe
-6. **Postmortem → runbook update** — every incident should improve a runbook
-7. **Link, don't duplicate** — reference the canonical config file, don't copy its contents into the runbook
-8. **Test runbooks like you test code** — untested runbooks are worse than no runbooks (false confidence)
+1. Keep every command copy-pasteable.
+2. Include health checks after every critical step.
+3. Validate runbooks on a fixed review cadence.
+4. Update runbook content after incidents and postmortems.
--- a/engineering/runbook-generator/references/runbook-templates.md
+++ b/engineering/runbook-generator/references/runbook-templates.md
@@ -0,0 +1,40 @@
+# Runbook Templates
+
+## Deployment Runbook Template
+
+- Pre-deployment checks
+- Deploy steps with expected output
+- Smoke tests
+- Rollback plan with explicit triggers
+- Escalation and communication notes
+
+## Incident Response Template
+
+- Triage phase (first 5 minutes)
+- Diagnosis phase (logs, metrics, recent deploys)
+- Mitigation phase (containment and restoration)
+- Resolution and postmortem actions
+
+## Database Maintenance Template
+
+- Backup and restore verification
+- Migration sequencing and lock-risk notes
+- Vacuum/reindex routines
+- Verification queries and performance checks
+
+## Staleness Detection Template
+
+Track referenced config files and update runbooks whenever these change:
+
+- deployment config (`vercel.json`, Helm charts, Terraform)
+- CI pipelines (`.github/workflows/*`, `.gitlab-ci.yml`)
+- data schema/migration definitions
+- service runtime/env configuration
+
+## Quarterly Validation Checklist
+
+1. Execute commands in staging.
+2. Validate expected outputs.
+3. Test rollback paths.
+4. Confirm contact/escalation ownership.
+5. Update `Last verified` date.
--- a/engineering/runbook-generator/scripts/runbook_generator.py
+++ b/engineering/runbook-generator/scripts/runbook_generator.py
@@ -0,0 +1,128 @@
+#!/usr/bin/env python3
+"""Generate an operational runbook skeleton for a service."""
+
+from __future__ import annotations
+
+import argparse
+from datetime import date
+from pathlib import Path
+
+
+def build_runbook(service: str, owner: str, environment: str) -> str:
+    today = date.today().isoformat()
+    return f"""# Runbook - {service}
+
+- Service: {service}
+- Owner: {owner}
+- Environment: {environment}
+- Last verified: {today}
+
+## Overview
+
+Describe the service purpose, dependencies, and critical user impact.
+
+## Preconditions
+
+- Access to deployment platform
+- Access to logs/metrics
+- Access to secret/config manager
+
+## Start Procedure
+
+1. Pull latest config/secrets.
+2. Start service process.
+3. Confirm process is healthy.
+
+```bash
+# Example
+# systemctl start {service}
+```
+
+## Stop Procedure
+
+1. Drain traffic if applicable.
+2. Stop service process.
+3. Confirm no active workers remain.
+
+```bash
+# Example
+# systemctl stop {service}
+```
+
+## Health Checks
+
+- HTTP health endpoint
+- Dependency connectivity checks
+- Error-rate and latency checks
+
+```bash
+# Example
+# curl -sf https://{service}.example.com/health
+```
+
+## Deployment Checklist
+
+1. Verify CI status and artifact integrity.
+2. Apply migrations (if required) in safe order.
+3. Deploy service revision.
+4. Run smoke checks.
+5. Observe metrics for 10-15 minutes.
+
+## Rollback
+
+1. Identify last known good release.
+2. Re-deploy previous version.
+3. Re-run health checks.
+4. Communicate rollback status to stakeholders.
+
+```bash
+# Example
+# deployctl rollback --service {service}
+```
+
+## Incident Response
+
+1. Classify severity.
+2. Contain user impact.
+3. Triage likely failing component.
+4. Escalate if SLA risk is high.
+
+## Escalation
+
+- L1: On-call engineer
+- L2: Service owner ({owner})
+- L3: Platform/Engineering leadership
+
+## Post-Incident
+
+1. Write timeline and root cause.
+2. Define corrective actions with owners.
+3. Update this runbook with missing steps.
+"""
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Generate a markdown runbook skeleton.")
+    parser.add_argument("service", help="Service name")
+    parser.add_argument("--owner", default="platform-team", help="Service owner label")
+    parser.add_argument("--environment", default="production", help="Primary environment")
+    parser.add_argument("--output", help="Optional output path (prints to stdout if omitted)")
+    return parser.parse_args()
+
+
+def main() -> int:
+    args = parse_args()
+    markdown = build_runbook(args.service, owner=args.owner, environment=args.environment)
+
+    if args.output:
+        path = Path(args.output)
+        path.parent.mkdir(parents=True, exist_ok=True)
+        path.write_text(markdown, encoding="utf-8")
+        print(f"Wrote runbook skeleton to {path}")
+    else:
+        print(markdown)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())