From dc61de798d8def1bd75d33e6a50e368502cfde20 Mon Sep 17 00:00:00 2001 From: Leo Date: Wed, 11 Mar 2026 20:24:23 +0100 Subject: [PATCH] fix(engineering): improve runbook-generator - add scripts + extract references --- engineering/runbook-generator/SKILL.md | 401 ++---------------- .../references/runbook-templates.md | 40 ++ .../scripts/runbook_generator.py | 128 ++++++ 3 files changed, 199 insertions(+), 370 deletions(-) create mode 100644 engineering/runbook-generator/references/runbook-templates.md create mode 100755 engineering/runbook-generator/scripts/runbook_generator.py diff --git a/engineering/runbook-generator/SKILL.md b/engineering/runbook-generator/SKILL.md index 53da23e..cd331c1 100644 --- a/engineering/runbook-generator/SKILL.md +++ b/engineering/runbook-generator/SKILL.md @@ -7,409 +7,70 @@ description: "Runbook Generator" **Tier:** POWERFUL **Category:** Engineering -**Domain:** DevOps / Site Reliability Engineering +**Domain:** DevOps / Site Reliability Engineering --- ## Overview -Analyze a codebase and generate production-grade operational runbooks. Detects your stack (CI/CD, database, hosting, containers), then produces step-by-step runbooks with copy-paste commands, verification checks, rollback procedures, escalation paths, and time estimates. Keeps runbooks fresh with staleness detection linked to config file modification dates. - ---- +Generate operational runbooks quickly from a service name, then customize for deployment, incident response, maintenance, and rollback workflows. ## Core Capabilities -- **Stack detection** — auto-identify CI/CD, database, hosting, orchestration from repo files -- **Runbook types** — deployment, incident response, database maintenance, scaling, monitoring setup -- **Format discipline** — numbered steps, copy-paste commands, ✅ verification checks, time estimates -- **Escalation paths** — L1 → L2 → L3 with contact info and decision criteria -- **Rollback procedures** — every deployment step has a corresponding undo -- **Staleness detection** — runbook sections reference config files; flag when source changes -- **Testing methodology** — dry-run framework for staging validation, quarterly review cadence +- Runbook skeleton generation from a CLI +- Standard sections for start/stop/health/rollback +- Structured escalation and incident handling placeholders +- Reference templates for deployment and incident playbooks --- ## When to Use -Use when: -- A codebase has no runbooks and you need to bootstrap them fast -- Existing runbooks are outdated or incomplete (point at the repo, regenerate) -- Onboarding a new engineer who needs clear operational procedures -- Preparing for an incident response drill or audit -- Setting up monitoring and on-call rotation from scratch - -Skip when: -- The system is too early-stage to have stable operational patterns -- Runbooks already exist and only need minor updates (edit directly) +- A service has no runbook and needs a baseline immediately +- Existing runbooks are inconsistent across teams +- On-call onboarding requires standardized operations docs +- You need repeatable runbook scaffolding for new services --- -## Stack Detection - -When given a repo, scan for these signals before writing a single runbook line: +## Quick Start ```bash -# CI/CD -ls .github/workflows/ → GitHub Actions -ls .gitlab-ci.yml → GitLab CI -ls Jenkinsfile → Jenkins -ls .circleci/ → CircleCI -ls bitbucket-pipelines.yml → Bitbucket Pipelines +# Print runbook to stdout +python3 scripts/runbook_generator.py payments-api -# Database -grep -r "postgresql\|postgres\|pg" package.json pyproject.toml → PostgreSQL -grep -r "mysql\|mariadb" package.json → MySQL -grep -r "mongodb\|mongoose" package.json → MongoDB -grep -r "redis" package.json → Redis -ls prisma/schema.prisma → Prisma ORM (check provider field) -ls drizzle.config.* → Drizzle ORM - -# Hosting -ls vercel.json → Vercel -ls railway.toml → Railway -ls fly.toml → Fly.io -ls .ebextensions/ → AWS Elastic Beanstalk -ls terraform/ ls *.tf → Custom AWS/GCP/Azure (check provider) -ls kubernetes/ ls k8s/ → Kubernetes -ls docker-compose.yml → Docker Compose - -# Framework -ls next.config.* → Next.js -ls nuxt.config.* → Nuxt -ls svelte.config.* → SvelteKit -cat package.json | jq '.scripts' → Check build/start commands -``` - -Map detected stack → runbook templates. A Next.js + PostgreSQL + Vercel + GitHub Actions repo needs: -- Deployment runbook (Vercel + GitHub Actions) -- Database runbook (PostgreSQL backup, migration, vacuum) -- Incident response (with Vercel logs + pg query debugging) -- Monitoring setup (Vercel Analytics, pg_stat, alerting) - ---- - -## Runbook Types - -### 1. Deployment Runbook - -```markdown -# Deployment Runbook — [App Name] -**Stack:** Next.js 14 + PostgreSQL 15 + Vercel -**Last verified:** 2025-03-01 -**Source configs:** vercel.json (modified: git log -1 --format=%ci -- vercel.json) -**Owner:** Platform Team -**Est. total time:** 15–25 min - ---- - -## Pre-deployment Checklist -- [ ] All PRs merged to main -- [ ] CI passing on main (GitHub Actions green) -- [ ] Database migrations tested in staging -- [ ] Rollback plan confirmed - -## Steps - -### Step 1 — Run CI checks locally (3 min) -```bash -pnpm test -pnpm lint -pnpm build -``` -✅ Expected: All pass with 0 errors. Build output in `.next/` - -### Step 2 — Apply database migrations (5 min) -```bash -# Staging first -DATABASE_URL=$STAGING_DATABASE_URL npx prisma migrate deploy -``` -✅ Expected: `All migrations have been successfully applied.` - -```bash -# Verify migration applied -psql $STAGING_DATABASE_URL -c "\d" | grep -i migration -``` -✅ Expected: Migration table shows new entry with today's date - -### Step 3 — Deploy to production (5 min) -```bash -git push origin main -# OR trigger manually: -vercel --prod -``` -✅ Expected: Vercel dashboard shows deployment in progress. URL format: -`https://app-name--team.vercel.app` - -### Step 4 — Smoke test production (5 min) -```bash -# Health check -curl -sf https://your-app.vercel.app/api/health | jq . - -# Critical path -curl -sf https://your-app.vercel.app/api/users/me \ - -H "Authorization: Bearer $TEST_TOKEN" | jq '.id' -``` -✅ Expected: health returns `{"status":"ok","db":"connected"}`. Users API returns valid ID. - -### Step 5 — Monitor for 10 min -- Check Vercel Functions log for errors: `vercel logs --since=10m` -- Check error rate in Vercel Analytics: < 1% 5xx -- Check DB connection pool: `SELECT count(*) FROM pg_stat_activity;` (< 80% of max_connections) - ---- - -## Rollback - -If smoke tests fail or error rate spikes: - -```bash -# Instant rollback via Vercel (preferred — < 30 sec) -vercel rollback [previous-deployment-url] - -# Database rollback (only if migration was applied) -DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate reset --skip-seed -# WARNING: This resets to previous migration. Confirm data impact first. -``` - -✅ Expected after rollback: Previous deployment URL becomes active. Verify with smoke test. - ---- - -## Escalation -- **L1 (on-call engineer):** Check Vercel logs, run smoke tests, attempt rollback -- **L2 (platform lead):** DB issues, data loss risk, rollback failed — Slack: @platform-lead -- **L3 (CTO):** Production down > 30 min, data breach — PagerDuty: #critical-incidents +# Write runbook file +python3 scripts/runbook_generator.py payments-api --owner platform --output docs/runbooks/payments-api.md ``` --- -### 2. Incident Response Runbook +## Recommended Workflow -```markdown -# Incident Response Runbook -**Severity levels:** P1 (down), P2 (degraded), P3 (minor) -**Est. total time:** P1: 30–60 min, P2: 1–4 hours - -## Phase 1 — Triage (5 min) - -### Confirm the incident -```bash -# Is the app responding? -curl -sw "%{http_code}" https://your-app.vercel.app/api/health -o /dev/null - -# Check Vercel function errors (last 15 min) -vercel logs --since=15m | grep -i "error\|exception\|5[0-9][0-9]" -``` -✅ 200 = app up. 5xx or timeout = incident confirmed. - -Declare severity: -- Site completely down → P1 — page L2/L3 immediately -- Partial degradation / slow responses → P2 — notify team channel -- Single feature broken → P3 — create ticket, fix in business hours +1. Generate the initial skeleton with `scripts/runbook_generator.py`. +2. Fill in service-specific commands and URLs. +3. Add verification checks and rollback triggers. +4. Dry-run in staging. +5. Store runbook in version control near service code. --- -## Phase 2 — Diagnose (10–15 min) +## Reference Docs -```bash -# Recent deployments — did something just ship? -vercel ls --limit=5 - -# Database health -psql $DATABASE_URL -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE state != 'idle' LIMIT 20;" - -# Long-running queries (> 30 sec) -psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds';" - -# Connection pool saturation -psql $DATABASE_URL -c "SELECT count(*), max_conn FROM pg_stat_activity, (SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') t GROUP BY max_conn;" -``` - -Diagnostic decision tree: -- Recent deploy + new errors → rollback (see Deployment Runbook) -- DB query timeout / pool saturation → kill long queries, scale connections -- External dependency failing → check status pages, add circuit breaker -- Memory/CPU spike → check Vercel function logs for infinite loops - ---- - -## Phase 3 — Mitigate (variable) - -```bash -# Kill a runaway DB query -psql $DATABASE_URL -c "SELECT pg_terminate_backend();" - -# Scale DB connections (Supabase/Neon — adjust pool size) -# Vercel → Settings → Environment Variables → update DATABASE_POOL_MAX - -# Enable maintenance mode (if you have a feature flag) -vercel env add MAINTENANCE_MODE true production -vercel --prod # redeploy with flag -``` - ---- - -## Phase 4 — Resolve & Postmortem - -After incident is resolved, within 24 hours: - -1. Write incident timeline (what happened, when, who noticed, what fixed it) -2. Identify root cause (5-Whys) -3. Define action items with owners and due dates -4. Update this runbook if a step was missing or wrong -5. Add monitoring/alert that would have caught this earlier - -**Postmortem template:** `docs/postmortems/YYYY-MM-DD-incident-title.md` - ---- - -## Escalation Path - -| Level | Who | When | Contact | -|-------|-----|------|---------| -| L1 | On-call engineer | Always first | PagerDuty rotation | -| L2 | Platform lead | DB issues, rollback needed | Slack @platform-lead | -| L3 | CTO/VP Eng | P1 > 30 min, data loss | Phone + PagerDuty | -``` - ---- - -### 3. Database Maintenance Runbook - -```markdown -# Database Maintenance Runbook — PostgreSQL -**Schedule:** Weekly vacuum (automated), monthly manual review - -## Backup - -```bash -# Full backup -pg_dump $DATABASE_URL \ - --format=custom \ - --compress=9 \ - --file="backup-$(date +%Y%m%d-%H%M%S).dump" -``` -✅ Expected: File created, size > 0. `pg_restore --list backup.dump | head -20` shows tables. - -Verify backup is restorable (test monthly): -```bash -pg_restore --dbname=$STAGING_DATABASE_URL backup.dump -psql $STAGING_DATABASE_URL -c "SELECT count(*) FROM users;" -``` -✅ Expected: Row count matches production. - -## Migration - -```bash -# Always test in staging first -DATABASE_URL=$STAGING_DATABASE_URL npx prisma migrate deploy -# Verify, then: -DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate deploy -``` -✅ Expected: `All migrations have been successfully applied.` - -⚠️ For large table migrations (> 1M rows), use `pg_repack` or add column with DEFAULT separately to avoid table locks. - -## Vacuum & Reindex - -```bash -# Check bloat before deciding -psql $DATABASE_URL -c " -SELECT schemaname, tablename, - pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size, - n_dead_tup, n_live_tup, - ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_ratio -FROM pg_stat_user_tables -ORDER BY n_dead_tup DESC LIMIT 10;" - -# Vacuum high-bloat tables (non-blocking) -psql $DATABASE_URL -c "VACUUM ANALYZE users;" -psql $DATABASE_URL -c "VACUUM ANALYZE events;" - -# Reindex (use CONCURRENTLY to avoid locks) -psql $DATABASE_URL -c "REINDEX INDEX CONCURRENTLY users_email_idx;" -``` -✅ Expected: dead_ratio drops below 5% after vacuum. -``` - ---- - -## Staleness Detection - -Add a staleness header to every runbook: - -```markdown -## Staleness Check -This runbook references the following config files. If they've changed since the -"Last verified" date, review the affected steps. - -| Config File | Last Modified | Affects Steps | -|-------------|--------------|---------------| -| vercel.json | `git log -1 --format=%ci -- vercel.json` | Step 3, Rollback | -| prisma/schema.prisma | `git log -1 --format=%ci -- prisma/schema.prisma` | Step 2, DB Maintenance | -| .github/workflows/deploy.yml | `git log -1 --format=%ci -- .github/workflows/deploy.yml` | Step 1, Step 3 | -| docker-compose.yml | `git log -1 --format=%ci -- docker-compose.yml` | All scaling steps | -``` - -**Automation:** Add a CI job that runs weekly and comments on the runbook doc if any referenced file was modified more recently than the runbook's "Last verified" date. - ---- - -## Runbook Testing Methodology - -### Dry-Run in Staging - -Before trusting a runbook in production, validate every step in staging: - -```bash -# 1. Create a staging environment mirror -vercel env pull .env.staging -source .env.staging - -# 2. Run each step with staging credentials -# Replace all $DATABASE_URL with $STAGING_DATABASE_URL -# Replace all production URLs with staging URLs - -# 3. Verify expected outputs match -# Document any discrepancies and update the runbook - -# 4. Time each step — update estimates in the runbook -time npx prisma migrate deploy -``` - -### Quarterly Review Cadence - -Schedule a 1-hour review every quarter: - -1. **Run each command** in staging — does it still work? -2. **Check config drift** — compare "Last Modified" dates vs "Last verified" -3. **Test rollback procedures** — actually roll back in staging -4. **Update contact info** — L1/L2/L3 may have changed -5. **Add new failure modes** discovered in the past quarter -6. **Update "Last verified" date** at top of runbook +- `references/runbook-templates.md` --- ## Common Pitfalls -| Pitfall | Fix | -|---|---| -| Commands that require manual copy of dynamic values | Use env vars — `$DATABASE_URL` not `postgres://user:pass@host/db` | -| No expected output specified | Add ✅ with exact expected string after every verification step | -| Rollback steps missing | Every destructive step needs a corresponding undo | -| Runbooks that never get tested | Schedule quarterly staging dry-runs in team calendar | -| L3 escalation contact is the former CTO | Review contact info every quarter | -| Migration runbook doesn't mention table locks | Call out lock risk for large table operations explicitly | - ---- +- Missing rollback triggers or rollback commands +- Steps without expected output checks +- Stale ownership/escalation contacts +- Runbooks never tested outside of incidents ## Best Practices -1. **Every command must be copy-pasteable** — no placeholder text, use env vars -2. **✅ after every step** — explicit expected output, not "it should work" -3. **Time estimates are mandatory** — engineers need to know if they have time to fix before SLA breach -4. **Rollback before you deploy** — plan the undo before executing -5. **Runbooks live in the repo** — `docs/runbooks/`, versioned with the code they describe -6. **Postmortem → runbook update** — every incident should improve a runbook -7. **Link, don't duplicate** — reference the canonical config file, don't copy its contents into the runbook -8. **Test runbooks like you test code** — untested runbooks are worse than no runbooks (false confidence) +1. Keep every command copy-pasteable. +2. Include health checks after every critical step. +3. Validate runbooks on a fixed review cadence. +4. Update runbook content after incidents and postmortems. diff --git a/engineering/runbook-generator/references/runbook-templates.md b/engineering/runbook-generator/references/runbook-templates.md new file mode 100644 index 0000000..959218d --- /dev/null +++ b/engineering/runbook-generator/references/runbook-templates.md @@ -0,0 +1,40 @@ +# Runbook Templates + +## Deployment Runbook Template + +- Pre-deployment checks +- Deploy steps with expected output +- Smoke tests +- Rollback plan with explicit triggers +- Escalation and communication notes + +## Incident Response Template + +- Triage phase (first 5 minutes) +- Diagnosis phase (logs, metrics, recent deploys) +- Mitigation phase (containment and restoration) +- Resolution and postmortem actions + +## Database Maintenance Template + +- Backup and restore verification +- Migration sequencing and lock-risk notes +- Vacuum/reindex routines +- Verification queries and performance checks + +## Staleness Detection Template + +Track referenced config files and update runbooks whenever these change: + +- deployment config (`vercel.json`, Helm charts, Terraform) +- CI pipelines (`.github/workflows/*`, `.gitlab-ci.yml`) +- data schema/migration definitions +- service runtime/env configuration + +## Quarterly Validation Checklist + +1. Execute commands in staging. +2. Validate expected outputs. +3. Test rollback paths. +4. Confirm contact/escalation ownership. +5. Update `Last verified` date. diff --git a/engineering/runbook-generator/scripts/runbook_generator.py b/engineering/runbook-generator/scripts/runbook_generator.py new file mode 100755 index 0000000..96e98c8 --- /dev/null +++ b/engineering/runbook-generator/scripts/runbook_generator.py @@ -0,0 +1,128 @@ +#!/usr/bin/env python3 +"""Generate an operational runbook skeleton for a service.""" + +from __future__ import annotations + +import argparse +from datetime import date +from pathlib import Path + + +def build_runbook(service: str, owner: str, environment: str) -> str: + today = date.today().isoformat() + return f"""# Runbook - {service} + +- Service: {service} +- Owner: {owner} +- Environment: {environment} +- Last verified: {today} + +## Overview + +Describe the service purpose, dependencies, and critical user impact. + +## Preconditions + +- Access to deployment platform +- Access to logs/metrics +- Access to secret/config manager + +## Start Procedure + +1. Pull latest config/secrets. +2. Start service process. +3. Confirm process is healthy. + +```bash +# Example +# systemctl start {service} +``` + +## Stop Procedure + +1. Drain traffic if applicable. +2. Stop service process. +3. Confirm no active workers remain. + +```bash +# Example +# systemctl stop {service} +``` + +## Health Checks + +- HTTP health endpoint +- Dependency connectivity checks +- Error-rate and latency checks + +```bash +# Example +# curl -sf https://{service}.example.com/health +``` + +## Deployment Checklist + +1. Verify CI status and artifact integrity. +2. Apply migrations (if required) in safe order. +3. Deploy service revision. +4. Run smoke checks. +5. Observe metrics for 10-15 minutes. + +## Rollback + +1. Identify last known good release. +2. Re-deploy previous version. +3. Re-run health checks. +4. Communicate rollback status to stakeholders. + +```bash +# Example +# deployctl rollback --service {service} +``` + +## Incident Response + +1. Classify severity. +2. Contain user impact. +3. Triage likely failing component. +4. Escalate if SLA risk is high. + +## Escalation + +- L1: On-call engineer +- L2: Service owner ({owner}) +- L3: Platform/Engineering leadership + +## Post-Incident + +1. Write timeline and root cause. +2. Define corrective actions with owners. +3. Update this runbook with missing steps. +""" + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser(description="Generate a markdown runbook skeleton.") + parser.add_argument("service", help="Service name") + parser.add_argument("--owner", default="platform-team", help="Service owner label") + parser.add_argument("--environment", default="production", help="Primary environment") + parser.add_argument("--output", help="Optional output path (prints to stdout if omitted)") + return parser.parse_args() + + +def main() -> int: + args = parse_args() + markdown = build_runbook(args.service, owner=args.owner, environment=args.environment) + + if args.output: + path = Path(args.output) + path.parent.mkdir(parents=True, exist_ok=True) + path.write_text(markdown, encoding="utf-8") + print(f"Wrote runbook skeleton to {path}") + else: + print(markdown) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main())