fix(engineering): improve runbook-generator - add scripts + extract references
This commit is contained in:
@@ -7,409 +7,70 @@ description: "Runbook Generator"
|
||||
|
||||
**Tier:** POWERFUL
|
||||
**Category:** Engineering
|
||||
**Domain:** DevOps / Site Reliability Engineering
|
||||
**Domain:** DevOps / Site Reliability Engineering
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Analyze a codebase and generate production-grade operational runbooks. Detects your stack (CI/CD, database, hosting, containers), then produces step-by-step runbooks with copy-paste commands, verification checks, rollback procedures, escalation paths, and time estimates. Keeps runbooks fresh with staleness detection linked to config file modification dates.
|
||||
|
||||
---
|
||||
Generate operational runbooks quickly from a service name, then customize for deployment, incident response, maintenance, and rollback workflows.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
- **Stack detection** — auto-identify CI/CD, database, hosting, orchestration from repo files
|
||||
- **Runbook types** — deployment, incident response, database maintenance, scaling, monitoring setup
|
||||
- **Format discipline** — numbered steps, copy-paste commands, ✅ verification checks, time estimates
|
||||
- **Escalation paths** — L1 → L2 → L3 with contact info and decision criteria
|
||||
- **Rollback procedures** — every deployment step has a corresponding undo
|
||||
- **Staleness detection** — runbook sections reference config files; flag when source changes
|
||||
- **Testing methodology** — dry-run framework for staging validation, quarterly review cadence
|
||||
- Runbook skeleton generation from a CLI
|
||||
- Standard sections for start/stop/health/rollback
|
||||
- Structured escalation and incident handling placeholders
|
||||
- Reference templates for deployment and incident playbooks
|
||||
|
||||
---
|
||||
|
||||
## When to Use
|
||||
|
||||
Use when:
|
||||
- A codebase has no runbooks and you need to bootstrap them fast
|
||||
- Existing runbooks are outdated or incomplete (point at the repo, regenerate)
|
||||
- Onboarding a new engineer who needs clear operational procedures
|
||||
- Preparing for an incident response drill or audit
|
||||
- Setting up monitoring and on-call rotation from scratch
|
||||
|
||||
Skip when:
|
||||
- The system is too early-stage to have stable operational patterns
|
||||
- Runbooks already exist and only need minor updates (edit directly)
|
||||
- A service has no runbook and needs a baseline immediately
|
||||
- Existing runbooks are inconsistent across teams
|
||||
- On-call onboarding requires standardized operations docs
|
||||
- You need repeatable runbook scaffolding for new services
|
||||
|
||||
---
|
||||
|
||||
## Stack Detection
|
||||
|
||||
When given a repo, scan for these signals before writing a single runbook line:
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# CI/CD
|
||||
ls .github/workflows/ → GitHub Actions
|
||||
ls .gitlab-ci.yml → GitLab CI
|
||||
ls Jenkinsfile → Jenkins
|
||||
ls .circleci/ → CircleCI
|
||||
ls bitbucket-pipelines.yml → Bitbucket Pipelines
|
||||
# Print runbook to stdout
|
||||
python3 scripts/runbook_generator.py payments-api
|
||||
|
||||
# Database
|
||||
grep -r "postgresql\|postgres\|pg" package.json pyproject.toml → PostgreSQL
|
||||
grep -r "mysql\|mariadb" package.json → MySQL
|
||||
grep -r "mongodb\|mongoose" package.json → MongoDB
|
||||
grep -r "redis" package.json → Redis
|
||||
ls prisma/schema.prisma → Prisma ORM (check provider field)
|
||||
ls drizzle.config.* → Drizzle ORM
|
||||
|
||||
# Hosting
|
||||
ls vercel.json → Vercel
|
||||
ls railway.toml → Railway
|
||||
ls fly.toml → Fly.io
|
||||
ls .ebextensions/ → AWS Elastic Beanstalk
|
||||
ls terraform/ ls *.tf → Custom AWS/GCP/Azure (check provider)
|
||||
ls kubernetes/ ls k8s/ → Kubernetes
|
||||
ls docker-compose.yml → Docker Compose
|
||||
|
||||
# Framework
|
||||
ls next.config.* → Next.js
|
||||
ls nuxt.config.* → Nuxt
|
||||
ls svelte.config.* → SvelteKit
|
||||
cat package.json | jq '.scripts' → Check build/start commands
|
||||
```
|
||||
|
||||
Map detected stack → runbook templates. A Next.js + PostgreSQL + Vercel + GitHub Actions repo needs:
|
||||
- Deployment runbook (Vercel + GitHub Actions)
|
||||
- Database runbook (PostgreSQL backup, migration, vacuum)
|
||||
- Incident response (with Vercel logs + pg query debugging)
|
||||
- Monitoring setup (Vercel Analytics, pg_stat, alerting)
|
||||
|
||||
---
|
||||
|
||||
## Runbook Types
|
||||
|
||||
### 1. Deployment Runbook
|
||||
|
||||
```markdown
|
||||
# Deployment Runbook — [App Name]
|
||||
**Stack:** Next.js 14 + PostgreSQL 15 + Vercel
|
||||
**Last verified:** 2025-03-01
|
||||
**Source configs:** vercel.json (modified: git log -1 --format=%ci -- vercel.json)
|
||||
**Owner:** Platform Team
|
||||
**Est. total time:** 15–25 min
|
||||
|
||||
---
|
||||
|
||||
## Pre-deployment Checklist
|
||||
- [ ] All PRs merged to main
|
||||
- [ ] CI passing on main (GitHub Actions green)
|
||||
- [ ] Database migrations tested in staging
|
||||
- [ ] Rollback plan confirmed
|
||||
|
||||
## Steps
|
||||
|
||||
### Step 1 — Run CI checks locally (3 min)
|
||||
```bash
|
||||
pnpm test
|
||||
pnpm lint
|
||||
pnpm build
|
||||
```
|
||||
✅ Expected: All pass with 0 errors. Build output in `.next/`
|
||||
|
||||
### Step 2 — Apply database migrations (5 min)
|
||||
```bash
|
||||
# Staging first
|
||||
DATABASE_URL=$STAGING_DATABASE_URL npx prisma migrate deploy
|
||||
```
|
||||
✅ Expected: `All migrations have been successfully applied.`
|
||||
|
||||
```bash
|
||||
# Verify migration applied
|
||||
psql $STAGING_DATABASE_URL -c "\d" | grep -i migration
|
||||
```
|
||||
✅ Expected: Migration table shows new entry with today's date
|
||||
|
||||
### Step 3 — Deploy to production (5 min)
|
||||
```bash
|
||||
git push origin main
|
||||
# OR trigger manually:
|
||||
vercel --prod
|
||||
```
|
||||
✅ Expected: Vercel dashboard shows deployment in progress. URL format:
|
||||
`https://app-name-<hash>-team.vercel.app`
|
||||
|
||||
### Step 4 — Smoke test production (5 min)
|
||||
```bash
|
||||
# Health check
|
||||
curl -sf https://your-app.vercel.app/api/health | jq .
|
||||
|
||||
# Critical path
|
||||
curl -sf https://your-app.vercel.app/api/users/me \
|
||||
-H "Authorization: Bearer $TEST_TOKEN" | jq '.id'
|
||||
```
|
||||
✅ Expected: health returns `{"status":"ok","db":"connected"}`. Users API returns valid ID.
|
||||
|
||||
### Step 5 — Monitor for 10 min
|
||||
- Check Vercel Functions log for errors: `vercel logs --since=10m`
|
||||
- Check error rate in Vercel Analytics: < 1% 5xx
|
||||
- Check DB connection pool: `SELECT count(*) FROM pg_stat_activity;` (< 80% of max_connections)
|
||||
|
||||
---
|
||||
|
||||
## Rollback
|
||||
|
||||
If smoke tests fail or error rate spikes:
|
||||
|
||||
```bash
|
||||
# Instant rollback via Vercel (preferred — < 30 sec)
|
||||
vercel rollback [previous-deployment-url]
|
||||
|
||||
# Database rollback (only if migration was applied)
|
||||
DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate reset --skip-seed
|
||||
# WARNING: This resets to previous migration. Confirm data impact first.
|
||||
```
|
||||
|
||||
✅ Expected after rollback: Previous deployment URL becomes active. Verify with smoke test.
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
- **L1 (on-call engineer):** Check Vercel logs, run smoke tests, attempt rollback
|
||||
- **L2 (platform lead):** DB issues, data loss risk, rollback failed — Slack: @platform-lead
|
||||
- **L3 (CTO):** Production down > 30 min, data breach — PagerDuty: #critical-incidents
|
||||
# Write runbook file
|
||||
python3 scripts/runbook_generator.py payments-api --owner platform --output docs/runbooks/payments-api.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Incident Response Runbook
|
||||
## Recommended Workflow
|
||||
|
||||
```markdown
|
||||
# Incident Response Runbook
|
||||
**Severity levels:** P1 (down), P2 (degraded), P3 (minor)
|
||||
**Est. total time:** P1: 30–60 min, P2: 1–4 hours
|
||||
|
||||
## Phase 1 — Triage (5 min)
|
||||
|
||||
### Confirm the incident
|
||||
```bash
|
||||
# Is the app responding?
|
||||
curl -sw "%{http_code}" https://your-app.vercel.app/api/health -o /dev/null
|
||||
|
||||
# Check Vercel function errors (last 15 min)
|
||||
vercel logs --since=15m | grep -i "error\|exception\|5[0-9][0-9]"
|
||||
```
|
||||
✅ 200 = app up. 5xx or timeout = incident confirmed.
|
||||
|
||||
Declare severity:
|
||||
- Site completely down → P1 — page L2/L3 immediately
|
||||
- Partial degradation / slow responses → P2 — notify team channel
|
||||
- Single feature broken → P3 — create ticket, fix in business hours
|
||||
1. Generate the initial skeleton with `scripts/runbook_generator.py`.
|
||||
2. Fill in service-specific commands and URLs.
|
||||
3. Add verification checks and rollback triggers.
|
||||
4. Dry-run in staging.
|
||||
5. Store runbook in version control near service code.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Diagnose (10–15 min)
|
||||
## Reference Docs
|
||||
|
||||
```bash
|
||||
# Recent deployments — did something just ship?
|
||||
vercel ls --limit=5
|
||||
|
||||
# Database health
|
||||
psql $DATABASE_URL -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE state != 'idle' LIMIT 20;"
|
||||
|
||||
# Long-running queries (> 30 sec)
|
||||
psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds';"
|
||||
|
||||
# Connection pool saturation
|
||||
psql $DATABASE_URL -c "SELECT count(*), max_conn FROM pg_stat_activity, (SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') t GROUP BY max_conn;"
|
||||
```
|
||||
|
||||
Diagnostic decision tree:
|
||||
- Recent deploy + new errors → rollback (see Deployment Runbook)
|
||||
- DB query timeout / pool saturation → kill long queries, scale connections
|
||||
- External dependency failing → check status pages, add circuit breaker
|
||||
- Memory/CPU spike → check Vercel function logs for infinite loops
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — Mitigate (variable)
|
||||
|
||||
```bash
|
||||
# Kill a runaway DB query
|
||||
psql $DATABASE_URL -c "SELECT pg_terminate_backend(<pid>);"
|
||||
|
||||
# Scale DB connections (Supabase/Neon — adjust pool size)
|
||||
# Vercel → Settings → Environment Variables → update DATABASE_POOL_MAX
|
||||
|
||||
# Enable maintenance mode (if you have a feature flag)
|
||||
vercel env add MAINTENANCE_MODE true production
|
||||
vercel --prod # redeploy with flag
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Resolve & Postmortem
|
||||
|
||||
After incident is resolved, within 24 hours:
|
||||
|
||||
1. Write incident timeline (what happened, when, who noticed, what fixed it)
|
||||
2. Identify root cause (5-Whys)
|
||||
3. Define action items with owners and due dates
|
||||
4. Update this runbook if a step was missing or wrong
|
||||
5. Add monitoring/alert that would have caught this earlier
|
||||
|
||||
**Postmortem template:** `docs/postmortems/YYYY-MM-DD-incident-title.md`
|
||||
|
||||
---
|
||||
|
||||
## Escalation Path
|
||||
|
||||
| Level | Who | When | Contact |
|
||||
|-------|-----|------|---------|
|
||||
| L1 | On-call engineer | Always first | PagerDuty rotation |
|
||||
| L2 | Platform lead | DB issues, rollback needed | Slack @platform-lead |
|
||||
| L3 | CTO/VP Eng | P1 > 30 min, data loss | Phone + PagerDuty |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Database Maintenance Runbook
|
||||
|
||||
```markdown
|
||||
# Database Maintenance Runbook — PostgreSQL
|
||||
**Schedule:** Weekly vacuum (automated), monthly manual review
|
||||
|
||||
## Backup
|
||||
|
||||
```bash
|
||||
# Full backup
|
||||
pg_dump $DATABASE_URL \
|
||||
--format=custom \
|
||||
--compress=9 \
|
||||
--file="backup-$(date +%Y%m%d-%H%M%S).dump"
|
||||
```
|
||||
✅ Expected: File created, size > 0. `pg_restore --list backup.dump | head -20` shows tables.
|
||||
|
||||
Verify backup is restorable (test monthly):
|
||||
```bash
|
||||
pg_restore --dbname=$STAGING_DATABASE_URL backup.dump
|
||||
psql $STAGING_DATABASE_URL -c "SELECT count(*) FROM users;"
|
||||
```
|
||||
✅ Expected: Row count matches production.
|
||||
|
||||
## Migration
|
||||
|
||||
```bash
|
||||
# Always test in staging first
|
||||
DATABASE_URL=$STAGING_DATABASE_URL npx prisma migrate deploy
|
||||
# Verify, then:
|
||||
DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate deploy
|
||||
```
|
||||
✅ Expected: `All migrations have been successfully applied.`
|
||||
|
||||
⚠️ For large table migrations (> 1M rows), use `pg_repack` or add column with DEFAULT separately to avoid table locks.
|
||||
|
||||
## Vacuum & Reindex
|
||||
|
||||
```bash
|
||||
# Check bloat before deciding
|
||||
psql $DATABASE_URL -c "
|
||||
SELECT schemaname, tablename,
|
||||
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
|
||||
n_dead_tup, n_live_tup,
|
||||
ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_ratio
|
||||
FROM pg_stat_user_tables
|
||||
ORDER BY n_dead_tup DESC LIMIT 10;"
|
||||
|
||||
# Vacuum high-bloat tables (non-blocking)
|
||||
psql $DATABASE_URL -c "VACUUM ANALYZE users;"
|
||||
psql $DATABASE_URL -c "VACUUM ANALYZE events;"
|
||||
|
||||
# Reindex (use CONCURRENTLY to avoid locks)
|
||||
psql $DATABASE_URL -c "REINDEX INDEX CONCURRENTLY users_email_idx;"
|
||||
```
|
||||
✅ Expected: dead_ratio drops below 5% after vacuum.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Staleness Detection
|
||||
|
||||
Add a staleness header to every runbook:
|
||||
|
||||
```markdown
|
||||
## Staleness Check
|
||||
This runbook references the following config files. If they've changed since the
|
||||
"Last verified" date, review the affected steps.
|
||||
|
||||
| Config File | Last Modified | Affects Steps |
|
||||
|-------------|--------------|---------------|
|
||||
| vercel.json | `git log -1 --format=%ci -- vercel.json` | Step 3, Rollback |
|
||||
| prisma/schema.prisma | `git log -1 --format=%ci -- prisma/schema.prisma` | Step 2, DB Maintenance |
|
||||
| .github/workflows/deploy.yml | `git log -1 --format=%ci -- .github/workflows/deploy.yml` | Step 1, Step 3 |
|
||||
| docker-compose.yml | `git log -1 --format=%ci -- docker-compose.yml` | All scaling steps |
|
||||
```
|
||||
|
||||
**Automation:** Add a CI job that runs weekly and comments on the runbook doc if any referenced file was modified more recently than the runbook's "Last verified" date.
|
||||
|
||||
---
|
||||
|
||||
## Runbook Testing Methodology
|
||||
|
||||
### Dry-Run in Staging
|
||||
|
||||
Before trusting a runbook in production, validate every step in staging:
|
||||
|
||||
```bash
|
||||
# 1. Create a staging environment mirror
|
||||
vercel env pull .env.staging
|
||||
source .env.staging
|
||||
|
||||
# 2. Run each step with staging credentials
|
||||
# Replace all $DATABASE_URL with $STAGING_DATABASE_URL
|
||||
# Replace all production URLs with staging URLs
|
||||
|
||||
# 3. Verify expected outputs match
|
||||
# Document any discrepancies and update the runbook
|
||||
|
||||
# 4. Time each step — update estimates in the runbook
|
||||
time npx prisma migrate deploy
|
||||
```
|
||||
|
||||
### Quarterly Review Cadence
|
||||
|
||||
Schedule a 1-hour review every quarter:
|
||||
|
||||
1. **Run each command** in staging — does it still work?
|
||||
2. **Check config drift** — compare "Last Modified" dates vs "Last verified"
|
||||
3. **Test rollback procedures** — actually roll back in staging
|
||||
4. **Update contact info** — L1/L2/L3 may have changed
|
||||
5. **Add new failure modes** discovered in the past quarter
|
||||
6. **Update "Last verified" date** at top of runbook
|
||||
- `references/runbook-templates.md`
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
| Pitfall | Fix |
|
||||
|---|---|
|
||||
| Commands that require manual copy of dynamic values | Use env vars — `$DATABASE_URL` not `postgres://user:pass@host/db` |
|
||||
| No expected output specified | Add ✅ with exact expected string after every verification step |
|
||||
| Rollback steps missing | Every destructive step needs a corresponding undo |
|
||||
| Runbooks that never get tested | Schedule quarterly staging dry-runs in team calendar |
|
||||
| L3 escalation contact is the former CTO | Review contact info every quarter |
|
||||
| Migration runbook doesn't mention table locks | Call out lock risk for large table operations explicitly |
|
||||
|
||||
---
|
||||
- Missing rollback triggers or rollback commands
|
||||
- Steps without expected output checks
|
||||
- Stale ownership/escalation contacts
|
||||
- Runbooks never tested outside of incidents
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Every command must be copy-pasteable** — no placeholder text, use env vars
|
||||
2. **✅ after every step** — explicit expected output, not "it should work"
|
||||
3. **Time estimates are mandatory** — engineers need to know if they have time to fix before SLA breach
|
||||
4. **Rollback before you deploy** — plan the undo before executing
|
||||
5. **Runbooks live in the repo** — `docs/runbooks/`, versioned with the code they describe
|
||||
6. **Postmortem → runbook update** — every incident should improve a runbook
|
||||
7. **Link, don't duplicate** — reference the canonical config file, don't copy its contents into the runbook
|
||||
8. **Test runbooks like you test code** — untested runbooks are worse than no runbooks (false confidence)
|
||||
1. Keep every command copy-pasteable.
|
||||
2. Include health checks after every critical step.
|
||||
3. Validate runbooks on a fixed review cadence.
|
||||
4. Update runbook content after incidents and postmortems.
|
||||
|
||||
@@ -0,0 +1,40 @@
|
||||
# Runbook Templates
|
||||
|
||||
## Deployment Runbook Template
|
||||
|
||||
- Pre-deployment checks
|
||||
- Deploy steps with expected output
|
||||
- Smoke tests
|
||||
- Rollback plan with explicit triggers
|
||||
- Escalation and communication notes
|
||||
|
||||
## Incident Response Template
|
||||
|
||||
- Triage phase (first 5 minutes)
|
||||
- Diagnosis phase (logs, metrics, recent deploys)
|
||||
- Mitigation phase (containment and restoration)
|
||||
- Resolution and postmortem actions
|
||||
|
||||
## Database Maintenance Template
|
||||
|
||||
- Backup and restore verification
|
||||
- Migration sequencing and lock-risk notes
|
||||
- Vacuum/reindex routines
|
||||
- Verification queries and performance checks
|
||||
|
||||
## Staleness Detection Template
|
||||
|
||||
Track referenced config files and update runbooks whenever these change:
|
||||
|
||||
- deployment config (`vercel.json`, Helm charts, Terraform)
|
||||
- CI pipelines (`.github/workflows/*`, `.gitlab-ci.yml`)
|
||||
- data schema/migration definitions
|
||||
- service runtime/env configuration
|
||||
|
||||
## Quarterly Validation Checklist
|
||||
|
||||
1. Execute commands in staging.
|
||||
2. Validate expected outputs.
|
||||
3. Test rollback paths.
|
||||
4. Confirm contact/escalation ownership.
|
||||
5. Update `Last verified` date.
|
||||
128
engineering/runbook-generator/scripts/runbook_generator.py
Executable file
128
engineering/runbook-generator/scripts/runbook_generator.py
Executable file
@@ -0,0 +1,128 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Generate an operational runbook skeleton for a service."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
from datetime import date
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def build_runbook(service: str, owner: str, environment: str) -> str:
|
||||
today = date.today().isoformat()
|
||||
return f"""# Runbook - {service}
|
||||
|
||||
- Service: {service}
|
||||
- Owner: {owner}
|
||||
- Environment: {environment}
|
||||
- Last verified: {today}
|
||||
|
||||
## Overview
|
||||
|
||||
Describe the service purpose, dependencies, and critical user impact.
|
||||
|
||||
## Preconditions
|
||||
|
||||
- Access to deployment platform
|
||||
- Access to logs/metrics
|
||||
- Access to secret/config manager
|
||||
|
||||
## Start Procedure
|
||||
|
||||
1. Pull latest config/secrets.
|
||||
2. Start service process.
|
||||
3. Confirm process is healthy.
|
||||
|
||||
```bash
|
||||
# Example
|
||||
# systemctl start {service}
|
||||
```
|
||||
|
||||
## Stop Procedure
|
||||
|
||||
1. Drain traffic if applicable.
|
||||
2. Stop service process.
|
||||
3. Confirm no active workers remain.
|
||||
|
||||
```bash
|
||||
# Example
|
||||
# systemctl stop {service}
|
||||
```
|
||||
|
||||
## Health Checks
|
||||
|
||||
- HTTP health endpoint
|
||||
- Dependency connectivity checks
|
||||
- Error-rate and latency checks
|
||||
|
||||
```bash
|
||||
# Example
|
||||
# curl -sf https://{service}.example.com/health
|
||||
```
|
||||
|
||||
## Deployment Checklist
|
||||
|
||||
1. Verify CI status and artifact integrity.
|
||||
2. Apply migrations (if required) in safe order.
|
||||
3. Deploy service revision.
|
||||
4. Run smoke checks.
|
||||
5. Observe metrics for 10-15 minutes.
|
||||
|
||||
## Rollback
|
||||
|
||||
1. Identify last known good release.
|
||||
2. Re-deploy previous version.
|
||||
3. Re-run health checks.
|
||||
4. Communicate rollback status to stakeholders.
|
||||
|
||||
```bash
|
||||
# Example
|
||||
# deployctl rollback --service {service}
|
||||
```
|
||||
|
||||
## Incident Response
|
||||
|
||||
1. Classify severity.
|
||||
2. Contain user impact.
|
||||
3. Triage likely failing component.
|
||||
4. Escalate if SLA risk is high.
|
||||
|
||||
## Escalation
|
||||
|
||||
- L1: On-call engineer
|
||||
- L2: Service owner ({owner})
|
||||
- L3: Platform/Engineering leadership
|
||||
|
||||
## Post-Incident
|
||||
|
||||
1. Write timeline and root cause.
|
||||
2. Define corrective actions with owners.
|
||||
3. Update this runbook with missing steps.
|
||||
"""
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
parser = argparse.ArgumentParser(description="Generate a markdown runbook skeleton.")
|
||||
parser.add_argument("service", help="Service name")
|
||||
parser.add_argument("--owner", default="platform-team", help="Service owner label")
|
||||
parser.add_argument("--environment", default="production", help="Primary environment")
|
||||
parser.add_argument("--output", help="Optional output path (prints to stdout if omitted)")
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def main() -> int:
|
||||
args = parse_args()
|
||||
markdown = build_runbook(args.service, owner=args.owner, environment=args.environment)
|
||||
|
||||
if args.output:
|
||||
path = Path(args.output)
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
path.write_text(markdown, encoding="utf-8")
|
||||
print(f"Wrote runbook skeleton to {path}")
|
||||
else:
|
||||
print(markdown)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
Reference in New Issue
Block a user