Files
claude-skills-reference/engineering-team/incident-commander/assets/runbook_template.md
Leo f6f50f5282 Fix CI workflows and installation documentation
- Replace non-existent anthropics/claude-code-action@v1 with direct bash steps in smart-sync.yml and pr-issue-auto-close.yml
- Add missing checkout steps to both workflows for WORKFLOW_KILLSWITCH access
- Fix Issue #189: Replace broken 'npx ai-agent-skills install' with working 'npx agent-skills-cli add' command
- Update README.md and INSTALLATION.md with correct Agent Skills CLI commands and repository links
- Verified: agent-skills-cli detects all 53 skills and works with 42+ AI agents

Fixes: Two GitHub Actions workflows that broke on PR #191 merge
Closes: #189
2026-02-16 11:30:18 +00:00

9.5 KiB

Runbook: [Service/Component Name]

Owner: [Team Name] Last Updated: [YYYY-MM-DD] Reviewed By: [Name] Review Cadence: Quarterly


Service Overview

Property Value
Service [service-name]
Repository [repo URL]
Dashboard [monitoring dashboard URL]
On-Call Rotation [PagerDuty/OpsGenie schedule URL]
SLA Tier [Tier 1/2/3]
Availability Target [99.9% / 99.95% / 99.99%]
Dependencies [list upstream/downstream services]
Owner Team [team name]
Escalation Contact [name/email]

Architecture Summary

[2-3 sentence description of the service architecture. Include key components, data stores, and external dependencies.]


Alert Response Decision Tree

High Error Rate (>5%)

Error Rate Alert Fired
├── Check: Is this a deployment-related issue?
│   ├── YES → Go to "Recent Deployment Rollback" section
│   └── NO → Continue
├── Check: Is a downstream dependency failing?
│   ├── YES → Go to "Dependency Failure" section
│   └── NO → Continue
├── Check: Is there unusual traffic volume?
│   ├── YES → Go to "Traffic Spike" section
│   └── NO → Continue
└── Escalate: Engage on-call secondary + service owner

High Latency (p99 > [threshold]ms)

Latency Alert Fired
├── Check: Database query latency elevated?
│   ├── YES → Go to "Database Performance" section
│   └── NO → Continue
├── Check: Connection pool utilization >80%?
│   ├── YES → Go to "Connection Pool Exhaustion" section
│   └── NO → Continue
├── Check: Memory/CPU pressure on service instances?
│   ├── YES → Go to "Resource Exhaustion" section
│   └── NO → Continue
└── Escalate: Engage on-call secondary + service owner

Service Unavailable (Health Check Failing)

Health Check Alert Fired
├── Check: Are all instances down?
│   ├── YES → Go to "Complete Outage" section
│   └── NO → Continue
├── Check: Is only one AZ affected?
│   ├── YES → Go to "AZ Failure" section
│   └── NO → Continue
├── Check: Can instances be restarted?
│   ├── YES → Go to "Instance Restart" section
│   └── NO → Continue
└── Escalate: Declare incident, engage IC

Common Scenarios

Recent Deployment Rollback

Symptoms: Error rate spike or latency increase within 60 minutes of a deployment.

Diagnosis:

  1. Check deployment history: kubectl rollout history deployment/[service-name]
  2. Compare error rate timing with deployment timestamp
  3. Review deployment diff for risky changes

Mitigation:

  1. Initiate rollback: kubectl rollout undo deployment/[service-name]
  2. Verify rollback: kubectl rollout status deployment/[service-name]
  3. Confirm error rate returns to baseline (allow 5 minutes)
  4. If rollback fails: escalate immediately

Communication: If customer-impacting, update status page within 5 minutes of confirming impact.


Database Performance

Symptoms: Elevated query latency, connection pool saturation, timeout errors.

Diagnosis:

  1. Check active queries: SELECT * FROM pg_stat_activity WHERE state = 'active';
  2. Check for long-running queries: SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;
  3. Check connection count: SELECT count(*) FROM pg_stat_activity;
  4. Check table bloat and vacuum status

Mitigation:

  1. Kill long-running queries if identified: SELECT pg_terminate_backend([pid]);
  2. If connection pool exhausted: increase pool size via config (requires restart)
  3. If read replica available: redirect read traffic
  4. If write-heavy: identify and defer non-critical writes

Escalation Trigger: If query latency >10s for >5 minutes, escalate to DBA on-call.


Connection Pool Exhaustion

Symptoms: Connection timeout errors, pool utilization >90%, requests queuing.

Diagnosis:

  1. Check pool metrics: current size, active connections, waiting requests
  2. Check for connection leaks: connections held >30s without activity
  3. Review recent config changes or deployments

Mitigation:

  1. Increase pool size (if infrastructure allows): update config, rolling restart
  2. Kill idle connections exceeding timeout
  3. If caused by leak: identify and restart affected instances
  4. Enable connection pool auto-scaling if available

Prevention: Pool utilization alerting at 70% (warning) and 85% (critical).


Dependency Failure

Symptoms: Errors correlated with downstream service failures, circuit breakers tripping.

Diagnosis:

  1. Check dependency status dashboards
  2. Verify circuit breaker state: open/half-open/closed
  3. Check for correlation with dependency deployments or incidents
  4. Test dependency health endpoints directly

Mitigation:

  1. If circuit breaker not tripping: verify timeout/threshold configuration
  2. Enable graceful degradation (serve cached/default responses)
  3. If critical path: engage dependency team via incident process
  4. If non-critical path: disable feature flag for affected functionality

Communication: Coordinate with dependency team IC if both services have active incidents.


Traffic Spike

Symptoms: Sudden traffic increase beyond normal patterns, resource saturation.

Diagnosis:

  1. Check traffic source: organic growth vs. bot traffic vs. DDoS
  2. Review rate limiting effectiveness
  3. Check auto-scaling status and capacity

Mitigation:

  1. If bot/DDoS: enable rate limiting, engage security team
  2. If organic: trigger manual scale-up, increase auto-scaling limits
  3. Enable request queuing or load shedding if at capacity
  4. Consider feature flag toggles to reduce per-request cost

Complete Outage

Symptoms: All instances unreachable, health checks failing across AZs.

Diagnosis:

  1. Check infrastructure status (AWS/GCP status page)
  2. Verify network connectivity and DNS resolution
  3. Check for infrastructure-level incidents (region outage)
  4. Review recent infrastructure changes (Terraform, network config)

Mitigation:

  1. If infra provider issue: activate disaster recovery plan
  2. If DNS issue: update DNS records, reduce TTL
  3. If deployment corruption: redeploy last known good version
  4. If data corruption: engage data recovery procedures

Escalation: Immediately declare SEV1 incident. Engage infrastructure team and management.


Instance Restart

Symptoms: Individual instances unhealthy, OOM kills, process crashes.

Diagnosis:

  1. Check instance logs for crash reason
  2. Review memory/CPU usage patterns before crash
  3. Check for memory leaks or resource exhaustion
  4. Verify configuration consistency across instances

Mitigation:

  1. Restart unhealthy instances: kubectl delete pod [pod-name]
  2. If recurring: cordon node and migrate workloads
  3. If memory leak: schedule immediate patch with increased memory limit
  4. Monitor for recurrence after restart

AZ Failure

Symptoms: All instances in one availability zone failing, others healthy.

Diagnosis:

  1. Confirm AZ-specific failure vs. instance-specific issues
  2. Check cloud provider AZ status
  3. Verify load balancer is routing around failed AZ

Mitigation:

  1. Ensure load balancer marks AZ instances as unhealthy
  2. Scale up remaining AZs to handle redirected traffic
  3. If auto-scaling: verify it's responding to increased load
  4. Monitor remaining AZs for cascade effects

Key Metrics & Dashboards

Metric Normal Range Warning Critical Dashboard
Error Rate <0.1% >1% >5% [link]
p99 Latency <200ms >500ms >2000ms [link]
CPU Usage <60% >75% >90% [link]
Memory Usage <70% >80% >90% [link]
DB Pool Usage <50% >70% >85% [link]
Request Rate [baseline]±20% ±50% ±100% [link]

Escalation Contacts

Level Contact When
L1: On-Call Primary [name/rotation] First responder
L2: On-Call Secondary [name/rotation] Primary unavailable or needs help
L3: Service Owner [name] Complex issues, architectural decisions
L4: Engineering Manager [name] SEV1/SEV2, customer impact, resource needs
L5: VP Engineering [name] SEV1 >30 min, major customer/revenue impact

Maintenance Procedures

Planned Maintenance Checklist

  • Maintenance window scheduled and communicated (72 hours advance for Tier 1)
  • Status page updated with planned maintenance notice
  • Rollback plan documented and tested
  • On-call notified of maintenance window
  • Customer notification sent (if SLA-impacting)
  • Post-maintenance verification plan ready

Health Verification After Changes

  1. Check all health endpoints return 200
  2. Verify error rate returns to baseline within 5 minutes
  3. Confirm latency within normal range
  4. Run synthetic transaction test
  5. Monitor for 15 minutes before declaring success

Revision History

Date Author Change
[YYYY-MM-DD] [Name] Initial version
[YYYY-MM-DD] [Name] [Description of update]

This runbook should be reviewed quarterly and updated after every incident that reveals missing procedures. The on-call engineer should be able to follow this document without prior context about the service. If any section requires tribal knowledge to execute, it needs to be expanded.