claude-skills-reference/engineering-team/incident-commander/assets/runbook_template.md

# Runbook: [Service/Component Name]

**Owner:** [Team Name]
**Last Updated:** [YYYY-MM-DD]
**Reviewed By:** [Name]
**Review Cadence:** Quarterly

---

## Service Overview

| Property | Value |
|----------|-------|
| **Service** | [service-name] |
| **Repository** | [repo URL] |
| **Dashboard** | [monitoring dashboard URL] |
| **On-Call Rotation** | [PagerDuty/OpsGenie schedule URL] |
| **SLA Tier** | [Tier 1/2/3] |
| **Availability Target** | [99.9% / 99.95% / 99.99%] |
| **Dependencies** | [list upstream/downstream services] |
| **Owner Team** | [team name] |
| **Escalation Contact** | [name/email] |

### Architecture Summary

[2-3 sentence description of the service architecture. Include key components, data stores, and external dependencies.]

---

## Alert Response Decision Tree

### High Error Rate (>5%)

```
Error Rate Alert Fired
├── Check: Is this a deployment-related issue?
│   ├── YES → Go to "Recent Deployment Rollback" section
│   └── NO → Continue
├── Check: Is a downstream dependency failing?
│   ├── YES → Go to "Dependency Failure" section
│   └── NO → Continue
├── Check: Is there unusual traffic volume?
│   ├── YES → Go to "Traffic Spike" section
│   └── NO → Continue
└── Escalate: Engage on-call secondary + service owner
```

### High Latency (p99 > [threshold]ms)

```
Latency Alert Fired
├── Check: Database query latency elevated?
│   ├── YES → Go to "Database Performance" section
│   └── NO → Continue
├── Check: Connection pool utilization >80%?
│   ├── YES → Go to "Connection Pool Exhaustion" section
│   └── NO → Continue
├── Check: Memory/CPU pressure on service instances?
│   ├── YES → Go to "Resource Exhaustion" section
│   └── NO → Continue
└── Escalate: Engage on-call secondary + service owner
```

### Service Unavailable (Health Check Failing)

```
Health Check Alert Fired
├── Check: Are all instances down?
│   ├── YES → Go to "Complete Outage" section
│   └── NO → Continue
├── Check: Is only one AZ affected?
│   ├── YES → Go to "AZ Failure" section
│   └── NO → Continue
├── Check: Can instances be restarted?
│   ├── YES → Go to "Instance Restart" section
│   └── NO → Continue
└── Escalate: Declare incident, engage IC
```

---

## Common Scenarios

### Recent Deployment Rollback

**Symptoms:** Error rate spike or latency increase within 60 minutes of a deployment.

**Diagnosis:**
1. Check deployment history: `kubectl rollout history deployment/[service-name]`
2. Compare error rate timing with deployment timestamp
3. Review deployment diff for risky changes

**Mitigation:**
1. Initiate rollback: `kubectl rollout undo deployment/[service-name]`
2. Verify rollback: `kubectl rollout status deployment/[service-name]`
3. Confirm error rate returns to baseline (allow 5 minutes)
4. If rollback fails: escalate immediately

**Communication:** If customer-impacting, update status page within 5 minutes of confirming impact.

---

### Database Performance

**Symptoms:** Elevated query latency, connection pool saturation, timeout errors.

**Diagnosis:**
1. Check active queries: `SELECT * FROM pg_stat_activity WHERE state = 'active';`
2. Check for long-running queries: `SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;`
3. Check connection count: `SELECT count(*) FROM pg_stat_activity;`
4. Check table bloat and vacuum status

**Mitigation:**
1. Kill long-running queries if identified: `SELECT pg_terminate_backend([pid]);`
2. If connection pool exhausted: increase pool size via config (requires restart)
3. If read replica available: redirect read traffic
4. If write-heavy: identify and defer non-critical writes

**Escalation Trigger:** If query latency >10s for >5 minutes, escalate to DBA on-call.

---

### Connection Pool Exhaustion

**Symptoms:** Connection timeout errors, pool utilization >90%, requests queuing.

**Diagnosis:**
1. Check pool metrics: current size, active connections, waiting requests
2. Check for connection leaks: connections held >30s without activity
3. Review recent config changes or deployments

**Mitigation:**
1. Increase pool size (if infrastructure allows): update config, rolling restart
2. Kill idle connections exceeding timeout
3. If caused by leak: identify and restart affected instances
4. Enable connection pool auto-scaling if available

**Prevention:** Pool utilization alerting at 70% (warning) and 85% (critical).

---

### Dependency Failure

**Symptoms:** Errors correlated with downstream service failures, circuit breakers tripping.

**Diagnosis:**
1. Check dependency status dashboards
2. Verify circuit breaker state: open/half-open/closed
3. Check for correlation with dependency deployments or incidents
4. Test dependency health endpoints directly

**Mitigation:**
1. If circuit breaker not tripping: verify timeout/threshold configuration
2. Enable graceful degradation (serve cached/default responses)
3. If critical path: engage dependency team via incident process
4. If non-critical path: disable feature flag for affected functionality

**Communication:** Coordinate with dependency team IC if both services have active incidents.

---

### Traffic Spike

**Symptoms:** Sudden traffic increase beyond normal patterns, resource saturation.

**Diagnosis:**
1. Check traffic source: organic growth vs. bot traffic vs. DDoS
2. Review rate limiting effectiveness
3. Check auto-scaling status and capacity

**Mitigation:**
1. If bot/DDoS: enable rate limiting, engage security team
2. If organic: trigger manual scale-up, increase auto-scaling limits
3. Enable request queuing or load shedding if at capacity
4. Consider feature flag toggles to reduce per-request cost

---

### Complete Outage

**Symptoms:** All instances unreachable, health checks failing across AZs.

**Diagnosis:**
1. Check infrastructure status (AWS/GCP status page)
2. Verify network connectivity and DNS resolution
3. Check for infrastructure-level incidents (region outage)
4. Review recent infrastructure changes (Terraform, network config)

**Mitigation:**
1. If infra provider issue: activate disaster recovery plan
2. If DNS issue: update DNS records, reduce TTL
3. If deployment corruption: redeploy last known good version
4. If data corruption: engage data recovery procedures

**Escalation:** Immediately declare SEV1 incident. Engage infrastructure team and management.

---

### Instance Restart

**Symptoms:** Individual instances unhealthy, OOM kills, process crashes.

**Diagnosis:**
1. Check instance logs for crash reason
2. Review memory/CPU usage patterns before crash
3. Check for memory leaks or resource exhaustion
4. Verify configuration consistency across instances

**Mitigation:**
1. Restart unhealthy instances: `kubectl delete pod [pod-name]`
2. If recurring: cordon node and migrate workloads
3. If memory leak: schedule immediate patch with increased memory limit
4. Monitor for recurrence after restart

---

### AZ Failure

**Symptoms:** All instances in one availability zone failing, others healthy.

**Diagnosis:**
1. Confirm AZ-specific failure vs. instance-specific issues
2. Check cloud provider AZ status
3. Verify load balancer is routing around failed AZ

**Mitigation:**
1. Ensure load balancer marks AZ instances as unhealthy
2. Scale up remaining AZs to handle redirected traffic
3. If auto-scaling: verify it's responding to increased load
4. Monitor remaining AZs for cascade effects

---

## Key Metrics & Dashboards

| Metric | Normal Range | Warning | Critical | Dashboard |
|--------|-------------|---------|----------|-----------|
| Error Rate | <0.1% | >1% | >5% | [link] |
| p99 Latency | <200ms | >500ms | >2000ms | [link] |
| CPU Usage | <60% | >75% | >90% | [link] |
| Memory Usage | <70% | >80% | >90% | [link] |
| DB Pool Usage | <50% | >70% | >85% | [link] |
| Request Rate | [baseline]±20% | ±50% | ±100% | [link] |

---

## Escalation Contacts

| Level | Contact | When |
|-------|---------|------|
| L1: On-Call Primary | [name/rotation] | First responder |
| L2: On-Call Secondary | [name/rotation] | Primary unavailable or needs help |
| L3: Service Owner | [name] | Complex issues, architectural decisions |
| L4: Engineering Manager | [name] | SEV1/SEV2, customer impact, resource needs |
| L5: VP Engineering | [name] | SEV1 >30 min, major customer/revenue impact |

---

## Maintenance Procedures

### Planned Maintenance Checklist

- [ ] Maintenance window scheduled and communicated (72 hours advance for Tier 1)
- [ ] Status page updated with planned maintenance notice
- [ ] Rollback plan documented and tested
- [ ] On-call notified of maintenance window
- [ ] Customer notification sent (if SLA-impacting)
- [ ] Post-maintenance verification plan ready

### Health Verification After Changes

1. Check all health endpoints return 200
2. Verify error rate returns to baseline within 5 minutes
3. Confirm latency within normal range
4. Run synthetic transaction test
5. Monitor for 15 minutes before declaring success

---

## Revision History

| Date | Author | Change |
|------|--------|--------|
| [YYYY-MM-DD] | [Name] | Initial version |
| [YYYY-MM-DD] | [Name] | [Description of update] |

---

*This runbook should be reviewed quarterly and updated after every incident that reveals missing procedures. The on-call engineer should be able to follow this document without prior context about the service. If any section requires tribal knowledge to execute, it needs to be expanded.*