- Replace non-existent anthropics/claude-code-action@v1 with direct bash steps in smart-sync.yml and pr-issue-auto-close.yml - Add missing checkout steps to both workflows for WORKFLOW_KILLSWITCH access - Fix Issue #189: Replace broken 'npx ai-agent-skills install' with working 'npx agent-skills-cli add' command - Update README.md and INSTALLATION.md with correct Agent Skills CLI commands and repository links - Verified: agent-skills-cli detects all 53 skills and works with 42+ AI agents Fixes: Two GitHub Actions workflows that broke on PR #191 merge Closes: #189
290 lines
9.5 KiB
Markdown
290 lines
9.5 KiB
Markdown
# Runbook: [Service/Component Name]
|
|
|
|
**Owner:** [Team Name]
|
|
**Last Updated:** [YYYY-MM-DD]
|
|
**Reviewed By:** [Name]
|
|
**Review Cadence:** Quarterly
|
|
|
|
---
|
|
|
|
## Service Overview
|
|
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| **Service** | [service-name] |
|
|
| **Repository** | [repo URL] |
|
|
| **Dashboard** | [monitoring dashboard URL] |
|
|
| **On-Call Rotation** | [PagerDuty/OpsGenie schedule URL] |
|
|
| **SLA Tier** | [Tier 1/2/3] |
|
|
| **Availability Target** | [99.9% / 99.95% / 99.99%] |
|
|
| **Dependencies** | [list upstream/downstream services] |
|
|
| **Owner Team** | [team name] |
|
|
| **Escalation Contact** | [name/email] |
|
|
|
|
### Architecture Summary
|
|
|
|
[2-3 sentence description of the service architecture. Include key components, data stores, and external dependencies.]
|
|
|
|
---
|
|
|
|
## Alert Response Decision Tree
|
|
|
|
### High Error Rate (>5%)
|
|
|
|
```
|
|
Error Rate Alert Fired
|
|
├── Check: Is this a deployment-related issue?
|
|
│ ├── YES → Go to "Recent Deployment Rollback" section
|
|
│ └── NO → Continue
|
|
├── Check: Is a downstream dependency failing?
|
|
│ ├── YES → Go to "Dependency Failure" section
|
|
│ └── NO → Continue
|
|
├── Check: Is there unusual traffic volume?
|
|
│ ├── YES → Go to "Traffic Spike" section
|
|
│ └── NO → Continue
|
|
└── Escalate: Engage on-call secondary + service owner
|
|
```
|
|
|
|
### High Latency (p99 > [threshold]ms)
|
|
|
|
```
|
|
Latency Alert Fired
|
|
├── Check: Database query latency elevated?
|
|
│ ├── YES → Go to "Database Performance" section
|
|
│ └── NO → Continue
|
|
├── Check: Connection pool utilization >80%?
|
|
│ ├── YES → Go to "Connection Pool Exhaustion" section
|
|
│ └── NO → Continue
|
|
├── Check: Memory/CPU pressure on service instances?
|
|
│ ├── YES → Go to "Resource Exhaustion" section
|
|
│ └── NO → Continue
|
|
└── Escalate: Engage on-call secondary + service owner
|
|
```
|
|
|
|
### Service Unavailable (Health Check Failing)
|
|
|
|
```
|
|
Health Check Alert Fired
|
|
├── Check: Are all instances down?
|
|
│ ├── YES → Go to "Complete Outage" section
|
|
│ └── NO → Continue
|
|
├── Check: Is only one AZ affected?
|
|
│ ├── YES → Go to "AZ Failure" section
|
|
│ └── NO → Continue
|
|
├── Check: Can instances be restarted?
|
|
│ ├── YES → Go to "Instance Restart" section
|
|
│ └── NO → Continue
|
|
└── Escalate: Declare incident, engage IC
|
|
```
|
|
|
|
---
|
|
|
|
## Common Scenarios
|
|
|
|
### Recent Deployment Rollback
|
|
|
|
**Symptoms:** Error rate spike or latency increase within 60 minutes of a deployment.
|
|
|
|
**Diagnosis:**
|
|
1. Check deployment history: `kubectl rollout history deployment/[service-name]`
|
|
2. Compare error rate timing with deployment timestamp
|
|
3. Review deployment diff for risky changes
|
|
|
|
**Mitigation:**
|
|
1. Initiate rollback: `kubectl rollout undo deployment/[service-name]`
|
|
2. Verify rollback: `kubectl rollout status deployment/[service-name]`
|
|
3. Confirm error rate returns to baseline (allow 5 minutes)
|
|
4. If rollback fails: escalate immediately
|
|
|
|
**Communication:** If customer-impacting, update status page within 5 minutes of confirming impact.
|
|
|
|
---
|
|
|
|
### Database Performance
|
|
|
|
**Symptoms:** Elevated query latency, connection pool saturation, timeout errors.
|
|
|
|
**Diagnosis:**
|
|
1. Check active queries: `SELECT * FROM pg_stat_activity WHERE state = 'active';`
|
|
2. Check for long-running queries: `SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;`
|
|
3. Check connection count: `SELECT count(*) FROM pg_stat_activity;`
|
|
4. Check table bloat and vacuum status
|
|
|
|
**Mitigation:**
|
|
1. Kill long-running queries if identified: `SELECT pg_terminate_backend([pid]);`
|
|
2. If connection pool exhausted: increase pool size via config (requires restart)
|
|
3. If read replica available: redirect read traffic
|
|
4. If write-heavy: identify and defer non-critical writes
|
|
|
|
**Escalation Trigger:** If query latency >10s for >5 minutes, escalate to DBA on-call.
|
|
|
|
---
|
|
|
|
### Connection Pool Exhaustion
|
|
|
|
**Symptoms:** Connection timeout errors, pool utilization >90%, requests queuing.
|
|
|
|
**Diagnosis:**
|
|
1. Check pool metrics: current size, active connections, waiting requests
|
|
2. Check for connection leaks: connections held >30s without activity
|
|
3. Review recent config changes or deployments
|
|
|
|
**Mitigation:**
|
|
1. Increase pool size (if infrastructure allows): update config, rolling restart
|
|
2. Kill idle connections exceeding timeout
|
|
3. If caused by leak: identify and restart affected instances
|
|
4. Enable connection pool auto-scaling if available
|
|
|
|
**Prevention:** Pool utilization alerting at 70% (warning) and 85% (critical).
|
|
|
|
---
|
|
|
|
### Dependency Failure
|
|
|
|
**Symptoms:** Errors correlated with downstream service failures, circuit breakers tripping.
|
|
|
|
**Diagnosis:**
|
|
1. Check dependency status dashboards
|
|
2. Verify circuit breaker state: open/half-open/closed
|
|
3. Check for correlation with dependency deployments or incidents
|
|
4. Test dependency health endpoints directly
|
|
|
|
**Mitigation:**
|
|
1. If circuit breaker not tripping: verify timeout/threshold configuration
|
|
2. Enable graceful degradation (serve cached/default responses)
|
|
3. If critical path: engage dependency team via incident process
|
|
4. If non-critical path: disable feature flag for affected functionality
|
|
|
|
**Communication:** Coordinate with dependency team IC if both services have active incidents.
|
|
|
|
---
|
|
|
|
### Traffic Spike
|
|
|
|
**Symptoms:** Sudden traffic increase beyond normal patterns, resource saturation.
|
|
|
|
**Diagnosis:**
|
|
1. Check traffic source: organic growth vs. bot traffic vs. DDoS
|
|
2. Review rate limiting effectiveness
|
|
3. Check auto-scaling status and capacity
|
|
|
|
**Mitigation:**
|
|
1. If bot/DDoS: enable rate limiting, engage security team
|
|
2. If organic: trigger manual scale-up, increase auto-scaling limits
|
|
3. Enable request queuing or load shedding if at capacity
|
|
4. Consider feature flag toggles to reduce per-request cost
|
|
|
|
---
|
|
|
|
### Complete Outage
|
|
|
|
**Symptoms:** All instances unreachable, health checks failing across AZs.
|
|
|
|
**Diagnosis:**
|
|
1. Check infrastructure status (AWS/GCP status page)
|
|
2. Verify network connectivity and DNS resolution
|
|
3. Check for infrastructure-level incidents (region outage)
|
|
4. Review recent infrastructure changes (Terraform, network config)
|
|
|
|
**Mitigation:**
|
|
1. If infra provider issue: activate disaster recovery plan
|
|
2. If DNS issue: update DNS records, reduce TTL
|
|
3. If deployment corruption: redeploy last known good version
|
|
4. If data corruption: engage data recovery procedures
|
|
|
|
**Escalation:** Immediately declare SEV1 incident. Engage infrastructure team and management.
|
|
|
|
---
|
|
|
|
### Instance Restart
|
|
|
|
**Symptoms:** Individual instances unhealthy, OOM kills, process crashes.
|
|
|
|
**Diagnosis:**
|
|
1. Check instance logs for crash reason
|
|
2. Review memory/CPU usage patterns before crash
|
|
3. Check for memory leaks or resource exhaustion
|
|
4. Verify configuration consistency across instances
|
|
|
|
**Mitigation:**
|
|
1. Restart unhealthy instances: `kubectl delete pod [pod-name]`
|
|
2. If recurring: cordon node and migrate workloads
|
|
3. If memory leak: schedule immediate patch with increased memory limit
|
|
4. Monitor for recurrence after restart
|
|
|
|
---
|
|
|
|
### AZ Failure
|
|
|
|
**Symptoms:** All instances in one availability zone failing, others healthy.
|
|
|
|
**Diagnosis:**
|
|
1. Confirm AZ-specific failure vs. instance-specific issues
|
|
2. Check cloud provider AZ status
|
|
3. Verify load balancer is routing around failed AZ
|
|
|
|
**Mitigation:**
|
|
1. Ensure load balancer marks AZ instances as unhealthy
|
|
2. Scale up remaining AZs to handle redirected traffic
|
|
3. If auto-scaling: verify it's responding to increased load
|
|
4. Monitor remaining AZs for cascade effects
|
|
|
|
---
|
|
|
|
## Key Metrics & Dashboards
|
|
|
|
| Metric | Normal Range | Warning | Critical | Dashboard |
|
|
|--------|-------------|---------|----------|-----------|
|
|
| Error Rate | <0.1% | >1% | >5% | [link] |
|
|
| p99 Latency | <200ms | >500ms | >2000ms | [link] |
|
|
| CPU Usage | <60% | >75% | >90% | [link] |
|
|
| Memory Usage | <70% | >80% | >90% | [link] |
|
|
| DB Pool Usage | <50% | >70% | >85% | [link] |
|
|
| Request Rate | [baseline]±20% | ±50% | ±100% | [link] |
|
|
|
|
---
|
|
|
|
## Escalation Contacts
|
|
|
|
| Level | Contact | When |
|
|
|-------|---------|------|
|
|
| L1: On-Call Primary | [name/rotation] | First responder |
|
|
| L2: On-Call Secondary | [name/rotation] | Primary unavailable or needs help |
|
|
| L3: Service Owner | [name] | Complex issues, architectural decisions |
|
|
| L4: Engineering Manager | [name] | SEV1/SEV2, customer impact, resource needs |
|
|
| L5: VP Engineering | [name] | SEV1 >30 min, major customer/revenue impact |
|
|
|
|
---
|
|
|
|
## Maintenance Procedures
|
|
|
|
### Planned Maintenance Checklist
|
|
|
|
- [ ] Maintenance window scheduled and communicated (72 hours advance for Tier 1)
|
|
- [ ] Status page updated with planned maintenance notice
|
|
- [ ] Rollback plan documented and tested
|
|
- [ ] On-call notified of maintenance window
|
|
- [ ] Customer notification sent (if SLA-impacting)
|
|
- [ ] Post-maintenance verification plan ready
|
|
|
|
### Health Verification After Changes
|
|
|
|
1. Check all health endpoints return 200
|
|
2. Verify error rate returns to baseline within 5 minutes
|
|
3. Confirm latency within normal range
|
|
4. Run synthetic transaction test
|
|
5. Monitor for 15 minutes before declaring success
|
|
|
|
---
|
|
|
|
## Revision History
|
|
|
|
| Date | Author | Change |
|
|
|------|--------|--------|
|
|
| [YYYY-MM-DD] | [Name] | Initial version |
|
|
| [YYYY-MM-DD] | [Name] | [Description of update] |
|
|
|
|
---
|
|
|
|
*This runbook should be reviewed quarterly and updated after every incident that reveals missing procedures. The on-call engineer should be able to follow this document without prior context about the service. If any section requires tribal knowledge to execute, it needs to be expanded.*
|