- Replace non-existent anthropics/claude-code-action@v1 with direct bash steps in smart-sync.yml and pr-issue-auto-close.yml - Add missing checkout steps to both workflows for WORKFLOW_KILLSWITCH access - Fix Issue #189: Replace broken 'npx ai-agent-skills install' with working 'npx agent-skills-cli add' command - Update README.md and INSTALLATION.md with correct Agent Skills CLI commands and repository links - Verified: agent-skills-cli detects all 53 skills and works with 42+ AI agents Fixes: Two GitHub Actions workflows that broke on PR #191 merge Closes: #189
9.5 KiB
Runbook: [Service/Component Name]
Owner: [Team Name] Last Updated: [YYYY-MM-DD] Reviewed By: [Name] Review Cadence: Quarterly
Service Overview
| Property | Value |
|---|---|
| Service | [service-name] |
| Repository | [repo URL] |
| Dashboard | [monitoring dashboard URL] |
| On-Call Rotation | [PagerDuty/OpsGenie schedule URL] |
| SLA Tier | [Tier 1/2/3] |
| Availability Target | [99.9% / 99.95% / 99.99%] |
| Dependencies | [list upstream/downstream services] |
| Owner Team | [team name] |
| Escalation Contact | [name/email] |
Architecture Summary
[2-3 sentence description of the service architecture. Include key components, data stores, and external dependencies.]
Alert Response Decision Tree
High Error Rate (>5%)
Error Rate Alert Fired
├── Check: Is this a deployment-related issue?
│ ├── YES → Go to "Recent Deployment Rollback" section
│ └── NO → Continue
├── Check: Is a downstream dependency failing?
│ ├── YES → Go to "Dependency Failure" section
│ └── NO → Continue
├── Check: Is there unusual traffic volume?
│ ├── YES → Go to "Traffic Spike" section
│ └── NO → Continue
└── Escalate: Engage on-call secondary + service owner
High Latency (p99 > [threshold]ms)
Latency Alert Fired
├── Check: Database query latency elevated?
│ ├── YES → Go to "Database Performance" section
│ └── NO → Continue
├── Check: Connection pool utilization >80%?
│ ├── YES → Go to "Connection Pool Exhaustion" section
│ └── NO → Continue
├── Check: Memory/CPU pressure on service instances?
│ ├── YES → Go to "Resource Exhaustion" section
│ └── NO → Continue
└── Escalate: Engage on-call secondary + service owner
Service Unavailable (Health Check Failing)
Health Check Alert Fired
├── Check: Are all instances down?
│ ├── YES → Go to "Complete Outage" section
│ └── NO → Continue
├── Check: Is only one AZ affected?
│ ├── YES → Go to "AZ Failure" section
│ └── NO → Continue
├── Check: Can instances be restarted?
│ ├── YES → Go to "Instance Restart" section
│ └── NO → Continue
└── Escalate: Declare incident, engage IC
Common Scenarios
Recent Deployment Rollback
Symptoms: Error rate spike or latency increase within 60 minutes of a deployment.
Diagnosis:
- Check deployment history:
kubectl rollout history deployment/[service-name] - Compare error rate timing with deployment timestamp
- Review deployment diff for risky changes
Mitigation:
- Initiate rollback:
kubectl rollout undo deployment/[service-name] - Verify rollback:
kubectl rollout status deployment/[service-name] - Confirm error rate returns to baseline (allow 5 minutes)
- If rollback fails: escalate immediately
Communication: If customer-impacting, update status page within 5 minutes of confirming impact.
Database Performance
Symptoms: Elevated query latency, connection pool saturation, timeout errors.
Diagnosis:
- Check active queries:
SELECT * FROM pg_stat_activity WHERE state = 'active'; - Check for long-running queries:
SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC; - Check connection count:
SELECT count(*) FROM pg_stat_activity; - Check table bloat and vacuum status
Mitigation:
- Kill long-running queries if identified:
SELECT pg_terminate_backend([pid]); - If connection pool exhausted: increase pool size via config (requires restart)
- If read replica available: redirect read traffic
- If write-heavy: identify and defer non-critical writes
Escalation Trigger: If query latency >10s for >5 minutes, escalate to DBA on-call.
Connection Pool Exhaustion
Symptoms: Connection timeout errors, pool utilization >90%, requests queuing.
Diagnosis:
- Check pool metrics: current size, active connections, waiting requests
- Check for connection leaks: connections held >30s without activity
- Review recent config changes or deployments
Mitigation:
- Increase pool size (if infrastructure allows): update config, rolling restart
- Kill idle connections exceeding timeout
- If caused by leak: identify and restart affected instances
- Enable connection pool auto-scaling if available
Prevention: Pool utilization alerting at 70% (warning) and 85% (critical).
Dependency Failure
Symptoms: Errors correlated with downstream service failures, circuit breakers tripping.
Diagnosis:
- Check dependency status dashboards
- Verify circuit breaker state: open/half-open/closed
- Check for correlation with dependency deployments or incidents
- Test dependency health endpoints directly
Mitigation:
- If circuit breaker not tripping: verify timeout/threshold configuration
- Enable graceful degradation (serve cached/default responses)
- If critical path: engage dependency team via incident process
- If non-critical path: disable feature flag for affected functionality
Communication: Coordinate with dependency team IC if both services have active incidents.
Traffic Spike
Symptoms: Sudden traffic increase beyond normal patterns, resource saturation.
Diagnosis:
- Check traffic source: organic growth vs. bot traffic vs. DDoS
- Review rate limiting effectiveness
- Check auto-scaling status and capacity
Mitigation:
- If bot/DDoS: enable rate limiting, engage security team
- If organic: trigger manual scale-up, increase auto-scaling limits
- Enable request queuing or load shedding if at capacity
- Consider feature flag toggles to reduce per-request cost
Complete Outage
Symptoms: All instances unreachable, health checks failing across AZs.
Diagnosis:
- Check infrastructure status (AWS/GCP status page)
- Verify network connectivity and DNS resolution
- Check for infrastructure-level incidents (region outage)
- Review recent infrastructure changes (Terraform, network config)
Mitigation:
- If infra provider issue: activate disaster recovery plan
- If DNS issue: update DNS records, reduce TTL
- If deployment corruption: redeploy last known good version
- If data corruption: engage data recovery procedures
Escalation: Immediately declare SEV1 incident. Engage infrastructure team and management.
Instance Restart
Symptoms: Individual instances unhealthy, OOM kills, process crashes.
Diagnosis:
- Check instance logs for crash reason
- Review memory/CPU usage patterns before crash
- Check for memory leaks or resource exhaustion
- Verify configuration consistency across instances
Mitigation:
- Restart unhealthy instances:
kubectl delete pod [pod-name] - If recurring: cordon node and migrate workloads
- If memory leak: schedule immediate patch with increased memory limit
- Monitor for recurrence after restart
AZ Failure
Symptoms: All instances in one availability zone failing, others healthy.
Diagnosis:
- Confirm AZ-specific failure vs. instance-specific issues
- Check cloud provider AZ status
- Verify load balancer is routing around failed AZ
Mitigation:
- Ensure load balancer marks AZ instances as unhealthy
- Scale up remaining AZs to handle redirected traffic
- If auto-scaling: verify it's responding to increased load
- Monitor remaining AZs for cascade effects
Key Metrics & Dashboards
| Metric | Normal Range | Warning | Critical | Dashboard |
|---|---|---|---|---|
| Error Rate | <0.1% | >1% | >5% | [link] |
| p99 Latency | <200ms | >500ms | >2000ms | [link] |
| CPU Usage | <60% | >75% | >90% | [link] |
| Memory Usage | <70% | >80% | >90% | [link] |
| DB Pool Usage | <50% | >70% | >85% | [link] |
| Request Rate | [baseline]±20% | ±50% | ±100% | [link] |
Escalation Contacts
| Level | Contact | When |
|---|---|---|
| L1: On-Call Primary | [name/rotation] | First responder |
| L2: On-Call Secondary | [name/rotation] | Primary unavailable or needs help |
| L3: Service Owner | [name] | Complex issues, architectural decisions |
| L4: Engineering Manager | [name] | SEV1/SEV2, customer impact, resource needs |
| L5: VP Engineering | [name] | SEV1 >30 min, major customer/revenue impact |
Maintenance Procedures
Planned Maintenance Checklist
- Maintenance window scheduled and communicated (72 hours advance for Tier 1)
- Status page updated with planned maintenance notice
- Rollback plan documented and tested
- On-call notified of maintenance window
- Customer notification sent (if SLA-impacting)
- Post-maintenance verification plan ready
Health Verification After Changes
- Check all health endpoints return 200
- Verify error rate returns to baseline within 5 minutes
- Confirm latency within normal range
- Run synthetic transaction test
- Monitor for 15 minutes before declaring success
Revision History
| Date | Author | Change |
|---|---|---|
| [YYYY-MM-DD] | [Name] | Initial version |
| [YYYY-MM-DD] | [Name] | [Description of update] |
This runbook should be reviewed quarterly and updated after every incident that reveals missing procedures. The on-call engineer should be able to follow this document without prior context about the service. If any section requires tribal knowledge to execute, it needs to be expanded.