Files
claude-skills-reference/engineering-team/incident-commander/expected_outputs/pir_markdown_output.md
Leo daace78954 feat: Add comprehensive incident-commander skill
- Add SKILL.md with 300+ lines of incident response playbook
- Implement incident_classifier.py: severity classification and response recommendations
- Implement timeline_reconstructor.py: event timeline reconstruction with phase analysis
- Implement pir_generator.py: comprehensive PIR generation with multiple RCA frameworks
- Add reference documentation: severity matrix, RCA frameworks, communication templates
- Add sample data files and expected outputs for testing
- All scripts are standalone with zero external dependencies
- Dual output formats: JSON + human-readable text
- Professional, opinionated defaults based on SRE best practices

This POWERFUL-tier skill provides end-to-end incident response capabilities from
detection through post-incident review.
2026-02-16 12:43:38 +00:00

88 lines
3.2 KiB
Markdown

# Post-Incident Review: Payment API Database Connection Pool Exhaustion
## Executive Summary
On March 15, 2024, we experienced a sev2 incident affecting ['payment-api', 'checkout-service', 'subscription-billing']. The incident lasted 1h 5m and had the following impact: 80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay. The incident has been resolved and we have identified specific actions to prevent recurrence.
## Incident Overview
- **Incident ID:** INC-2024-0315-001
- **Date & Time:** 2024-03-15 14:30:00 UTC
- **Duration:** 1h 5m
- **Severity:** SEV2
- **Status:** Resolved
- **Incident Commander:** Mike Rodriguez
- **Responders:** Sarah Chen - On-call Engineer, Primary Responder, Tom Wilson - Database Team Lead, Lisa Park - Database Engineer, Mike Rodriguez - Incident Commander, David Kumar - DevOps Engineer
### Customer Impact
80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay.
### Business Impact
Estimated revenue loss of $45,000 during outage period. No SLA breaches as resolution was within 2-hour window. 12 customer escalations through support channels.
## Timeline
No detailed timeline available.
## Root Cause Analysis
### Analysis Method: 5 Whys Analysis
#### Why Analysis
**Why 1:** Why did Database connection pool exhaustion caused widespread 500 errors in payment processing API, preventing users from completing purchases. Root cause was an inefficient database query introduced in deployment v2.3.1.?
**Answer:** New deployment introduced a regression
**Why 2:** Why wasn't this detected earlier?
**Answer:** Code review process missed the issue
**Why 3:** Why didn't existing safeguards prevent this?
**Answer:** Testing environment didn't match production
**Why 4:** Why wasn't there a backup mechanism?
**Answer:** Further investigation needed
**Why 5:** Why wasn't this scenario anticipated?
**Answer:** Further investigation needed
## What Went Well
- The incident was successfully resolved
- Incident command was established
- Multiple team members collaborated on resolution
## What Didn't Go Well
- Analysis in progress
## Lessons Learned
Lessons learned to be documented following detailed analysis.
## Action Items
Action items to be defined.
## Follow-up and Prevention
### Prevention Measures
Based on the root cause analysis, the following preventive measures have been identified:
- Implement comprehensive testing for similar scenarios
- Improve monitoring and alerting coverage
- Enhance error handling and resilience patterns
### Follow-up Schedule
- 1 week: Review action item progress
- 1 month: Evaluate effectiveness of implemented changes
- 3 months: Conduct follow-up assessment and update preventive measures
## Appendix
### Additional Information
- Incident ID: INC-2024-0315-001
- Severity Classification: sev2
- Affected Services: payment-api, checkout-service, subscription-billing
### References
- Incident tracking ticket: [Link TBD]
- Monitoring dashboards: [Link TBD]
- Communication thread: [Link TBD]
---
*Generated on 2026-02-16 by PIR Generator*