Files
claude-skills-reference/engineering-team/incident-commander/expected_outputs/pir_markdown_output.md
Leo daace78954 feat: Add comprehensive incident-commander skill
- Add SKILL.md with 300+ lines of incident response playbook
- Implement incident_classifier.py: severity classification and response recommendations
- Implement timeline_reconstructor.py: event timeline reconstruction with phase analysis
- Implement pir_generator.py: comprehensive PIR generation with multiple RCA frameworks
- Add reference documentation: severity matrix, RCA frameworks, communication templates
- Add sample data files and expected outputs for testing
- All scripts are standalone with zero external dependencies
- Dual output formats: JSON + human-readable text
- Professional, opinionated defaults based on SRE best practices

This POWERFUL-tier skill provides end-to-end incident response capabilities from
detection through post-incident review.
2026-02-16 12:43:38 +00:00

3.2 KiB

Post-Incident Review: Payment API Database Connection Pool Exhaustion

Executive Summary

On March 15, 2024, we experienced a sev2 incident affecting ['payment-api', 'checkout-service', 'subscription-billing']. The incident lasted 1h 5m and had the following impact: 80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay. The incident has been resolved and we have identified specific actions to prevent recurrence.

Incident Overview

  • Incident ID: INC-2024-0315-001
  • Date & Time: 2024-03-15 14:30:00 UTC
  • Duration: 1h 5m
  • Severity: SEV2
  • Status: Resolved
  • Incident Commander: Mike Rodriguez
  • Responders: Sarah Chen - On-call Engineer, Primary Responder, Tom Wilson - Database Team Lead, Lisa Park - Database Engineer, Mike Rodriguez - Incident Commander, David Kumar - DevOps Engineer

Customer Impact

80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay.

Business Impact

Estimated revenue loss of $45,000 during outage period. No SLA breaches as resolution was within 2-hour window. 12 customer escalations through support channels.

Timeline

No detailed timeline available.

Root Cause Analysis

Analysis Method: 5 Whys Analysis

Why Analysis

Why 1: Why did Database connection pool exhaustion caused widespread 500 errors in payment processing API, preventing users from completing purchases. Root cause was an inefficient database query introduced in deployment v2.3.1.? Answer: New deployment introduced a regression

Why 2: Why wasn't this detected earlier? Answer: Code review process missed the issue

Why 3: Why didn't existing safeguards prevent this? Answer: Testing environment didn't match production

Why 4: Why wasn't there a backup mechanism? Answer: Further investigation needed

Why 5: Why wasn't this scenario anticipated? Answer: Further investigation needed

What Went Well

  • The incident was successfully resolved
  • Incident command was established
  • Multiple team members collaborated on resolution

What Didn't Go Well

  • Analysis in progress

Lessons Learned

Lessons learned to be documented following detailed analysis.

Action Items

Action items to be defined.

Follow-up and Prevention

Prevention Measures

Based on the root cause analysis, the following preventive measures have been identified:

  • Implement comprehensive testing for similar scenarios
  • Improve monitoring and alerting coverage
  • Enhance error handling and resilience patterns

Follow-up Schedule

  • 1 week: Review action item progress
  • 1 month: Evaluate effectiveness of implemented changes
  • 3 months: Conduct follow-up assessment and update preventive measures

Appendix

Additional Information

  • Incident ID: INC-2024-0315-001
  • Severity Classification: sev2
  • Affected Services: payment-api, checkout-service, subscription-billing

References

  • Incident tracking ticket: [Link TBD]
  • Monitoring dashboards: [Link TBD]
  • Communication thread: [Link TBD]

Generated on 2026-02-16 by PIR Generator