Kept our SKILL.md (POWERFUL-tier, 669 lines) over the codex-synced version. Accepted all new files from dev (additional scripts, references, assets).
Incident Commander Skill
A comprehensive incident response framework providing structured tools for managing technology incidents from detection through resolution and post-incident review.
Overview
This skill implements battle-tested practices from SRE and DevOps teams at scale, providing:
- Automated Severity Classification - Intelligent incident triage
- Timeline Reconstruction - Transform scattered events into coherent narratives
- Post-Incident Review Generation - Structured PIRs with RCA frameworks
- Communication Templates - Pre-built stakeholder communication
- Comprehensive Documentation - Reference guides for incident response
Quick Start
Classify an Incident
# From JSON file
python scripts/incident_classifier.py --input incident.json --format text
# From stdin text
echo "Database is down affecting all users" | python scripts/incident_classifier.py --format text
# Interactive mode
python scripts/incident_classifier.py --interactive
Reconstruct Timeline
# Analyze event timeline
python scripts/timeline_reconstructor.py --input events.json --format text
# With gap analysis
python scripts/timeline_reconstructor.py --input events.json --gap-analysis --format markdown
Generate PIR Document
# Basic PIR
python scripts/pir_generator.py --incident incident.json --format markdown
# Comprehensive PIR with timeline
python scripts/pir_generator.py --incident incident.json --timeline timeline.json --rca-method fishbone
Scripts
incident_classifier.py
Purpose: Analyzes incident descriptions and provides severity classification, team recommendations, and response templates.
Input: JSON object with incident details or plain text description Output: JSON + human-readable classification report
Example Input:
{
"description": "Database connection timeouts causing 500 errors",
"service": "payment-api",
"affected_users": "80%",
"business_impact": "high"
}
Key Features:
- SEV1-4 severity classification
- Recommended response teams
- Initial action prioritization
- Communication templates
- Response timelines
timeline_reconstructor.py
Purpose: Reconstructs incident timelines from timestamped events, identifies phases, and performs gap analysis.
Input: JSON array of timestamped events Output: Formatted timeline with phase analysis and metrics
Example Input:
[
{
"timestamp": "2024-01-01T12:00:00Z",
"source": "monitoring",
"message": "High error rate detected",
"severity": "critical",
"actor": "system"
}
]
Key Features:
- Phase detection (detection → triage → mitigation → resolution)
- Duration analysis
- Gap identification
- Communication effectiveness analysis
- Response metrics
pir_generator.py
Purpose: Generates comprehensive Post-Incident Review documents with multiple RCA frameworks.
Input: Incident data JSON, optional timeline data Output: Structured PIR document with RCA analysis
Key Features:
- Multiple RCA methods (5 Whys, Fishbone, Timeline, Bow Tie)
- Automated action item generation
- Lessons learned categorization
- Follow-up planning
- Completeness assessment
Sample Data
The assets/ directory contains sample data files for testing:
sample_incident_classification.json- Database connection pool exhaustion incidentsample_timeline_events.json- Complete timeline with 21 events across phasessample_incident_pir_data.json- Comprehensive incident data for PIR generationsimple_incident.json- Minimal incident for basic testingsimple_timeline_events.json- Simple 4-event timeline
Expected Outputs
The expected_outputs/ directory contains reference outputs showing what each script produces:
incident_classification_text_output.txt- Detailed classification reporttimeline_reconstruction_text_output.txt- Complete timeline analysispir_markdown_output.md- Full PIR documentsimple_incident_classification.txt- Basic classification example
Reference Documentation
references/incident_severity_matrix.md
Complete severity classification system with:
- SEV1-4 definitions and criteria
- Response requirements and timelines
- Escalation paths
- Communication requirements
- Decision trees and examples
references/rca_frameworks_guide.md
Detailed guide for root cause analysis:
- 5 Whys methodology
- Fishbone (Ishikawa) diagram analysis
- Timeline analysis techniques
- Bow Tie analysis for high-risk incidents
- Framework selection guidelines
references/communication_templates.md
Standardized communication templates:
- Severity-specific notification templates
- Stakeholder-specific messaging
- Escalation communications
- Resolution notifications
- Customer communication guidelines
Usage Patterns
End-to-End Incident Workflow
- Initial Classification
echo "Payment API returning 500 errors for 70% of requests" | \
python scripts/incident_classifier.py --format text
- Timeline Reconstruction (after collecting events)
python scripts/timeline_reconstructor.py \
--input events.json \
--gap-analysis \
--format markdown \
--output timeline.md
- PIR Generation (after incident resolution)
python scripts/pir_generator.py \
--incident incident.json \
--timeline timeline.md \
--rca-method fishbone \
--output pir.md
Integration Examples
CI/CD Pipeline Integration:
# Classify deployment issues
cat deployment_error.log | python scripts/incident_classifier.py --format json
Monitoring Integration:
# Process alert events
curl -s "monitoring-api/events" | python scripts/timeline_reconstructor.py --format text
Runbook Generation: Use classification output to automatically select appropriate runbooks and escalation procedures.
Quality Standards
- Zero External Dependencies - All scripts use only Python standard library
- Dual Output Format - Both JSON (machine-readable) and text (human-readable)
- Robust Input Handling - Graceful handling of missing or malformed data
- Professional Defaults - Opinionated, battle-tested configurations
- Comprehensive Testing - Sample data and expected outputs included
Technical Requirements
- Python 3.6+
- No external dependencies required
- Works with standard Unix tools (pipes, redirection)
- Cross-platform compatible
Severity Classification Reference
| Severity | Description | Response Time | Update Frequency |
|---|---|---|---|
| SEV1 | Complete outage | 5 minutes | Every 15 minutes |
| SEV2 | Major degradation | 15 minutes | Every 30 minutes |
| SEV3 | Minor impact | 2 hours | At milestones |
| SEV4 | Low impact | 1-2 days | Weekly |
Getting Help
Each script includes comprehensive help:
python scripts/incident_classifier.py --help
python scripts/timeline_reconstructor.py --help
python scripts/pir_generator.py --help
For methodology questions, refer to the reference documentation in the references/ directory.
Contributing
When adding new features:
- Maintain zero external dependencies
- Add comprehensive examples to
assets/ - Update expected outputs in
expected_outputs/ - Follow the established patterns for argument parsing and output formatting
License
This skill is part of the claude-skills repository. See the main repository LICENSE for details.