# Incident Commander Skill A comprehensive incident response framework providing structured tools for managing technology incidents from detection through resolution and post-incident review. ## Overview This skill implements battle-tested practices from SRE and DevOps teams at scale, providing: - **Automated Severity Classification** - Intelligent incident triage - **Timeline Reconstruction** - Transform scattered events into coherent narratives - **Post-Incident Review Generation** - Structured PIRs with RCA frameworks - **Communication Templates** - Pre-built stakeholder communication - **Comprehensive Documentation** - Reference guides for incident response ## Quick Start ### Classify an Incident ```bash # From JSON file python scripts/incident_classifier.py --input incident.json --format text # From stdin text echo "Database is down affecting all users" | python scripts/incident_classifier.py --format text # Interactive mode python scripts/incident_classifier.py --interactive ``` ### Reconstruct Timeline ```bash # Analyze event timeline python scripts/timeline_reconstructor.py --input events.json --format text # With gap analysis python scripts/timeline_reconstructor.py --input events.json --gap-analysis --format markdown ``` ### Generate PIR Document ```bash # Basic PIR python scripts/pir_generator.py --incident incident.json --format markdown # Comprehensive PIR with timeline python scripts/pir_generator.py --incident incident.json --timeline timeline.json --rca-method fishbone ``` ## Scripts ### incident_classifier.py **Purpose:** Analyzes incident descriptions and provides severity classification, team recommendations, and response templates. **Input:** JSON object with incident details or plain text description **Output:** JSON + human-readable classification report **Example Input:** ```json { "description": "Database connection timeouts causing 500 errors", "service": "payment-api", "affected_users": "80%", "business_impact": "high" } ``` **Key Features:** - SEV1-4 severity classification - Recommended response teams - Initial action prioritization - Communication templates - Response timelines ### timeline_reconstructor.py **Purpose:** Reconstructs incident timelines from timestamped events, identifies phases, and performs gap analysis. **Input:** JSON array of timestamped events **Output:** Formatted timeline with phase analysis and metrics **Example Input:** ```json [ { "timestamp": "2024-01-01T12:00:00Z", "source": "monitoring", "message": "High error rate detected", "severity": "critical", "actor": "system" } ] ``` **Key Features:** - Phase detection (detection → triage → mitigation → resolution) - Duration analysis - Gap identification - Communication effectiveness analysis - Response metrics ### pir_generator.py **Purpose:** Generates comprehensive Post-Incident Review documents with multiple RCA frameworks. **Input:** Incident data JSON, optional timeline data **Output:** Structured PIR document with RCA analysis **Key Features:** - Multiple RCA methods (5 Whys, Fishbone, Timeline, Bow Tie) - Automated action item generation - Lessons learned categorization - Follow-up planning - Completeness assessment ## Sample Data The `assets/` directory contains sample data files for testing: - `sample_incident_classification.json` - Database connection pool exhaustion incident - `sample_timeline_events.json` - Complete timeline with 21 events across phases - `sample_incident_pir_data.json` - Comprehensive incident data for PIR generation - `simple_incident.json` - Minimal incident for basic testing - `simple_timeline_events.json` - Simple 4-event timeline ## Expected Outputs The `expected_outputs/` directory contains reference outputs showing what each script produces: - `incident_classification_text_output.txt` - Detailed classification report - `timeline_reconstruction_text_output.txt` - Complete timeline analysis - `pir_markdown_output.md` - Full PIR document - `simple_incident_classification.txt` - Basic classification example ## Reference Documentation ### references/incident_severity_matrix.md Complete severity classification system with: - SEV1-4 definitions and criteria - Response requirements and timelines - Escalation paths - Communication requirements - Decision trees and examples ### references/rca_frameworks_guide.md Detailed guide for root cause analysis: - 5 Whys methodology - Fishbone (Ishikawa) diagram analysis - Timeline analysis techniques - Bow Tie analysis for high-risk incidents - Framework selection guidelines ### references/communication_templates.md Standardized communication templates: - Severity-specific notification templates - Stakeholder-specific messaging - Escalation communications - Resolution notifications - Customer communication guidelines ## Usage Patterns ### End-to-End Incident Workflow 1. **Initial Classification** ```bash echo "Payment API returning 500 errors for 70% of requests" | \ python scripts/incident_classifier.py --format text ``` 2. **Timeline Reconstruction** (after collecting events) ```bash python scripts/timeline_reconstructor.py \ --input events.json \ --gap-analysis \ --format markdown \ --output timeline.md ``` 3. **PIR Generation** (after incident resolution) ```bash python scripts/pir_generator.py \ --incident incident.json \ --timeline timeline.md \ --rca-method fishbone \ --output pir.md ``` ### Integration Examples **CI/CD Pipeline Integration:** ```bash # Classify deployment issues cat deployment_error.log | python scripts/incident_classifier.py --format json ``` **Monitoring Integration:** ```bash # Process alert events curl -s "monitoring-api/events" | python scripts/timeline_reconstructor.py --format text ``` **Runbook Generation:** Use classification output to automatically select appropriate runbooks and escalation procedures. ## Quality Standards - **Zero External Dependencies** - All scripts use only Python standard library - **Dual Output Format** - Both JSON (machine-readable) and text (human-readable) - **Robust Input Handling** - Graceful handling of missing or malformed data - **Professional Defaults** - Opinionated, battle-tested configurations - **Comprehensive Testing** - Sample data and expected outputs included ## Technical Requirements - Python 3.6+ - No external dependencies required - Works with standard Unix tools (pipes, redirection) - Cross-platform compatible ## Severity Classification Reference | Severity | Description | Response Time | Update Frequency | |----------|-------------|---------------|------------------| | **SEV1** | Complete outage | 5 minutes | Every 15 minutes | | **SEV2** | Major degradation | 15 minutes | Every 30 minutes | | **SEV3** | Minor impact | 2 hours | At milestones | | **SEV4** | Low impact | 1-2 days | Weekly | ## Getting Help Each script includes comprehensive help: ```bash python scripts/incident_classifier.py --help python scripts/timeline_reconstructor.py --help python scripts/pir_generator.py --help ``` For methodology questions, refer to the reference documentation in the `references/` directory. ## Contributing When adding new features: 1. Maintain zero external dependencies 2. Add comprehensive examples to `assets/` 3. Update expected outputs in `expected_outputs/` 4. Follow the established patterns for argument parsing and output formatting ## License This skill is part of the claude-skills repository. See the main repository LICENSE for details.