Files

Incident Commander Skill

A comprehensive incident response framework providing structured tools for managing technology incidents from detection through resolution and post-incident review.

Overview

This skill implements battle-tested practices from SRE and DevOps teams at scale, providing:

  • Automated Severity Classification - Intelligent incident triage
  • Timeline Reconstruction - Transform scattered events into coherent narratives
  • Post-Incident Review Generation - Structured PIRs with RCA frameworks
  • Communication Templates - Pre-built stakeholder communication
  • Comprehensive Documentation - Reference guides for incident response

Quick Start

Classify an Incident

# From JSON file
python scripts/incident_classifier.py --input incident.json --format text

# From stdin text
echo "Database is down affecting all users" | python scripts/incident_classifier.py --format text

# Interactive mode
python scripts/incident_classifier.py --interactive

Reconstruct Timeline

# Analyze event timeline
python scripts/timeline_reconstructor.py --input events.json --format text

# With gap analysis
python scripts/timeline_reconstructor.py --input events.json --gap-analysis --format markdown

Generate PIR Document

# Basic PIR
python scripts/pir_generator.py --incident incident.json --format markdown

# Comprehensive PIR with timeline
python scripts/pir_generator.py --incident incident.json --timeline timeline.json --rca-method fishbone

Scripts

incident_classifier.py

Purpose: Analyzes incident descriptions and provides severity classification, team recommendations, and response templates.

Input: JSON object with incident details or plain text description Output: JSON + human-readable classification report

Example Input:

{
  "description": "Database connection timeouts causing 500 errors",
  "service": "payment-api",
  "affected_users": "80%",
  "business_impact": "high"
}

Key Features:

  • SEV1-4 severity classification
  • Recommended response teams
  • Initial action prioritization
  • Communication templates
  • Response timelines

timeline_reconstructor.py

Purpose: Reconstructs incident timelines from timestamped events, identifies phases, and performs gap analysis.

Input: JSON array of timestamped events Output: Formatted timeline with phase analysis and metrics

Example Input:

[
  {
    "timestamp": "2024-01-01T12:00:00Z",
    "source": "monitoring",
    "message": "High error rate detected",
    "severity": "critical",
    "actor": "system"
  }
]

Key Features:

  • Phase detection (detection → triage → mitigation → resolution)
  • Duration analysis
  • Gap identification
  • Communication effectiveness analysis
  • Response metrics

pir_generator.py

Purpose: Generates comprehensive Post-Incident Review documents with multiple RCA frameworks.

Input: Incident data JSON, optional timeline data Output: Structured PIR document with RCA analysis

Key Features:

  • Multiple RCA methods (5 Whys, Fishbone, Timeline, Bow Tie)
  • Automated action item generation
  • Lessons learned categorization
  • Follow-up planning
  • Completeness assessment

Sample Data

The assets/ directory contains sample data files for testing:

  • sample_incident_classification.json - Database connection pool exhaustion incident
  • sample_timeline_events.json - Complete timeline with 21 events across phases
  • sample_incident_pir_data.json - Comprehensive incident data for PIR generation
  • simple_incident.json - Minimal incident for basic testing
  • simple_timeline_events.json - Simple 4-event timeline

Expected Outputs

The expected_outputs/ directory contains reference outputs showing what each script produces:

  • incident_classification_text_output.txt - Detailed classification report
  • timeline_reconstruction_text_output.txt - Complete timeline analysis
  • pir_markdown_output.md - Full PIR document
  • simple_incident_classification.txt - Basic classification example

Reference Documentation

references/incident_severity_matrix.md

Complete severity classification system with:

  • SEV1-4 definitions and criteria
  • Response requirements and timelines
  • Escalation paths
  • Communication requirements
  • Decision trees and examples

references/rca_frameworks_guide.md

Detailed guide for root cause analysis:

  • 5 Whys methodology
  • Fishbone (Ishikawa) diagram analysis
  • Timeline analysis techniques
  • Bow Tie analysis for high-risk incidents
  • Framework selection guidelines

references/communication_templates.md

Standardized communication templates:

  • Severity-specific notification templates
  • Stakeholder-specific messaging
  • Escalation communications
  • Resolution notifications
  • Customer communication guidelines

Usage Patterns

End-to-End Incident Workflow

  1. Initial Classification
echo "Payment API returning 500 errors for 70% of requests" | \
  python scripts/incident_classifier.py --format text
  1. Timeline Reconstruction (after collecting events)
python scripts/timeline_reconstructor.py \
  --input events.json \
  --gap-analysis \
  --format markdown \
  --output timeline.md
  1. PIR Generation (after incident resolution)
python scripts/pir_generator.py \
  --incident incident.json \
  --timeline timeline.md \
  --rca-method fishbone \
  --output pir.md

Integration Examples

CI/CD Pipeline Integration:

# Classify deployment issues
cat deployment_error.log | python scripts/incident_classifier.py --format json

Monitoring Integration:

# Process alert events
curl -s "monitoring-api/events" | python scripts/timeline_reconstructor.py --format text

Runbook Generation: Use classification output to automatically select appropriate runbooks and escalation procedures.

Quality Standards

  • Zero External Dependencies - All scripts use only Python standard library
  • Dual Output Format - Both JSON (machine-readable) and text (human-readable)
  • Robust Input Handling - Graceful handling of missing or malformed data
  • Professional Defaults - Opinionated, battle-tested configurations
  • Comprehensive Testing - Sample data and expected outputs included

Technical Requirements

  • Python 3.6+
  • No external dependencies required
  • Works with standard Unix tools (pipes, redirection)
  • Cross-platform compatible

Severity Classification Reference

Severity Description Response Time Update Frequency
SEV1 Complete outage 5 minutes Every 15 minutes
SEV2 Major degradation 15 minutes Every 30 minutes
SEV3 Minor impact 2 hours At milestones
SEV4 Low impact 1-2 days Weekly

Getting Help

Each script includes comprehensive help:

python scripts/incident_classifier.py --help
python scripts/timeline_reconstructor.py --help  
python scripts/pir_generator.py --help

For methodology questions, refer to the reference documentation in the references/ directory.

Contributing

When adding new features:

  1. Maintain zero external dependencies
  2. Add comprehensive examples to assets/
  3. Update expected outputs in expected_outputs/
  4. Follow the established patterns for argument parsing and output formatting

License

This skill is part of the claude-skills repository. See the main repository LICENSE for details.