Files
claude-skills-reference/engineering-team/incident-commander/references/rca_frameworks_guide.md
Leo daace78954 feat: Add comprehensive incident-commander skill
- Add SKILL.md with 300+ lines of incident response playbook
- Implement incident_classifier.py: severity classification and response recommendations
- Implement timeline_reconstructor.py: event timeline reconstruction with phase analysis
- Implement pir_generator.py: comprehensive PIR generation with multiple RCA frameworks
- Add reference documentation: severity matrix, RCA frameworks, communication templates
- Add sample data files and expected outputs for testing
- All scripts are standalone with zero external dependencies
- Dual output formats: JSON + human-readable text
- Professional, opinionated defaults based on SRE best practices

This POWERFUL-tier skill provides end-to-end incident response capabilities from
detection through post-incident review.
2026-02-16 12:43:38 +00:00

15 KiB

Root Cause Analysis (RCA) Frameworks Guide

Overview

This guide provides detailed instructions for applying various Root Cause Analysis frameworks during Post-Incident Reviews. Each framework offers a different perspective and approach to identifying underlying causes of incidents.

Framework Selection Guidelines

Incident Type Recommended Framework Why
Process Failure 5 Whys Simple, direct cause-effect chain
Complex System Failure Fishbone + Timeline Multiple contributing factors
Human Error Fishbone Systematic analysis of contributing factors
Extended Incidents Timeline Analysis Understanding decision points
High-Risk Incidents Bow Tie Comprehensive barrier analysis
Recurring Issues 5 Whys + Fishbone Deep dive into systemic issues

5 Whys Analysis Framework

Purpose

Iteratively drill down through cause-effect relationships to identify root causes.

When to Use

  • Simple, linear cause-effect chains
  • Time-pressured analysis
  • Process-related failures
  • Individual component failures

Process Steps

Step 1: Problem Statement

Write a clear, specific problem statement.

Good Example:

"The payment API returned 500 errors for 2 hours on March 15, affecting 80% of checkout attempts."

Poor Example:

"The system was broken."

Step 2: First Why

Ask why the problem occurred. Focus on immediate, observable causes.

Example:

  • Why 1: Why did the payment API return 500 errors?
  • Answer: The database connection pool was exhausted.

Step 3: Subsequent Whys

For each answer, ask "why" again. Continue until you reach a root cause.

Example Chain:

  • Why 2: Why was the database connection pool exhausted?

  • Answer: The application was creating more connections than usual.

  • Why 3: Why was the application creating more connections?

  • Answer: A new feature wasn't properly closing connections.

  • Why 4: Why wasn't the feature properly closing connections?

  • Answer: Code review missed the connection leak pattern.

  • Why 5: Why did code review miss this pattern?

  • Answer: We don't have automated checks for connection pooling best practices.

Step 4: Validation

Verify that addressing the root cause would prevent the original problem.

Best Practices

  1. Ask at least 3 "whys" - Surface causes are rarely root causes
  2. Focus on process failures, not people - Avoid blame, focus on system improvements
  3. Use evidence - Support each answer with data or observations
  4. Consider multiple paths - Some problems have multiple root causes
  5. Test the logic - Work backwards from root cause to problem

Common Pitfalls

  • Stopping too early - First few whys often reveal symptoms, not causes
  • Single-cause assumption - Complex systems often have multiple contributing factors
  • Blame focus - Focusing on individual mistakes rather than system failures
  • Vague answers - Use specific, actionable answers

5 Whys Template

## 5 Whys Analysis

**Problem Statement:** [Clear description of the incident]

**Why 1:** [First why question]
**Answer:** [Specific, evidence-based answer]
**Evidence:** [Supporting data, logs, observations]

**Why 2:** [Second why question]
**Answer:** [Specific answer based on Why 1]
**Evidence:** [Supporting evidence]

[Continue for 3-7 iterations]

**Root Cause(s) Identified:**
1. [Primary root cause]
2. [Secondary root cause if applicable]

**Validation:** [Confirm that addressing root causes would prevent recurrence]

Fishbone (Ishikawa) Diagram Framework

Purpose

Systematically analyze potential causes across multiple categories to identify contributing factors.

When to Use

  • Complex incidents with multiple potential causes
  • When human factors are suspected
  • Systemic or organizational issues
  • When 5 Whys doesn't reveal clear root causes

Categories

People (Human Factors)

  • Training and Skills

    • Insufficient training on new systems
    • Lack of domain expertise
    • Skill gaps in team
    • Knowledge not shared across team
  • Communication

    • Poor communication between teams
    • Unclear responsibilities
    • Information not reaching right people
    • Language/cultural barriers
  • Decision Making

    • Decisions made under pressure
    • Insufficient information for decisions
    • Risk assessment inadequate
    • Approval processes bypassed

Process (Procedures and Workflows)

  • Documentation

    • Outdated procedures
    • Missing runbooks
    • Unclear instructions
    • Process not documented
  • Change Management

    • Inadequate change review
    • Rushed deployments
    • Insufficient testing
    • Rollback procedures unclear
  • Review and Approval

    • Code review gaps
    • Architecture review skipped
    • Security review insufficient
    • Performance review missing

Technology (Systems and Tools)

  • Architecture

    • Single points of failure
    • Insufficient redundancy
    • Scalability limitations
    • Tight coupling between systems
  • Monitoring and Alerting

    • Missing monitoring
    • Alert fatigue
    • Inadequate thresholds
    • Poor alert routing
  • Tools and Automation

    • Manual processes prone to error
    • Tool limitations
    • Automation gaps
    • Integration issues

Environment (External Factors)

  • Infrastructure

    • Hardware failures
    • Network issues
    • Capacity limitations
    • Geographic dependencies
  • Dependencies

    • Third-party service failures
    • External API changes
    • Vendor issues
    • Supply chain problems
  • External Pressure

    • Time pressure from business
    • Resource constraints
    • Regulatory changes
    • Market conditions

Process Steps

Step 1: Define the Problem

Place the incident at the "head" of the fishbone diagram.

Step 2: Brainstorm Causes

For each category, brainstorm potential contributing factors.

Step 3: Drill Down

For each factor, ask what caused that factor (sub-causes).

Step 4: Identify Primary Causes

Mark the most likely contributing factors based on evidence.

Step 5: Validate

Gather evidence to support or refute each suspected cause.

Fishbone Template

## Fishbone Analysis

**Problem:** [Incident description]

### People
**Training/Skills:**
- [Factor 1]: [Evidence/likelihood]
- [Factor 2]: [Evidence/likelihood]

**Communication:**
- [Factor 1]: [Evidence/likelihood]

**Decision Making:**
- [Factor 1]: [Evidence/likelihood]

### Process
**Documentation:**
- [Factor 1]: [Evidence/likelihood]

**Change Management:**
- [Factor 1]: [Evidence/likelihood]

**Review/Approval:**
- [Factor 1]: [Evidence/likelihood]

### Technology
**Architecture:**
- [Factor 1]: [Evidence/likelihood]

**Monitoring:**
- [Factor 1]: [Evidence/likelihood]

**Tools:**
- [Factor 1]: [Evidence/likelihood]

### Environment
**Infrastructure:**
- [Factor 1]: [Evidence/likelihood]

**Dependencies:**
- [Factor 1]: [Evidence/likelihood]

**External Factors:**
- [Factor 1]: [Evidence/likelihood]

### Primary Contributing Factors
1. [Factor with highest evidence/impact]
2. [Second most significant factor]
3. [Third most significant factor]

### Root Cause Hypothesis
[Synthesized explanation of how factors combined to cause incident]

Timeline Analysis Framework

Purpose

Analyze the chronological sequence of events to identify decision points, missed opportunities, and process gaps.

When to Use

  • Extended incidents (> 1 hour)
  • Complex multi-phase incidents
  • When response effectiveness is questioned
  • Communication or coordination failures

Analysis Dimensions

Detection Analysis

  • Time to Detection: How long from onset to first alert?
  • Detection Method: How was the incident first identified?
  • Alert Effectiveness: Were the right people notified quickly?
  • False Negatives: What signals were missed?

Response Analysis

  • Time to Response: How long from detection to first response action?
  • Escalation Timing: Were escalations timely and appropriate?
  • Resource Mobilization: How quickly were the right people engaged?
  • Decision Points: What key decisions were made and when?

Communication Analysis

  • Internal Communication: How effective was team coordination?
  • External Communication: Were stakeholders informed appropriately?
  • Communication Gaps: Where did information flow break down?
  • Update Frequency: Were updates provided at appropriate intervals?

Resolution Analysis

  • Mitigation Strategy: Was the chosen approach optimal?
  • Alternative Paths: What other options were considered?
  • Resource Allocation: Were resources used effectively?
  • Verification: How was resolution confirmed?

Process Steps

Step 1: Event Reconstruction

Create comprehensive timeline with all available events.

Step 2: Phase Identification

Identify distinct phases (detection, triage, escalation, mitigation, resolution).

Step 3: Gap Analysis

Identify time gaps and analyze their causes.

Step 4: Decision Point Analysis

Examine key decision points and alternative paths.

Step 5: Effectiveness Assessment

Evaluate the overall effectiveness of the response.

Timeline Template

## Timeline Analysis

### Incident Phases
1. **Detection** ([start] - [end], [duration])
2. **Triage** ([start] - [end], [duration])
3. **Escalation** ([start] - [end], [duration])
4. **Mitigation** ([start] - [end], [duration])
5. **Resolution** ([start] - [end], [duration])

### Key Decision Points
**[Timestamp]:** [Decision made]
- **Context:** [Situation at time of decision]
- **Alternatives:** [Other options considered]
- **Outcome:** [Result of decision]
- **Assessment:** [Was this optimal?]

### Communication Timeline
**[Timestamp]:** [Communication event]
- **Channel:** [Slack/Email/Phone/etc.]
- **Audience:** [Who was informed]
- **Content:** [What was communicated]
- **Effectiveness:** [Assessment]

### Gaps and Delays
**[Time Period]:** [Description of gap]
- **Duration:** [Length of gap]
- **Cause:** [Why did gap occur]
- **Impact:** [Effect on incident response]

### Response Effectiveness
**Strengths:**
- [What went well]
- [Effective decisions/actions]

**Weaknesses:**
- [What could be improved]
- [Missed opportunities]

### Root Causes from Timeline
1. [Process-based root cause]
2. [Communication-based root cause]
3. [Decision-making root cause]

Bow Tie Analysis Framework

Purpose

Analyze both preventive measures (left side) and protective measures (right side) around an incident.

When to Use

  • High-severity incidents (SEV1)
  • Security incidents
  • Safety-critical systems
  • When comprehensive barrier analysis is needed

Components

Hazards

What conditions create the potential for incidents?

Examples:

  • High traffic loads
  • Software deployments
  • Human interactions with critical systems
  • Third-party dependencies

Top Event

What actually went wrong? This is the center of the bow tie.

Examples:

  • "Database became unresponsive"
  • "Payment processing failed"
  • "User authentication service crashed"

Threats (Left Side)

What specific causes could lead to the top event?

Examples:

  • Code defects in new deployment
  • Database connection pool exhaustion
  • Network connectivity issues
  • DDoS attack

Consequences (Right Side)

What are the potential impacts of the top event?

Examples:

  • Revenue loss
  • Customer churn
  • Regulatory violations
  • Brand damage
  • Data loss

Barriers

What controls exist (or could exist) to prevent threats or mitigate consequences?

Preventive Barriers (Left Side):

  • Code reviews
  • Automated testing
  • Load testing
  • Input validation
  • Rate limiting

Protective Barriers (Right Side):

  • Circuit breakers
  • Failover systems
  • Backup procedures
  • Customer communication
  • Rollback capabilities

Process Steps

Step 1: Define the Top Event

Clearly state what went wrong.

Step 2: Identify Threats

Brainstorm all possible causes that could lead to the top event.

Step 3: Identify Consequences

List all potential impacts of the top event.

Step 4: Map Existing Barriers

Identify current controls for each threat and consequence.

Step 5: Assess Barrier Effectiveness

Evaluate how well each barrier worked (or failed).

Step 6: Recommend Additional Barriers

Identify new controls needed to prevent recurrence.

Bow Tie Template

## Bow Tie Analysis

**Top Event:** [What went wrong]

### Threats (Potential Causes)
1. **[Threat 1]**
   - Likelihood: [High/Medium/Low]
   - Current Barriers: [Preventive controls]
   - Barrier Effectiveness: [Assessment]

2. **[Threat 2]**
   - Likelihood: [High/Medium/Low]
   - Current Barriers: [Preventive controls]
   - Barrier Effectiveness: [Assessment]

### Consequences (Potential Impacts)
1. **[Consequence 1]**
   - Severity: [High/Medium/Low]
   - Current Barriers: [Protective controls]
   - Barrier Effectiveness: [Assessment]

2. **[Consequence 2]**
   - Severity: [High/Medium/Low]
   - Current Barriers: [Protective controls]
   - Barrier Effectiveness: [Assessment]

### Barrier Analysis
**Effective Barriers:**
- [Barrier that worked well]
- [Why it was effective]

**Failed Barriers:**
- [Barrier that failed]
- [Why it failed]
- [How to improve]

**Missing Barriers:**
- [Needed preventive control]
- [Needed protective control]

### Recommendations
**Preventive Measures:**
1. [New barrier to prevent threat]
2. [Improvement to existing barrier]

**Protective Measures:**
1. [New barrier to mitigate consequence]
2. [Improvement to existing barrier]

Framework Comparison

Framework Time Required Complexity Best For Output
5 Whys 30-60 minutes Low Simple, linear causes Clear cause chain
Fishbone 1-2 hours Medium Complex, multi-factor Comprehensive factor map
Timeline 2-3 hours Medium Extended incidents Process improvements
Bow Tie 2-4 hours High High-risk incidents Barrier strategy

Combining Frameworks

5 Whys + Fishbone

Use 5 Whys for initial analysis, then Fishbone to explore contributing factors.

Timeline + 5 Whys

Use Timeline to identify key decision points, then 5 Whys on critical failures.

Fishbone + Bow Tie

Use Fishbone to identify causes, then Bow Tie to develop comprehensive prevention strategy.

Quality Checklist

  • Root causes address systemic issues, not symptoms
  • Analysis is backed by evidence, not assumptions
  • Multiple perspectives considered (technical, process, human)
  • Recommendations are specific and actionable
  • Analysis focuses on prevention, not blame
  • Findings are validated against incident timeline
  • Contributing factors are prioritized by impact
  • Root causes link clearly to preventive actions

Common Anti-Patterns

  • Human Error as Root Cause - Dig deeper into why human error occurred
  • Single Root Cause - Complex systems usually have multiple contributing factors
  • Technology-Only Focus - Consider process and organizational factors
  • Blame Assignment - Focus on system improvements, not individual fault
  • Generic Recommendations - Provide specific, measurable actions
  • Surface-Level Analysis - Ensure you've reached true root causes

Last Updated: February 2026
Next Review: August 2026
Owner: SRE Team + Engineering Leadership