claude-skills-reference/engineering-team/incident-commander/references/rca_frameworks_guide.md

# Root Cause Analysis (RCA) Frameworks Guide

## Overview

This guide provides detailed instructions for applying various Root Cause Analysis frameworks during Post-Incident Reviews. Each framework offers a different perspective and approach to identifying underlying causes of incidents.

## Framework Selection Guidelines

| Incident Type | Recommended Framework | Why |
|---------------|----------------------|-----|
| **Process Failure** | 5 Whys | Simple, direct cause-effect chain |
| **Complex System Failure** | Fishbone + Timeline | Multiple contributing factors |
| **Human Error** | Fishbone | Systematic analysis of contributing factors |
| **Extended Incidents** | Timeline Analysis | Understanding decision points |
| **High-Risk Incidents** | Bow Tie | Comprehensive barrier analysis |
| **Recurring Issues** | 5 Whys + Fishbone | Deep dive into systemic issues |

---

## 5 Whys Analysis Framework

### Purpose
Iteratively drill down through cause-effect relationships to identify root causes.

### When to Use
- Simple, linear cause-effect chains
- Time-pressured analysis
- Process-related failures
- Individual component failures

### Process Steps

#### Step 1: Problem Statement
Write a clear, specific problem statement.

**Good Example:**
> "The payment API returned 500 errors for 2 hours on March 15, affecting 80% of checkout attempts."

**Poor Example:**
> "The system was broken."

#### Step 2: First Why
Ask why the problem occurred. Focus on immediate, observable causes.

**Example:**
- **Why 1:** Why did the payment API return 500 errors?
- **Answer:** The database connection pool was exhausted.

#### Step 3: Subsequent Whys
For each answer, ask "why" again. Continue until you reach a root cause.

**Example Chain:**
- **Why 2:** Why was the database connection pool exhausted?
- **Answer:** The application was creating more connections than usual.

- **Why 3:** Why was the application creating more connections?
- **Answer:** A new feature wasn't properly closing connections.

- **Why 4:** Why wasn't the feature properly closing connections?
- **Answer:** Code review missed the connection leak pattern.

- **Why 5:** Why did code review miss this pattern?
- **Answer:** We don't have automated checks for connection pooling best practices.

#### Step 4: Validation
Verify that addressing the root cause would prevent the original problem.

### Best Practices

1. **Ask at least 3 "whys"** - Surface causes are rarely root causes
2. **Focus on process failures, not people** - Avoid blame, focus on system improvements
3. **Use evidence** - Support each answer with data or observations
4. **Consider multiple paths** - Some problems have multiple root causes
5. **Test the logic** - Work backwards from root cause to problem

### Common Pitfalls

- **Stopping too early** - First few whys often reveal symptoms, not causes
- **Single-cause assumption** - Complex systems often have multiple contributing factors
- **Blame focus** - Focusing on individual mistakes rather than system failures
- **Vague answers** - Use specific, actionable answers

### 5 Whys Template

```markdown
## 5 Whys Analysis

**Problem Statement:** [Clear description of the incident]

**Why 1:** [First why question]
**Answer:** [Specific, evidence-based answer]
**Evidence:** [Supporting data, logs, observations]

**Why 2:** [Second why question]
**Answer:** [Specific answer based on Why 1]
**Evidence:** [Supporting evidence]

[Continue for 3-7 iterations]

**Root Cause(s) Identified:**
1. [Primary root cause]
2. [Secondary root cause if applicable]

**Validation:** [Confirm that addressing root causes would prevent recurrence]
```

---

## Fishbone (Ishikawa) Diagram Framework

### Purpose
Systematically analyze potential causes across multiple categories to identify contributing factors.

### When to Use
- Complex incidents with multiple potential causes
- When human factors are suspected
- Systemic or organizational issues
- When 5 Whys doesn't reveal clear root causes

### Categories

#### People (Human Factors)
- **Training and Skills**
  - Insufficient training on new systems
  - Lack of domain expertise
  - Skill gaps in team
  - Knowledge not shared across team

- **Communication**
  - Poor communication between teams
  - Unclear responsibilities
  - Information not reaching right people
  - Language/cultural barriers

- **Decision Making**
  - Decisions made under pressure
  - Insufficient information for decisions
  - Risk assessment inadequate
  - Approval processes bypassed

#### Process (Procedures and Workflows)
- **Documentation**
  - Outdated procedures
  - Missing runbooks
  - Unclear instructions
  - Process not documented

- **Change Management**
  - Inadequate change review
  - Rushed deployments
  - Insufficient testing
  - Rollback procedures unclear

- **Review and Approval**
  - Code review gaps
  - Architecture review skipped
  - Security review insufficient
  - Performance review missing

#### Technology (Systems and Tools)
- **Architecture**
  - Single points of failure
  - Insufficient redundancy
  - Scalability limitations
  - Tight coupling between systems

- **Monitoring and Alerting**
  - Missing monitoring
  - Alert fatigue
  - Inadequate thresholds
  - Poor alert routing

- **Tools and Automation**
  - Manual processes prone to error
  - Tool limitations
  - Automation gaps
  - Integration issues

#### Environment (External Factors)
- **Infrastructure**
  - Hardware failures
  - Network issues
  - Capacity limitations
  - Geographic dependencies

- **Dependencies**
  - Third-party service failures
  - External API changes
  - Vendor issues
  - Supply chain problems

- **External Pressure**
  - Time pressure from business
  - Resource constraints
  - Regulatory changes
  - Market conditions

### Process Steps

#### Step 1: Define the Problem
Place the incident at the "head" of the fishbone diagram.

#### Step 2: Brainstorm Causes
For each category, brainstorm potential contributing factors.

#### Step 3: Drill Down
For each factor, ask what caused that factor (sub-causes).

#### Step 4: Identify Primary Causes
Mark the most likely contributing factors based on evidence.

#### Step 5: Validate
Gather evidence to support or refute each suspected cause.

### Fishbone Template

```markdown
## Fishbone Analysis

**Problem:** [Incident description]

### People
**Training/Skills:**
- [Factor 1]: [Evidence/likelihood]
- [Factor 2]: [Evidence/likelihood]

**Communication:**
- [Factor 1]: [Evidence/likelihood]

**Decision Making:**
- [Factor 1]: [Evidence/likelihood]

### Process
**Documentation:**
- [Factor 1]: [Evidence/likelihood]

**Change Management:**
- [Factor 1]: [Evidence/likelihood]

**Review/Approval:**
- [Factor 1]: [Evidence/likelihood]

### Technology
**Architecture:**
- [Factor 1]: [Evidence/likelihood]

**Monitoring:**
- [Factor 1]: [Evidence/likelihood]

**Tools:**
- [Factor 1]: [Evidence/likelihood]

### Environment
**Infrastructure:**
- [Factor 1]: [Evidence/likelihood]

**Dependencies:**
- [Factor 1]: [Evidence/likelihood]

**External Factors:**
- [Factor 1]: [Evidence/likelihood]

### Primary Contributing Factors
1. [Factor with highest evidence/impact]
2. [Second most significant factor]
3. [Third most significant factor]

### Root Cause Hypothesis
[Synthesized explanation of how factors combined to cause incident]
```

---

## Timeline Analysis Framework

### Purpose
Analyze the chronological sequence of events to identify decision points, missed opportunities, and process gaps.

### When to Use
- Extended incidents (> 1 hour)
- Complex multi-phase incidents
- When response effectiveness is questioned
- Communication or coordination failures

### Analysis Dimensions

#### Detection Analysis
- **Time to Detection:** How long from onset to first alert?
- **Detection Method:** How was the incident first identified?
- **Alert Effectiveness:** Were the right people notified quickly?
- **False Negatives:** What signals were missed?

#### Response Analysis
- **Time to Response:** How long from detection to first response action?
- **Escalation Timing:** Were escalations timely and appropriate?
- **Resource Mobilization:** How quickly were the right people engaged?
- **Decision Points:** What key decisions were made and when?

#### Communication Analysis
- **Internal Communication:** How effective was team coordination?
- **External Communication:** Were stakeholders informed appropriately?
- **Communication Gaps:** Where did information flow break down?
- **Update Frequency:** Were updates provided at appropriate intervals?

#### Resolution Analysis
- **Mitigation Strategy:** Was the chosen approach optimal?
- **Alternative Paths:** What other options were considered?
- **Resource Allocation:** Were resources used effectively?
- **Verification:** How was resolution confirmed?

### Process Steps

#### Step 1: Event Reconstruction
Create comprehensive timeline with all available events.

#### Step 2: Phase Identification
Identify distinct phases (detection, triage, escalation, mitigation, resolution).

#### Step 3: Gap Analysis
Identify time gaps and analyze their causes.

#### Step 4: Decision Point Analysis
Examine key decision points and alternative paths.

#### Step 5: Effectiveness Assessment
Evaluate the overall effectiveness of the response.

### Timeline Template

```markdown
## Timeline Analysis

### Incident Phases
1. **Detection** ([start] - [end], [duration])
2. **Triage** ([start] - [end], [duration])
3. **Escalation** ([start] - [end], [duration])
4. **Mitigation** ([start] - [end], [duration])
5. **Resolution** ([start] - [end], [duration])

### Key Decision Points
**[Timestamp]:** [Decision made]
- **Context:** [Situation at time of decision]
- **Alternatives:** [Other options considered]
- **Outcome:** [Result of decision]
- **Assessment:** [Was this optimal?]

### Communication Timeline
**[Timestamp]:** [Communication event]
- **Channel:** [Slack/Email/Phone/etc.]
- **Audience:** [Who was informed]
- **Content:** [What was communicated]
- **Effectiveness:** [Assessment]

### Gaps and Delays
**[Time Period]:** [Description of gap]
- **Duration:** [Length of gap]
- **Cause:** [Why did gap occur]
- **Impact:** [Effect on incident response]

### Response Effectiveness
**Strengths:**
- [What went well]
- [Effective decisions/actions]

**Weaknesses:**
- [What could be improved]
- [Missed opportunities]

### Root Causes from Timeline
1. [Process-based root cause]
2. [Communication-based root cause]
3. [Decision-making root cause]
```

---

## Bow Tie Analysis Framework

### Purpose
Analyze both preventive measures (left side) and protective measures (right side) around an incident.

### When to Use
- High-severity incidents (SEV1)
- Security incidents
- Safety-critical systems
- When comprehensive barrier analysis is needed

### Components

#### Hazards
What conditions create the potential for incidents?

**Examples:**
- High traffic loads
- Software deployments
- Human interactions with critical systems
- Third-party dependencies

#### Top Event
What actually went wrong? This is the center of the bow tie.

**Examples:**
- "Database became unresponsive"
- "Payment processing failed"
- "User authentication service crashed"

#### Threats (Left Side)
What specific causes could lead to the top event?

**Examples:**
- Code defects in new deployment
- Database connection pool exhaustion
- Network connectivity issues
- DDoS attack

#### Consequences (Right Side)
What are the potential impacts of the top event?

**Examples:**
- Revenue loss
- Customer churn
- Regulatory violations
- Brand damage
- Data loss

#### Barriers
What controls exist (or could exist) to prevent threats or mitigate consequences?

**Preventive Barriers (Left Side):**
- Code reviews
- Automated testing
- Load testing
- Input validation
- Rate limiting

**Protective Barriers (Right Side):**
- Circuit breakers
- Failover systems
- Backup procedures
- Customer communication
- Rollback capabilities

### Process Steps

#### Step 1: Define the Top Event
Clearly state what went wrong.

#### Step 2: Identify Threats
Brainstorm all possible causes that could lead to the top event.

#### Step 3: Identify Consequences
List all potential impacts of the top event.

#### Step 4: Map Existing Barriers
Identify current controls for each threat and consequence.

#### Step 5: Assess Barrier Effectiveness
Evaluate how well each barrier worked (or failed).

#### Step 6: Recommend Additional Barriers
Identify new controls needed to prevent recurrence.

### Bow Tie Template

```markdown
## Bow Tie Analysis

**Top Event:** [What went wrong]

### Threats (Potential Causes)
1. **[Threat 1]**
   - Likelihood: [High/Medium/Low]
   - Current Barriers: [Preventive controls]
   - Barrier Effectiveness: [Assessment]

2. **[Threat 2]**
   - Likelihood: [High/Medium/Low]
   - Current Barriers: [Preventive controls]
   - Barrier Effectiveness: [Assessment]

### Consequences (Potential Impacts)
1. **[Consequence 1]**
   - Severity: [High/Medium/Low]
   - Current Barriers: [Protective controls]
   - Barrier Effectiveness: [Assessment]

2. **[Consequence 2]**
   - Severity: [High/Medium/Low]
   - Current Barriers: [Protective controls]
   - Barrier Effectiveness: [Assessment]

### Barrier Analysis
**Effective Barriers:**
- [Barrier that worked well]
- [Why it was effective]

**Failed Barriers:**
- [Barrier that failed]
- [Why it failed]
- [How to improve]

**Missing Barriers:**
- [Needed preventive control]
- [Needed protective control]

### Recommendations
**Preventive Measures:**
1. [New barrier to prevent threat]
2. [Improvement to existing barrier]

**Protective Measures:**
1. [New barrier to mitigate consequence]
2. [Improvement to existing barrier]
```

---

## Framework Comparison

| Framework | Time Required | Complexity | Best For | Output |
|-----------|---------------|------------|----------|---------|
| **5 Whys** | 30-60 minutes | Low | Simple, linear causes | Clear cause chain |
| **Fishbone** | 1-2 hours | Medium | Complex, multi-factor | Comprehensive factor map |
| **Timeline** | 2-3 hours | Medium | Extended incidents | Process improvements |
| **Bow Tie** | 2-4 hours | High | High-risk incidents | Barrier strategy |

## Combining Frameworks

### 5 Whys + Fishbone
Use 5 Whys for initial analysis, then Fishbone to explore contributing factors.

### Timeline + 5 Whys
Use Timeline to identify key decision points, then 5 Whys on critical failures.

### Fishbone + Bow Tie
Use Fishbone to identify causes, then Bow Tie to develop comprehensive prevention strategy.

## Quality Checklist

- [ ] Root causes address systemic issues, not symptoms
- [ ] Analysis is backed by evidence, not assumptions
- [ ] Multiple perspectives considered (technical, process, human)
- [ ] Recommendations are specific and actionable
- [ ] Analysis focuses on prevention, not blame
- [ ] Findings are validated against incident timeline
- [ ] Contributing factors are prioritized by impact
- [ ] Root causes link clearly to preventive actions

## Common Anti-Patterns

- **Human Error as Root Cause** - Dig deeper into why human error occurred
- **Single Root Cause** - Complex systems usually have multiple contributing factors
- **Technology-Only Focus** - Consider process and organizational factors
- **Blame Assignment** - Focus on system improvements, not individual fault
- **Generic Recommendations** - Provide specific, measurable actions
- **Surface-Level Analysis** - Ensure you've reached true root causes

---

**Last Updated:** February 2026
**Next Review:** August 2026
**Owner:** SRE Team + Engineering Leadership