Files
claude-skills-reference/engineering-team/incident-commander/references/rca_frameworks_guide.md
Leo daace78954 feat: Add comprehensive incident-commander skill
- Add SKILL.md with 300+ lines of incident response playbook
- Implement incident_classifier.py: severity classification and response recommendations
- Implement timeline_reconstructor.py: event timeline reconstruction with phase analysis
- Implement pir_generator.py: comprehensive PIR generation with multiple RCA frameworks
- Add reference documentation: severity matrix, RCA frameworks, communication templates
- Add sample data files and expected outputs for testing
- All scripts are standalone with zero external dependencies
- Dual output formats: JSON + human-readable text
- Professional, opinionated defaults based on SRE best practices

This POWERFUL-tier skill provides end-to-end incident response capabilities from
detection through post-incident review.
2026-02-16 12:43:38 +00:00

562 lines
15 KiB
Markdown

# Root Cause Analysis (RCA) Frameworks Guide
## Overview
This guide provides detailed instructions for applying various Root Cause Analysis frameworks during Post-Incident Reviews. Each framework offers a different perspective and approach to identifying underlying causes of incidents.
## Framework Selection Guidelines
| Incident Type | Recommended Framework | Why |
|---------------|----------------------|-----|
| **Process Failure** | 5 Whys | Simple, direct cause-effect chain |
| **Complex System Failure** | Fishbone + Timeline | Multiple contributing factors |
| **Human Error** | Fishbone | Systematic analysis of contributing factors |
| **Extended Incidents** | Timeline Analysis | Understanding decision points |
| **High-Risk Incidents** | Bow Tie | Comprehensive barrier analysis |
| **Recurring Issues** | 5 Whys + Fishbone | Deep dive into systemic issues |
---
## 5 Whys Analysis Framework
### Purpose
Iteratively drill down through cause-effect relationships to identify root causes.
### When to Use
- Simple, linear cause-effect chains
- Time-pressured analysis
- Process-related failures
- Individual component failures
### Process Steps
#### Step 1: Problem Statement
Write a clear, specific problem statement.
**Good Example:**
> "The payment API returned 500 errors for 2 hours on March 15, affecting 80% of checkout attempts."
**Poor Example:**
> "The system was broken."
#### Step 2: First Why
Ask why the problem occurred. Focus on immediate, observable causes.
**Example:**
- **Why 1:** Why did the payment API return 500 errors?
- **Answer:** The database connection pool was exhausted.
#### Step 3: Subsequent Whys
For each answer, ask "why" again. Continue until you reach a root cause.
**Example Chain:**
- **Why 2:** Why was the database connection pool exhausted?
- **Answer:** The application was creating more connections than usual.
- **Why 3:** Why was the application creating more connections?
- **Answer:** A new feature wasn't properly closing connections.
- **Why 4:** Why wasn't the feature properly closing connections?
- **Answer:** Code review missed the connection leak pattern.
- **Why 5:** Why did code review miss this pattern?
- **Answer:** We don't have automated checks for connection pooling best practices.
#### Step 4: Validation
Verify that addressing the root cause would prevent the original problem.
### Best Practices
1. **Ask at least 3 "whys"** - Surface causes are rarely root causes
2. **Focus on process failures, not people** - Avoid blame, focus on system improvements
3. **Use evidence** - Support each answer with data or observations
4. **Consider multiple paths** - Some problems have multiple root causes
5. **Test the logic** - Work backwards from root cause to problem
### Common Pitfalls
- **Stopping too early** - First few whys often reveal symptoms, not causes
- **Single-cause assumption** - Complex systems often have multiple contributing factors
- **Blame focus** - Focusing on individual mistakes rather than system failures
- **Vague answers** - Use specific, actionable answers
### 5 Whys Template
```markdown
## 5 Whys Analysis
**Problem Statement:** [Clear description of the incident]
**Why 1:** [First why question]
**Answer:** [Specific, evidence-based answer]
**Evidence:** [Supporting data, logs, observations]
**Why 2:** [Second why question]
**Answer:** [Specific answer based on Why 1]
**Evidence:** [Supporting evidence]
[Continue for 3-7 iterations]
**Root Cause(s) Identified:**
1. [Primary root cause]
2. [Secondary root cause if applicable]
**Validation:** [Confirm that addressing root causes would prevent recurrence]
```
---
## Fishbone (Ishikawa) Diagram Framework
### Purpose
Systematically analyze potential causes across multiple categories to identify contributing factors.
### When to Use
- Complex incidents with multiple potential causes
- When human factors are suspected
- Systemic or organizational issues
- When 5 Whys doesn't reveal clear root causes
### Categories
#### People (Human Factors)
- **Training and Skills**
- Insufficient training on new systems
- Lack of domain expertise
- Skill gaps in team
- Knowledge not shared across team
- **Communication**
- Poor communication between teams
- Unclear responsibilities
- Information not reaching right people
- Language/cultural barriers
- **Decision Making**
- Decisions made under pressure
- Insufficient information for decisions
- Risk assessment inadequate
- Approval processes bypassed
#### Process (Procedures and Workflows)
- **Documentation**
- Outdated procedures
- Missing runbooks
- Unclear instructions
- Process not documented
- **Change Management**
- Inadequate change review
- Rushed deployments
- Insufficient testing
- Rollback procedures unclear
- **Review and Approval**
- Code review gaps
- Architecture review skipped
- Security review insufficient
- Performance review missing
#### Technology (Systems and Tools)
- **Architecture**
- Single points of failure
- Insufficient redundancy
- Scalability limitations
- Tight coupling between systems
- **Monitoring and Alerting**
- Missing monitoring
- Alert fatigue
- Inadequate thresholds
- Poor alert routing
- **Tools and Automation**
- Manual processes prone to error
- Tool limitations
- Automation gaps
- Integration issues
#### Environment (External Factors)
- **Infrastructure**
- Hardware failures
- Network issues
- Capacity limitations
- Geographic dependencies
- **Dependencies**
- Third-party service failures
- External API changes
- Vendor issues
- Supply chain problems
- **External Pressure**
- Time pressure from business
- Resource constraints
- Regulatory changes
- Market conditions
### Process Steps
#### Step 1: Define the Problem
Place the incident at the "head" of the fishbone diagram.
#### Step 2: Brainstorm Causes
For each category, brainstorm potential contributing factors.
#### Step 3: Drill Down
For each factor, ask what caused that factor (sub-causes).
#### Step 4: Identify Primary Causes
Mark the most likely contributing factors based on evidence.
#### Step 5: Validate
Gather evidence to support or refute each suspected cause.
### Fishbone Template
```markdown
## Fishbone Analysis
**Problem:** [Incident description]
### People
**Training/Skills:**
- [Factor 1]: [Evidence/likelihood]
- [Factor 2]: [Evidence/likelihood]
**Communication:**
- [Factor 1]: [Evidence/likelihood]
**Decision Making:**
- [Factor 1]: [Evidence/likelihood]
### Process
**Documentation:**
- [Factor 1]: [Evidence/likelihood]
**Change Management:**
- [Factor 1]: [Evidence/likelihood]
**Review/Approval:**
- [Factor 1]: [Evidence/likelihood]
### Technology
**Architecture:**
- [Factor 1]: [Evidence/likelihood]
**Monitoring:**
- [Factor 1]: [Evidence/likelihood]
**Tools:**
- [Factor 1]: [Evidence/likelihood]
### Environment
**Infrastructure:**
- [Factor 1]: [Evidence/likelihood]
**Dependencies:**
- [Factor 1]: [Evidence/likelihood]
**External Factors:**
- [Factor 1]: [Evidence/likelihood]
### Primary Contributing Factors
1. [Factor with highest evidence/impact]
2. [Second most significant factor]
3. [Third most significant factor]
### Root Cause Hypothesis
[Synthesized explanation of how factors combined to cause incident]
```
---
## Timeline Analysis Framework
### Purpose
Analyze the chronological sequence of events to identify decision points, missed opportunities, and process gaps.
### When to Use
- Extended incidents (> 1 hour)
- Complex multi-phase incidents
- When response effectiveness is questioned
- Communication or coordination failures
### Analysis Dimensions
#### Detection Analysis
- **Time to Detection:** How long from onset to first alert?
- **Detection Method:** How was the incident first identified?
- **Alert Effectiveness:** Were the right people notified quickly?
- **False Negatives:** What signals were missed?
#### Response Analysis
- **Time to Response:** How long from detection to first response action?
- **Escalation Timing:** Were escalations timely and appropriate?
- **Resource Mobilization:** How quickly were the right people engaged?
- **Decision Points:** What key decisions were made and when?
#### Communication Analysis
- **Internal Communication:** How effective was team coordination?
- **External Communication:** Were stakeholders informed appropriately?
- **Communication Gaps:** Where did information flow break down?
- **Update Frequency:** Were updates provided at appropriate intervals?
#### Resolution Analysis
- **Mitigation Strategy:** Was the chosen approach optimal?
- **Alternative Paths:** What other options were considered?
- **Resource Allocation:** Were resources used effectively?
- **Verification:** How was resolution confirmed?
### Process Steps
#### Step 1: Event Reconstruction
Create comprehensive timeline with all available events.
#### Step 2: Phase Identification
Identify distinct phases (detection, triage, escalation, mitigation, resolution).
#### Step 3: Gap Analysis
Identify time gaps and analyze their causes.
#### Step 4: Decision Point Analysis
Examine key decision points and alternative paths.
#### Step 5: Effectiveness Assessment
Evaluate the overall effectiveness of the response.
### Timeline Template
```markdown
## Timeline Analysis
### Incident Phases
1. **Detection** ([start] - [end], [duration])
2. **Triage** ([start] - [end], [duration])
3. **Escalation** ([start] - [end], [duration])
4. **Mitigation** ([start] - [end], [duration])
5. **Resolution** ([start] - [end], [duration])
### Key Decision Points
**[Timestamp]:** [Decision made]
- **Context:** [Situation at time of decision]
- **Alternatives:** [Other options considered]
- **Outcome:** [Result of decision]
- **Assessment:** [Was this optimal?]
### Communication Timeline
**[Timestamp]:** [Communication event]
- **Channel:** [Slack/Email/Phone/etc.]
- **Audience:** [Who was informed]
- **Content:** [What was communicated]
- **Effectiveness:** [Assessment]
### Gaps and Delays
**[Time Period]:** [Description of gap]
- **Duration:** [Length of gap]
- **Cause:** [Why did gap occur]
- **Impact:** [Effect on incident response]
### Response Effectiveness
**Strengths:**
- [What went well]
- [Effective decisions/actions]
**Weaknesses:**
- [What could be improved]
- [Missed opportunities]
### Root Causes from Timeline
1. [Process-based root cause]
2. [Communication-based root cause]
3. [Decision-making root cause]
```
---
## Bow Tie Analysis Framework
### Purpose
Analyze both preventive measures (left side) and protective measures (right side) around an incident.
### When to Use
- High-severity incidents (SEV1)
- Security incidents
- Safety-critical systems
- When comprehensive barrier analysis is needed
### Components
#### Hazards
What conditions create the potential for incidents?
**Examples:**
- High traffic loads
- Software deployments
- Human interactions with critical systems
- Third-party dependencies
#### Top Event
What actually went wrong? This is the center of the bow tie.
**Examples:**
- "Database became unresponsive"
- "Payment processing failed"
- "User authentication service crashed"
#### Threats (Left Side)
What specific causes could lead to the top event?
**Examples:**
- Code defects in new deployment
- Database connection pool exhaustion
- Network connectivity issues
- DDoS attack
#### Consequences (Right Side)
What are the potential impacts of the top event?
**Examples:**
- Revenue loss
- Customer churn
- Regulatory violations
- Brand damage
- Data loss
#### Barriers
What controls exist (or could exist) to prevent threats or mitigate consequences?
**Preventive Barriers (Left Side):**
- Code reviews
- Automated testing
- Load testing
- Input validation
- Rate limiting
**Protective Barriers (Right Side):**
- Circuit breakers
- Failover systems
- Backup procedures
- Customer communication
- Rollback capabilities
### Process Steps
#### Step 1: Define the Top Event
Clearly state what went wrong.
#### Step 2: Identify Threats
Brainstorm all possible causes that could lead to the top event.
#### Step 3: Identify Consequences
List all potential impacts of the top event.
#### Step 4: Map Existing Barriers
Identify current controls for each threat and consequence.
#### Step 5: Assess Barrier Effectiveness
Evaluate how well each barrier worked (or failed).
#### Step 6: Recommend Additional Barriers
Identify new controls needed to prevent recurrence.
### Bow Tie Template
```markdown
## Bow Tie Analysis
**Top Event:** [What went wrong]
### Threats (Potential Causes)
1. **[Threat 1]**
- Likelihood: [High/Medium/Low]
- Current Barriers: [Preventive controls]
- Barrier Effectiveness: [Assessment]
2. **[Threat 2]**
- Likelihood: [High/Medium/Low]
- Current Barriers: [Preventive controls]
- Barrier Effectiveness: [Assessment]
### Consequences (Potential Impacts)
1. **[Consequence 1]**
- Severity: [High/Medium/Low]
- Current Barriers: [Protective controls]
- Barrier Effectiveness: [Assessment]
2. **[Consequence 2]**
- Severity: [High/Medium/Low]
- Current Barriers: [Protective controls]
- Barrier Effectiveness: [Assessment]
### Barrier Analysis
**Effective Barriers:**
- [Barrier that worked well]
- [Why it was effective]
**Failed Barriers:**
- [Barrier that failed]
- [Why it failed]
- [How to improve]
**Missing Barriers:**
- [Needed preventive control]
- [Needed protective control]
### Recommendations
**Preventive Measures:**
1. [New barrier to prevent threat]
2. [Improvement to existing barrier]
**Protective Measures:**
1. [New barrier to mitigate consequence]
2. [Improvement to existing barrier]
```
---
## Framework Comparison
| Framework | Time Required | Complexity | Best For | Output |
|-----------|---------------|------------|----------|---------|
| **5 Whys** | 30-60 minutes | Low | Simple, linear causes | Clear cause chain |
| **Fishbone** | 1-2 hours | Medium | Complex, multi-factor | Comprehensive factor map |
| **Timeline** | 2-3 hours | Medium | Extended incidents | Process improvements |
| **Bow Tie** | 2-4 hours | High | High-risk incidents | Barrier strategy |
## Combining Frameworks
### 5 Whys + Fishbone
Use 5 Whys for initial analysis, then Fishbone to explore contributing factors.
### Timeline + 5 Whys
Use Timeline to identify key decision points, then 5 Whys on critical failures.
### Fishbone + Bow Tie
Use Fishbone to identify causes, then Bow Tie to develop comprehensive prevention strategy.
## Quality Checklist
- [ ] Root causes address systemic issues, not symptoms
- [ ] Analysis is backed by evidence, not assumptions
- [ ] Multiple perspectives considered (technical, process, human)
- [ ] Recommendations are specific and actionable
- [ ] Analysis focuses on prevention, not blame
- [ ] Findings are validated against incident timeline
- [ ] Contributing factors are prioritized by impact
- [ ] Root causes link clearly to preventive actions
## Common Anti-Patterns
- **Human Error as Root Cause** - Dig deeper into why human error occurred
- **Single Root Cause** - Complex systems usually have multiple contributing factors
- **Technology-Only Focus** - Consider process and organizational factors
- **Blame Assignment** - Focus on system improvements, not individual fault
- **Generic Recommendations** - Provide specific, measurable actions
- **Surface-Level Analysis** - Ensure you've reached true root causes
---
**Last Updated:** February 2026
**Next Review:** August 2026
**Owner:** SRE Team + Engineering Leadership