- Add SKILL.md with 300+ lines of incident response playbook - Implement incident_classifier.py: severity classification and response recommendations - Implement timeline_reconstructor.py: event timeline reconstruction with phase analysis - Implement pir_generator.py: comprehensive PIR generation with multiple RCA frameworks - Add reference documentation: severity matrix, RCA frameworks, communication templates - Add sample data files and expected outputs for testing - All scripts are standalone with zero external dependencies - Dual output formats: JSON + human-readable text - Professional, opinionated defaults based on SRE best practices This POWERFUL-tier skill provides end-to-end incident response capabilities from detection through post-incident review.
562 lines
15 KiB
Markdown
562 lines
15 KiB
Markdown
# Root Cause Analysis (RCA) Frameworks Guide
|
|
|
|
## Overview
|
|
|
|
This guide provides detailed instructions for applying various Root Cause Analysis frameworks during Post-Incident Reviews. Each framework offers a different perspective and approach to identifying underlying causes of incidents.
|
|
|
|
## Framework Selection Guidelines
|
|
|
|
| Incident Type | Recommended Framework | Why |
|
|
|---------------|----------------------|-----|
|
|
| **Process Failure** | 5 Whys | Simple, direct cause-effect chain |
|
|
| **Complex System Failure** | Fishbone + Timeline | Multiple contributing factors |
|
|
| **Human Error** | Fishbone | Systematic analysis of contributing factors |
|
|
| **Extended Incidents** | Timeline Analysis | Understanding decision points |
|
|
| **High-Risk Incidents** | Bow Tie | Comprehensive barrier analysis |
|
|
| **Recurring Issues** | 5 Whys + Fishbone | Deep dive into systemic issues |
|
|
|
|
---
|
|
|
|
## 5 Whys Analysis Framework
|
|
|
|
### Purpose
|
|
Iteratively drill down through cause-effect relationships to identify root causes.
|
|
|
|
### When to Use
|
|
- Simple, linear cause-effect chains
|
|
- Time-pressured analysis
|
|
- Process-related failures
|
|
- Individual component failures
|
|
|
|
### Process Steps
|
|
|
|
#### Step 1: Problem Statement
|
|
Write a clear, specific problem statement.
|
|
|
|
**Good Example:**
|
|
> "The payment API returned 500 errors for 2 hours on March 15, affecting 80% of checkout attempts."
|
|
|
|
**Poor Example:**
|
|
> "The system was broken."
|
|
|
|
#### Step 2: First Why
|
|
Ask why the problem occurred. Focus on immediate, observable causes.
|
|
|
|
**Example:**
|
|
- **Why 1:** Why did the payment API return 500 errors?
|
|
- **Answer:** The database connection pool was exhausted.
|
|
|
|
#### Step 3: Subsequent Whys
|
|
For each answer, ask "why" again. Continue until you reach a root cause.
|
|
|
|
**Example Chain:**
|
|
- **Why 2:** Why was the database connection pool exhausted?
|
|
- **Answer:** The application was creating more connections than usual.
|
|
|
|
- **Why 3:** Why was the application creating more connections?
|
|
- **Answer:** A new feature wasn't properly closing connections.
|
|
|
|
- **Why 4:** Why wasn't the feature properly closing connections?
|
|
- **Answer:** Code review missed the connection leak pattern.
|
|
|
|
- **Why 5:** Why did code review miss this pattern?
|
|
- **Answer:** We don't have automated checks for connection pooling best practices.
|
|
|
|
#### Step 4: Validation
|
|
Verify that addressing the root cause would prevent the original problem.
|
|
|
|
### Best Practices
|
|
|
|
1. **Ask at least 3 "whys"** - Surface causes are rarely root causes
|
|
2. **Focus on process failures, not people** - Avoid blame, focus on system improvements
|
|
3. **Use evidence** - Support each answer with data or observations
|
|
4. **Consider multiple paths** - Some problems have multiple root causes
|
|
5. **Test the logic** - Work backwards from root cause to problem
|
|
|
|
### Common Pitfalls
|
|
|
|
- **Stopping too early** - First few whys often reveal symptoms, not causes
|
|
- **Single-cause assumption** - Complex systems often have multiple contributing factors
|
|
- **Blame focus** - Focusing on individual mistakes rather than system failures
|
|
- **Vague answers** - Use specific, actionable answers
|
|
|
|
### 5 Whys Template
|
|
|
|
```markdown
|
|
## 5 Whys Analysis
|
|
|
|
**Problem Statement:** [Clear description of the incident]
|
|
|
|
**Why 1:** [First why question]
|
|
**Answer:** [Specific, evidence-based answer]
|
|
**Evidence:** [Supporting data, logs, observations]
|
|
|
|
**Why 2:** [Second why question]
|
|
**Answer:** [Specific answer based on Why 1]
|
|
**Evidence:** [Supporting evidence]
|
|
|
|
[Continue for 3-7 iterations]
|
|
|
|
**Root Cause(s) Identified:**
|
|
1. [Primary root cause]
|
|
2. [Secondary root cause if applicable]
|
|
|
|
**Validation:** [Confirm that addressing root causes would prevent recurrence]
|
|
```
|
|
|
|
---
|
|
|
|
## Fishbone (Ishikawa) Diagram Framework
|
|
|
|
### Purpose
|
|
Systematically analyze potential causes across multiple categories to identify contributing factors.
|
|
|
|
### When to Use
|
|
- Complex incidents with multiple potential causes
|
|
- When human factors are suspected
|
|
- Systemic or organizational issues
|
|
- When 5 Whys doesn't reveal clear root causes
|
|
|
|
### Categories
|
|
|
|
#### People (Human Factors)
|
|
- **Training and Skills**
|
|
- Insufficient training on new systems
|
|
- Lack of domain expertise
|
|
- Skill gaps in team
|
|
- Knowledge not shared across team
|
|
|
|
- **Communication**
|
|
- Poor communication between teams
|
|
- Unclear responsibilities
|
|
- Information not reaching right people
|
|
- Language/cultural barriers
|
|
|
|
- **Decision Making**
|
|
- Decisions made under pressure
|
|
- Insufficient information for decisions
|
|
- Risk assessment inadequate
|
|
- Approval processes bypassed
|
|
|
|
#### Process (Procedures and Workflows)
|
|
- **Documentation**
|
|
- Outdated procedures
|
|
- Missing runbooks
|
|
- Unclear instructions
|
|
- Process not documented
|
|
|
|
- **Change Management**
|
|
- Inadequate change review
|
|
- Rushed deployments
|
|
- Insufficient testing
|
|
- Rollback procedures unclear
|
|
|
|
- **Review and Approval**
|
|
- Code review gaps
|
|
- Architecture review skipped
|
|
- Security review insufficient
|
|
- Performance review missing
|
|
|
|
#### Technology (Systems and Tools)
|
|
- **Architecture**
|
|
- Single points of failure
|
|
- Insufficient redundancy
|
|
- Scalability limitations
|
|
- Tight coupling between systems
|
|
|
|
- **Monitoring and Alerting**
|
|
- Missing monitoring
|
|
- Alert fatigue
|
|
- Inadequate thresholds
|
|
- Poor alert routing
|
|
|
|
- **Tools and Automation**
|
|
- Manual processes prone to error
|
|
- Tool limitations
|
|
- Automation gaps
|
|
- Integration issues
|
|
|
|
#### Environment (External Factors)
|
|
- **Infrastructure**
|
|
- Hardware failures
|
|
- Network issues
|
|
- Capacity limitations
|
|
- Geographic dependencies
|
|
|
|
- **Dependencies**
|
|
- Third-party service failures
|
|
- External API changes
|
|
- Vendor issues
|
|
- Supply chain problems
|
|
|
|
- **External Pressure**
|
|
- Time pressure from business
|
|
- Resource constraints
|
|
- Regulatory changes
|
|
- Market conditions
|
|
|
|
### Process Steps
|
|
|
|
#### Step 1: Define the Problem
|
|
Place the incident at the "head" of the fishbone diagram.
|
|
|
|
#### Step 2: Brainstorm Causes
|
|
For each category, brainstorm potential contributing factors.
|
|
|
|
#### Step 3: Drill Down
|
|
For each factor, ask what caused that factor (sub-causes).
|
|
|
|
#### Step 4: Identify Primary Causes
|
|
Mark the most likely contributing factors based on evidence.
|
|
|
|
#### Step 5: Validate
|
|
Gather evidence to support or refute each suspected cause.
|
|
|
|
### Fishbone Template
|
|
|
|
```markdown
|
|
## Fishbone Analysis
|
|
|
|
**Problem:** [Incident description]
|
|
|
|
### People
|
|
**Training/Skills:**
|
|
- [Factor 1]: [Evidence/likelihood]
|
|
- [Factor 2]: [Evidence/likelihood]
|
|
|
|
**Communication:**
|
|
- [Factor 1]: [Evidence/likelihood]
|
|
|
|
**Decision Making:**
|
|
- [Factor 1]: [Evidence/likelihood]
|
|
|
|
### Process
|
|
**Documentation:**
|
|
- [Factor 1]: [Evidence/likelihood]
|
|
|
|
**Change Management:**
|
|
- [Factor 1]: [Evidence/likelihood]
|
|
|
|
**Review/Approval:**
|
|
- [Factor 1]: [Evidence/likelihood]
|
|
|
|
### Technology
|
|
**Architecture:**
|
|
- [Factor 1]: [Evidence/likelihood]
|
|
|
|
**Monitoring:**
|
|
- [Factor 1]: [Evidence/likelihood]
|
|
|
|
**Tools:**
|
|
- [Factor 1]: [Evidence/likelihood]
|
|
|
|
### Environment
|
|
**Infrastructure:**
|
|
- [Factor 1]: [Evidence/likelihood]
|
|
|
|
**Dependencies:**
|
|
- [Factor 1]: [Evidence/likelihood]
|
|
|
|
**External Factors:**
|
|
- [Factor 1]: [Evidence/likelihood]
|
|
|
|
### Primary Contributing Factors
|
|
1. [Factor with highest evidence/impact]
|
|
2. [Second most significant factor]
|
|
3. [Third most significant factor]
|
|
|
|
### Root Cause Hypothesis
|
|
[Synthesized explanation of how factors combined to cause incident]
|
|
```
|
|
|
|
---
|
|
|
|
## Timeline Analysis Framework
|
|
|
|
### Purpose
|
|
Analyze the chronological sequence of events to identify decision points, missed opportunities, and process gaps.
|
|
|
|
### When to Use
|
|
- Extended incidents (> 1 hour)
|
|
- Complex multi-phase incidents
|
|
- When response effectiveness is questioned
|
|
- Communication or coordination failures
|
|
|
|
### Analysis Dimensions
|
|
|
|
#### Detection Analysis
|
|
- **Time to Detection:** How long from onset to first alert?
|
|
- **Detection Method:** How was the incident first identified?
|
|
- **Alert Effectiveness:** Were the right people notified quickly?
|
|
- **False Negatives:** What signals were missed?
|
|
|
|
#### Response Analysis
|
|
- **Time to Response:** How long from detection to first response action?
|
|
- **Escalation Timing:** Were escalations timely and appropriate?
|
|
- **Resource Mobilization:** How quickly were the right people engaged?
|
|
- **Decision Points:** What key decisions were made and when?
|
|
|
|
#### Communication Analysis
|
|
- **Internal Communication:** How effective was team coordination?
|
|
- **External Communication:** Were stakeholders informed appropriately?
|
|
- **Communication Gaps:** Where did information flow break down?
|
|
- **Update Frequency:** Were updates provided at appropriate intervals?
|
|
|
|
#### Resolution Analysis
|
|
- **Mitigation Strategy:** Was the chosen approach optimal?
|
|
- **Alternative Paths:** What other options were considered?
|
|
- **Resource Allocation:** Were resources used effectively?
|
|
- **Verification:** How was resolution confirmed?
|
|
|
|
### Process Steps
|
|
|
|
#### Step 1: Event Reconstruction
|
|
Create comprehensive timeline with all available events.
|
|
|
|
#### Step 2: Phase Identification
|
|
Identify distinct phases (detection, triage, escalation, mitigation, resolution).
|
|
|
|
#### Step 3: Gap Analysis
|
|
Identify time gaps and analyze their causes.
|
|
|
|
#### Step 4: Decision Point Analysis
|
|
Examine key decision points and alternative paths.
|
|
|
|
#### Step 5: Effectiveness Assessment
|
|
Evaluate the overall effectiveness of the response.
|
|
|
|
### Timeline Template
|
|
|
|
```markdown
|
|
## Timeline Analysis
|
|
|
|
### Incident Phases
|
|
1. **Detection** ([start] - [end], [duration])
|
|
2. **Triage** ([start] - [end], [duration])
|
|
3. **Escalation** ([start] - [end], [duration])
|
|
4. **Mitigation** ([start] - [end], [duration])
|
|
5. **Resolution** ([start] - [end], [duration])
|
|
|
|
### Key Decision Points
|
|
**[Timestamp]:** [Decision made]
|
|
- **Context:** [Situation at time of decision]
|
|
- **Alternatives:** [Other options considered]
|
|
- **Outcome:** [Result of decision]
|
|
- **Assessment:** [Was this optimal?]
|
|
|
|
### Communication Timeline
|
|
**[Timestamp]:** [Communication event]
|
|
- **Channel:** [Slack/Email/Phone/etc.]
|
|
- **Audience:** [Who was informed]
|
|
- **Content:** [What was communicated]
|
|
- **Effectiveness:** [Assessment]
|
|
|
|
### Gaps and Delays
|
|
**[Time Period]:** [Description of gap]
|
|
- **Duration:** [Length of gap]
|
|
- **Cause:** [Why did gap occur]
|
|
- **Impact:** [Effect on incident response]
|
|
|
|
### Response Effectiveness
|
|
**Strengths:**
|
|
- [What went well]
|
|
- [Effective decisions/actions]
|
|
|
|
**Weaknesses:**
|
|
- [What could be improved]
|
|
- [Missed opportunities]
|
|
|
|
### Root Causes from Timeline
|
|
1. [Process-based root cause]
|
|
2. [Communication-based root cause]
|
|
3. [Decision-making root cause]
|
|
```
|
|
|
|
---
|
|
|
|
## Bow Tie Analysis Framework
|
|
|
|
### Purpose
|
|
Analyze both preventive measures (left side) and protective measures (right side) around an incident.
|
|
|
|
### When to Use
|
|
- High-severity incidents (SEV1)
|
|
- Security incidents
|
|
- Safety-critical systems
|
|
- When comprehensive barrier analysis is needed
|
|
|
|
### Components
|
|
|
|
#### Hazards
|
|
What conditions create the potential for incidents?
|
|
|
|
**Examples:**
|
|
- High traffic loads
|
|
- Software deployments
|
|
- Human interactions with critical systems
|
|
- Third-party dependencies
|
|
|
|
#### Top Event
|
|
What actually went wrong? This is the center of the bow tie.
|
|
|
|
**Examples:**
|
|
- "Database became unresponsive"
|
|
- "Payment processing failed"
|
|
- "User authentication service crashed"
|
|
|
|
#### Threats (Left Side)
|
|
What specific causes could lead to the top event?
|
|
|
|
**Examples:**
|
|
- Code defects in new deployment
|
|
- Database connection pool exhaustion
|
|
- Network connectivity issues
|
|
- DDoS attack
|
|
|
|
#### Consequences (Right Side)
|
|
What are the potential impacts of the top event?
|
|
|
|
**Examples:**
|
|
- Revenue loss
|
|
- Customer churn
|
|
- Regulatory violations
|
|
- Brand damage
|
|
- Data loss
|
|
|
|
#### Barriers
|
|
What controls exist (or could exist) to prevent threats or mitigate consequences?
|
|
|
|
**Preventive Barriers (Left Side):**
|
|
- Code reviews
|
|
- Automated testing
|
|
- Load testing
|
|
- Input validation
|
|
- Rate limiting
|
|
|
|
**Protective Barriers (Right Side):**
|
|
- Circuit breakers
|
|
- Failover systems
|
|
- Backup procedures
|
|
- Customer communication
|
|
- Rollback capabilities
|
|
|
|
### Process Steps
|
|
|
|
#### Step 1: Define the Top Event
|
|
Clearly state what went wrong.
|
|
|
|
#### Step 2: Identify Threats
|
|
Brainstorm all possible causes that could lead to the top event.
|
|
|
|
#### Step 3: Identify Consequences
|
|
List all potential impacts of the top event.
|
|
|
|
#### Step 4: Map Existing Barriers
|
|
Identify current controls for each threat and consequence.
|
|
|
|
#### Step 5: Assess Barrier Effectiveness
|
|
Evaluate how well each barrier worked (or failed).
|
|
|
|
#### Step 6: Recommend Additional Barriers
|
|
Identify new controls needed to prevent recurrence.
|
|
|
|
### Bow Tie Template
|
|
|
|
```markdown
|
|
## Bow Tie Analysis
|
|
|
|
**Top Event:** [What went wrong]
|
|
|
|
### Threats (Potential Causes)
|
|
1. **[Threat 1]**
|
|
- Likelihood: [High/Medium/Low]
|
|
- Current Barriers: [Preventive controls]
|
|
- Barrier Effectiveness: [Assessment]
|
|
|
|
2. **[Threat 2]**
|
|
- Likelihood: [High/Medium/Low]
|
|
- Current Barriers: [Preventive controls]
|
|
- Barrier Effectiveness: [Assessment]
|
|
|
|
### Consequences (Potential Impacts)
|
|
1. **[Consequence 1]**
|
|
- Severity: [High/Medium/Low]
|
|
- Current Barriers: [Protective controls]
|
|
- Barrier Effectiveness: [Assessment]
|
|
|
|
2. **[Consequence 2]**
|
|
- Severity: [High/Medium/Low]
|
|
- Current Barriers: [Protective controls]
|
|
- Barrier Effectiveness: [Assessment]
|
|
|
|
### Barrier Analysis
|
|
**Effective Barriers:**
|
|
- [Barrier that worked well]
|
|
- [Why it was effective]
|
|
|
|
**Failed Barriers:**
|
|
- [Barrier that failed]
|
|
- [Why it failed]
|
|
- [How to improve]
|
|
|
|
**Missing Barriers:**
|
|
- [Needed preventive control]
|
|
- [Needed protective control]
|
|
|
|
### Recommendations
|
|
**Preventive Measures:**
|
|
1. [New barrier to prevent threat]
|
|
2. [Improvement to existing barrier]
|
|
|
|
**Protective Measures:**
|
|
1. [New barrier to mitigate consequence]
|
|
2. [Improvement to existing barrier]
|
|
```
|
|
|
|
---
|
|
|
|
## Framework Comparison
|
|
|
|
| Framework | Time Required | Complexity | Best For | Output |
|
|
|-----------|---------------|------------|----------|---------|
|
|
| **5 Whys** | 30-60 minutes | Low | Simple, linear causes | Clear cause chain |
|
|
| **Fishbone** | 1-2 hours | Medium | Complex, multi-factor | Comprehensive factor map |
|
|
| **Timeline** | 2-3 hours | Medium | Extended incidents | Process improvements |
|
|
| **Bow Tie** | 2-4 hours | High | High-risk incidents | Barrier strategy |
|
|
|
|
## Combining Frameworks
|
|
|
|
### 5 Whys + Fishbone
|
|
Use 5 Whys for initial analysis, then Fishbone to explore contributing factors.
|
|
|
|
### Timeline + 5 Whys
|
|
Use Timeline to identify key decision points, then 5 Whys on critical failures.
|
|
|
|
### Fishbone + Bow Tie
|
|
Use Fishbone to identify causes, then Bow Tie to develop comprehensive prevention strategy.
|
|
|
|
## Quality Checklist
|
|
|
|
- [ ] Root causes address systemic issues, not symptoms
|
|
- [ ] Analysis is backed by evidence, not assumptions
|
|
- [ ] Multiple perspectives considered (technical, process, human)
|
|
- [ ] Recommendations are specific and actionable
|
|
- [ ] Analysis focuses on prevention, not blame
|
|
- [ ] Findings are validated against incident timeline
|
|
- [ ] Contributing factors are prioritized by impact
|
|
- [ ] Root causes link clearly to preventive actions
|
|
|
|
## Common Anti-Patterns
|
|
|
|
- **Human Error as Root Cause** - Dig deeper into why human error occurred
|
|
- **Single Root Cause** - Complex systems usually have multiple contributing factors
|
|
- **Technology-Only Focus** - Consider process and organizational factors
|
|
- **Blame Assignment** - Focus on system improvements, not individual fault
|
|
- **Generic Recommendations** - Provide specific, measurable actions
|
|
- **Surface-Level Analysis** - Ensure you've reached true root causes
|
|
|
|
---
|
|
|
|
**Last Updated:** February 2026
|
|
**Next Review:** August 2026
|
|
**Owner:** SRE Team + Engineering Leadership |