- Add SKILL.md with 300+ lines of incident response playbook - Implement incident_classifier.py: severity classification and response recommendations - Implement timeline_reconstructor.py: event timeline reconstruction with phase analysis - Implement pir_generator.py: comprehensive PIR generation with multiple RCA frameworks - Add reference documentation: severity matrix, RCA frameworks, communication templates - Add sample data files and expected outputs for testing - All scripts are standalone with zero external dependencies - Dual output formats: JSON + human-readable text - Professional, opinionated defaults based on SRE best practices This POWERFUL-tier skill provides end-to-end incident response capabilities from detection through post-incident review.
15 KiB
Root Cause Analysis (RCA) Frameworks Guide
Overview
This guide provides detailed instructions for applying various Root Cause Analysis frameworks during Post-Incident Reviews. Each framework offers a different perspective and approach to identifying underlying causes of incidents.
Framework Selection Guidelines
| Incident Type | Recommended Framework | Why |
|---|---|---|
| Process Failure | 5 Whys | Simple, direct cause-effect chain |
| Complex System Failure | Fishbone + Timeline | Multiple contributing factors |
| Human Error | Fishbone | Systematic analysis of contributing factors |
| Extended Incidents | Timeline Analysis | Understanding decision points |
| High-Risk Incidents | Bow Tie | Comprehensive barrier analysis |
| Recurring Issues | 5 Whys + Fishbone | Deep dive into systemic issues |
5 Whys Analysis Framework
Purpose
Iteratively drill down through cause-effect relationships to identify root causes.
When to Use
- Simple, linear cause-effect chains
- Time-pressured analysis
- Process-related failures
- Individual component failures
Process Steps
Step 1: Problem Statement
Write a clear, specific problem statement.
Good Example:
"The payment API returned 500 errors for 2 hours on March 15, affecting 80% of checkout attempts."
Poor Example:
"The system was broken."
Step 2: First Why
Ask why the problem occurred. Focus on immediate, observable causes.
Example:
- Why 1: Why did the payment API return 500 errors?
- Answer: The database connection pool was exhausted.
Step 3: Subsequent Whys
For each answer, ask "why" again. Continue until you reach a root cause.
Example Chain:
-
Why 2: Why was the database connection pool exhausted?
-
Answer: The application was creating more connections than usual.
-
Why 3: Why was the application creating more connections?
-
Answer: A new feature wasn't properly closing connections.
-
Why 4: Why wasn't the feature properly closing connections?
-
Answer: Code review missed the connection leak pattern.
-
Why 5: Why did code review miss this pattern?
-
Answer: We don't have automated checks for connection pooling best practices.
Step 4: Validation
Verify that addressing the root cause would prevent the original problem.
Best Practices
- Ask at least 3 "whys" - Surface causes are rarely root causes
- Focus on process failures, not people - Avoid blame, focus on system improvements
- Use evidence - Support each answer with data or observations
- Consider multiple paths - Some problems have multiple root causes
- Test the logic - Work backwards from root cause to problem
Common Pitfalls
- Stopping too early - First few whys often reveal symptoms, not causes
- Single-cause assumption - Complex systems often have multiple contributing factors
- Blame focus - Focusing on individual mistakes rather than system failures
- Vague answers - Use specific, actionable answers
5 Whys Template
## 5 Whys Analysis
**Problem Statement:** [Clear description of the incident]
**Why 1:** [First why question]
**Answer:** [Specific, evidence-based answer]
**Evidence:** [Supporting data, logs, observations]
**Why 2:** [Second why question]
**Answer:** [Specific answer based on Why 1]
**Evidence:** [Supporting evidence]
[Continue for 3-7 iterations]
**Root Cause(s) Identified:**
1. [Primary root cause]
2. [Secondary root cause if applicable]
**Validation:** [Confirm that addressing root causes would prevent recurrence]
Fishbone (Ishikawa) Diagram Framework
Purpose
Systematically analyze potential causes across multiple categories to identify contributing factors.
When to Use
- Complex incidents with multiple potential causes
- When human factors are suspected
- Systemic or organizational issues
- When 5 Whys doesn't reveal clear root causes
Categories
People (Human Factors)
-
Training and Skills
- Insufficient training on new systems
- Lack of domain expertise
- Skill gaps in team
- Knowledge not shared across team
-
Communication
- Poor communication between teams
- Unclear responsibilities
- Information not reaching right people
- Language/cultural barriers
-
Decision Making
- Decisions made under pressure
- Insufficient information for decisions
- Risk assessment inadequate
- Approval processes bypassed
Process (Procedures and Workflows)
-
Documentation
- Outdated procedures
- Missing runbooks
- Unclear instructions
- Process not documented
-
Change Management
- Inadequate change review
- Rushed deployments
- Insufficient testing
- Rollback procedures unclear
-
Review and Approval
- Code review gaps
- Architecture review skipped
- Security review insufficient
- Performance review missing
Technology (Systems and Tools)
-
Architecture
- Single points of failure
- Insufficient redundancy
- Scalability limitations
- Tight coupling between systems
-
Monitoring and Alerting
- Missing monitoring
- Alert fatigue
- Inadequate thresholds
- Poor alert routing
-
Tools and Automation
- Manual processes prone to error
- Tool limitations
- Automation gaps
- Integration issues
Environment (External Factors)
-
Infrastructure
- Hardware failures
- Network issues
- Capacity limitations
- Geographic dependencies
-
Dependencies
- Third-party service failures
- External API changes
- Vendor issues
- Supply chain problems
-
External Pressure
- Time pressure from business
- Resource constraints
- Regulatory changes
- Market conditions
Process Steps
Step 1: Define the Problem
Place the incident at the "head" of the fishbone diagram.
Step 2: Brainstorm Causes
For each category, brainstorm potential contributing factors.
Step 3: Drill Down
For each factor, ask what caused that factor (sub-causes).
Step 4: Identify Primary Causes
Mark the most likely contributing factors based on evidence.
Step 5: Validate
Gather evidence to support or refute each suspected cause.
Fishbone Template
## Fishbone Analysis
**Problem:** [Incident description]
### People
**Training/Skills:**
- [Factor 1]: [Evidence/likelihood]
- [Factor 2]: [Evidence/likelihood]
**Communication:**
- [Factor 1]: [Evidence/likelihood]
**Decision Making:**
- [Factor 1]: [Evidence/likelihood]
### Process
**Documentation:**
- [Factor 1]: [Evidence/likelihood]
**Change Management:**
- [Factor 1]: [Evidence/likelihood]
**Review/Approval:**
- [Factor 1]: [Evidence/likelihood]
### Technology
**Architecture:**
- [Factor 1]: [Evidence/likelihood]
**Monitoring:**
- [Factor 1]: [Evidence/likelihood]
**Tools:**
- [Factor 1]: [Evidence/likelihood]
### Environment
**Infrastructure:**
- [Factor 1]: [Evidence/likelihood]
**Dependencies:**
- [Factor 1]: [Evidence/likelihood]
**External Factors:**
- [Factor 1]: [Evidence/likelihood]
### Primary Contributing Factors
1. [Factor with highest evidence/impact]
2. [Second most significant factor]
3. [Third most significant factor]
### Root Cause Hypothesis
[Synthesized explanation of how factors combined to cause incident]
Timeline Analysis Framework
Purpose
Analyze the chronological sequence of events to identify decision points, missed opportunities, and process gaps.
When to Use
- Extended incidents (> 1 hour)
- Complex multi-phase incidents
- When response effectiveness is questioned
- Communication or coordination failures
Analysis Dimensions
Detection Analysis
- Time to Detection: How long from onset to first alert?
- Detection Method: How was the incident first identified?
- Alert Effectiveness: Were the right people notified quickly?
- False Negatives: What signals were missed?
Response Analysis
- Time to Response: How long from detection to first response action?
- Escalation Timing: Were escalations timely and appropriate?
- Resource Mobilization: How quickly were the right people engaged?
- Decision Points: What key decisions were made and when?
Communication Analysis
- Internal Communication: How effective was team coordination?
- External Communication: Were stakeholders informed appropriately?
- Communication Gaps: Where did information flow break down?
- Update Frequency: Were updates provided at appropriate intervals?
Resolution Analysis
- Mitigation Strategy: Was the chosen approach optimal?
- Alternative Paths: What other options were considered?
- Resource Allocation: Were resources used effectively?
- Verification: How was resolution confirmed?
Process Steps
Step 1: Event Reconstruction
Create comprehensive timeline with all available events.
Step 2: Phase Identification
Identify distinct phases (detection, triage, escalation, mitigation, resolution).
Step 3: Gap Analysis
Identify time gaps and analyze their causes.
Step 4: Decision Point Analysis
Examine key decision points and alternative paths.
Step 5: Effectiveness Assessment
Evaluate the overall effectiveness of the response.
Timeline Template
## Timeline Analysis
### Incident Phases
1. **Detection** ([start] - [end], [duration])
2. **Triage** ([start] - [end], [duration])
3. **Escalation** ([start] - [end], [duration])
4. **Mitigation** ([start] - [end], [duration])
5. **Resolution** ([start] - [end], [duration])
### Key Decision Points
**[Timestamp]:** [Decision made]
- **Context:** [Situation at time of decision]
- **Alternatives:** [Other options considered]
- **Outcome:** [Result of decision]
- **Assessment:** [Was this optimal?]
### Communication Timeline
**[Timestamp]:** [Communication event]
- **Channel:** [Slack/Email/Phone/etc.]
- **Audience:** [Who was informed]
- **Content:** [What was communicated]
- **Effectiveness:** [Assessment]
### Gaps and Delays
**[Time Period]:** [Description of gap]
- **Duration:** [Length of gap]
- **Cause:** [Why did gap occur]
- **Impact:** [Effect on incident response]
### Response Effectiveness
**Strengths:**
- [What went well]
- [Effective decisions/actions]
**Weaknesses:**
- [What could be improved]
- [Missed opportunities]
### Root Causes from Timeline
1. [Process-based root cause]
2. [Communication-based root cause]
3. [Decision-making root cause]
Bow Tie Analysis Framework
Purpose
Analyze both preventive measures (left side) and protective measures (right side) around an incident.
When to Use
- High-severity incidents (SEV1)
- Security incidents
- Safety-critical systems
- When comprehensive barrier analysis is needed
Components
Hazards
What conditions create the potential for incidents?
Examples:
- High traffic loads
- Software deployments
- Human interactions with critical systems
- Third-party dependencies
Top Event
What actually went wrong? This is the center of the bow tie.
Examples:
- "Database became unresponsive"
- "Payment processing failed"
- "User authentication service crashed"
Threats (Left Side)
What specific causes could lead to the top event?
Examples:
- Code defects in new deployment
- Database connection pool exhaustion
- Network connectivity issues
- DDoS attack
Consequences (Right Side)
What are the potential impacts of the top event?
Examples:
- Revenue loss
- Customer churn
- Regulatory violations
- Brand damage
- Data loss
Barriers
What controls exist (or could exist) to prevent threats or mitigate consequences?
Preventive Barriers (Left Side):
- Code reviews
- Automated testing
- Load testing
- Input validation
- Rate limiting
Protective Barriers (Right Side):
- Circuit breakers
- Failover systems
- Backup procedures
- Customer communication
- Rollback capabilities
Process Steps
Step 1: Define the Top Event
Clearly state what went wrong.
Step 2: Identify Threats
Brainstorm all possible causes that could lead to the top event.
Step 3: Identify Consequences
List all potential impacts of the top event.
Step 4: Map Existing Barriers
Identify current controls for each threat and consequence.
Step 5: Assess Barrier Effectiveness
Evaluate how well each barrier worked (or failed).
Step 6: Recommend Additional Barriers
Identify new controls needed to prevent recurrence.
Bow Tie Template
## Bow Tie Analysis
**Top Event:** [What went wrong]
### Threats (Potential Causes)
1. **[Threat 1]**
- Likelihood: [High/Medium/Low]
- Current Barriers: [Preventive controls]
- Barrier Effectiveness: [Assessment]
2. **[Threat 2]**
- Likelihood: [High/Medium/Low]
- Current Barriers: [Preventive controls]
- Barrier Effectiveness: [Assessment]
### Consequences (Potential Impacts)
1. **[Consequence 1]**
- Severity: [High/Medium/Low]
- Current Barriers: [Protective controls]
- Barrier Effectiveness: [Assessment]
2. **[Consequence 2]**
- Severity: [High/Medium/Low]
- Current Barriers: [Protective controls]
- Barrier Effectiveness: [Assessment]
### Barrier Analysis
**Effective Barriers:**
- [Barrier that worked well]
- [Why it was effective]
**Failed Barriers:**
- [Barrier that failed]
- [Why it failed]
- [How to improve]
**Missing Barriers:**
- [Needed preventive control]
- [Needed protective control]
### Recommendations
**Preventive Measures:**
1. [New barrier to prevent threat]
2. [Improvement to existing barrier]
**Protective Measures:**
1. [New barrier to mitigate consequence]
2. [Improvement to existing barrier]
Framework Comparison
| Framework | Time Required | Complexity | Best For | Output |
|---|---|---|---|---|
| 5 Whys | 30-60 minutes | Low | Simple, linear causes | Clear cause chain |
| Fishbone | 1-2 hours | Medium | Complex, multi-factor | Comprehensive factor map |
| Timeline | 2-3 hours | Medium | Extended incidents | Process improvements |
| Bow Tie | 2-4 hours | High | High-risk incidents | Barrier strategy |
Combining Frameworks
5 Whys + Fishbone
Use 5 Whys for initial analysis, then Fishbone to explore contributing factors.
Timeline + 5 Whys
Use Timeline to identify key decision points, then 5 Whys on critical failures.
Fishbone + Bow Tie
Use Fishbone to identify causes, then Bow Tie to develop comprehensive prevention strategy.
Quality Checklist
- Root causes address systemic issues, not symptoms
- Analysis is backed by evidence, not assumptions
- Multiple perspectives considered (technical, process, human)
- Recommendations are specific and actionable
- Analysis focuses on prevention, not blame
- Findings are validated against incident timeline
- Contributing factors are prioritized by impact
- Root causes link clearly to preventive actions
Common Anti-Patterns
- Human Error as Root Cause - Dig deeper into why human error occurred
- Single Root Cause - Complex systems usually have multiple contributing factors
- Technology-Only Focus - Consider process and organizational factors
- Blame Assignment - Focus on system improvements, not individual fault
- Generic Recommendations - Provide specific, measurable actions
- Surface-Level Analysis - Ensure you've reached true root causes
Last Updated: February 2026
Next Review: August 2026
Owner: SRE Team + Engineering Leadership