202 lines
6.7 KiB
Markdown
202 lines
6.7 KiB
Markdown
# incident-commander reference
|
|
|
|
## Reference Information
|
|
- **Architecture Diagram:** {link}
|
|
- **Monitoring Dashboard:** {link}
|
|
- **Related Runbooks:** {links to dependent service runbooks}
|
|
```
|
|
|
|
### Post-Incident Review (PIR) Framework
|
|
|
|
#### PIR Timeline and Ownership
|
|
|
|
**Timeline:**
|
|
- **24 hours:** Initial PIR draft completed by Incident Commander
|
|
- **3 business days:** Final PIR published with all stakeholder input
|
|
- **1 week:** Action items assigned with owners and due dates
|
|
- **4 weeks:** Follow-up review on action item progress
|
|
|
|
**Roles:**
|
|
- **PIR Owner:** Incident Commander (can delegate writing but owns completion)
|
|
- **Technical Contributors:** All engineers involved in response
|
|
- **Review Committee:** Engineering leadership, affected product teams
|
|
- **Action Item Owners:** Assigned based on expertise and capacity
|
|
|
|
#### Root Cause Analysis Frameworks
|
|
|
|
#### 1. Five Whys Method
|
|
|
|
The Five Whys technique involves asking "why" repeatedly to drill down to root causes:
|
|
|
|
**Example Application:**
|
|
- **Problem:** Database became unresponsive during peak traffic
|
|
- **Why 1:** Why did the database become unresponsive? → Connection pool was exhausted
|
|
- **Why 2:** Why was the connection pool exhausted? → Application was creating more connections than usual
|
|
- **Why 3:** Why was the application creating more connections? → New feature wasn't properly connection pooling
|
|
- **Why 4:** Why wasn't the feature properly connection pooling? → Code review missed this pattern
|
|
- **Why 5:** Why did code review miss this? → No automated checks for connection pooling patterns
|
|
|
|
**Best Practices:**
|
|
- Ask "why" at least 3 times, often need 5+ iterations
|
|
- Focus on process failures, not individual blame
|
|
- Each "why" should point to a actionable system improvement
|
|
- Consider multiple root cause paths, not just one linear chain
|
|
|
|
#### 2. Fishbone (Ishikawa) Diagram
|
|
|
|
Systematic analysis across multiple categories of potential causes:
|
|
|
|
**Categories:**
|
|
- **People:** Training, experience, communication, handoffs
|
|
- **Process:** Procedures, change management, review processes
|
|
- **Technology:** Architecture, tooling, monitoring, automation
|
|
- **Environment:** Infrastructure, dependencies, external factors
|
|
|
|
**Application Method:**
|
|
1. State the problem clearly at the "head" of the fishbone
|
|
2. For each category, brainstorm potential contributing factors
|
|
3. For each factor, ask what caused that factor (sub-causes)
|
|
4. Identify the factors most likely to be root causes
|
|
5. Validate root causes with evidence from the incident
|
|
|
|
#### 3. Timeline Analysis
|
|
|
|
Reconstruct the incident chronologically to identify decision points and missed opportunities:
|
|
|
|
**Timeline Elements:**
|
|
- **Detection:** When was the issue first observable? When was it first detected?
|
|
- **Notification:** How quickly were the right people informed?
|
|
- **Response:** What actions were taken and how effective were they?
|
|
- **Communication:** When were stakeholders updated?
|
|
- **Resolution:** What finally resolved the issue?
|
|
|
|
**Analysis Questions:**
|
|
- Where were there delays and what caused them?
|
|
- What decisions would we make differently with perfect information?
|
|
- Where did communication break down?
|
|
- What automation could have detected/resolved faster?
|
|
|
|
### Escalation Paths
|
|
|
|
#### Technical Escalation
|
|
|
|
**Level 1:** On-call engineer
|
|
- **Responsibility:** Initial response and common issue resolution
|
|
- **Escalation Trigger:** Issue not resolved within SLA timeframe
|
|
- **Timeframe:** 15 minutes (SEV1), 30 minutes (SEV2)
|
|
|
|
**Level 2:** Senior engineer/Team lead
|
|
- **Responsibility:** Complex technical issues requiring deeper expertise
|
|
- **Escalation Trigger:** Level 1 requests help or timeout occurs
|
|
- **Timeframe:** 30 minutes (SEV1), 1 hour (SEV2)
|
|
|
|
**Level 3:** Engineering Manager/Staff Engineer
|
|
- **Responsibility:** Cross-team coordination and architectural decisions
|
|
- **Escalation Trigger:** Issue spans multiple systems or teams
|
|
- **Timeframe:** 45 minutes (SEV1), 2 hours (SEV2)
|
|
|
|
**Level 4:** Director of Engineering/CTO
|
|
- **Responsibility:** Resource allocation and business impact decisions
|
|
- **Escalation Trigger:** Extended outage or significant business impact
|
|
- **Timeframe:** 1 hour (SEV1), 4 hours (SEV2)
|
|
|
|
#### Business Escalation
|
|
|
|
**Customer Impact Assessment:**
|
|
- **High:** Revenue loss, SLA breaches, customer churn risk
|
|
- **Medium:** User experience degradation, support ticket volume
|
|
- **Low:** Internal tools, development impact only
|
|
|
|
**Escalation Matrix:**
|
|
|
|
| Severity | Duration | Business Escalation |
|
|
|----------|----------|-------------------|
|
|
| SEV1 | Immediate | VP Engineering |
|
|
| SEV1 | 30 minutes | CTO + Customer Success VP |
|
|
| SEV1 | 1 hour | CEO + Full Executive Team |
|
|
| SEV2 | 2 hours | VP Engineering |
|
|
| SEV2 | 4 hours | CTO |
|
|
| SEV3 | 1 business day | Engineering Manager |
|
|
|
|
### Status Page Management
|
|
|
|
#### Update Principles
|
|
|
|
1. **Transparency:** Provide factual information without speculation
|
|
2. **Timeliness:** Update within committed timeframes
|
|
3. **Clarity:** Use customer-friendly language, avoid technical jargon
|
|
4. **Completeness:** Include impact scope, status, and next update time
|
|
|
|
#### Status Categories
|
|
|
|
- **Operational:** All systems functioning normally
|
|
- **Degraded Performance:** Some users may experience slowness
|
|
- **Partial Outage:** Subset of features unavailable
|
|
- **Major Outage:** Service unavailable for most/all users
|
|
- **Under Maintenance:** Planned maintenance window
|
|
|
|
#### Update Template
|
|
|
|
```
|
|
{Timestamp} - {Status Category}
|
|
|
|
{Brief description of current state}
|
|
|
|
Impact: {who is affected and how}
|
|
Cause: {root cause if known, "under investigation" if not}
|
|
Resolution: {what's being done to fix it}
|
|
|
|
Next update: {specific time}
|
|
|
|
We apologize for any inconvenience this may cause.
|
|
```
|
|
|
|
### Action Item Framework
|
|
|
|
#### Action Item Categories
|
|
|
|
1. **Immediate Fixes**
|
|
- Critical bugs discovered during incident
|
|
- Security vulnerabilities exposed
|
|
- Data integrity issues
|
|
|
|
2. **Process Improvements**
|
|
- Communication gaps
|
|
- Escalation procedure updates
|
|
- Runbook additions/updates
|
|
|
|
3. **Technical Debt**
|
|
- Architecture improvements
|
|
- Monitoring enhancements
|
|
- Automation opportunities
|
|
|
|
4. **Organizational Changes**
|
|
- Team structure adjustments
|
|
- Training requirements
|
|
- Tool/platform investments
|
|
|
|
#### Action Item Template
|
|
|
|
```
|
|
**Title:** {Concise description of the action}
|
|
**Priority:** {Critical/High/Medium/Low}
|
|
**Category:** {Fix/Process/Technical/Organizational}
|
|
**Owner:** {Assigned person}
|
|
**Due Date:** {Specific date}
|
|
**Success Criteria:** {How will we know this is complete}
|
|
**Dependencies:** {What needs to happen first}
|
|
**Related PIRs:** {Links to other incidents this addresses}
|
|
|
|
**Description:**
|
|
{Detailed description of what needs to be done and why}
|
|
|
|
**Implementation Plan:**
|
|
1. {Step 1}
|
|
2. {Step 2}
|
|
3. {Validation step}
|
|
|
|
**Progress Updates:**
|
|
- {Date}: {Progress update}
|
|
- {Date}: {Progress update}
|
|
```
|