- Add SKILL.md with 300+ lines of incident response playbook - Implement incident_classifier.py: severity classification and response recommendations - Implement timeline_reconstructor.py: event timeline reconstruction with phase analysis - Implement pir_generator.py: comprehensive PIR generation with multiple RCA frameworks - Add reference documentation: severity matrix, RCA frameworks, communication templates - Add sample data files and expected outputs for testing - All scripts are standalone with zero external dependencies - Dual output formats: JSON + human-readable text - Professional, opinionated defaults based on SRE best practices This POWERFUL-tier skill provides end-to-end incident response capabilities from detection through post-incident review.
292 lines
8.7 KiB
Markdown
292 lines
8.7 KiB
Markdown
# Incident Severity Classification Matrix
|
|
|
|
## Overview
|
|
|
|
This document defines the severity classification system used for incident response. The classification determines response requirements, escalation paths, and communication frequency.
|
|
|
|
## Severity Levels
|
|
|
|
### SEV1 - Critical Outage
|
|
|
|
**Definition:** Complete service failure affecting all users or critical business functions
|
|
|
|
#### Impact Criteria
|
|
- Customer-facing services completely unavailable
|
|
- Data loss or corruption affecting users
|
|
- Security breaches with customer data exposure
|
|
- Revenue-generating systems down
|
|
- SLA violations with financial penalties
|
|
- > 75% of users affected
|
|
|
|
#### Response Requirements
|
|
| Metric | Requirement |
|
|
|--------|-------------|
|
|
| **Response Time** | Immediate (0-5 minutes) |
|
|
| **Incident Commander** | Assigned within 5 minutes |
|
|
| **War Room** | Established within 10 minutes |
|
|
| **Executive Notification** | Within 15 minutes |
|
|
| **Public Status Page** | Updated within 15 minutes |
|
|
| **Customer Communication** | Within 30 minutes |
|
|
|
|
#### Escalation Path
|
|
1. **Immediate**: On-call Engineer → Incident Commander
|
|
2. **15 minutes**: VP Engineering + Customer Success VP
|
|
3. **30 minutes**: CTO
|
|
4. **60 minutes**: CEO + Full Executive Team
|
|
|
|
#### Communication Requirements
|
|
- **Frequency**: Every 15 minutes until resolution
|
|
- **Channels**: PagerDuty, Phone, Slack, Email, Status Page
|
|
- **Recipients**: All engineering, executives, customer success
|
|
- **Template**: SEV1 Executive Alert Template
|
|
|
|
---
|
|
|
|
### SEV2 - Major Impact
|
|
|
|
**Definition:** Significant degradation affecting subset of users or non-critical functions
|
|
|
|
#### Impact Criteria
|
|
- Partial service degradation (25-75% of users affected)
|
|
- Performance issues causing user frustration
|
|
- Non-critical features unavailable
|
|
- Internal tools impacting productivity
|
|
- Data inconsistencies not affecting user experience
|
|
- API errors affecting integrations
|
|
|
|
#### Response Requirements
|
|
| Metric | Requirement |
|
|
|--------|-------------|
|
|
| **Response Time** | 15 minutes |
|
|
| **Incident Commander** | Assigned within 30 minutes |
|
|
| **Status Page Update** | Within 30 minutes |
|
|
| **Stakeholder Notification** | Within 1 hour |
|
|
| **Team Assembly** | Within 30 minutes |
|
|
|
|
#### Escalation Path
|
|
1. **Immediate**: On-call Engineer → Team Lead
|
|
2. **30 minutes**: Engineering Manager
|
|
3. **2 hours**: VP Engineering
|
|
4. **4 hours**: CTO (if unresolved)
|
|
|
|
#### Communication Requirements
|
|
- **Frequency**: Every 30 minutes during active response
|
|
- **Channels**: PagerDuty, Slack, Email
|
|
- **Recipients**: Engineering team, product team, relevant stakeholders
|
|
- **Template**: SEV2 Major Impact Template
|
|
|
|
---
|
|
|
|
### SEV3 - Minor Impact
|
|
|
|
**Definition:** Limited impact with workarounds available
|
|
|
|
#### Impact Criteria
|
|
- Single feature or component affected
|
|
- < 25% of users impacted
|
|
- Workarounds available
|
|
- Performance degradation not significantly impacting UX
|
|
- Non-urgent monitoring alerts
|
|
- Development/test environment issues
|
|
|
|
#### Response Requirements
|
|
| Metric | Requirement |
|
|
|--------|-------------|
|
|
| **Response Time** | 2 hours (business hours) |
|
|
| **After Hours Response** | Next business day |
|
|
| **Team Assignment** | Within 4 hours |
|
|
| **Status Page Update** | Optional |
|
|
| **Internal Notification** | Within 2 hours |
|
|
|
|
#### Escalation Path
|
|
1. **Immediate**: Assigned Engineer
|
|
2. **4 hours**: Team Lead
|
|
3. **1 business day**: Engineering Manager (if needed)
|
|
|
|
#### Communication Requirements
|
|
- **Frequency**: At key milestones only
|
|
- **Channels**: Slack, Email
|
|
- **Recipients**: Assigned team, team lead
|
|
- **Template**: SEV3 Minor Impact Template
|
|
|
|
---
|
|
|
|
### SEV4 - Low Impact
|
|
|
|
**Definition:** Minimal impact, cosmetic issues, or planned maintenance
|
|
|
|
#### Impact Criteria
|
|
- Cosmetic bugs
|
|
- Documentation issues
|
|
- Logging or monitoring gaps
|
|
- Performance issues with no user impact
|
|
- Development/test environment issues
|
|
- Feature requests or enhancements
|
|
|
|
#### Response Requirements
|
|
| Metric | Requirement |
|
|
|--------|-------------|
|
|
| **Response Time** | 1-2 business days |
|
|
| **Assignment** | Next sprint planning |
|
|
| **Tracking** | Standard ticket system |
|
|
| **Escalation** | None required |
|
|
|
|
#### Communication Requirements
|
|
- **Frequency**: Standard development cycle updates
|
|
- **Channels**: Ticket system
|
|
- **Recipients**: Product owner, assigned developer
|
|
- **Template**: Standard issue template
|
|
|
|
## Classification Guidelines
|
|
|
|
### User Impact Assessment
|
|
|
|
| Impact Scope | Description | Typical Severity |
|
|
|--------------|-------------|------------------|
|
|
| **All Users** | 100% of users affected | SEV1 |
|
|
| **Major Subset** | 50-75% of users affected | SEV1/SEV2 |
|
|
| **Significant Subset** | 25-50% of users affected | SEV2 |
|
|
| **Limited Users** | 5-25% of users affected | SEV2/SEV3 |
|
|
| **Few Users** | < 5% of users affected | SEV3/SEV4 |
|
|
| **No User Impact** | Internal only | SEV4 |
|
|
|
|
### Business Impact Assessment
|
|
|
|
| Business Impact | Description | Severity Boost |
|
|
|-----------------|-------------|----------------|
|
|
| **Revenue Loss** | Direct revenue impact | +1 severity level |
|
|
| **SLA Breach** | Contract violations | +1 severity level |
|
|
| **Regulatory** | Compliance implications | +1 severity level |
|
|
| **Brand Damage** | Public-facing issues | +1 severity level |
|
|
| **Security** | Data or system security | +2 severity levels |
|
|
|
|
### Duration Considerations
|
|
|
|
| Duration | Impact on Classification |
|
|
|----------|--------------------------|
|
|
| **< 15 minutes** | May reduce severity by 1 level |
|
|
| **15-60 minutes** | Standard classification |
|
|
| **1-4 hours** | May increase severity by 1 level |
|
|
| **> 4 hours** | Significant severity increase |
|
|
|
|
## Decision Tree
|
|
|
|
```
|
|
1. Is this a security incident with data exposure?
|
|
→ YES: SEV1 (regardless of user count)
|
|
→ NO: Continue to step 2
|
|
|
|
2. Are revenue-generating services completely down?
|
|
→ YES: SEV1
|
|
→ NO: Continue to step 3
|
|
|
|
3. What percentage of users are affected?
|
|
→ > 75%: SEV1
|
|
→ 25-75%: SEV2
|
|
→ 5-25%: SEV3
|
|
→ < 5%: SEV4
|
|
|
|
4. Apply business impact modifiers
|
|
5. Consider duration factors
|
|
6. When in doubt, err on higher severity
|
|
```
|
|
|
|
## Examples
|
|
|
|
### SEV1 Examples
|
|
- Payment processing system completely down
|
|
- All user authentication failing
|
|
- Database corruption causing data loss
|
|
- Security breach with customer data exposed
|
|
- Website returning 500 errors for all users
|
|
|
|
### SEV2 Examples
|
|
- Payment processing slow (30-second delays)
|
|
- Search functionality returning incomplete results
|
|
- API rate limits causing partner integration issues
|
|
- Dashboard displaying stale data (> 1 hour old)
|
|
- Mobile app crashing for 40% of users
|
|
|
|
### SEV3 Examples
|
|
- Single feature in admin panel not working
|
|
- Email notifications delayed by 1 hour
|
|
- Non-critical API endpoint returning errors
|
|
- Cosmetic UI bug in settings page
|
|
- Development environment deployment failing
|
|
|
|
### SEV4 Examples
|
|
- Typo in help documentation
|
|
- Log format change needed for analysis
|
|
- Non-critical performance optimization
|
|
- Internal tool enhancement request
|
|
- Test data cleanup needed
|
|
|
|
## Escalation Triggers
|
|
|
|
### Automatic Escalation
|
|
- SEV1 incidents automatically escalate every 30 minutes if unresolved
|
|
- SEV2 incidents escalate after 2 hours without significant progress
|
|
- Any incident with expanding scope increases severity
|
|
- Customer escalation to support triggers severity review
|
|
|
|
### Manual Escalation
|
|
- Incident Commander can escalate at any time
|
|
- Technical leads can request escalation
|
|
- Business stakeholders can request severity review
|
|
- External factors (media attention, regulatory) trigger escalation
|
|
|
|
## Communication Templates
|
|
|
|
### SEV1 Executive Alert
|
|
```
|
|
Subject: 🚨 CRITICAL INCIDENT - [Service] Complete Outage
|
|
|
|
URGENT: Customer-facing service outage requiring immediate attention
|
|
|
|
Service: [Service Name]
|
|
Start Time: [Timestamp]
|
|
Impact: [Description of customer impact]
|
|
Estimated Affected Users: [Number/Percentage]
|
|
Business Impact: [Revenue/SLA/Brand implications]
|
|
|
|
Incident Commander: [Name] ([Contact])
|
|
Response Team: [Team members engaged]
|
|
|
|
Current Status: [Brief status update]
|
|
Next Update: [Timestamp - 15 minutes from now]
|
|
War Room: [Bridge/Chat link]
|
|
|
|
This is a customer-impacting incident requiring executive awareness.
|
|
```
|
|
|
|
### SEV2 Major Impact
|
|
```
|
|
Subject: ⚠️ [SEV2] [Service] - Major Performance Impact
|
|
|
|
Major service degradation affecting user experience
|
|
|
|
Service: [Service Name]
|
|
Start Time: [Timestamp]
|
|
Impact: [Description of user impact]
|
|
Scope: [Affected functionality/users]
|
|
|
|
Response Team: [Team Lead] + [Team members]
|
|
Status: [Current mitigation efforts]
|
|
Workaround: [If available]
|
|
|
|
Next Update: 30 minutes
|
|
Status Page: [Link if updated]
|
|
```
|
|
|
|
## Review and Updates
|
|
|
|
This severity matrix should be reviewed quarterly and updated based on:
|
|
- Incident response learnings
|
|
- Business priority changes
|
|
- Service architecture evolution
|
|
- Regulatory requirement changes
|
|
- Customer feedback and SLA updates
|
|
|
|
**Last Updated:** February 2026
|
|
**Next Review:** May 2026
|
|
**Owner:** Engineering Leadership |