Files
claude-skills-reference/engineering-team/incident-commander/references/incident_severity_matrix.md
Leo daace78954 feat: Add comprehensive incident-commander skill
- Add SKILL.md with 300+ lines of incident response playbook
- Implement incident_classifier.py: severity classification and response recommendations
- Implement timeline_reconstructor.py: event timeline reconstruction with phase analysis
- Implement pir_generator.py: comprehensive PIR generation with multiple RCA frameworks
- Add reference documentation: severity matrix, RCA frameworks, communication templates
- Add sample data files and expected outputs for testing
- All scripts are standalone with zero external dependencies
- Dual output formats: JSON + human-readable text
- Professional, opinionated defaults based on SRE best practices

This POWERFUL-tier skill provides end-to-end incident response capabilities from
detection through post-incident review.
2026-02-16 12:43:38 +00:00

292 lines
8.7 KiB
Markdown

# Incident Severity Classification Matrix
## Overview
This document defines the severity classification system used for incident response. The classification determines response requirements, escalation paths, and communication frequency.
## Severity Levels
### SEV1 - Critical Outage
**Definition:** Complete service failure affecting all users or critical business functions
#### Impact Criteria
- Customer-facing services completely unavailable
- Data loss or corruption affecting users
- Security breaches with customer data exposure
- Revenue-generating systems down
- SLA violations with financial penalties
- > 75% of users affected
#### Response Requirements
| Metric | Requirement |
|--------|-------------|
| **Response Time** | Immediate (0-5 minutes) |
| **Incident Commander** | Assigned within 5 minutes |
| **War Room** | Established within 10 minutes |
| **Executive Notification** | Within 15 minutes |
| **Public Status Page** | Updated within 15 minutes |
| **Customer Communication** | Within 30 minutes |
#### Escalation Path
1. **Immediate**: On-call Engineer → Incident Commander
2. **15 minutes**: VP Engineering + Customer Success VP
3. **30 minutes**: CTO
4. **60 minutes**: CEO + Full Executive Team
#### Communication Requirements
- **Frequency**: Every 15 minutes until resolution
- **Channels**: PagerDuty, Phone, Slack, Email, Status Page
- **Recipients**: All engineering, executives, customer success
- **Template**: SEV1 Executive Alert Template
---
### SEV2 - Major Impact
**Definition:** Significant degradation affecting subset of users or non-critical functions
#### Impact Criteria
- Partial service degradation (25-75% of users affected)
- Performance issues causing user frustration
- Non-critical features unavailable
- Internal tools impacting productivity
- Data inconsistencies not affecting user experience
- API errors affecting integrations
#### Response Requirements
| Metric | Requirement |
|--------|-------------|
| **Response Time** | 15 minutes |
| **Incident Commander** | Assigned within 30 minutes |
| **Status Page Update** | Within 30 minutes |
| **Stakeholder Notification** | Within 1 hour |
| **Team Assembly** | Within 30 minutes |
#### Escalation Path
1. **Immediate**: On-call Engineer → Team Lead
2. **30 minutes**: Engineering Manager
3. **2 hours**: VP Engineering
4. **4 hours**: CTO (if unresolved)
#### Communication Requirements
- **Frequency**: Every 30 minutes during active response
- **Channels**: PagerDuty, Slack, Email
- **Recipients**: Engineering team, product team, relevant stakeholders
- **Template**: SEV2 Major Impact Template
---
### SEV3 - Minor Impact
**Definition:** Limited impact with workarounds available
#### Impact Criteria
- Single feature or component affected
- < 25% of users impacted
- Workarounds available
- Performance degradation not significantly impacting UX
- Non-urgent monitoring alerts
- Development/test environment issues
#### Response Requirements
| Metric | Requirement |
|--------|-------------|
| **Response Time** | 2 hours (business hours) |
| **After Hours Response** | Next business day |
| **Team Assignment** | Within 4 hours |
| **Status Page Update** | Optional |
| **Internal Notification** | Within 2 hours |
#### Escalation Path
1. **Immediate**: Assigned Engineer
2. **4 hours**: Team Lead
3. **1 business day**: Engineering Manager (if needed)
#### Communication Requirements
- **Frequency**: At key milestones only
- **Channels**: Slack, Email
- **Recipients**: Assigned team, team lead
- **Template**: SEV3 Minor Impact Template
---
### SEV4 - Low Impact
**Definition:** Minimal impact, cosmetic issues, or planned maintenance
#### Impact Criteria
- Cosmetic bugs
- Documentation issues
- Logging or monitoring gaps
- Performance issues with no user impact
- Development/test environment issues
- Feature requests or enhancements
#### Response Requirements
| Metric | Requirement |
|--------|-------------|
| **Response Time** | 1-2 business days |
| **Assignment** | Next sprint planning |
| **Tracking** | Standard ticket system |
| **Escalation** | None required |
#### Communication Requirements
- **Frequency**: Standard development cycle updates
- **Channels**: Ticket system
- **Recipients**: Product owner, assigned developer
- **Template**: Standard issue template
## Classification Guidelines
### User Impact Assessment
| Impact Scope | Description | Typical Severity |
|--------------|-------------|------------------|
| **All Users** | 100% of users affected | SEV1 |
| **Major Subset** | 50-75% of users affected | SEV1/SEV2 |
| **Significant Subset** | 25-50% of users affected | SEV2 |
| **Limited Users** | 5-25% of users affected | SEV2/SEV3 |
| **Few Users** | < 5% of users affected | SEV3/SEV4 |
| **No User Impact** | Internal only | SEV4 |
### Business Impact Assessment
| Business Impact | Description | Severity Boost |
|-----------------|-------------|----------------|
| **Revenue Loss** | Direct revenue impact | +1 severity level |
| **SLA Breach** | Contract violations | +1 severity level |
| **Regulatory** | Compliance implications | +1 severity level |
| **Brand Damage** | Public-facing issues | +1 severity level |
| **Security** | Data or system security | +2 severity levels |
### Duration Considerations
| Duration | Impact on Classification |
|----------|--------------------------|
| **< 15 minutes** | May reduce severity by 1 level |
| **15-60 minutes** | Standard classification |
| **1-4 hours** | May increase severity by 1 level |
| **> 4 hours** | Significant severity increase |
## Decision Tree
```
1. Is this a security incident with data exposure?
→ YES: SEV1 (regardless of user count)
→ NO: Continue to step 2
2. Are revenue-generating services completely down?
→ YES: SEV1
→ NO: Continue to step 3
3. What percentage of users are affected?
→ > 75%: SEV1
→ 25-75%: SEV2
→ 5-25%: SEV3
→ < 5%: SEV4
4. Apply business impact modifiers
5. Consider duration factors
6. When in doubt, err on higher severity
```
## Examples
### SEV1 Examples
- Payment processing system completely down
- All user authentication failing
- Database corruption causing data loss
- Security breach with customer data exposed
- Website returning 500 errors for all users
### SEV2 Examples
- Payment processing slow (30-second delays)
- Search functionality returning incomplete results
- API rate limits causing partner integration issues
- Dashboard displaying stale data (> 1 hour old)
- Mobile app crashing for 40% of users
### SEV3 Examples
- Single feature in admin panel not working
- Email notifications delayed by 1 hour
- Non-critical API endpoint returning errors
- Cosmetic UI bug in settings page
- Development environment deployment failing
### SEV4 Examples
- Typo in help documentation
- Log format change needed for analysis
- Non-critical performance optimization
- Internal tool enhancement request
- Test data cleanup needed
## Escalation Triggers
### Automatic Escalation
- SEV1 incidents automatically escalate every 30 minutes if unresolved
- SEV2 incidents escalate after 2 hours without significant progress
- Any incident with expanding scope increases severity
- Customer escalation to support triggers severity review
### Manual Escalation
- Incident Commander can escalate at any time
- Technical leads can request escalation
- Business stakeholders can request severity review
- External factors (media attention, regulatory) trigger escalation
## Communication Templates
### SEV1 Executive Alert
```
Subject: 🚨 CRITICAL INCIDENT - [Service] Complete Outage
URGENT: Customer-facing service outage requiring immediate attention
Service: [Service Name]
Start Time: [Timestamp]
Impact: [Description of customer impact]
Estimated Affected Users: [Number/Percentage]
Business Impact: [Revenue/SLA/Brand implications]
Incident Commander: [Name] ([Contact])
Response Team: [Team members engaged]
Current Status: [Brief status update]
Next Update: [Timestamp - 15 minutes from now]
War Room: [Bridge/Chat link]
This is a customer-impacting incident requiring executive awareness.
```
### SEV2 Major Impact
```
Subject: ⚠️ [SEV2] [Service] - Major Performance Impact
Major service degradation affecting user experience
Service: [Service Name]
Start Time: [Timestamp]
Impact: [Description of user impact]
Scope: [Affected functionality/users]
Response Team: [Team Lead] + [Team members]
Status: [Current mitigation efforts]
Workaround: [If available]
Next Update: 30 minutes
Status Page: [Link if updated]
```
## Review and Updates
This severity matrix should be reviewed quarterly and updated based on:
- Incident response learnings
- Business priority changes
- Service architecture evolution
- Regulatory requirement changes
- Customer feedback and SLA updates
**Last Updated:** February 2026
**Next Review:** May 2026
**Owner:** Engineering Leadership