Merge pull request #195 from alirezarezvani/feature/incident-commander
feat: Add incident-commander skill (POWERFUL tier)
This commit is contained in:
252
engineering-team/incident-commander/README.md
Normal file
252
engineering-team/incident-commander/README.md
Normal file
@@ -0,0 +1,252 @@
|
||||
# Incident Commander Skill
|
||||
|
||||
A comprehensive incident response framework providing structured tools for managing technology incidents from detection through resolution and post-incident review.
|
||||
|
||||
## Overview
|
||||
|
||||
This skill implements battle-tested practices from SRE and DevOps teams at scale, providing:
|
||||
|
||||
- **Automated Severity Classification** - Intelligent incident triage
|
||||
- **Timeline Reconstruction** - Transform scattered events into coherent narratives
|
||||
- **Post-Incident Review Generation** - Structured PIRs with RCA frameworks
|
||||
- **Communication Templates** - Pre-built stakeholder communication
|
||||
- **Comprehensive Documentation** - Reference guides for incident response
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Classify an Incident
|
||||
|
||||
```bash
|
||||
# From JSON file
|
||||
python scripts/incident_classifier.py --input incident.json --format text
|
||||
|
||||
# From stdin text
|
||||
echo "Database is down affecting all users" | python scripts/incident_classifier.py --format text
|
||||
|
||||
# Interactive mode
|
||||
python scripts/incident_classifier.py --interactive
|
||||
```
|
||||
|
||||
### Reconstruct Timeline
|
||||
|
||||
```bash
|
||||
# Analyze event timeline
|
||||
python scripts/timeline_reconstructor.py --input events.json --format text
|
||||
|
||||
# With gap analysis
|
||||
python scripts/timeline_reconstructor.py --input events.json --gap-analysis --format markdown
|
||||
```
|
||||
|
||||
### Generate PIR Document
|
||||
|
||||
```bash
|
||||
# Basic PIR
|
||||
python scripts/pir_generator.py --incident incident.json --format markdown
|
||||
|
||||
# Comprehensive PIR with timeline
|
||||
python scripts/pir_generator.py --incident incident.json --timeline timeline.json --rca-method fishbone
|
||||
```
|
||||
|
||||
## Scripts
|
||||
|
||||
### incident_classifier.py
|
||||
|
||||
**Purpose:** Analyzes incident descriptions and provides severity classification, team recommendations, and response templates.
|
||||
|
||||
**Input:** JSON object with incident details or plain text description
|
||||
**Output:** JSON + human-readable classification report
|
||||
|
||||
**Example Input:**
|
||||
```json
|
||||
{
|
||||
"description": "Database connection timeouts causing 500 errors",
|
||||
"service": "payment-api",
|
||||
"affected_users": "80%",
|
||||
"business_impact": "high"
|
||||
}
|
||||
```
|
||||
|
||||
**Key Features:**
|
||||
- SEV1-4 severity classification
|
||||
- Recommended response teams
|
||||
- Initial action prioritization
|
||||
- Communication templates
|
||||
- Response timelines
|
||||
|
||||
### timeline_reconstructor.py
|
||||
|
||||
**Purpose:** Reconstructs incident timelines from timestamped events, identifies phases, and performs gap analysis.
|
||||
|
||||
**Input:** JSON array of timestamped events
|
||||
**Output:** Formatted timeline with phase analysis and metrics
|
||||
|
||||
**Example Input:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"timestamp": "2024-01-01T12:00:00Z",
|
||||
"source": "monitoring",
|
||||
"message": "High error rate detected",
|
||||
"severity": "critical",
|
||||
"actor": "system"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
**Key Features:**
|
||||
- Phase detection (detection → triage → mitigation → resolution)
|
||||
- Duration analysis
|
||||
- Gap identification
|
||||
- Communication effectiveness analysis
|
||||
- Response metrics
|
||||
|
||||
### pir_generator.py
|
||||
|
||||
**Purpose:** Generates comprehensive Post-Incident Review documents with multiple RCA frameworks.
|
||||
|
||||
**Input:** Incident data JSON, optional timeline data
|
||||
**Output:** Structured PIR document with RCA analysis
|
||||
|
||||
**Key Features:**
|
||||
- Multiple RCA methods (5 Whys, Fishbone, Timeline, Bow Tie)
|
||||
- Automated action item generation
|
||||
- Lessons learned categorization
|
||||
- Follow-up planning
|
||||
- Completeness assessment
|
||||
|
||||
## Sample Data
|
||||
|
||||
The `assets/` directory contains sample data files for testing:
|
||||
|
||||
- `sample_incident_classification.json` - Database connection pool exhaustion incident
|
||||
- `sample_timeline_events.json` - Complete timeline with 21 events across phases
|
||||
- `sample_incident_pir_data.json` - Comprehensive incident data for PIR generation
|
||||
- `simple_incident.json` - Minimal incident for basic testing
|
||||
- `simple_timeline_events.json` - Simple 4-event timeline
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
The `expected_outputs/` directory contains reference outputs showing what each script produces:
|
||||
|
||||
- `incident_classification_text_output.txt` - Detailed classification report
|
||||
- `timeline_reconstruction_text_output.txt` - Complete timeline analysis
|
||||
- `pir_markdown_output.md` - Full PIR document
|
||||
- `simple_incident_classification.txt` - Basic classification example
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
### references/incident_severity_matrix.md
|
||||
Complete severity classification system with:
|
||||
- SEV1-4 definitions and criteria
|
||||
- Response requirements and timelines
|
||||
- Escalation paths
|
||||
- Communication requirements
|
||||
- Decision trees and examples
|
||||
|
||||
### references/rca_frameworks_guide.md
|
||||
Detailed guide for root cause analysis:
|
||||
- 5 Whys methodology
|
||||
- Fishbone (Ishikawa) diagram analysis
|
||||
- Timeline analysis techniques
|
||||
- Bow Tie analysis for high-risk incidents
|
||||
- Framework selection guidelines
|
||||
|
||||
### references/communication_templates.md
|
||||
Standardized communication templates:
|
||||
- Severity-specific notification templates
|
||||
- Stakeholder-specific messaging
|
||||
- Escalation communications
|
||||
- Resolution notifications
|
||||
- Customer communication guidelines
|
||||
|
||||
## Usage Patterns
|
||||
|
||||
### End-to-End Incident Workflow
|
||||
|
||||
1. **Initial Classification**
|
||||
```bash
|
||||
echo "Payment API returning 500 errors for 70% of requests" | \
|
||||
python scripts/incident_classifier.py --format text
|
||||
```
|
||||
|
||||
2. **Timeline Reconstruction** (after collecting events)
|
||||
```bash
|
||||
python scripts/timeline_reconstructor.py \
|
||||
--input events.json \
|
||||
--gap-analysis \
|
||||
--format markdown \
|
||||
--output timeline.md
|
||||
```
|
||||
|
||||
3. **PIR Generation** (after incident resolution)
|
||||
```bash
|
||||
python scripts/pir_generator.py \
|
||||
--incident incident.json \
|
||||
--timeline timeline.md \
|
||||
--rca-method fishbone \
|
||||
--output pir.md
|
||||
```
|
||||
|
||||
### Integration Examples
|
||||
|
||||
**CI/CD Pipeline Integration:**
|
||||
```bash
|
||||
# Classify deployment issues
|
||||
cat deployment_error.log | python scripts/incident_classifier.py --format json
|
||||
```
|
||||
|
||||
**Monitoring Integration:**
|
||||
```bash
|
||||
# Process alert events
|
||||
curl -s "monitoring-api/events" | python scripts/timeline_reconstructor.py --format text
|
||||
```
|
||||
|
||||
**Runbook Generation:**
|
||||
Use classification output to automatically select appropriate runbooks and escalation procedures.
|
||||
|
||||
## Quality Standards
|
||||
|
||||
- **Zero External Dependencies** - All scripts use only Python standard library
|
||||
- **Dual Output Format** - Both JSON (machine-readable) and text (human-readable)
|
||||
- **Robust Input Handling** - Graceful handling of missing or malformed data
|
||||
- **Professional Defaults** - Opinionated, battle-tested configurations
|
||||
- **Comprehensive Testing** - Sample data and expected outputs included
|
||||
|
||||
## Technical Requirements
|
||||
|
||||
- Python 3.6+
|
||||
- No external dependencies required
|
||||
- Works with standard Unix tools (pipes, redirection)
|
||||
- Cross-platform compatible
|
||||
|
||||
## Severity Classification Reference
|
||||
|
||||
| Severity | Description | Response Time | Update Frequency |
|
||||
|----------|-------------|---------------|------------------|
|
||||
| **SEV1** | Complete outage | 5 minutes | Every 15 minutes |
|
||||
| **SEV2** | Major degradation | 15 minutes | Every 30 minutes |
|
||||
| **SEV3** | Minor impact | 2 hours | At milestones |
|
||||
| **SEV4** | Low impact | 1-2 days | Weekly |
|
||||
|
||||
## Getting Help
|
||||
|
||||
Each script includes comprehensive help:
|
||||
```bash
|
||||
python scripts/incident_classifier.py --help
|
||||
python scripts/timeline_reconstructor.py --help
|
||||
python scripts/pir_generator.py --help
|
||||
```
|
||||
|
||||
For methodology questions, refer to the reference documentation in the `references/` directory.
|
||||
|
||||
## Contributing
|
||||
|
||||
When adding new features:
|
||||
1. Maintain zero external dependencies
|
||||
2. Add comprehensive examples to `assets/`
|
||||
3. Update expected outputs in `expected_outputs/`
|
||||
4. Follow the established patterns for argument parsing and output formatting
|
||||
|
||||
## License
|
||||
|
||||
This skill is part of the claude-skills repository. See the main repository LICENSE for details.
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,14 @@
|
||||
{
|
||||
"description": "Database connection timeouts causing 500 errors for payment processing API. Users unable to complete checkout. Error rate spiked from 0.1% to 45% starting at 14:30 UTC. Database monitoring shows connection pool exhaustion with 200/200 connections active.",
|
||||
"service": "payment-api",
|
||||
"affected_users": "80%",
|
||||
"business_impact": "high",
|
||||
"duration_minutes": 95,
|
||||
"metadata": {
|
||||
"error_rate": "45%",
|
||||
"connection_pool_utilization": "100%",
|
||||
"affected_regions": ["us-west", "us-east", "eu-west"],
|
||||
"detection_method": "monitoring_alert",
|
||||
"customer_escalations": 12
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,74 @@
|
||||
{
|
||||
"incident_id": "INC-2024-0315-001",
|
||||
"title": "Payment API Database Connection Pool Exhaustion",
|
||||
"description": "Database connection pool exhaustion caused widespread 500 errors in payment processing API, preventing users from completing purchases. Root cause was an inefficient database query introduced in deployment v2.3.1.",
|
||||
"severity": "sev2",
|
||||
"start_time": "2024-03-15T14:30:00Z",
|
||||
"end_time": "2024-03-15T15:35:00Z",
|
||||
"duration": "1h 5m",
|
||||
"affected_services": ["payment-api", "checkout-service", "subscription-billing"],
|
||||
"customer_impact": "80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay.",
|
||||
"business_impact": "Estimated revenue loss of $45,000 during outage period. No SLA breaches as resolution was within 2-hour window. 12 customer escalations through support channels.",
|
||||
"incident_commander": "Mike Rodriguez",
|
||||
"responders": [
|
||||
"Sarah Chen - On-call Engineer, Primary Responder",
|
||||
"Tom Wilson - Database Team Lead",
|
||||
"Lisa Park - Database Engineer",
|
||||
"Mike Rodriguez - Incident Commander",
|
||||
"David Kumar - DevOps Engineer"
|
||||
],
|
||||
"status": "resolved",
|
||||
"detection_details": {
|
||||
"detection_method": "automated_monitoring",
|
||||
"detection_time": "2024-03-15T14:30:00Z",
|
||||
"alert_source": "Datadog error rate threshold",
|
||||
"time_to_detection": "immediate"
|
||||
},
|
||||
"response_details": {
|
||||
"time_to_response": "5 minutes",
|
||||
"time_to_escalation": "10 minutes",
|
||||
"time_to_resolution": "65 minutes",
|
||||
"war_room_established": "2024-03-15T14:45:00Z",
|
||||
"executives_notified": false,
|
||||
"status_page_updated": true
|
||||
},
|
||||
"technical_details": {
|
||||
"root_cause": "Inefficient database query introduced in deployment v2.3.1 caused each payment validation to take 15 seconds instead of normal 0.1 seconds, exhausting the 200-connection database pool",
|
||||
"affected_regions": ["us-west", "us-east", "eu-west"],
|
||||
"error_metrics": {
|
||||
"peak_error_rate": "45%",
|
||||
"normal_error_rate": "0.1%",
|
||||
"connection_pool_max": 200,
|
||||
"connections_exhausted_at": "100%"
|
||||
},
|
||||
"resolution_method": "rollback",
|
||||
"rollback_target": "v2.2.9",
|
||||
"rollback_duration": "7 minutes"
|
||||
},
|
||||
"communication_log": [
|
||||
{
|
||||
"timestamp": "2024-03-15T14:50:00Z",
|
||||
"type": "status_page",
|
||||
"message": "Investigating payment processing issues",
|
||||
"audience": "customers"
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T15:35:00Z",
|
||||
"type": "status_page",
|
||||
"message": "Payment processing issues resolved",
|
||||
"audience": "customers"
|
||||
}
|
||||
],
|
||||
"lessons_learned_preview": [
|
||||
"Deployment v2.3.1 code review missed performance implications of query change",
|
||||
"Load testing didn't include realistic database query patterns",
|
||||
"Connection pool monitoring could have provided earlier warning",
|
||||
"Rollback procedure worked effectively - 7 minute rollback time"
|
||||
],
|
||||
"preliminary_action_items": [
|
||||
"Fix inefficient query for v2.3.2 deployment",
|
||||
"Add database query performance checks to CI pipeline",
|
||||
"Improve load testing to include database performance scenarios",
|
||||
"Add connection pool utilization alerts"
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,263 @@
|
||||
[
|
||||
{
|
||||
"timestamp": "2024-03-15T14:30:00Z",
|
||||
"source": "datadog",
|
||||
"type": "alert",
|
||||
"message": "High error rate detected on payment-api: 45% error rate (threshold: 5%)",
|
||||
"severity": "critical",
|
||||
"actor": "monitoring-system",
|
||||
"metadata": {
|
||||
"alert_id": "ALT-001",
|
||||
"metric_value": "45%",
|
||||
"threshold": "5%"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T14:32:00Z",
|
||||
"source": "pagerduty",
|
||||
"type": "escalation",
|
||||
"message": "Paged on-call engineer Sarah Chen for payment-api alerts",
|
||||
"severity": "high",
|
||||
"actor": "pagerduty-system",
|
||||
"metadata": {
|
||||
"incident_id": "PD-12345",
|
||||
"responder": "sarah.chen@company.com"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T14:35:00Z",
|
||||
"source": "slack",
|
||||
"type": "communication",
|
||||
"message": "Sarah Chen acknowledged the alert and is investigating payment-api issues",
|
||||
"severity": "medium",
|
||||
"actor": "sarah.chen",
|
||||
"metadata": {
|
||||
"channel": "#incidents",
|
||||
"message_id": "1234567890.123456"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T14:38:00Z",
|
||||
"source": "application_logs",
|
||||
"type": "log",
|
||||
"message": "Database connection pool exhausted: 200/200 connections active, unable to acquire new connections",
|
||||
"severity": "critical",
|
||||
"actor": "payment-api",
|
||||
"metadata": {
|
||||
"log_level": "ERROR",
|
||||
"component": "database_pool",
|
||||
"connection_count": 200,
|
||||
"max_connections": 200
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T14:40:00Z",
|
||||
"source": "slack",
|
||||
"type": "escalation",
|
||||
"message": "Sarah Chen: Escalating to incident commander - database connection pool exhausted, need database team",
|
||||
"severity": "high",
|
||||
"actor": "sarah.chen",
|
||||
"metadata": {
|
||||
"channel": "#incidents",
|
||||
"escalation_reason": "database_expertise_needed"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T14:42:00Z",
|
||||
"source": "pagerduty",
|
||||
"type": "escalation",
|
||||
"message": "Incident commander Mike Rodriguez assigned to incident PD-12345",
|
||||
"severity": "high",
|
||||
"actor": "pagerduty-system",
|
||||
"metadata": {
|
||||
"incident_commander": "mike.rodriguez@company.com",
|
||||
"role": "incident_commander"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T14:45:00Z",
|
||||
"source": "slack",
|
||||
"type": "communication",
|
||||
"message": "Mike Rodriguez: War room established in #war-room-payment-api. Engaging database team.",
|
||||
"severity": "high",
|
||||
"actor": "mike.rodriguez",
|
||||
"metadata": {
|
||||
"channel": "#incidents",
|
||||
"war_room": "#war-room-payment-api"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T14:47:00Z",
|
||||
"source": "pagerduty",
|
||||
"type": "escalation",
|
||||
"message": "Database team engineers paged: Tom Wilson, Lisa Park",
|
||||
"severity": "medium",
|
||||
"actor": "pagerduty-system",
|
||||
"metadata": {
|
||||
"team": "database-team",
|
||||
"responders": ["tom.wilson@company.com", "lisa.park@company.com"]
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T14:50:00Z",
|
||||
"source": "statuspage",
|
||||
"type": "communication",
|
||||
"message": "Status page updated: Investigating payment processing issues",
|
||||
"severity": "medium",
|
||||
"actor": "mike.rodriguez",
|
||||
"metadata": {
|
||||
"status": "investigating",
|
||||
"affected_systems": ["payment-api"]
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T14:52:00Z",
|
||||
"source": "slack",
|
||||
"type": "communication",
|
||||
"message": "Tom Wilson: Joining war room. Looking at database metrics now. Seeing unusual query patterns from recent deployment.",
|
||||
"severity": "medium",
|
||||
"actor": "tom.wilson",
|
||||
"metadata": {
|
||||
"channel": "#war-room-payment-api",
|
||||
"investigation_focus": "database_metrics"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T14:55:00Z",
|
||||
"source": "database_monitoring",
|
||||
"type": "log",
|
||||
"message": "Identified slow query introduced in deployment v2.3.1: payment validation taking 15s per request",
|
||||
"severity": "critical",
|
||||
"actor": "database-monitor",
|
||||
"metadata": {
|
||||
"deployment_version": "v2.3.1",
|
||||
"query_time": "15s",
|
||||
"normal_query_time": "0.1s"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T15:00:00Z",
|
||||
"source": "slack",
|
||||
"type": "communication",
|
||||
"message": "Tom Wilson: Root cause identified - inefficient query in v2.3.1 deployment. Recommending immediate rollback.",
|
||||
"severity": "high",
|
||||
"actor": "tom.wilson",
|
||||
"metadata": {
|
||||
"channel": "#war-room-payment-api",
|
||||
"root_cause": "inefficient_query",
|
||||
"recommendation": "rollback"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T15:02:00Z",
|
||||
"source": "slack",
|
||||
"type": "communication",
|
||||
"message": "Mike Rodriguez: Approved rollback to v2.2.9. Sarah initiating rollback procedure.",
|
||||
"severity": "high",
|
||||
"actor": "mike.rodriguez",
|
||||
"metadata": {
|
||||
"channel": "#war-room-payment-api",
|
||||
"decision": "rollback_approved",
|
||||
"target_version": "v2.2.9"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T15:05:00Z",
|
||||
"source": "deployment_system",
|
||||
"type": "action",
|
||||
"message": "Rollback initiated: payment-api v2.3.1 → v2.2.9",
|
||||
"severity": "medium",
|
||||
"actor": "sarah.chen",
|
||||
"metadata": {
|
||||
"from_version": "v2.3.1",
|
||||
"to_version": "v2.2.9",
|
||||
"deployment_type": "rollback"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T15:12:00Z",
|
||||
"source": "deployment_system",
|
||||
"type": "action",
|
||||
"message": "Rollback completed successfully: payment-api now running v2.2.9 across all regions",
|
||||
"severity": "medium",
|
||||
"actor": "deployment-system",
|
||||
"metadata": {
|
||||
"deployment_status": "completed",
|
||||
"regions": ["us-west", "us-east", "eu-west"]
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T15:15:00Z",
|
||||
"source": "datadog",
|
||||
"type": "log",
|
||||
"message": "Error rate decreasing: payment-api error rate dropped to 8% and continuing to decline",
|
||||
"severity": "medium",
|
||||
"actor": "monitoring-system",
|
||||
"metadata": {
|
||||
"error_rate": "8%",
|
||||
"trend": "decreasing"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T15:18:00Z",
|
||||
"source": "database_monitoring",
|
||||
"type": "log",
|
||||
"message": "Connection pool utilization normalizing: 45/200 connections active",
|
||||
"severity": "low",
|
||||
"actor": "database-monitor",
|
||||
"metadata": {
|
||||
"connection_count": 45,
|
||||
"max_connections": 200,
|
||||
"utilization": "22.5%"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T15:25:00Z",
|
||||
"source": "datadog",
|
||||
"type": "log",
|
||||
"message": "Error rate returned to normal: payment-api error rate now 0.2% (within normal range)",
|
||||
"severity": "low",
|
||||
"actor": "monitoring-system",
|
||||
"metadata": {
|
||||
"error_rate": "0.2%",
|
||||
"status": "normal"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T15:30:00Z",
|
||||
"source": "slack",
|
||||
"type": "communication",
|
||||
"message": "Mike Rodriguez: All metrics returned to normal. Declaring incident resolved. Thanks to all responders.",
|
||||
"severity": "low",
|
||||
"actor": "mike.rodriguez",
|
||||
"metadata": {
|
||||
"channel": "#war-room-payment-api",
|
||||
"status": "resolved"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T15:35:00Z",
|
||||
"source": "statuspage",
|
||||
"type": "communication",
|
||||
"message": "Status page updated: Payment processing issues resolved. All systems operational.",
|
||||
"severity": "low",
|
||||
"actor": "mike.rodriguez",
|
||||
"metadata": {
|
||||
"status": "resolved",
|
||||
"duration": "65 minutes"
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-15T15:40:00Z",
|
||||
"source": "slack",
|
||||
"type": "communication",
|
||||
"message": "Mike Rodriguez: PIR scheduled for tomorrow 10am. Action item: fix the inefficient query in v2.3.2",
|
||||
"severity": "low",
|
||||
"actor": "mike.rodriguez",
|
||||
"metadata": {
|
||||
"channel": "#incidents",
|
||||
"pir_time": "2024-03-16T10:00:00Z",
|
||||
"action_item": "fix_query_v2.3.2"
|
||||
}
|
||||
}
|
||||
]
|
||||
@@ -0,0 +1,6 @@
|
||||
{
|
||||
"description": "Users reporting slow page loads on the main website",
|
||||
"service": "web-frontend",
|
||||
"affected_users": "25%",
|
||||
"business_impact": "medium"
|
||||
}
|
||||
@@ -0,0 +1,30 @@
|
||||
[
|
||||
{
|
||||
"timestamp": "2024-03-10T09:00:00Z",
|
||||
"source": "monitoring",
|
||||
"message": "High CPU utilization detected on web servers",
|
||||
"severity": "medium",
|
||||
"actor": "system"
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-10T09:05:00Z",
|
||||
"source": "slack",
|
||||
"message": "Engineer investigating high CPU alerts",
|
||||
"severity": "medium",
|
||||
"actor": "john.doe"
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-10T09:15:00Z",
|
||||
"source": "deployment",
|
||||
"message": "Deployed hotfix to reduce CPU usage",
|
||||
"severity": "low",
|
||||
"actor": "john.doe"
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-10T09:25:00Z",
|
||||
"source": "monitoring",
|
||||
"message": "CPU utilization returned to normal levels",
|
||||
"severity": "low",
|
||||
"actor": "system"
|
||||
}
|
||||
]
|
||||
@@ -0,0 +1,44 @@
|
||||
============================================================
|
||||
INCIDENT CLASSIFICATION REPORT
|
||||
============================================================
|
||||
|
||||
CLASSIFICATION:
|
||||
Severity: SEV1
|
||||
Confidence: 100.0%
|
||||
Reasoning: Classified as SEV1 based on: keywords: timeout, 500 error; user impact: 80%
|
||||
Timestamp: 2026-02-16T12:41:46.644096+00:00
|
||||
|
||||
RECOMMENDED RESPONSE:
|
||||
Primary Team: Analytics Team
|
||||
Supporting Teams: SRE, API Team, Backend Engineering, Finance Engineering, Payments Team, DevOps, Compliance Team, Database Team, Platform Team, Data Engineering
|
||||
Response Time: 5 minutes
|
||||
|
||||
INITIAL ACTIONS:
|
||||
1. Establish incident command (Priority 1)
|
||||
Timeout: 5 minutes
|
||||
Page incident commander and establish war room
|
||||
|
||||
2. Create incident ticket (Priority 1)
|
||||
Timeout: 2 minutes
|
||||
Create tracking ticket with all known details
|
||||
|
||||
3. Update status page (Priority 2)
|
||||
Timeout: 15 minutes
|
||||
Post initial status page update acknowledging incident
|
||||
|
||||
4. Notify executives (Priority 2)
|
||||
Timeout: 15 minutes
|
||||
Alert executive team of customer-impacting outage
|
||||
|
||||
5. Engage subject matter experts (Priority 3)
|
||||
Timeout: 10 minutes
|
||||
Page relevant SMEs based on affected systems
|
||||
|
||||
COMMUNICATION:
|
||||
Subject: 🚨 [SEV1] payment-api - Database connection timeouts causing 500 errors fo...
|
||||
Urgency: SEV1
|
||||
Recipients: on-call, engineering-leadership, executives, customer-success
|
||||
Channels: pager, phone, slack, email, status-page
|
||||
Update Frequency: Every 15 minutes
|
||||
|
||||
============================================================
|
||||
@@ -0,0 +1,88 @@
|
||||
# Post-Incident Review: Payment API Database Connection Pool Exhaustion
|
||||
|
||||
## Executive Summary
|
||||
On March 15, 2024, we experienced a sev2 incident affecting ['payment-api', 'checkout-service', 'subscription-billing']. The incident lasted 1h 5m and had the following impact: 80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay. The incident has been resolved and we have identified specific actions to prevent recurrence.
|
||||
|
||||
## Incident Overview
|
||||
- **Incident ID:** INC-2024-0315-001
|
||||
- **Date & Time:** 2024-03-15 14:30:00 UTC
|
||||
- **Duration:** 1h 5m
|
||||
- **Severity:** SEV2
|
||||
- **Status:** Resolved
|
||||
- **Incident Commander:** Mike Rodriguez
|
||||
- **Responders:** Sarah Chen - On-call Engineer, Primary Responder, Tom Wilson - Database Team Lead, Lisa Park - Database Engineer, Mike Rodriguez - Incident Commander, David Kumar - DevOps Engineer
|
||||
|
||||
### Customer Impact
|
||||
80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay.
|
||||
|
||||
### Business Impact
|
||||
Estimated revenue loss of $45,000 during outage period. No SLA breaches as resolution was within 2-hour window. 12 customer escalations through support channels.
|
||||
|
||||
## Timeline
|
||||
No detailed timeline available.
|
||||
|
||||
## Root Cause Analysis
|
||||
### Analysis Method: 5 Whys Analysis
|
||||
|
||||
#### Why Analysis
|
||||
|
||||
**Why 1:** Why did Database connection pool exhaustion caused widespread 500 errors in payment processing API, preventing users from completing purchases. Root cause was an inefficient database query introduced in deployment v2.3.1.?
|
||||
**Answer:** New deployment introduced a regression
|
||||
|
||||
**Why 2:** Why wasn't this detected earlier?
|
||||
**Answer:** Code review process missed the issue
|
||||
|
||||
**Why 3:** Why didn't existing safeguards prevent this?
|
||||
**Answer:** Testing environment didn't match production
|
||||
|
||||
**Why 4:** Why wasn't there a backup mechanism?
|
||||
**Answer:** Further investigation needed
|
||||
|
||||
**Why 5:** Why wasn't this scenario anticipated?
|
||||
**Answer:** Further investigation needed
|
||||
|
||||
|
||||
## What Went Well
|
||||
- The incident was successfully resolved
|
||||
- Incident command was established
|
||||
- Multiple team members collaborated on resolution
|
||||
|
||||
## What Didn't Go Well
|
||||
- Analysis in progress
|
||||
|
||||
## Lessons Learned
|
||||
Lessons learned to be documented following detailed analysis.
|
||||
|
||||
## Action Items
|
||||
Action items to be defined.
|
||||
|
||||
## Follow-up and Prevention
|
||||
### Prevention Measures
|
||||
|
||||
Based on the root cause analysis, the following preventive measures have been identified:
|
||||
|
||||
- Implement comprehensive testing for similar scenarios
|
||||
- Improve monitoring and alerting coverage
|
||||
- Enhance error handling and resilience patterns
|
||||
|
||||
### Follow-up Schedule
|
||||
|
||||
- 1 week: Review action item progress
|
||||
- 1 month: Evaluate effectiveness of implemented changes
|
||||
- 3 months: Conduct follow-up assessment and update preventive measures
|
||||
|
||||
## Appendix
|
||||
### Additional Information
|
||||
|
||||
- Incident ID: INC-2024-0315-001
|
||||
- Severity Classification: sev2
|
||||
- Affected Services: payment-api, checkout-service, subscription-billing
|
||||
|
||||
### References
|
||||
|
||||
- Incident tracking ticket: [Link TBD]
|
||||
- Monitoring dashboards: [Link TBD]
|
||||
- Communication thread: [Link TBD]
|
||||
|
||||
---
|
||||
*Generated on 2026-02-16 by PIR Generator*
|
||||
@@ -0,0 +1,44 @@
|
||||
============================================================
|
||||
INCIDENT CLASSIFICATION REPORT
|
||||
============================================================
|
||||
|
||||
CLASSIFICATION:
|
||||
Severity: SEV2
|
||||
Confidence: 100.0%
|
||||
Reasoning: Classified as SEV2 based on: keywords: slow; user impact: 25%
|
||||
Timestamp: 2026-02-16T12:42:41.889774+00:00
|
||||
|
||||
RECOMMENDED RESPONSE:
|
||||
Primary Team: UX Engineering
|
||||
Supporting Teams: Product Engineering, Frontend Team
|
||||
Response Time: 15 minutes
|
||||
|
||||
INITIAL ACTIONS:
|
||||
1. Assign incident commander (Priority 1)
|
||||
Timeout: 30 minutes
|
||||
Assign IC and establish coordination channel
|
||||
|
||||
2. Create incident tracking (Priority 1)
|
||||
Timeout: 5 minutes
|
||||
Create incident ticket with details and timeline
|
||||
|
||||
3. Assess customer impact (Priority 2)
|
||||
Timeout: 15 minutes
|
||||
Determine scope and severity of user impact
|
||||
|
||||
4. Engage response team (Priority 2)
|
||||
Timeout: 30 minutes
|
||||
Page appropriate technical responders
|
||||
|
||||
5. Begin investigation (Priority 3)
|
||||
Timeout: 15 minutes
|
||||
Start technical analysis and debugging
|
||||
|
||||
COMMUNICATION:
|
||||
Subject: ⚠️ [SEV2] web-frontend - Users reporting slow page loads on the main websit...
|
||||
Urgency: SEV2
|
||||
Recipients: on-call, engineering-leadership, product-team
|
||||
Channels: pager, slack, email
|
||||
Update Frequency: Every 30 minutes
|
||||
|
||||
============================================================
|
||||
@@ -0,0 +1,110 @@
|
||||
================================================================================
|
||||
INCIDENT TIMELINE RECONSTRUCTION
|
||||
================================================================================
|
||||
|
||||
OVERVIEW:
|
||||
Time Range: 2024-03-15T14:30:00+00:00 to 2024-03-15T15:40:00+00:00
|
||||
Total Duration: 70 minutes
|
||||
Total Events: 21
|
||||
Phases Detected: 12
|
||||
|
||||
PHASES:
|
||||
DETECTION:
|
||||
Start: 2024-03-15T14:30:00+00:00
|
||||
Duration: 0.0 minutes
|
||||
Events: 1
|
||||
Description: Initial detection of the incident through monitoring or observation
|
||||
|
||||
ESCALATION:
|
||||
Start: 2024-03-15T14:32:00+00:00
|
||||
Duration: 0.0 minutes
|
||||
Events: 1
|
||||
Description: Escalation to additional resources or higher severity response
|
||||
|
||||
TRIAGE:
|
||||
Start: 2024-03-15T14:35:00+00:00
|
||||
Duration: 0.0 minutes
|
||||
Events: 1
|
||||
Description: Assessment and initial investigation of the incident
|
||||
|
||||
ESCALATION:
|
||||
Start: 2024-03-15T14:38:00+00:00
|
||||
Duration: 9.0 minutes
|
||||
Events: 5
|
||||
Description: Escalation to additional resources or higher severity response
|
||||
|
||||
TRIAGE:
|
||||
Start: 2024-03-15T14:50:00+00:00
|
||||
Duration: 0.0 minutes
|
||||
Events: 1
|
||||
Description: Assessment and initial investigation of the incident
|
||||
|
||||
ESCALATION:
|
||||
Start: 2024-03-15T14:52:00+00:00
|
||||
Duration: 10.0 minutes
|
||||
Events: 4
|
||||
Description: Escalation to additional resources or higher severity response
|
||||
|
||||
TRIAGE:
|
||||
Start: 2024-03-15T15:05:00+00:00
|
||||
Duration: 7.0 minutes
|
||||
Events: 2
|
||||
Description: Assessment and initial investigation of the incident
|
||||
|
||||
DETECTION:
|
||||
Start: 2024-03-15T15:15:00+00:00
|
||||
Duration: 0.0 minutes
|
||||
Events: 1
|
||||
Description: Initial detection of the incident through monitoring or observation
|
||||
|
||||
RESOLUTION:
|
||||
Start: 2024-03-15T15:18:00+00:00
|
||||
Duration: 0.0 minutes
|
||||
Events: 1
|
||||
Description: Confirmation that the incident has been resolved
|
||||
|
||||
DETECTION:
|
||||
Start: 2024-03-15T15:25:00+00:00
|
||||
Duration: 0.0 minutes
|
||||
Events: 1
|
||||
Description: Initial detection of the incident through monitoring or observation
|
||||
|
||||
RESOLUTION:
|
||||
Start: 2024-03-15T15:30:00+00:00
|
||||
Duration: 5.0 minutes
|
||||
Events: 2
|
||||
Description: Confirmation that the incident has been resolved
|
||||
|
||||
TRIAGE:
|
||||
Start: 2024-03-15T15:40:00+00:00
|
||||
Duration: 0.0 minutes
|
||||
Events: 1
|
||||
Description: Assessment and initial investigation of the incident
|
||||
|
||||
KEY METRICS:
|
||||
Time to Mitigation: 0 minutes
|
||||
Time to Resolution: 48.0 minutes
|
||||
Events per Hour: 18.0
|
||||
Unique Sources: 7
|
||||
|
||||
INCIDENT NARRATIVE:
|
||||
Incident Timeline Summary:
|
||||
The incident began at 2024-03-15 14:30:00 UTC and concluded at 2024-03-15 15:40:00 UTC, lasting approximately 70 minutes.
|
||||
|
||||
The incident progressed through 12 distinct phases: detection, escalation, triage, escalation, triage, escalation, triage, detection, resolution, detection, resolution, triage.
|
||||
|
||||
Key milestones:
|
||||
- Detection: 14:30 (0 min)
|
||||
- Escalation: 14:32 (0 min)
|
||||
- Triage: 14:35 (0 min)
|
||||
- Escalation: 14:38 (9 min)
|
||||
- Triage: 14:50 (0 min)
|
||||
- Escalation: 14:52 (10 min)
|
||||
- Triage: 15:05 (7 min)
|
||||
- Detection: 15:15 (0 min)
|
||||
- Resolution: 15:18 (0 min)
|
||||
- Detection: 15:25 (0 min)
|
||||
- Resolution: 15:30 (5 min)
|
||||
- Triage: 15:40 (0 min)
|
||||
|
||||
================================================================================
|
||||
@@ -0,0 +1,591 @@
|
||||
# Incident Communication Templates
|
||||
|
||||
## Overview
|
||||
|
||||
This document provides standardized communication templates for incident response. These templates ensure consistent, clear communication across different severity levels and stakeholder groups.
|
||||
|
||||
## Template Usage Guidelines
|
||||
|
||||
### General Principles
|
||||
1. **Be Clear and Concise** - Use simple language, avoid jargon
|
||||
2. **Be Factual** - Only state what is known, avoid speculation
|
||||
3. **Be Timely** - Send updates at committed intervals
|
||||
4. **Be Actionable** - Include next steps and expected timelines
|
||||
5. **Be Accountable** - Include contact information for follow-up
|
||||
|
||||
### Template Selection
|
||||
- Choose templates based on incident severity and audience
|
||||
- Customize templates with specific incident details
|
||||
- Always include next update time and contact information
|
||||
- Escalate template types as severity increases
|
||||
|
||||
---
|
||||
|
||||
## SEV1 Templates
|
||||
|
||||
### Initial Alert - Internal Teams
|
||||
|
||||
**Subject:** 🚨 [SEV1] CRITICAL: {Service} Complete Outage - Immediate Response Required
|
||||
|
||||
```
|
||||
CRITICAL INCIDENT ALERT - IMMEDIATE ATTENTION REQUIRED
|
||||
|
||||
Incident Summary:
|
||||
- Service: {Service Name}
|
||||
- Status: Complete Outage
|
||||
- Start Time: {Timestamp}
|
||||
- Customer Impact: {Impact Description}
|
||||
- Estimated Affected Users: {Number/Percentage}
|
||||
|
||||
Immediate Actions Needed:
|
||||
✓ Incident Commander: {Name} - ASSIGNED
|
||||
✓ War Room: {Bridge/Chat Link} - JOIN NOW
|
||||
✓ On-Call Response: {Team} - PAGED
|
||||
⏳ Executive Notification: In progress
|
||||
⏳ Status Page Update: Within 15 minutes
|
||||
|
||||
Current Situation:
|
||||
{Brief description of what we know}
|
||||
|
||||
What We're Doing:
|
||||
{Immediate response actions being taken}
|
||||
|
||||
Next Update: {Timestamp - 15 minutes from now}
|
||||
|
||||
Incident Commander: {Name}
|
||||
Contact: {Phone/Slack}
|
||||
|
||||
THIS IS A CUSTOMER-IMPACTING INCIDENT REQUIRING IMMEDIATE ATTENTION
|
||||
```
|
||||
|
||||
### Executive Notification - SEV1
|
||||
|
||||
**Subject:** 🚨 URGENT: Customer-Impacting Outage - {Service}
|
||||
|
||||
```
|
||||
EXECUTIVE ALERT: Critical customer-facing incident
|
||||
|
||||
Service: {Service Name}
|
||||
Impact: {Customer impact description}
|
||||
Duration: {Current duration} (started {start time})
|
||||
Business Impact: {Revenue/SLA/compliance implications}
|
||||
|
||||
Customer Impact Summary:
|
||||
- Affected Users: {Number/percentage}
|
||||
- Revenue Impact: {$ amount if known}
|
||||
- SLA Status: {Breach status}
|
||||
- Customer Escalations: {Number if any}
|
||||
|
||||
Response Status:
|
||||
- Incident Commander: {Name} ({contact})
|
||||
- Response Team Size: {Number of engineers}
|
||||
- Root Cause: {If known, otherwise "Under investigation"}
|
||||
- ETA to Resolution: {If known, otherwise "Investigating"}
|
||||
|
||||
Executive Actions Required:
|
||||
- [ ] Customer communication approval needed
|
||||
- [ ] Legal/compliance notification: {If applicable}
|
||||
- [ ] PR/Media response preparation: {If needed}
|
||||
- [ ] Resource allocation decisions: {If escalation needed}
|
||||
|
||||
War Room: {Link}
|
||||
Next Update: {15 minutes from now}
|
||||
|
||||
This incident meets SEV1 criteria and requires executive oversight.
|
||||
|
||||
{Incident Commander contact information}
|
||||
```
|
||||
|
||||
### Customer Communication - SEV1
|
||||
|
||||
**Subject:** Service Disruption - Immediate Action Being Taken
|
||||
|
||||
```
|
||||
We are currently experiencing a service disruption affecting {service description}.
|
||||
|
||||
What's Happening:
|
||||
{Clear, customer-friendly description of the issue}
|
||||
|
||||
Impact:
|
||||
{What customers are experiencing - be specific}
|
||||
|
||||
What We're Doing:
|
||||
We detected this issue at {time} and immediately mobilized our engineering team. We are actively working to resolve this issue and will provide updates every 15 minutes.
|
||||
|
||||
Current Actions:
|
||||
• {Action 1 - customer-friendly description}
|
||||
• {Action 2 - customer-friendly description}
|
||||
• {Action 3 - customer-friendly description}
|
||||
|
||||
Workaround:
|
||||
{If available, provide clear steps}
|
||||
{If not available: "We are working on alternative solutions and will share them as soon as available."}
|
||||
|
||||
Next Update: {Timestamp}
|
||||
Status Page: {Link}
|
||||
Support: {Contact information if different from usual}
|
||||
|
||||
We sincerely apologize for the inconvenience and are committed to resolving this as quickly as possible.
|
||||
|
||||
{Company Name} Team
|
||||
```
|
||||
|
||||
### Status Page Update - SEV1
|
||||
|
||||
**Status:** Major Outage
|
||||
|
||||
```
|
||||
{Timestamp} - Investigating
|
||||
|
||||
We are currently investigating reports of {service} being unavailable. Our team has been alerted and is actively investigating the cause.
|
||||
|
||||
Affected Services: {List of affected services}
|
||||
Impact: {Customer-facing impact description}
|
||||
|
||||
We will provide an update within 15 minutes.
|
||||
```
|
||||
|
||||
```
|
||||
{Timestamp} - Identified
|
||||
|
||||
We have identified the cause of the {service} outage. Our engineering team is implementing a fix.
|
||||
|
||||
Root Cause: {Brief, customer-friendly explanation}
|
||||
Expected Resolution: {Timeline if known}
|
||||
|
||||
Next update in 15 minutes.
|
||||
```
|
||||
|
||||
```
|
||||
{Timestamp} - Monitoring
|
||||
|
||||
The fix has been implemented and we are monitoring the service recovery.
|
||||
|
||||
Current Status: {Recovery progress}
|
||||
Next Steps: {What we're monitoring}
|
||||
|
||||
We expect full service restoration within {timeframe}.
|
||||
```
|
||||
|
||||
```
|
||||
{Timestamp} - Resolved
|
||||
|
||||
{Service} is now fully operational. We have confirmed that all functionality is working as expected.
|
||||
|
||||
Total Duration: {Duration}
|
||||
Root Cause: {Brief summary}
|
||||
|
||||
We apologize for the inconvenience. A full post-incident review will be conducted and shared within 24 hours.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## SEV2 Templates
|
||||
|
||||
### Team Notification - SEV2
|
||||
|
||||
**Subject:** ⚠️ [SEV2] {Service} Performance Issues - Response Team Mobilizing
|
||||
|
||||
```
|
||||
SEV2 INCIDENT: Performance degradation requiring active response
|
||||
|
||||
Incident Details:
|
||||
- Service: {Service Name}
|
||||
- Issue: {Description of performance issue}
|
||||
- Start Time: {Timestamp}
|
||||
- Affected Users: {Percentage/description}
|
||||
- Business Impact: {Impact on business operations}
|
||||
|
||||
Current Status:
|
||||
{What we know about the issue}
|
||||
|
||||
Response Team:
|
||||
- Incident Commander: {Name} ({contact})
|
||||
- Primary Responder: {Name} ({team})
|
||||
- Supporting Teams: {List of engaged teams}
|
||||
|
||||
Immediate Actions:
|
||||
✓ {Action 1 - completed}
|
||||
⏳ {Action 2 - in progress}
|
||||
⏳ {Action 3 - next step}
|
||||
|
||||
Metrics:
|
||||
- Error Rate: {Current vs normal}
|
||||
- Response Time: {Current vs normal}
|
||||
- Throughput: {Current vs normal}
|
||||
|
||||
Communication Plan:
|
||||
- Internal Updates: Every 30 minutes
|
||||
- Stakeholder Notification: {If needed}
|
||||
- Status Page Update: {Planned/not needed}
|
||||
|
||||
Coordination Channel: {Slack channel}
|
||||
Next Update: {30 minutes from now}
|
||||
|
||||
Incident Commander: {Name} | {Contact}
|
||||
```
|
||||
|
||||
### Stakeholder Update - SEV2
|
||||
|
||||
**Subject:** [SEV2] Service Performance Update - {Service}
|
||||
|
||||
```
|
||||
Service Performance Incident Update
|
||||
|
||||
Service: {Service Name}
|
||||
Duration: {Current duration}
|
||||
Impact: {Description of user impact}
|
||||
|
||||
Current Status:
|
||||
{Brief status of the incident and response efforts}
|
||||
|
||||
What We Know:
|
||||
• {Key finding 1}
|
||||
• {Key finding 2}
|
||||
• {Key finding 3}
|
||||
|
||||
What We're Doing:
|
||||
• {Response action 1}
|
||||
• {Response action 2}
|
||||
• {Monitoring/verification steps}
|
||||
|
||||
Customer Impact:
|
||||
{Realistic assessment of what users are experiencing}
|
||||
|
||||
Workaround:
|
||||
{If available, provide steps}
|
||||
|
||||
Expected Resolution:
|
||||
{Timeline if known, otherwise "Continuing investigation"}
|
||||
|
||||
Next Update: {30 minutes}
|
||||
Contact: {Incident Commander information}
|
||||
|
||||
This incident is being actively managed and does not currently require escalation.
|
||||
```
|
||||
|
||||
### Customer Communication - SEV2 (Optional)
|
||||
|
||||
**Subject:** Temporary Service Performance Issues
|
||||
|
||||
```
|
||||
We are currently experiencing performance issues with {service name} that may affect your experience.
|
||||
|
||||
What You Might Notice:
|
||||
{Specific symptoms users might experience}
|
||||
|
||||
What We're Doing:
|
||||
Our team identified this issue at {time} and is actively working on a resolution. We expect to have this resolved within {timeframe}.
|
||||
|
||||
Workaround:
|
||||
{If applicable, provide simple workaround steps}
|
||||
|
||||
We will update our status page at {link} with progress information.
|
||||
|
||||
Thank you for your patience as we work to resolve this issue quickly.
|
||||
|
||||
{Company Name} Support Team
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## SEV3 Templates
|
||||
|
||||
### Team Assignment - SEV3
|
||||
|
||||
**Subject:** [SEV3] Issue Assignment - {Component} Issue
|
||||
|
||||
```
|
||||
SEV3 Issue Assignment
|
||||
|
||||
Service/Component: {Affected component}
|
||||
Issue: {Description}
|
||||
Reported: {Timestamp}
|
||||
Reporter: {Person/system that reported}
|
||||
|
||||
Issue Details:
|
||||
{Detailed description of the problem}
|
||||
|
||||
Impact Assessment:
|
||||
- Affected Users: {Scope}
|
||||
- Business Impact: {Assessment}
|
||||
- Urgency: {Business hours response appropriate}
|
||||
|
||||
Assignment:
|
||||
- Primary: {Engineer name}
|
||||
- Team: {Responsible team}
|
||||
- Expected Response: {Within 2-4 hours}
|
||||
|
||||
Investigation Plan:
|
||||
1. {Investigation step 1}
|
||||
2. {Investigation step 2}
|
||||
3. {Communication checkpoint}
|
||||
|
||||
Workaround:
|
||||
{If known, otherwise "Investigating alternatives"}
|
||||
|
||||
This issue will be tracked in {ticket system} as {ticket number}.
|
||||
|
||||
Team Lead: {Name} | {Contact}
|
||||
```
|
||||
|
||||
### Status Update - SEV3
|
||||
|
||||
**Subject:** [SEV3] Progress Update - {Component}
|
||||
|
||||
```
|
||||
SEV3 Issue Progress Update
|
||||
|
||||
Issue: {Brief description}
|
||||
Assigned to: {Engineer/Team}
|
||||
Investigation Status: {Current progress}
|
||||
|
||||
Findings So Far:
|
||||
{What has been discovered during investigation}
|
||||
|
||||
Next Steps:
|
||||
{Planned actions and timeline}
|
||||
|
||||
Impact Update:
|
||||
{Any changes to scope or urgency}
|
||||
|
||||
Expected Resolution:
|
||||
{Timeline if known}
|
||||
|
||||
This issue continues to be tracked as SEV3 with no escalation required.
|
||||
|
||||
Contact: {Assigned engineer} | {Team lead}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## SEV4 Templates
|
||||
|
||||
### Issue Documentation - SEV4
|
||||
|
||||
**Subject:** [SEV4] Issue Documented - {Description}
|
||||
|
||||
```
|
||||
SEV4 Issue Logged
|
||||
|
||||
Description: {Clear description of the issue}
|
||||
Reporter: {Name/system}
|
||||
Date: {Date reported}
|
||||
|
||||
Impact:
|
||||
{Minimal impact description}
|
||||
|
||||
Priority Assessment:
|
||||
This issue has been classified as SEV4 and will be addressed in the normal development cycle.
|
||||
|
||||
Assignment:
|
||||
- Team: {Responsible team}
|
||||
- Sprint: {Target sprint}
|
||||
- Estimated Effort: {Story points/hours}
|
||||
|
||||
This issue is tracked as {ticket number} in {system}.
|
||||
|
||||
Product Owner: {Name}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Escalation Templates
|
||||
|
||||
### Severity Escalation
|
||||
|
||||
**Subject:** ESCALATION: {Original Severity} → {New Severity} - {Service}
|
||||
|
||||
```
|
||||
SEVERITY ESCALATION NOTIFICATION
|
||||
|
||||
Original Classification: {Original severity}
|
||||
New Classification: {New severity}
|
||||
Escalation Time: {Timestamp}
|
||||
Escalated By: {Name and role}
|
||||
|
||||
Escalation Reasons:
|
||||
• {Reason 1 - scope expansion/duration/impact}
|
||||
• {Reason 2}
|
||||
• {Reason 3}
|
||||
|
||||
Updated Impact:
|
||||
{New assessment of customer/business impact}
|
||||
|
||||
Updated Response Requirements:
|
||||
{New response team, communication frequency, etc.}
|
||||
|
||||
Previous Response Actions:
|
||||
{Summary of actions taken under previous severity}
|
||||
|
||||
New Incident Commander: {If changed}
|
||||
Updated Communication Plan: {New frequency/recipients}
|
||||
|
||||
All stakeholders should adjust response according to {new severity} protocols.
|
||||
|
||||
Incident Commander: {Name} | {Contact}
|
||||
```
|
||||
|
||||
### Management Escalation
|
||||
|
||||
**Subject:** MANAGEMENT ESCALATION: Extended {Severity} Incident - {Service}
|
||||
|
||||
```
|
||||
Management Escalation Required
|
||||
|
||||
Incident: {Service} {brief description}
|
||||
Original Severity: {Severity}
|
||||
Duration: {Current duration}
|
||||
Escalation Trigger: {Duration threshold/scope change/customer escalation}
|
||||
|
||||
Current Status:
|
||||
{Brief status of incident response}
|
||||
|
||||
Challenges Encountered:
|
||||
• {Challenge 1}
|
||||
• {Challenge 2}
|
||||
• {Resource/expertise needs}
|
||||
|
||||
Business Impact:
|
||||
{Updated assessment of business implications}
|
||||
|
||||
Management Decision Required:
|
||||
• {Decision 1 - resource allocation/external expertise/communication}
|
||||
• {Decision 2}
|
||||
|
||||
Recommended Actions:
|
||||
{Incident Commander's recommendations}
|
||||
|
||||
This escalation follows standard procedures for {trigger type}.
|
||||
|
||||
Incident Commander: {Name}
|
||||
Contact: {Phone/Slack}
|
||||
War Room: {Link}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution Templates
|
||||
|
||||
### Resolution Confirmation - All Severities
|
||||
|
||||
**Subject:** RESOLVED: [{Severity}] {Service} Incident - {Brief Description}
|
||||
|
||||
```
|
||||
INCIDENT RESOLVED
|
||||
|
||||
Service: {Service Name}
|
||||
Issue: {Brief description}
|
||||
Duration: {Total duration}
|
||||
Resolution Time: {Timestamp}
|
||||
|
||||
Resolution Summary:
|
||||
{Brief description of how the issue was resolved}
|
||||
|
||||
Root Cause:
|
||||
{Brief explanation - detailed PIR to follow}
|
||||
|
||||
Impact Summary:
|
||||
- Users Affected: {Final count/percentage}
|
||||
- Business Impact: {Final assessment}
|
||||
- Services Affected: {List}
|
||||
|
||||
Resolution Actions Taken:
|
||||
• {Action 1}
|
||||
• {Action 2}
|
||||
• {Verification steps}
|
||||
|
||||
Monitoring:
|
||||
We will continue monitoring {service} for {duration} to ensure stability.
|
||||
|
||||
Next Steps:
|
||||
• Post-incident review scheduled for {date}
|
||||
• Action items to be tracked in {system}
|
||||
• Follow-up communication: {If needed}
|
||||
|
||||
Thank you to everyone who participated in the incident response.
|
||||
|
||||
Incident Commander: {Name}
|
||||
```
|
||||
|
||||
### Customer Resolution Communication
|
||||
|
||||
**Subject:** Service Restored - Thank You for Your Patience
|
||||
|
||||
```
|
||||
Service Update: Issue Resolved
|
||||
|
||||
We're pleased to report that the {service} issues have been fully resolved as of {timestamp}.
|
||||
|
||||
What Was Fixed:
|
||||
{Customer-friendly explanation of the resolution}
|
||||
|
||||
Duration:
|
||||
The issue lasted {duration} from {start time} to {end time}.
|
||||
|
||||
What We Learned:
|
||||
{Brief, high-level takeaway}
|
||||
|
||||
Our Commitment:
|
||||
We are conducting a thorough review of this incident and will implement improvements to prevent similar issues in the future. A summary of our findings and improvements will be shared {timeframe}.
|
||||
|
||||
We sincerely apologize for any inconvenience this may have caused and appreciate your patience while we worked to resolve the issue.
|
||||
|
||||
If you continue to experience any problems, please contact our support team at {contact information}.
|
||||
|
||||
Thank you,
|
||||
{Company Name} Team
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Template Customization Guidelines
|
||||
|
||||
### Placeholders to Always Replace
|
||||
- `{Service}` / `{Service Name}` - Specific service or component
|
||||
- `{Timestamp}` - Specific date/time in consistent format
|
||||
- `{Name}` / `{Contact}` - Actual names and contact information
|
||||
- `{Duration}` - Actual time durations
|
||||
- `{Link}` - Real URLs to war rooms, status pages, etc.
|
||||
|
||||
### Language Guidelines
|
||||
- Use active voice ("We are investigating" not "The issue is being investigated")
|
||||
- Be specific about timelines ("within 30 minutes" not "soon")
|
||||
- Avoid technical jargon in customer communications
|
||||
- Include empathy in customer-facing messages
|
||||
- Use consistent terminology throughout incident lifecycle
|
||||
|
||||
### Timing Guidelines
|
||||
| Severity | Initial Notification | Update Frequency | Resolution Notification |
|
||||
|----------|---------------------|------------------|------------------------|
|
||||
| SEV1 | Immediate (< 5 min) | Every 15 minutes | Immediate |
|
||||
| SEV2 | Within 15 minutes | Every 30 minutes | Within 15 minutes |
|
||||
| SEV3 | Within 2 hours | At milestones | Within 1 hour |
|
||||
| SEV4 | Within 1 business day | Weekly | When resolved |
|
||||
|
||||
### Audience-Specific Considerations
|
||||
|
||||
#### Engineering Teams
|
||||
- Include technical details
|
||||
- Provide specific metrics and logs
|
||||
- Include coordination channels
|
||||
- List specific actions and owners
|
||||
|
||||
#### Executive/Business
|
||||
- Focus on business impact
|
||||
- Include customer and revenue implications
|
||||
- Provide clear timeline and resource needs
|
||||
- Highlight any external factors (PR, legal, compliance)
|
||||
|
||||
#### Customers
|
||||
- Use plain language
|
||||
- Focus on customer impact and workarounds
|
||||
- Provide realistic timelines
|
||||
- Include support contact information
|
||||
- Show empathy and accountability
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** February 2026
|
||||
**Next Review:** May 2026
|
||||
**Owner:** Incident Management Team
|
||||
@@ -0,0 +1,292 @@
|
||||
# Incident Severity Classification Matrix
|
||||
|
||||
## Overview
|
||||
|
||||
This document defines the severity classification system used for incident response. The classification determines response requirements, escalation paths, and communication frequency.
|
||||
|
||||
## Severity Levels
|
||||
|
||||
### SEV1 - Critical Outage
|
||||
|
||||
**Definition:** Complete service failure affecting all users or critical business functions
|
||||
|
||||
#### Impact Criteria
|
||||
- Customer-facing services completely unavailable
|
||||
- Data loss or corruption affecting users
|
||||
- Security breaches with customer data exposure
|
||||
- Revenue-generating systems down
|
||||
- SLA violations with financial penalties
|
||||
- > 75% of users affected
|
||||
|
||||
#### Response Requirements
|
||||
| Metric | Requirement |
|
||||
|--------|-------------|
|
||||
| **Response Time** | Immediate (0-5 minutes) |
|
||||
| **Incident Commander** | Assigned within 5 minutes |
|
||||
| **War Room** | Established within 10 minutes |
|
||||
| **Executive Notification** | Within 15 minutes |
|
||||
| **Public Status Page** | Updated within 15 minutes |
|
||||
| **Customer Communication** | Within 30 minutes |
|
||||
|
||||
#### Escalation Path
|
||||
1. **Immediate**: On-call Engineer → Incident Commander
|
||||
2. **15 minutes**: VP Engineering + Customer Success VP
|
||||
3. **30 minutes**: CTO
|
||||
4. **60 minutes**: CEO + Full Executive Team
|
||||
|
||||
#### Communication Requirements
|
||||
- **Frequency**: Every 15 minutes until resolution
|
||||
- **Channels**: PagerDuty, Phone, Slack, Email, Status Page
|
||||
- **Recipients**: All engineering, executives, customer success
|
||||
- **Template**: SEV1 Executive Alert Template
|
||||
|
||||
---
|
||||
|
||||
### SEV2 - Major Impact
|
||||
|
||||
**Definition:** Significant degradation affecting subset of users or non-critical functions
|
||||
|
||||
#### Impact Criteria
|
||||
- Partial service degradation (25-75% of users affected)
|
||||
- Performance issues causing user frustration
|
||||
- Non-critical features unavailable
|
||||
- Internal tools impacting productivity
|
||||
- Data inconsistencies not affecting user experience
|
||||
- API errors affecting integrations
|
||||
|
||||
#### Response Requirements
|
||||
| Metric | Requirement |
|
||||
|--------|-------------|
|
||||
| **Response Time** | 15 minutes |
|
||||
| **Incident Commander** | Assigned within 30 minutes |
|
||||
| **Status Page Update** | Within 30 minutes |
|
||||
| **Stakeholder Notification** | Within 1 hour |
|
||||
| **Team Assembly** | Within 30 minutes |
|
||||
|
||||
#### Escalation Path
|
||||
1. **Immediate**: On-call Engineer → Team Lead
|
||||
2. **30 minutes**: Engineering Manager
|
||||
3. **2 hours**: VP Engineering
|
||||
4. **4 hours**: CTO (if unresolved)
|
||||
|
||||
#### Communication Requirements
|
||||
- **Frequency**: Every 30 minutes during active response
|
||||
- **Channels**: PagerDuty, Slack, Email
|
||||
- **Recipients**: Engineering team, product team, relevant stakeholders
|
||||
- **Template**: SEV2 Major Impact Template
|
||||
|
||||
---
|
||||
|
||||
### SEV3 - Minor Impact
|
||||
|
||||
**Definition:** Limited impact with workarounds available
|
||||
|
||||
#### Impact Criteria
|
||||
- Single feature or component affected
|
||||
- < 25% of users impacted
|
||||
- Workarounds available
|
||||
- Performance degradation not significantly impacting UX
|
||||
- Non-urgent monitoring alerts
|
||||
- Development/test environment issues
|
||||
|
||||
#### Response Requirements
|
||||
| Metric | Requirement |
|
||||
|--------|-------------|
|
||||
| **Response Time** | 2 hours (business hours) |
|
||||
| **After Hours Response** | Next business day |
|
||||
| **Team Assignment** | Within 4 hours |
|
||||
| **Status Page Update** | Optional |
|
||||
| **Internal Notification** | Within 2 hours |
|
||||
|
||||
#### Escalation Path
|
||||
1. **Immediate**: Assigned Engineer
|
||||
2. **4 hours**: Team Lead
|
||||
3. **1 business day**: Engineering Manager (if needed)
|
||||
|
||||
#### Communication Requirements
|
||||
- **Frequency**: At key milestones only
|
||||
- **Channels**: Slack, Email
|
||||
- **Recipients**: Assigned team, team lead
|
||||
- **Template**: SEV3 Minor Impact Template
|
||||
|
||||
---
|
||||
|
||||
### SEV4 - Low Impact
|
||||
|
||||
**Definition:** Minimal impact, cosmetic issues, or planned maintenance
|
||||
|
||||
#### Impact Criteria
|
||||
- Cosmetic bugs
|
||||
- Documentation issues
|
||||
- Logging or monitoring gaps
|
||||
- Performance issues with no user impact
|
||||
- Development/test environment issues
|
||||
- Feature requests or enhancements
|
||||
|
||||
#### Response Requirements
|
||||
| Metric | Requirement |
|
||||
|--------|-------------|
|
||||
| **Response Time** | 1-2 business days |
|
||||
| **Assignment** | Next sprint planning |
|
||||
| **Tracking** | Standard ticket system |
|
||||
| **Escalation** | None required |
|
||||
|
||||
#### Communication Requirements
|
||||
- **Frequency**: Standard development cycle updates
|
||||
- **Channels**: Ticket system
|
||||
- **Recipients**: Product owner, assigned developer
|
||||
- **Template**: Standard issue template
|
||||
|
||||
## Classification Guidelines
|
||||
|
||||
### User Impact Assessment
|
||||
|
||||
| Impact Scope | Description | Typical Severity |
|
||||
|--------------|-------------|------------------|
|
||||
| **All Users** | 100% of users affected | SEV1 |
|
||||
| **Major Subset** | 50-75% of users affected | SEV1/SEV2 |
|
||||
| **Significant Subset** | 25-50% of users affected | SEV2 |
|
||||
| **Limited Users** | 5-25% of users affected | SEV2/SEV3 |
|
||||
| **Few Users** | < 5% of users affected | SEV3/SEV4 |
|
||||
| **No User Impact** | Internal only | SEV4 |
|
||||
|
||||
### Business Impact Assessment
|
||||
|
||||
| Business Impact | Description | Severity Boost |
|
||||
|-----------------|-------------|----------------|
|
||||
| **Revenue Loss** | Direct revenue impact | +1 severity level |
|
||||
| **SLA Breach** | Contract violations | +1 severity level |
|
||||
| **Regulatory** | Compliance implications | +1 severity level |
|
||||
| **Brand Damage** | Public-facing issues | +1 severity level |
|
||||
| **Security** | Data or system security | +2 severity levels |
|
||||
|
||||
### Duration Considerations
|
||||
|
||||
| Duration | Impact on Classification |
|
||||
|----------|--------------------------|
|
||||
| **< 15 minutes** | May reduce severity by 1 level |
|
||||
| **15-60 minutes** | Standard classification |
|
||||
| **1-4 hours** | May increase severity by 1 level |
|
||||
| **> 4 hours** | Significant severity increase |
|
||||
|
||||
## Decision Tree
|
||||
|
||||
```
|
||||
1. Is this a security incident with data exposure?
|
||||
→ YES: SEV1 (regardless of user count)
|
||||
→ NO: Continue to step 2
|
||||
|
||||
2. Are revenue-generating services completely down?
|
||||
→ YES: SEV1
|
||||
→ NO: Continue to step 3
|
||||
|
||||
3. What percentage of users are affected?
|
||||
→ > 75%: SEV1
|
||||
→ 25-75%: SEV2
|
||||
→ 5-25%: SEV3
|
||||
→ < 5%: SEV4
|
||||
|
||||
4. Apply business impact modifiers
|
||||
5. Consider duration factors
|
||||
6. When in doubt, err on higher severity
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### SEV1 Examples
|
||||
- Payment processing system completely down
|
||||
- All user authentication failing
|
||||
- Database corruption causing data loss
|
||||
- Security breach with customer data exposed
|
||||
- Website returning 500 errors for all users
|
||||
|
||||
### SEV2 Examples
|
||||
- Payment processing slow (30-second delays)
|
||||
- Search functionality returning incomplete results
|
||||
- API rate limits causing partner integration issues
|
||||
- Dashboard displaying stale data (> 1 hour old)
|
||||
- Mobile app crashing for 40% of users
|
||||
|
||||
### SEV3 Examples
|
||||
- Single feature in admin panel not working
|
||||
- Email notifications delayed by 1 hour
|
||||
- Non-critical API endpoint returning errors
|
||||
- Cosmetic UI bug in settings page
|
||||
- Development environment deployment failing
|
||||
|
||||
### SEV4 Examples
|
||||
- Typo in help documentation
|
||||
- Log format change needed for analysis
|
||||
- Non-critical performance optimization
|
||||
- Internal tool enhancement request
|
||||
- Test data cleanup needed
|
||||
|
||||
## Escalation Triggers
|
||||
|
||||
### Automatic Escalation
|
||||
- SEV1 incidents automatically escalate every 30 minutes if unresolved
|
||||
- SEV2 incidents escalate after 2 hours without significant progress
|
||||
- Any incident with expanding scope increases severity
|
||||
- Customer escalation to support triggers severity review
|
||||
|
||||
### Manual Escalation
|
||||
- Incident Commander can escalate at any time
|
||||
- Technical leads can request escalation
|
||||
- Business stakeholders can request severity review
|
||||
- External factors (media attention, regulatory) trigger escalation
|
||||
|
||||
## Communication Templates
|
||||
|
||||
### SEV1 Executive Alert
|
||||
```
|
||||
Subject: 🚨 CRITICAL INCIDENT - [Service] Complete Outage
|
||||
|
||||
URGENT: Customer-facing service outage requiring immediate attention
|
||||
|
||||
Service: [Service Name]
|
||||
Start Time: [Timestamp]
|
||||
Impact: [Description of customer impact]
|
||||
Estimated Affected Users: [Number/Percentage]
|
||||
Business Impact: [Revenue/SLA/Brand implications]
|
||||
|
||||
Incident Commander: [Name] ([Contact])
|
||||
Response Team: [Team members engaged]
|
||||
|
||||
Current Status: [Brief status update]
|
||||
Next Update: [Timestamp - 15 minutes from now]
|
||||
War Room: [Bridge/Chat link]
|
||||
|
||||
This is a customer-impacting incident requiring executive awareness.
|
||||
```
|
||||
|
||||
### SEV2 Major Impact
|
||||
```
|
||||
Subject: ⚠️ [SEV2] [Service] - Major Performance Impact
|
||||
|
||||
Major service degradation affecting user experience
|
||||
|
||||
Service: [Service Name]
|
||||
Start Time: [Timestamp]
|
||||
Impact: [Description of user impact]
|
||||
Scope: [Affected functionality/users]
|
||||
|
||||
Response Team: [Team Lead] + [Team members]
|
||||
Status: [Current mitigation efforts]
|
||||
Workaround: [If available]
|
||||
|
||||
Next Update: 30 minutes
|
||||
Status Page: [Link if updated]
|
||||
```
|
||||
|
||||
## Review and Updates
|
||||
|
||||
This severity matrix should be reviewed quarterly and updated based on:
|
||||
- Incident response learnings
|
||||
- Business priority changes
|
||||
- Service architecture evolution
|
||||
- Regulatory requirement changes
|
||||
- Customer feedback and SLA updates
|
||||
|
||||
**Last Updated:** February 2026
|
||||
**Next Review:** May 2026
|
||||
**Owner:** Engineering Leadership
|
||||
@@ -0,0 +1,562 @@
|
||||
# Root Cause Analysis (RCA) Frameworks Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This guide provides detailed instructions for applying various Root Cause Analysis frameworks during Post-Incident Reviews. Each framework offers a different perspective and approach to identifying underlying causes of incidents.
|
||||
|
||||
## Framework Selection Guidelines
|
||||
|
||||
| Incident Type | Recommended Framework | Why |
|
||||
|---------------|----------------------|-----|
|
||||
| **Process Failure** | 5 Whys | Simple, direct cause-effect chain |
|
||||
| **Complex System Failure** | Fishbone + Timeline | Multiple contributing factors |
|
||||
| **Human Error** | Fishbone | Systematic analysis of contributing factors |
|
||||
| **Extended Incidents** | Timeline Analysis | Understanding decision points |
|
||||
| **High-Risk Incidents** | Bow Tie | Comprehensive barrier analysis |
|
||||
| **Recurring Issues** | 5 Whys + Fishbone | Deep dive into systemic issues |
|
||||
|
||||
---
|
||||
|
||||
## 5 Whys Analysis Framework
|
||||
|
||||
### Purpose
|
||||
Iteratively drill down through cause-effect relationships to identify root causes.
|
||||
|
||||
### When to Use
|
||||
- Simple, linear cause-effect chains
|
||||
- Time-pressured analysis
|
||||
- Process-related failures
|
||||
- Individual component failures
|
||||
|
||||
### Process Steps
|
||||
|
||||
#### Step 1: Problem Statement
|
||||
Write a clear, specific problem statement.
|
||||
|
||||
**Good Example:**
|
||||
> "The payment API returned 500 errors for 2 hours on March 15, affecting 80% of checkout attempts."
|
||||
|
||||
**Poor Example:**
|
||||
> "The system was broken."
|
||||
|
||||
#### Step 2: First Why
|
||||
Ask why the problem occurred. Focus on immediate, observable causes.
|
||||
|
||||
**Example:**
|
||||
- **Why 1:** Why did the payment API return 500 errors?
|
||||
- **Answer:** The database connection pool was exhausted.
|
||||
|
||||
#### Step 3: Subsequent Whys
|
||||
For each answer, ask "why" again. Continue until you reach a root cause.
|
||||
|
||||
**Example Chain:**
|
||||
- **Why 2:** Why was the database connection pool exhausted?
|
||||
- **Answer:** The application was creating more connections than usual.
|
||||
|
||||
- **Why 3:** Why was the application creating more connections?
|
||||
- **Answer:** A new feature wasn't properly closing connections.
|
||||
|
||||
- **Why 4:** Why wasn't the feature properly closing connections?
|
||||
- **Answer:** Code review missed the connection leak pattern.
|
||||
|
||||
- **Why 5:** Why did code review miss this pattern?
|
||||
- **Answer:** We don't have automated checks for connection pooling best practices.
|
||||
|
||||
#### Step 4: Validation
|
||||
Verify that addressing the root cause would prevent the original problem.
|
||||
|
||||
### Best Practices
|
||||
|
||||
1. **Ask at least 3 "whys"** - Surface causes are rarely root causes
|
||||
2. **Focus on process failures, not people** - Avoid blame, focus on system improvements
|
||||
3. **Use evidence** - Support each answer with data or observations
|
||||
4. **Consider multiple paths** - Some problems have multiple root causes
|
||||
5. **Test the logic** - Work backwards from root cause to problem
|
||||
|
||||
### Common Pitfalls
|
||||
|
||||
- **Stopping too early** - First few whys often reveal symptoms, not causes
|
||||
- **Single-cause assumption** - Complex systems often have multiple contributing factors
|
||||
- **Blame focus** - Focusing on individual mistakes rather than system failures
|
||||
- **Vague answers** - Use specific, actionable answers
|
||||
|
||||
### 5 Whys Template
|
||||
|
||||
```markdown
|
||||
## 5 Whys Analysis
|
||||
|
||||
**Problem Statement:** [Clear description of the incident]
|
||||
|
||||
**Why 1:** [First why question]
|
||||
**Answer:** [Specific, evidence-based answer]
|
||||
**Evidence:** [Supporting data, logs, observations]
|
||||
|
||||
**Why 2:** [Second why question]
|
||||
**Answer:** [Specific answer based on Why 1]
|
||||
**Evidence:** [Supporting evidence]
|
||||
|
||||
[Continue for 3-7 iterations]
|
||||
|
||||
**Root Cause(s) Identified:**
|
||||
1. [Primary root cause]
|
||||
2. [Secondary root cause if applicable]
|
||||
|
||||
**Validation:** [Confirm that addressing root causes would prevent recurrence]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Fishbone (Ishikawa) Diagram Framework
|
||||
|
||||
### Purpose
|
||||
Systematically analyze potential causes across multiple categories to identify contributing factors.
|
||||
|
||||
### When to Use
|
||||
- Complex incidents with multiple potential causes
|
||||
- When human factors are suspected
|
||||
- Systemic or organizational issues
|
||||
- When 5 Whys doesn't reveal clear root causes
|
||||
|
||||
### Categories
|
||||
|
||||
#### People (Human Factors)
|
||||
- **Training and Skills**
|
||||
- Insufficient training on new systems
|
||||
- Lack of domain expertise
|
||||
- Skill gaps in team
|
||||
- Knowledge not shared across team
|
||||
|
||||
- **Communication**
|
||||
- Poor communication between teams
|
||||
- Unclear responsibilities
|
||||
- Information not reaching right people
|
||||
- Language/cultural barriers
|
||||
|
||||
- **Decision Making**
|
||||
- Decisions made under pressure
|
||||
- Insufficient information for decisions
|
||||
- Risk assessment inadequate
|
||||
- Approval processes bypassed
|
||||
|
||||
#### Process (Procedures and Workflows)
|
||||
- **Documentation**
|
||||
- Outdated procedures
|
||||
- Missing runbooks
|
||||
- Unclear instructions
|
||||
- Process not documented
|
||||
|
||||
- **Change Management**
|
||||
- Inadequate change review
|
||||
- Rushed deployments
|
||||
- Insufficient testing
|
||||
- Rollback procedures unclear
|
||||
|
||||
- **Review and Approval**
|
||||
- Code review gaps
|
||||
- Architecture review skipped
|
||||
- Security review insufficient
|
||||
- Performance review missing
|
||||
|
||||
#### Technology (Systems and Tools)
|
||||
- **Architecture**
|
||||
- Single points of failure
|
||||
- Insufficient redundancy
|
||||
- Scalability limitations
|
||||
- Tight coupling between systems
|
||||
|
||||
- **Monitoring and Alerting**
|
||||
- Missing monitoring
|
||||
- Alert fatigue
|
||||
- Inadequate thresholds
|
||||
- Poor alert routing
|
||||
|
||||
- **Tools and Automation**
|
||||
- Manual processes prone to error
|
||||
- Tool limitations
|
||||
- Automation gaps
|
||||
- Integration issues
|
||||
|
||||
#### Environment (External Factors)
|
||||
- **Infrastructure**
|
||||
- Hardware failures
|
||||
- Network issues
|
||||
- Capacity limitations
|
||||
- Geographic dependencies
|
||||
|
||||
- **Dependencies**
|
||||
- Third-party service failures
|
||||
- External API changes
|
||||
- Vendor issues
|
||||
- Supply chain problems
|
||||
|
||||
- **External Pressure**
|
||||
- Time pressure from business
|
||||
- Resource constraints
|
||||
- Regulatory changes
|
||||
- Market conditions
|
||||
|
||||
### Process Steps
|
||||
|
||||
#### Step 1: Define the Problem
|
||||
Place the incident at the "head" of the fishbone diagram.
|
||||
|
||||
#### Step 2: Brainstorm Causes
|
||||
For each category, brainstorm potential contributing factors.
|
||||
|
||||
#### Step 3: Drill Down
|
||||
For each factor, ask what caused that factor (sub-causes).
|
||||
|
||||
#### Step 4: Identify Primary Causes
|
||||
Mark the most likely contributing factors based on evidence.
|
||||
|
||||
#### Step 5: Validate
|
||||
Gather evidence to support or refute each suspected cause.
|
||||
|
||||
### Fishbone Template
|
||||
|
||||
```markdown
|
||||
## Fishbone Analysis
|
||||
|
||||
**Problem:** [Incident description]
|
||||
|
||||
### People
|
||||
**Training/Skills:**
|
||||
- [Factor 1]: [Evidence/likelihood]
|
||||
- [Factor 2]: [Evidence/likelihood]
|
||||
|
||||
**Communication:**
|
||||
- [Factor 1]: [Evidence/likelihood]
|
||||
|
||||
**Decision Making:**
|
||||
- [Factor 1]: [Evidence/likelihood]
|
||||
|
||||
### Process
|
||||
**Documentation:**
|
||||
- [Factor 1]: [Evidence/likelihood]
|
||||
|
||||
**Change Management:**
|
||||
- [Factor 1]: [Evidence/likelihood]
|
||||
|
||||
**Review/Approval:**
|
||||
- [Factor 1]: [Evidence/likelihood]
|
||||
|
||||
### Technology
|
||||
**Architecture:**
|
||||
- [Factor 1]: [Evidence/likelihood]
|
||||
|
||||
**Monitoring:**
|
||||
- [Factor 1]: [Evidence/likelihood]
|
||||
|
||||
**Tools:**
|
||||
- [Factor 1]: [Evidence/likelihood]
|
||||
|
||||
### Environment
|
||||
**Infrastructure:**
|
||||
- [Factor 1]: [Evidence/likelihood]
|
||||
|
||||
**Dependencies:**
|
||||
- [Factor 1]: [Evidence/likelihood]
|
||||
|
||||
**External Factors:**
|
||||
- [Factor 1]: [Evidence/likelihood]
|
||||
|
||||
### Primary Contributing Factors
|
||||
1. [Factor with highest evidence/impact]
|
||||
2. [Second most significant factor]
|
||||
3. [Third most significant factor]
|
||||
|
||||
### Root Cause Hypothesis
|
||||
[Synthesized explanation of how factors combined to cause incident]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Timeline Analysis Framework
|
||||
|
||||
### Purpose
|
||||
Analyze the chronological sequence of events to identify decision points, missed opportunities, and process gaps.
|
||||
|
||||
### When to Use
|
||||
- Extended incidents (> 1 hour)
|
||||
- Complex multi-phase incidents
|
||||
- When response effectiveness is questioned
|
||||
- Communication or coordination failures
|
||||
|
||||
### Analysis Dimensions
|
||||
|
||||
#### Detection Analysis
|
||||
- **Time to Detection:** How long from onset to first alert?
|
||||
- **Detection Method:** How was the incident first identified?
|
||||
- **Alert Effectiveness:** Were the right people notified quickly?
|
||||
- **False Negatives:** What signals were missed?
|
||||
|
||||
#### Response Analysis
|
||||
- **Time to Response:** How long from detection to first response action?
|
||||
- **Escalation Timing:** Were escalations timely and appropriate?
|
||||
- **Resource Mobilization:** How quickly were the right people engaged?
|
||||
- **Decision Points:** What key decisions were made and when?
|
||||
|
||||
#### Communication Analysis
|
||||
- **Internal Communication:** How effective was team coordination?
|
||||
- **External Communication:** Were stakeholders informed appropriately?
|
||||
- **Communication Gaps:** Where did information flow break down?
|
||||
- **Update Frequency:** Were updates provided at appropriate intervals?
|
||||
|
||||
#### Resolution Analysis
|
||||
- **Mitigation Strategy:** Was the chosen approach optimal?
|
||||
- **Alternative Paths:** What other options were considered?
|
||||
- **Resource Allocation:** Were resources used effectively?
|
||||
- **Verification:** How was resolution confirmed?
|
||||
|
||||
### Process Steps
|
||||
|
||||
#### Step 1: Event Reconstruction
|
||||
Create comprehensive timeline with all available events.
|
||||
|
||||
#### Step 2: Phase Identification
|
||||
Identify distinct phases (detection, triage, escalation, mitigation, resolution).
|
||||
|
||||
#### Step 3: Gap Analysis
|
||||
Identify time gaps and analyze their causes.
|
||||
|
||||
#### Step 4: Decision Point Analysis
|
||||
Examine key decision points and alternative paths.
|
||||
|
||||
#### Step 5: Effectiveness Assessment
|
||||
Evaluate the overall effectiveness of the response.
|
||||
|
||||
### Timeline Template
|
||||
|
||||
```markdown
|
||||
## Timeline Analysis
|
||||
|
||||
### Incident Phases
|
||||
1. **Detection** ([start] - [end], [duration])
|
||||
2. **Triage** ([start] - [end], [duration])
|
||||
3. **Escalation** ([start] - [end], [duration])
|
||||
4. **Mitigation** ([start] - [end], [duration])
|
||||
5. **Resolution** ([start] - [end], [duration])
|
||||
|
||||
### Key Decision Points
|
||||
**[Timestamp]:** [Decision made]
|
||||
- **Context:** [Situation at time of decision]
|
||||
- **Alternatives:** [Other options considered]
|
||||
- **Outcome:** [Result of decision]
|
||||
- **Assessment:** [Was this optimal?]
|
||||
|
||||
### Communication Timeline
|
||||
**[Timestamp]:** [Communication event]
|
||||
- **Channel:** [Slack/Email/Phone/etc.]
|
||||
- **Audience:** [Who was informed]
|
||||
- **Content:** [What was communicated]
|
||||
- **Effectiveness:** [Assessment]
|
||||
|
||||
### Gaps and Delays
|
||||
**[Time Period]:** [Description of gap]
|
||||
- **Duration:** [Length of gap]
|
||||
- **Cause:** [Why did gap occur]
|
||||
- **Impact:** [Effect on incident response]
|
||||
|
||||
### Response Effectiveness
|
||||
**Strengths:**
|
||||
- [What went well]
|
||||
- [Effective decisions/actions]
|
||||
|
||||
**Weaknesses:**
|
||||
- [What could be improved]
|
||||
- [Missed opportunities]
|
||||
|
||||
### Root Causes from Timeline
|
||||
1. [Process-based root cause]
|
||||
2. [Communication-based root cause]
|
||||
3. [Decision-making root cause]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Bow Tie Analysis Framework
|
||||
|
||||
### Purpose
|
||||
Analyze both preventive measures (left side) and protective measures (right side) around an incident.
|
||||
|
||||
### When to Use
|
||||
- High-severity incidents (SEV1)
|
||||
- Security incidents
|
||||
- Safety-critical systems
|
||||
- When comprehensive barrier analysis is needed
|
||||
|
||||
### Components
|
||||
|
||||
#### Hazards
|
||||
What conditions create the potential for incidents?
|
||||
|
||||
**Examples:**
|
||||
- High traffic loads
|
||||
- Software deployments
|
||||
- Human interactions with critical systems
|
||||
- Third-party dependencies
|
||||
|
||||
#### Top Event
|
||||
What actually went wrong? This is the center of the bow tie.
|
||||
|
||||
**Examples:**
|
||||
- "Database became unresponsive"
|
||||
- "Payment processing failed"
|
||||
- "User authentication service crashed"
|
||||
|
||||
#### Threats (Left Side)
|
||||
What specific causes could lead to the top event?
|
||||
|
||||
**Examples:**
|
||||
- Code defects in new deployment
|
||||
- Database connection pool exhaustion
|
||||
- Network connectivity issues
|
||||
- DDoS attack
|
||||
|
||||
#### Consequences (Right Side)
|
||||
What are the potential impacts of the top event?
|
||||
|
||||
**Examples:**
|
||||
- Revenue loss
|
||||
- Customer churn
|
||||
- Regulatory violations
|
||||
- Brand damage
|
||||
- Data loss
|
||||
|
||||
#### Barriers
|
||||
What controls exist (or could exist) to prevent threats or mitigate consequences?
|
||||
|
||||
**Preventive Barriers (Left Side):**
|
||||
- Code reviews
|
||||
- Automated testing
|
||||
- Load testing
|
||||
- Input validation
|
||||
- Rate limiting
|
||||
|
||||
**Protective Barriers (Right Side):**
|
||||
- Circuit breakers
|
||||
- Failover systems
|
||||
- Backup procedures
|
||||
- Customer communication
|
||||
- Rollback capabilities
|
||||
|
||||
### Process Steps
|
||||
|
||||
#### Step 1: Define the Top Event
|
||||
Clearly state what went wrong.
|
||||
|
||||
#### Step 2: Identify Threats
|
||||
Brainstorm all possible causes that could lead to the top event.
|
||||
|
||||
#### Step 3: Identify Consequences
|
||||
List all potential impacts of the top event.
|
||||
|
||||
#### Step 4: Map Existing Barriers
|
||||
Identify current controls for each threat and consequence.
|
||||
|
||||
#### Step 5: Assess Barrier Effectiveness
|
||||
Evaluate how well each barrier worked (or failed).
|
||||
|
||||
#### Step 6: Recommend Additional Barriers
|
||||
Identify new controls needed to prevent recurrence.
|
||||
|
||||
### Bow Tie Template
|
||||
|
||||
```markdown
|
||||
## Bow Tie Analysis
|
||||
|
||||
**Top Event:** [What went wrong]
|
||||
|
||||
### Threats (Potential Causes)
|
||||
1. **[Threat 1]**
|
||||
- Likelihood: [High/Medium/Low]
|
||||
- Current Barriers: [Preventive controls]
|
||||
- Barrier Effectiveness: [Assessment]
|
||||
|
||||
2. **[Threat 2]**
|
||||
- Likelihood: [High/Medium/Low]
|
||||
- Current Barriers: [Preventive controls]
|
||||
- Barrier Effectiveness: [Assessment]
|
||||
|
||||
### Consequences (Potential Impacts)
|
||||
1. **[Consequence 1]**
|
||||
- Severity: [High/Medium/Low]
|
||||
- Current Barriers: [Protective controls]
|
||||
- Barrier Effectiveness: [Assessment]
|
||||
|
||||
2. **[Consequence 2]**
|
||||
- Severity: [High/Medium/Low]
|
||||
- Current Barriers: [Protective controls]
|
||||
- Barrier Effectiveness: [Assessment]
|
||||
|
||||
### Barrier Analysis
|
||||
**Effective Barriers:**
|
||||
- [Barrier that worked well]
|
||||
- [Why it was effective]
|
||||
|
||||
**Failed Barriers:**
|
||||
- [Barrier that failed]
|
||||
- [Why it failed]
|
||||
- [How to improve]
|
||||
|
||||
**Missing Barriers:**
|
||||
- [Needed preventive control]
|
||||
- [Needed protective control]
|
||||
|
||||
### Recommendations
|
||||
**Preventive Measures:**
|
||||
1. [New barrier to prevent threat]
|
||||
2. [Improvement to existing barrier]
|
||||
|
||||
**Protective Measures:**
|
||||
1. [New barrier to mitigate consequence]
|
||||
2. [Improvement to existing barrier]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Framework Comparison
|
||||
|
||||
| Framework | Time Required | Complexity | Best For | Output |
|
||||
|-----------|---------------|------------|----------|---------|
|
||||
| **5 Whys** | 30-60 minutes | Low | Simple, linear causes | Clear cause chain |
|
||||
| **Fishbone** | 1-2 hours | Medium | Complex, multi-factor | Comprehensive factor map |
|
||||
| **Timeline** | 2-3 hours | Medium | Extended incidents | Process improvements |
|
||||
| **Bow Tie** | 2-4 hours | High | High-risk incidents | Barrier strategy |
|
||||
|
||||
## Combining Frameworks
|
||||
|
||||
### 5 Whys + Fishbone
|
||||
Use 5 Whys for initial analysis, then Fishbone to explore contributing factors.
|
||||
|
||||
### Timeline + 5 Whys
|
||||
Use Timeline to identify key decision points, then 5 Whys on critical failures.
|
||||
|
||||
### Fishbone + Bow Tie
|
||||
Use Fishbone to identify causes, then Bow Tie to develop comprehensive prevention strategy.
|
||||
|
||||
## Quality Checklist
|
||||
|
||||
- [ ] Root causes address systemic issues, not symptoms
|
||||
- [ ] Analysis is backed by evidence, not assumptions
|
||||
- [ ] Multiple perspectives considered (technical, process, human)
|
||||
- [ ] Recommendations are specific and actionable
|
||||
- [ ] Analysis focuses on prevention, not blame
|
||||
- [ ] Findings are validated against incident timeline
|
||||
- [ ] Contributing factors are prioritized by impact
|
||||
- [ ] Root causes link clearly to preventive actions
|
||||
|
||||
## Common Anti-Patterns
|
||||
|
||||
- **Human Error as Root Cause** - Dig deeper into why human error occurred
|
||||
- **Single Root Cause** - Complex systems usually have multiple contributing factors
|
||||
- **Technology-Only Focus** - Consider process and organizational factors
|
||||
- **Blame Assignment** - Focus on system improvements, not individual fault
|
||||
- **Generic Recommendations** - Provide specific, measurable actions
|
||||
- **Surface-Level Analysis** - Ensure you've reached true root causes
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** February 2026
|
||||
**Next Review:** August 2026
|
||||
**Owner:** SRE Team + Engineering Leadership
|
||||
@@ -0,0 +1,914 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Incident Classifier
|
||||
|
||||
Analyzes incident descriptions and outputs severity levels, recommended response teams,
|
||||
initial actions, and communication templates.
|
||||
|
||||
This tool uses pattern matching and keyword analysis to classify incidents according to
|
||||
SEV1-4 criteria and provide structured response guidance.
|
||||
|
||||
Usage:
|
||||
python incident_classifier.py --input incident.json
|
||||
echo "Database is down" | python incident_classifier.py --format text
|
||||
python incident_classifier.py --interactive
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import re
|
||||
from datetime import datetime, timezone
|
||||
from typing import Dict, List, Tuple, Optional, Any
|
||||
|
||||
|
||||
class IncidentClassifier:
|
||||
"""
|
||||
Classifies incidents based on description, impact metrics, and business context.
|
||||
Provides severity assessment, team recommendations, and response templates.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the classifier with rules and templates."""
|
||||
self.severity_rules = self._load_severity_rules()
|
||||
self.team_mappings = self._load_team_mappings()
|
||||
self.communication_templates = self._load_communication_templates()
|
||||
self.action_templates = self._load_action_templates()
|
||||
|
||||
def _load_severity_rules(self) -> Dict[str, Dict]:
|
||||
"""Load severity classification rules and keywords."""
|
||||
return {
|
||||
"sev1": {
|
||||
"keywords": [
|
||||
"down", "outage", "offline", "unavailable", "crashed", "failed",
|
||||
"critical", "emergency", "dead", "broken", "timeout", "500 error",
|
||||
"data loss", "corrupted", "breach", "security incident",
|
||||
"revenue impact", "customer facing", "all users", "complete failure"
|
||||
],
|
||||
"impact_indicators": [
|
||||
"100%", "all users", "entire service", "complete",
|
||||
"revenue loss", "sla violation", "customer churn",
|
||||
"security breach", "data corruption", "regulatory"
|
||||
],
|
||||
"duration_threshold": 0, # Immediate classification
|
||||
"response_time": 300, # 5 minutes
|
||||
"description": "Complete service failure affecting all users or critical business functions"
|
||||
},
|
||||
"sev2": {
|
||||
"keywords": [
|
||||
"degraded", "slow", "performance", "errors", "partial",
|
||||
"intermittent", "high latency", "timeouts", "some users",
|
||||
"feature broken", "api errors", "database slow"
|
||||
],
|
||||
"impact_indicators": [
|
||||
"50%", "25-75%", "many users", "significant",
|
||||
"performance degradation", "feature unavailable",
|
||||
"support tickets", "user complaints"
|
||||
],
|
||||
"duration_threshold": 300, # 5 minutes
|
||||
"response_time": 900, # 15 minutes
|
||||
"description": "Significant degradation affecting subset of users or non-critical functions"
|
||||
},
|
||||
"sev3": {
|
||||
"keywords": [
|
||||
"minor", "cosmetic", "single feature", "workaround available",
|
||||
"edge case", "rare issue", "non-critical", "internal tool",
|
||||
"logging issue", "monitoring gap"
|
||||
],
|
||||
"impact_indicators": [
|
||||
"<25%", "few users", "limited impact",
|
||||
"workaround exists", "internal only",
|
||||
"development environment"
|
||||
],
|
||||
"duration_threshold": 3600, # 1 hour
|
||||
"response_time": 7200, # 2 hours
|
||||
"description": "Limited impact with workarounds available"
|
||||
},
|
||||
"sev4": {
|
||||
"keywords": [
|
||||
"cosmetic", "documentation", "typo", "minor bug",
|
||||
"enhancement", "nice to have", "low priority",
|
||||
"test environment", "dev tools"
|
||||
],
|
||||
"impact_indicators": [
|
||||
"no impact", "cosmetic only", "documentation",
|
||||
"development", "testing", "non-production"
|
||||
],
|
||||
"duration_threshold": 86400, # 24 hours
|
||||
"response_time": 172800, # 2 days
|
||||
"description": "Minimal impact, cosmetic issues, or planned maintenance"
|
||||
}
|
||||
}
|
||||
|
||||
def _load_team_mappings(self) -> Dict[str, List[str]]:
|
||||
"""Load team assignment rules based on service/component keywords."""
|
||||
return {
|
||||
"database": ["Database Team", "SRE", "Backend Engineering"],
|
||||
"frontend": ["Frontend Team", "UX Engineering", "Product Engineering"],
|
||||
"api": ["API Team", "Backend Engineering", "Platform Team"],
|
||||
"infrastructure": ["SRE", "DevOps", "Platform Team"],
|
||||
"security": ["Security Team", "SRE", "Compliance Team"],
|
||||
"network": ["Network Engineering", "SRE", "Infrastructure Team"],
|
||||
"authentication": ["Identity Team", "Security Team", "Backend Engineering"],
|
||||
"payment": ["Payments Team", "Finance Engineering", "Compliance Team"],
|
||||
"mobile": ["Mobile Team", "API Team", "QA Engineering"],
|
||||
"monitoring": ["SRE", "Platform Team", "DevOps"],
|
||||
"deployment": ["DevOps", "Release Engineering", "SRE"],
|
||||
"data": ["Data Engineering", "Analytics Team", "Backend Engineering"]
|
||||
}
|
||||
|
||||
def _load_communication_templates(self) -> Dict[str, Dict]:
|
||||
"""Load communication templates for each severity level."""
|
||||
return {
|
||||
"sev1": {
|
||||
"subject": "🚨 [SEV1] {service} - {brief_description}",
|
||||
"body": """CRITICAL INCIDENT ALERT
|
||||
|
||||
Incident Details:
|
||||
- Start Time: {timestamp}
|
||||
- Severity: SEV1 - Critical Outage
|
||||
- Service: {service}
|
||||
- Impact: {impact_description}
|
||||
- Current Status: Investigating
|
||||
|
||||
Customer Impact:
|
||||
{customer_impact}
|
||||
|
||||
Response Team:
|
||||
- Incident Commander: TBD (assigning now)
|
||||
- Primary Responder: {primary_responder}
|
||||
- SMEs Required: {subject_matter_experts}
|
||||
|
||||
Immediate Actions Taken:
|
||||
{initial_actions}
|
||||
|
||||
War Room: {war_room_link}
|
||||
Status Page: Will be updated within 15 minutes
|
||||
Next Update: {next_update_time}
|
||||
|
||||
This is a customer-impacting incident requiring immediate attention.
|
||||
|
||||
{incident_commander_contact}"""
|
||||
},
|
||||
"sev2": {
|
||||
"subject": "⚠️ [SEV2] {service} - {brief_description}",
|
||||
"body": """MAJOR INCIDENT NOTIFICATION
|
||||
|
||||
Incident Details:
|
||||
- Start Time: {timestamp}
|
||||
- Severity: SEV2 - Major Impact
|
||||
- Service: {service}
|
||||
- Impact: {impact_description}
|
||||
- Current Status: Investigating
|
||||
|
||||
User Impact:
|
||||
{customer_impact}
|
||||
|
||||
Response Team:
|
||||
- Primary Responder: {primary_responder}
|
||||
- Supporting Team: {supporting_teams}
|
||||
- Incident Commander: {incident_commander}
|
||||
|
||||
Initial Assessment:
|
||||
{initial_assessment}
|
||||
|
||||
Next Steps:
|
||||
{next_steps}
|
||||
|
||||
Updates will be provided every 30 minutes.
|
||||
Status page: {status_page_link}
|
||||
|
||||
{contact_information}"""
|
||||
},
|
||||
"sev3": {
|
||||
"subject": "ℹ️ [SEV3] {service} - {brief_description}",
|
||||
"body": """MINOR INCIDENT NOTIFICATION
|
||||
|
||||
Incident Details:
|
||||
- Start Time: {timestamp}
|
||||
- Severity: SEV3 - Minor Impact
|
||||
- Service: {service}
|
||||
- Impact: {impact_description}
|
||||
- Status: {current_status}
|
||||
|
||||
Details:
|
||||
{incident_details}
|
||||
|
||||
Assigned Team: {assigned_team}
|
||||
Estimated Resolution: {eta}
|
||||
|
||||
Workaround: {workaround}
|
||||
|
||||
This incident has limited customer impact and is being addressed during normal business hours.
|
||||
|
||||
{team_contact}"""
|
||||
},
|
||||
"sev4": {
|
||||
"subject": "[SEV4] {service} - {brief_description}",
|
||||
"body": """LOW PRIORITY ISSUE
|
||||
|
||||
Issue Details:
|
||||
- Reported: {timestamp}
|
||||
- Severity: SEV4 - Low Impact
|
||||
- Component: {service}
|
||||
- Description: {description}
|
||||
|
||||
This issue will be addressed in the normal development cycle.
|
||||
|
||||
Assigned to: {assigned_team}
|
||||
Target Resolution: {target_date}
|
||||
|
||||
{standard_contact}"""
|
||||
}
|
||||
}
|
||||
|
||||
def _load_action_templates(self) -> Dict[str, List[Dict]]:
|
||||
"""Load initial action templates for each severity level."""
|
||||
return {
|
||||
"sev1": [
|
||||
{
|
||||
"action": "Establish incident command",
|
||||
"priority": 1,
|
||||
"timeout_minutes": 5,
|
||||
"description": "Page incident commander and establish war room"
|
||||
},
|
||||
{
|
||||
"action": "Create incident ticket",
|
||||
"priority": 1,
|
||||
"timeout_minutes": 2,
|
||||
"description": "Create tracking ticket with all known details"
|
||||
},
|
||||
{
|
||||
"action": "Update status page",
|
||||
"priority": 2,
|
||||
"timeout_minutes": 15,
|
||||
"description": "Post initial status page update acknowledging incident"
|
||||
},
|
||||
{
|
||||
"action": "Notify executives",
|
||||
"priority": 2,
|
||||
"timeout_minutes": 15,
|
||||
"description": "Alert executive team of customer-impacting outage"
|
||||
},
|
||||
{
|
||||
"action": "Engage subject matter experts",
|
||||
"priority": 3,
|
||||
"timeout_minutes": 10,
|
||||
"description": "Page relevant SMEs based on affected systems"
|
||||
},
|
||||
{
|
||||
"action": "Begin technical investigation",
|
||||
"priority": 3,
|
||||
"timeout_minutes": 5,
|
||||
"description": "Start technical diagnosis and mitigation efforts"
|
||||
}
|
||||
],
|
||||
"sev2": [
|
||||
{
|
||||
"action": "Assign incident commander",
|
||||
"priority": 1,
|
||||
"timeout_minutes": 30,
|
||||
"description": "Assign IC and establish coordination channel"
|
||||
},
|
||||
{
|
||||
"action": "Create incident tracking",
|
||||
"priority": 1,
|
||||
"timeout_minutes": 5,
|
||||
"description": "Create incident ticket with details and timeline"
|
||||
},
|
||||
{
|
||||
"action": "Assess customer impact",
|
||||
"priority": 2,
|
||||
"timeout_minutes": 15,
|
||||
"description": "Determine scope and severity of user impact"
|
||||
},
|
||||
{
|
||||
"action": "Engage response team",
|
||||
"priority": 2,
|
||||
"timeout_minutes": 30,
|
||||
"description": "Page appropriate technical responders"
|
||||
},
|
||||
{
|
||||
"action": "Begin investigation",
|
||||
"priority": 3,
|
||||
"timeout_minutes": 15,
|
||||
"description": "Start technical analysis and debugging"
|
||||
},
|
||||
{
|
||||
"action": "Plan status communication",
|
||||
"priority": 3,
|
||||
"timeout_minutes": 30,
|
||||
"description": "Determine if status page update is needed"
|
||||
}
|
||||
],
|
||||
"sev3": [
|
||||
{
|
||||
"action": "Assign to appropriate team",
|
||||
"priority": 1,
|
||||
"timeout_minutes": 120,
|
||||
"description": "Route to team with relevant expertise"
|
||||
},
|
||||
{
|
||||
"action": "Create tracking ticket",
|
||||
"priority": 1,
|
||||
"timeout_minutes": 30,
|
||||
"description": "Document issue in standard ticketing system"
|
||||
},
|
||||
{
|
||||
"action": "Assess scope and impact",
|
||||
"priority": 2,
|
||||
"timeout_minutes": 60,
|
||||
"description": "Understand full scope of the issue"
|
||||
},
|
||||
{
|
||||
"action": "Identify workarounds",
|
||||
"priority": 2,
|
||||
"timeout_minutes": 60,
|
||||
"description": "Find temporary solutions if possible"
|
||||
},
|
||||
{
|
||||
"action": "Plan resolution approach",
|
||||
"priority": 3,
|
||||
"timeout_minutes": 120,
|
||||
"description": "Develop plan for permanent fix"
|
||||
}
|
||||
],
|
||||
"sev4": [
|
||||
{
|
||||
"action": "Create backlog item",
|
||||
"priority": 1,
|
||||
"timeout_minutes": 1440, # 24 hours
|
||||
"description": "Add to team backlog for future sprint planning"
|
||||
},
|
||||
{
|
||||
"action": "Triage and prioritize",
|
||||
"priority": 2,
|
||||
"timeout_minutes": 2880, # 2 days
|
||||
"description": "Review and prioritize against other work"
|
||||
},
|
||||
{
|
||||
"action": "Assign owner",
|
||||
"priority": 3,
|
||||
"timeout_minutes": 4320, # 3 days
|
||||
"description": "Assign to appropriate developer when capacity allows"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
def classify_incident(self, incident_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""
|
||||
Main classification method that analyzes incident data and returns
|
||||
comprehensive response recommendations.
|
||||
|
||||
Args:
|
||||
incident_data: Dictionary containing incident information
|
||||
|
||||
Returns:
|
||||
Dictionary with classification results and recommendations
|
||||
"""
|
||||
# Extract key information from incident data
|
||||
description = incident_data.get('description', '').lower()
|
||||
affected_users = incident_data.get('affected_users', '0%')
|
||||
business_impact = incident_data.get('business_impact', 'unknown')
|
||||
service = incident_data.get('service', 'unknown service')
|
||||
duration = incident_data.get('duration_minutes', 0)
|
||||
|
||||
# Classify severity
|
||||
severity = self._classify_severity(description, affected_users, business_impact, duration)
|
||||
|
||||
# Determine response teams
|
||||
response_teams = self._determine_teams(description, service)
|
||||
|
||||
# Generate initial actions
|
||||
initial_actions = self._generate_initial_actions(severity, incident_data)
|
||||
|
||||
# Create communication template
|
||||
communication = self._generate_communication(severity, incident_data)
|
||||
|
||||
# Calculate response timeline
|
||||
timeline = self._generate_timeline(severity)
|
||||
|
||||
# Determine escalation path
|
||||
escalation = self._determine_escalation(severity, business_impact)
|
||||
|
||||
return {
|
||||
"classification": {
|
||||
"severity": severity.upper(),
|
||||
"confidence": self._calculate_confidence(description, affected_users, business_impact),
|
||||
"reasoning": self._explain_classification(severity, description, affected_users),
|
||||
"timestamp": datetime.now(timezone.utc).isoformat()
|
||||
},
|
||||
"response": {
|
||||
"primary_team": response_teams[0] if response_teams else "General Engineering",
|
||||
"supporting_teams": response_teams[1:] if len(response_teams) > 1 else [],
|
||||
"all_teams": response_teams,
|
||||
"response_time_minutes": self.severity_rules[severity]["response_time"] // 60
|
||||
},
|
||||
"initial_actions": initial_actions,
|
||||
"communication": communication,
|
||||
"timeline": timeline,
|
||||
"escalation": escalation,
|
||||
"incident_data": {
|
||||
"service": service,
|
||||
"description": incident_data.get('description', ''),
|
||||
"affected_users": affected_users,
|
||||
"business_impact": business_impact,
|
||||
"duration_minutes": duration
|
||||
}
|
||||
}
|
||||
|
||||
def _classify_severity(self, description: str, affected_users: str,
|
||||
business_impact: str, duration: int) -> str:
|
||||
"""Classify incident severity based on multiple factors."""
|
||||
scores = {"sev1": 0, "sev2": 0, "sev3": 0, "sev4": 0}
|
||||
|
||||
# Keyword analysis
|
||||
for severity, rules in self.severity_rules.items():
|
||||
for keyword in rules["keywords"]:
|
||||
if keyword in description:
|
||||
scores[severity] += 2
|
||||
|
||||
for indicator in rules["impact_indicators"]:
|
||||
if indicator.lower() in description or indicator.lower() in affected_users.lower():
|
||||
scores[severity] += 3
|
||||
|
||||
# Business impact weighting
|
||||
if business_impact.lower() in ['critical', 'high', 'severe']:
|
||||
scores["sev1"] += 5
|
||||
scores["sev2"] += 3
|
||||
elif business_impact.lower() in ['medium', 'moderate']:
|
||||
scores["sev2"] += 3
|
||||
scores["sev3"] += 2
|
||||
elif business_impact.lower() in ['low', 'minimal']:
|
||||
scores["sev3"] += 2
|
||||
scores["sev4"] += 3
|
||||
|
||||
# User impact analysis
|
||||
if '%' in affected_users:
|
||||
try:
|
||||
percentage = float(re.findall(r'\d+', affected_users)[0])
|
||||
if percentage >= 75:
|
||||
scores["sev1"] += 4
|
||||
elif percentage >= 25:
|
||||
scores["sev2"] += 4
|
||||
elif percentage >= 5:
|
||||
scores["sev3"] += 3
|
||||
else:
|
||||
scores["sev4"] += 2
|
||||
except (IndexError, ValueError):
|
||||
pass
|
||||
|
||||
# Duration consideration
|
||||
if duration > 0:
|
||||
if duration >= 3600: # 1 hour
|
||||
scores["sev1"] += 2
|
||||
scores["sev2"] += 1
|
||||
elif duration >= 1800: # 30 minutes
|
||||
scores["sev2"] += 2
|
||||
scores["sev3"] += 1
|
||||
|
||||
# Return highest scoring severity
|
||||
return max(scores, key=scores.get)
|
||||
|
||||
def _determine_teams(self, description: str, service: str) -> List[str]:
|
||||
"""Determine which teams should respond based on affected systems."""
|
||||
teams = set()
|
||||
text_to_analyze = f"{description} {service}".lower()
|
||||
|
||||
for component, team_list in self.team_mappings.items():
|
||||
if component in text_to_analyze:
|
||||
teams.update(team_list)
|
||||
|
||||
# Default teams if no specific match
|
||||
if not teams:
|
||||
teams = {"General Engineering", "SRE"}
|
||||
|
||||
return list(teams)
|
||||
|
||||
def _generate_initial_actions(self, severity: str, incident_data: Dict) -> List[Dict]:
|
||||
"""Generate prioritized initial actions based on severity."""
|
||||
base_actions = self.action_templates[severity].copy()
|
||||
|
||||
# Customize actions based on incident details
|
||||
for action in base_actions:
|
||||
if severity in ["sev1", "sev2"]:
|
||||
action["urgency"] = "immediate" if severity == "sev1" else "high"
|
||||
else:
|
||||
action["urgency"] = "normal" if severity == "sev3" else "low"
|
||||
|
||||
return base_actions
|
||||
|
||||
def _generate_communication(self, severity: str, incident_data: Dict) -> Dict:
|
||||
"""Generate communication template filled with incident data."""
|
||||
template = self.communication_templates[severity]
|
||||
|
||||
# Fill template with incident data
|
||||
now = datetime.now(timezone.utc)
|
||||
service = incident_data.get('service', 'Unknown Service')
|
||||
description = incident_data.get('description', 'Incident detected')
|
||||
|
||||
communication = {
|
||||
"subject": template["subject"].format(
|
||||
service=service,
|
||||
brief_description=description[:50] + "..." if len(description) > 50 else description
|
||||
),
|
||||
"body": template["body"],
|
||||
"urgency": severity,
|
||||
"recipients": self._determine_recipients(severity),
|
||||
"channels": self._determine_channels(severity),
|
||||
"frequency_minutes": self._get_update_frequency(severity)
|
||||
}
|
||||
|
||||
return communication
|
||||
|
||||
def _generate_timeline(self, severity: str) -> Dict:
|
||||
"""Generate expected response timeline."""
|
||||
rules = self.severity_rules[severity]
|
||||
now = datetime.now(timezone.utc)
|
||||
|
||||
milestones = []
|
||||
if severity == "sev1":
|
||||
milestones = [
|
||||
{"milestone": "Incident Commander assigned", "minutes": 5},
|
||||
{"milestone": "War room established", "minutes": 10},
|
||||
{"milestone": "Initial status page update", "minutes": 15},
|
||||
{"milestone": "Executive notification", "minutes": 15},
|
||||
{"milestone": "First customer update", "minutes": 30}
|
||||
]
|
||||
elif severity == "sev2":
|
||||
milestones = [
|
||||
{"milestone": "Response team assembled", "minutes": 15},
|
||||
{"milestone": "Initial assessment complete", "minutes": 30},
|
||||
{"milestone": "Stakeholder notification", "minutes": 60},
|
||||
{"milestone": "Status page update (if needed)", "minutes": 60}
|
||||
]
|
||||
elif severity == "sev3":
|
||||
milestones = [
|
||||
{"milestone": "Team assignment", "minutes": 120},
|
||||
{"milestone": "Initial triage complete", "minutes": 240},
|
||||
{"milestone": "Resolution plan created", "minutes": 480}
|
||||
]
|
||||
else: # sev4
|
||||
milestones = [
|
||||
{"milestone": "Backlog creation", "minutes": 1440},
|
||||
{"milestone": "Priority assessment", "minutes": 2880}
|
||||
]
|
||||
|
||||
return {
|
||||
"response_time_minutes": rules["response_time"] // 60,
|
||||
"milestones": milestones,
|
||||
"update_frequency_minutes": self._get_update_frequency(severity)
|
||||
}
|
||||
|
||||
def _determine_escalation(self, severity: str, business_impact: str) -> Dict:
|
||||
"""Determine escalation requirements and triggers."""
|
||||
escalation_rules = {
|
||||
"sev1": {
|
||||
"immediate": ["Incident Commander", "Engineering Manager"],
|
||||
"15_minutes": ["VP Engineering", "Customer Success"],
|
||||
"30_minutes": ["CTO"],
|
||||
"60_minutes": ["CEO", "All C-Suite"],
|
||||
"triggers": ["Extended outage", "Revenue impact", "Media attention"]
|
||||
},
|
||||
"sev2": {
|
||||
"immediate": ["Team Lead", "On-call Engineer"],
|
||||
"30_minutes": ["Engineering Manager"],
|
||||
"120_minutes": ["VP Engineering"],
|
||||
"triggers": ["No progress", "Expanding scope", "Customer escalation"]
|
||||
},
|
||||
"sev3": {
|
||||
"immediate": ["Assigned Engineer"],
|
||||
"240_minutes": ["Team Lead"],
|
||||
"triggers": ["Issue complexity", "Multiple teams needed"]
|
||||
},
|
||||
"sev4": {
|
||||
"immediate": ["Product Owner"],
|
||||
"triggers": ["Customer request", "Stakeholder priority"]
|
||||
}
|
||||
}
|
||||
|
||||
return escalation_rules.get(severity, escalation_rules["sev4"])
|
||||
|
||||
def _determine_recipients(self, severity: str) -> List[str]:
|
||||
"""Determine who should receive notifications."""
|
||||
recipients = {
|
||||
"sev1": ["on-call", "engineering-leadership", "executives", "customer-success"],
|
||||
"sev2": ["on-call", "engineering-leadership", "product-team"],
|
||||
"sev3": ["assigned-team", "team-lead"],
|
||||
"sev4": ["assigned-engineer"]
|
||||
}
|
||||
return recipients.get(severity, recipients["sev4"])
|
||||
|
||||
def _determine_channels(self, severity: str) -> List[str]:
|
||||
"""Determine communication channels to use."""
|
||||
channels = {
|
||||
"sev1": ["pager", "phone", "slack", "email", "status-page"],
|
||||
"sev2": ["pager", "slack", "email"],
|
||||
"sev3": ["slack", "email"],
|
||||
"sev4": ["ticket-system"]
|
||||
}
|
||||
return channels.get(severity, channels["sev4"])
|
||||
|
||||
def _get_update_frequency(self, severity: str) -> int:
|
||||
"""Get recommended update frequency in minutes."""
|
||||
frequencies = {"sev1": 15, "sev2": 30, "sev3": 240, "sev4": 0}
|
||||
return frequencies.get(severity, 0)
|
||||
|
||||
def _calculate_confidence(self, description: str, affected_users: str, business_impact: str) -> float:
|
||||
"""Calculate confidence score for the classification."""
|
||||
confidence = 0.5 # Base confidence
|
||||
|
||||
# Higher confidence with more specific information
|
||||
if '%' in affected_users and any(char.isdigit() for char in affected_users):
|
||||
confidence += 0.2
|
||||
|
||||
if business_impact.lower() in ['critical', 'high', 'medium', 'low']:
|
||||
confidence += 0.15
|
||||
|
||||
if len(description.split()) > 5: # Detailed description
|
||||
confidence += 0.15
|
||||
|
||||
return min(confidence, 1.0)
|
||||
|
||||
def _explain_classification(self, severity: str, description: str, affected_users: str) -> str:
|
||||
"""Provide explanation for the classification decision."""
|
||||
rules = self.severity_rules[severity]
|
||||
|
||||
matched_keywords = []
|
||||
for keyword in rules["keywords"]:
|
||||
if keyword in description.lower():
|
||||
matched_keywords.append(keyword)
|
||||
|
||||
explanation = f"Classified as {severity.upper()} based on: "
|
||||
reasons = []
|
||||
|
||||
if matched_keywords:
|
||||
reasons.append(f"keywords: {', '.join(matched_keywords[:3])}")
|
||||
|
||||
if '%' in affected_users:
|
||||
reasons.append(f"user impact: {affected_users}")
|
||||
|
||||
if not reasons:
|
||||
reasons.append("default classification based on available information")
|
||||
|
||||
return explanation + "; ".join(reasons)
|
||||
|
||||
|
||||
def format_json_output(result: Dict) -> str:
|
||||
"""Format result as pretty JSON."""
|
||||
return json.dumps(result, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
def format_text_output(result: Dict) -> str:
|
||||
"""Format result as human-readable text."""
|
||||
classification = result["classification"]
|
||||
response = result["response"]
|
||||
actions = result["initial_actions"]
|
||||
communication = result["communication"]
|
||||
|
||||
output = []
|
||||
output.append("=" * 60)
|
||||
output.append("INCIDENT CLASSIFICATION REPORT")
|
||||
output.append("=" * 60)
|
||||
output.append("")
|
||||
|
||||
# Classification section
|
||||
output.append("CLASSIFICATION:")
|
||||
output.append(f" Severity: {classification['severity']}")
|
||||
output.append(f" Confidence: {classification['confidence']:.1%}")
|
||||
output.append(f" Reasoning: {classification['reasoning']}")
|
||||
output.append(f" Timestamp: {classification['timestamp']}")
|
||||
output.append("")
|
||||
|
||||
# Response section
|
||||
output.append("RECOMMENDED RESPONSE:")
|
||||
output.append(f" Primary Team: {response['primary_team']}")
|
||||
if response['supporting_teams']:
|
||||
output.append(f" Supporting Teams: {', '.join(response['supporting_teams'])}")
|
||||
output.append(f" Response Time: {response['response_time_minutes']} minutes")
|
||||
output.append("")
|
||||
|
||||
# Actions section
|
||||
output.append("INITIAL ACTIONS:")
|
||||
for i, action in enumerate(actions[:5], 1): # Show first 5 actions
|
||||
output.append(f" {i}. {action['action']} (Priority {action['priority']})")
|
||||
output.append(f" Timeout: {action['timeout_minutes']} minutes")
|
||||
output.append(f" {action['description']}")
|
||||
output.append("")
|
||||
|
||||
# Communication section
|
||||
output.append("COMMUNICATION:")
|
||||
output.append(f" Subject: {communication['subject']}")
|
||||
output.append(f" Urgency: {communication['urgency'].upper()}")
|
||||
output.append(f" Recipients: {', '.join(communication['recipients'])}")
|
||||
output.append(f" Channels: {', '.join(communication['channels'])}")
|
||||
if communication['frequency_minutes'] > 0:
|
||||
output.append(f" Update Frequency: Every {communication['frequency_minutes']} minutes")
|
||||
output.append("")
|
||||
|
||||
output.append("=" * 60)
|
||||
|
||||
return "\n".join(output)
|
||||
|
||||
|
||||
def parse_input_text(text: str) -> Dict[str, Any]:
|
||||
"""Parse free-form text input into structured incident data."""
|
||||
# Basic parsing - in a real system, this would be more sophisticated
|
||||
incident_data = {
|
||||
"description": text.strip(),
|
||||
"service": "unknown service",
|
||||
"affected_users": "unknown",
|
||||
"business_impact": "unknown"
|
||||
}
|
||||
|
||||
# Try to extract service name
|
||||
service_patterns = [
|
||||
r'(?:service|api|database|server|application)\s+(\w+)',
|
||||
r'(\w+)(?:\s+(?:is|has|service|api|database))',
|
||||
r'(?:^|\s)(\w+)\s+(?:down|failed|broken)'
|
||||
]
|
||||
|
||||
for pattern in service_patterns:
|
||||
match = re.search(pattern, text.lower())
|
||||
if match:
|
||||
incident_data["service"] = match.group(1)
|
||||
break
|
||||
|
||||
# Try to extract user impact
|
||||
impact_patterns = [
|
||||
r'(\d+%)\s+(?:of\s+)?(?:users?|customers?)',
|
||||
r'(?:all|every|100%)\s+(?:users?|customers?)',
|
||||
r'(?:some|many|several)\s+(?:users?|customers?)'
|
||||
]
|
||||
|
||||
for pattern in impact_patterns:
|
||||
match = re.search(pattern, text.lower())
|
||||
if match:
|
||||
incident_data["affected_users"] = match.group(1) if match.group(1) else match.group(0)
|
||||
break
|
||||
|
||||
# Try to infer business impact
|
||||
if any(word in text.lower() for word in ['critical', 'urgent', 'emergency', 'down', 'outage']):
|
||||
incident_data["business_impact"] = "high"
|
||||
elif any(word in text.lower() for word in ['slow', 'degraded', 'performance']):
|
||||
incident_data["business_impact"] = "medium"
|
||||
elif any(word in text.lower() for word in ['minor', 'cosmetic', 'small']):
|
||||
incident_data["business_impact"] = "low"
|
||||
|
||||
return incident_data
|
||||
|
||||
|
||||
def interactive_mode():
|
||||
"""Run in interactive mode, prompting user for input."""
|
||||
classifier = IncidentClassifier()
|
||||
|
||||
print("🚨 Incident Classifier - Interactive Mode")
|
||||
print("=" * 50)
|
||||
print("Enter incident details (or 'quit' to exit):")
|
||||
print()
|
||||
|
||||
while True:
|
||||
try:
|
||||
description = input("Incident description: ").strip()
|
||||
if description.lower() in ['quit', 'exit', 'q']:
|
||||
break
|
||||
|
||||
if not description:
|
||||
print("Please provide an incident description.")
|
||||
continue
|
||||
|
||||
service = input("Affected service (optional): ").strip() or "unknown"
|
||||
affected_users = input("Affected users (e.g., '50%', 'all users'): ").strip() or "unknown"
|
||||
business_impact = input("Business impact (high/medium/low): ").strip() or "unknown"
|
||||
|
||||
incident_data = {
|
||||
"description": description,
|
||||
"service": service,
|
||||
"affected_users": affected_users,
|
||||
"business_impact": business_impact
|
||||
}
|
||||
|
||||
result = classifier.classify_incident(incident_data)
|
||||
print("\n" + "=" * 50)
|
||||
print(format_text_output(result))
|
||||
print("=" * 50)
|
||||
print()
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n\nExiting...")
|
||||
break
|
||||
except Exception as e:
|
||||
print(f"Error: {e}")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main function with argument parsing and execution."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Classify incidents and provide response recommendations",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
python incident_classifier.py --input incident.json
|
||||
echo "Database is down" | python incident_classifier.py --format text
|
||||
python incident_classifier.py --interactive
|
||||
|
||||
Input JSON format:
|
||||
{
|
||||
"description": "Database connection timeouts",
|
||||
"service": "user-service",
|
||||
"affected_users": "80%",
|
||||
"business_impact": "high"
|
||||
}
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--input", "-i",
|
||||
help="Input file path (JSON format) or '-' for stdin"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--format", "-f",
|
||||
choices=["json", "text"],
|
||||
default="json",
|
||||
help="Output format (default: json)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--interactive",
|
||||
action="store_true",
|
||||
help="Run in interactive mode"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--output", "-o",
|
||||
help="Output file path (default: stdout)"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Interactive mode
|
||||
if args.interactive:
|
||||
interactive_mode()
|
||||
return
|
||||
|
||||
classifier = IncidentClassifier()
|
||||
|
||||
try:
|
||||
# Read input
|
||||
if args.input == "-" or (not args.input and not sys.stdin.isatty()):
|
||||
# Read from stdin
|
||||
input_text = sys.stdin.read().strip()
|
||||
if not input_text:
|
||||
parser.error("No input provided")
|
||||
|
||||
# Try to parse as JSON first, then as text
|
||||
try:
|
||||
incident_data = json.loads(input_text)
|
||||
except json.JSONDecodeError:
|
||||
incident_data = parse_input_text(input_text)
|
||||
|
||||
elif args.input:
|
||||
# Read from file
|
||||
with open(args.input, 'r') as f:
|
||||
incident_data = json.load(f)
|
||||
else:
|
||||
parser.error("No input specified. Use --input, --interactive, or pipe data to stdin.")
|
||||
|
||||
# Validate required fields
|
||||
if not isinstance(incident_data, dict):
|
||||
parser.error("Input must be a JSON object")
|
||||
|
||||
if "description" not in incident_data:
|
||||
parser.error("Input must contain 'description' field")
|
||||
|
||||
# Classify incident
|
||||
result = classifier.classify_incident(incident_data)
|
||||
|
||||
# Format output
|
||||
if args.format == "json":
|
||||
output = format_json_output(result)
|
||||
else:
|
||||
output = format_text_output(result)
|
||||
|
||||
# Write output
|
||||
if args.output:
|
||||
with open(args.output, 'w') as f:
|
||||
f.write(output)
|
||||
f.write('\n')
|
||||
else:
|
||||
print(output)
|
||||
|
||||
except FileNotFoundError as e:
|
||||
print(f"Error: File not found - {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
except json.JSONDecodeError as e:
|
||||
print(f"Error: Invalid JSON - {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
1638
engineering-team/incident-commander/scripts/pir_generator.py
Normal file
1638
engineering-team/incident-commander/scripts/pir_generator.py
Normal file
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
680
engineering/api-design-reviewer/references/api_antipatterns.md
Normal file
680
engineering/api-design-reviewer/references/api_antipatterns.md
Normal file
@@ -0,0 +1,680 @@
|
||||
# Common API Anti-Patterns and How to Avoid Them
|
||||
|
||||
## Introduction
|
||||
|
||||
This document outlines common anti-patterns in REST API design that can lead to poor developer experience, maintenance nightmares, and scalability issues. Each anti-pattern is accompanied by examples and recommended solutions.
|
||||
|
||||
## 1. Verb-Based URLs (The RPC Trap)
|
||||
|
||||
### Anti-Pattern
|
||||
Using verbs in URLs instead of treating endpoints as resources.
|
||||
|
||||
```
|
||||
❌ Bad Examples:
|
||||
POST /api/getUsers
|
||||
POST /api/createUser
|
||||
GET /api/deleteUser/123
|
||||
POST /api/updateUserPassword
|
||||
GET /api/calculateOrderTotal/456
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Violates REST principles
|
||||
- Makes the API feel like RPC instead of REST
|
||||
- HTTP methods lose their semantic meaning
|
||||
- Reduces cacheability
|
||||
- Harder to understand resource relationships
|
||||
|
||||
### Solution
|
||||
```
|
||||
✅ Good Examples:
|
||||
GET /api/users # Get users
|
||||
POST /api/users # Create user
|
||||
DELETE /api/users/123 # Delete user
|
||||
PATCH /api/users/123/password # Update password
|
||||
GET /api/orders/456/total # Get order total
|
||||
```
|
||||
|
||||
## 2. Inconsistent Naming Conventions
|
||||
|
||||
### Anti-Pattern
|
||||
Mixed naming conventions across the API.
|
||||
|
||||
```json
|
||||
❌ Bad Examples:
|
||||
{
|
||||
"user_id": 123, // snake_case
|
||||
"firstName": "John", // camelCase
|
||||
"last-name": "Doe", // kebab-case
|
||||
"EMAIL": "john@example.com", // UPPER_CASE
|
||||
"IsActive": true // PascalCase
|
||||
}
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Confuses developers
|
||||
- Increases cognitive load
|
||||
- Makes code generation difficult
|
||||
- Reduces API adoption
|
||||
|
||||
### Solution
|
||||
```json
|
||||
✅ Choose one convention and stick to it (camelCase recommended):
|
||||
{
|
||||
"userId": 123,
|
||||
"firstName": "John",
|
||||
"lastName": "Doe",
|
||||
"email": "john@example.com",
|
||||
"isActive": true
|
||||
}
|
||||
```
|
||||
|
||||
## 3. Ignoring HTTP Status Codes
|
||||
|
||||
### Anti-Pattern
|
||||
Always returning HTTP 200 regardless of the actual result.
|
||||
|
||||
```json
|
||||
❌ Bad Example:
|
||||
HTTP/1.1 200 OK
|
||||
{
|
||||
"status": "error",
|
||||
"code": 404,
|
||||
"message": "User not found"
|
||||
}
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Breaks HTTP semantics
|
||||
- Prevents proper error handling by clients
|
||||
- Breaks caching and proxies
|
||||
- Makes monitoring and debugging harder
|
||||
|
||||
### Solution
|
||||
```json
|
||||
✅ Good Example:
|
||||
HTTP/1.1 404 Not Found
|
||||
{
|
||||
"error": {
|
||||
"code": "USER_NOT_FOUND",
|
||||
"message": "User with ID 123 not found",
|
||||
"requestId": "req-abc123"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 4. Overly Complex Nested Resources
|
||||
|
||||
### Anti-Pattern
|
||||
Creating deeply nested URL structures that are hard to navigate.
|
||||
|
||||
```
|
||||
❌ Bad Example:
|
||||
/companies/123/departments/456/teams/789/members/012/projects/345/tasks/678/comments/901
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- URLs become unwieldy
|
||||
- Creates tight coupling between resources
|
||||
- Makes independent resource access difficult
|
||||
- Complicates authorization logic
|
||||
|
||||
### Solution
|
||||
```
|
||||
✅ Good Examples:
|
||||
/tasks/678 # Direct access to task
|
||||
/tasks/678/comments # Task comments
|
||||
/users/012/tasks # User's tasks
|
||||
/projects/345?team=789 # Project filtering
|
||||
```
|
||||
|
||||
## 5. Inconsistent Error Response Formats
|
||||
|
||||
### Anti-Pattern
|
||||
Different error response structures across endpoints.
|
||||
|
||||
```json
|
||||
❌ Bad Examples:
|
||||
# Endpoint 1
|
||||
{"error": "Invalid email"}
|
||||
|
||||
# Endpoint 2
|
||||
{"success": false, "msg": "User not found", "code": 404}
|
||||
|
||||
# Endpoint 3
|
||||
{"errors": [{"field": "name", "message": "Required"}]}
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Makes error handling complex for clients
|
||||
- Reduces code reusability
|
||||
- Poor developer experience
|
||||
|
||||
### Solution
|
||||
```json
|
||||
✅ Standardized Error Format:
|
||||
{
|
||||
"error": {
|
||||
"code": "VALIDATION_ERROR",
|
||||
"message": "The request contains invalid data",
|
||||
"details": [
|
||||
{
|
||||
"field": "email",
|
||||
"code": "INVALID_FORMAT",
|
||||
"message": "Email address is not valid"
|
||||
}
|
||||
],
|
||||
"requestId": "req-123456",
|
||||
"timestamp": "2024-02-16T13:00:00Z"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 6. Missing or Poor Pagination
|
||||
|
||||
### Anti-Pattern
|
||||
Returning all results in a single response or inconsistent pagination.
|
||||
|
||||
```json
|
||||
❌ Bad Examples:
|
||||
# No pagination (returns 10,000 records)
|
||||
GET /api/users
|
||||
|
||||
# Inconsistent pagination parameters
|
||||
GET /api/users?page=1&size=10
|
||||
GET /api/orders?offset=0&limit=20
|
||||
GET /api/products?start=0&count=50
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Can cause performance issues
|
||||
- May overwhelm clients
|
||||
- Inconsistent pagination parameters confuse developers
|
||||
- No way to estimate total results
|
||||
|
||||
### Solution
|
||||
```json
|
||||
✅ Good Example:
|
||||
GET /api/users?page=1&pageSize=10
|
||||
|
||||
{
|
||||
"data": [...],
|
||||
"pagination": {
|
||||
"page": 1,
|
||||
"pageSize": 10,
|
||||
"total": 150,
|
||||
"totalPages": 15,
|
||||
"hasNext": true,
|
||||
"hasPrev": false
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 7. Exposing Internal Implementation Details
|
||||
|
||||
### Anti-Pattern
|
||||
URLs and field names that reflect database structure or internal architecture.
|
||||
|
||||
```
|
||||
❌ Bad Examples:
|
||||
/api/user_table/123
|
||||
/api/db_orders
|
||||
/api/legacy_customer_data
|
||||
/api/temp_migration_users
|
||||
|
||||
Response fields:
|
||||
{
|
||||
"user_id_pk": 123,
|
||||
"internal_ref_code": "usr_abc",
|
||||
"db_created_timestamp": 1645123456
|
||||
}
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Couples API to internal implementation
|
||||
- Makes refactoring difficult
|
||||
- Exposes unnecessary technical details
|
||||
- Reduces API longevity
|
||||
|
||||
### Solution
|
||||
```
|
||||
✅ Good Examples:
|
||||
/api/users/123
|
||||
/api/orders
|
||||
/api/customers
|
||||
|
||||
Response fields:
|
||||
{
|
||||
"id": 123,
|
||||
"referenceCode": "usr_abc",
|
||||
"createdAt": "2024-02-16T13:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
## 8. Overloading Single Endpoint
|
||||
|
||||
### Anti-Pattern
|
||||
Using one endpoint for multiple unrelated operations based on request parameters.
|
||||
|
||||
```
|
||||
❌ Bad Example:
|
||||
POST /api/user-actions
|
||||
{
|
||||
"action": "create_user",
|
||||
"userData": {...}
|
||||
}
|
||||
|
||||
POST /api/user-actions
|
||||
{
|
||||
"action": "delete_user",
|
||||
"userId": 123
|
||||
}
|
||||
|
||||
POST /api/user-actions
|
||||
{
|
||||
"action": "send_email",
|
||||
"userId": 123,
|
||||
"emailType": "welcome"
|
||||
}
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Breaks REST principles
|
||||
- Makes documentation complex
|
||||
- Complicates client implementation
|
||||
- Reduces discoverability
|
||||
|
||||
### Solution
|
||||
```
|
||||
✅ Good Examples:
|
||||
POST /api/users # Create user
|
||||
DELETE /api/users/123 # Delete user
|
||||
POST /api/users/123/emails # Send email to user
|
||||
```
|
||||
|
||||
## 9. Lack of Versioning Strategy
|
||||
|
||||
### Anti-Pattern
|
||||
Making breaking changes without version management.
|
||||
|
||||
```
|
||||
❌ Bad Examples:
|
||||
# Original API
|
||||
{
|
||||
"name": "John Doe",
|
||||
"age": 30
|
||||
}
|
||||
|
||||
# Later (breaking change with no versioning)
|
||||
{
|
||||
"firstName": "John",
|
||||
"lastName": "Doe",
|
||||
"birthDate": "1994-02-16"
|
||||
}
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Breaks existing clients
|
||||
- Forces all clients to update simultaneously
|
||||
- No graceful migration path
|
||||
- Reduces API stability
|
||||
|
||||
### Solution
|
||||
```
|
||||
✅ Good Examples:
|
||||
# Version 1
|
||||
GET /api/v1/users/123
|
||||
{
|
||||
"name": "John Doe",
|
||||
"age": 30
|
||||
}
|
||||
|
||||
# Version 2 (with both versions supported)
|
||||
GET /api/v2/users/123
|
||||
{
|
||||
"firstName": "John",
|
||||
"lastName": "Doe",
|
||||
"birthDate": "1994-02-16",
|
||||
"age": 30 // Backwards compatibility
|
||||
}
|
||||
```
|
||||
|
||||
## 10. Poor Error Messages
|
||||
|
||||
### Anti-Pattern
|
||||
Vague, unhelpful, or technical error messages.
|
||||
|
||||
```json
|
||||
❌ Bad Examples:
|
||||
{"error": "Something went wrong"}
|
||||
{"error": "Invalid input"}
|
||||
{"error": "SQL constraint violation: FK_user_profile_id"}
|
||||
{"error": "NullPointerException at line 247"}
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Doesn't help developers fix issues
|
||||
- Increases support burden
|
||||
- Poor developer experience
|
||||
- May expose sensitive information
|
||||
|
||||
### Solution
|
||||
```json
|
||||
✅ Good Examples:
|
||||
{
|
||||
"error": {
|
||||
"code": "VALIDATION_ERROR",
|
||||
"message": "The email address is required and must be in a valid format",
|
||||
"details": [
|
||||
{
|
||||
"field": "email",
|
||||
"code": "REQUIRED",
|
||||
"message": "Email address is required"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 11. Ignoring Content Negotiation
|
||||
|
||||
### Anti-Pattern
|
||||
Hard-coding response format without considering client preferences.
|
||||
|
||||
```
|
||||
❌ Bad Example:
|
||||
# Always returns JSON regardless of Accept header
|
||||
GET /api/users/123
|
||||
Accept: application/xml
|
||||
# Returns JSON anyway
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Reduces API flexibility
|
||||
- Ignores HTTP standards
|
||||
- Makes integration harder for diverse clients
|
||||
|
||||
### Solution
|
||||
```
|
||||
✅ Good Example:
|
||||
GET /api/users/123
|
||||
Accept: application/xml
|
||||
|
||||
HTTP/1.1 200 OK
|
||||
Content-Type: application/xml
|
||||
|
||||
<?xml version="1.0"?>
|
||||
<user>
|
||||
<id>123</id>
|
||||
<name>John Doe</name>
|
||||
</user>
|
||||
```
|
||||
|
||||
## 12. Stateful API Design
|
||||
|
||||
### Anti-Pattern
|
||||
Maintaining session state on the server between requests.
|
||||
|
||||
```
|
||||
❌ Bad Example:
|
||||
# Step 1: Initialize session
|
||||
POST /api/session/init
|
||||
|
||||
# Step 2: Set context (requires step 1)
|
||||
POST /api/session/set-user/123
|
||||
|
||||
# Step 3: Get data (requires steps 1 & 2)
|
||||
GET /api/session/user-data
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Breaks REST statelessness principle
|
||||
- Reduces scalability
|
||||
- Makes caching difficult
|
||||
- Complicates error recovery
|
||||
|
||||
### Solution
|
||||
```
|
||||
✅ Good Example:
|
||||
# Self-contained requests
|
||||
GET /api/users/123/data
|
||||
Authorization: Bearer jwt-token-with-context
|
||||
```
|
||||
|
||||
## 13. Inconsistent HTTP Method Usage
|
||||
|
||||
### Anti-Pattern
|
||||
Using HTTP methods inappropriately or inconsistently.
|
||||
|
||||
```
|
||||
❌ Bad Examples:
|
||||
GET /api/users/123/delete # DELETE operation with GET
|
||||
POST /api/users/123/get # GET operation with POST
|
||||
PUT /api/users # Creating with PUT on collection
|
||||
GET /api/users/search # Search with side effects
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Violates HTTP semantics
|
||||
- Breaks caching and idempotency expectations
|
||||
- Confuses developers and tools
|
||||
|
||||
### Solution
|
||||
```
|
||||
✅ Good Examples:
|
||||
DELETE /api/users/123 # Delete with DELETE
|
||||
GET /api/users/123 # Get with GET
|
||||
POST /api/users # Create on collection
|
||||
GET /api/users?q=search # Safe search with GET
|
||||
```
|
||||
|
||||
## 14. Missing Rate Limiting Information
|
||||
|
||||
### Anti-Pattern
|
||||
Not providing rate limiting information to clients.
|
||||
|
||||
```
|
||||
❌ Bad Example:
|
||||
HTTP/1.1 429 Too Many Requests
|
||||
{
|
||||
"error": "Rate limit exceeded"
|
||||
}
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Clients don't know when to retry
|
||||
- No information about current limits
|
||||
- Difficult to implement proper backoff strategies
|
||||
|
||||
### Solution
|
||||
```
|
||||
✅ Good Example:
|
||||
HTTP/1.1 429 Too Many Requests
|
||||
X-RateLimit-Limit: 1000
|
||||
X-RateLimit-Remaining: 0
|
||||
X-RateLimit-Reset: 1640995200
|
||||
Retry-After: 3600
|
||||
|
||||
{
|
||||
"error": {
|
||||
"code": "RATE_LIMIT_EXCEEDED",
|
||||
"message": "API rate limit exceeded",
|
||||
"retryAfter": 3600
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 15. Chatty API Design
|
||||
|
||||
### Anti-Pattern
|
||||
Requiring multiple API calls to accomplish common tasks.
|
||||
|
||||
```
|
||||
❌ Bad Example:
|
||||
# Get user profile requires 4 API calls
|
||||
GET /api/users/123 # Basic info
|
||||
GET /api/users/123/profile # Profile details
|
||||
GET /api/users/123/settings # User settings
|
||||
GET /api/users/123/stats # User statistics
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Increases latency
|
||||
- Creates network overhead
|
||||
- Makes mobile apps inefficient
|
||||
- Complicates client implementation
|
||||
|
||||
### Solution
|
||||
```
|
||||
✅ Good Examples:
|
||||
# Single call with expansion
|
||||
GET /api/users/123?include=profile,settings,stats
|
||||
|
||||
# Or provide composite endpoints
|
||||
GET /api/users/123/dashboard
|
||||
|
||||
# Or batch operations
|
||||
POST /api/batch
|
||||
{
|
||||
"requests": [
|
||||
{"method": "GET", "url": "/users/123"},
|
||||
{"method": "GET", "url": "/users/123/profile"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## 16. No Input Validation
|
||||
|
||||
### Anti-Pattern
|
||||
Accepting and processing invalid input without proper validation.
|
||||
|
||||
```json
|
||||
❌ Bad Example:
|
||||
POST /api/users
|
||||
{
|
||||
"email": "not-an-email",
|
||||
"age": -5,
|
||||
"name": ""
|
||||
}
|
||||
|
||||
# API processes this and fails later or stores invalid data
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Leads to data corruption
|
||||
- Security vulnerabilities
|
||||
- Difficult to debug issues
|
||||
- Poor user experience
|
||||
|
||||
### Solution
|
||||
```json
|
||||
✅ Good Example:
|
||||
POST /api/users
|
||||
{
|
||||
"email": "not-an-email",
|
||||
"age": -5,
|
||||
"name": ""
|
||||
}
|
||||
|
||||
HTTP/1.1 400 Bad Request
|
||||
{
|
||||
"error": {
|
||||
"code": "VALIDATION_ERROR",
|
||||
"message": "The request contains invalid data",
|
||||
"details": [
|
||||
{
|
||||
"field": "email",
|
||||
"code": "INVALID_FORMAT",
|
||||
"message": "Email must be a valid email address"
|
||||
},
|
||||
{
|
||||
"field": "age",
|
||||
"code": "INVALID_RANGE",
|
||||
"message": "Age must be between 0 and 150"
|
||||
},
|
||||
{
|
||||
"field": "name",
|
||||
"code": "REQUIRED",
|
||||
"message": "Name is required and cannot be empty"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 17. Synchronous Long-Running Operations
|
||||
|
||||
### Anti-Pattern
|
||||
Blocking the client with long-running operations in synchronous endpoints.
|
||||
|
||||
```
|
||||
❌ Bad Example:
|
||||
POST /api/reports/generate
|
||||
# Client waits 30 seconds for response
|
||||
```
|
||||
|
||||
### Why It's Bad
|
||||
- Poor user experience
|
||||
- Timeouts and connection issues
|
||||
- Resource waste on client and server
|
||||
- Doesn't scale well
|
||||
|
||||
### Solution
|
||||
```
|
||||
✅ Good Example:
|
||||
# Async pattern
|
||||
POST /api/reports
|
||||
HTTP/1.1 202 Accepted
|
||||
Location: /api/reports/job-123
|
||||
{
|
||||
"jobId": "job-123",
|
||||
"status": "processing",
|
||||
"estimatedCompletion": "2024-02-16T13:05:00Z"
|
||||
}
|
||||
|
||||
# Check status
|
||||
GET /api/reports/job-123
|
||||
{
|
||||
"jobId": "job-123",
|
||||
"status": "completed",
|
||||
"result": "/api/reports/download/report-456"
|
||||
}
|
||||
```
|
||||
|
||||
## Prevention Strategies
|
||||
|
||||
### 1. API Design Reviews
|
||||
- Implement mandatory design reviews
|
||||
- Use checklists based on these anti-patterns
|
||||
- Include multiple stakeholders
|
||||
|
||||
### 2. API Style Guides
|
||||
- Create and enforce API style guides
|
||||
- Use linting tools for consistency
|
||||
- Regular training for development teams
|
||||
|
||||
### 3. Automated Testing
|
||||
- Test for common anti-patterns
|
||||
- Include contract testing
|
||||
- Monitor API usage patterns
|
||||
|
||||
### 4. Documentation Standards
|
||||
- Require comprehensive API documentation
|
||||
- Include examples and error scenarios
|
||||
- Keep documentation up-to-date
|
||||
|
||||
### 5. Client Feedback
|
||||
- Regularly collect feedback from API consumers
|
||||
- Monitor API usage analytics
|
||||
- Conduct developer experience surveys
|
||||
|
||||
## Conclusion
|
||||
|
||||
Avoiding these anti-patterns requires:
|
||||
- Understanding REST principles
|
||||
- Consistent design standards
|
||||
- Regular review and refactoring
|
||||
- Focus on developer experience
|
||||
- Proper tooling and automation
|
||||
|
||||
Remember: A well-designed API is an asset that grows in value over time, while a poorly designed API becomes a liability that hampers development and adoption.
|
||||
487
engineering/api-design-reviewer/references/rest_design_rules.md
Normal file
487
engineering/api-design-reviewer/references/rest_design_rules.md
Normal file
@@ -0,0 +1,487 @@
|
||||
# REST API Design Rules Reference
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. Resources, Not Actions
|
||||
REST APIs should focus on **resources** (nouns) rather than **actions** (verbs). The HTTP methods provide the actions.
|
||||
|
||||
```
|
||||
✅ Good:
|
||||
GET /users # Get all users
|
||||
GET /users/123 # Get user 123
|
||||
POST /users # Create new user
|
||||
PUT /users/123 # Update user 123
|
||||
DELETE /users/123 # Delete user 123
|
||||
|
||||
❌ Bad:
|
||||
POST /getUsers
|
||||
POST /createUser
|
||||
POST /updateUser/123
|
||||
POST /deleteUser/123
|
||||
```
|
||||
|
||||
### 2. Hierarchical Resource Structure
|
||||
Use hierarchical URLs to represent resource relationships:
|
||||
|
||||
```
|
||||
/users/123/orders/456/items/789
|
||||
```
|
||||
|
||||
But avoid excessive nesting (max 3-4 levels):
|
||||
|
||||
```
|
||||
❌ Too deep: /companies/123/departments/456/teams/789/members/012/tasks/345
|
||||
✅ Better: /tasks/345?member=012&team=789
|
||||
```
|
||||
|
||||
## Resource Naming Conventions
|
||||
|
||||
### URLs Should Use Kebab-Case
|
||||
```
|
||||
✅ Good:
|
||||
/user-profiles
|
||||
/order-items
|
||||
/shipping-addresses
|
||||
|
||||
❌ Bad:
|
||||
/userProfiles
|
||||
/user_profiles
|
||||
/orderItems
|
||||
```
|
||||
|
||||
### Collections vs Individual Resources
|
||||
```
|
||||
Collection: /users
|
||||
Individual: /users/123
|
||||
Sub-resource: /users/123/orders
|
||||
```
|
||||
|
||||
### Pluralization Rules
|
||||
- Use **plural nouns** for collections: `/users`, `/orders`
|
||||
- Use **singular nouns** for single resources: `/user-profile`, `/current-session`
|
||||
- Be consistent throughout your API
|
||||
|
||||
## HTTP Methods Usage
|
||||
|
||||
### GET - Safe and Idempotent
|
||||
- **Purpose**: Retrieve data
|
||||
- **Safe**: No side effects
|
||||
- **Idempotent**: Multiple calls return same result
|
||||
- **Request Body**: Should not have one
|
||||
- **Cacheable**: Yes
|
||||
|
||||
```
|
||||
GET /users/123
|
||||
GET /users?status=active&limit=10
|
||||
```
|
||||
|
||||
### POST - Not Idempotent
|
||||
- **Purpose**: Create resources, non-idempotent operations
|
||||
- **Safe**: No
|
||||
- **Idempotent**: No
|
||||
- **Request Body**: Usually required
|
||||
- **Cacheable**: Generally no
|
||||
|
||||
```
|
||||
POST /users # Create new user
|
||||
POST /users/123/activate # Activate user (action)
|
||||
```
|
||||
|
||||
### PUT - Idempotent
|
||||
- **Purpose**: Create or completely replace a resource
|
||||
- **Safe**: No
|
||||
- **Idempotent**: Yes
|
||||
- **Request Body**: Required (complete resource)
|
||||
- **Cacheable**: No
|
||||
|
||||
```
|
||||
PUT /users/123 # Replace entire user resource
|
||||
```
|
||||
|
||||
### PATCH - Partial Update
|
||||
- **Purpose**: Partially update a resource
|
||||
- **Safe**: No
|
||||
- **Idempotent**: Not necessarily
|
||||
- **Request Body**: Required (partial resource)
|
||||
- **Cacheable**: No
|
||||
|
||||
```
|
||||
PATCH /users/123 # Update only specified fields
|
||||
```
|
||||
|
||||
### DELETE - Idempotent
|
||||
- **Purpose**: Remove a resource
|
||||
- **Safe**: No
|
||||
- **Idempotent**: Yes (same result if called multiple times)
|
||||
- **Request Body**: Usually not needed
|
||||
- **Cacheable**: No
|
||||
|
||||
```
|
||||
DELETE /users/123
|
||||
```
|
||||
|
||||
## Status Codes
|
||||
|
||||
### Success Codes (2xx)
|
||||
- **200 OK**: Standard success response
|
||||
- **201 Created**: Resource created successfully (POST)
|
||||
- **202 Accepted**: Request accepted for processing (async)
|
||||
- **204 No Content**: Success with no response body (DELETE, PUT)
|
||||
|
||||
### Redirection Codes (3xx)
|
||||
- **301 Moved Permanently**: Resource permanently moved
|
||||
- **302 Found**: Temporary redirect
|
||||
- **304 Not Modified**: Use cached version
|
||||
|
||||
### Client Error Codes (4xx)
|
||||
- **400 Bad Request**: Invalid request syntax or data
|
||||
- **401 Unauthorized**: Authentication required
|
||||
- **403 Forbidden**: Access denied (user authenticated but not authorized)
|
||||
- **404 Not Found**: Resource not found
|
||||
- **405 Method Not Allowed**: HTTP method not supported
|
||||
- **409 Conflict**: Resource conflict (duplicates, version mismatch)
|
||||
- **422 Unprocessable Entity**: Valid syntax but semantic errors
|
||||
- **429 Too Many Requests**: Rate limit exceeded
|
||||
|
||||
### Server Error Codes (5xx)
|
||||
- **500 Internal Server Error**: Unexpected server error
|
||||
- **502 Bad Gateway**: Invalid response from upstream server
|
||||
- **503 Service Unavailable**: Server temporarily unavailable
|
||||
- **504 Gateway Timeout**: Upstream server timeout
|
||||
|
||||
## URL Design Patterns
|
||||
|
||||
### Query Parameters for Filtering
|
||||
```
|
||||
GET /users?status=active
|
||||
GET /users?role=admin&department=engineering
|
||||
GET /orders?created_after=2024-01-01&status=pending
|
||||
```
|
||||
|
||||
### Pagination Parameters
|
||||
```
|
||||
# Offset-based
|
||||
GET /users?offset=20&limit=10
|
||||
|
||||
# Cursor-based
|
||||
GET /users?cursor=eyJpZCI6MTIzfQ&limit=10
|
||||
|
||||
# Page-based
|
||||
GET /users?page=3&page_size=10
|
||||
```
|
||||
|
||||
### Sorting Parameters
|
||||
```
|
||||
GET /users?sort=created_at # Ascending
|
||||
GET /users?sort=-created_at # Descending (prefix with -)
|
||||
GET /users?sort=last_name,first_name # Multiple fields
|
||||
```
|
||||
|
||||
### Field Selection
|
||||
```
|
||||
GET /users?fields=id,name,email
|
||||
GET /users/123?include=orders,profile
|
||||
GET /users/123?exclude=internal_notes
|
||||
```
|
||||
|
||||
### Search Parameters
|
||||
```
|
||||
GET /users?q=john
|
||||
GET /products?search=laptop&category=electronics
|
||||
```
|
||||
|
||||
## Response Format Standards
|
||||
|
||||
### Consistent Response Structure
|
||||
```json
|
||||
{
|
||||
"data": {
|
||||
"id": 123,
|
||||
"name": "John Doe",
|
||||
"email": "john@example.com"
|
||||
},
|
||||
"meta": {
|
||||
"timestamp": "2024-02-16T13:00:00Z",
|
||||
"version": "1.0"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Collection Responses
|
||||
```json
|
||||
{
|
||||
"data": [
|
||||
{"id": 1, "name": "Item 1"},
|
||||
{"id": 2, "name": "Item 2"}
|
||||
],
|
||||
"pagination": {
|
||||
"total": 150,
|
||||
"page": 1,
|
||||
"pageSize": 10,
|
||||
"totalPages": 15,
|
||||
"hasNext": true,
|
||||
"hasPrev": false
|
||||
},
|
||||
"meta": {
|
||||
"timestamp": "2024-02-16T13:00:00Z"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Error Response Format
|
||||
```json
|
||||
{
|
||||
"error": {
|
||||
"code": "VALIDATION_ERROR",
|
||||
"message": "The request contains invalid parameters",
|
||||
"details": [
|
||||
{
|
||||
"field": "email",
|
||||
"code": "INVALID_FORMAT",
|
||||
"message": "Email address is not valid"
|
||||
}
|
||||
],
|
||||
"requestId": "req-123456",
|
||||
"timestamp": "2024-02-16T13:00:00Z"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Field Naming Conventions
|
||||
|
||||
### Use camelCase for JSON Fields
|
||||
```json
|
||||
✅ Good:
|
||||
{
|
||||
"firstName": "John",
|
||||
"lastName": "Doe",
|
||||
"createdAt": "2024-02-16T13:00:00Z",
|
||||
"isActive": true
|
||||
}
|
||||
|
||||
❌ Bad:
|
||||
{
|
||||
"first_name": "John",
|
||||
"LastName": "Doe",
|
||||
"created-at": "2024-02-16T13:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### Boolean Fields
|
||||
Use positive, clear names with "is", "has", "can", or "should" prefixes:
|
||||
|
||||
```json
|
||||
✅ Good:
|
||||
{
|
||||
"isActive": true,
|
||||
"hasPermission": false,
|
||||
"canEdit": true,
|
||||
"shouldNotify": false
|
||||
}
|
||||
|
||||
❌ Bad:
|
||||
{
|
||||
"active": true,
|
||||
"disabled": false, // Double negative
|
||||
"permission": false // Unclear meaning
|
||||
}
|
||||
```
|
||||
|
||||
### Date/Time Fields
|
||||
- Use ISO 8601 format: `2024-02-16T13:00:00Z`
|
||||
- Include timezone information
|
||||
- Use consistent field naming:
|
||||
|
||||
```json
|
||||
{
|
||||
"createdAt": "2024-02-16T13:00:00Z",
|
||||
"updatedAt": "2024-02-16T13:30:00Z",
|
||||
"deletedAt": null,
|
||||
"publishedAt": "2024-02-16T14:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
## Content Negotiation
|
||||
|
||||
### Accept Headers
|
||||
```
|
||||
Accept: application/json
|
||||
Accept: application/xml
|
||||
Accept: application/json; version=1
|
||||
```
|
||||
|
||||
### Content-Type Headers
|
||||
```
|
||||
Content-Type: application/json
|
||||
Content-Type: application/json; charset=utf-8
|
||||
Content-Type: multipart/form-data
|
||||
```
|
||||
|
||||
### Versioning via Headers
|
||||
```
|
||||
Accept: application/vnd.myapi.v1+json
|
||||
API-Version: 1.0
|
||||
```
|
||||
|
||||
## Caching Guidelines
|
||||
|
||||
### Cache-Control Headers
|
||||
```
|
||||
Cache-Control: public, max-age=3600 # Cache for 1 hour
|
||||
Cache-Control: private, max-age=0 # Don't cache
|
||||
Cache-Control: no-cache, must-revalidate # Always validate
|
||||
```
|
||||
|
||||
### ETags for Conditional Requests
|
||||
```
|
||||
HTTP/1.1 200 OK
|
||||
ETag: "123456789"
|
||||
Last-Modified: Wed, 21 Oct 2015 07:28:00 GMT
|
||||
|
||||
# Client subsequent request:
|
||||
If-None-Match: "123456789"
|
||||
If-Modified-Since: Wed, 21 Oct 2015 07:28:00 GMT
|
||||
```
|
||||
|
||||
## Security Headers
|
||||
|
||||
### Authentication
|
||||
```
|
||||
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
|
||||
Authorization: Basic dXNlcjpwYXNzd29yZA==
|
||||
Authorization: Api-Key abc123def456
|
||||
```
|
||||
|
||||
### CORS Headers
|
||||
```
|
||||
Access-Control-Allow-Origin: https://example.com
|
||||
Access-Control-Allow-Methods: GET, POST, PUT, DELETE
|
||||
Access-Control-Allow-Headers: Content-Type, Authorization
|
||||
```
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
### Rate Limit Headers
|
||||
```
|
||||
X-RateLimit-Limit: 1000
|
||||
X-RateLimit-Remaining: 999
|
||||
X-RateLimit-Reset: 1640995200
|
||||
X-RateLimit-Window: 3600
|
||||
```
|
||||
|
||||
### Rate Limit Exceeded Response
|
||||
```json
|
||||
HTTP/1.1 429 Too Many Requests
|
||||
Retry-After: 3600
|
||||
|
||||
{
|
||||
"error": {
|
||||
"code": "RATE_LIMIT_EXCEEDED",
|
||||
"message": "API rate limit exceeded",
|
||||
"details": {
|
||||
"limit": 1000,
|
||||
"window": "1 hour",
|
||||
"retryAfter": 3600
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Hypermedia (HATEOAS)
|
||||
|
||||
### Links in Responses
|
||||
```json
|
||||
{
|
||||
"id": 123,
|
||||
"name": "John Doe",
|
||||
"email": "john@example.com",
|
||||
"_links": {
|
||||
"self": {
|
||||
"href": "/users/123"
|
||||
},
|
||||
"orders": {
|
||||
"href": "/users/123/orders"
|
||||
},
|
||||
"edit": {
|
||||
"href": "/users/123",
|
||||
"method": "PUT"
|
||||
},
|
||||
"delete": {
|
||||
"href": "/users/123",
|
||||
"method": "DELETE"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Link Relations
|
||||
- **self**: Link to the resource itself
|
||||
- **edit**: Link to edit the resource
|
||||
- **delete**: Link to delete the resource
|
||||
- **related**: Link to related resources
|
||||
- **next/prev**: Pagination links
|
||||
|
||||
## Common Anti-Patterns to Avoid
|
||||
|
||||
### 1. Verbs in URLs
|
||||
```
|
||||
❌ Bad: /api/getUser/123
|
||||
✅ Good: GET /api/users/123
|
||||
```
|
||||
|
||||
### 2. Inconsistent Naming
|
||||
```
|
||||
❌ Bad: /user-profiles and /userAddresses
|
||||
✅ Good: /user-profiles and /user-addresses
|
||||
```
|
||||
|
||||
### 3. Deep Nesting
|
||||
```
|
||||
❌ Bad: /companies/123/departments/456/teams/789/members/012
|
||||
✅ Good: /team-members/012?team=789
|
||||
```
|
||||
|
||||
### 4. Ignoring HTTP Status Codes
|
||||
```
|
||||
❌ Bad: Always return 200 with error info in body
|
||||
✅ Good: Use appropriate status codes (404, 400, 500, etc.)
|
||||
```
|
||||
|
||||
### 5. Exposing Internal Structure
|
||||
```
|
||||
❌ Bad: /api/database_table_users
|
||||
✅ Good: /api/users
|
||||
```
|
||||
|
||||
### 6. No Versioning Strategy
|
||||
```
|
||||
❌ Bad: Breaking changes without version management
|
||||
✅ Good: /api/v1/users or Accept: application/vnd.api+json;version=1
|
||||
```
|
||||
|
||||
### 7. Inconsistent Error Responses
|
||||
```
|
||||
❌ Bad: Different error formats for different endpoints
|
||||
✅ Good: Standardized error response structure
|
||||
```
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
1. **Use nouns for resources, not verbs**
|
||||
2. **Leverage HTTP methods correctly**
|
||||
3. **Maintain consistent naming conventions**
|
||||
4. **Implement proper error handling**
|
||||
5. **Use appropriate HTTP status codes**
|
||||
6. **Design for cacheability**
|
||||
7. **Implement security from the start**
|
||||
8. **Plan for versioning**
|
||||
9. **Provide comprehensive documentation**
|
||||
10. **Follow HATEOAS principles when applicable**
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [RFC 7231 - HTTP/1.1 Semantics and Content](https://tools.ietf.org/html/rfc7231)
|
||||
- [RFC 6570 - URI Template](https://tools.ietf.org/html/rfc6570)
|
||||
- [OpenAPI Specification](https://swagger.io/specification/)
|
||||
- [REST API Design Best Practices](https://www.restapitutorial.com/)
|
||||
- [HTTP Status Code Definitions](https://httpstatuses.com/)
|
||||
1661
engineering/api-design-reviewer/scripts/api_scorecard.py
Normal file
1661
engineering/api-design-reviewer/scripts/api_scorecard.py
Normal file
File diff suppressed because it is too large
Load Diff
1102
engineering/api-design-reviewer/scripts/breaking_change_detector.py
Normal file
1102
engineering/api-design-reviewer/scripts/breaking_change_detector.py
Normal file
File diff suppressed because it is too large
Load Diff
458
engineering/interview-system-designer/SKILL.md
Normal file
458
engineering/interview-system-designer/SKILL.md
Normal file
@@ -0,0 +1,458 @@
|
||||
---
|
||||
name: interview-system-designer
|
||||
description: This skill should be used when the user asks to "design interview processes", "create hiring pipelines", "calibrate interview loops", "generate interview questions", "design competency matrices", "analyze interviewer bias", "create scoring rubrics", "build question banks", or "optimize hiring systems". Use for designing role-specific interview loops, competency assessments, and hiring calibration systems.
|
||||
---
|
||||
|
||||
# Interview System Designer
|
||||
|
||||
Comprehensive interview system design, competency assessment, and hiring process optimization.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Quick Start](#quick-start)
|
||||
- [Tools Overview](#tools-overview)
|
||||
- [Interview Loop Designer](#1-interview-loop-designer)
|
||||
- [Question Bank Generator](#2-question-bank-generator)
|
||||
- [Hiring Calibrator](#3-hiring-calibrator)
|
||||
- [Interview System Workflows](#interview-system-workflows)
|
||||
- [Role-Specific Loop Design](#role-specific-loop-design)
|
||||
- [Competency Matrix Development](#competency-matrix-development)
|
||||
- [Question Bank Creation](#question-bank-creation)
|
||||
- [Bias Mitigation Framework](#bias-mitigation-framework)
|
||||
- [Hiring Bar Calibration](#hiring-bar-calibration)
|
||||
- [Competency Frameworks](#competency-frameworks)
|
||||
- [Scoring & Calibration](#scoring--calibration)
|
||||
- [Reference Documentation](#reference-documentation)
|
||||
- [Industry Standards](#industry-standards)
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Design a complete interview loop for a senior software engineer role
|
||||
python loop_designer.py --role "Senior Software Engineer" --level senior --team platform --output loops/
|
||||
|
||||
# Generate a comprehensive question bank for a product manager position
|
||||
python question_bank_generator.py --role "Product Manager" --level senior --competencies leadership,strategy,analytics --output questions/
|
||||
|
||||
# Analyze interview calibration across multiple candidates and interviewers
|
||||
python hiring_calibrator.py --input interview_data.json --output calibration_report.json --analysis-type full
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tools Overview
|
||||
|
||||
### 1. Interview Loop Designer
|
||||
|
||||
Generates calibrated interview loops tailored to specific roles, levels, and teams.
|
||||
|
||||
**Input:** Role definition (title, level, team, competency requirements)
|
||||
**Output:** Complete interview loop with rounds, focus areas, time allocation, scorecard templates
|
||||
|
||||
**Key Features:**
|
||||
- Role-specific competency mapping
|
||||
- Level-appropriate question difficulty
|
||||
- Interviewer skill requirements
|
||||
- Time-optimized scheduling
|
||||
- Standardized scorecards
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# Design loop for a specific role
|
||||
python loop_designer.py --role "Staff Data Scientist" --level staff --team ml-platform
|
||||
|
||||
# Generate loop with specific focus areas
|
||||
python loop_designer.py --role "Engineering Manager" --level senior --competencies leadership,technical,strategy
|
||||
|
||||
# Create loop for multiple levels
|
||||
python loop_designer.py --role "Backend Engineer" --levels junior,mid,senior --output loops/backend/
|
||||
```
|
||||
|
||||
### 2. Question Bank Generator
|
||||
|
||||
Creates comprehensive, competency-based interview questions with detailed scoring criteria.
|
||||
|
||||
**Input:** Role requirements, competency areas, experience level
|
||||
**Output:** Structured question bank with scoring rubrics, follow-up probes, and calibration examples
|
||||
|
||||
**Key Features:**
|
||||
- Competency-based question organization
|
||||
- Level-appropriate difficulty progression
|
||||
- Behavioral and technical question types
|
||||
- Anti-bias question design
|
||||
- Calibration examples (poor/good/great answers)
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# Generate questions for technical competencies
|
||||
python question_bank_generator.py --role "Frontend Engineer" --competencies react,typescript,system-design
|
||||
|
||||
# Create behavioral question bank
|
||||
python question_bank_generator.py --role "Product Manager" --question-types behavioral,leadership --output pm_questions/
|
||||
|
||||
# Generate questions for all levels
|
||||
python question_bank_generator.py --role "DevOps Engineer" --levels junior,mid,senior,staff
|
||||
```
|
||||
|
||||
### 3. Hiring Calibrator
|
||||
|
||||
Analyzes interview scores to detect bias, calibration issues, and recommends improvements.
|
||||
|
||||
**Input:** Interview results data (candidate scores, interviewer feedback, demographics)
|
||||
**Output:** Calibration analysis, bias detection report, interviewer coaching recommendations
|
||||
|
||||
**Key Features:**
|
||||
- Statistical bias detection
|
||||
- Interviewer calibration analysis
|
||||
- Score distribution analysis
|
||||
- Recommendation engine
|
||||
- Trend tracking over time
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# Analyze calibration across all interviews
|
||||
python hiring_calibrator.py --input interview_results.json --analysis-type comprehensive
|
||||
|
||||
# Focus on specific competency areas
|
||||
python hiring_calibrator.py --input data.json --competencies technical,leadership --output bias_report.json
|
||||
|
||||
# Track calibration trends over time
|
||||
python hiring_calibrator.py --input historical_data.json --trend-analysis --period quarterly
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Interview System Workflows
|
||||
|
||||
### Role-Specific Loop Design
|
||||
|
||||
#### Software Engineering Roles
|
||||
|
||||
**Junior/Mid Software Engineer (2-4 years)**
|
||||
- **Duration:** 3-4 hours across 3-4 rounds
|
||||
- **Focus Areas:** Coding fundamentals, debugging, system understanding, growth mindset
|
||||
- **Rounds:**
|
||||
1. Technical Phone Screen (45min) - Coding fundamentals, algorithms
|
||||
2. Coding Deep Dive (60min) - Problem-solving, code quality, testing
|
||||
3. System Design Basics (45min) - Component interaction, basic scalability
|
||||
4. Behavioral & Values (30min) - Team collaboration, learning agility
|
||||
|
||||
**Senior Software Engineer (5-8 years)**
|
||||
- **Duration:** 4-5 hours across 4-5 rounds
|
||||
- **Focus Areas:** System design, technical leadership, mentoring capability, domain expertise
|
||||
- **Rounds:**
|
||||
1. Technical Phone Screen (45min) - Advanced algorithms, optimization
|
||||
2. System Design (60min) - Scalability, trade-offs, architectural decisions
|
||||
3. Coding Excellence (60min) - Code quality, testing strategies, refactoring
|
||||
4. Technical Leadership (45min) - Mentoring, technical decisions, cross-team collaboration
|
||||
5. Behavioral & Culture (30min) - Leadership examples, conflict resolution
|
||||
|
||||
**Staff+ Engineer (8+ years)**
|
||||
- **Duration:** 5-6 hours across 5-6 rounds
|
||||
- **Focus Areas:** Architectural vision, organizational impact, technical strategy, cross-functional leadership
|
||||
- **Rounds:**
|
||||
1. Technical Phone Screen (45min) - System architecture, complex problem-solving
|
||||
2. Architecture Design (90min) - Large-scale systems, technology choices, evolution patterns
|
||||
3. Technical Strategy (60min) - Technical roadmaps, technology adoption, risk assessment
|
||||
4. Leadership & Influence (60min) - Cross-team impact, technical vision, stakeholder management
|
||||
5. Coding & Best Practices (45min) - Code quality standards, development processes
|
||||
6. Cultural & Strategic Fit (30min) - Company values, strategic thinking
|
||||
|
||||
#### Product Management Roles
|
||||
|
||||
**Product Manager (3-6 years)**
|
||||
- **Duration:** 3-4 hours across 4 rounds
|
||||
- **Focus Areas:** Product sense, analytical thinking, stakeholder management, execution
|
||||
- **Rounds:**
|
||||
1. Product Sense (60min) - Feature prioritization, user empathy, market understanding
|
||||
2. Analytical Thinking (45min) - Data interpretation, metrics design, experimentation
|
||||
3. Execution & Process (45min) - Project management, cross-functional collaboration
|
||||
4. Behavioral & Leadership (30min) - Stakeholder management, conflict resolution
|
||||
|
||||
**Senior Product Manager (6-10 years)**
|
||||
- **Duration:** 4-5 hours across 4-5 rounds
|
||||
- **Focus Areas:** Product strategy, team leadership, business impact, market analysis
|
||||
- **Rounds:**
|
||||
1. Product Strategy (75min) - Market analysis, competitive positioning, roadmap planning
|
||||
2. Leadership & Influence (60min) - Team building, stakeholder management, decision-making
|
||||
3. Data & Analytics (45min) - Advanced metrics, experimentation design, business intelligence
|
||||
4. Technical Collaboration (45min) - Technical trade-offs, engineering partnership
|
||||
5. Case Study Presentation (45min) - Past impact, lessons learned, strategic thinking
|
||||
|
||||
#### Design Roles
|
||||
|
||||
**UX Designer (2-5 years)**
|
||||
- **Duration:** 3-4 hours across 3-4 rounds
|
||||
- **Focus Areas:** Design process, user research, visual design, collaboration
|
||||
- **Rounds:**
|
||||
1. Portfolio Review (60min) - Design process, problem-solving approach, visual skills
|
||||
2. Design Challenge (90min) - User-centered design, wireframing, iteration
|
||||
3. Collaboration & Process (45min) - Cross-functional work, feedback incorporation
|
||||
4. Behavioral & Values (30min) - User advocacy, creative problem-solving
|
||||
|
||||
**Senior UX Designer (5+ years)**
|
||||
- **Duration:** 4-5 hours across 4-5 rounds
|
||||
- **Focus Areas:** Design leadership, system thinking, research methodology, business impact
|
||||
- **Rounds:**
|
||||
1. Portfolio Deep Dive (75min) - Design impact, methodology, leadership examples
|
||||
2. Design System Challenge (90min) - Systems thinking, scalability, consistency
|
||||
3. Research & Strategy (60min) - User research methods, data-driven design decisions
|
||||
4. Leadership & Mentoring (45min) - Design team leadership, process improvement
|
||||
5. Business & Strategy (30min) - Design's business impact, stakeholder management
|
||||
|
||||
### Competency Matrix Development
|
||||
|
||||
#### Technical Competencies
|
||||
|
||||
**Software Engineering**
|
||||
- **Coding Proficiency:** Algorithm design, data structures, language expertise
|
||||
- **System Design:** Architecture patterns, scalability, performance optimization
|
||||
- **Testing & Quality:** Unit testing, integration testing, code review practices
|
||||
- **DevOps & Tools:** CI/CD, monitoring, debugging, development workflows
|
||||
|
||||
**Data Science & Analytics**
|
||||
- **Statistical Analysis:** Statistical methods, hypothesis testing, experimental design
|
||||
- **Machine Learning:** Algorithm selection, model evaluation, feature engineering
|
||||
- **Data Engineering:** ETL processes, data pipeline design, data quality
|
||||
- **Business Intelligence:** Metrics design, dashboard creation, stakeholder communication
|
||||
|
||||
**Product Management**
|
||||
- **Product Strategy:** Market analysis, competitive research, roadmap planning
|
||||
- **User Research:** User interviews, usability testing, persona development
|
||||
- **Data Analysis:** Metrics interpretation, A/B testing, cohort analysis
|
||||
- **Technical Understanding:** API design, database concepts, system architecture
|
||||
|
||||
#### Behavioral Competencies
|
||||
|
||||
**Leadership & Influence**
|
||||
- **Team Building:** Hiring, onboarding, team culture development
|
||||
- **Mentoring & Coaching:** Skill development, career guidance, feedback delivery
|
||||
- **Strategic Thinking:** Long-term planning, vision setting, decision-making frameworks
|
||||
- **Change Management:** Process improvement, organizational change, resistance handling
|
||||
|
||||
**Communication & Collaboration**
|
||||
- **Stakeholder Management:** Expectation setting, conflict resolution, alignment building
|
||||
- **Cross-Functional Partnership:** Engineering-Product-Design collaboration
|
||||
- **Presentation Skills:** Technical communication, executive briefings, documentation
|
||||
- **Active Listening:** Empathy, question asking, perspective taking
|
||||
|
||||
**Problem-Solving & Innovation**
|
||||
- **Analytical Thinking:** Problem decomposition, root cause analysis, hypothesis formation
|
||||
- **Creative Problem-Solving:** Alternative solution generation, constraint navigation
|
||||
- **Learning Agility:** Skill acquisition, adaptation to change, knowledge transfer
|
||||
- **Risk Assessment:** Uncertainty navigation, trade-off analysis, mitigation planning
|
||||
|
||||
### Question Bank Creation
|
||||
|
||||
#### Technical Questions by Level
|
||||
|
||||
**Junior Level Questions**
|
||||
- **Coding:** "Implement a function to find the second largest element in an array"
|
||||
- **System Design:** "How would you design a simple URL shortener for 1000 users?"
|
||||
- **Debugging:** "Walk through how you would debug a slow-loading web page"
|
||||
|
||||
**Senior Level Questions**
|
||||
- **Architecture:** "Design a real-time chat system supporting 1M concurrent users"
|
||||
- **Leadership:** "Describe how you would onboard a new team member in your area"
|
||||
- **Trade-offs:** "Compare microservices vs monolith for a rapidly scaling startup"
|
||||
|
||||
**Staff+ Level Questions**
|
||||
- **Strategy:** "How would you evaluate and introduce a new programming language to the organization?"
|
||||
- **Influence:** "Describe a time you drove technical consensus across multiple teams"
|
||||
- **Vision:** "How do you balance technical debt against feature development?"
|
||||
|
||||
#### Behavioral Questions Framework
|
||||
|
||||
**STAR Method Implementation**
|
||||
- **Situation:** Context and background of the scenario
|
||||
- **Task:** Specific challenge or goal that needed to be addressed
|
||||
- **Action:** Concrete steps taken to address the challenge
|
||||
- **Result:** Measurable outcomes and lessons learned
|
||||
|
||||
**Sample Questions:**
|
||||
- "Tell me about a time you had to influence a decision without formal authority"
|
||||
- "Describe a situation where you had to deliver difficult feedback to a colleague"
|
||||
- "Give an example of when you had to adapt your communication style for different audiences"
|
||||
- "Walk me through a time when you had to make a decision with incomplete information"
|
||||
|
||||
### Bias Mitigation Framework
|
||||
|
||||
#### Structural Bias Prevention
|
||||
|
||||
**Interview Panel Composition**
|
||||
- Diverse interviewer panels (gender, ethnicity, experience level)
|
||||
- Rotating panel assignments to prevent pattern bias
|
||||
- Anonymous resume screening for initial phone screens
|
||||
- Standardized question sets to ensure consistency
|
||||
|
||||
**Process Standardization**
|
||||
- Structured interview guides with required probing questions
|
||||
- Consistent time allocation across all candidates
|
||||
- Standardized evaluation criteria and scoring rubrics
|
||||
- Required justification for all scoring decisions
|
||||
|
||||
#### Cognitive Bias Recognition
|
||||
|
||||
**Common Interview Biases**
|
||||
- **Halo Effect:** One strong impression influences overall assessment
|
||||
- **Confirmation Bias:** Seeking information that confirms initial impressions
|
||||
- **Similarity Bias:** Favoring candidates with similar backgrounds/experiences
|
||||
- **Contrast Effect:** Comparing candidates against each other rather than standard
|
||||
- **Anchoring Bias:** Over-relying on first piece of information received
|
||||
|
||||
**Mitigation Strategies**
|
||||
- Pre-interview bias awareness training for all interviewers
|
||||
- Structured debrief sessions with independent score recording
|
||||
- Regular calibration sessions with example candidate discussions
|
||||
- Statistical monitoring of scoring patterns by interviewer and demographic
|
||||
|
||||
### Hiring Bar Calibration
|
||||
|
||||
#### Calibration Methodology
|
||||
|
||||
**Regular Calibration Sessions**
|
||||
- Monthly interviewer calibration meetings
|
||||
- Shadow interviewing for new interviewers (minimum 5 sessions)
|
||||
- Quarterly cross-team calibration reviews
|
||||
- Annual hiring bar review and adjustment process
|
||||
|
||||
**Performance Tracking**
|
||||
- New hire performance correlation with interview scores
|
||||
- Interviewer accuracy tracking (prediction vs actual performance)
|
||||
- False positive/negative analysis
|
||||
- Offer acceptance rate analysis by interviewer
|
||||
|
||||
**Feedback Loops**
|
||||
- Six-month new hire performance reviews
|
||||
- Manager feedback on interview process effectiveness
|
||||
- Candidate experience surveys and feedback integration
|
||||
- Continuous process improvement based on data analysis
|
||||
|
||||
---
|
||||
|
||||
## Competency Frameworks
|
||||
|
||||
### Engineering Competency Levels
|
||||
|
||||
#### Level 1-2: Individual Contributor (Junior/Mid)
|
||||
- **Technical Skills:** Language proficiency, testing basics, code review participation
|
||||
- **Problem Solving:** Structured approach to debugging, logical thinking
|
||||
- **Communication:** Clear status updates, effective question asking
|
||||
- **Learning:** Proactive skill development, mentorship seeking
|
||||
|
||||
#### Level 3-4: Senior Individual Contributor
|
||||
- **Technical Leadership:** Architecture decisions, code quality advocacy
|
||||
- **Mentoring:** Junior developer guidance, knowledge sharing
|
||||
- **Project Ownership:** End-to-end feature delivery, stakeholder communication
|
||||
- **Innovation:** Process improvement, technology evaluation
|
||||
|
||||
#### Level 5-6: Staff+ Engineer
|
||||
- **Organizational Impact:** Cross-team technical leadership, strategic planning
|
||||
- **Technical Vision:** Long-term architectural planning, technology roadmap
|
||||
- **People Development:** Team growth, hiring contribution, culture building
|
||||
- **External Influence:** Industry contribution, thought leadership
|
||||
|
||||
### Product Management Competency Levels
|
||||
|
||||
#### Level 1-2: Associate/Product Manager
|
||||
- **Product Execution:** Feature specification, requirements gathering
|
||||
- **User Focus:** User research participation, feedback collection
|
||||
- **Data Analysis:** Basic metrics analysis, experiment interpretation
|
||||
- **Stakeholder Management:** Cross-functional collaboration, communication
|
||||
|
||||
#### Level 3-4: Senior Product Manager
|
||||
- **Strategic Thinking:** Market analysis, competitive positioning
|
||||
- **Leadership:** Cross-functional team leadership, decision making
|
||||
- **Business Impact:** Revenue impact, market share growth
|
||||
- **Process Innovation:** Product development process improvement
|
||||
|
||||
#### Level 5-6: Principal Product Manager
|
||||
- **Vision Setting:** Product strategy, market direction
|
||||
- **Organizational Influence:** Executive communication, team building
|
||||
- **Innovation Leadership:** New market creation, disruptive thinking
|
||||
- **Talent Development:** PM team growth, hiring leadership
|
||||
|
||||
---
|
||||
|
||||
## Scoring & Calibration
|
||||
|
||||
### Scoring Rubric Framework
|
||||
|
||||
#### 4-Point Scoring Scale
|
||||
- **4 - Exceeds Expectations:** Demonstrates mastery beyond required level
|
||||
- **3 - Meets Expectations:** Solid performance meeting all requirements
|
||||
- **2 - Partially Meets:** Shows potential but has development areas
|
||||
- **1 - Does Not Meet:** Significant gaps in required competencies
|
||||
|
||||
#### Competency-Specific Scoring
|
||||
|
||||
**Technical Competencies**
|
||||
- Code Quality (4): Clean, maintainable, well-tested code with excellent documentation
|
||||
- Code Quality (3): Functional code with good structure and basic testing
|
||||
- Code Quality (2): Working code with some structural issues or missing tests
|
||||
- Code Quality (1): Non-functional or poorly structured code with significant issues
|
||||
|
||||
**Leadership Competencies**
|
||||
- Team Influence (4): Drives team success, develops others, creates lasting positive change
|
||||
- Team Influence (3): Contributes positively to team dynamics and outcomes
|
||||
- Team Influence (2): Shows leadership potential with some effective examples
|
||||
- Team Influence (1): Limited evidence of leadership ability or negative team impact
|
||||
|
||||
### Calibration Standards
|
||||
|
||||
#### Statistical Benchmarks
|
||||
- Target score distribution: 20% (4s), 40% (3s), 30% (2s), 10% (1s)
|
||||
- Interviewer consistency target: <0.5 standard deviation from team average
|
||||
- Pass rate target: 15-25% for most roles (varies by level and market conditions)
|
||||
- Time to hire target: 2-3 weeks from first interview to offer
|
||||
|
||||
#### Quality Metrics
|
||||
- New hire 6-month performance correlation: >0.6 with interview scores
|
||||
- Interviewer agreement rate: >80% within 1 point on final recommendations
|
||||
- Candidate experience satisfaction: >4.0/5.0 average rating
|
||||
- Offer acceptance rate: >85% for preferred candidates
|
||||
|
||||
---
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
### Interview Templates
|
||||
- Role-specific interview guides and question banks
|
||||
- Scorecard templates for consistent evaluation
|
||||
- Debrief facilitation guides for effective team discussions
|
||||
|
||||
### Bias Mitigation Resources
|
||||
- Unconscious bias training materials and exercises
|
||||
- Structured interviewing best practices checklist
|
||||
- Demographic diversity tracking and reporting templates
|
||||
|
||||
### Calibration Tools
|
||||
- Interview performance correlation analysis templates
|
||||
- Interviewer coaching and development frameworks
|
||||
- Hiring pipeline metrics and dashboard specifications
|
||||
|
||||
---
|
||||
|
||||
## Industry Standards
|
||||
|
||||
### Best Practices Integration
|
||||
- Google's structured interviewing methodology
|
||||
- Amazon's Leadership Principles assessment framework
|
||||
- Microsoft's competency-based evaluation system
|
||||
- Netflix's culture fit assessment approach
|
||||
|
||||
### Compliance & Legal Considerations
|
||||
- EEOC compliance requirements and documentation
|
||||
- ADA accommodation procedures and guidelines
|
||||
- International hiring law considerations
|
||||
- Privacy and data protection requirements (GDPR, CCPA)
|
||||
|
||||
### Continuous Improvement Framework
|
||||
- Regular process auditing and refinement cycles
|
||||
- Industry benchmarking and comparative analysis
|
||||
- Technology integration for interview optimization
|
||||
- Candidate experience enhancement initiatives
|
||||
|
||||
This comprehensive interview system design framework provides the structure and tools necessary to build fair, effective, and scalable hiring processes that consistently identify top talent while minimizing bias and maximizing candidate experience.
|
||||
908
engineering/interview-system-designer/loop_designer.py
Normal file
908
engineering/interview-system-designer/loop_designer.py
Normal file
@@ -0,0 +1,908 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Interview Loop Designer
|
||||
|
||||
Generates calibrated interview loops tailored to specific roles, levels, and teams.
|
||||
Creates complete interview loops with rounds, focus areas, time allocation,
|
||||
interviewer skill requirements, and scorecard templates.
|
||||
|
||||
Usage:
|
||||
python loop_designer.py --role "Senior Software Engineer" --level senior --team platform
|
||||
python loop_designer.py --role "Product Manager" --level mid --competencies leadership,strategy
|
||||
python loop_designer.py --input role_definition.json --output loops/
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import argparse
|
||||
from datetime import datetime, timedelta
|
||||
from typing import Dict, List, Optional, Any, Tuple
|
||||
from collections import defaultdict
|
||||
|
||||
|
||||
class InterviewLoopDesigner:
|
||||
"""Designs comprehensive interview loops based on role requirements."""
|
||||
|
||||
def __init__(self):
|
||||
self.competency_frameworks = self._init_competency_frameworks()
|
||||
self.role_templates = self._init_role_templates()
|
||||
self.interviewer_skills = self._init_interviewer_skills()
|
||||
|
||||
def _init_competency_frameworks(self) -> Dict[str, Dict]:
|
||||
"""Initialize competency frameworks for different roles."""
|
||||
return {
|
||||
"software_engineer": {
|
||||
"junior": {
|
||||
"required": ["coding_fundamentals", "debugging", "testing_basics", "version_control"],
|
||||
"preferred": ["system_understanding", "code_review", "collaboration"],
|
||||
"focus_areas": ["technical_execution", "learning_agility", "team_collaboration"]
|
||||
},
|
||||
"mid": {
|
||||
"required": ["advanced_coding", "system_design_basics", "testing_strategy", "debugging_complex"],
|
||||
"preferred": ["mentoring_basics", "technical_communication", "project_ownership"],
|
||||
"focus_areas": ["technical_depth", "system_thinking", "ownership"]
|
||||
},
|
||||
"senior": {
|
||||
"required": ["system_architecture", "technical_leadership", "mentoring", "cross_team_collab"],
|
||||
"preferred": ["technology_evaluation", "process_improvement", "hiring_contribution"],
|
||||
"focus_areas": ["technical_leadership", "system_architecture", "people_development"]
|
||||
},
|
||||
"staff": {
|
||||
"required": ["architectural_vision", "organizational_impact", "technical_strategy", "team_building"],
|
||||
"preferred": ["industry_influence", "innovation_leadership", "executive_communication"],
|
||||
"focus_areas": ["organizational_impact", "technical_vision", "strategic_influence"]
|
||||
},
|
||||
"principal": {
|
||||
"required": ["company_wide_impact", "technical_vision", "talent_development", "strategic_planning"],
|
||||
"preferred": ["industry_leadership", "board_communication", "market_influence"],
|
||||
"focus_areas": ["strategic_leadership", "organizational_transformation", "external_influence"]
|
||||
}
|
||||
},
|
||||
"product_manager": {
|
||||
"junior": {
|
||||
"required": ["product_execution", "user_research", "data_analysis", "stakeholder_comm"],
|
||||
"preferred": ["market_awareness", "technical_understanding", "project_management"],
|
||||
"focus_areas": ["execution_excellence", "user_focus", "analytical_thinking"]
|
||||
},
|
||||
"mid": {
|
||||
"required": ["product_strategy", "cross_functional_leadership", "metrics_design", "market_analysis"],
|
||||
"preferred": ["team_building", "technical_collaboration", "competitive_analysis"],
|
||||
"focus_areas": ["strategic_thinking", "leadership", "business_impact"]
|
||||
},
|
||||
"senior": {
|
||||
"required": ["business_strategy", "team_leadership", "p&l_ownership", "market_positioning"],
|
||||
"preferred": ["hiring_leadership", "board_communication", "partnership_development"],
|
||||
"focus_areas": ["business_leadership", "market_strategy", "organizational_impact"]
|
||||
},
|
||||
"staff": {
|
||||
"required": ["portfolio_management", "organizational_leadership", "strategic_planning", "market_creation"],
|
||||
"preferred": ["executive_presence", "investor_relations", "acquisition_strategy"],
|
||||
"focus_areas": ["strategic_leadership", "market_innovation", "organizational_transformation"]
|
||||
}
|
||||
},
|
||||
"designer": {
|
||||
"junior": {
|
||||
"required": ["design_fundamentals", "user_research", "prototyping", "design_tools"],
|
||||
"preferred": ["user_empathy", "visual_design", "collaboration"],
|
||||
"focus_areas": ["design_execution", "user_research", "creative_problem_solving"]
|
||||
},
|
||||
"mid": {
|
||||
"required": ["design_systems", "user_testing", "cross_functional_collab", "design_strategy"],
|
||||
"preferred": ["mentoring", "process_improvement", "business_understanding"],
|
||||
"focus_areas": ["design_leadership", "system_thinking", "business_impact"]
|
||||
},
|
||||
"senior": {
|
||||
"required": ["design_leadership", "team_building", "strategic_design", "stakeholder_management"],
|
||||
"preferred": ["design_culture", "hiring_leadership", "executive_communication"],
|
||||
"focus_areas": ["design_strategy", "team_leadership", "organizational_impact"]
|
||||
}
|
||||
},
|
||||
"data_scientist": {
|
||||
"junior": {
|
||||
"required": ["statistical_analysis", "python_r", "data_visualization", "sql"],
|
||||
"preferred": ["machine_learning", "business_understanding", "communication"],
|
||||
"focus_areas": ["analytical_skills", "technical_execution", "business_impact"]
|
||||
},
|
||||
"mid": {
|
||||
"required": ["advanced_ml", "experiment_design", "data_engineering", "stakeholder_comm"],
|
||||
"preferred": ["mentoring", "project_leadership", "product_collaboration"],
|
||||
"focus_areas": ["advanced_analytics", "project_leadership", "cross_functional_impact"]
|
||||
},
|
||||
"senior": {
|
||||
"required": ["data_strategy", "team_leadership", "ml_systems", "business_strategy"],
|
||||
"preferred": ["hiring_leadership", "executive_communication", "technology_evaluation"],
|
||||
"focus_areas": ["strategic_leadership", "technical_vision", "organizational_impact"]
|
||||
}
|
||||
},
|
||||
"devops_engineer": {
|
||||
"junior": {
|
||||
"required": ["infrastructure_basics", "scripting", "monitoring", "troubleshooting"],
|
||||
"preferred": ["automation", "cloud_platforms", "security_awareness"],
|
||||
"focus_areas": ["operational_excellence", "automation_mindset", "problem_solving"]
|
||||
},
|
||||
"mid": {
|
||||
"required": ["ci_cd_design", "infrastructure_as_code", "security_implementation", "performance_optimization"],
|
||||
"preferred": ["team_collaboration", "incident_management", "capacity_planning"],
|
||||
"focus_areas": ["system_reliability", "automation_leadership", "cross_team_collaboration"]
|
||||
},
|
||||
"senior": {
|
||||
"required": ["platform_architecture", "team_leadership", "security_strategy", "organizational_impact"],
|
||||
"preferred": ["hiring_contribution", "technology_evaluation", "executive_communication"],
|
||||
"focus_areas": ["platform_leadership", "strategic_thinking", "organizational_transformation"]
|
||||
}
|
||||
},
|
||||
"engineering_manager": {
|
||||
"junior": {
|
||||
"required": ["team_leadership", "technical_background", "people_management", "project_coordination"],
|
||||
"preferred": ["hiring_experience", "performance_management", "technical_mentoring"],
|
||||
"focus_areas": ["people_leadership", "team_building", "execution_excellence"]
|
||||
},
|
||||
"senior": {
|
||||
"required": ["organizational_leadership", "strategic_planning", "talent_development", "cross_functional_leadership"],
|
||||
"preferred": ["technical_vision", "culture_building", "executive_communication"],
|
||||
"focus_areas": ["organizational_impact", "strategic_leadership", "talent_development"]
|
||||
},
|
||||
"staff": {
|
||||
"required": ["multi_team_leadership", "organizational_strategy", "executive_presence", "cultural_transformation"],
|
||||
"preferred": ["board_communication", "market_understanding", "acquisition_integration"],
|
||||
"focus_areas": ["organizational_transformation", "strategic_leadership", "cultural_evolution"]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
def _init_role_templates(self) -> Dict[str, Dict]:
|
||||
"""Initialize role-specific interview templates."""
|
||||
return {
|
||||
"software_engineer": {
|
||||
"core_rounds": ["technical_phone_screen", "coding_deep_dive", "system_design", "behavioral"],
|
||||
"optional_rounds": ["technical_leadership", "domain_expertise", "culture_fit"],
|
||||
"total_duration_range": (180, 360), # 3-6 hours
|
||||
"required_competencies": ["coding", "problem_solving", "communication"]
|
||||
},
|
||||
"product_manager": {
|
||||
"core_rounds": ["product_sense", "analytical_thinking", "execution_process", "behavioral"],
|
||||
"optional_rounds": ["strategic_thinking", "technical_collaboration", "leadership"],
|
||||
"total_duration_range": (180, 300), # 3-5 hours
|
||||
"required_competencies": ["product_strategy", "analytical_thinking", "stakeholder_management"]
|
||||
},
|
||||
"designer": {
|
||||
"core_rounds": ["portfolio_review", "design_challenge", "collaboration_process", "behavioral"],
|
||||
"optional_rounds": ["design_system_thinking", "research_methodology", "leadership"],
|
||||
"total_duration_range": (180, 300), # 3-5 hours
|
||||
"required_competencies": ["design_process", "user_empathy", "visual_communication"]
|
||||
},
|
||||
"data_scientist": {
|
||||
"core_rounds": ["technical_assessment", "case_study", "statistical_thinking", "behavioral"],
|
||||
"optional_rounds": ["ml_systems", "business_strategy", "technical_leadership"],
|
||||
"total_duration_range": (210, 330), # 3.5-5.5 hours
|
||||
"required_competencies": ["statistical_analysis", "programming", "business_acumen"]
|
||||
},
|
||||
"devops_engineer": {
|
||||
"core_rounds": ["technical_assessment", "system_design", "troubleshooting", "behavioral"],
|
||||
"optional_rounds": ["security_assessment", "automation_design", "leadership"],
|
||||
"total_duration_range": (180, 300), # 3-5 hours
|
||||
"required_competencies": ["infrastructure", "automation", "problem_solving"]
|
||||
},
|
||||
"engineering_manager": {
|
||||
"core_rounds": ["leadership_assessment", "technical_background", "people_management", "behavioral"],
|
||||
"optional_rounds": ["strategic_thinking", "hiring_assessment", "culture_building"],
|
||||
"total_duration_range": (240, 360), # 4-6 hours
|
||||
"required_competencies": ["people_leadership", "technical_understanding", "strategic_thinking"]
|
||||
}
|
||||
}
|
||||
|
||||
def _init_interviewer_skills(self) -> Dict[str, Dict]:
|
||||
"""Initialize interviewer skill requirements for different round types."""
|
||||
return {
|
||||
"technical_phone_screen": {
|
||||
"required_skills": ["technical_assessment", "coding_evaluation"],
|
||||
"preferred_experience": ["same_domain", "senior_level"],
|
||||
"calibration_level": "standard"
|
||||
},
|
||||
"coding_deep_dive": {
|
||||
"required_skills": ["advanced_technical", "code_quality_assessment"],
|
||||
"preferred_experience": ["senior_engineer", "system_design"],
|
||||
"calibration_level": "high"
|
||||
},
|
||||
"system_design": {
|
||||
"required_skills": ["architecture_design", "scalability_assessment"],
|
||||
"preferred_experience": ["senior_architect", "large_scale_systems"],
|
||||
"calibration_level": "high"
|
||||
},
|
||||
"behavioral": {
|
||||
"required_skills": ["behavioral_interviewing", "competency_assessment"],
|
||||
"preferred_experience": ["hiring_manager", "people_leadership"],
|
||||
"calibration_level": "standard"
|
||||
},
|
||||
"technical_leadership": {
|
||||
"required_skills": ["leadership_assessment", "technical_mentoring"],
|
||||
"preferred_experience": ["engineering_manager", "tech_lead"],
|
||||
"calibration_level": "high"
|
||||
},
|
||||
"product_sense": {
|
||||
"required_skills": ["product_evaluation", "market_analysis"],
|
||||
"preferred_experience": ["product_manager", "product_leadership"],
|
||||
"calibration_level": "high"
|
||||
},
|
||||
"analytical_thinking": {
|
||||
"required_skills": ["data_analysis", "metrics_evaluation"],
|
||||
"preferred_experience": ["data_analyst", "product_manager"],
|
||||
"calibration_level": "standard"
|
||||
},
|
||||
"design_challenge": {
|
||||
"required_skills": ["design_evaluation", "user_experience"],
|
||||
"preferred_experience": ["senior_designer", "design_manager"],
|
||||
"calibration_level": "high"
|
||||
}
|
||||
}
|
||||
|
||||
def generate_interview_loop(self, role: str, level: str, team: Optional[str] = None,
|
||||
competencies: Optional[List[str]] = None) -> Dict[str, Any]:
|
||||
"""Generate a complete interview loop for the specified role and level."""
|
||||
|
||||
# Normalize inputs
|
||||
role_key = role.lower().replace(" ", "_").replace("-", "_")
|
||||
level_key = level.lower()
|
||||
|
||||
# Get role template and competency requirements
|
||||
if role_key not in self.competency_frameworks:
|
||||
role_key = self._find_closest_role(role_key)
|
||||
|
||||
if level_key not in self.competency_frameworks[role_key]:
|
||||
level_key = self._find_closest_level(role_key, level_key)
|
||||
|
||||
competency_req = self.competency_frameworks[role_key][level_key]
|
||||
role_template = self.role_templates.get(role_key, self.role_templates["software_engineer"])
|
||||
|
||||
# Design the interview loop
|
||||
rounds = self._design_rounds(role_key, level_key, competency_req, role_template, competencies)
|
||||
schedule = self._create_schedule(rounds)
|
||||
scorecard = self._generate_scorecard(role_key, level_key, competency_req)
|
||||
interviewer_requirements = self._define_interviewer_requirements(rounds)
|
||||
|
||||
return {
|
||||
"role": role,
|
||||
"level": level,
|
||||
"team": team,
|
||||
"generated_at": datetime.now().isoformat(),
|
||||
"total_duration_minutes": sum(round_info["duration_minutes"] for round_info in rounds.values()),
|
||||
"total_rounds": len(rounds),
|
||||
"rounds": rounds,
|
||||
"suggested_schedule": schedule,
|
||||
"scorecard_template": scorecard,
|
||||
"interviewer_requirements": interviewer_requirements,
|
||||
"competency_framework": competency_req,
|
||||
"calibration_notes": self._generate_calibration_notes(role_key, level_key)
|
||||
}
|
||||
|
||||
def _find_closest_role(self, role_key: str) -> str:
|
||||
"""Find the closest matching role template."""
|
||||
role_mappings = {
|
||||
"engineer": "software_engineer",
|
||||
"developer": "software_engineer",
|
||||
"swe": "software_engineer",
|
||||
"backend": "software_engineer",
|
||||
"frontend": "software_engineer",
|
||||
"fullstack": "software_engineer",
|
||||
"pm": "product_manager",
|
||||
"product": "product_manager",
|
||||
"ux": "designer",
|
||||
"ui": "designer",
|
||||
"graphic": "designer",
|
||||
"data": "data_scientist",
|
||||
"analyst": "data_scientist",
|
||||
"ml": "data_scientist",
|
||||
"ops": "devops_engineer",
|
||||
"sre": "devops_engineer",
|
||||
"infrastructure": "devops_engineer",
|
||||
"manager": "engineering_manager",
|
||||
"lead": "engineering_manager"
|
||||
}
|
||||
|
||||
for key_part in role_key.split("_"):
|
||||
if key_part in role_mappings:
|
||||
return role_mappings[key_part]
|
||||
|
||||
return "software_engineer" # Default fallback
|
||||
|
||||
def _find_closest_level(self, role_key: str, level_key: str) -> str:
|
||||
"""Find the closest matching level for the role."""
|
||||
available_levels = list(self.competency_frameworks[role_key].keys())
|
||||
|
||||
level_mappings = {
|
||||
"entry": "junior",
|
||||
"associate": "junior",
|
||||
"jr": "junior",
|
||||
"mid": "mid",
|
||||
"middle": "mid",
|
||||
"sr": "senior",
|
||||
"senior": "senior",
|
||||
"staff": "staff",
|
||||
"principal": "principal",
|
||||
"lead": "senior",
|
||||
"manager": "senior"
|
||||
}
|
||||
|
||||
mapped_level = level_mappings.get(level_key, level_key)
|
||||
|
||||
if mapped_level in available_levels:
|
||||
return mapped_level
|
||||
elif "senior" in available_levels:
|
||||
return "senior"
|
||||
else:
|
||||
return available_levels[0]
|
||||
|
||||
def _design_rounds(self, role_key: str, level_key: str, competency_req: Dict,
|
||||
role_template: Dict, custom_competencies: Optional[List[str]]) -> Dict[str, Dict]:
|
||||
"""Design the specific interview rounds based on role and level."""
|
||||
rounds = {}
|
||||
|
||||
# Determine which rounds to include
|
||||
core_rounds = role_template["core_rounds"].copy()
|
||||
optional_rounds = role_template["optional_rounds"].copy()
|
||||
|
||||
# Add optional rounds based on level
|
||||
if level_key in ["senior", "staff", "principal"]:
|
||||
if "technical_leadership" in optional_rounds and role_key in ["software_engineer", "engineering_manager"]:
|
||||
core_rounds.append("technical_leadership")
|
||||
if "strategic_thinking" in optional_rounds and role_key in ["product_manager", "engineering_manager"]:
|
||||
core_rounds.append("strategic_thinking")
|
||||
if "design_system_thinking" in optional_rounds and role_key == "designer":
|
||||
core_rounds.append("design_system_thinking")
|
||||
|
||||
if level_key in ["staff", "principal"]:
|
||||
if "domain_expertise" in optional_rounds:
|
||||
core_rounds.append("domain_expertise")
|
||||
|
||||
# Define round details
|
||||
round_definitions = self._get_round_definitions()
|
||||
|
||||
for i, round_type in enumerate(core_rounds, 1):
|
||||
if round_type in round_definitions:
|
||||
round_def = round_definitions[round_type].copy()
|
||||
round_def["order"] = i
|
||||
round_def["focus_areas"] = self._customize_focus_areas(round_type, competency_req, custom_competencies)
|
||||
rounds[f"round_{i}_{round_type}"] = round_def
|
||||
|
||||
return rounds
|
||||
|
||||
def _get_round_definitions(self) -> Dict[str, Dict]:
|
||||
"""Get predefined round definitions with standard durations and formats."""
|
||||
return {
|
||||
"technical_phone_screen": {
|
||||
"name": "Technical Phone Screen",
|
||||
"duration_minutes": 45,
|
||||
"format": "virtual",
|
||||
"objectives": ["Assess coding fundamentals", "Evaluate problem-solving approach", "Screen for basic technical competency"],
|
||||
"question_types": ["coding_problems", "technical_concepts", "experience_questions"],
|
||||
"evaluation_criteria": ["technical_accuracy", "problem_solving_process", "communication_clarity"]
|
||||
},
|
||||
"coding_deep_dive": {
|
||||
"name": "Coding Deep Dive",
|
||||
"duration_minutes": 75,
|
||||
"format": "in_person_or_virtual",
|
||||
"objectives": ["Evaluate coding skills in depth", "Assess code quality and testing", "Review debugging approach"],
|
||||
"question_types": ["complex_coding_problems", "code_review", "testing_strategy"],
|
||||
"evaluation_criteria": ["code_quality", "testing_approach", "debugging_skills", "optimization_thinking"]
|
||||
},
|
||||
"system_design": {
|
||||
"name": "System Design",
|
||||
"duration_minutes": 75,
|
||||
"format": "collaborative_whiteboard",
|
||||
"objectives": ["Assess architectural thinking", "Evaluate scalability considerations", "Review trade-off analysis"],
|
||||
"question_types": ["system_architecture", "scalability_design", "trade_off_analysis"],
|
||||
"evaluation_criteria": ["architectural_thinking", "scalability_awareness", "trade_off_reasoning"]
|
||||
},
|
||||
"behavioral": {
|
||||
"name": "Behavioral Interview",
|
||||
"duration_minutes": 45,
|
||||
"format": "conversational",
|
||||
"objectives": ["Assess cultural fit", "Evaluate past experiences", "Review leadership examples"],
|
||||
"question_types": ["star_method_questions", "situational_scenarios", "values_alignment"],
|
||||
"evaluation_criteria": ["communication_skills", "leadership_examples", "cultural_alignment"]
|
||||
},
|
||||
"technical_leadership": {
|
||||
"name": "Technical Leadership",
|
||||
"duration_minutes": 60,
|
||||
"format": "discussion_based",
|
||||
"objectives": ["Evaluate mentoring capability", "Assess technical decision making", "Review cross-team collaboration"],
|
||||
"question_types": ["leadership_scenarios", "technical_decisions", "mentoring_examples"],
|
||||
"evaluation_criteria": ["leadership_potential", "technical_judgment", "influence_skills"]
|
||||
},
|
||||
"product_sense": {
|
||||
"name": "Product Sense",
|
||||
"duration_minutes": 75,
|
||||
"format": "case_study",
|
||||
"objectives": ["Assess product intuition", "Evaluate user empathy", "Review market understanding"],
|
||||
"question_types": ["product_scenarios", "feature_prioritization", "user_journey_analysis"],
|
||||
"evaluation_criteria": ["product_intuition", "user_empathy", "analytical_thinking"]
|
||||
},
|
||||
"analytical_thinking": {
|
||||
"name": "Analytical Thinking",
|
||||
"duration_minutes": 60,
|
||||
"format": "data_analysis",
|
||||
"objectives": ["Evaluate data interpretation", "Assess metric design", "Review experiment planning"],
|
||||
"question_types": ["data_interpretation", "metric_design", "experiment_analysis"],
|
||||
"evaluation_criteria": ["analytical_rigor", "metric_intuition", "experimental_thinking"]
|
||||
},
|
||||
"design_challenge": {
|
||||
"name": "Design Challenge",
|
||||
"duration_minutes": 90,
|
||||
"format": "hands_on_design",
|
||||
"objectives": ["Assess design process", "Evaluate user-centered thinking", "Review iteration approach"],
|
||||
"question_types": ["design_problems", "user_research", "design_critique"],
|
||||
"evaluation_criteria": ["design_process", "user_focus", "visual_communication"]
|
||||
},
|
||||
"portfolio_review": {
|
||||
"name": "Portfolio Review",
|
||||
"duration_minutes": 75,
|
||||
"format": "presentation_discussion",
|
||||
"objectives": ["Review past work", "Assess design thinking", "Evaluate impact measurement"],
|
||||
"question_types": ["portfolio_walkthrough", "design_decisions", "impact_stories"],
|
||||
"evaluation_criteria": ["design_quality", "process_thinking", "business_impact"]
|
||||
}
|
||||
}
|
||||
|
||||
def _customize_focus_areas(self, round_type: str, competency_req: Dict,
|
||||
custom_competencies: Optional[List[str]]) -> List[str]:
|
||||
"""Customize focus areas based on role competency requirements."""
|
||||
base_focus_areas = competency_req.get("focus_areas", [])
|
||||
|
||||
round_focus_mapping = {
|
||||
"technical_phone_screen": ["coding_fundamentals", "problem_solving"],
|
||||
"coding_deep_dive": ["technical_execution", "code_quality"],
|
||||
"system_design": ["system_thinking", "architectural_reasoning"],
|
||||
"behavioral": ["cultural_fit", "communication", "teamwork"],
|
||||
"technical_leadership": ["leadership", "mentoring", "influence"],
|
||||
"product_sense": ["product_intuition", "user_empathy"],
|
||||
"analytical_thinking": ["data_analysis", "metric_design"],
|
||||
"design_challenge": ["design_process", "user_focus"]
|
||||
}
|
||||
|
||||
focus_areas = round_focus_mapping.get(round_type, [])
|
||||
|
||||
# Add custom competencies if specified
|
||||
if custom_competencies:
|
||||
focus_areas.extend([comp for comp in custom_competencies if comp not in focus_areas])
|
||||
|
||||
# Add role-specific focus areas
|
||||
focus_areas.extend([area for area in base_focus_areas if area not in focus_areas])
|
||||
|
||||
return focus_areas[:5] # Limit to top 5 focus areas
|
||||
|
||||
def _create_schedule(self, rounds: Dict[str, Dict]) -> Dict[str, Any]:
|
||||
"""Create a suggested interview schedule."""
|
||||
sorted_rounds = sorted(rounds.items(), key=lambda x: x[1]["order"])
|
||||
|
||||
# Calculate optimal scheduling
|
||||
total_duration = sum(round_info["duration_minutes"] for _, round_info in sorted_rounds)
|
||||
|
||||
if total_duration <= 240: # 4 hours or less - single day
|
||||
schedule_type = "single_day"
|
||||
day_structure = self._create_single_day_schedule(sorted_rounds)
|
||||
else: # Multi-day schedule
|
||||
schedule_type = "multi_day"
|
||||
day_structure = self._create_multi_day_schedule(sorted_rounds)
|
||||
|
||||
return {
|
||||
"type": schedule_type,
|
||||
"total_duration_minutes": total_duration,
|
||||
"recommended_breaks": self._calculate_breaks(total_duration),
|
||||
"day_structure": day_structure,
|
||||
"logistics_notes": self._generate_logistics_notes(sorted_rounds)
|
||||
}
|
||||
|
||||
def _create_single_day_schedule(self, rounds: List[Tuple[str, Dict]]) -> Dict[str, Any]:
|
||||
"""Create a single-day interview schedule."""
|
||||
start_time = datetime.strptime("09:00", "%H:%M")
|
||||
current_time = start_time
|
||||
|
||||
schedule = []
|
||||
|
||||
for round_name, round_info in rounds:
|
||||
# Add break if needed (after 90 minutes of interviews)
|
||||
if schedule and sum(item.get("duration_minutes", 0) for item in schedule if "break" not in item.get("type", "")) >= 90:
|
||||
schedule.append({
|
||||
"type": "break",
|
||||
"start_time": current_time.strftime("%H:%M"),
|
||||
"duration_minutes": 15,
|
||||
"end_time": (current_time + timedelta(minutes=15)).strftime("%H:%M")
|
||||
})
|
||||
current_time += timedelta(minutes=15)
|
||||
|
||||
# Add the interview round
|
||||
end_time = current_time + timedelta(minutes=round_info["duration_minutes"])
|
||||
schedule.append({
|
||||
"type": "interview",
|
||||
"round_name": round_name,
|
||||
"title": round_info["name"],
|
||||
"start_time": current_time.strftime("%H:%M"),
|
||||
"end_time": end_time.strftime("%H:%M"),
|
||||
"duration_minutes": round_info["duration_minutes"],
|
||||
"format": round_info["format"]
|
||||
})
|
||||
current_time = end_time
|
||||
|
||||
return {
|
||||
"day_1": {
|
||||
"date": "TBD",
|
||||
"start_time": start_time.strftime("%H:%M"),
|
||||
"end_time": current_time.strftime("%H:%M"),
|
||||
"rounds": schedule
|
||||
}
|
||||
}
|
||||
|
||||
def _create_multi_day_schedule(self, rounds: List[Tuple[str, Dict]]) -> Dict[str, Any]:
|
||||
"""Create a multi-day interview schedule."""
|
||||
# Split rounds across days (max 4 hours per day)
|
||||
max_daily_minutes = 240
|
||||
days = {}
|
||||
current_day = 1
|
||||
current_day_duration = 0
|
||||
current_day_rounds = []
|
||||
|
||||
for round_name, round_info in rounds:
|
||||
duration = round_info["duration_minutes"] + 15 # Add buffer time
|
||||
|
||||
if current_day_duration + duration > max_daily_minutes and current_day_rounds:
|
||||
# Finalize current day
|
||||
days[f"day_{current_day}"] = self._finalize_day_schedule(current_day_rounds)
|
||||
current_day += 1
|
||||
current_day_duration = 0
|
||||
current_day_rounds = []
|
||||
|
||||
current_day_rounds.append((round_name, round_info))
|
||||
current_day_duration += duration
|
||||
|
||||
# Finalize last day
|
||||
if current_day_rounds:
|
||||
days[f"day_{current_day}"] = self._finalize_day_schedule(current_day_rounds)
|
||||
|
||||
return days
|
||||
|
||||
def _finalize_day_schedule(self, day_rounds: List[Tuple[str, Dict]]) -> Dict[str, Any]:
|
||||
"""Finalize the schedule for a specific day."""
|
||||
start_time = datetime.strptime("09:00", "%H:%M")
|
||||
current_time = start_time
|
||||
schedule = []
|
||||
|
||||
for round_name, round_info in day_rounds:
|
||||
end_time = current_time + timedelta(minutes=round_info["duration_minutes"])
|
||||
schedule.append({
|
||||
"type": "interview",
|
||||
"round_name": round_name,
|
||||
"title": round_info["name"],
|
||||
"start_time": current_time.strftime("%H:%M"),
|
||||
"end_time": end_time.strftime("%H:%M"),
|
||||
"duration_minutes": round_info["duration_minutes"],
|
||||
"format": round_info["format"]
|
||||
})
|
||||
current_time = end_time + timedelta(minutes=15) # 15-min buffer
|
||||
|
||||
return {
|
||||
"date": "TBD",
|
||||
"start_time": start_time.strftime("%H:%M"),
|
||||
"end_time": (current_time - timedelta(minutes=15)).strftime("%H:%M"),
|
||||
"rounds": schedule
|
||||
}
|
||||
|
||||
def _calculate_breaks(self, total_duration: int) -> List[Dict[str, Any]]:
|
||||
"""Calculate recommended breaks based on total duration."""
|
||||
breaks = []
|
||||
|
||||
if total_duration >= 120: # 2+ hours
|
||||
breaks.append({"type": "short_break", "duration": 15, "after_minutes": 90})
|
||||
|
||||
if total_duration >= 240: # 4+ hours
|
||||
breaks.append({"type": "lunch_break", "duration": 60, "after_minutes": 180})
|
||||
|
||||
if total_duration >= 360: # 6+ hours
|
||||
breaks.append({"type": "short_break", "duration": 15, "after_minutes": 300})
|
||||
|
||||
return breaks
|
||||
|
||||
def _generate_scorecard(self, role_key: str, level_key: str, competency_req: Dict) -> Dict[str, Any]:
|
||||
"""Generate a scorecard template for the interview loop."""
|
||||
scoring_dimensions = []
|
||||
|
||||
# Add competency-based scoring dimensions
|
||||
for competency in competency_req["required"]:
|
||||
scoring_dimensions.append({
|
||||
"dimension": competency,
|
||||
"weight": "high",
|
||||
"scale": "1-4",
|
||||
"description": f"Assessment of {competency.replace('_', ' ')} competency"
|
||||
})
|
||||
|
||||
for competency in competency_req.get("preferred", []):
|
||||
scoring_dimensions.append({
|
||||
"dimension": competency,
|
||||
"weight": "medium",
|
||||
"scale": "1-4",
|
||||
"description": f"Assessment of {competency.replace('_', ' ')} competency"
|
||||
})
|
||||
|
||||
# Add standard dimensions
|
||||
standard_dimensions = [
|
||||
{"dimension": "communication", "weight": "high", "scale": "1-4"},
|
||||
{"dimension": "cultural_fit", "weight": "medium", "scale": "1-4"},
|
||||
{"dimension": "learning_agility", "weight": "medium", "scale": "1-4"}
|
||||
]
|
||||
|
||||
scoring_dimensions.extend(standard_dimensions)
|
||||
|
||||
return {
|
||||
"scoring_scale": {
|
||||
"4": "Exceeds Expectations - Demonstrates mastery beyond required level",
|
||||
"3": "Meets Expectations - Solid performance meeting all requirements",
|
||||
"2": "Partially Meets - Shows potential but has development areas",
|
||||
"1": "Does Not Meet - Significant gaps in required competencies"
|
||||
},
|
||||
"dimensions": scoring_dimensions,
|
||||
"overall_recommendation": {
|
||||
"options": ["Strong Hire", "Hire", "No Hire", "Strong No Hire"],
|
||||
"criteria": "Based on weighted average and minimum thresholds"
|
||||
},
|
||||
"calibration_notes": {
|
||||
"required": True,
|
||||
"min_length": 100,
|
||||
"sections": ["strengths", "areas_for_development", "specific_examples"]
|
||||
}
|
||||
}
|
||||
|
||||
def _define_interviewer_requirements(self, rounds: Dict[str, Dict]) -> Dict[str, Dict]:
|
||||
"""Define interviewer skill requirements for each round."""
|
||||
requirements = {}
|
||||
|
||||
for round_name, round_info in rounds.items():
|
||||
round_type = round_name.split("_", 2)[-1] # Extract round type
|
||||
|
||||
if round_type in self.interviewer_skills:
|
||||
skill_req = self.interviewer_skills[round_type].copy()
|
||||
skill_req["suggested_interviewers"] = self._suggest_interviewer_profiles(round_type)
|
||||
requirements[round_name] = skill_req
|
||||
else:
|
||||
# Default requirements
|
||||
requirements[round_name] = {
|
||||
"required_skills": ["interviewing_basics", "evaluation_skills"],
|
||||
"preferred_experience": ["relevant_domain"],
|
||||
"calibration_level": "standard",
|
||||
"suggested_interviewers": ["experienced_interviewer"]
|
||||
}
|
||||
|
||||
return requirements
|
||||
|
||||
def _suggest_interviewer_profiles(self, round_type: str) -> List[str]:
|
||||
"""Suggest specific interviewer profiles for different round types."""
|
||||
profile_mapping = {
|
||||
"technical_phone_screen": ["senior_engineer", "tech_lead"],
|
||||
"coding_deep_dive": ["senior_engineer", "staff_engineer"],
|
||||
"system_design": ["senior_architect", "staff_engineer"],
|
||||
"behavioral": ["hiring_manager", "people_manager"],
|
||||
"technical_leadership": ["engineering_manager", "senior_staff"],
|
||||
"product_sense": ["senior_pm", "product_leader"],
|
||||
"analytical_thinking": ["senior_analyst", "data_scientist"],
|
||||
"design_challenge": ["senior_designer", "design_manager"]
|
||||
}
|
||||
|
||||
return profile_mapping.get(round_type, ["experienced_interviewer"])
|
||||
|
||||
def _generate_calibration_notes(self, role_key: str, level_key: str) -> Dict[str, Any]:
|
||||
"""Generate calibration notes and best practices."""
|
||||
return {
|
||||
"hiring_bar_notes": f"Calibrated for {level_key} level {role_key.replace('_', ' ')} role",
|
||||
"common_pitfalls": [
|
||||
"Avoid comparing candidates to each other rather than to the role standard",
|
||||
"Don't let one strong/weak area overshadow overall assessment",
|
||||
"Ensure consistent application of evaluation criteria"
|
||||
],
|
||||
"calibration_checkpoints": [
|
||||
"Review score distribution after every 5 candidates",
|
||||
"Conduct monthly interviewer calibration sessions",
|
||||
"Track correlation with 6-month performance reviews"
|
||||
],
|
||||
"escalation_criteria": [
|
||||
"Any candidate receiving all 4s or all 1s",
|
||||
"Significant disagreement between interviewers (>1.5 point spread)",
|
||||
"Unusual circumstances or accommodations needed"
|
||||
]
|
||||
}
|
||||
|
||||
def _generate_logistics_notes(self, rounds: List[Tuple[str, Dict]]) -> List[str]:
|
||||
"""Generate logistics and coordination notes."""
|
||||
notes = [
|
||||
"Coordinate interviewer availability before scheduling",
|
||||
"Ensure all interviewers have access to job description and competency requirements",
|
||||
"Prepare interview rooms/virtual links for all rounds",
|
||||
"Share candidate resume and application with all interviewers"
|
||||
]
|
||||
|
||||
# Add format-specific notes
|
||||
formats_used = {round_info["format"] for _, round_info in rounds}
|
||||
|
||||
if "virtual" in formats_used:
|
||||
notes.append("Test video conferencing setup before virtual interviews")
|
||||
notes.append("Share virtual meeting links with candidate 24 hours in advance")
|
||||
|
||||
if "collaborative_whiteboard" in formats_used:
|
||||
notes.append("Prepare whiteboard or collaborative online tool for design sessions")
|
||||
|
||||
if "hands_on_design" in formats_used:
|
||||
notes.append("Provide design tools access or ensure candidate can screen share their preferred tools")
|
||||
|
||||
return notes
|
||||
|
||||
|
||||
def format_human_readable(loop_data: Dict[str, Any]) -> str:
|
||||
"""Format the interview loop data in a human-readable format."""
|
||||
output = []
|
||||
|
||||
# Header
|
||||
output.append(f"Interview Loop Design for {loop_data['role']} ({loop_data['level'].title()} Level)")
|
||||
output.append("=" * 60)
|
||||
|
||||
if loop_data.get('team'):
|
||||
output.append(f"Team: {loop_data['team']}")
|
||||
|
||||
output.append(f"Generated: {loop_data['generated_at']}")
|
||||
output.append(f"Total Duration: {loop_data['total_duration_minutes']} minutes ({loop_data['total_duration_minutes']//60}h {loop_data['total_duration_minutes']%60}m)")
|
||||
output.append(f"Total Rounds: {loop_data['total_rounds']}")
|
||||
output.append("")
|
||||
|
||||
# Interview Rounds
|
||||
output.append("INTERVIEW ROUNDS")
|
||||
output.append("-" * 40)
|
||||
|
||||
sorted_rounds = sorted(loop_data['rounds'].items(), key=lambda x: x[1]['order'])
|
||||
for round_name, round_info in sorted_rounds:
|
||||
output.append(f"\nRound {round_info['order']}: {round_info['name']}")
|
||||
output.append(f"Duration: {round_info['duration_minutes']} minutes")
|
||||
output.append(f"Format: {round_info['format'].replace('_', ' ').title()}")
|
||||
|
||||
output.append("Objectives:")
|
||||
for obj in round_info['objectives']:
|
||||
output.append(f" • {obj}")
|
||||
|
||||
output.append("Focus Areas:")
|
||||
for area in round_info['focus_areas']:
|
||||
output.append(f" • {area.replace('_', ' ').title()}")
|
||||
|
||||
# Suggested Schedule
|
||||
output.append("\nSUGGESTED SCHEDULE")
|
||||
output.append("-" * 40)
|
||||
|
||||
schedule = loop_data['suggested_schedule']
|
||||
output.append(f"Schedule Type: {schedule['type'].replace('_', ' ').title()}")
|
||||
|
||||
for day_name, day_info in schedule['day_structure'].items():
|
||||
output.append(f"\n{day_name.replace('_', ' ').title()}:")
|
||||
output.append(f"Time: {day_info['start_time']} - {day_info['end_time']}")
|
||||
|
||||
for item in day_info['rounds']:
|
||||
if item['type'] == 'interview':
|
||||
output.append(f" {item['start_time']}-{item['end_time']}: {item['title']} ({item['duration_minutes']}min)")
|
||||
else:
|
||||
output.append(f" {item['start_time']}-{item['end_time']}: {item['type'].title()} ({item['duration_minutes']}min)")
|
||||
|
||||
# Interviewer Requirements
|
||||
output.append("\nINTERVIEWER REQUIREMENTS")
|
||||
output.append("-" * 40)
|
||||
|
||||
for round_name, requirements in loop_data['interviewer_requirements'].items():
|
||||
round_display = round_name.split("_", 2)[-1].replace("_", " ").title()
|
||||
output.append(f"\n{round_display}:")
|
||||
output.append(f"Required Skills: {', '.join(requirements['required_skills'])}")
|
||||
output.append(f"Suggested Interviewers: {', '.join(requirements['suggested_interviewers'])}")
|
||||
output.append(f"Calibration Level: {requirements['calibration_level'].title()}")
|
||||
|
||||
# Scorecard Overview
|
||||
output.append("\nSCORECARD TEMPLATE")
|
||||
output.append("-" * 40)
|
||||
|
||||
scorecard = loop_data['scorecard_template']
|
||||
output.append("Scoring Scale:")
|
||||
for score, description in scorecard['scoring_scale'].items():
|
||||
output.append(f" {score}: {description}")
|
||||
|
||||
output.append("\nEvaluation Dimensions:")
|
||||
for dim in scorecard['dimensions']:
|
||||
output.append(f" • {dim['dimension'].replace('_', ' ').title()} (Weight: {dim['weight']})")
|
||||
|
||||
# Calibration Notes
|
||||
output.append("\nCALIBRATION NOTES")
|
||||
output.append("-" * 40)
|
||||
|
||||
calibration = loop_data['calibration_notes']
|
||||
output.append(f"Hiring Bar: {calibration['hiring_bar_notes']}")
|
||||
|
||||
output.append("\nCommon Pitfalls:")
|
||||
for pitfall in calibration['common_pitfalls']:
|
||||
output.append(f" • {pitfall}")
|
||||
|
||||
return "\n".join(output)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Generate calibrated interview loops for specific roles and levels")
|
||||
parser.add_argument("--role", type=str, help="Job role title (e.g., 'Senior Software Engineer')")
|
||||
parser.add_argument("--level", type=str, help="Experience level (junior, mid, senior, staff, principal)")
|
||||
parser.add_argument("--team", type=str, help="Team or department (optional)")
|
||||
parser.add_argument("--competencies", type=str, help="Comma-separated list of specific competencies to focus on")
|
||||
parser.add_argument("--input", type=str, help="Input JSON file with role definition")
|
||||
parser.add_argument("--output", type=str, help="Output directory or file path")
|
||||
parser.add_argument("--format", choices=["json", "text", "both"], default="both", help="Output format")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
designer = InterviewLoopDesigner()
|
||||
|
||||
# Handle input
|
||||
if args.input:
|
||||
try:
|
||||
with open(args.input, 'r') as f:
|
||||
role_data = json.load(f)
|
||||
role = role_data.get('role') or role_data.get('title', '')
|
||||
level = role_data.get('level', 'senior')
|
||||
team = role_data.get('team')
|
||||
competencies = role_data.get('competencies')
|
||||
except Exception as e:
|
||||
print(f"Error reading input file: {e}")
|
||||
sys.exit(1)
|
||||
else:
|
||||
if not args.role or not args.level:
|
||||
print("Error: --role and --level are required when not using --input")
|
||||
sys.exit(1)
|
||||
|
||||
role = args.role
|
||||
level = args.level
|
||||
team = args.team
|
||||
competencies = args.competencies.split(',') if args.competencies else None
|
||||
|
||||
# Generate interview loop
|
||||
try:
|
||||
loop_data = designer.generate_interview_loop(role, level, team, competencies)
|
||||
|
||||
# Handle output
|
||||
if args.output:
|
||||
output_path = args.output
|
||||
if os.path.isdir(output_path):
|
||||
safe_role = "".join(c for c in role.lower() if c.isalnum() or c in (' ', '-', '_')).replace(' ', '_')
|
||||
base_filename = f"{safe_role}_{level}_interview_loop"
|
||||
json_path = os.path.join(output_path, f"{base_filename}.json")
|
||||
text_path = os.path.join(output_path, f"{base_filename}.txt")
|
||||
else:
|
||||
# Use provided path as base
|
||||
json_path = output_path if output_path.endswith('.json') else f"{output_path}.json"
|
||||
text_path = output_path.replace('.json', '.txt') if output_path.endswith('.json') else f"{output_path}.txt"
|
||||
else:
|
||||
safe_role = "".join(c for c in role.lower() if c.isalnum() or c in (' ', '-', '_')).replace(' ', '_')
|
||||
base_filename = f"{safe_role}_{level}_interview_loop"
|
||||
json_path = f"{base_filename}.json"
|
||||
text_path = f"{base_filename}.txt"
|
||||
|
||||
# Write outputs
|
||||
if args.format in ["json", "both"]:
|
||||
with open(json_path, 'w') as f:
|
||||
json.dump(loop_data, f, indent=2, default=str)
|
||||
print(f"JSON output written to: {json_path}")
|
||||
|
||||
if args.format in ["text", "both"]:
|
||||
with open(text_path, 'w') as f:
|
||||
f.write(format_human_readable(loop_data))
|
||||
print(f"Text output written to: {text_path}")
|
||||
|
||||
# Always print summary to stdout
|
||||
print("\nInterview Loop Summary:")
|
||||
print(f"Role: {loop_data['role']} ({loop_data['level'].title()})")
|
||||
print(f"Total Duration: {loop_data['total_duration_minutes']} minutes")
|
||||
print(f"Number of Rounds: {loop_data['total_rounds']}")
|
||||
print(f"Schedule Type: {loop_data['suggested_schedule']['type'].replace('_', ' ').title()}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error generating interview loop: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user