Files
claude-skills-reference/engineering/observability-designer/references/alert_design_patterns.md
Leo 52732f7e2b feat: add observability-designer POWERFUL-tier skill
- SLO Designer: generates comprehensive SLI/SLO frameworks with error budgets and burn rate alerts
- Alert Optimizer: analyzes and optimizes alert configurations to reduce noise and improve effectiveness
- Dashboard Generator: creates role-based dashboard specifications with golden signals coverage

Includes comprehensive documentation, sample data, and expected outputs for testing.
2026-02-16 14:03:12 +00:00

469 lines
12 KiB
Markdown

# Alert Design Patterns: A Guide to Effective Alerting
## Introduction
Well-designed alerts are the difference between a reliable system and 3 AM pages about non-issues. This guide provides patterns and anti-patterns for creating alerts that provide value without causing fatigue.
## Fundamental Principles
### The Golden Rules of Alerting
1. **Every alert should be actionable** - If you can't do something about it, don't alert
2. **Every alert should require human intelligence** - If a script can handle it, automate the response
3. **Every alert should be novel** - Don't alert on known, ongoing issues
4. **Every alert should represent a user-visible impact** - Internal metrics matter only if users are affected
### Alert Classification
#### Critical Alerts
- Service is completely down
- Data loss is occurring
- Security breach detected
- SLO burn rate indicates imminent SLO violation
#### Warning Alerts
- Service degradation affecting some users
- Approaching resource limits
- Dependent service issues
- Elevated error rates within SLO
#### Info Alerts
- Deployment notifications
- Capacity planning triggers
- Configuration changes
- Maintenance windows
## Alert Design Patterns
### Pattern 1: Symptoms, Not Causes
**Good**: Alert on user-visible symptoms
```yaml
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
annotations:
summary: "API latency is high"
description: "95th percentile latency is {{ $value }}s, above 500ms threshold"
```
**Bad**: Alert on internal metrics that may not affect users
```yaml
- alert: HighCPU
expr: cpu_usage > 80
# This might not affect users at all!
```
### Pattern 2: Multi-Window Alerting
Reduce false positives by requiring sustained problems:
```yaml
- alert: ServiceDown
expr: (
avg_over_time(up[2m]) == 0 # Short window: immediate detection
and
avg_over_time(up[10m]) < 0.8 # Long window: avoid flapping
)
for: 1m
```
### Pattern 3: Burn Rate Alerting
Alert based on error budget consumption rate:
```yaml
# Fast burn: 2% of monthly budget in 1 hour
- alert: ErrorBudgetFastBurn
expr: (
error_rate_5m > (14.4 * error_budget_slo)
and
error_rate_1h > (14.4 * error_budget_slo)
)
for: 2m
labels:
severity: critical
# Slow burn: 10% of monthly budget in 3 days
- alert: ErrorBudgetSlowBurn
expr: (
error_rate_6h > (1.0 * error_budget_slo)
and
error_rate_3d > (1.0 * error_budget_slo)
)
for: 15m
labels:
severity: warning
```
### Pattern 4: Hysteresis
Use different thresholds for firing and resolving to prevent flapping:
```yaml
- alert: HighErrorRate
expr: error_rate > 0.05 # Fire at 5%
for: 5m
# Resolution happens automatically when error_rate < 0.03 (3%)
# This prevents flapping around the 5% threshold
```
### Pattern 5: Composite Alerts
Alert when multiple conditions indicate a problem:
```yaml
- alert: ServiceDegraded
expr: (
(latency_p95 > latency_threshold)
or
(error_rate > error_threshold)
or
(availability < availability_threshold)
) and (
request_rate > min_request_rate # Only alert if we have traffic
)
```
### Pattern 6: Contextual Alerting
Include relevant context in alerts:
```yaml
- alert: DatabaseConnections
expr: db_connections_active / db_connections_max > 0.8
for: 5m
annotations:
summary: "Database connection pool nearly exhausted"
description: "{{ $labels.database }} has {{ $value | humanizePercentage }} connection utilization"
runbook_url: "https://runbooks.company.com/database-connections"
impact: "New requests may be rejected, causing 500 errors"
suggested_action: "Check for connection leaks or increase pool size"
```
## Alert Routing and Escalation
### Routing by Impact and Urgency
#### Critical Path Services
```yaml
route:
group_by: ['service']
routes:
- match:
service: 'payment-api'
severity: 'critical'
receiver: 'payment-team-pager'
continue: true
- match:
service: 'payment-api'
severity: 'warning'
receiver: 'payment-team-slack'
```
#### Time-Based Routing
```yaml
route:
routes:
- match:
severity: 'critical'
receiver: 'oncall-pager'
- match:
severity: 'warning'
time: 'business_hours' # 9 AM - 5 PM
receiver: 'team-slack'
- match:
severity: 'warning'
time: 'after_hours'
receiver: 'team-email' # Lower urgency outside business hours
```
### Escalation Patterns
#### Linear Escalation
```yaml
receivers:
- name: 'primary-oncall'
pagerduty_configs:
- escalation_policy: 'P1-Escalation'
# 0 min: Primary on-call
# 5 min: Secondary on-call
# 15 min: Engineering manager
# 30 min: Director of engineering
```
#### Severity-Based Escalation
```yaml
# Critical: Immediate escalation
- match:
severity: 'critical'
receiver: 'critical-escalation'
# Warning: Team-first escalation
- match:
severity: 'warning'
receiver: 'team-escalation'
```
## Alert Fatigue Prevention
### Grouping and Suppression
#### Time-Based Grouping
```yaml
route:
group_wait: 30s # Wait 30s to group similar alerts
group_interval: 2m # Send grouped alerts every 2 minutes
repeat_interval: 1h # Re-send unresolved alerts every hour
```
#### Dependent Service Suppression
```yaml
- alert: ServiceDown
expr: up == 0
- alert: HighLatency
expr: latency_p95 > 1
# This alert is suppressed when ServiceDown is firing
inhibit_rules:
- source_match:
alertname: 'ServiceDown'
target_match:
alertname: 'HighLatency'
equal: ['service']
```
### Alert Throttling
```yaml
# Limit to 1 alert per 10 minutes for noisy conditions
- alert: HighMemoryUsage
expr: memory_usage_percent > 85
for: 10m # Longer 'for' duration reduces noise
annotations:
summary: "Memory usage has been high for 10+ minutes"
```
### Smart Defaults
```yaml
# Use business logic to set intelligent thresholds
- alert: LowTraffic
expr: request_rate < (
avg_over_time(request_rate[7d]) * 0.1 # 10% of weekly average
)
# Only alert during business hours when low traffic is unusual
for: 30m
```
## Runbook Integration
### Runbook Structure Template
```markdown
# Alert: {{ $labels.alertname }}
## Immediate Actions
1. Check service status dashboard
2. Verify if users are affected
3. Look at recent deployments/changes
## Investigation Steps
1. Check logs for errors in the last 30 minutes
2. Verify dependent services are healthy
3. Check resource utilization (CPU, memory, disk)
4. Review recent alerts for patterns
## Resolution Actions
- If deployment-related: Consider rollback
- If resource-related: Scale up or optimize queries
- If dependency-related: Engage appropriate team
## Escalation
- Primary: @team-oncall
- Secondary: @engineering-manager
- Emergency: @site-reliability-team
```
### Runbook Integration in Alerts
```yaml
annotations:
runbook_url: "https://runbooks.company.com/alerts/{{ $labels.alertname }}"
quick_debug: |
1. curl -s https://{{ $labels.instance }}/health
2. kubectl logs {{ $labels.pod }} --tail=50
3. Check dashboard: https://grafana.company.com/d/service-{{ $labels.service }}
```
## Testing and Validation
### Alert Testing Strategies
#### Chaos Engineering Integration
```python
# Test that alerts fire during controlled failures
def test_alert_during_cpu_spike():
with chaos.cpu_spike(target='payment-api', duration='2m'):
assert wait_for_alert('HighCPU', timeout=180)
def test_alert_during_network_partition():
with chaos.network_partition(target='database'):
assert wait_for_alert('DatabaseUnreachable', timeout=60)
```
#### Historical Alert Analysis
```prometheus
# Query to find alerts that fired without incidents
count by (alertname) (
ALERTS{alertstate="firing"}[30d]
) unless on (alertname) (
count by (alertname) (
incident_created{source="alert"}[30d]
)
)
```
### Alert Quality Metrics
#### Alert Precision
```
Precision = True Positives / (True Positives + False Positives)
```
Track alerts that resulted in actual incidents vs false alarms.
#### Time to Resolution
```prometheus
# Average time from alert firing to resolution
avg_over_time(
(alert_resolved_timestamp - alert_fired_timestamp)[30d]
) by (alertname)
```
#### Alert Fatigue Indicators
```prometheus
# Alerts per day by team
sum by (team) (
increase(alerts_fired_total[1d])
)
# Percentage of alerts acknowledged within 15 minutes
sum(alerts_acked_within_15m) / sum(alerts_fired) * 100
```
## Advanced Patterns
### Machine Learning-Enhanced Alerting
#### Anomaly Detection
```yaml
- alert: AnomalousTraffic
expr: |
abs(request_rate - predict_linear(request_rate[1h], 300)) /
stddev_over_time(request_rate[1h]) > 3
for: 10m
annotations:
summary: "Traffic pattern is anomalous"
description: "Current traffic deviates from predicted pattern by >3 standard deviations"
```
#### Dynamic Thresholds
```yaml
- alert: DynamicHighLatency
expr: |
latency_p95 > (
quantile_over_time(0.95, latency_p95[7d]) + # Historical 95th percentile
2 * stddev_over_time(latency_p95[7d]) # Plus 2 standard deviations
)
```
### Business Hours Awareness
```yaml
# Different thresholds for business vs off hours
- alert: HighLatencyBusinessHours
expr: latency_p95 > 0.2 # Stricter during business hours
for: 2m
# Active 9 AM - 5 PM weekdays
- alert: HighLatencyOffHours
expr: latency_p95 > 0.5 # More lenient after hours
for: 5m
# Active nights and weekends
```
### Progressive Alerting
```yaml
# Escalating alert severity based on duration
- alert: ServiceLatencyElevated
expr: latency_p95 > 0.5
for: 5m
labels:
severity: info
- alert: ServiceLatencyHigh
expr: latency_p95 > 0.5
for: 15m # Same condition, longer duration
labels:
severity: warning
- alert: ServiceLatencyCritical
expr: latency_p95 > 0.5
for: 30m # Same condition, even longer duration
labels:
severity: critical
```
## Anti-Patterns to Avoid
### Anti-Pattern 1: Alerting on Everything
**Problem**: Too many alerts create noise and fatigue
**Solution**: Be selective; only alert on user-impacting issues
### Anti-Pattern 2: Vague Alert Messages
**Problem**: "Service X is down" - which instance? what's the impact?
**Solution**: Include specific details and context
### Anti-Pattern 3: Alerts Without Runbooks
**Problem**: Alerts that don't explain what to do
**Solution**: Every alert must have an associated runbook
### Anti-Pattern 4: Static Thresholds
**Problem**: 80% CPU might be normal during peak hours
**Solution**: Use contextual, adaptive thresholds
### Anti-Pattern 5: Ignoring Alert Quality
**Problem**: Accepting high false positive rates
**Solution**: Regularly review and tune alert precision
## Implementation Checklist
### Pre-Implementation
- [ ] Define alert severity levels and escalation policies
- [ ] Create runbook templates
- [ ] Set up alert routing configuration
- [ ] Define SLOs that alerts will protect
### Alert Development
- [ ] Each alert has clear success criteria
- [ ] Alert conditions tested against historical data
- [ ] Runbook created and accessible
- [ ] Severity and routing configured
- [ ] Context and suggested actions included
### Post-Implementation
- [ ] Monitor alert precision and recall
- [ ] Regular review of alert fatigue metrics
- [ ] Quarterly alert effectiveness review
- [ ] Team training on alert response procedures
### Quality Assurance
- [ ] Test alerts fire during controlled failures
- [ ] Verify alerts resolve when conditions improve
- [ ] Confirm runbooks are accurate and helpful
- [ ] Validate escalation paths work correctly
Remember: Great alerts are invisible when things work and invaluable when things break. Focus on quality over quantity, and always optimize for the human who will respond to the alert at 3 AM.