- SLO Designer: generates comprehensive SLI/SLO frameworks with error budgets and burn rate alerts - Alert Optimizer: analyzes and optimizes alert configurations to reduce noise and improve effectiveness - Dashboard Generator: creates role-based dashboard specifications with golden signals coverage Includes comprehensive documentation, sample data, and expected outputs for testing.
469 lines
12 KiB
Markdown
469 lines
12 KiB
Markdown
# Alert Design Patterns: A Guide to Effective Alerting
|
|
|
|
## Introduction
|
|
|
|
Well-designed alerts are the difference between a reliable system and 3 AM pages about non-issues. This guide provides patterns and anti-patterns for creating alerts that provide value without causing fatigue.
|
|
|
|
## Fundamental Principles
|
|
|
|
### The Golden Rules of Alerting
|
|
|
|
1. **Every alert should be actionable** - If you can't do something about it, don't alert
|
|
2. **Every alert should require human intelligence** - If a script can handle it, automate the response
|
|
3. **Every alert should be novel** - Don't alert on known, ongoing issues
|
|
4. **Every alert should represent a user-visible impact** - Internal metrics matter only if users are affected
|
|
|
|
### Alert Classification
|
|
|
|
#### Critical Alerts
|
|
- Service is completely down
|
|
- Data loss is occurring
|
|
- Security breach detected
|
|
- SLO burn rate indicates imminent SLO violation
|
|
|
|
#### Warning Alerts
|
|
- Service degradation affecting some users
|
|
- Approaching resource limits
|
|
- Dependent service issues
|
|
- Elevated error rates within SLO
|
|
|
|
#### Info Alerts
|
|
- Deployment notifications
|
|
- Capacity planning triggers
|
|
- Configuration changes
|
|
- Maintenance windows
|
|
|
|
## Alert Design Patterns
|
|
|
|
### Pattern 1: Symptoms, Not Causes
|
|
|
|
**Good**: Alert on user-visible symptoms
|
|
```yaml
|
|
- alert: HighLatency
|
|
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
|
|
for: 5m
|
|
annotations:
|
|
summary: "API latency is high"
|
|
description: "95th percentile latency is {{ $value }}s, above 500ms threshold"
|
|
```
|
|
|
|
**Bad**: Alert on internal metrics that may not affect users
|
|
```yaml
|
|
- alert: HighCPU
|
|
expr: cpu_usage > 80
|
|
# This might not affect users at all!
|
|
```
|
|
|
|
### Pattern 2: Multi-Window Alerting
|
|
|
|
Reduce false positives by requiring sustained problems:
|
|
|
|
```yaml
|
|
- alert: ServiceDown
|
|
expr: (
|
|
avg_over_time(up[2m]) == 0 # Short window: immediate detection
|
|
and
|
|
avg_over_time(up[10m]) < 0.8 # Long window: avoid flapping
|
|
)
|
|
for: 1m
|
|
```
|
|
|
|
### Pattern 3: Burn Rate Alerting
|
|
|
|
Alert based on error budget consumption rate:
|
|
|
|
```yaml
|
|
# Fast burn: 2% of monthly budget in 1 hour
|
|
- alert: ErrorBudgetFastBurn
|
|
expr: (
|
|
error_rate_5m > (14.4 * error_budget_slo)
|
|
and
|
|
error_rate_1h > (14.4 * error_budget_slo)
|
|
)
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
|
|
# Slow burn: 10% of monthly budget in 3 days
|
|
- alert: ErrorBudgetSlowBurn
|
|
expr: (
|
|
error_rate_6h > (1.0 * error_budget_slo)
|
|
and
|
|
error_rate_3d > (1.0 * error_budget_slo)
|
|
)
|
|
for: 15m
|
|
labels:
|
|
severity: warning
|
|
```
|
|
|
|
### Pattern 4: Hysteresis
|
|
|
|
Use different thresholds for firing and resolving to prevent flapping:
|
|
|
|
```yaml
|
|
- alert: HighErrorRate
|
|
expr: error_rate > 0.05 # Fire at 5%
|
|
for: 5m
|
|
|
|
# Resolution happens automatically when error_rate < 0.03 (3%)
|
|
# This prevents flapping around the 5% threshold
|
|
```
|
|
|
|
### Pattern 5: Composite Alerts
|
|
|
|
Alert when multiple conditions indicate a problem:
|
|
|
|
```yaml
|
|
- alert: ServiceDegraded
|
|
expr: (
|
|
(latency_p95 > latency_threshold)
|
|
or
|
|
(error_rate > error_threshold)
|
|
or
|
|
(availability < availability_threshold)
|
|
) and (
|
|
request_rate > min_request_rate # Only alert if we have traffic
|
|
)
|
|
```
|
|
|
|
### Pattern 6: Contextual Alerting
|
|
|
|
Include relevant context in alerts:
|
|
|
|
```yaml
|
|
- alert: DatabaseConnections
|
|
expr: db_connections_active / db_connections_max > 0.8
|
|
for: 5m
|
|
annotations:
|
|
summary: "Database connection pool nearly exhausted"
|
|
description: "{{ $labels.database }} has {{ $value | humanizePercentage }} connection utilization"
|
|
runbook_url: "https://runbooks.company.com/database-connections"
|
|
impact: "New requests may be rejected, causing 500 errors"
|
|
suggested_action: "Check for connection leaks or increase pool size"
|
|
```
|
|
|
|
## Alert Routing and Escalation
|
|
|
|
### Routing by Impact and Urgency
|
|
|
|
#### Critical Path Services
|
|
```yaml
|
|
route:
|
|
group_by: ['service']
|
|
routes:
|
|
- match:
|
|
service: 'payment-api'
|
|
severity: 'critical'
|
|
receiver: 'payment-team-pager'
|
|
continue: true
|
|
- match:
|
|
service: 'payment-api'
|
|
severity: 'warning'
|
|
receiver: 'payment-team-slack'
|
|
```
|
|
|
|
#### Time-Based Routing
|
|
```yaml
|
|
route:
|
|
routes:
|
|
- match:
|
|
severity: 'critical'
|
|
receiver: 'oncall-pager'
|
|
- match:
|
|
severity: 'warning'
|
|
time: 'business_hours' # 9 AM - 5 PM
|
|
receiver: 'team-slack'
|
|
- match:
|
|
severity: 'warning'
|
|
time: 'after_hours'
|
|
receiver: 'team-email' # Lower urgency outside business hours
|
|
```
|
|
|
|
### Escalation Patterns
|
|
|
|
#### Linear Escalation
|
|
```yaml
|
|
receivers:
|
|
- name: 'primary-oncall'
|
|
pagerduty_configs:
|
|
- escalation_policy: 'P1-Escalation'
|
|
# 0 min: Primary on-call
|
|
# 5 min: Secondary on-call
|
|
# 15 min: Engineering manager
|
|
# 30 min: Director of engineering
|
|
```
|
|
|
|
#### Severity-Based Escalation
|
|
```yaml
|
|
# Critical: Immediate escalation
|
|
- match:
|
|
severity: 'critical'
|
|
receiver: 'critical-escalation'
|
|
|
|
# Warning: Team-first escalation
|
|
- match:
|
|
severity: 'warning'
|
|
receiver: 'team-escalation'
|
|
```
|
|
|
|
## Alert Fatigue Prevention
|
|
|
|
### Grouping and Suppression
|
|
|
|
#### Time-Based Grouping
|
|
```yaml
|
|
route:
|
|
group_wait: 30s # Wait 30s to group similar alerts
|
|
group_interval: 2m # Send grouped alerts every 2 minutes
|
|
repeat_interval: 1h # Re-send unresolved alerts every hour
|
|
```
|
|
|
|
#### Dependent Service Suppression
|
|
```yaml
|
|
- alert: ServiceDown
|
|
expr: up == 0
|
|
|
|
- alert: HighLatency
|
|
expr: latency_p95 > 1
|
|
# This alert is suppressed when ServiceDown is firing
|
|
inhibit_rules:
|
|
- source_match:
|
|
alertname: 'ServiceDown'
|
|
target_match:
|
|
alertname: 'HighLatency'
|
|
equal: ['service']
|
|
```
|
|
|
|
### Alert Throttling
|
|
|
|
```yaml
|
|
# Limit to 1 alert per 10 minutes for noisy conditions
|
|
- alert: HighMemoryUsage
|
|
expr: memory_usage_percent > 85
|
|
for: 10m # Longer 'for' duration reduces noise
|
|
annotations:
|
|
summary: "Memory usage has been high for 10+ minutes"
|
|
```
|
|
|
|
### Smart Defaults
|
|
|
|
```yaml
|
|
# Use business logic to set intelligent thresholds
|
|
- alert: LowTraffic
|
|
expr: request_rate < (
|
|
avg_over_time(request_rate[7d]) * 0.1 # 10% of weekly average
|
|
)
|
|
# Only alert during business hours when low traffic is unusual
|
|
for: 30m
|
|
```
|
|
|
|
## Runbook Integration
|
|
|
|
### Runbook Structure Template
|
|
|
|
```markdown
|
|
# Alert: {{ $labels.alertname }}
|
|
|
|
## Immediate Actions
|
|
1. Check service status dashboard
|
|
2. Verify if users are affected
|
|
3. Look at recent deployments/changes
|
|
|
|
## Investigation Steps
|
|
1. Check logs for errors in the last 30 minutes
|
|
2. Verify dependent services are healthy
|
|
3. Check resource utilization (CPU, memory, disk)
|
|
4. Review recent alerts for patterns
|
|
|
|
## Resolution Actions
|
|
- If deployment-related: Consider rollback
|
|
- If resource-related: Scale up or optimize queries
|
|
- If dependency-related: Engage appropriate team
|
|
|
|
## Escalation
|
|
- Primary: @team-oncall
|
|
- Secondary: @engineering-manager
|
|
- Emergency: @site-reliability-team
|
|
```
|
|
|
|
### Runbook Integration in Alerts
|
|
|
|
```yaml
|
|
annotations:
|
|
runbook_url: "https://runbooks.company.com/alerts/{{ $labels.alertname }}"
|
|
quick_debug: |
|
|
1. curl -s https://{{ $labels.instance }}/health
|
|
2. kubectl logs {{ $labels.pod }} --tail=50
|
|
3. Check dashboard: https://grafana.company.com/d/service-{{ $labels.service }}
|
|
```
|
|
|
|
## Testing and Validation
|
|
|
|
### Alert Testing Strategies
|
|
|
|
#### Chaos Engineering Integration
|
|
```python
|
|
# Test that alerts fire during controlled failures
|
|
def test_alert_during_cpu_spike():
|
|
with chaos.cpu_spike(target='payment-api', duration='2m'):
|
|
assert wait_for_alert('HighCPU', timeout=180)
|
|
|
|
def test_alert_during_network_partition():
|
|
with chaos.network_partition(target='database'):
|
|
assert wait_for_alert('DatabaseUnreachable', timeout=60)
|
|
```
|
|
|
|
#### Historical Alert Analysis
|
|
```prometheus
|
|
# Query to find alerts that fired without incidents
|
|
count by (alertname) (
|
|
ALERTS{alertstate="firing"}[30d]
|
|
) unless on (alertname) (
|
|
count by (alertname) (
|
|
incident_created{source="alert"}[30d]
|
|
)
|
|
)
|
|
```
|
|
|
|
### Alert Quality Metrics
|
|
|
|
#### Alert Precision
|
|
```
|
|
Precision = True Positives / (True Positives + False Positives)
|
|
```
|
|
|
|
Track alerts that resulted in actual incidents vs false alarms.
|
|
|
|
#### Time to Resolution
|
|
```prometheus
|
|
# Average time from alert firing to resolution
|
|
avg_over_time(
|
|
(alert_resolved_timestamp - alert_fired_timestamp)[30d]
|
|
) by (alertname)
|
|
```
|
|
|
|
#### Alert Fatigue Indicators
|
|
```prometheus
|
|
# Alerts per day by team
|
|
sum by (team) (
|
|
increase(alerts_fired_total[1d])
|
|
)
|
|
|
|
# Percentage of alerts acknowledged within 15 minutes
|
|
sum(alerts_acked_within_15m) / sum(alerts_fired) * 100
|
|
```
|
|
|
|
## Advanced Patterns
|
|
|
|
### Machine Learning-Enhanced Alerting
|
|
|
|
#### Anomaly Detection
|
|
```yaml
|
|
- alert: AnomalousTraffic
|
|
expr: |
|
|
abs(request_rate - predict_linear(request_rate[1h], 300)) /
|
|
stddev_over_time(request_rate[1h]) > 3
|
|
for: 10m
|
|
annotations:
|
|
summary: "Traffic pattern is anomalous"
|
|
description: "Current traffic deviates from predicted pattern by >3 standard deviations"
|
|
```
|
|
|
|
#### Dynamic Thresholds
|
|
```yaml
|
|
- alert: DynamicHighLatency
|
|
expr: |
|
|
latency_p95 > (
|
|
quantile_over_time(0.95, latency_p95[7d]) + # Historical 95th percentile
|
|
2 * stddev_over_time(latency_p95[7d]) # Plus 2 standard deviations
|
|
)
|
|
```
|
|
|
|
### Business Hours Awareness
|
|
|
|
```yaml
|
|
# Different thresholds for business vs off hours
|
|
- alert: HighLatencyBusinessHours
|
|
expr: latency_p95 > 0.2 # Stricter during business hours
|
|
for: 2m
|
|
# Active 9 AM - 5 PM weekdays
|
|
|
|
- alert: HighLatencyOffHours
|
|
expr: latency_p95 > 0.5 # More lenient after hours
|
|
for: 5m
|
|
# Active nights and weekends
|
|
```
|
|
|
|
### Progressive Alerting
|
|
|
|
```yaml
|
|
# Escalating alert severity based on duration
|
|
- alert: ServiceLatencyElevated
|
|
expr: latency_p95 > 0.5
|
|
for: 5m
|
|
labels:
|
|
severity: info
|
|
|
|
- alert: ServiceLatencyHigh
|
|
expr: latency_p95 > 0.5
|
|
for: 15m # Same condition, longer duration
|
|
labels:
|
|
severity: warning
|
|
|
|
- alert: ServiceLatencyCritical
|
|
expr: latency_p95 > 0.5
|
|
for: 30m # Same condition, even longer duration
|
|
labels:
|
|
severity: critical
|
|
```
|
|
|
|
## Anti-Patterns to Avoid
|
|
|
|
### Anti-Pattern 1: Alerting on Everything
|
|
**Problem**: Too many alerts create noise and fatigue
|
|
**Solution**: Be selective; only alert on user-impacting issues
|
|
|
|
### Anti-Pattern 2: Vague Alert Messages
|
|
**Problem**: "Service X is down" - which instance? what's the impact?
|
|
**Solution**: Include specific details and context
|
|
|
|
### Anti-Pattern 3: Alerts Without Runbooks
|
|
**Problem**: Alerts that don't explain what to do
|
|
**Solution**: Every alert must have an associated runbook
|
|
|
|
### Anti-Pattern 4: Static Thresholds
|
|
**Problem**: 80% CPU might be normal during peak hours
|
|
**Solution**: Use contextual, adaptive thresholds
|
|
|
|
### Anti-Pattern 5: Ignoring Alert Quality
|
|
**Problem**: Accepting high false positive rates
|
|
**Solution**: Regularly review and tune alert precision
|
|
|
|
## Implementation Checklist
|
|
|
|
### Pre-Implementation
|
|
- [ ] Define alert severity levels and escalation policies
|
|
- [ ] Create runbook templates
|
|
- [ ] Set up alert routing configuration
|
|
- [ ] Define SLOs that alerts will protect
|
|
|
|
### Alert Development
|
|
- [ ] Each alert has clear success criteria
|
|
- [ ] Alert conditions tested against historical data
|
|
- [ ] Runbook created and accessible
|
|
- [ ] Severity and routing configured
|
|
- [ ] Context and suggested actions included
|
|
|
|
### Post-Implementation
|
|
- [ ] Monitor alert precision and recall
|
|
- [ ] Regular review of alert fatigue metrics
|
|
- [ ] Quarterly alert effectiveness review
|
|
- [ ] Team training on alert response procedures
|
|
|
|
### Quality Assurance
|
|
- [ ] Test alerts fire during controlled failures
|
|
- [ ] Verify alerts resolve when conditions improve
|
|
- [ ] Confirm runbooks are accurate and helpful
|
|
- [ ] Validate escalation paths work correctly
|
|
|
|
Remember: Great alerts are invisible when things work and invaluable when things break. Focus on quality over quantity, and always optimize for the human who will respond to the alert at 3 AM. |