- SLO Designer: generates comprehensive SLI/SLO frameworks with error budgets and burn rate alerts - Alert Optimizer: analyzes and optimizes alert configurations to reduce noise and improve effectiveness - Dashboard Generator: creates role-based dashboard specifications with golden signals coverage Includes comprehensive documentation, sample data, and expected outputs for testing.
571 lines
14 KiB
Markdown
571 lines
14 KiB
Markdown
# Dashboard Best Practices: Design for Insight and Action
|
|
|
|
## Introduction
|
|
|
|
A well-designed dashboard is like a good story - it guides you through the data with purpose and clarity. This guide provides practical patterns for creating dashboards that inform decisions and enable quick troubleshooting.
|
|
|
|
## Design Principles
|
|
|
|
### The Hierarchy of Information
|
|
|
|
#### Primary Information (Top Third)
|
|
- Service health status
|
|
- SLO achievement
|
|
- Critical alerts
|
|
- Business KPIs
|
|
|
|
#### Secondary Information (Middle Third)
|
|
- Golden signals (latency, traffic, errors, saturation)
|
|
- Resource utilization
|
|
- Throughput and performance metrics
|
|
|
|
#### Tertiary Information (Bottom Third)
|
|
- Detailed breakdowns
|
|
- Historical trends
|
|
- Dependency status
|
|
- Debug information
|
|
|
|
### Visual Design Principles
|
|
|
|
#### Rule of 7±2
|
|
- Maximum 7±2 panels per screen
|
|
- Group related information together
|
|
- Use sections to organize complexity
|
|
|
|
#### Color Psychology
|
|
- **Red**: Critical issues, danger, immediate attention needed
|
|
- **Yellow/Orange**: Warnings, caution, degraded state
|
|
- **Green**: Healthy, normal operation, success
|
|
- **Blue**: Information, neutral metrics, capacity
|
|
- **Gray**: Disabled, unknown, or baseline states
|
|
|
|
#### Chart Selection Guide
|
|
- **Line charts**: Time series, trends, comparisons over time
|
|
- **Bar charts**: Categorical comparisons, top N lists
|
|
- **Gauges**: Single value with defined good/bad ranges
|
|
- **Stat panels**: Key metrics, percentages, counts
|
|
- **Heatmaps**: Distribution data, correlation analysis
|
|
- **Tables**: Detailed breakdowns, multi-dimensional data
|
|
|
|
## Dashboard Archetypes
|
|
|
|
### The Overview Dashboard
|
|
|
|
**Purpose**: High-level health check and business metrics
|
|
**Audience**: Executives, managers, cross-team stakeholders
|
|
**Update Frequency**: 5-15 minutes
|
|
|
|
```yaml
|
|
sections:
|
|
- title: "Business Health"
|
|
panels:
|
|
- service_availability_summary
|
|
- revenue_per_hour
|
|
- active_users
|
|
- conversion_rate
|
|
|
|
- title: "System Health"
|
|
panels:
|
|
- critical_alerts_count
|
|
- slo_achievement_summary
|
|
- error_budget_remaining
|
|
- deployment_status
|
|
```
|
|
|
|
### The SRE Operational Dashboard
|
|
|
|
**Purpose**: Real-time monitoring and incident response
|
|
**Audience**: SRE, on-call engineers
|
|
**Update Frequency**: 15-30 seconds
|
|
|
|
```yaml
|
|
sections:
|
|
- title: "Service Status"
|
|
panels:
|
|
- service_up_status
|
|
- active_incidents
|
|
- recent_deployments
|
|
|
|
- title: "Golden Signals"
|
|
panels:
|
|
- latency_percentiles
|
|
- request_rate
|
|
- error_rate
|
|
- resource_saturation
|
|
|
|
- title: "Infrastructure"
|
|
panels:
|
|
- cpu_memory_utilization
|
|
- network_io
|
|
- disk_space
|
|
```
|
|
|
|
### The Developer Debug Dashboard
|
|
|
|
**Purpose**: Deep-dive troubleshooting and performance analysis
|
|
**Audience**: Development teams
|
|
**Update Frequency**: 30 seconds - 2 minutes
|
|
|
|
```yaml
|
|
sections:
|
|
- title: "Application Performance"
|
|
panels:
|
|
- endpoint_latency_breakdown
|
|
- database_query_performance
|
|
- cache_hit_rates
|
|
- queue_depths
|
|
|
|
- title: "Errors and Logs"
|
|
panels:
|
|
- error_rate_by_endpoint
|
|
- log_volume_by_level
|
|
- exception_types
|
|
- slow_queries
|
|
```
|
|
|
|
## Layout Patterns
|
|
|
|
### The F-Pattern Layout
|
|
|
|
Based on eye-tracking studies, users scan in an F-pattern:
|
|
|
|
```
|
|
[Critical Status] [SLO Summary ] [Error Budget ]
|
|
[Latency ] [Traffic ] [Errors ]
|
|
[Saturation ] [Resource Use ] [Detailed View]
|
|
[Historical ] [Dependencies ] [Debug Info ]
|
|
```
|
|
|
|
### The Z-Pattern Layout
|
|
|
|
For executive dashboards, follow the Z-pattern:
|
|
|
|
```
|
|
[Business KPIs ] → [System Status]
|
|
↓ ↓
|
|
[Trend Analysis ] ← [Key Metrics ]
|
|
```
|
|
|
|
### Responsive Design
|
|
|
|
#### Desktop (1920x1080)
|
|
- 24-column grid
|
|
- Panels can be 6, 8, 12, or 24 units wide
|
|
- 4-6 rows visible without scrolling
|
|
|
|
#### Laptop (1366x768)
|
|
- Stack wider panels vertically
|
|
- Reduce panel heights
|
|
- Prioritize most critical information
|
|
|
|
#### Mobile (768px width)
|
|
- Single column layout
|
|
- Simplified panels
|
|
- Touch-friendly controls
|
|
|
|
## Effective Panel Design
|
|
|
|
### Stat Panels
|
|
|
|
```yaml
|
|
# Good: Clear value with context
|
|
- title: "API Availability"
|
|
type: stat
|
|
targets:
|
|
- expr: avg(up{service="api"}) * 100
|
|
field_config:
|
|
unit: percent
|
|
thresholds:
|
|
steps:
|
|
- color: red
|
|
value: 0
|
|
- color: yellow
|
|
value: 99
|
|
- color: green
|
|
value: 99.9
|
|
options:
|
|
color_mode: background
|
|
text_mode: value_and_name
|
|
```
|
|
|
|
### Time Series Panels
|
|
|
|
```yaml
|
|
# Good: Multiple related metrics with clear legend
|
|
- title: "Request Latency"
|
|
type: timeseries
|
|
targets:
|
|
- expr: histogram_quantile(0.50, rate(http_duration_bucket[5m]))
|
|
legend: "P50"
|
|
- expr: histogram_quantile(0.95, rate(http_duration_bucket[5m]))
|
|
legend: "P95"
|
|
- expr: histogram_quantile(0.99, rate(http_duration_bucket[5m]))
|
|
legend: "P99"
|
|
field_config:
|
|
unit: ms
|
|
custom:
|
|
draw_style: line
|
|
fill_opacity: 10
|
|
options:
|
|
legend:
|
|
display_mode: table
|
|
placement: bottom
|
|
values: [min, max, mean, last]
|
|
```
|
|
|
|
### Table Panels
|
|
|
|
```yaml
|
|
# Good: Top N with relevant columns
|
|
- title: "Slowest Endpoints"
|
|
type: table
|
|
targets:
|
|
- expr: topk(10, histogram_quantile(0.95, sum by (handler)(rate(http_duration_bucket[5m]))))
|
|
format: table
|
|
instant: true
|
|
transformations:
|
|
- id: organize
|
|
options:
|
|
exclude_by_name:
|
|
Time: true
|
|
rename_by_name:
|
|
Value: "P95 Latency (ms)"
|
|
handler: "Endpoint"
|
|
```
|
|
|
|
## Color and Visualization Best Practices
|
|
|
|
### Threshold Configuration
|
|
|
|
```yaml
|
|
# Traffic light system with meaningful boundaries
|
|
thresholds:
|
|
steps:
|
|
- color: green # Good performance
|
|
value: null # Default
|
|
- color: yellow # Degraded performance
|
|
value: 95 # 95th percentile of historical normal
|
|
- color: orange # Poor performance
|
|
value: 99 # 99th percentile of historical normal
|
|
- color: red # Critical performance
|
|
value: 99.9 # Worst case scenario
|
|
```
|
|
|
|
### Color Blind Friendly Palettes
|
|
|
|
```yaml
|
|
# Use patterns and shapes in addition to color
|
|
field_config:
|
|
overrides:
|
|
- matcher:
|
|
id: byName
|
|
options: "Critical"
|
|
properties:
|
|
- id: color
|
|
value:
|
|
mode: fixed
|
|
fixed_color: "#d73027" # Red-orange for protanopia
|
|
- id: custom.draw_style
|
|
value: "points" # Different shape
|
|
```
|
|
|
|
### Consistent Color Semantics
|
|
|
|
- **Success/Health**: Green (#28a745)
|
|
- **Warning/Degraded**: Yellow (#ffc107)
|
|
- **Error/Critical**: Red (#dc3545)
|
|
- **Information**: Blue (#007bff)
|
|
- **Neutral**: Gray (#6c757d)
|
|
|
|
## Time Range Strategy
|
|
|
|
### Default Time Ranges by Dashboard Type
|
|
|
|
#### Real-time Operational
|
|
- **Default**: Last 15 minutes
|
|
- **Quick options**: 5m, 15m, 1h, 4h
|
|
- **Auto-refresh**: 15-30 seconds
|
|
|
|
#### Troubleshooting
|
|
- **Default**: Last 1 hour
|
|
- **Quick options**: 15m, 1h, 4h, 12h, 1d
|
|
- **Auto-refresh**: 1 minute
|
|
|
|
#### Business Review
|
|
- **Default**: Last 24 hours
|
|
- **Quick options**: 1d, 7d, 30d, 90d
|
|
- **Auto-refresh**: 5 minutes
|
|
|
|
#### Capacity Planning
|
|
- **Default**: Last 7 days
|
|
- **Quick options**: 7d, 30d, 90d, 1y
|
|
- **Auto-refresh**: 15 minutes
|
|
|
|
### Time Range Annotations
|
|
|
|
```yaml
|
|
# Add context for time-based events
|
|
annotations:
|
|
- name: "Deployments"
|
|
datasource: "Prometheus"
|
|
expr: "deployment_timestamp"
|
|
title_format: "Deploy {{ version }}"
|
|
text_format: "Deployed version {{ version }} to {{ environment }}"
|
|
|
|
- name: "Incidents"
|
|
datasource: "Incident API"
|
|
query: "incidents.json?service={{ service }}"
|
|
color: "red"
|
|
```
|
|
|
|
## Interactive Features
|
|
|
|
### Template Variables
|
|
|
|
```yaml
|
|
# Service selector
|
|
- name: service
|
|
type: query
|
|
query: label_values(up, service)
|
|
current:
|
|
text: All
|
|
value: $__all
|
|
include_all: true
|
|
multi: true
|
|
|
|
# Environment selector
|
|
- name: environment
|
|
type: query
|
|
query: label_values(up{service="$service"}, environment)
|
|
current:
|
|
text: production
|
|
value: production
|
|
```
|
|
|
|
### Drill-Down Links
|
|
|
|
```yaml
|
|
# Panel-level drill-downs
|
|
- title: "Error Rate"
|
|
type: timeseries
|
|
# ... other config ...
|
|
options:
|
|
data_links:
|
|
- title: "View Error Logs"
|
|
url: "/d/logs-dashboard?var-service=${__field.labels.service}&from=${__from}&to=${__to}"
|
|
- title: "Error Traces"
|
|
url: "/d/traces-dashboard?var-service=${__field.labels.service}"
|
|
```
|
|
|
|
### Dynamic Panel Titles
|
|
|
|
```yaml
|
|
- title: "${service} - Request Rate" # Uses template variable
|
|
type: timeseries
|
|
# Title updates automatically when service variable changes
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### Query Optimization
|
|
|
|
#### Use Recording Rules
|
|
```yaml
|
|
# Instead of complex queries in dashboards
|
|
groups:
|
|
- name: http_requests
|
|
rules:
|
|
- record: http_request_rate_5m
|
|
expr: sum(rate(http_requests_total[5m])) by (service, method, handler)
|
|
|
|
- record: http_request_latency_p95_5m
|
|
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))
|
|
```
|
|
|
|
#### Limit Data Points
|
|
```yaml
|
|
# Good: Reasonable resolution for dashboard
|
|
- expr: http_request_rate_5m[1h]
|
|
interval: 15s # One point every 15 seconds
|
|
|
|
# Bad: Too many points for visualization
|
|
- expr: http_request_rate_1s[1h] # 3600 points!
|
|
```
|
|
|
|
### Dashboard Performance
|
|
|
|
#### Panel Limits
|
|
- **Maximum panels per dashboard**: 20-30
|
|
- **Maximum queries per panel**: 10
|
|
- **Maximum time series per panel**: 50
|
|
|
|
#### Caching Strategy
|
|
```yaml
|
|
# Use appropriate cache headers
|
|
cache_timeout: 30 # Cache for 30 seconds on fast-changing panels
|
|
cache_timeout: 300 # Cache for 5 minutes on slow-changing panels
|
|
```
|
|
|
|
## Accessibility
|
|
|
|
### Screen Reader Support
|
|
|
|
```yaml
|
|
# Provide text alternatives for visual elements
|
|
- title: "Service Health Status"
|
|
type: stat
|
|
options:
|
|
text_mode: value_and_name # Includes both value and description
|
|
field_config:
|
|
mappings:
|
|
- options:
|
|
"1":
|
|
text: "Healthy"
|
|
color: "green"
|
|
"0":
|
|
text: "Unhealthy"
|
|
color: "red"
|
|
```
|
|
|
|
### Keyboard Navigation
|
|
|
|
- Ensure all interactive elements are keyboard accessible
|
|
- Provide logical tab order
|
|
- Include skip links for complex dashboards
|
|
|
|
### High Contrast Mode
|
|
|
|
```yaml
|
|
# Test dashboards work in high contrast mode
|
|
theme: high_contrast
|
|
colors:
|
|
- "#000000" # Pure black
|
|
- "#ffffff" # Pure white
|
|
- "#ffff00" # Pure yellow
|
|
- "#ff0000" # Pure red
|
|
```
|
|
|
|
## Testing and Validation
|
|
|
|
### Dashboard Testing Checklist
|
|
|
|
#### Functional Testing
|
|
- [ ] All panels load without errors
|
|
- [ ] Template variables filter correctly
|
|
- [ ] Time range changes update all panels
|
|
- [ ] Drill-down links work as expected
|
|
- [ ] Auto-refresh functions properly
|
|
|
|
#### Visual Testing
|
|
- [ ] Dashboard renders correctly on different screen sizes
|
|
- [ ] Colors are distinguishable and meaningful
|
|
- [ ] Text is readable at normal zoom levels
|
|
- [ ] Legends and labels are clear
|
|
|
|
#### Performance Testing
|
|
- [ ] Dashboard loads in < 5 seconds
|
|
- [ ] No queries timeout under normal load
|
|
- [ ] Auto-refresh doesn't cause browser lag
|
|
- [ ] Memory usage remains reasonable
|
|
|
|
#### Usability Testing
|
|
- [ ] New team members can understand the dashboard
|
|
- [ ] Action items are clear during incidents
|
|
- [ ] Key information is quickly discoverable
|
|
- [ ] Dashboard supports common troubleshooting workflows
|
|
|
|
## Maintenance and Governance
|
|
|
|
### Dashboard Lifecycle
|
|
|
|
#### Creation
|
|
1. Define dashboard purpose and audience
|
|
2. Identify key metrics and success criteria
|
|
3. Design layout following established patterns
|
|
4. Implement with consistent styling
|
|
5. Test with real data and user scenarios
|
|
|
|
#### Maintenance
|
|
- **Weekly**: Check for broken panels or queries
|
|
- **Monthly**: Review dashboard usage analytics
|
|
- **Quarterly**: Gather user feedback and iterate
|
|
- **Annually**: Major review and potential redesign
|
|
|
|
#### Retirement
|
|
- Archive dashboards that are no longer used
|
|
- Migrate users to replacement dashboards
|
|
- Document lessons learned
|
|
|
|
### Dashboard Standards
|
|
|
|
```yaml
|
|
# Organization dashboard standards
|
|
standards:
|
|
naming_convention: "[Team] [Service] - [Purpose]"
|
|
tags: [team, service_type, environment, purpose]
|
|
refresh_intervals: [15s, 30s, 1m, 5m, 15m]
|
|
time_ranges: [5m, 15m, 1h, 4h, 1d, 7d, 30d]
|
|
color_scheme: "company_standard"
|
|
max_panels_per_dashboard: 25
|
|
```
|
|
|
|
## Advanced Patterns
|
|
|
|
### Composite Dashboards
|
|
|
|
```yaml
|
|
# Dashboard that includes panels from other dashboards
|
|
- title: "Service Overview"
|
|
type: dashlist
|
|
targets:
|
|
- "service-health"
|
|
- "service-performance"
|
|
- "service-business-metrics"
|
|
options:
|
|
show_headings: true
|
|
max_items: 10
|
|
```
|
|
|
|
### Dynamic Dashboard Generation
|
|
|
|
```python
|
|
# Generate dashboards from service definitions
|
|
def generate_service_dashboard(service_config):
|
|
panels = []
|
|
|
|
# Always include golden signals
|
|
panels.extend(generate_golden_signals_panels(service_config))
|
|
|
|
# Add service-specific panels
|
|
if service_config.type == 'database':
|
|
panels.extend(generate_database_panels(service_config))
|
|
elif service_config.type == 'queue':
|
|
panels.extend(generate_queue_panels(service_config))
|
|
|
|
return {
|
|
'title': f"{service_config.name} - Operational Dashboard",
|
|
'panels': panels,
|
|
'variables': generate_variables(service_config)
|
|
}
|
|
```
|
|
|
|
### A/B Testing for Dashboards
|
|
|
|
```yaml
|
|
# Test different dashboard designs with different teams
|
|
experiment:
|
|
name: "dashboard_layout_test"
|
|
variants:
|
|
- name: "traditional_layout"
|
|
weight: 50
|
|
config: "dashboard_v1.json"
|
|
- name: "f_pattern_layout"
|
|
weight: 50
|
|
config: "dashboard_v2.json"
|
|
success_metrics:
|
|
- "time_to_insight"
|
|
- "user_satisfaction"
|
|
- "troubleshooting_efficiency"
|
|
```
|
|
|
|
Remember: A dashboard should tell a story about your system's health and guide users toward the right actions. Focus on clarity over complexity, and always optimize for the person who will use it during a stressful incident. |