Files
claude-skills-reference/engineering-team/incident-commander/references/sla-management-guide.md
Leo f6f50f5282 Fix CI workflows and installation documentation
- Replace non-existent anthropics/claude-code-action@v1 with direct bash steps in smart-sync.yml and pr-issue-auto-close.yml
- Add missing checkout steps to both workflows for WORKFLOW_KILLSWITCH access
- Fix Issue #189: Replace broken 'npx ai-agent-skills install' with working 'npx agent-skills-cli add' command
- Update README.md and INSTALLATION.md with correct Agent Skills CLI commands and repository links
- Verified: agent-skills-cli detects all 53 skills and works with 42+ AI agents

Fixes: Two GitHub Actions workflows that broke on PR #191 merge
Closes: #189
2026-02-16 11:30:18 +00:00

23 KiB

SLA Management Guide

Comprehensive reference for Service Level Agreements, Objectives, and Indicators. Designed for incident commanders who must understand, protect, and communicate SLA status during and after incidents.


1. Definitions & Relationships

Service Level Indicator (SLI)

An SLI is the quantitative measurement of a specific aspect of service quality. SLIs are the raw data that feed everything above them. They must be precisely defined, automatically collected, and unambiguous.

Common SLI types by service:

Service Type SLI Measurement Method
Web Application Request latency (p50, p95, p99) Server-side histogram
Web Application Availability (successful responses / total requests) Load balancer logs
REST API Error rate (5xx responses / total responses) API gateway metrics
REST API Throughput (requests per second) Counter metric
Database Query latency (p99) Slow query log + APM
Database Replication lag (seconds) Replica monitoring
Message Queue End-to-end delivery latency Timestamp comparison
Message Queue Message loss rate Producer vs consumer counts
Storage Durability (objects lost / objects stored) Integrity checksums
CDN Cache hit ratio Edge server logs

SLI specification formula:

SLI = (good events / total events) x 100

For availability: SLI = (successful requests / total requests) x 100 For latency: SLI = (requests faster than threshold / total requests) x 100

Service Level Objective (SLO)

An SLO is the target value or range for an SLI. It defines the acceptable level of reliability. SLOs are internal goals that engineering teams commit to.

Setting meaningful SLOs:

  1. Measure the current baseline over 30 days minimum
  2. Subtract a safety margin (typically 0.05%-0.1% below actual performance)
  3. Validate against user expectations and business requirements
  4. Never set an SLO higher than what the system can sustain without heroics

Common pitfall: Setting 99.99% availability when 99.9% meets every user need. The jump from 99.9% to 99.99% is a 10x reduction in allowed downtime and typically requires 3-5x the engineering investment.

SLO examples:

  • 99.9% of HTTP requests return a non-5xx response within each calendar month
  • 95% of API requests complete in under 200ms (p95 latency)
  • 99.95% of messages are delivered within 30 seconds of production

Service Level Agreement (SLA)

An SLA is a formal contract between a service provider and its customers that specifies consequences for failing to meet defined service levels. SLAs must always be looser than SLOs to provide a buffer zone.

Rule of thumb: If your SLO is 99.95%, your SLA should be 99.9% or lower. The gap between SLO and SLA is your safety margin.

The Hierarchy

  SLA (99.9%)     ← Contract with customers, financial penalties
    ↑ backs
  SLO (99.95%)    ← Internal target, triggers error budget policy
    ↑ targets
  SLI (measured)  ← Raw metric: actual uptime = 99.97% this month

Standard combinations by tier:

Tier SLI (Metric) SLO (Target) SLA (Contract) Allowed Downtime/Month
Critical (payments) Availability 99.99% 99.95% SLO: 4.38 min / SLA: 21.9 min
High (core API) Availability 99.95% 99.9% SLO: 21.9 min / SLA: 43.8 min
Standard (dashboard) Availability 99.9% 99.5% SLO: 43.8 min / SLA: 3.65 hrs
Low (internal tools) Availability 99.5% 99.0% SLO: 3.65 hrs / SLA: 7.3 hrs

2. Error Budget Policy

What Is an Error Budget

An error budget is the maximum amount of unreliability a service can have within a given period while still meeting its SLO. It is calculated as:

Error Budget = 1 - SLO target

For a 99.9% SLO over a 30-day month (43,200 minutes):

Error Budget = 1 - 0.999 = 0.001 = 0.1%
Allowed Downtime = 43,200 x 0.001 = 43.2 minutes

Downtime Allowances by SLO

SLO Error Budget Monthly Downtime Quarterly Downtime Annual Downtime
99.0% 1.0% 7 hrs 18 min 21 hrs 54 min 3 days 15 hrs
99.5% 0.5% 3 hrs 39 min 10 hrs 57 min 1 day 19 hrs
99.9% 0.1% 43.8 min 2 hrs 11 min 8 hrs 46 min
99.95% 0.05% 21.9 min 1 hr 6 min 4 hrs 23 min
99.99% 0.01% 4.38 min 13.1 min 52.6 min
99.999% 0.001% 26.3 sec 78.9 sec 5.26 min

Error Budget Consumption Tracking

Track budget consumption as a percentage of the total budget used so far in the current window:

Budget Consumed (%) = (actual bad minutes / allowed bad minutes) x 100

Example: SLO is 99.9% (43.8 min budget/month). On day 10, you have had 15 minutes of downtime.

Budget Consumed = (15 / 43.8) x 100 = 34.2%
Expected consumption at day 10 = (10/30) x 100 = 33.3%
Status: Slightly over pace (34.2% consumed at 33.3% of month elapsed)

Burn Rate

Burn rate measures how fast the error budget is being consumed relative to the steady-state rate:

Burn Rate = (error rate observed / error rate allowed by SLO)

A burn rate of 1.0 means the budget will be exactly exhausted by the end of the window. A burn rate of 10 means the budget will be exhausted in 1/10th of the window.

Burn rate to time-to-exhaustion (30-day month):

Burn Rate Budget Exhausted In Urgency
1x 30 days On pace, monitoring only
2x 15 days Elevated attention
6x 5 days Active investigation required
14.4x 2.08 days (~50 hours) Immediate page
36x 20 hours Critical, all-hands
720x 1 hour Total outage scenario

Error Budget Exhaustion Policy

When the error budget is consumed, the following actions trigger based on threshold:

Tier 1 - Budget at 75% consumed (Yellow):

  • Notify service team lead via automated alert
  • Freeze non-critical deployments to the affected service
  • Conduct pre-emptive review of upcoming changes for risk
  • Increase monitoring sensitivity (lower alert thresholds)

Tier 2 - Budget at 100% consumed (Orange):

  • Hard feature freeze on the affected service
  • Mandatory reliability sprint: all engineering effort redirected to reliability
  • Daily status updates to engineering leadership
  • Postmortem required for the incidents that consumed the budget
  • Freeze lasts until budget replenishes to 50% or systemic fixes are verified

Tier 3 - Budget at 150% consumed / SLA breach imminent (Red):

  • Escalation to VP Engineering and CTO
  • Cross-team war room if dependencies are involved
  • Customer communication prepared and staged
  • Legal and finance teams briefed on potential SLA credit obligations
  • Recovery plan with specific milestones required within 24 hours

Error Budget Policy Template

SERVICE: [service-name]
SLO: [target]% availability over [rolling 30-day / calendar month] window
ERROR BUDGET: [calculated] minutes per window

BUDGET THRESHOLDS:
  - 50% consumed: Team notification, increased vigilance
  - 75% consumed: Feature freeze for this service, reliability focus
  - 100% consumed: Full feature freeze, reliability sprint mandatory
  - SLA threshold crossed: Executive escalation, customer communication

REVIEW CADENCE: Monthly budget review on [day], quarterly SLO adjustment

EXCEPTIONS: Planned maintenance windows excluded if communicated 72+ hours in advance
            and within agreed maintenance allowance.

APPROVED BY: [Engineering Lead] / [Product Lead] / [Date]

3. SLA Breach Handling

Detection Methods

Automated detection (primary):

  • Real-time monitoring dashboards with SLA burn-rate alerts
  • Automated SLA compliance calculations running every 5 minutes
  • Threshold-based alerts when cumulative downtime approaches SLA limits
  • Synthetic monitoring (external probes) for customer-perspective validation

Manual review (secondary):

  • Monthly SLA compliance reports generated on the 1st of each month
  • Customer-reported incidents cross-referenced with internal metrics
  • Quarterly audits comparing measured SLIs against contracted SLAs
  • Discrepancy review between internal metrics and customer-perceived availability

Breach Classification

Minor Breach:

  • SLA missed by less than 0.05 percentage points (e.g., 99.85% vs 99.9% SLA)
  • Fewer than 3 discrete incidents contributed
  • No single incident exceeded 30 minutes
  • Customer impact was limited or partial degradation only
  • Financial credit: typically 5-10% of monthly service fee

Major Breach:

  • SLA missed by 0.05 to 0.5 percentage points
  • Extended outage of 1-4 hours in a single incident, or multiple significant incidents
  • Clear customer impact with support tickets generated
  • Financial credit: typically 10-25% of monthly service fee

Critical Breach:

  • SLA missed by more than 0.5 percentage points
  • Total outage exceeding 4 hours, or repeated major incidents in same window
  • Data loss, security incident, or compliance violation involved
  • Financial credit: typically 25-100% of monthly service fee
  • May trigger contract termination clauses

Response Protocol

For Minor Breach (within 3 business days):

  1. Generate SLA compliance report with exact metrics
  2. Document contributing incidents with root causes
  3. Send proactive notification to customer success manager
  4. Issue service credits if contractually required (do not wait for customer to ask)
  5. File internal improvement ticket with 30-day remediation target

For Major Breach (within 24 hours):

  1. Incident commander confirms SLA impact calculation
  2. Draft customer communication (see template below)
  3. Executive sponsor reviews and approves communication
  4. Issue service credits with detailed breakdown
  5. Schedule root cause review with customer within 5 business days
  6. Produce remediation plan with committed timelines

For Critical Breach (immediate):

  1. Activate executive escalation chain
  2. Legal team reviews contractual exposure
  3. Finance team calculates credit obligations
  4. Customer communication from VP or C-level within 4 hours
  5. Dedicated remediation task force assigned
  6. Weekly status updates to customer until remediation complete
  7. Formal postmortem document shared with customer within 10 business days

Customer Communication Template

Subject: Service Level Update - [Service Name] - [Month Year]

Dear [Customer Name],

We are writing to inform you that [Service Name] did not meet the committed
service level of [SLA target]% availability during [time period].

MEASURED PERFORMANCE: [actual]% availability
COMMITTED SLA: [SLA target]% availability
SHORTFALL: [delta] percentage points

CONTRIBUTING FACTORS:
- [Date/Time]: [Brief description of incident] ([duration] impact)
- [Date/Time]: [Brief description of incident] ([duration] impact)

SERVICE CREDIT: In accordance with our agreement, a credit of [amount/percentage]
will be applied to your next invoice.

REMEDIATION ACTIONS:
1. [Specific technical fix with completion date]
2. [Process improvement with implementation date]
3. [Monitoring enhancement with deployment date]

We take our service commitments seriously. [Name], [Title] is personally
overseeing the remediation and is available to discuss further at your convenience.

Sincerely,
[Name, Title]
  • Maintain auditable records of all SLA measurements for the full contract term plus 2 years
  • SLA calculations must use the measurement methodology defined in the contract, not internal approximations
  • Force majeure clauses typically exclude natural disasters, but verify per contract
  • Planned maintenance exclusions must match the exact notification procedures in the contract
  • Multi-region SLAs may have separate calculations per region; verify aggregation method

4. Incident-to-SLA Mapping

Downtime Calculation Methodologies

Full outage: Service completely unavailable. Every minute counts as a full minute of downtime.

Downtime = End Time - Start Time (in minutes)

Partial degradation: Service available but impaired. Apply a degradation factor:

Effective Downtime = Actual Duration x Degradation Factor
Degradation Level Factor Description
Complete outage 1.0 Service fully unavailable
Severe degradation 0.75 >50% of requests failing or >10x latency
Moderate degradation 0.5 10-50% of requests affected or 3-10x latency
Minor degradation 0.25 <10% of requests affected or <3x latency increase
Cosmetic / non-functional 0.0 No impact on core SLI metrics

Note: The exact degradation factors must be agreed upon in the SLA contract. The above are industry-standard starting points.

Planned vs Unplanned Downtime

Most SLAs exclude pre-announced maintenance windows from availability calculations, subject to conditions:

  • Notification provided N hours/days in advance (commonly 72 hours)
  • Maintenance occurs within an agreed window (e.g., Sunday 02:00-06:00 UTC)
  • Total planned downtime does not exceed the monthly maintenance allowance (e.g., 4 hours/month)
  • Any overrun beyond the planned window counts as unplanned downtime
SLA Availability = (Total Minutes - Excluded Maintenance - Unplanned Downtime) / (Total Minutes - Excluded Maintenance) x 100

Multi-Service SLA Composition

When a customer-facing product depends on multiple services, composite SLA is calculated as:

Serial dependency (all must be up):

Composite SLA = SLA_A x SLA_B x SLA_C
Example: 99.9% x 99.95% x 99.99% = 99.84%

Parallel / redundant (any one must be up):

Composite Availability = 1 - ((1 - SLA_A) x (1 - SLA_B))
Example: 1 - ((1 - 0.999) x (1 - 0.999)) = 1 - 0.000001 = 99.9999%

This is critical during incidents: an outage in a shared dependency may breach SLAs for multiple customer-facing products simultaneously.

Worked Examples

Example 1: Simple outage

  • Service: Core API (SLA: 99.9%)
  • Month: 30 days = 43,200 minutes
  • Incident: Full outage from 14:23 to 14:38 UTC on the 12th (15 minutes)
  • No other incidents this month
Availability = (43,200 - 15) / 43,200 x 100 = 99.965%
SLA Status: PASS (99.965% > 99.9%)
Error Budget Consumed: 15 / 43.2 = 34.7%

Example 2: Partial degradation

  • Service: Payment Processing (SLA: 99.95%)
  • Month: 30 days = 43,200 minutes
  • Incident: 50% of transactions failing for 4 hours (240 minutes)
  • Degradation factor: 0.5 (moderate - 50% of requests affected)
Effective Downtime = 240 x 0.5 = 120 minutes
Availability = (43,200 - 120) / 43,200 x 100 = 99.722%
SLA Status: FAIL (99.722% < 99.95%)
Shortfall: 0.228 percentage points → Major Breach

Example 3: Multiple incidents

  • Service: Dashboard (SLA: 99.5%)
  • Month: 31 days = 44,640 minutes
  • Incident A: 45-minute full outage on the 5th
  • Incident B: 2-hour severe degradation (factor 0.75) on the 18th
  • Incident C: 30-minute full outage on the 25th
Total Effective Downtime = 45 + (120 x 0.75) + 30 = 45 + 90 + 30 = 165 minutes
Availability = (44,640 - 165) / 44,640 x 100 = 99.630%
SLA Status: PASS (99.630% > 99.5%)
Error Budget Consumed: 165 / 223.2 = 73.9% → Yellow threshold, feature freeze recommended

5. SLO Best Practices

Start with User Journeys

Do not set SLOs based on infrastructure metrics. Start from what users experience:

  1. Identify critical user journeys (e.g., "User completes checkout")
  2. Map each journey to the services and dependencies involved
  3. Define what "good" looks like for each journey (fast, error-free, complete)
  4. Select the SLIs that most directly measure that user experience
  5. Set SLO targets that reflect the minimum acceptable user experience

A database with 99.99% uptime is meaningless if the API in front of it has a bug causing 5% error rates.

The Four Golden Signals as SLI Sources

From Google SRE, the four golden signals provide comprehensive service health:

Signal SLI Example Typical SLO
Latency p99 request duration < 500ms 99% of requests under threshold
Traffic Requests per second N/A (capacity planning, not SLO)
Errors 5xx rate as % of total requests < 0.1% error rate over rolling window
Saturation CPU/memory/queue depth < 80% utilization (capacity SLI)

For most services, latency and error rate are the two most important SLIs to back with SLOs.

Setting SLO Targets

  1. Collect 90 days of historical SLI data
  2. Calculate the 5th percentile performance (worst 5% of days)
  3. Set SLO slightly above that baseline (this ensures the SLO is achievable without heroics)
  4. Validate: would a breach at this level actually impact users negatively?
  5. Adjust upward only if user impact analysis demands it

Never set SLOs by aspiration. A 99.99% SLO on a service that has historically achieved 99.93% is a guaranteed source of perpetual firefighting with no reliability improvement.

Review Cadence

  • Weekly: Review current error budget burn rate, flag services approaching thresholds
  • Monthly: Full SLO compliance review, adjust alert thresholds if needed
  • Quarterly: Reassess SLO targets based on 90-day data, review SLA contract alignment
  • Annually: Strategic SLO review tied to product roadmap and infrastructure investments

Anti-Patterns

Anti-Pattern Problem Fix
Vanity SLOs Setting 99.99% to impress, then ignoring breaches Set achievable targets, enforce budget policy
SLO Inflation Ratcheting SLOs up whenever performance is good Only increase SLOs when users demonstrably need it
Unmeasured SLAs Committing contractual SLAs without actual SLI measurement Instrument SLIs before signing SLA contracts
Copy-Paste SLOs Same SLO for every service regardless of criticality Tier services by business impact, set SLOs accordingly
Ignoring Dependencies Setting aggressive SLOs without accounting for dependency reliability Calculate composite SLA; your SLO cannot exceed dependency chain
Alert-Free SLOs Having SLOs but no automated alerting on budget consumption Every SLO must have corresponding burn rate alerts

6. Monitoring & Alerting for SLAs

Multi-Window Burn Rate Alerting

The Google SRE approach uses multiple time windows to balance speed of detection against alert noise. Each alert condition requires both a short window (for speed) and a long window (for confirmation):

Alert configuration matrix:

Severity Short Window Short Threshold Long Window Long Threshold Action
Critical (Page) 1 hour > 14.4x burn rate 5 minutes > 14.4x burn rate Wake someone up
High (Page) 6 hours > 6x burn rate 30 minutes > 6x burn rate Page on-call within 30 min
Medium (Ticket) 3 days > 1x burn rate 6 hours > 1x burn rate Create ticket, next business day

Why these specific numbers:

  • 14.4x burn rate over 1 hour consumes 2% of monthly budget in that hour. At this rate, the entire 30-day budget is gone in ~50 hours. This demands immediate human attention.
  • 6x burn rate over 6 hours consumes 5% of monthly budget. The budget will be exhausted in 5 days. Urgent but not wake-up-at-3am urgent.
  • 1x burn rate over 3 days means you are on pace to exactly exhaust the budget. This needs investigation but is not an emergency.

Burn Rate Alert Formulas

For a given time window, calculate the burn rate:

burn_rate = (error_count_in_window / request_count_in_window) / (1 - SLO_target)

Example for a 99.9% SLO, observing 50 errors out of 10,000 requests in a 1-hour window:

observed_error_rate = 50 / 10,000 = 0.005 (0.5%)
allowed_error_rate = 1 - 0.999 = 0.001 (0.1%)
burn_rate = 0.005 / 0.001 = 5.0

A burn rate of 5.0 means the error budget is being consumed 5 times faster than the sustainable rate.

Alert Severity to SLA Risk Mapping

Burn Rate Budget Impact SLA Risk Response
< 1x Under budget pace None Routine monitoring
1x - 3x On pace or slightly over Low Investigate next business day
3x - 6x Budget will exhaust in 5-10 days Moderate Investigate within 4 hours
6x - 14.4x Budget will exhaust in 2-5 days High Page on-call, respond in 30 min
> 14.4x Budget will exhaust in < 2 days Critical Immediate page, incident declared
> 100x Active major outage SLA breach imminent All-hands incident response

Dashboard Design for SLA Tracking

Every SLA-tracked service should have a dashboard with these panels:

Row 1 - Current Status:

  • Current availability (real-time, rolling 5-minute window)
  • Current error rate (real-time)
  • Current p99 latency (real-time)

Row 2 - Budget Status:

  • Error budget remaining (% of monthly budget, gauge visualization)
  • Budget consumption timeline (line chart, actual vs expected burn)
  • Budget burn rate (current 1h, 6h, and 3d burn rates)

Row 3 - Historical Context:

  • 30-day availability trend (daily granularity)
  • SLA compliance status for current and previous 3 months
  • Incident markers overlaid on availability timeline

Row 4 - Dependencies:

  • Upstream dependency availability (services this service depends on)
  • Downstream impact scope (services that depend on this service)
  • Composite SLA calculation for customer-facing products

Alert Fatigue Prevention

Alert fatigue is the primary reason SLA monitoring fails in practice. Mitigation strategies:

  1. Require dual-window confirmation. Never page on a single short window. Always require both the short window (for speed) and long window (for persistence) to fire simultaneously.

  2. Separate page-worthy from ticket-worthy. Only two conditions should wake someone up: >14.4x burn rate sustained, or >6x burn rate sustained. Everything else is a ticket.

  3. Deduplicate aggressively. If the same service triggers both a latency and error rate alert for the same underlying issue, group them into a single notification.

  4. Auto-resolve. Alerts must auto-resolve when the burn rate drops below threshold. Never leave stale alerts open.

  5. Review alert quality monthly. Track the ratio of actionable alerts to total alerts. Target >80% actionable rate. If an alert fires and no human action is needed, tune or remove it.

  6. Escalation, not repetition. If an alert is not acknowledged within the response window, escalate to the next tier. Do not re-send the same alert every 5 minutes.

Practical Monitoring Stack

Layer Tool Category Purpose
Collection Prometheus, OpenTelemetry, StatsD Gather SLI metrics from services
Storage Prometheus TSDB, Thanos, Mimir Retain metrics for SLO window + 90 days
Calculation Prometheus recording rules, Sloth Pre-compute burn rates and budget consumption
Alerting Alertmanager, PagerDuty, OpsGenie Route alerts by severity and schedule
Visualization Grafana, Datadog Dashboards for real-time and historical SLA views
Reporting Custom scripts, SLO generators Monthly SLA compliance reports for customers

Retention requirement: SLI data must be retained for at least the SLA reporting period (typically monthly or quarterly) plus a 90-day dispute window. Annual SLA reviews require 12 months of data at daily granularity minimum.


Last updated: February 2026 For use with: incident-commander skill Maintainer: Engineering Team