firefrost-gaming/claude-skills-reference

Files

Leo f6f50f5282 Fix CI workflows and installation documentation

- Replace non-existent anthropics/claude-code-action@v1 with direct bash steps in smart-sync.yml and pr-issue-auto-close.yml
- Add missing checkout steps to both workflows for WORKFLOW_KILLSWITCH access
- Fix Issue #189: Replace broken 'npx ai-agent-skills install' with working 'npx agent-skills-cli add' command
- Update README.md and INSTALLATION.md with correct Agent Skills CLI commands and repository links
- Verified: agent-skills-cli detects all 53 skills and works with 42+ AI agents

Fixes: Two GitHub Actions workflows that broke on PR #191 merge
Closes: #189

2026-02-16 11:30:18 +00:00

23 KiB

Raw Blame History

SLA Management Guide

Comprehensive reference for Service Level Agreements, Objectives, and Indicators. Designed for incident commanders who must understand, protect, and communicate SLA status during and after incidents.

1. Definitions & Relationships

Service Level Indicator (SLI)

An SLI is the quantitative measurement of a specific aspect of service quality. SLIs are the raw data that feed everything above them. They must be precisely defined, automatically collected, and unambiguous.

Common SLI types by service:

Service Type	SLI	Measurement Method
Web Application	Request latency (p50, p95, p99)	Server-side histogram
Web Application	Availability (successful responses / total requests)	Load balancer logs
REST API	Error rate (5xx responses / total responses)	API gateway metrics
REST API	Throughput (requests per second)	Counter metric
Database	Query latency (p99)	Slow query log + APM
Database	Replication lag (seconds)	Replica monitoring
Message Queue	End-to-end delivery latency	Timestamp comparison
Message Queue	Message loss rate	Producer vs consumer counts
Storage	Durability (objects lost / objects stored)	Integrity checksums
CDN	Cache hit ratio	Edge server logs

SLI specification formula:

SLI = (good events / total events) x 100

For availability: SLI = (successful requests / total requests) x 100 For latency: SLI = (requests faster than threshold / total requests) x 100

Service Level Objective (SLO)

An SLO is the target value or range for an SLI. It defines the acceptable level of reliability. SLOs are internal goals that engineering teams commit to.

Setting meaningful SLOs:

Measure the current baseline over 30 days minimum
Subtract a safety margin (typically 0.05%-0.1% below actual performance)
Validate against user expectations and business requirements
Never set an SLO higher than what the system can sustain without heroics

Common pitfall: Setting 99.99% availability when 99.9% meets every user need. The jump from 99.9% to 99.99% is a 10x reduction in allowed downtime and typically requires 3-5x the engineering investment.

SLO examples:

99.9% of HTTP requests return a non-5xx response within each calendar month
95% of API requests complete in under 200ms (p95 latency)
99.95% of messages are delivered within 30 seconds of production

Service Level Agreement (SLA)

An SLA is a formal contract between a service provider and its customers that specifies consequences for failing to meet defined service levels. SLAs must always be looser than SLOs to provide a buffer zone.

Rule of thumb: If your SLO is 99.95%, your SLA should be 99.9% or lower. The gap between SLO and SLA is your safety margin.

The Hierarchy

  SLA (99.9%)     ← Contract with customers, financial penalties
    ↑ backs
  SLO (99.95%)    ← Internal target, triggers error budget policy
    ↑ targets
  SLI (measured)  ← Raw metric: actual uptime = 99.97% this month

Standard combinations by tier:

Tier	SLI (Metric)	SLO (Target)	SLA (Contract)	Allowed Downtime/Month
Critical (payments)	Availability	99.99%	99.95%	SLO: 4.38 min / SLA: 21.9 min
High (core API)	Availability	99.95%	99.9%	SLO: 21.9 min / SLA: 43.8 min
Standard (dashboard)	Availability	99.9%	99.5%	SLO: 43.8 min / SLA: 3.65 hrs
Low (internal tools)	Availability	99.5%	99.0%	SLO: 3.65 hrs / SLA: 7.3 hrs

2. Error Budget Policy

What Is an Error Budget

An error budget is the maximum amount of unreliability a service can have within a given period while still meeting its SLO. It is calculated as:

Error Budget = 1 - SLO target

For a 99.9% SLO over a 30-day month (43,200 minutes):

Error Budget = 1 - 0.999 = 0.001 = 0.1%
Allowed Downtime = 43,200 x 0.001 = 43.2 minutes

Downtime Allowances by SLO

SLO	Error Budget	Monthly Downtime	Quarterly Downtime	Annual Downtime
99.0%	1.0%	7 hrs 18 min	21 hrs 54 min	3 days 15 hrs
99.5%	0.5%	3 hrs 39 min	10 hrs 57 min	1 day 19 hrs
99.9%	0.1%	43.8 min	2 hrs 11 min	8 hrs 46 min
99.95%	0.05%	21.9 min	1 hr 6 min	4 hrs 23 min
99.99%	0.01%	4.38 min	13.1 min	52.6 min
99.999%	0.001%	26.3 sec	78.9 sec	5.26 min

Error Budget Consumption Tracking

Track budget consumption as a percentage of the total budget used so far in the current window:

Budget Consumed (%) = (actual bad minutes / allowed bad minutes) x 100

Example: SLO is 99.9% (43.8 min budget/month). On day 10, you have had 15 minutes of downtime.

Budget Consumed = (15 / 43.8) x 100 = 34.2%
Expected consumption at day 10 = (10/30) x 100 = 33.3%
Status: Slightly over pace (34.2% consumed at 33.3% of month elapsed)

Burn Rate

Burn rate measures how fast the error budget is being consumed relative to the steady-state rate:

Burn Rate = (error rate observed / error rate allowed by SLO)

A burn rate of 1.0 means the budget will be exactly exhausted by the end of the window. A burn rate of 10 means the budget will be exhausted in 1/10th of the window.

Burn rate to time-to-exhaustion (30-day month):

Burn Rate	Budget Exhausted In	Urgency
1x	30 days	On pace, monitoring only
2x	15 days	Elevated attention
6x	5 days	Active investigation required
14.4x	2.08 days (~50 hours)	Immediate page
36x	20 hours	Critical, all-hands
720x	1 hour	Total outage scenario

Error Budget Exhaustion Policy

When the error budget is consumed, the following actions trigger based on threshold:

Tier 1 - Budget at 75% consumed (Yellow):

Notify service team lead via automated alert
Freeze non-critical deployments to the affected service
Conduct pre-emptive review of upcoming changes for risk
Increase monitoring sensitivity (lower alert thresholds)

Tier 2 - Budget at 100% consumed (Orange):

Hard feature freeze on the affected service
Mandatory reliability sprint: all engineering effort redirected to reliability
Daily status updates to engineering leadership
Postmortem required for the incidents that consumed the budget
Freeze lasts until budget replenishes to 50% or systemic fixes are verified

Tier 3 - Budget at 150% consumed / SLA breach imminent (Red):

Escalation to VP Engineering and CTO
Cross-team war room if dependencies are involved
Customer communication prepared and staged
Legal and finance teams briefed on potential SLA credit obligations
Recovery plan with specific milestones required within 24 hours

Error Budget Policy Template

SERVICE: [service-name]
SLO: [target]% availability over [rolling 30-day / calendar month] window
ERROR BUDGET: [calculated] minutes per window

BUDGET THRESHOLDS:
  - 50% consumed: Team notification, increased vigilance
  - 75% consumed: Feature freeze for this service, reliability focus
  - 100% consumed: Full feature freeze, reliability sprint mandatory
  - SLA threshold crossed: Executive escalation, customer communication

REVIEW CADENCE: Monthly budget review on [day], quarterly SLO adjustment

EXCEPTIONS: Planned maintenance windows excluded if communicated 72+ hours in advance
            and within agreed maintenance allowance.

APPROVED BY: [Engineering Lead] / [Product Lead] / [Date]

3. SLA Breach Handling

Detection Methods

Automated detection (primary):

Real-time monitoring dashboards with SLA burn-rate alerts
Automated SLA compliance calculations running every 5 minutes
Threshold-based alerts when cumulative downtime approaches SLA limits
Synthetic monitoring (external probes) for customer-perspective validation

Manual review (secondary):

Monthly SLA compliance reports generated on the 1st of each month
Customer-reported incidents cross-referenced with internal metrics
Quarterly audits comparing measured SLIs against contracted SLAs
Discrepancy review between internal metrics and customer-perceived availability

Breach Classification

Minor Breach:

SLA missed by less than 0.05 percentage points (e.g., 99.85% vs 99.9% SLA)
Fewer than 3 discrete incidents contributed
No single incident exceeded 30 minutes
Customer impact was limited or partial degradation only
Financial credit: typically 5-10% of monthly service fee

Major Breach:

SLA missed by 0.05 to 0.5 percentage points
Extended outage of 1-4 hours in a single incident, or multiple significant incidents
Clear customer impact with support tickets generated
Financial credit: typically 10-25% of monthly service fee

Critical Breach:

SLA missed by more than 0.5 percentage points
Total outage exceeding 4 hours, or repeated major incidents in same window
Data loss, security incident, or compliance violation involved
Financial credit: typically 25-100% of monthly service fee
May trigger contract termination clauses

Response Protocol

For Minor Breach (within 3 business days):

Generate SLA compliance report with exact metrics
Document contributing incidents with root causes
Send proactive notification to customer success manager
Issue service credits if contractually required (do not wait for customer to ask)
File internal improvement ticket with 30-day remediation target

For Major Breach (within 24 hours):

Incident commander confirms SLA impact calculation
Draft customer communication (see template below)
Executive sponsor reviews and approves communication
Issue service credits with detailed breakdown
Schedule root cause review with customer within 5 business days
Produce remediation plan with committed timelines

For Critical Breach (immediate):

Activate executive escalation chain
Legal team reviews contractual exposure
Finance team calculates credit obligations
Customer communication from VP or C-level within 4 hours
Dedicated remediation task force assigned
Weekly status updates to customer until remediation complete
Formal postmortem document shared with customer within 10 business days

Customer Communication Template

Subject: Service Level Update - [Service Name] - [Month Year]

Dear [Customer Name],

We are writing to inform you that [Service Name] did not meet the committed
service level of [SLA target]% availability during [time period].

MEASURED PERFORMANCE: [actual]% availability
COMMITTED SLA: [SLA target]% availability
SHORTFALL: [delta] percentage points

CONTRIBUTING FACTORS:
- [Date/Time]: [Brief description of incident] ([duration] impact)
- [Date/Time]: [Brief description of incident] ([duration] impact)

SERVICE CREDIT: In accordance with our agreement, a credit of [amount/percentage]
will be applied to your next invoice.

REMEDIATION ACTIONS:
1. [Specific technical fix with completion date]
2. [Process improvement with implementation date]
3. [Monitoring enhancement with deployment date]

We take our service commitments seriously. [Name], [Title] is personally
overseeing the remediation and is available to discuss further at your convenience.

Sincerely,
[Name, Title]

Legal and Compliance Considerations

Maintain auditable records of all SLA measurements for the full contract term plus 2 years
SLA calculations must use the measurement methodology defined in the contract, not internal approximations
Force majeure clauses typically exclude natural disasters, but verify per contract
Planned maintenance exclusions must match the exact notification procedures in the contract
Multi-region SLAs may have separate calculations per region; verify aggregation method

4. Incident-to-SLA Mapping

Downtime Calculation Methodologies

Full outage: Service completely unavailable. Every minute counts as a full minute of downtime.

Downtime = End Time - Start Time (in minutes)

Partial degradation: Service available but impaired. Apply a degradation factor:

Effective Downtime = Actual Duration x Degradation Factor

Degradation Level	Factor	Description
Complete outage	1.0	Service fully unavailable
Severe degradation	0.75	>50% of requests failing or >10x latency
Moderate degradation	0.5	10-50% of requests affected or 3-10x latency
Minor degradation	0.25	<10% of requests affected or <3x latency increase
Cosmetic / non-functional	0.0	No impact on core SLI metrics

Note: The exact degradation factors must be agreed upon in the SLA contract. The above are industry-standard starting points.

Planned vs Unplanned Downtime

Most SLAs exclude pre-announced maintenance windows from availability calculations, subject to conditions:

Notification provided N hours/days in advance (commonly 72 hours)
Maintenance occurs within an agreed window (e.g., Sunday 02:00-06:00 UTC)
Total planned downtime does not exceed the monthly maintenance allowance (e.g., 4 hours/month)
Any overrun beyond the planned window counts as unplanned downtime

SLA Availability = (Total Minutes - Excluded Maintenance - Unplanned Downtime) / (Total Minutes - Excluded Maintenance) x 100

Multi-Service SLA Composition

When a customer-facing product depends on multiple services, composite SLA is calculated as:

Serial dependency (all must be up):

Composite SLA = SLA_A x SLA_B x SLA_C
Example: 99.9% x 99.95% x 99.99% = 99.84%

Parallel / redundant (any one must be up):

Composite Availability = 1 - ((1 - SLA_A) x (1 - SLA_B))
Example: 1 - ((1 - 0.999) x (1 - 0.999)) = 1 - 0.000001 = 99.9999%

This is critical during incidents: an outage in a shared dependency may breach SLAs for multiple customer-facing products simultaneously.

Worked Examples

Example 1: Simple outage

Service: Core API (SLA: 99.9%)
Month: 30 days = 43,200 minutes
Incident: Full outage from 14:23 to 14:38 UTC on the 12th (15 minutes)
No other incidents this month

Availability = (43,200 - 15) / 43,200 x 100 = 99.965%
SLA Status: PASS (99.965% > 99.9%)
Error Budget Consumed: 15 / 43.2 = 34.7%

Example 2: Partial degradation

Service: Payment Processing (SLA: 99.95%)
Month: 30 days = 43,200 minutes
Incident: 50% of transactions failing for 4 hours (240 minutes)
Degradation factor: 0.5 (moderate - 50% of requests affected)

Effective Downtime = 240 x 0.5 = 120 minutes
Availability = (43,200 - 120) / 43,200 x 100 = 99.722%
SLA Status: FAIL (99.722% < 99.95%)
Shortfall: 0.228 percentage points → Major Breach

Example 3: Multiple incidents

Service: Dashboard (SLA: 99.5%)
Month: 31 days = 44,640 minutes
Incident A: 45-minute full outage on the 5th
Incident B: 2-hour severe degradation (factor 0.75) on the 18th
Incident C: 30-minute full outage on the 25th

Total Effective Downtime = 45 + (120 x 0.75) + 30 = 45 + 90 + 30 = 165 minutes
Availability = (44,640 - 165) / 44,640 x 100 = 99.630%
SLA Status: PASS (99.630% > 99.5%)
Error Budget Consumed: 165 / 223.2 = 73.9% → Yellow threshold, feature freeze recommended

5. SLO Best Practices

Start with User Journeys

Do not set SLOs based on infrastructure metrics. Start from what users experience:

Identify critical user journeys (e.g., "User completes checkout")
Map each journey to the services and dependencies involved
Define what "good" looks like for each journey (fast, error-free, complete)
Select the SLIs that most directly measure that user experience
Set SLO targets that reflect the minimum acceptable user experience

A database with 99.99% uptime is meaningless if the API in front of it has a bug causing 5% error rates.

The Four Golden Signals as SLI Sources

From Google SRE, the four golden signals provide comprehensive service health:

Signal	SLI Example	Typical SLO
Latency	p99 request duration < 500ms	99% of requests under threshold
Traffic	Requests per second	N/A (capacity planning, not SLO)
Errors	5xx rate as % of total requests	< 0.1% error rate over rolling window
Saturation	CPU/memory/queue depth	< 80% utilization (capacity SLI)

For most services, latency and error rate are the two most important SLIs to back with SLOs.

Setting SLO Targets

Collect 90 days of historical SLI data
Calculate the 5th percentile performance (worst 5% of days)
Set SLO slightly above that baseline (this ensures the SLO is achievable without heroics)
Validate: would a breach at this level actually impact users negatively?
Adjust upward only if user impact analysis demands it

Never set SLOs by aspiration. A 99.99% SLO on a service that has historically achieved 99.93% is a guaranteed source of perpetual firefighting with no reliability improvement.

Review Cadence

Weekly: Review current error budget burn rate, flag services approaching thresholds
Monthly: Full SLO compliance review, adjust alert thresholds if needed
Quarterly: Reassess SLO targets based on 90-day data, review SLA contract alignment
Annually: Strategic SLO review tied to product roadmap and infrastructure investments

Anti-Patterns

Anti-Pattern	Problem	Fix
Vanity SLOs	Setting 99.99% to impress, then ignoring breaches	Set achievable targets, enforce budget policy
SLO Inflation	Ratcheting SLOs up whenever performance is good	Only increase SLOs when users demonstrably need it
Unmeasured SLAs	Committing contractual SLAs without actual SLI measurement	Instrument SLIs before signing SLA contracts
Copy-Paste SLOs	Same SLO for every service regardless of criticality	Tier services by business impact, set SLOs accordingly
Ignoring Dependencies	Setting aggressive SLOs without accounting for dependency reliability	Calculate composite SLA; your SLO cannot exceed dependency chain
Alert-Free SLOs	Having SLOs but no automated alerting on budget consumption	Every SLO must have corresponding burn rate alerts

6. Monitoring & Alerting for SLAs

Multi-Window Burn Rate Alerting

The Google SRE approach uses multiple time windows to balance speed of detection against alert noise. Each alert condition requires both a short window (for speed) and a long window (for confirmation):

Alert configuration matrix:

Severity	Short Window	Short Threshold	Long Window	Long Threshold	Action
Critical (Page)	1 hour	> 14.4x burn rate	5 minutes	> 14.4x burn rate	Wake someone up
High (Page)	6 hours	> 6x burn rate	30 minutes	> 6x burn rate	Page on-call within 30 min
Medium (Ticket)	3 days	> 1x burn rate	6 hours	> 1x burn rate	Create ticket, next business day

Why these specific numbers:

14.4x burn rate over 1 hour consumes 2% of monthly budget in that hour. At this rate, the entire 30-day budget is gone in ~50 hours. This demands immediate human attention.
6x burn rate over 6 hours consumes 5% of monthly budget. The budget will be exhausted in 5 days. Urgent but not wake-up-at-3am urgent.
1x burn rate over 3 days means you are on pace to exactly exhaust the budget. This needs investigation but is not an emergency.

Burn Rate Alert Formulas

For a given time window, calculate the burn rate:

burn_rate = (error_count_in_window / request_count_in_window) / (1 - SLO_target)

Example for a 99.9% SLO, observing 50 errors out of 10,000 requests in a 1-hour window:

observed_error_rate = 50 / 10,000 = 0.005 (0.5%)
allowed_error_rate = 1 - 0.999 = 0.001 (0.1%)
burn_rate = 0.005 / 0.001 = 5.0

A burn rate of 5.0 means the error budget is being consumed 5 times faster than the sustainable rate.

Alert Severity to SLA Risk Mapping

Burn Rate	Budget Impact	SLA Risk	Response
< 1x	Under budget pace	None	Routine monitoring
1x - 3x	On pace or slightly over	Low	Investigate next business day
3x - 6x	Budget will exhaust in 5-10 days	Moderate	Investigate within 4 hours
6x - 14.4x	Budget will exhaust in 2-5 days	High	Page on-call, respond in 30 min
> 14.4x	Budget will exhaust in < 2 days	Critical	Immediate page, incident declared
> 100x	Active major outage	SLA breach imminent	All-hands incident response

Dashboard Design for SLA Tracking

Every SLA-tracked service should have a dashboard with these panels:

Row 1 - Current Status:

Current availability (real-time, rolling 5-minute window)
Current error rate (real-time)
Current p99 latency (real-time)

Row 2 - Budget Status:

Error budget remaining (% of monthly budget, gauge visualization)
Budget consumption timeline (line chart, actual vs expected burn)
Budget burn rate (current 1h, 6h, and 3d burn rates)

Row 3 - Historical Context:

30-day availability trend (daily granularity)
SLA compliance status for current and previous 3 months
Incident markers overlaid on availability timeline

Row 4 - Dependencies:

Upstream dependency availability (services this service depends on)
Downstream impact scope (services that depend on this service)
Composite SLA calculation for customer-facing products

Alert Fatigue Prevention

Alert fatigue is the primary reason SLA monitoring fails in practice. Mitigation strategies:

Require dual-window confirmation. Never page on a single short window. Always require both the short window (for speed) and long window (for persistence) to fire simultaneously.
Separate page-worthy from ticket-worthy. Only two conditions should wake someone up: >14.4x burn rate sustained, or >6x burn rate sustained. Everything else is a ticket.
Deduplicate aggressively. If the same service triggers both a latency and error rate alert for the same underlying issue, group them into a single notification.
Auto-resolve. Alerts must auto-resolve when the burn rate drops below threshold. Never leave stale alerts open.
Review alert quality monthly. Track the ratio of actionable alerts to total alerts. Target >80% actionable rate. If an alert fires and no human action is needed, tune or remove it.
Escalation, not repetition. If an alert is not acknowledged within the response window, escalate to the next tier. Do not re-send the same alert every 5 minutes.

Practical Monitoring Stack

Layer	Tool Category	Purpose
Collection	Prometheus, OpenTelemetry, StatsD	Gather SLI metrics from services
Storage	Prometheus TSDB, Thanos, Mimir	Retain metrics for SLO window + 90 days
Calculation	Prometheus recording rules, Sloth	Pre-compute burn rates and budget consumption
Alerting	Alertmanager, PagerDuty, OpsGenie	Route alerts by severity and schedule
Visualization	Grafana, Datadog	Dashboards for real-time and historical SLA views
Reporting	Custom scripts, SLO generators	Monthly SLA compliance reports for customers

Retention requirement: SLI data must be retained for at least the SLA reporting period (typically monthly or quarterly) plus a 90-day dispute window. Annual SLA reviews require 12 months of data at daily granularity minimum.

Last updated: February 2026 For use with: incident-commander skill Maintainer: Engineering Team

23 KiB Raw Blame History