firefrost-gaming/claude-skills-reference

Files

Leo f6f50f5282 Fix CI workflows and installation documentation

- Replace non-existent anthropics/claude-code-action@v1 with direct bash steps in smart-sync.yml and pr-issue-auto-close.yml
- Add missing checkout steps to both workflows for WORKFLOW_KILLSWITCH access
- Fix Issue #189: Replace broken 'npx ai-agent-skills install' with working 'npx agent-skills-cli add' command
- Update README.md and INSTALLATION.md with correct Agent Skills CLI commands and repository links
- Verified: agent-skills-cli detects all 53 skills and works with 42+ AI agents

Fixes: Two GitHub Actions workflows that broke on PR #191 merge
Closes: #189

2026-02-16 11:30:18 +00:00

9.5 KiB

Raw Blame History

Runbook: [Service/Component Name]

Owner: [Team Name] Last Updated: [YYYY-MM-DD] Reviewed By: [Name] Review Cadence: Quarterly

Service Overview

Property	Value
Service	[service-name]
Repository	[repo URL]
Dashboard	[monitoring dashboard URL]
On-Call Rotation	[PagerDuty/OpsGenie schedule URL]
SLA Tier	[Tier 1/2/3]
Availability Target	[99.9% / 99.95% / 99.99%]
Dependencies	[list upstream/downstream services]
Owner Team	[team name]
Escalation Contact	[name/email]

Architecture Summary

[2-3 sentence description of the service architecture. Include key components, data stores, and external dependencies.]

Alert Response Decision Tree

High Error Rate (>5%)

Error Rate Alert Fired
├── Check: Is this a deployment-related issue?
│   ├── YES → Go to "Recent Deployment Rollback" section
│   └── NO → Continue
├── Check: Is a downstream dependency failing?
│   ├── YES → Go to "Dependency Failure" section
│   └── NO → Continue
├── Check: Is there unusual traffic volume?
│   ├── YES → Go to "Traffic Spike" section
│   └── NO → Continue
└── Escalate: Engage on-call secondary + service owner

High Latency (p99 > [threshold]ms)

Latency Alert Fired
├── Check: Database query latency elevated?
│   ├── YES → Go to "Database Performance" section
│   └── NO → Continue
├── Check: Connection pool utilization >80%?
│   ├── YES → Go to "Connection Pool Exhaustion" section
│   └── NO → Continue
├── Check: Memory/CPU pressure on service instances?
│   ├── YES → Go to "Resource Exhaustion" section
│   └── NO → Continue
└── Escalate: Engage on-call secondary + service owner

Service Unavailable (Health Check Failing)

Health Check Alert Fired
├── Check: Are all instances down?
│   ├── YES → Go to "Complete Outage" section
│   └── NO → Continue
├── Check: Is only one AZ affected?
│   ├── YES → Go to "AZ Failure" section
│   └── NO → Continue
├── Check: Can instances be restarted?
│   ├── YES → Go to "Instance Restart" section
│   └── NO → Continue
└── Escalate: Declare incident, engage IC

Common Scenarios

Recent Deployment Rollback

Symptoms: Error rate spike or latency increase within 60 minutes of a deployment.

Diagnosis:

Check deployment history: kubectl rollout history deployment/[service-name]
Compare error rate timing with deployment timestamp
Review deployment diff for risky changes

Mitigation:

Initiate rollback: kubectl rollout undo deployment/[service-name]
Verify rollback: kubectl rollout status deployment/[service-name]
Confirm error rate returns to baseline (allow 5 minutes)
If rollback fails: escalate immediately

Communication: If customer-impacting, update status page within 5 minutes of confirming impact.

Database Performance

Symptoms: Elevated query latency, connection pool saturation, timeout errors.

Diagnosis:

Check active queries: SELECT * FROM pg_stat_activity WHERE state = 'active';
Check for long-running queries: SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;
Check connection count: SELECT count(*) FROM pg_stat_activity;
Check table bloat and vacuum status

Mitigation:

Kill long-running queries if identified: SELECT pg_terminate_backend([pid]);
If connection pool exhausted: increase pool size via config (requires restart)
If read replica available: redirect read traffic
If write-heavy: identify and defer non-critical writes

Escalation Trigger: If query latency >10s for >5 minutes, escalate to DBA on-call.

Connection Pool Exhaustion

Symptoms: Connection timeout errors, pool utilization >90%, requests queuing.

Diagnosis:

Check pool metrics: current size, active connections, waiting requests
Check for connection leaks: connections held >30s without activity
Review recent config changes or deployments

Mitigation:

Increase pool size (if infrastructure allows): update config, rolling restart
Kill idle connections exceeding timeout
If caused by leak: identify and restart affected instances
Enable connection pool auto-scaling if available

Prevention: Pool utilization alerting at 70% (warning) and 85% (critical).

Dependency Failure

Symptoms: Errors correlated with downstream service failures, circuit breakers tripping.

Diagnosis:

Check dependency status dashboards
Verify circuit breaker state: open/half-open/closed
Check for correlation with dependency deployments or incidents
Test dependency health endpoints directly

Mitigation:

If circuit breaker not tripping: verify timeout/threshold configuration
Enable graceful degradation (serve cached/default responses)
If critical path: engage dependency team via incident process
If non-critical path: disable feature flag for affected functionality

Communication: Coordinate with dependency team IC if both services have active incidents.

Traffic Spike

Symptoms: Sudden traffic increase beyond normal patterns, resource saturation.

Diagnosis:

Check traffic source: organic growth vs. bot traffic vs. DDoS
Review rate limiting effectiveness
Check auto-scaling status and capacity

Mitigation:

If bot/DDoS: enable rate limiting, engage security team
If organic: trigger manual scale-up, increase auto-scaling limits
Enable request queuing or load shedding if at capacity
Consider feature flag toggles to reduce per-request cost

Complete Outage

Symptoms: All instances unreachable, health checks failing across AZs.

Diagnosis:

Check infrastructure status (AWS/GCP status page)
Verify network connectivity and DNS resolution
Check for infrastructure-level incidents (region outage)
Review recent infrastructure changes (Terraform, network config)

Mitigation:

If infra provider issue: activate disaster recovery plan
If DNS issue: update DNS records, reduce TTL
If deployment corruption: redeploy last known good version
If data corruption: engage data recovery procedures

Escalation: Immediately declare SEV1 incident. Engage infrastructure team and management.

Instance Restart

Symptoms: Individual instances unhealthy, OOM kills, process crashes.

Diagnosis:

Check instance logs for crash reason
Review memory/CPU usage patterns before crash
Check for memory leaks or resource exhaustion
Verify configuration consistency across instances

Mitigation:

Restart unhealthy instances: kubectl delete pod [pod-name]
If recurring: cordon node and migrate workloads
If memory leak: schedule immediate patch with increased memory limit
Monitor for recurrence after restart

AZ Failure

Symptoms: All instances in one availability zone failing, others healthy.

Diagnosis:

Confirm AZ-specific failure vs. instance-specific issues
Check cloud provider AZ status
Verify load balancer is routing around failed AZ

Mitigation:

Ensure load balancer marks AZ instances as unhealthy
Scale up remaining AZs to handle redirected traffic
If auto-scaling: verify it's responding to increased load
Monitor remaining AZs for cascade effects

Key Metrics & Dashboards

Metric	Normal Range	Warning	Critical	Dashboard
Error Rate	<0.1%	>1%	>5%	[link]
p99 Latency	<200ms	>500ms	>2000ms	[link]
CPU Usage	<60%	>75%	>90%	[link]
Memory Usage	<70%	>80%	>90%	[link]
DB Pool Usage	<50%	>70%	>85%	[link]
Request Rate	[baseline]±20%	±50%	±100%	[link]

Escalation Contacts

Level	Contact	When
L1: On-Call Primary	[name/rotation]	First responder
L2: On-Call Secondary	[name/rotation]	Primary unavailable or needs help
L3: Service Owner	[name]	Complex issues, architectural decisions
L4: Engineering Manager	[name]	SEV1/SEV2, customer impact, resource needs
L5: VP Engineering	[name]	SEV1 >30 min, major customer/revenue impact

Maintenance Procedures

Planned Maintenance Checklist

Maintenance window scheduled and communicated (72 hours advance for Tier 1)
Status page updated with planned maintenance notice
Rollback plan documented and tested
On-call notified of maintenance window
Customer notification sent (if SLA-impacting)
Post-maintenance verification plan ready

Health Verification After Changes

Check all health endpoints return 200
Verify error rate returns to baseline within 5 minutes
Confirm latency within normal range
Run synthetic transaction test
Monitor for 15 minutes before declaring success

Revision History

Date	Author	Change
[YYYY-MM-DD]	[Name]	Initial version
[YYYY-MM-DD]	[Name]	[Description of update]

This runbook should be reviewed quarterly and updated after every incident that reveals missing procedures. The on-call engineer should be able to follow this document without prior context about the service. If any section requires tribal knowledge to execute, it needs to be expanded.

9.5 KiB Raw Blame History

Runbook: [Service/Component Name]

Service Overview

Architecture Summary

Alert Response Decision Tree

High Error Rate (>5%)

High Latency (p99 > [threshold]ms)

Service Unavailable (Health Check Failing)

Common Scenarios

Recent Deployment Rollback

Database Performance

Connection Pool Exhaustion

Dependency Failure

Traffic Spike

Complete Outage

Instance Restart

AZ Failure

Key Metrics & Dashboards

Escalation Contacts

Maintenance Procedures

Planned Maintenance Checklist

Health Verification After Changes

Revision History

9.5 KiB

Raw Blame History