firefrost-gaming/claude-skills-reference

Files

Leo 4dbb0c581f feat: add experiment designer skill with sample size calculator

2026-03-11 14:58:27 +01:00

1.9 KiB

Raw Blame History

Experiment Playbook

Experiment Types

A/B Test

Compare one control versus one variant.
Best for high-confidence directional decisions.

Multivariate Test

Test combinations of multiple factors.
Useful for interaction effects, requires larger traffic.

Holdout Test

Keep a percentage unexposed to intervention.
Useful for measuring incremental lift over broader changes.

Metric Design

Primary Metric

One metric that decides ship/no-ship.
Must align with user value and business objective.

Guardrail Metrics

Prevent local optimization damage.
Examples: error rate, latency, churn proxy, support contacts.

Diagnostic Metrics

Explain why change happened.
Do not use as decision gate unless pre-specified.

Stopping Rules

Define before launch:

Fixed sample size per group
Minimum run duration (to capture weekday/weekend behavior)
Guardrail breach thresholds (pause criteria)

Avoid:

Continuous peeking with fixed-horizon inference
Changing success metric mid-test
Retroactive segmentation without correction

Novelty and Primacy Effects

Novelty effect: short-term spike due to newness, not durable value.
Primacy effect: early exposure creates bias in user behavior.

Mitigation:

Run long enough for behavior stabilization.
Check returning users and delayed cohorts separately.
Re-run key tests when stakes are high.

Pre-Launch Checklist

Hypothesis complete (If/Then/Because)
Metric definitions frozen
Instrumentation validated
Randomization and assignment verified
Sample size and duration approved
Rollback plan documented

Post-Test Readout Template

Hypothesis and scope
Experiment setup and quality checks
Primary metric effect size + confidence interval
Guardrail status
Segment-level observations (pre-registered only)
Decision: ship, iterate, or reject
Follow-up experiments