1.9 KiB
1.9 KiB
Experiment Playbook
Experiment Types
A/B Test
- Compare one control versus one variant.
- Best for high-confidence directional decisions.
Multivariate Test
- Test combinations of multiple factors.
- Useful for interaction effects, requires larger traffic.
Holdout Test
- Keep a percentage unexposed to intervention.
- Useful for measuring incremental lift over broader changes.
Metric Design
Primary Metric
- One metric that decides ship/no-ship.
- Must align with user value and business objective.
Guardrail Metrics
- Prevent local optimization damage.
- Examples: error rate, latency, churn proxy, support contacts.
Diagnostic Metrics
- Explain why change happened.
- Do not use as decision gate unless pre-specified.
Stopping Rules
Define before launch:
- Fixed sample size per group
- Minimum run duration (to capture weekday/weekend behavior)
- Guardrail breach thresholds (pause criteria)
Avoid:
- Continuous peeking with fixed-horizon inference
- Changing success metric mid-test
- Retroactive segmentation without correction
Novelty and Primacy Effects
- Novelty effect: short-term spike due to newness, not durable value.
- Primacy effect: early exposure creates bias in user behavior.
Mitigation:
- Run long enough for behavior stabilization.
- Check returning users and delayed cohorts separately.
- Re-run key tests when stakes are high.
Pre-Launch Checklist
- Hypothesis complete (If/Then/Because)
- Metric definitions frozen
- Instrumentation validated
- Randomization and assignment verified
- Sample size and duration approved
- Rollback plan documented
Post-Test Readout Template
- Hypothesis and scope
- Experiment setup and quality checks
- Primary metric effect size + confidence interval
- Guardrail status
- Segment-level observations (pre-registered only)
- Decision: ship, iterate, or reject
- Follow-up experiments