# Experiment Playbook

## Experiment Types

### A/B Test
- Compare one control versus one variant.
- Best for high-confidence directional decisions.

### Multivariate Test
- Test combinations of multiple factors.
- Useful for interaction effects, requires larger traffic.

### Holdout Test
- Keep a percentage unexposed to intervention.
- Useful for measuring incremental lift over broader changes.

## Metric Design

### Primary Metric
- One metric that decides ship/no-ship.
- Must align with user value and business objective.

### Guardrail Metrics
- Prevent local optimization damage.
- Examples: error rate, latency, churn proxy, support contacts.

### Diagnostic Metrics
- Explain why change happened.
- Do not use as decision gate unless pre-specified.

## Stopping Rules

Define before launch:
- Fixed sample size per group
- Minimum run duration (to capture weekday/weekend behavior)
- Guardrail breach thresholds (pause criteria)

Avoid:
- Continuous peeking with fixed-horizon inference
- Changing success metric mid-test
- Retroactive segmentation without correction

## Novelty and Primacy Effects

- Novelty effect: short-term spike due to newness, not durable value.
- Primacy effect: early exposure creates bias in user behavior.

Mitigation:
- Run long enough for behavior stabilization.
- Check returning users and delayed cohorts separately.
- Re-run key tests when stakes are high.

## Pre-Launch Checklist

- [ ] Hypothesis complete (If/Then/Because)
- [ ] Metric definitions frozen
- [ ] Instrumentation validated
- [ ] Randomization and assignment verified
- [ ] Sample size and duration approved
- [ ] Rollback plan documented

## Post-Test Readout Template

1. Hypothesis and scope
2. Experiment setup and quality checks
3. Primary metric effect size + confidence interval
4. Guardrail status
5. Segment-level observations (pre-registered only)
6. Decision: ship, iterate, or reject
7. Follow-up experiments