Files
claude-skills-reference/product-team/experiment-designer/SKILL.md

3.1 KiB

name, description
name description
experiment-designer Use when planning product experiments, writing testable hypotheses, estimating sample size, prioritizing tests, or interpreting A/B outcomes with practical statistical rigor.

Experiment Designer

Design, prioritize, and evaluate product experiments with clear hypotheses and defensible decisions.

When To Use

Use this skill for:

  • A/B and multivariate experiment planning
  • Hypothesis writing and success criteria definition
  • Sample size and minimum detectable effect planning
  • Experiment prioritization with ICE scoring
  • Reading statistical output for product decisions

Core Workflow

  1. Write hypothesis in If/Then/Because format
  • If we change [intervention]
  • Then [metric] will change by [expected direction/magnitude]
  • Because [behavioral mechanism]
  1. Define metrics before running test
  • Primary metric: single decision metric
  • Guardrail metrics: quality/risk protection
  • Secondary metrics: diagnostics only
  1. Estimate sample size
  • Baseline conversion or baseline mean
  • Minimum detectable effect (MDE)
  • Significance level (alpha) and power

Use:

python3 scripts/sample_size_calculator.py --baseline-rate 0.12 --mde 0.02 --mde-type absolute
  1. Prioritize experiments with ICE
  • Impact: potential upside
  • Confidence: evidence quality
  • Ease: cost/speed/complexity

ICE Score = (Impact * Confidence * Ease) / 10

  1. Launch with stopping rules
  • Decide fixed sample size or fixed duration in advance
  • Avoid repeated peeking without proper method
  • Monitor guardrails continuously
  1. Interpret results
  • Statistical significance is not business significance
  • Compare point estimate + confidence interval to decision threshold
  • Investigate novelty effects and segment heterogeneity

Hypothesis Quality Checklist

  • Contains explicit intervention and audience
  • Specifies measurable metric change
  • States plausible causal reason
  • Includes expected minimum effect
  • Defines failure condition

Common Experiment Pitfalls

  • Underpowered tests leading to false negatives
  • Running too many simultaneous changes without isolation
  • Changing targeting or implementation mid-test
  • Stopping early on random spikes
  • Ignoring sample ratio mismatch and instrumentation drift
  • Declaring success from p-value without effect-size context

Statistical Interpretation Guardrails

  • p-value < alpha indicates evidence against null, not guaranteed truth.
  • Confidence interval crossing zero/no-effect means uncertain directional claim.
  • Wide intervals imply low precision even when significant.
  • Use practical significance thresholds tied to business impact.

See:

  • references/experiment-playbook.md
  • references/statistics-reference.md

Tooling

scripts/sample_size_calculator.py

Computes required sample size (per variant and total) from:

  • baseline rate
  • MDE (absolute or relative)
  • significance level (alpha)
  • statistical power

Example:

python3 scripts/sample_size_calculator.py \
  --baseline-rate 0.10 \
  --mde 0.015 \
  --mde-type absolute \
  --alpha 0.05 \
  --power 0.8