Files
claude-skills-reference/engineering/statistical-analyst/references/statistical-testing-concepts.md
Reza Rezvani 7c2564845a refactor(engineering): move statistical-analyst to engineering/, fix cross-refs
- Move from data-analysis/ to engineering/
- Fix 5 cross-references to use correct domain paths
- Fix Python 3.9 compat in sample_size_calculator.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 11:18:12 +02:00

6.1 KiB
Raw Blame History

Statistical Testing Concepts Reference

Deep-dive reference for the Statistical Analyst skill. Keeps SKILL.md lean while preserving the theory.


The Frequentist Framework

All tests in this skill operate in the frequentist framework: we define a null hypothesis (H₀) and an alternative (H₁), then ask "how often would we see data this extreme if H₀ were true?"

  • H₀ (null): No difference exists between control and treatment
  • H₁ (alternative): A difference exists (two-tailed)
  • p-value: P(observing this result or more extreme | H₀ is true)
  • α (significance level): The threshold we set in advance. Reject H₀ if p < α.

The p-value misconception

A p-value of 0.03 does not mean "there is a 97% chance the effect is real." It means: "If there were no effect, we would see data this extreme only 3% of the time."


Type I and Type II Errors

H₀ True H₀ False
Reject H₀ Type I Error (α) — False Positive Correct (Power = 1β)
Fail to reject H₀ Correct Type II Error (β) — False Negative
  • α (false positive rate): Typically 0.05. Reduce it when false positives are costly (medical trials, irreversible changes).
  • β (false negative rate): Typically 0.20 (power = 80%). Reduce it when missing real effects is costly.

Two-Proportion Z-Test

When: Comparing two binary conversion rates (e.g. clicked/not, signed up/not).

Assumptions:

  • Independent samples
  • n×p ≥ 5 and n×(1p) ≥ 5 for both groups (normal approximation valid)
  • No interference between units (SUTVA)

Formula:

z = (p̂₂  p̂₁) / √[p̄(1p̄)(1/n₁ + 1/n₂)]

where p̄ = (x₁ + x₂) / (n₁ + n₂)  (pooled proportion)

Effect size — Cohen's h:

h = 2 arcsin(√p₂)  2 arcsin(√p₁)

The arcsine transformation stabilizes variance across different baseline rates.


Welch's Two-Sample t-Test

When: Comparing means of a continuous metric between two groups (revenue, latency, session length).

Why Welch's (not Student's): Welch's t-test does not assume equal variances — it is strictly more general and loses little power when variances are equal. Always prefer it.

Formula:

t = (x̄₂  x̄₁) / √(s₁²/n₁ + s₂²/n₂)

WelchSatterthwaite df:
df = (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁1) + (s₂²/n₂)²/(n₂1)]

Effect size — Cohen's d:

d = (x̄₂  x̄₁) / s_pooled

s_pooled = √[((n₁1)s₁² + (n₂1)s₂²) / (n₁+n₂2)]

Warning for heavy-tailed metrics (revenue, LTV): Mean tests are sensitive to outliers. If the distribution has heavy tails, consider:

  1. Winsorizing at 99th percentile before testing
  2. Log-transforming (if values are positive)
  3. Using a non-parametric test (Mann-Whitney U) and flagging for human review

Chi-Square Test

When: Comparing categorical distributions (e.g. which plan users selected, which error type occurred).

Assumptions:

  • Expected count ≥ 5 per cell (otherwise, combine categories or use Fisher's exact)
  • Independent observations

Formula:

χ² = Σ (Oᵢ  Eᵢ)² / Eᵢ

df = k  1  (goodness-of-fit)
df = (r1)(c1)  (contingency table, r rows, c columns)

Effect size — Cramér's V:

V = √[χ² / (n × (min(r,c)  1))]

Wilson Score Interval

The standard confidence interval formula for proportions (p̂ ± z√(p̂(1p̂)/n)) can produce impossible values (< 0 or > 1) for small n or extreme p. The Wilson score interval fixes this:

center = (p̂ + z²/2n) / (1 + z²/n)
margin = z/(1+z²/n) × √(p̂(1p̂)/n + z²/4n²)
CI = [center  margin, center + margin]

Always use Wilson (or Clopper-Pearson) for proportions. The normal approximation is a historical artifact.


Sample Size & Power

Power: The probability of correctly detecting a real effect of size δ.

n = (z_α/2 + z_β)² × (σ₁² + σ₂²) / δ²    [means]
n = (z_α/2 + z_β)² × (p₁(1p₁) + p₂(1p₂)) / (p₂p₁)²    [proportions]

Key levers:

  • Increase n → more power (or detect smaller effects)
  • Increase MDE → smaller n (but you might miss smaller real effects)
  • Increase α → smaller n (but more false positives)
  • Increase power → larger n

The peeking problem: Checking results before the planned end date inflates your effective α. If you peek at 50%, 75%, and 100% of planned n, your true α is ~0.13 instead of 0.05 — a 2.6× inflation of false positives.

Solutions:

  • Pre-commit to a stopping rule and don't peek
  • Use sequential testing (SPRT) if early stopping is required
  • Use a Bonferroni-corrected α if you peek at scheduled intervals

Multiple Comparisons

Testing k hypotheses at α = 0.05 gives P(at least one false positive) ≈ 1 (1 0.05)^k

k tests P(≥1 false positive)
1 5%
3 14%
5 23%
10 40%
20 64%

Corrections:

  • Bonferroni: Use α/k per test. Conservative but simple. Appropriate for independent tests.
  • Benjamini-Hochberg (FDR): Controls false discovery rate, not family-wise error. Preferred when many tests are expected to be true positives.

SUTVA (Stable Unit Treatment Value Assumption)

A critical assumption for valid A/B tests: the outcome of unit i depends only on its own treatment assignment, not on other units' assignments.

Violations:

  • Social features (user A sees user B's activity — network spillover)
  • Shared inventory (one variant depletes shared stock)
  • Two-sided marketplaces (buyers and sellers interact)

Solutions:

  • Cluster randomization (randomize at the group/geography level)
  • Network A/B testing (graph-based splits)
  • Holdout-based testing

References

  • Imbens, G. & Rubin, D. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge.
  • Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments. Cambridge.
  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. 2nd ed.
  • Wilson, E.B. (1927). "Probable Inference, the Law of Succession, and Statistical Inference." JASA 22(158): 209212.