Files
claude-skills-reference/engineering/statistical-analyst/references/statistical-testing-concepts.md
Reza Rezvani 7c2564845a refactor(engineering): move statistical-analyst to engineering/, fix cross-refs
- Move from data-analysis/ to engineering/
- Fix 5 cross-references to use correct domain paths
- Fix Python 3.9 compat in sample_size_calculator.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 11:18:12 +02:00

190 lines
6.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Statistical Testing Concepts Reference
Deep-dive reference for the Statistical Analyst skill. Keeps SKILL.md lean while preserving the theory.
---
## The Frequentist Framework
All tests in this skill operate in the **frequentist framework**: we define a null hypothesis (H₀) and an alternative (H₁), then ask "how often would we see data this extreme if H₀ were true?"
- **H₀ (null):** No difference exists between control and treatment
- **H₁ (alternative):** A difference exists (two-tailed)
- **p-value:** P(observing this result or more extreme | H₀ is true)
- **α (significance level):** The threshold we set in advance. Reject H₀ if p < α.
### The p-value misconception
A p-value of 0.03 does **not** mean "there is a 97% chance the effect is real."
It means: "If there were no effect, we would see data this extreme only 3% of the time."
---
## Type I and Type II Errors
| | H₀ True | H₀ False |
|---|---|---|
| Reject H₀ | **Type I Error (α)** — False Positive | Correct (Power = 1β) |
| Fail to reject H₀ | Correct | **Type II Error (β)** — False Negative |
- **α** (false positive rate): Typically 0.05. Reduce it when false positives are costly (medical trials, irreversible changes).
- **β** (false negative rate): Typically 0.20 (power = 80%). Reduce it when missing real effects is costly.
---
## Two-Proportion Z-Test
**When:** Comparing two binary conversion rates (e.g. clicked/not, signed up/not).
**Assumptions:**
- Independent samples
- n×p ≥ 5 and n×(1p) ≥ 5 for both groups (normal approximation valid)
- No interference between units (SUTVA)
**Formula:**
```
z = (p̂₂ p̂₁) / √[p̄(1p̄)(1/n₁ + 1/n₂)]
where p̄ = (x₁ + x₂) / (n₁ + n₂) (pooled proportion)
```
**Effect size — Cohen's h:**
```
h = 2 arcsin(√p₂) 2 arcsin(√p₁)
```
The arcsine transformation stabilizes variance across different baseline rates.
---
## Welch's Two-Sample t-Test
**When:** Comparing means of a continuous metric between two groups (revenue, latency, session length).
**Why Welch's (not Student's):**
Welch's t-test does not assume equal variances — it is strictly more general and loses little power when variances are equal. Always prefer it.
**Formula:**
```
t = (x̄₂ x̄₁) / √(s₁²/n₁ + s₂²/n₂)
WelchSatterthwaite df:
df = (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁1) + (s₂²/n₂)²/(n₂1)]
```
**Effect size — Cohen's d:**
```
d = (x̄₂ x̄₁) / s_pooled
s_pooled = √[((n₁1)s₁² + (n₂1)s₂²) / (n₁+n₂2)]
```
**Warning for heavy-tailed metrics (revenue, LTV):**
Mean tests are sensitive to outliers. If the distribution has heavy tails, consider:
1. Winsorizing at 99th percentile before testing
2. Log-transforming (if values are positive)
3. Using a non-parametric test (Mann-Whitney U) and flagging for human review
---
## Chi-Square Test
**When:** Comparing categorical distributions (e.g. which plan users selected, which error type occurred).
**Assumptions:**
- Expected count ≥ 5 per cell (otherwise, combine categories or use Fisher's exact)
- Independent observations
**Formula:**
```
χ² = Σ (Oᵢ Eᵢ)² / Eᵢ
df = k 1 (goodness-of-fit)
df = (r1)(c1) (contingency table, r rows, c columns)
```
**Effect size — Cramér's V:**
```
V = √[χ² / (n × (min(r,c) 1))]
```
---
## Wilson Score Interval
The standard confidence interval formula for proportions (`p̂ ± z√(p̂(1p̂)/n)`) can produce impossible values (< 0 or > 1) for small n or extreme p. The Wilson score interval fixes this:
```
center = (p̂ + z²/2n) / (1 + z²/n)
margin = z/(1+z²/n) × √(p̂(1p̂)/n + z²/4n²)
CI = [center margin, center + margin]
```
Always use Wilson (or Clopper-Pearson) for proportions. The normal approximation is a historical artifact.
---
## Sample Size & Power
**Power:** The probability of correctly detecting a real effect of size δ.
```
n = (z_α/2 + z_β)² × (σ₁² + σ₂²) / δ² [means]
n = (z_α/2 + z_β)² × (p₁(1p₁) + p₂(1p₂)) / (p₂p₁)² [proportions]
```
**Key levers:**
- Increase n → more power (or detect smaller effects)
- Increase MDE → smaller n (but you might miss smaller real effects)
- Increase α → smaller n (but more false positives)
- Increase power → larger n
**The peeking problem:**
Checking results before the planned end date inflates your effective α. If you peek at 50%, 75%, and 100% of planned n, your true α is ~0.13 instead of 0.05 — a 2.6× inflation of false positives.
**Solutions:**
- Pre-commit to a stopping rule and don't peek
- Use sequential testing (SPRT) if early stopping is required
- Use a Bonferroni-corrected α if you peek at scheduled intervals
---
## Multiple Comparisons
Testing k hypotheses at α = 0.05 gives P(at least one false positive) ≈ 1 (1 0.05)^k
| k tests | P(≥1 false positive) |
|---|---|
| 1 | 5% |
| 3 | 14% |
| 5 | 23% |
| 10 | 40% |
| 20 | 64% |
**Corrections:**
- **Bonferroni:** Use α/k per test. Conservative but simple. Appropriate for independent tests.
- **Benjamini-Hochberg (FDR):** Controls false discovery rate, not family-wise error. Preferred when many tests are expected to be true positives.
---
## SUTVA (Stable Unit Treatment Value Assumption)
A critical assumption for valid A/B tests: the outcome of unit i depends only on its own treatment assignment, not on other units' assignments.
**Violations:**
- Social features (user A sees user B's activity — network spillover)
- Shared inventory (one variant depletes shared stock)
- Two-sided marketplaces (buyers and sellers interact)
**Solutions:**
- Cluster randomization (randomize at the group/geography level)
- Network A/B testing (graph-based splits)
- Holdout-based testing
---
## References
- Imbens, G. & Rubin, D. (2015). *Causal Inference for Statistics, Social, and Biomedical Sciences*. Cambridge.
- Kohavi, R., Tang, D., & Xu, Y. (2020). *Trustworthy Online Controlled Experiments*. Cambridge.
- Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences*. 2nd ed.
- Wilson, E.B. (1927). "Probable Inference, the Law of Succession, and Statistical Inference." *JASA* 22(158): 209212.