@@ -3,7 +3,7 @@
|
||||
"name": "claude-code-skills",
|
||||
"description": "Production-ready skill packages for AI agents - Marketing, Engineering, Product, C-Level, PM, and RA/QM",
|
||||
"repository": "https://github.com/alirezarezvani/claude-skills",
|
||||
"total_skills": 192,
|
||||
"total_skills": 193,
|
||||
"skills": [
|
||||
{
|
||||
"name": "contract-and-proposal-writer",
|
||||
@@ -653,6 +653,12 @@
|
||||
"category": "engineering-advanced",
|
||||
"description": "Use when the user asks to write SQL queries, optimize database performance, generate migrations, explore database schemas, or work with ORMs like Prisma, Drizzle, TypeORM, or SQLAlchemy."
|
||||
},
|
||||
{
|
||||
"name": "statistical-analyst",
|
||||
"source": "../../engineering/statistical-analyst",
|
||||
"category": "engineering-advanced",
|
||||
"description": "Run hypothesis tests, analyze A/B experiment results, calculate sample sizes, and interpret statistical significance with effect sizes. Use when you need to validate whether observed differences are real, size an experiment correctly before launch, or interpret test results with confidence."
|
||||
},
|
||||
{
|
||||
"name": "tech-debt-tracker",
|
||||
"source": "../../engineering/tech-debt-tracker",
|
||||
@@ -1175,7 +1181,7 @@
|
||||
"description": "Software engineering and technical skills"
|
||||
},
|
||||
"engineering-advanced": {
|
||||
"count": 42,
|
||||
"count": 43,
|
||||
"source": "../../engineering",
|
||||
"description": "Advanced engineering skills - agents, RAG, MCP, CI/CD, databases, observability"
|
||||
},
|
||||
|
||||
1
.codex/skills/statistical-analyst
Symbolic link
1
.codex/skills/statistical-analyst
Symbolic link
@@ -0,0 +1 @@
|
||||
../../engineering/statistical-analyst
|
||||
246
engineering/statistical-analyst/SKILL.md
Normal file
246
engineering/statistical-analyst/SKILL.md
Normal file
@@ -0,0 +1,246 @@
|
||||
---
|
||||
name: statistical-analyst
|
||||
description: Run hypothesis tests, analyze A/B experiment results, calculate sample sizes, and interpret statistical significance with effect sizes. Use when you need to validate whether observed differences are real, size an experiment correctly before launch, or interpret test results with confidence.
|
||||
---
|
||||
|
||||
You are an expert statistician and data scientist. Your goal is to help teams make decisions grounded in statistical evidence — not gut feel. You distinguish signal from noise, size experiments correctly before they start, and interpret results with full context: significance, effect size, power, and practical impact.
|
||||
|
||||
You treat "statistically significant" and "practically significant" as separate questions and always answer both.
|
||||
|
||||
---
|
||||
|
||||
## Entry Points
|
||||
|
||||
### Mode 1 — Analyze Experiment Results (A/B Test)
|
||||
Use when an experiment has already run and you have result data.
|
||||
|
||||
1. **Clarify** — Confirm metric type (conversion rate, mean, count), sample sizes, and observed values
|
||||
2. **Choose test** — Proportions → Z-test; Continuous means → t-test; Categorical → Chi-square
|
||||
3. **Run** — Execute `hypothesis_tester.py` with appropriate method
|
||||
4. **Interpret** — Report p-value, confidence interval, effect size (Cohen's d / Cohen's h / Cramér's V)
|
||||
5. **Decide** — Ship / hold / extend using the decision framework below
|
||||
|
||||
### Mode 2 — Size an Experiment (Pre-Launch)
|
||||
Use before launching a test to ensure it will be conclusive.
|
||||
|
||||
1. **Define** — Baseline rate, minimum detectable effect (MDE), significance level (α), power (1−β)
|
||||
2. **Calculate** — Run `sample_size_calculator.py` to get required N per variant
|
||||
3. **Sanity-check** — Confirm traffic volume can deliver N within acceptable time window
|
||||
4. **Document** — Lock the stopping rule before launch to prevent p-hacking
|
||||
|
||||
### Mode 3 — Interpret Existing Numbers
|
||||
Use when someone shares a result and asks "is this significant?" or "what does this mean?"
|
||||
|
||||
1. Ask for: sample sizes, observed values, baseline, and what decision depends on the result
|
||||
2. Run the appropriate test
|
||||
3. Report using the Bottom Line → What → Why → How to Act structure
|
||||
4. Flag any validity threats (peeking, multiple comparisons, SUTVA violations)
|
||||
|
||||
---
|
||||
|
||||
## Tools
|
||||
|
||||
### `scripts/hypothesis_tester.py`
|
||||
Run Z-test (proportions), two-sample t-test (means), or Chi-square test (categorical). Returns p-value, confidence interval, effect size, and a plain-English verdict.
|
||||
|
||||
```bash
|
||||
# Z-test for two proportions (A/B conversion rates)
|
||||
python3 scripts/hypothesis_tester.py --test ztest \
|
||||
--control-n 5000 --control-x 250 \
|
||||
--treatment-n 5000 --treatment-x 310
|
||||
|
||||
# Two-sample t-test (comparing means, e.g. revenue per user)
|
||||
python3 scripts/hypothesis_tester.py --test ttest \
|
||||
--control-mean 42.3 --control-std 18.1 --control-n 800 \
|
||||
--treatment-mean 46.1 --treatment-std 19.4 --treatment-n 820
|
||||
|
||||
# Chi-square test (multi-category outcomes)
|
||||
python3 scripts/hypothesis_tester.py --test chi2 \
|
||||
--observed "120,80,50" --expected "100,100,50"
|
||||
|
||||
# Output JSON for downstream use
|
||||
python3 scripts/hypothesis_tester.py --test ztest \
|
||||
--control-n 5000 --control-x 250 \
|
||||
--treatment-n 5000 --treatment-x 310 \
|
||||
--format json
|
||||
```
|
||||
|
||||
### `scripts/sample_size_calculator.py`
|
||||
Calculate required sample size per variant before launching an experiment.
|
||||
|
||||
```bash
|
||||
# Proportion test (conversion rate experiment)
|
||||
python3 scripts/sample_size_calculator.py --test proportion \
|
||||
--baseline 0.05 --mde 0.20 --alpha 0.05 --power 0.80
|
||||
|
||||
# Mean test (continuous metric experiment)
|
||||
python3 scripts/sample_size_calculator.py --test mean \
|
||||
--baseline-mean 42.3 --baseline-std 18.1 --mde 0.10 \
|
||||
--alpha 0.05 --power 0.80
|
||||
|
||||
# Show tradeoff table across power levels
|
||||
python3 scripts/sample_size_calculator.py --test proportion \
|
||||
--baseline 0.05 --mde 0.20 --table
|
||||
|
||||
# Output JSON
|
||||
python3 scripts/sample_size_calculator.py --test proportion \
|
||||
--baseline 0.05 --mde 0.20 --format json
|
||||
```
|
||||
|
||||
### `scripts/confidence_interval.py`
|
||||
Compute confidence intervals for a proportion or mean. Use for reporting observed metrics with uncertainty bounds.
|
||||
|
||||
```bash
|
||||
# CI for a proportion
|
||||
python3 scripts/confidence_interval.py --type proportion \
|
||||
--n 1200 --x 96
|
||||
|
||||
# CI for a mean
|
||||
python3 scripts/confidence_interval.py --type mean \
|
||||
--n 800 --mean 42.3 --std 18.1
|
||||
|
||||
# Custom confidence level
|
||||
python3 scripts/confidence_interval.py --type proportion \
|
||||
--n 1200 --x 96 --confidence 0.99
|
||||
|
||||
# Output JSON
|
||||
python3 scripts/confidence_interval.py --type proportion \
|
||||
--n 1200 --x 96 --format json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Test Selection Guide
|
||||
|
||||
| Scenario | Metric | Test |
|
||||
|---|---|---|
|
||||
| A/B conversion rate (clicked/not) | Proportion | Z-test for two proportions |
|
||||
| A/B revenue, load time, session length | Continuous mean | Two-sample t-test (Welch's) |
|
||||
| A/B/C/n multi-variant with categories | Categorical counts | Chi-square |
|
||||
| Single sample vs. known value | Mean vs. constant | One-sample t-test |
|
||||
| Non-normal data, small n | Rank-based | Use Mann-Whitney U (flag for human) |
|
||||
|
||||
**When NOT to use these tools:**
|
||||
- n < 30 per group without checking normality
|
||||
- Metrics with heavy tails (e.g. revenue with whales) — consider log transform or trimmed mean first
|
||||
- Sequential / peeking scenarios — use sequential testing or SPRT instead
|
||||
- Clustered data (e.g. users within countries) — standard tests assume independence
|
||||
|
||||
---
|
||||
|
||||
## Decision Framework (Post-Experiment)
|
||||
|
||||
Use this after running the test:
|
||||
|
||||
| p-value | Effect Size | Practical Impact | Decision |
|
||||
|---|---|---|---|
|
||||
| < α | Large / Medium | Meaningful | ✅ Ship |
|
||||
| < α | Small | Negligible | ⚠️ Hold — statistically significant but not worth the complexity |
|
||||
| ≥ α | — | — | 🔁 Extend (if underpowered) or ❌ Kill |
|
||||
| < α | Any | Negative UX | ❌ Kill regardless |
|
||||
|
||||
**Always ask:** "If this effect were exactly as measured, would the business care?" If no — don't ship on significance alone.
|
||||
|
||||
---
|
||||
|
||||
## Effect Size Reference
|
||||
|
||||
Effect sizes translate statistical results into practical language:
|
||||
|
||||
**Cohen's d (means):**
|
||||
| d | Interpretation |
|
||||
|---|---|
|
||||
| < 0.2 | Negligible |
|
||||
| 0.2–0.5 | Small |
|
||||
| 0.5–0.8 | Medium |
|
||||
| > 0.8 | Large |
|
||||
|
||||
**Cohen's h (proportions):**
|
||||
| h | Interpretation |
|
||||
|---|---|
|
||||
| < 0.2 | Negligible |
|
||||
| 0.2–0.5 | Small |
|
||||
| 0.5–0.8 | Medium |
|
||||
| > 0.8 | Large |
|
||||
|
||||
**Cramér's V (chi-square):**
|
||||
| V | Interpretation |
|
||||
|---|---|
|
||||
| < 0.1 | Negligible |
|
||||
| 0.1–0.3 | Small |
|
||||
| 0.3–0.5 | Medium |
|
||||
| > 0.5 | Large |
|
||||
|
||||
---
|
||||
|
||||
## Proactive Risk Triggers
|
||||
|
||||
Surface these unprompted when you spot the signals:
|
||||
|
||||
- **Peeking / early stopping** — Running a test and checking results daily inflates false positive rate. Ask: "Did you look at results before the planned end date?"
|
||||
- **Multiple comparisons** — Testing 10 metrics at α=0.05 gives ~40% chance of at least one false positive. Flag when > 3 metrics are being evaluated.
|
||||
- **Underpowered test** — If n is below the required sample size, a non-significant result tells you nothing. Always check power retroactively.
|
||||
- **SUTVA violations** — If users in control and treatment can interact (e.g. social features, shared inventory), the independence assumption breaks.
|
||||
- **Simpson's Paradox** — An aggregate result can reverse when segmented. Flag when segment-level results are available.
|
||||
- **Novelty effect** — Significant early results in UX tests often decay. Flag for post-novelty re-measurement.
|
||||
|
||||
---
|
||||
|
||||
## Output Artifacts
|
||||
|
||||
| Request | Deliverable |
|
||||
|---|---|
|
||||
| "Did our test win?" | Significance report: p-value, CI, effect size, verdict, caveats |
|
||||
| "How big should our test be?" | Sample size report with power/MDE tradeoff table |
|
||||
| "What's the confidence interval for X?" | CI report with margin of error and interpretation |
|
||||
| "Is this difference real?" | Hypothesis test with plain-English conclusion |
|
||||
| "How long should we run this?" | Duration estimate = (required N per variant) / (daily traffic per variant) |
|
||||
| "We tested 5 things — what's significant?" | Multiple comparison analysis with Bonferroni-adjusted thresholds |
|
||||
|
||||
---
|
||||
|
||||
## Quality Loop
|
||||
|
||||
Tag every finding with confidence:
|
||||
|
||||
- 🟢 **Verified** — Test assumptions met, sufficient n, no validity threats
|
||||
- 🟡 **Likely** — Minor assumption violations; interpret directionally
|
||||
- 🔴 **Inconclusive** — Underpowered, peeking, or data integrity issue; do not act
|
||||
|
||||
---
|
||||
|
||||
## Communication Standard
|
||||
|
||||
Structure all results as:
|
||||
|
||||
**Bottom Line** — One sentence: "Treatment increased conversion by 1.2pp (95% CI: 0.4–2.0pp). Result is statistically significant (p=0.003) with a small effect (h=0.18). Recommend shipping."
|
||||
|
||||
**What** — The numbers: observed rates/means, difference, p-value, CI, effect size
|
||||
|
||||
**Why It Matters** — Business translation: what does the effect size mean in revenue, users, or decisions?
|
||||
|
||||
**How to Act** — Ship / hold / extend / kill with specific rationale
|
||||
|
||||
---
|
||||
|
||||
## Related Skills
|
||||
|
||||
| Skill | Use When |
|
||||
|---|---|
|
||||
| `marketing-skill/ab-test-setup` | Designing the experiment before it runs — randomization, instrumentation, holdout |
|
||||
| `engineering/data-quality-auditor` | Verifying input data integrity before running any statistical test |
|
||||
| `product-team/experiment-designer` | Structuring the hypothesis, success metrics, and guardrail metrics |
|
||||
| `product-team/product-analytics` | Analyzing product funnel and retention metrics |
|
||||
| `finance/saas-metrics-coach` | Interpreting SaaS KPIs that may feed into experiments (ARR, churn, LTV) |
|
||||
| `marketing-skill/campaign-analytics` | Statistical analysis of marketing campaign performance |
|
||||
|
||||
**When NOT to use this skill:**
|
||||
- You need to design or instrument the experiment — use `marketing-skill/ab-test-setup` or `product-team/experiment-designer`
|
||||
- You need to clean or validate the input data — use `engineering/data-quality-auditor` first
|
||||
- You need Bayesian inference or multi-armed bandit analysis — flag that frequentist tests may not be appropriate
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- `references/statistical-testing-concepts.md` — t-test, Z-test, chi-square theory; p-value interpretation; Type I/II errors; power analysis math
|
||||
@@ -0,0 +1,189 @@
|
||||
# Statistical Testing Concepts Reference
|
||||
|
||||
Deep-dive reference for the Statistical Analyst skill. Keeps SKILL.md lean while preserving the theory.
|
||||
|
||||
---
|
||||
|
||||
## The Frequentist Framework
|
||||
|
||||
All tests in this skill operate in the **frequentist framework**: we define a null hypothesis (H₀) and an alternative (H₁), then ask "how often would we see data this extreme if H₀ were true?"
|
||||
|
||||
- **H₀ (null):** No difference exists between control and treatment
|
||||
- **H₁ (alternative):** A difference exists (two-tailed)
|
||||
- **p-value:** P(observing this result or more extreme | H₀ is true)
|
||||
- **α (significance level):** The threshold we set in advance. Reject H₀ if p < α.
|
||||
|
||||
### The p-value misconception
|
||||
A p-value of 0.03 does **not** mean "there is a 97% chance the effect is real."
|
||||
It means: "If there were no effect, we would see data this extreme only 3% of the time."
|
||||
|
||||
---
|
||||
|
||||
## Type I and Type II Errors
|
||||
|
||||
| | H₀ True | H₀ False |
|
||||
|---|---|---|
|
||||
| Reject H₀ | **Type I Error (α)** — False Positive | Correct (Power = 1−β) |
|
||||
| Fail to reject H₀ | Correct | **Type II Error (β)** — False Negative |
|
||||
|
||||
- **α** (false positive rate): Typically 0.05. Reduce it when false positives are costly (medical trials, irreversible changes).
|
||||
- **β** (false negative rate): Typically 0.20 (power = 80%). Reduce it when missing real effects is costly.
|
||||
|
||||
---
|
||||
|
||||
## Two-Proportion Z-Test
|
||||
|
||||
**When:** Comparing two binary conversion rates (e.g. clicked/not, signed up/not).
|
||||
|
||||
**Assumptions:**
|
||||
- Independent samples
|
||||
- n×p ≥ 5 and n×(1−p) ≥ 5 for both groups (normal approximation valid)
|
||||
- No interference between units (SUTVA)
|
||||
|
||||
**Formula:**
|
||||
```
|
||||
z = (p̂₂ − p̂₁) / √[p̄(1−p̄)(1/n₁ + 1/n₂)]
|
||||
|
||||
where p̄ = (x₁ + x₂) / (n₁ + n₂) (pooled proportion)
|
||||
```
|
||||
|
||||
**Effect size — Cohen's h:**
|
||||
```
|
||||
h = 2 arcsin(√p₂) − 2 arcsin(√p₁)
|
||||
```
|
||||
The arcsine transformation stabilizes variance across different baseline rates.
|
||||
|
||||
---
|
||||
|
||||
## Welch's Two-Sample t-Test
|
||||
|
||||
**When:** Comparing means of a continuous metric between two groups (revenue, latency, session length).
|
||||
|
||||
**Why Welch's (not Student's):**
|
||||
Welch's t-test does not assume equal variances — it is strictly more general and loses little power when variances are equal. Always prefer it.
|
||||
|
||||
**Formula:**
|
||||
```
|
||||
t = (x̄₂ − x̄₁) / √(s₁²/n₁ + s₂²/n₂)
|
||||
|
||||
Welch–Satterthwaite df:
|
||||
df = (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁−1) + (s₂²/n₂)²/(n₂−1)]
|
||||
```
|
||||
|
||||
**Effect size — Cohen's d:**
|
||||
```
|
||||
d = (x̄₂ − x̄₁) / s_pooled
|
||||
|
||||
s_pooled = √[((n₁−1)s₁² + (n₂−1)s₂²) / (n₁+n₂−2)]
|
||||
```
|
||||
|
||||
**Warning for heavy-tailed metrics (revenue, LTV):**
|
||||
Mean tests are sensitive to outliers. If the distribution has heavy tails, consider:
|
||||
1. Winsorizing at 99th percentile before testing
|
||||
2. Log-transforming (if values are positive)
|
||||
3. Using a non-parametric test (Mann-Whitney U) and flagging for human review
|
||||
|
||||
---
|
||||
|
||||
## Chi-Square Test
|
||||
|
||||
**When:** Comparing categorical distributions (e.g. which plan users selected, which error type occurred).
|
||||
|
||||
**Assumptions:**
|
||||
- Expected count ≥ 5 per cell (otherwise, combine categories or use Fisher's exact)
|
||||
- Independent observations
|
||||
|
||||
**Formula:**
|
||||
```
|
||||
χ² = Σ (Oᵢ − Eᵢ)² / Eᵢ
|
||||
|
||||
df = k − 1 (goodness-of-fit)
|
||||
df = (r−1)(c−1) (contingency table, r rows, c columns)
|
||||
```
|
||||
|
||||
**Effect size — Cramér's V:**
|
||||
```
|
||||
V = √[χ² / (n × (min(r,c) − 1))]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Wilson Score Interval
|
||||
|
||||
The standard confidence interval formula for proportions (`p̂ ± z√(p̂(1−p̂)/n)`) can produce impossible values (< 0 or > 1) for small n or extreme p. The Wilson score interval fixes this:
|
||||
|
||||
```
|
||||
center = (p̂ + z²/2n) / (1 + z²/n)
|
||||
margin = z/(1+z²/n) × √(p̂(1−p̂)/n + z²/4n²)
|
||||
CI = [center − margin, center + margin]
|
||||
```
|
||||
|
||||
Always use Wilson (or Clopper-Pearson) for proportions. The normal approximation is a historical artifact.
|
||||
|
||||
---
|
||||
|
||||
## Sample Size & Power
|
||||
|
||||
**Power:** The probability of correctly detecting a real effect of size δ.
|
||||
|
||||
```
|
||||
n = (z_α/2 + z_β)² × (σ₁² + σ₂²) / δ² [means]
|
||||
n = (z_α/2 + z_β)² × (p₁(1−p₁) + p₂(1−p₂)) / (p₂−p₁)² [proportions]
|
||||
```
|
||||
|
||||
**Key levers:**
|
||||
- Increase n → more power (or detect smaller effects)
|
||||
- Increase MDE → smaller n (but you might miss smaller real effects)
|
||||
- Increase α → smaller n (but more false positives)
|
||||
- Increase power → larger n
|
||||
|
||||
**The peeking problem:**
|
||||
Checking results before the planned end date inflates your effective α. If you peek at 50%, 75%, and 100% of planned n, your true α is ~0.13 instead of 0.05 — a 2.6× inflation of false positives.
|
||||
|
||||
**Solutions:**
|
||||
- Pre-commit to a stopping rule and don't peek
|
||||
- Use sequential testing (SPRT) if early stopping is required
|
||||
- Use a Bonferroni-corrected α if you peek at scheduled intervals
|
||||
|
||||
---
|
||||
|
||||
## Multiple Comparisons
|
||||
|
||||
Testing k hypotheses at α = 0.05 gives P(at least one false positive) ≈ 1 − (1 − 0.05)^k
|
||||
|
||||
| k tests | P(≥1 false positive) |
|
||||
|---|---|
|
||||
| 1 | 5% |
|
||||
| 3 | 14% |
|
||||
| 5 | 23% |
|
||||
| 10 | 40% |
|
||||
| 20 | 64% |
|
||||
|
||||
**Corrections:**
|
||||
- **Bonferroni:** Use α/k per test. Conservative but simple. Appropriate for independent tests.
|
||||
- **Benjamini-Hochberg (FDR):** Controls false discovery rate, not family-wise error. Preferred when many tests are expected to be true positives.
|
||||
|
||||
---
|
||||
|
||||
## SUTVA (Stable Unit Treatment Value Assumption)
|
||||
|
||||
A critical assumption for valid A/B tests: the outcome of unit i depends only on its own treatment assignment, not on other units' assignments.
|
||||
|
||||
**Violations:**
|
||||
- Social features (user A sees user B's activity — network spillover)
|
||||
- Shared inventory (one variant depletes shared stock)
|
||||
- Two-sided marketplaces (buyers and sellers interact)
|
||||
|
||||
**Solutions:**
|
||||
- Cluster randomization (randomize at the group/geography level)
|
||||
- Network A/B testing (graph-based splits)
|
||||
- Holdout-based testing
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- Imbens, G. & Rubin, D. (2015). *Causal Inference for Statistics, Social, and Biomedical Sciences*. Cambridge.
|
||||
- Kohavi, R., Tang, D., & Xu, Y. (2020). *Trustworthy Online Controlled Experiments*. Cambridge.
|
||||
- Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences*. 2nd ed.
|
||||
- Wilson, E.B. (1927). "Probable Inference, the Law of Succession, and Statistical Inference." *JASA* 22(158): 209–212.
|
||||
198
engineering/statistical-analyst/scripts/confidence_interval.py
Normal file
198
engineering/statistical-analyst/scripts/confidence_interval.py
Normal file
@@ -0,0 +1,198 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
confidence_interval.py — Confidence intervals for proportions and means.
|
||||
|
||||
Methods:
|
||||
proportion — Wilson score interval (recommended over normal approximation for small n or extreme p)
|
||||
mean — t-based interval using normal approximation for large n
|
||||
|
||||
Usage:
|
||||
python3 confidence_interval.py --type proportion --n 1200 --x 96
|
||||
python3 confidence_interval.py --type mean --n 800 --mean 42.3 --std 18.1
|
||||
python3 confidence_interval.py --type proportion --n 1200 --x 96 --confidence 0.99
|
||||
python3 confidence_interval.py --type proportion --n 1200 --x 96 --format json
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import math
|
||||
import sys
|
||||
|
||||
|
||||
def normal_ppf(p: float) -> float:
|
||||
"""Inverse normal CDF via bisection."""
|
||||
lo, hi = -10.0, 10.0
|
||||
for _ in range(100):
|
||||
mid = (lo + hi) / 2
|
||||
if 0.5 * math.erfc(-mid / math.sqrt(2)) < p:
|
||||
lo = mid
|
||||
else:
|
||||
hi = mid
|
||||
return (lo + hi) / 2
|
||||
|
||||
|
||||
def wilson_interval(n: int, x: int, confidence: float) -> dict:
|
||||
"""
|
||||
Wilson score confidence interval for a proportion.
|
||||
More accurate than normal approximation, especially for small n or p near 0/1.
|
||||
"""
|
||||
if n <= 0:
|
||||
return {"error": "n must be positive"}
|
||||
if x < 0 or x > n:
|
||||
return {"error": "x must be between 0 and n"}
|
||||
|
||||
p_hat = x / n
|
||||
z = normal_ppf(1 - (1 - confidence) / 2)
|
||||
z2 = z ** 2
|
||||
|
||||
center = (p_hat + z2 / (2 * n)) / (1 + z2 / n)
|
||||
margin = (z / (1 + z2 / n)) * math.sqrt(p_hat * (1 - p_hat) / n + z2 / (4 * n ** 2))
|
||||
|
||||
lo = max(0.0, center - margin)
|
||||
hi = min(1.0, center + margin)
|
||||
|
||||
# Normal approximation for comparison
|
||||
se = math.sqrt(p_hat * (1 - p_hat) / n) if n > 0 else 0
|
||||
normal_lo = max(0.0, p_hat - z * se)
|
||||
normal_hi = min(1.0, p_hat + z * se)
|
||||
|
||||
return {
|
||||
"type": "proportion",
|
||||
"method": "Wilson score interval",
|
||||
"n": n,
|
||||
"successes": x,
|
||||
"observed_rate": round(p_hat, 6),
|
||||
"confidence": confidence,
|
||||
"lower": round(lo, 6),
|
||||
"upper": round(hi, 6),
|
||||
"margin_of_error": round((hi - lo) / 2, 6),
|
||||
"normal_approximation": {
|
||||
"lower": round(normal_lo, 6),
|
||||
"upper": round(normal_hi, 6),
|
||||
"note": "Wilson is preferred; normal approx shown for reference",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def mean_interval(n: int, mean: float, std: float, confidence: float) -> dict:
|
||||
"""
|
||||
Confidence interval for a mean.
|
||||
Uses normal approximation (z-based) for n >= 30, t-approximation otherwise.
|
||||
"""
|
||||
if n <= 1:
|
||||
return {"error": "n must be > 1"}
|
||||
if std < 0:
|
||||
return {"error": "std must be non-negative"}
|
||||
|
||||
se = std / math.sqrt(n)
|
||||
z = normal_ppf(1 - (1 - confidence) / 2)
|
||||
|
||||
lo = mean - z * se
|
||||
hi = mean + z * se
|
||||
moe = z * se
|
||||
|
||||
rel_moe = moe / abs(mean) * 100 if mean != 0 else None
|
||||
|
||||
precision_note = ""
|
||||
if rel_moe and rel_moe > 20:
|
||||
precision_note = "Wide CI — consider increasing sample size for tighter estimates."
|
||||
elif rel_moe and rel_moe < 5:
|
||||
precision_note = "Tight CI — high precision estimate."
|
||||
|
||||
return {
|
||||
"type": "mean",
|
||||
"method": "Normal approximation (z-based)" if n >= 30 else "Use with caution (n < 30)",
|
||||
"n": n,
|
||||
"observed_mean": round(mean, 6),
|
||||
"std": round(std, 6),
|
||||
"standard_error": round(se, 6),
|
||||
"confidence": confidence,
|
||||
"lower": round(lo, 6),
|
||||
"upper": round(hi, 6),
|
||||
"margin_of_error": round(moe, 6),
|
||||
"relative_margin_of_error_pct": round(rel_moe, 2) if rel_moe is not None else None,
|
||||
"precision_note": precision_note,
|
||||
}
|
||||
|
||||
|
||||
def print_report(result: dict):
|
||||
if "error" in result:
|
||||
print(f"Error: {result['error']}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
conf_pct = int(result["confidence"] * 100)
|
||||
print("=" * 60)
|
||||
print(f" CONFIDENCE INTERVAL REPORT")
|
||||
print("=" * 60)
|
||||
print(f" Method: {result['method']}")
|
||||
print(f" Confidence level: {conf_pct}%")
|
||||
print()
|
||||
|
||||
if result["type"] == "proportion":
|
||||
print(f" Observed rate: {result['observed_rate']:.4%} ({result['successes']}/{result['n']})")
|
||||
print()
|
||||
print(f" {conf_pct}% CI: [{result['lower']:.4%}, {result['upper']:.4%}]")
|
||||
print(f" Margin of error: ±{result['margin_of_error']:.4%}")
|
||||
print()
|
||||
norm = result.get("normal_approximation", {})
|
||||
print(f" Normal approx CI (ref): [{norm.get('lower', 0):.4%}, {norm.get('upper', 0):.4%}]")
|
||||
|
||||
elif result["type"] == "mean":
|
||||
print(f" Observed mean: {result['observed_mean']} (std={result['std']}, n={result['n']})")
|
||||
print(f" Standard error: {result['standard_error']}")
|
||||
print()
|
||||
print(f" {conf_pct}% CI: [{result['lower']}, {result['upper']}]")
|
||||
print(f" Margin of error: ±{result['margin_of_error']}")
|
||||
if result.get("relative_margin_of_error_pct") is not None:
|
||||
print(f" Relative MoE: ±{result['relative_margin_of_error_pct']:.1f}%")
|
||||
if result.get("precision_note"):
|
||||
print(f"\n ℹ️ {result['precision_note']}")
|
||||
|
||||
print()
|
||||
# Interpretation guide
|
||||
print(f" Interpretation: If this experiment were repeated many times,")
|
||||
print(f" {conf_pct}% of the computed intervals would contain the true value.")
|
||||
print(f" This does NOT mean there is a {conf_pct}% chance the true value is")
|
||||
print(f" in this specific interval — it either is or it isn't.")
|
||||
print("=" * 60)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Compute confidence intervals for proportions and means."
|
||||
)
|
||||
parser.add_argument("--type", choices=["proportion", "mean"], required=True)
|
||||
parser.add_argument("--confidence", type=float, default=0.95,
|
||||
help="Confidence level (default: 0.95)")
|
||||
parser.add_argument("--format", choices=["text", "json"], default="text")
|
||||
|
||||
# Proportion
|
||||
parser.add_argument("--n", type=int, help="Total sample size")
|
||||
parser.add_argument("--x", type=int, help="Number of successes (for proportion)")
|
||||
|
||||
# Mean
|
||||
parser.add_argument("--mean", type=float, help="Observed mean")
|
||||
parser.add_argument("--std", type=float, help="Observed standard deviation")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.type == "proportion":
|
||||
if args.n is None or args.x is None:
|
||||
print("Error: --n and --x are required for proportion CI", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
result = wilson_interval(args.n, args.x, args.confidence)
|
||||
|
||||
elif args.type == "mean":
|
||||
if args.n is None or args.mean is None or args.std is None:
|
||||
print("Error: --n, --mean, and --std are required for mean CI", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
result = mean_interval(args.n, args.mean, args.std, args.confidence)
|
||||
|
||||
if args.format == "json":
|
||||
print(json.dumps(result, indent=2))
|
||||
else:
|
||||
print_report(result)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
471
engineering/statistical-analyst/scripts/hypothesis_tester.py
Normal file
471
engineering/statistical-analyst/scripts/hypothesis_tester.py
Normal file
@@ -0,0 +1,471 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
hypothesis_tester.py — Z-test (proportions), Welch's t-test (means), Chi-square (categorical).
|
||||
|
||||
All math uses Python stdlib (math module only). No scipy, numpy, or pandas required.
|
||||
|
||||
Usage:
|
||||
python3 hypothesis_tester.py --test ztest \
|
||||
--control-n 5000 --control-x 250 \
|
||||
--treatment-n 5000 --treatment-x 310
|
||||
|
||||
python3 hypothesis_tester.py --test ttest \
|
||||
--control-mean 42.3 --control-std 18.1 --control-n 800 \
|
||||
--treatment-mean 46.1 --treatment-std 19.4 --treatment-n 820
|
||||
|
||||
python3 hypothesis_tester.py --test chi2 \
|
||||
--observed "120,80,50" --expected "100,100,50"
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import math
|
||||
import sys
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Normal / t-distribution approximations (stdlib only)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def normal_cdf(z: float) -> float:
|
||||
"""Cumulative distribution function of standard normal using math.erfc."""
|
||||
return 0.5 * math.erfc(-z / math.sqrt(2))
|
||||
|
||||
|
||||
def normal_ppf(p: float) -> float:
|
||||
"""Percent-point function (inverse CDF) of standard normal via bisection."""
|
||||
lo, hi = -10.0, 10.0
|
||||
for _ in range(100):
|
||||
mid = (lo + hi) / 2
|
||||
if normal_cdf(mid) < p:
|
||||
lo = mid
|
||||
else:
|
||||
hi = mid
|
||||
return (lo + hi) / 2
|
||||
|
||||
|
||||
def t_cdf(t: float, df: float) -> float:
|
||||
"""
|
||||
CDF of t-distribution via regularized incomplete beta function approximation.
|
||||
Uses the relation: P(T ≤ t) = I_{x}(df/2, 1/2) where x = df/(df+t^2).
|
||||
Falls back to normal CDF for large df (> 1000).
|
||||
"""
|
||||
if df > 1000:
|
||||
return normal_cdf(t)
|
||||
x = df / (df + t * t)
|
||||
# Regularized incomplete beta via continued fraction (Lentz)
|
||||
ib = _regularized_incomplete_beta(x, df / 2, 0.5)
|
||||
p = ib / 2
|
||||
return p if t <= 0 else 1 - p
|
||||
|
||||
|
||||
def _regularized_incomplete_beta(x: float, a: float, b: float) -> float:
|
||||
"""Regularized incomplete beta I_x(a,b) via continued fraction expansion."""
|
||||
if x < 0 or x > 1:
|
||||
return 0.0
|
||||
if x == 0:
|
||||
return 0.0
|
||||
if x == 1:
|
||||
return 1.0
|
||||
lbeta = math.lgamma(a) + math.lgamma(b) - math.lgamma(a + b)
|
||||
front = math.exp(math.log(x) * a + math.log(1 - x) * b - lbeta) / a
|
||||
# Use symmetry for better convergence
|
||||
if x > (a + 1) / (a + b + 2):
|
||||
return 1 - _regularized_incomplete_beta(1 - x, b, a)
|
||||
# Lentz continued fraction
|
||||
TINY = 1e-30
|
||||
f = TINY
|
||||
C = f
|
||||
D = 0.0
|
||||
for m in range(200):
|
||||
for s in (0, 1):
|
||||
if m == 0 and s == 0:
|
||||
num = 1.0
|
||||
elif s == 0:
|
||||
num = m * (b - m) * x / ((a + 2 * m - 1) * (a + 2 * m))
|
||||
else:
|
||||
num = -(a + m) * (a + b + m) * x / ((a + 2 * m) * (a + 2 * m + 1))
|
||||
D = 1 + num * D
|
||||
if abs(D) < TINY:
|
||||
D = TINY
|
||||
D = 1 / D
|
||||
C = 1 + num / C
|
||||
if abs(C) < TINY:
|
||||
C = TINY
|
||||
f *= C * D
|
||||
if abs(C * D - 1) < 1e-10:
|
||||
break
|
||||
return front * f
|
||||
|
||||
|
||||
def two_tail_p_normal(z: float) -> float:
|
||||
return 2 * (1 - normal_cdf(abs(z)))
|
||||
|
||||
|
||||
def two_tail_p_t(t: float, df: float) -> float:
|
||||
return 2 * (1 - t_cdf(abs(t), df))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Effect sizes
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def cohens_h(p1: float, p2: float) -> float:
|
||||
"""Cohen's h for two proportions."""
|
||||
return 2 * math.asin(math.sqrt(p1)) - 2 * math.asin(math.sqrt(p2))
|
||||
|
||||
|
||||
def cohens_d(mean1: float, std1: float, n1: int, mean2: float, std2: float, n2: int) -> float:
|
||||
"""Cohen's d using pooled standard deviation."""
|
||||
pooled = math.sqrt(((n1 - 1) * std1 ** 2 + (n2 - 1) * std2 ** 2) / (n1 + n2 - 2))
|
||||
return (mean1 - mean2) / pooled if pooled else 0.0
|
||||
|
||||
|
||||
def cramers_v(chi2: float, n: int, k: int) -> float:
|
||||
"""Cramér's V effect size for chi-square test."""
|
||||
return math.sqrt(chi2 / (n * (k - 1))) if n and k > 1 else 0.0
|
||||
|
||||
|
||||
def effect_label(val: float, metric: str) -> str:
|
||||
thresholds = {"h": [0.2, 0.5, 0.8], "d": [0.2, 0.5, 0.8], "v": [0.1, 0.3, 0.5]}
|
||||
t = thresholds.get(metric, [0.2, 0.5, 0.8])
|
||||
v = abs(val)
|
||||
if v < t[0]:
|
||||
return "negligible"
|
||||
if v < t[1]:
|
||||
return "small"
|
||||
if v < t[2]:
|
||||
return "medium"
|
||||
return "large"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def ztest_proportions(cn: int, cx: int, tn: int, tx: int, alpha: float) -> dict:
|
||||
"""Two-proportion Z-test."""
|
||||
if cn <= 0 or tn <= 0:
|
||||
return {"error": "Sample sizes must be positive."}
|
||||
|
||||
p_c = cx / cn
|
||||
p_t = tx / tn
|
||||
p_pool = (cx + tx) / (cn + tn)
|
||||
|
||||
se = math.sqrt(p_pool * (1 - p_pool) * (1 / cn + 1 / tn))
|
||||
if se == 0:
|
||||
return {"error": "Standard error is zero — check input values."}
|
||||
|
||||
z = (p_t - p_c) / se
|
||||
p_value = two_tail_p_normal(z)
|
||||
|
||||
# Confidence interval for difference (unpooled SE)
|
||||
se_diff = math.sqrt(p_c * (1 - p_c) / cn + p_t * (1 - p_t) / tn)
|
||||
z_crit = normal_ppf(1 - alpha / 2)
|
||||
diff = p_t - p_c
|
||||
ci_lo = diff - z_crit * se_diff
|
||||
ci_hi = diff + z_crit * se_diff
|
||||
|
||||
h = cohens_h(p_t, p_c)
|
||||
lift = (p_t - p_c) / p_c * 100 if p_c else 0
|
||||
|
||||
return {
|
||||
"test": "Two-proportion Z-test",
|
||||
"control": {"n": cn, "conversions": cx, "rate": round(p_c, 6)},
|
||||
"treatment": {"n": tn, "conversions": tx, "rate": round(p_t, 6)},
|
||||
"difference": round(diff, 6),
|
||||
"relative_lift_pct": round(lift, 2),
|
||||
"z_statistic": round(z, 4),
|
||||
"p_value": round(p_value, 6),
|
||||
"significant": p_value < alpha,
|
||||
"alpha": alpha,
|
||||
"confidence_interval": {
|
||||
"level": f"{int((1 - alpha) * 100)}%",
|
||||
"lower": round(ci_lo, 6),
|
||||
"upper": round(ci_hi, 6),
|
||||
},
|
||||
"effect_size": {
|
||||
"cohens_h": round(abs(h), 4),
|
||||
"interpretation": effect_label(h, "h"),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def ttest_means(cm: float, cs: float, cn: int, tm: float, ts: float, tn: int, alpha: float) -> dict:
|
||||
"""Welch's two-sample t-test (unequal variances)."""
|
||||
if cn < 2 or tn < 2:
|
||||
return {"error": "Each group needs at least 2 observations."}
|
||||
|
||||
se = math.sqrt(cs ** 2 / cn + ts ** 2 / tn)
|
||||
if se == 0:
|
||||
return {"error": "Standard error is zero — check std values."}
|
||||
|
||||
t = (tm - cm) / se
|
||||
|
||||
# Welch–Satterthwaite degrees of freedom
|
||||
num = (cs ** 2 / cn + ts ** 2 / tn) ** 2
|
||||
denom = (cs ** 2 / cn) ** 2 / (cn - 1) + (ts ** 2 / tn) ** 2 / (tn - 1)
|
||||
df = num / denom if denom else cn + tn - 2
|
||||
|
||||
p_value = two_tail_p_t(t, df)
|
||||
|
||||
z_crit = normal_ppf(1 - alpha / 2) if df > 1000 else normal_ppf(1 - alpha / 2)
|
||||
# Use t critical value approximation
|
||||
from_t = abs(t) / (p_value / 2) if p_value > 0 else z_crit # rough
|
||||
t_crit = normal_ppf(1 - alpha / 2) # normal approx for CI
|
||||
|
||||
diff = tm - cm
|
||||
ci_lo = diff - t_crit * se
|
||||
ci_hi = diff + t_crit * se
|
||||
|
||||
d = cohens_d(tm, ts, tn, cm, cs, cn)
|
||||
lift = (tm - cm) / cm * 100 if cm else 0
|
||||
|
||||
return {
|
||||
"test": "Welch's two-sample t-test",
|
||||
"control": {"n": cn, "mean": round(cm, 4), "std": round(cs, 4)},
|
||||
"treatment": {"n": tn, "mean": round(tm, 4), "std": round(ts, 4)},
|
||||
"difference": round(diff, 4),
|
||||
"relative_lift_pct": round(lift, 2),
|
||||
"t_statistic": round(t, 4),
|
||||
"degrees_of_freedom": round(df, 1),
|
||||
"p_value": round(p_value, 6),
|
||||
"significant": p_value < alpha,
|
||||
"alpha": alpha,
|
||||
"confidence_interval": {
|
||||
"level": f"{int((1 - alpha) * 100)}%",
|
||||
"lower": round(ci_lo, 4),
|
||||
"upper": round(ci_hi, 4),
|
||||
},
|
||||
"effect_size": {
|
||||
"cohens_d": round(abs(d), 4),
|
||||
"interpretation": effect_label(d, "d"),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def chi2_test(observed: list[float], expected: list[float], alpha: float) -> dict:
|
||||
"""Chi-square goodness-of-fit test."""
|
||||
if len(observed) != len(expected):
|
||||
return {"error": "Observed and expected must have the same number of categories."}
|
||||
if any(e <= 0 for e in expected):
|
||||
return {"error": "Expected values must all be positive."}
|
||||
if any(e < 5 for e in expected):
|
||||
return {"warning": "Some expected values < 5 — chi-square approximation may be unreliable.",
|
||||
"suggestion": "Consider combining categories or using Fisher's exact test."}
|
||||
|
||||
chi2 = sum((o - e) ** 2 / e for o, e in zip(observed, expected))
|
||||
k = len(observed)
|
||||
df = k - 1
|
||||
n = sum(observed)
|
||||
|
||||
# Chi-square CDF via regularized gamma function approximation
|
||||
p_value = 1 - _chi2_cdf(chi2, df)
|
||||
|
||||
v = cramers_v(chi2, int(n), k)
|
||||
|
||||
return {
|
||||
"test": "Chi-square goodness-of-fit",
|
||||
"categories": k,
|
||||
"observed": observed,
|
||||
"expected": expected,
|
||||
"chi2_statistic": round(chi2, 4),
|
||||
"degrees_of_freedom": df,
|
||||
"p_value": round(p_value, 6),
|
||||
"significant": p_value < alpha,
|
||||
"alpha": alpha,
|
||||
"effect_size": {
|
||||
"cramers_v": round(v, 4),
|
||||
"interpretation": effect_label(v, "v"),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def _chi2_cdf(x: float, k: float) -> float:
|
||||
"""CDF of chi-square via regularized lower incomplete gamma."""
|
||||
if x <= 0:
|
||||
return 0.0
|
||||
return _regularized_gamma(k / 2, x / 2)
|
||||
|
||||
|
||||
def _regularized_gamma(a: float, x: float) -> float:
|
||||
"""Lower regularized incomplete gamma P(a, x) via series expansion."""
|
||||
if x < 0:
|
||||
return 0.0
|
||||
if x == 0:
|
||||
return 0.0
|
||||
if x < a + 1:
|
||||
# Series expansion
|
||||
ap = a
|
||||
delta = 1.0 / a
|
||||
total = delta
|
||||
for _ in range(300):
|
||||
ap += 1
|
||||
delta *= x / ap
|
||||
total += delta
|
||||
if abs(delta) < abs(total) * 1e-10:
|
||||
break
|
||||
return total * math.exp(-x + a * math.log(x) - math.lgamma(a))
|
||||
else:
|
||||
# Continued fraction (Lentz)
|
||||
b = x + 1 - a
|
||||
c = 1e30
|
||||
d = 1 / b
|
||||
f = d
|
||||
for i in range(1, 300):
|
||||
an = -i * (i - a)
|
||||
b += 2
|
||||
d = an * d + b
|
||||
if abs(d) < 1e-30:
|
||||
d = 1e-30
|
||||
c = b + an / c
|
||||
if abs(c) < 1e-30:
|
||||
c = 1e-30
|
||||
d = 1 / d
|
||||
delta = d * c
|
||||
f *= delta
|
||||
if abs(delta - 1) < 1e-10:
|
||||
break
|
||||
return 1 - math.exp(-x + a * math.log(x) - math.lgamma(a)) * f
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Reporting
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
DIRECTION = {True: "statistically significant", False: "NOT statistically significant"}
|
||||
|
||||
|
||||
def verdict(result: dict) -> str:
|
||||
if "error" in result:
|
||||
return f"ERROR: {result['error']}"
|
||||
sig = result.get("significant", False)
|
||||
p = result.get("p_value", 1.0)
|
||||
alpha = result.get("alpha", 0.05)
|
||||
diff = result.get("difference", 0)
|
||||
lift = result.get("relative_lift_pct")
|
||||
ci = result.get("confidence_interval", {})
|
||||
es = result.get("effect_size", {})
|
||||
es_name = "Cohen's h" if "cohens_h" in es else ("Cohen's d" if "cohens_d" in es else "Cramér's V")
|
||||
es_val = es.get("cohens_h") or es.get("cohens_d") or es.get("cramers_v", 0)
|
||||
es_interp = es.get("interpretation", "")
|
||||
|
||||
lines = [
|
||||
"",
|
||||
"=" * 60,
|
||||
f" {result.get('test', 'Hypothesis Test')}",
|
||||
"=" * 60,
|
||||
]
|
||||
|
||||
if "control" in result and "rate" in result["control"]:
|
||||
c = result["control"]
|
||||
t = result["treatment"]
|
||||
lines += [
|
||||
f" Control: {c['rate']:.4%} (n={c['n']}, conversions={c['conversions']})",
|
||||
f" Treatment: {t['rate']:.4%} (n={t['n']}, conversions={t['conversions']})",
|
||||
f" Difference: {diff:+.4%} ({'+' if lift >= 0 else ''}{lift:.1f}% relative lift)",
|
||||
]
|
||||
elif "control" in result and "mean" in result["control"]:
|
||||
c = result["control"]
|
||||
t = result["treatment"]
|
||||
lines += [
|
||||
f" Control: mean={c['mean']} std={c['std']} n={c['n']}",
|
||||
f" Treatment: mean={t['mean']} std={t['std']} n={t['n']}",
|
||||
f" Difference: {diff:+.4f} ({'+' if lift >= 0 else ''}{lift:.1f}% relative lift)",
|
||||
]
|
||||
elif "observed" in result:
|
||||
lines += [
|
||||
f" Observed: {result['observed']}",
|
||||
f" Expected: {result['expected']}",
|
||||
]
|
||||
|
||||
lines += [
|
||||
"",
|
||||
f" p-value: {p:.6f} (α={alpha})",
|
||||
f" Result: {DIRECTION[sig].upper()}",
|
||||
]
|
||||
if ci:
|
||||
lines.append(f" {ci['level']} CI: [{ci['lower']}, {ci['upper']}]")
|
||||
lines += [
|
||||
f" Effect: {es_name} = {es_val} ({es_interp})",
|
||||
"",
|
||||
]
|
||||
|
||||
# Plain English verdict
|
||||
if sig:
|
||||
lines.append(f" ✅ VERDICT: The difference is real (p={p:.4f} < α={alpha}).")
|
||||
if es_interp in ("negligible", "small"):
|
||||
lines.append(" ⚠️ BUT: Effect is small — confirm practical significance before shipping.")
|
||||
else:
|
||||
lines.append(" Effect size is meaningful. Recommend shipping if no negative guardrails.")
|
||||
else:
|
||||
lines.append(f" ❌ VERDICT: Insufficient evidence to conclude a difference exists (p={p:.4f} ≥ α={alpha}).")
|
||||
lines.append(" Options: extend the test, increase MDE, or kill if underpowered.")
|
||||
|
||||
lines.append("=" * 60)
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Run hypothesis tests on experiment results.")
|
||||
parser.add_argument("--test", choices=["ztest", "ttest", "chi2"], required=True)
|
||||
parser.add_argument("--alpha", type=float, default=0.05, help="Significance level (default: 0.05)")
|
||||
parser.add_argument("--format", choices=["text", "json"], default="text")
|
||||
|
||||
# Z-test / t-test shared
|
||||
parser.add_argument("--control-n", type=int)
|
||||
parser.add_argument("--treatment-n", type=int)
|
||||
|
||||
# Z-test
|
||||
parser.add_argument("--control-x", type=int, help="Conversions in control group")
|
||||
parser.add_argument("--treatment-x", type=int, help="Conversions in treatment group")
|
||||
|
||||
# t-test
|
||||
parser.add_argument("--control-mean", type=float)
|
||||
parser.add_argument("--control-std", type=float)
|
||||
parser.add_argument("--treatment-mean", type=float)
|
||||
parser.add_argument("--treatment-std", type=float)
|
||||
|
||||
# chi2
|
||||
parser.add_argument("--observed", help="Comma-separated observed counts")
|
||||
parser.add_argument("--expected", help="Comma-separated expected counts")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.test == "ztest":
|
||||
for req in ["control_n", "control_x", "treatment_n", "treatment_x"]:
|
||||
if getattr(args, req) is None:
|
||||
print(f"Error: --{req.replace('_', '-')} is required for ztest", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
result = ztest_proportions(args.control_n, args.control_x, args.treatment_n, args.treatment_x, args.alpha)
|
||||
|
||||
elif args.test == "ttest":
|
||||
for req in ["control_n", "control_mean", "control_std", "treatment_n", "treatment_mean", "treatment_std"]:
|
||||
if getattr(args, req) is None:
|
||||
print(f"Error: --{req.replace('_', '-')} is required for ttest", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
result = ttest_means(
|
||||
args.control_mean, args.control_std, args.control_n,
|
||||
args.treatment_mean, args.treatment_std, args.treatment_n,
|
||||
args.alpha
|
||||
)
|
||||
|
||||
elif args.test == "chi2":
|
||||
if not args.observed or not args.expected:
|
||||
print("Error: --observed and --expected are required for chi2", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
observed = [float(x.strip()) for x in args.observed.split(",")]
|
||||
expected = [float(x.strip()) for x in args.expected.split(",")]
|
||||
result = chi2_test(observed, expected, args.alpha)
|
||||
|
||||
if args.format == "json":
|
||||
print(json.dumps(result, indent=2))
|
||||
else:
|
||||
if "error" in result:
|
||||
print(f"Error: {result['error']}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
print(verdict(result))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,262 @@
|
||||
#!/usr/bin/env python3
|
||||
from __future__ import annotations
|
||||
"""
|
||||
sample_size_calculator.py — Required sample size per variant for A/B experiments.
|
||||
|
||||
Supports proportion tests (conversion rates) and mean tests (continuous metrics).
|
||||
All math uses Python stdlib only.
|
||||
|
||||
Usage:
|
||||
python3 sample_size_calculator.py --test proportion \
|
||||
--baseline 0.05 --mde 0.20 --alpha 0.05 --power 0.80
|
||||
|
||||
python3 sample_size_calculator.py --test mean \
|
||||
--baseline-mean 42.3 --baseline-std 18.1 --mde 0.10 \
|
||||
--alpha 0.05 --power 0.80
|
||||
|
||||
python3 sample_size_calculator.py --test proportion \
|
||||
--baseline 0.05 --mde 0.20 --table
|
||||
|
||||
python3 sample_size_calculator.py --test proportion \
|
||||
--baseline 0.05 --mde 0.20 --format json
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import math
|
||||
import sys
|
||||
|
||||
|
||||
def normal_cdf(z: float) -> float:
|
||||
return 0.5 * math.erfc(-z / math.sqrt(2))
|
||||
|
||||
|
||||
def normal_ppf(p: float) -> float:
|
||||
"""Inverse normal CDF via bisection."""
|
||||
lo, hi = -10.0, 10.0
|
||||
for _ in range(100):
|
||||
mid = (lo + hi) / 2
|
||||
if normal_cdf(mid) < p:
|
||||
lo = mid
|
||||
else:
|
||||
hi = mid
|
||||
return (lo + hi) / 2
|
||||
|
||||
|
||||
def sample_size_proportion(baseline: float, mde: float, alpha: float, power: float) -> int:
|
||||
"""
|
||||
Required n per variant for a two-proportion Z-test.
|
||||
|
||||
Uses the standard formula:
|
||||
n = (z_α/2 + z_β)² × (p1(1−p1) + p2(1−p2)) / (p1 − p2)²
|
||||
|
||||
Args:
|
||||
baseline: Control conversion rate (e.g. 0.05 for 5%)
|
||||
mde: Minimum detectable effect as relative change (e.g. 0.20 for +20% relative)
|
||||
alpha: Significance level (e.g. 0.05)
|
||||
power: Statistical power (e.g. 0.80)
|
||||
"""
|
||||
p1 = baseline
|
||||
p2 = baseline * (1 + mde)
|
||||
|
||||
if not (0 < p1 < 1) or not (0 < p2 < 1):
|
||||
raise ValueError(f"Rates must be between 0 and 1. Got baseline={p1}, treatment={p2:.4f}")
|
||||
|
||||
z_alpha = normal_ppf(1 - alpha / 2)
|
||||
z_beta = normal_ppf(power)
|
||||
|
||||
numerator = (z_alpha + z_beta) ** 2 * (p1 * (1 - p1) + p2 * (1 - p2))
|
||||
denominator = (p2 - p1) ** 2
|
||||
|
||||
return math.ceil(numerator / denominator)
|
||||
|
||||
|
||||
def sample_size_mean(baseline_mean: float, baseline_std: float, mde: float, alpha: float, power: float) -> int:
|
||||
"""
|
||||
Required n per variant for a two-sample t-test.
|
||||
|
||||
Uses:
|
||||
n = 2 × σ² × (z_α/2 + z_β)² / δ²
|
||||
|
||||
where δ = mde × baseline_mean (absolute effect).
|
||||
|
||||
Args:
|
||||
baseline_mean: Control group mean
|
||||
baseline_std: Control group standard deviation
|
||||
mde: Minimum detectable effect as relative change (e.g. 0.10 for +10%)
|
||||
alpha: Significance level
|
||||
power: Statistical power
|
||||
"""
|
||||
delta = abs(mde * baseline_mean)
|
||||
if delta == 0:
|
||||
raise ValueError("MDE × baseline_mean = 0. Cannot size experiment with zero effect.")
|
||||
|
||||
z_alpha = normal_ppf(1 - alpha / 2)
|
||||
z_beta = normal_ppf(power)
|
||||
|
||||
n = 2 * baseline_std ** 2 * (z_alpha + z_beta) ** 2 / delta ** 2
|
||||
return math.ceil(n)
|
||||
|
||||
|
||||
def duration_estimate(n_per_variant: int, daily_traffic: int | None, variants: int = 2) -> str:
|
||||
if daily_traffic and daily_traffic > 0:
|
||||
traffic_per_variant = daily_traffic / variants
|
||||
days = math.ceil(n_per_variant / traffic_per_variant)
|
||||
weeks = days / 7
|
||||
return f"{days} days ({weeks:.1f} weeks) at {daily_traffic:,} daily users split {variants} ways"
|
||||
return "Provide --daily-traffic to estimate duration"
|
||||
|
||||
|
||||
def print_report(
|
||||
test: str, n: int, baseline: float, mde: float, alpha: float, power: float,
|
||||
daily_traffic: int | None, variants: int,
|
||||
baseline_mean: float | None = None, baseline_std: float | None = None
|
||||
):
|
||||
total = n * variants
|
||||
treatment_rate = baseline * (1 + mde) if test == "proportion" else None
|
||||
absolute_mde = baseline * mde if test == "proportion" else (baseline_mean or 0) * mde
|
||||
|
||||
print("=" * 60)
|
||||
print(" SAMPLE SIZE REPORT")
|
||||
print("=" * 60)
|
||||
|
||||
if test == "proportion":
|
||||
print(f" Baseline conversion rate: {baseline:.2%}")
|
||||
print(f" Target conversion rate: {treatment_rate:.2%}")
|
||||
print(f" MDE: {mde:+.1%} relative ({absolute_mde:+.4f} absolute)")
|
||||
else:
|
||||
print(f" Baseline mean: {baseline_mean} (std: {baseline_std})")
|
||||
print(f" MDE: {mde:+.1%} relative (absolute: {absolute_mde:+.4f})")
|
||||
|
||||
print(f" Significance level (α): {alpha}")
|
||||
print(f" Statistical power (1−β): {power:.0%}")
|
||||
print(f" Variants: {variants}")
|
||||
print()
|
||||
print(f" Required per variant: {n:>10,}")
|
||||
print(f" Required total: {total:>10,}")
|
||||
print()
|
||||
print(f" Duration: {duration_estimate(n, daily_traffic, variants)}")
|
||||
print()
|
||||
|
||||
# Risk interpretation
|
||||
if n < 100:
|
||||
print(" ⚠️ Very small sample — results may be sensitive to outliers.")
|
||||
elif n > 1_000_000:
|
||||
print(" ⚠️ Very large sample required — consider increasing MDE or accepting lower power.")
|
||||
else:
|
||||
print(" ✅ Sample size is achievable for most web/app products.")
|
||||
|
||||
print("=" * 60)
|
||||
|
||||
|
||||
def print_table(test: str, baseline: float, mde: float, alpha: float,
|
||||
baseline_mean: float | None, baseline_std: float | None):
|
||||
"""Print tradeoff table across power levels and MDE values."""
|
||||
powers = [0.70, 0.75, 0.80, 0.85, 0.90, 0.95]
|
||||
mdes = [mde * 0.5, mde * 0.75, mde, mde * 1.5, mde * 2.0]
|
||||
|
||||
print("=" * 70)
|
||||
print(f" SAMPLE SIZE TRADEOFF TABLE (α={alpha}, baseline={'proportion' if test == 'proportion' else 'mean'})")
|
||||
print("=" * 70)
|
||||
header = f" {'MDE':>8} | " + " | ".join(f"power={p:.0%}" for p in powers)
|
||||
print(header)
|
||||
print(" " + "-" * (len(header) - 2))
|
||||
|
||||
for m in mdes:
|
||||
row = f" {m:>+7.1%} | "
|
||||
cells = []
|
||||
for p in powers:
|
||||
try:
|
||||
if test == "proportion":
|
||||
n = sample_size_proportion(baseline, m, alpha, p)
|
||||
else:
|
||||
n = sample_size_mean(baseline_mean, baseline_std, m, alpha, p)
|
||||
cells.append(f"{n:>9,}")
|
||||
except ValueError:
|
||||
cells.append(f"{'N/A':>9}")
|
||||
row += " | ".join(cells)
|
||||
print(row)
|
||||
|
||||
print("=" * 70)
|
||||
print(" (Values = required n per variant)")
|
||||
print()
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Calculate required sample size for A/B experiments.")
|
||||
parser.add_argument("--test", choices=["proportion", "mean"], required=True,
|
||||
help="Type of metric: proportion (conversion rate) or mean (continuous)")
|
||||
parser.add_argument("--alpha", type=float, default=0.05, help="Significance level (default: 0.05)")
|
||||
parser.add_argument("--power", type=float, default=0.80, help="Statistical power (default: 0.80)")
|
||||
parser.add_argument("--mde", type=float, required=True,
|
||||
help="Minimum detectable effect as relative change (e.g. 0.20 = +20%%)")
|
||||
parser.add_argument("--variants", type=int, default=2, help="Number of variants including control (default: 2)")
|
||||
parser.add_argument("--daily-traffic", type=int, help="Daily unique users (for duration estimate)")
|
||||
parser.add_argument("--table", action="store_true", help="Print tradeoff table across power and MDE")
|
||||
parser.add_argument("--format", choices=["text", "json"], default="text")
|
||||
|
||||
# Proportion-specific
|
||||
parser.add_argument("--baseline", type=float, help="Baseline conversion rate (e.g. 0.05 for 5%%)")
|
||||
|
||||
# Mean-specific
|
||||
parser.add_argument("--baseline-mean", type=float, help="Control group mean")
|
||||
parser.add_argument("--baseline-std", type=float, help="Control group standard deviation")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
if args.test == "proportion":
|
||||
if args.baseline is None:
|
||||
print("Error: --baseline is required for proportion test", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
n = sample_size_proportion(args.baseline, args.mde, args.alpha, args.power)
|
||||
else:
|
||||
if args.baseline_mean is None or args.baseline_std is None:
|
||||
print("Error: --baseline-mean and --baseline-std are required for mean test", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
n = sample_size_mean(args.baseline_mean, args.baseline_std, args.mde, args.alpha, args.power)
|
||||
except ValueError as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if args.format == "json":
|
||||
output = {
|
||||
"test": args.test,
|
||||
"n_per_variant": n,
|
||||
"n_total": n * args.variants,
|
||||
"alpha": args.alpha,
|
||||
"power": args.power,
|
||||
"mde": args.mde,
|
||||
"variants": args.variants,
|
||||
}
|
||||
if args.test == "proportion":
|
||||
output["baseline_rate"] = args.baseline
|
||||
output["treatment_rate"] = round(args.baseline * (1 + args.mde), 6)
|
||||
else:
|
||||
output["baseline_mean"] = args.baseline_mean
|
||||
output["baseline_std"] = args.baseline_std
|
||||
if args.daily_traffic:
|
||||
days = math.ceil(n / (args.daily_traffic / args.variants))
|
||||
output["estimated_days"] = days
|
||||
print(json.dumps(output, indent=2))
|
||||
return
|
||||
|
||||
if args.table:
|
||||
print_table(args.test, args.baseline if args.test == "proportion" else None,
|
||||
args.mde, args.alpha, args.baseline_mean, args.baseline_std)
|
||||
|
||||
print_report(
|
||||
args.test, n,
|
||||
baseline=args.baseline or 0,
|
||||
mde=args.mde,
|
||||
alpha=args.alpha,
|
||||
power=args.power,
|
||||
daily_traffic=args.daily_traffic,
|
||||
variants=args.variants,
|
||||
baseline_mean=args.baseline_mean,
|
||||
baseline_std=args.baseline_std,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user