- Add date_added to all 950+ skills for complete tracking - Update version to 6.5.0 in package.json and README - Regenerate all indexes and catalog - Sync all generated files Features from merged PR #150: - Stars/Upvotes system for community-driven discovery - Auto-update mechanism via START_APP.bat - Interactive Prompt Builder - Date tracking badges - Smart auto-categorization All skills validated and indexed. Made-with: Cursor
239 lines
4.8 KiB
Markdown
239 lines
4.8 KiB
Markdown
---
|
||
name: ab-test-setup
|
||
description: "Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness."
|
||
risk: unknown
|
||
source: community
|
||
date_added: "2026-02-27"
|
||
---
|
||
|
||
# A/B Test Setup
|
||
|
||
## 1️⃣ Purpose & Scope
|
||
|
||
Ensure every A/B test is **valid, rigorous, and safe** before a single line of code is written.
|
||
|
||
- Prevents "peeking"
|
||
- Enforces statistical power
|
||
- Blocks invalid hypotheses
|
||
|
||
---
|
||
|
||
## 2️⃣ Pre-Requisites
|
||
|
||
You must have:
|
||
|
||
- A clear user problem
|
||
- Access to an analytics source
|
||
- Roughly estimated traffic volume
|
||
|
||
### Hypothesis Quality Checklist
|
||
|
||
A valid hypothesis includes:
|
||
|
||
- Observation or evidence
|
||
- Single, specific change
|
||
- Directional expectation
|
||
- Defined audience
|
||
- Measurable success criteria
|
||
|
||
---
|
||
|
||
### 3️⃣ Hypothesis Lock (Hard Gate)
|
||
|
||
Before designing variants or metrics, you MUST:
|
||
|
||
- Present the **final hypothesis**
|
||
- Specify:
|
||
- Target audience
|
||
- Primary metric
|
||
- Expected direction of effect
|
||
- Minimum Detectable Effect (MDE)
|
||
|
||
Ask explicitly:
|
||
|
||
> “Is this the final hypothesis we are committing to for this test?”
|
||
|
||
**Do NOT proceed until confirmed.**
|
||
|
||
---
|
||
|
||
### 4️⃣ Assumptions & Validity Check (Mandatory)
|
||
|
||
Explicitly list assumptions about:
|
||
|
||
- Traffic stability
|
||
- User independence
|
||
- Metric reliability
|
||
- Randomization quality
|
||
- External factors (seasonality, campaigns, releases)
|
||
|
||
If assumptions are weak or violated:
|
||
|
||
- Warn the user
|
||
- Recommend delaying or redesigning the test
|
||
|
||
---
|
||
|
||
### 5️⃣ Test Type Selection
|
||
|
||
Choose the simplest valid test:
|
||
|
||
- **A/B Test** – single change, two variants
|
||
- **A/B/n Test** – multiple variants, higher traffic required
|
||
- **Multivariate Test (MVT)** – interaction effects, very high traffic
|
||
- **Split URL Test** – major structural changes
|
||
|
||
Default to **A/B** unless there is a clear reason otherwise.
|
||
|
||
---
|
||
|
||
### 6️⃣ Metrics Definition
|
||
|
||
#### Primary Metric (Mandatory)
|
||
|
||
- Single metric used to evaluate success
|
||
- Directly tied to the hypothesis
|
||
- Pre-defined and frozen before launch
|
||
|
||
#### Secondary Metrics
|
||
|
||
- Provide context
|
||
- Explain _why_ results occurred
|
||
- Must not override the primary metric
|
||
|
||
#### Guardrail Metrics
|
||
|
||
- Metrics that must not degrade
|
||
- Used to prevent harmful wins
|
||
- Trigger test stop if significantly negative
|
||
|
||
---
|
||
|
||
### 7️⃣ Sample Size & Duration
|
||
|
||
Define upfront:
|
||
|
||
- Baseline rate
|
||
- MDE
|
||
- Significance level (typically 95%)
|
||
- Statistical power (typically 80%)
|
||
|
||
Estimate:
|
||
|
||
- Required sample size per variant
|
||
- Expected test duration
|
||
|
||
**Do NOT proceed without a realistic sample size estimate.**
|
||
|
||
---
|
||
|
||
### 8️⃣ Execution Readiness Gate (Hard Stop)
|
||
|
||
You may proceed to implementation **only if all are true**:
|
||
|
||
- Hypothesis is locked
|
||
- Primary metric is frozen
|
||
- Sample size is calculated
|
||
- Test duration is defined
|
||
- Guardrails are set
|
||
- Tracking is verified
|
||
|
||
If any item is missing, stop and resolve it.
|
||
|
||
---
|
||
|
||
## Running the Test
|
||
|
||
### During the Test
|
||
|
||
**DO:**
|
||
|
||
- Monitor technical health
|
||
- Document external factors
|
||
|
||
**DO NOT:**
|
||
|
||
- Stop early due to “good-looking” results
|
||
- Change variants mid-test
|
||
- Add new traffic sources
|
||
- Redefine success criteria
|
||
|
||
---
|
||
|
||
## Analyzing Results
|
||
|
||
### Analysis Discipline
|
||
|
||
When interpreting results:
|
||
|
||
- Do NOT generalize beyond the tested population
|
||
- Do NOT claim causality beyond the tested change
|
||
- Do NOT override guardrail failures
|
||
- Separate statistical significance from business judgment
|
||
|
||
### Interpretation Outcomes
|
||
|
||
| Result | Action |
|
||
| -------------------- | -------------------------------------- |
|
||
| Significant positive | Consider rollout |
|
||
| Significant negative | Reject variant, document learning |
|
||
| Inconclusive | Consider more traffic or bolder change |
|
||
| Guardrail failure | Do not ship, even if primary wins |
|
||
|
||
---
|
||
|
||
## Documentation & Learning
|
||
|
||
### Test Record (Mandatory)
|
||
|
||
Document:
|
||
|
||
- Hypothesis
|
||
- Variants
|
||
- Metrics
|
||
- Sample size vs achieved
|
||
- Results
|
||
- Decision
|
||
- Learnings
|
||
- Follow-up ideas
|
||
|
||
Store records in a shared, searchable location to avoid repeated failures.
|
||
|
||
---
|
||
|
||
## Refusal Conditions (Safety)
|
||
|
||
Refuse to proceed if:
|
||
|
||
- Baseline rate is unknown and cannot be estimated
|
||
- Traffic is insufficient to detect the MDE
|
||
- Primary metric is undefined
|
||
- Multiple variables are changed without proper design
|
||
- Hypothesis cannot be clearly stated
|
||
|
||
Explain why and recommend next steps.
|
||
|
||
---
|
||
|
||
## Key Principles (Non-Negotiable)
|
||
|
||
- One hypothesis per test
|
||
- One primary metric
|
||
- Commit before launch
|
||
- No peeking
|
||
- Learning over winning
|
||
- Statistical rigor first
|
||
|
||
---
|
||
|
||
## Final Reminder
|
||
|
||
A/B testing is not about proving ideas right.
|
||
It is about **learning the truth with confidence**.
|
||
|
||
If you feel tempted to rush, simplify, or “just try it” —
|
||
that is the signal to **slow down and re-check the design**.
|
||
|
||
## When to Use
|
||
This skill is applicable to execute the workflow or actions described in the overview.
|