Enhance A/B test setup documentation with new guidelines

Added a Hypothesis Quality Checklist and detailed guidelines for designing A/B tests, including sections on hypothesis formulation, test types, metrics selection, and common mistakes.
2026-01-25 16:41:24 +05:00
parent 5e888ef6bb
commit 27ce8af114
1 changed files with 120 additions and 432 deletions
--- a/skills/ab-test-setup/SKILL.md
+++ b/skills/ab-test-setup/SKILL.md
@@ -1,508 +1,196 @@
---
-name: ab-test-setup
-description: When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking.
---

-# A/B Test Setup
-
-You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.
-
-## Initial Assessment
-
-Before designing a test, understand:
-
-1. **Test Context**
-   - What are you trying to improve?
-   - What change are you considering?
-   - What made you want to test this?
-
-2. **Current State**
-   - Baseline conversion rate?
-   - Current traffic volume?
-   - Any historical test data?
-
-3. **Constraints**
-   - Technical implementation complexity?
-   - Timeline requirements?
-   - Tools available?
+#### Hypothesis Quality Checklist
+A valid hypothesis includes:
+- Observation or evidence
+- Single, specific change
+- Directional expectation
+- Defined audience
+- Measurable success criteria

 ---

-## Core Principles
+### 3️⃣ Hypothesis Lock (Hard Gate)

-### 1. Start with a Hypothesis
- Not just "let's see what happens"
- Specific prediction of outcome
- Based on reasoning or data
+Before designing variants or metrics, you MUST:

-### 2. Test One Thing
- Single variable per test
- Otherwise you don't know what worked
- Save MVT for later
+- Present the **final hypothesis**
+- Specify:
+  - Target audience
+  - Primary metric
+  - Expected direction of effect
+  - Minimum Detectable Effect (MDE)

-### 3. Statistical Rigor
- Pre-determine sample size
- Don't peek and stop early
- Commit to the methodology
+Ask explicitly:

-### 4. Measure What Matters
- Primary metric tied to business value
- Secondary metrics for context
- Guardrail metrics to prevent harm
+> “Is this the final hypothesis we are committing to for this test?”
+
+**Do NOT proceed until confirmed.**

 ---

-## Hypothesis Framework
+### 4️⃣ Assumptions & Validity Check (Mandatory)

-### Structure
+Explicitly list assumptions about:

-```
-Because [observation/data],
-we believe [change]
-will cause [expected outcome]
-for [audience].
-We'll know this is true when [metrics].
-```
+- Traffic stability
+- User independence
+- Metric reliability
+- Randomization quality
+- External factors (seasonality, campaigns, releases)

-### Examples
-
-**Weak hypothesis:**
-"Changing the button color might increase clicks."
-
-**Strong hypothesis:**
-"Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."
-
-### Good Hypotheses Include
-
- **Observation**: What prompted this idea
- **Change**: Specific modification
- **Effect**: Expected outcome and direction
- **Audience**: Who this applies to
- **Metric**: How you'll measure success
+If assumptions are weak or violated:
+- Warn the user
+- Recommend delaying or redesigning the test

 ---

-## Test Types
+### 5️⃣ Test Type Selection

-### A/B Test (Split Test)
- Two versions: Control (A) vs. Variant (B)
- Single change between versions
- Most common, easiest to analyze
+Choose the simplest valid test:

-### A/B/n Test
- Multiple variants (A vs. B vs. C...)
- Requires more traffic
- Good for testing several options
+- **A/B Test** – single change, two variants
+- **A/B/n Test** – multiple variants, higher traffic required
+- **Multivariate Test (MVT)** – interaction effects, very high traffic
+- **Split URL Test** – major structural changes

-### Multivariate Test (MVT)
- Multiple changes in combinations
- Tests interactions between changes
- Requires significantly more traffic
- Complex analysis
-
-### Split URL Test
- Different URLs for variants
- Good for major page changes
- Easier implementation sometimes
+Default to **A/B** unless there is a clear reason otherwise.

 ---

-## Sample Size Calculation
+### 6️⃣ Metrics Definition

-### Inputs Needed
+#### Primary Metric (Mandatory)
+- Single metric used to evaluate success
+- Directly tied to the hypothesis
+- Pre-defined and frozen before launch

-1. **Baseline conversion rate**: Your current rate
-2. **Minimum detectable effect (MDE)**: Smallest change worth detecting
-3. **Statistical significance level**: Usually 95%
-4. **Statistical power**: Usually 80%
+#### Secondary Metrics
+- Provide context
+- Explain *why* results occurred
+- Must not override the primary metric

-### Quick Reference
-
-| Baseline Rate | 10% Lift | 20% Lift | 50% Lift |
-|---------------|----------|----------|----------|
-| 1% | 150k/variant | 39k/variant | 6k/variant |
-| 3% | 47k/variant | 12k/variant | 2k/variant |
-| 5% | 27k/variant | 7k/variant | 1.2k/variant |
-| 10% | 12k/variant | 3k/variant | 550/variant |
-
-### Formula Resources
- Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html
- Optimizely's calculator: https://www.optimizely.com/sample-size-calculator/
-
-### Test Duration
-
-```
-Duration = Sample size needed per variant × Number of variants
-           ───────────────────────────────────────────────────
-           Daily traffic to test page × Conversion rate
-```
-
-Minimum: 1-2 business cycles (usually 1-2 weeks)
-Maximum: Avoid running too long (novelty effects, external factors)
+#### Guardrail Metrics
+- Metrics that must not degrade
+- Used to prevent harmful wins
+- Trigger test stop if significantly negative

 ---

-## Metrics Selection
+### 7️⃣ Sample Size & Duration

-### Primary Metric
- Single metric that matters most
- Directly tied to hypothesis
- What you'll use to call the test
+Define upfront:
+- Baseline rate
+- MDE
+- Significance level (typically 95%)
+- Statistical power (typically 80%)

-### Secondary Metrics
- Support primary metric interpretation
- Explain why/how the change worked
- Help understand user behavior
+Estimate:
+- Required sample size per variant
+- Expected test duration

-### Guardrail Metrics
- Things that shouldn't get worse
- Revenue, retention, satisfaction
- Stop test if significantly negative
-
-### Metric Examples by Test Type
-
-**Homepage CTA test:**
- Primary: CTA click-through rate
- Secondary: Time to click, scroll depth
- Guardrail: Bounce rate, downstream conversion
-
-**Pricing page test:**
- Primary: Plan selection rate
- Secondary: Time on page, plan distribution
- Guardrail: Support tickets, refund rate
-
-**Signup flow test:**
- Primary: Signup completion rate
- Secondary: Field-level completion, time to complete
- Guardrail: User activation rate (post-signup quality)
+**Do NOT proceed without a realistic sample size estimate.**

 ---

-## Designing Variants
+### 8️⃣ Execution Readiness Gate (Hard Stop)

-### Control (A)
- Current experience, unchanged
- Don't modify during test
+You may proceed to implementation **only if all are true**:

-### Variant (B+)
+- Hypothesis is locked
+- Primary metric is frozen
+- Sample size is calculated
+- Test duration is defined
+- Guardrails are set
+- Tracking is verified

-**Best practices:**
- Single, meaningful change
- Bold enough to make a difference
- True to the hypothesis
-
-**What to vary:**
-
-Headlines/Copy:
- Message angle
- Value proposition
- Specificity level
- Tone/voice
-
-Visual Design:
- Layout structure
- Color and contrast
- Image selection
- Visual hierarchy
-
-CTA:
- Button copy
- Size/prominence
- Placement
- Number of CTAs
-
-Content:
- Information included
- Order of information
- Amount of content
- Social proof type
-
-### Documenting Variants
-
-```
-Control (A):
- Screenshot
- Description of current state
-
-Variant (B):
- Screenshot or mockup
- Specific changes made
- Hypothesis for why this will win
-```
-
---
-
-## Traffic Allocation
-
-### Standard Split
- 50/50 for A/B test
- Equal split for multiple variants
-
-### Conservative Rollout
- 90/10 or 80/20 initially
- Limits risk of bad variant
- Longer to reach significance
-
-### Ramping
- Start small, increase over time
- Good for technical risk mitigation
- Most tools support this
-
-### Considerations
- Consistency: Users see same variant on return
- Segment sizes: Ensure segments are large enough
- Time of day/week: Balanced exposure
-
---
-
-## Implementation Approaches
-
-### Client-Side Testing
-
-**Tools**: PostHog, Optimizely, VWO, custom
-
-**How it works**:
- JavaScript modifies page after load
- Quick to implement
- Can cause flicker
-
-**Best for**:
- Marketing pages
- Copy/visual changes
- Quick iteration
-
-### Server-Side Testing
-
-**Tools**: PostHog, LaunchDarkly, Split, custom
-
-**How it works**:
- Variant determined before page renders
- No flicker
- Requires development work
-
-**Best for**:
- Product features
- Complex changes
- Performance-sensitive pages
-
-### Feature Flags
-
- Binary on/off (not true A/B)
- Good for rollouts
- Can convert to A/B with percentage split
+If any item is missing, stop and resolve it.

 ---

 ## Running the Test

-### Pre-Launch Checklist
-
- [ ] Hypothesis documented
- [ ] Primary metric defined
- [ ] Sample size calculated
- [ ] Test duration estimated
- [ ] Variants implemented correctly
- [ ] Tracking verified
- [ ] QA completed on all variants
- [ ] Stakeholders informed
-
 ### During the Test

 **DO:**
- Monitor for technical issues
- Check segment quality
- Document any external factors
+- Monitor technical health
+- Document external factors

-**DON'T:**
- Peek at results and stop early
- Make changes to variants
- Add traffic from new sources
- End early because you "know" the answer
-
-### Peeking Problem
-
-Looking at results before reaching sample size and stopping when you see significance leads to:
- False positives
- Inflated effect sizes
- Wrong decisions
-
-**Solutions:**
- Pre-commit to sample size and stick to it
- Use sequential testing if you must peek
- Trust the process
+**DO NOT:**
+- Stop early due to “good-looking” results
+- Change variants mid-test
+- Add new traffic sources
+- Redefine success criteria

 ---

 ## Analyzing Results

-### Statistical Significance
+### Analysis Discipline

- 95% confidence = p-value < 0.05
- Means: <5% chance result is random
- Not a guarantee—just a threshold
+When interpreting results:

-### Practical Significance
+- Do NOT generalize beyond the tested population
+- Do NOT claim causality beyond the tested change
+- Do NOT override guardrail failures
+- Separate statistical significance from business judgment

-Statistical ≠ Practical
+### Interpretation Outcomes

- Is the effect size meaningful for business?
- Is it worth the implementation cost?
- Is it sustainable over time?
-
-### What to Look At
-
-1. **Did you reach sample size?**
-   - If not, result is preliminary
-
-2. **Is it statistically significant?**
-   - Check confidence intervals
-   - Check p-value
-
-3. **Is the effect size meaningful?**
-   - Compare to your MDE
-   - Project business impact
-
-4. **Are secondary metrics consistent?**
-   - Do they support the primary?
-   - Any unexpected effects?
-
-5. **Any guardrail concerns?**
-   - Did anything get worse?
-   - Long-term risks?
-
-6. **Segment differences?**
-   - Mobile vs. desktop?
-   - New vs. returning?
-   - Traffic source?
-
-### Interpreting Results
-
-| Result | Conclusion |
-|--------|------------|
-| Significant winner | Implement variant |
-| Significant loser | Keep control, learn why |
-| No significant difference | Need more traffic or bolder test |
-| Mixed signals | Dig deeper, maybe segment |
+| Result | Action |
+|------|-------|
+| Significant positive | Consider rollout |
+| Significant negative | Reject variant, document learning |
+| Inconclusive | Consider more traffic or bolder change |
+| Guardrail failure | Do not ship, even if primary wins |

 ---

-## Documenting and Learning
+## Documentation & Learning

-### Test Documentation
+### Test Record (Mandatory)

-```
-Test Name: [Name]
-Test ID: [ID in testing tool]
-Dates: [Start] - [End]
-Owner: [Name]
+Document:
+- Hypothesis
+- Variants
+- Metrics
+- Sample size vs achieved
+- Results
+- Decision
+- Learnings
+- Follow-up ideas

-Hypothesis:
-[Full hypothesis statement]
-
-Variants:
- Control: [Description + screenshot]
- Variant: [Description + screenshot]
-
-Results:
- Sample size: [achieved vs. target]
- Primary metric: [control] vs. [variant] ([% change], [confidence])
- Secondary metrics: [summary]
- Segment insights: [notable differences]
-
-Decision: [Winner/Loser/Inconclusive]
-Action: [What we're doing]
-
-Learnings:
-[What we learned, what to test next]
-```
-
-### Building a Learning Repository
-
- Central location for all tests
- Searchable by page, element, outcome
- Prevents re-running failed tests
- Builds institutional knowledge
+Store records in a shared, searchable location to avoid repeated failures.

 ---

-## Output Format
+## Refusal Conditions (Safety)

-### Test Plan Document
+Refuse to proceed if:
+- Baseline rate is unknown and cannot be estimated
+- Traffic is insufficient to detect the MDE
+- Primary metric is undefined
+- Multiple variables are changed without proper design
+- Hypothesis cannot be clearly stated

-```
-# A/B Test: [Name]
-
-## Hypothesis
-[Full hypothesis using framework]
-
-## Test Design
- Type: A/B / A/B/n / MVT
- Duration: X weeks
- Sample size: X per variant
- Traffic allocation: 50/50
-
-## Variants
-[Control and variant descriptions with visuals]
-
-## Metrics
- Primary: [metric and definition]
- Secondary: [list]
- Guardrails: [list]
-
-## Implementation
- Method: Client-side / Server-side
- Tool: [Tool name]
- Dev requirements: [If any]
-
-## Analysis Plan
- Success criteria: [What constitutes a win]
- Segment analysis: [Planned segments]
-```
-
-### Results Summary
-When test is complete
-
-### Recommendations
-Next steps based on results
+Explain why and recommend next steps.

 ---

-## Common Mistakes
+## Key Principles (Non-Negotiable)

-### Test Design
- Testing too small a change (undetectable)
- Testing too many things (can't isolate)
- No clear hypothesis
- Wrong audience
-
-### Execution
- Stopping early
- Changing things mid-test
- Not checking implementation
- Uneven traffic allocation
-
-### Analysis
- Ignoring confidence intervals
- Cherry-picking segments
- Over-interpreting inconclusive results
- Not considering practical significance
+- One hypothesis per test
+- One primary metric
+- Commit before launch
+- No peeking
+- Learning over winning
+- Statistical rigor first

 ---

-## Questions to Ask
+## Final Reminder

-If you need more context:
-1. What's your current conversion rate?
-2. How much traffic does this page get?
-3. What change are you considering and why?
-4. What's the smallest improvement worth detecting?
-5. What tools do you have for testing?
-6. Have you tested this area before?
+A/B testing is not about proving ideas right.
+It is about **learning the truth with confidence**.

---
-
-## Related Skills
-
- **page-cro**: For generating test ideas based on CRO principles
- **analytics-tracking**: For setting up test measurement
- **copywriting**: For creating variant copy
+If you feel tempted to rush, simplify, or “just try it” —
+that is the signal to **slow down and re-check the design**.