From 27ce8af114af4345a5a0a6f3ec4c481c1f9d2bc9 Mon Sep 17 00:00:00 2001
From: Munir Abbasi <munir@ayubmed.edu.pk>
Date: Sun, 25 Jan 2026 16:41:24 +0500
Subject: [PATCH] Enhance A/B test setup documentation with new guidelines

Added a Hypothesis Quality Checklist and detailed guidelines for designing A/B tests, including sections on hypothesis formulation, test types, metrics selection, and common mistakes.
---
 skills/ab-test-setup/SKILL.md | 552 ++++++++--------------------------
 1 file changed, 120 insertions(+), 432 deletions(-)

diff --git a/skills/ab-test-setup/SKILL.md b/skills/ab-test-setup/SKILL.md
index b9eeeefb..3b365eee 100644
--- a/skills/ab-test-setup/SKILL.md
+++ b/skills/ab-test-setup/SKILL.md
@@ -1,508 +1,196 @@
----
-name: ab-test-setup
-description: When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking.
----
 
-# A/B Test Setup
-
-You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.
-
-## Initial Assessment
-
-Before designing a test, understand:
-
-1. **Test Context**
-   - What are you trying to improve?
-   - What change are you considering?
-   - What made you want to test this?
-
-2. **Current State**
-   - Baseline conversion rate?
-   - Current traffic volume?
-   - Any historical test data?
-
-3. **Constraints**
-   - Technical implementation complexity?
-   - Timeline requirements?
-   - Tools available?
+#### Hypothesis Quality Checklist
+A valid hypothesis includes:
+- Observation or evidence
+- Single, specific change
+- Directional expectation
+- Defined audience
+- Measurable success criteria
 
 ---
 
-## Core Principles
+### 3️⃣ Hypothesis Lock (Hard Gate)
 
-### 1. Start with a Hypothesis
-- Not just "let's see what happens"
-- Specific prediction of outcome
-- Based on reasoning or data
+Before designing variants or metrics, you MUST:
 
-### 2. Test One Thing
-- Single variable per test
-- Otherwise you don't know what worked
-- Save MVT for later
+- Present the **final hypothesis**
+- Specify:
+  - Target audience
+  - Primary metric
+  - Expected direction of effect
+  - Minimum Detectable Effect (MDE)
 
-### 3. Statistical Rigor
-- Pre-determine sample size
-- Don't peek and stop early
-- Commit to the methodology
+Ask explicitly:
 
-### 4. Measure What Matters
-- Primary metric tied to business value
-- Secondary metrics for context
-- Guardrail metrics to prevent harm
+> “Is this the final hypothesis we are committing to for this test?”
+
+**Do NOT proceed until confirmed.**
 
 ---
 
-## Hypothesis Framework
+### 4️⃣ Assumptions & Validity Check (Mandatory)
 
-### Structure
+Explicitly list assumptions about:
 
-```
-Because [observation/data],
-we believe [change]
-will cause [expected outcome]
-for [audience].
-We'll know this is true when [metrics].
-```
+- Traffic stability
+- User independence
+- Metric reliability
+- Randomization quality
+- External factors (seasonality, campaigns, releases)
 
-### Examples
-
-**Weak hypothesis:**
-"Changing the button color might increase clicks."
-
-**Strong hypothesis:**
-"Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."
-
-### Good Hypotheses Include
-
-- **Observation**: What prompted this idea
-- **Change**: Specific modification
-- **Effect**: Expected outcome and direction
-- **Audience**: Who this applies to
-- **Metric**: How you'll measure success
+If assumptions are weak or violated:
+- Warn the user
+- Recommend delaying or redesigning the test
 
 ---
 
-## Test Types
+### 5️⃣ Test Type Selection
 
-### A/B Test (Split Test)
-- Two versions: Control (A) vs. Variant (B)
-- Single change between versions
-- Most common, easiest to analyze
+Choose the simplest valid test:
 
-### A/B/n Test
-- Multiple variants (A vs. B vs. C...)
-- Requires more traffic
-- Good for testing several options
+- **A/B Test** – single change, two variants
+- **A/B/n Test** – multiple variants, higher traffic required
+- **Multivariate Test (MVT)** – interaction effects, very high traffic
+- **Split URL Test** – major structural changes
 
-### Multivariate Test (MVT)
-- Multiple changes in combinations
-- Tests interactions between changes
-- Requires significantly more traffic
-- Complex analysis
-
-### Split URL Test
-- Different URLs for variants
-- Good for major page changes
-- Easier implementation sometimes
+Default to **A/B** unless there is a clear reason otherwise.
 
 ---
 
-## Sample Size Calculation
+### 6️⃣ Metrics Definition
 
-### Inputs Needed
+#### Primary Metric (Mandatory)
+- Single metric used to evaluate success
+- Directly tied to the hypothesis
+- Pre-defined and frozen before launch
 
-1. **Baseline conversion rate**: Your current rate
-2. **Minimum detectable effect (MDE)**: Smallest change worth detecting
-3. **Statistical significance level**: Usually 95%
-4. **Statistical power**: Usually 80%
+#### Secondary Metrics
+- Provide context
+- Explain *why* results occurred
+- Must not override the primary metric
 
-### Quick Reference
-
-| Baseline Rate | 10% Lift | 20% Lift | 50% Lift |
-|---------------|----------|----------|----------|
-| 1% | 150k/variant | 39k/variant | 6k/variant |
-| 3% | 47k/variant | 12k/variant | 2k/variant |
-| 5% | 27k/variant | 7k/variant | 1.2k/variant |
-| 10% | 12k/variant | 3k/variant | 550/variant |
-
-### Formula Resources
-- Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html
-- Optimizely's calculator: https://www.optimizely.com/sample-size-calculator/
-
-### Test Duration
-
-```
-Duration = Sample size needed per variant × Number of variants
-           ───────────────────────────────────────────────────
-           Daily traffic to test page × Conversion rate
-```
-
-Minimum: 1-2 business cycles (usually 1-2 weeks)
-Maximum: Avoid running too long (novelty effects, external factors)
+#### Guardrail Metrics
+- Metrics that must not degrade
+- Used to prevent harmful wins
+- Trigger test stop if significantly negative
 
 ---
 
-## Metrics Selection
+### 7️⃣ Sample Size & Duration
 
-### Primary Metric
-- Single metric that matters most
-- Directly tied to hypothesis
-- What you'll use to call the test
+Define upfront:
+- Baseline rate
+- MDE
+- Significance level (typically 95%)
+- Statistical power (typically 80%)
 
-### Secondary Metrics
-- Support primary metric interpretation
-- Explain why/how the change worked
-- Help understand user behavior
+Estimate:
+- Required sample size per variant
+- Expected test duration
 
-### Guardrail Metrics
-- Things that shouldn't get worse
-- Revenue, retention, satisfaction
-- Stop test if significantly negative
-
-### Metric Examples by Test Type
-
-**Homepage CTA test:**
-- Primary: CTA click-through rate
-- Secondary: Time to click, scroll depth
-- Guardrail: Bounce rate, downstream conversion
-
-**Pricing page test:**
-- Primary: Plan selection rate
-- Secondary: Time on page, plan distribution
-- Guardrail: Support tickets, refund rate
-
-**Signup flow test:**
-- Primary: Signup completion rate
-- Secondary: Field-level completion, time to complete
-- Guardrail: User activation rate (post-signup quality)
+**Do NOT proceed without a realistic sample size estimate.**
 
 ---
 
-## Designing Variants
+### 8️⃣ Execution Readiness Gate (Hard Stop)
 
-### Control (A)
-- Current experience, unchanged
-- Don't modify during test
+You may proceed to implementation **only if all are true**:
 
-### Variant (B+)
+- Hypothesis is locked
+- Primary metric is frozen
+- Sample size is calculated
+- Test duration is defined
+- Guardrails are set
+- Tracking is verified
 
-**Best practices:**
-- Single, meaningful change
-- Bold enough to make a difference
-- True to the hypothesis
-
-**What to vary:**
-
-Headlines/Copy:
-- Message angle
-- Value proposition
-- Specificity level
-- Tone/voice
-
-Visual Design:
-- Layout structure
-- Color and contrast
-- Image selection
-- Visual hierarchy
-
-CTA:
-- Button copy
-- Size/prominence
-- Placement
-- Number of CTAs
-
-Content:
-- Information included
-- Order of information
-- Amount of content
-- Social proof type
-
-### Documenting Variants
-
-```
-Control (A):
-- Screenshot
-- Description of current state
-
-Variant (B):
-- Screenshot or mockup
-- Specific changes made
-- Hypothesis for why this will win
-```
-
----
-
-## Traffic Allocation
-
-### Standard Split
-- 50/50 for A/B test
-- Equal split for multiple variants
-
-### Conservative Rollout
-- 90/10 or 80/20 initially
-- Limits risk of bad variant
-- Longer to reach significance
-
-### Ramping
-- Start small, increase over time
-- Good for technical risk mitigation
-- Most tools support this
-
-### Considerations
-- Consistency: Users see same variant on return
-- Segment sizes: Ensure segments are large enough
-- Time of day/week: Balanced exposure
-
----
-
-## Implementation Approaches
-
-### Client-Side Testing
-
-**Tools**: PostHog, Optimizely, VWO, custom
-
-**How it works**:
-- JavaScript modifies page after load
-- Quick to implement
-- Can cause flicker
-
-**Best for**:
-- Marketing pages
-- Copy/visual changes
-- Quick iteration
-
-### Server-Side Testing
-
-**Tools**: PostHog, LaunchDarkly, Split, custom
-
-**How it works**:
-- Variant determined before page renders
-- No flicker
-- Requires development work
-
-**Best for**:
-- Product features
-- Complex changes
-- Performance-sensitive pages
-
-### Feature Flags
-
-- Binary on/off (not true A/B)
-- Good for rollouts
-- Can convert to A/B with percentage split
+If any item is missing, stop and resolve it.
 
 ---
 
 ## Running the Test
 
-### Pre-Launch Checklist
-
-- [ ] Hypothesis documented
-- [ ] Primary metric defined
-- [ ] Sample size calculated
-- [ ] Test duration estimated
-- [ ] Variants implemented correctly
-- [ ] Tracking verified
-- [ ] QA completed on all variants
-- [ ] Stakeholders informed
-
 ### During the Test
 
 **DO:**
-- Monitor for technical issues
-- Check segment quality
-- Document any external factors
+- Monitor technical health
+- Document external factors
 
-**DON'T:**
-- Peek at results and stop early
-- Make changes to variants
-- Add traffic from new sources
-- End early because you "know" the answer
-
-### Peeking Problem
-
-Looking at results before reaching sample size and stopping when you see significance leads to:
-- False positives
-- Inflated effect sizes
-- Wrong decisions
-
-**Solutions:**
-- Pre-commit to sample size and stick to it
-- Use sequential testing if you must peek
-- Trust the process
+**DO NOT:**
+- Stop early due to “good-looking” results
+- Change variants mid-test
+- Add new traffic sources
+- Redefine success criteria
 
 ---
 
 ## Analyzing Results
 
-### Statistical Significance
+### Analysis Discipline
 
-- 95% confidence = p-value < 0.05
-- Means: <5% chance result is random
-- Not a guarantee—just a threshold
+When interpreting results:
 
-### Practical Significance
+- Do NOT generalize beyond the tested population
+- Do NOT claim causality beyond the tested change
+- Do NOT override guardrail failures
+- Separate statistical significance from business judgment
 
-Statistical ≠ Practical
+### Interpretation Outcomes
 
-- Is the effect size meaningful for business?
-- Is it worth the implementation cost?
-- Is it sustainable over time?
-
-### What to Look At
-
-1. **Did you reach sample size?**
-   - If not, result is preliminary
-
-2. **Is it statistically significant?**
-   - Check confidence intervals
-   - Check p-value
-
-3. **Is the effect size meaningful?**
-   - Compare to your MDE
-   - Project business impact
-
-4. **Are secondary metrics consistent?**
-   - Do they support the primary?
-   - Any unexpected effects?
-
-5. **Any guardrail concerns?**
-   - Did anything get worse?
-   - Long-term risks?
-
-6. **Segment differences?**
-   - Mobile vs. desktop?
-   - New vs. returning?
-   - Traffic source?
-
-### Interpreting Results
-
-| Result | Conclusion |
-|--------|------------|
-| Significant winner | Implement variant |
-| Significant loser | Keep control, learn why |
-| No significant difference | Need more traffic or bolder test |
-| Mixed signals | Dig deeper, maybe segment |
+| Result | Action |
+|------|-------|
+| Significant positive | Consider rollout |
+| Significant negative | Reject variant, document learning |
+| Inconclusive | Consider more traffic or bolder change |
+| Guardrail failure | Do not ship, even if primary wins |
 
 ---
 
-## Documenting and Learning
+## Documentation & Learning
 
-### Test Documentation
+### Test Record (Mandatory)
 
-```
-Test Name: [Name]
-Test ID: [ID in testing tool]
-Dates: [Start] - [End]
-Owner: [Name]
+Document:
+- Hypothesis
+- Variants
+- Metrics
+- Sample size vs achieved
+- Results
+- Decision
+- Learnings
+- Follow-up ideas
 
-Hypothesis:
-[Full hypothesis statement]
-
-Variants:
-- Control: [Description + screenshot]
-- Variant: [Description + screenshot]
-
-Results:
-- Sample size: [achieved vs. target]
-- Primary metric: [control] vs. [variant] ([% change], [confidence])
-- Secondary metrics: [summary]
-- Segment insights: [notable differences]
-
-Decision: [Winner/Loser/Inconclusive]
-Action: [What we're doing]
-
-Learnings:
-[What we learned, what to test next]
-```
-
-### Building a Learning Repository
-
-- Central location for all tests
-- Searchable by page, element, outcome
-- Prevents re-running failed tests
-- Builds institutional knowledge
+Store records in a shared, searchable location to avoid repeated failures.
 
 ---
 
-## Output Format
+## Refusal Conditions (Safety)
 
-### Test Plan Document
+Refuse to proceed if:
+- Baseline rate is unknown and cannot be estimated
+- Traffic is insufficient to detect the MDE
+- Primary metric is undefined
+- Multiple variables are changed without proper design
+- Hypothesis cannot be clearly stated
 
-```
-# A/B Test: [Name]
-
-## Hypothesis
-[Full hypothesis using framework]
-
-## Test Design
-- Type: A/B / A/B/n / MVT
-- Duration: X weeks
-- Sample size: X per variant
-- Traffic allocation: 50/50
-
-## Variants
-[Control and variant descriptions with visuals]
-
-## Metrics
-- Primary: [metric and definition]
-- Secondary: [list]
-- Guardrails: [list]
-
-## Implementation
-- Method: Client-side / Server-side
-- Tool: [Tool name]
-- Dev requirements: [If any]
-
-## Analysis Plan
-- Success criteria: [What constitutes a win]
-- Segment analysis: [Planned segments]
-```
-
-### Results Summary
-When test is complete
-
-### Recommendations
-Next steps based on results
+Explain why and recommend next steps.
 
 ---
 
-## Common Mistakes
+## Key Principles (Non-Negotiable)
 
-### Test Design
-- Testing too small a change (undetectable)
-- Testing too many things (can't isolate)
-- No clear hypothesis
-- Wrong audience
-
-### Execution
-- Stopping early
-- Changing things mid-test
-- Not checking implementation
-- Uneven traffic allocation
-
-### Analysis
-- Ignoring confidence intervals
-- Cherry-picking segments
-- Over-interpreting inconclusive results
-- Not considering practical significance
+- One hypothesis per test
+- One primary metric
+- Commit before launch
+- No peeking
+- Learning over winning
+- Statistical rigor first
 
 ---
 
-## Questions to Ask
+## Final Reminder
 
-If you need more context:
-1. What's your current conversion rate?
-2. How much traffic does this page get?
-3. What change are you considering and why?
-4. What's the smallest improvement worth detecting?
-5. What tools do you have for testing?
-6. Have you tested this area before?
+A/B testing is not about proving ideas right.
+It is about **learning the truth with confidence**.
 
----
-
-## Related Skills
-
-- **page-cro**: For generating test ideas based on CRO principles
-- **analytics-tracking**: For setting up test measurement
-- **copywriting**: For creating variant copy
+If you feel tempted to rush, simplify, or “just try it” —
+that is the signal to **slow down and re-check the design**.