Research
Share

A/B Test Sample Size Guide

Calculate required sample size and test duration for statistically valid experiments.

Use Case

Planning A/B tests, determining test feasibility, understanding statistical requirements, or educating stakeholders on why tests take time.

Prompt

You are a statistics expert helping me determine the right sample size for an A/B test. I need to ensure my test has enough statistical power to detect meaningful effects.

Test Context:
- Primary metric: [What you're measuring, e.g., conversion rate]
- Current baseline: [Current metric value, e.g., 3.2% conversion rate]
- Minimum detectable effect: [Smallest change worth detecting, e.g., 10% relative lift]
- Daily traffic/sample: [How many users per day reach this test point]
- Traffic split: [e.g., 50/50 between control and treatment]

Please help me calculate:

1. Sample Size Calculation
   Assumptions:
   - Statistical significance level: 95% (α = 0.05)
   - Statistical power: 80% (β = 0.20)
   - Test type: Two-tailed
   
   Calculation:
   - Required sample size per variant: [Calculate]
   - Total sample size needed: [Calculate]
   - Formula/methodology used: [Explain]

2. Test Duration Estimate
   - Daily sample per variant: [Based on traffic]
   - Estimated test duration: [Days/weeks needed]
   - Recommended minimum: [At least 1 full week for weekly patterns]
   - Buffer recommendation: [Add 20% for safety]

3. Sensitivity Analysis
   Show how sample size changes with different parameters:
   
   | MDE | Required Sample | Duration |
   |-----|-----------------|----------|
   | 5%  | [Calculate]     | [Days]   |
   | 10% | [Calculate]     | [Days]   |
   | 15% | [Calculate]     | [Days]   |
   | 20% | [Calculate]     | [Days]   |

4. Power Analysis
   With your current traffic over [X] weeks, what's the:
   - Minimum detectable effect you can reliably detect
   - Statistical power for your target MDE
   - Risk of false negatives

5. Practical Considerations
   - Weekly patterns: Test should run at least 1 full week
   - Novelty effects: Consider running 2+ weeks
   - Holidays/events: Avoid testing during anomalies
   - Multiple variants: Adjust for Bonferroni correction if needed

6. Recommendations
   - Is your test feasible with current traffic?
   - If not, what are your options?
     - Reduce MDE (detect larger effects only)
     - Increase traffic allocation
     - Run longer
     - Use more sensitive metrics
   - Red flags to watch for

7. Monitoring Plan
   - When to first check results: [Not before X samples]
   - How often to monitor: [Recommend daily for issues, not for significance]
   - When to stop: [Only when sample size reached OR critical issue]

How to use

  1. 1Identify your primary metric and current baseline
  2. 2Determine the minimum effect worth detecting (usually 5-20% relative)
  3. 3Check your traffic data for daily sample size
  4. 4Replace placeholders with your specific values
  5. 5Review the duration estimate against your timeline
  6. 6If duration is too long, explore the sensitivity analysis options
  7. 7Share with stakeholders to set realistic expectations

Pro Tips

  • Use online calculators (Evan Miller, Optimizely) to verify
  • Conservative estimates are better than optimistic ones
  • Account for weekly patterns - never run less than 7 days
  • If you can't detect 5% changes, you might need a different approach
  • Multiple metrics = multiple comparison corrections needed
  • Low traffic? Consider qualitative research instead
  • Share this with stakeholders who ask "why does it take so long?"

Tags

ab-testingstatisticssample-sizeexperimentationpower-analysis

New prompts & templates by email

Weekly copy-paste prompts, pattern notes, and AI UX resources on Substack - no spam, unsubscribe anytime.

Subscribe on Substack