What Is Hypothesis Testing?
Hypothesis testing is a statistical framework for making data-driven decisions by evaluating whether observed results are likely due to chance or reflect a real effect. It is the foundation of A/B testing — the controlled experiment methodology used by product teams, marketers, and analysts to compare two or more variants and determine which performs better. Understanding hypothesis testing prevents analysts from confusing random variation with meaningful change, a mistake that leads to incorrect product decisions and wasted resources.
Core Concepts of Hypothesis Testing
Concept | Definition | Example |
|---|---|---|
Null hypothesis (H₀) | The default assumption that there is no effect or difference between groups | The new button color has no effect on click-through rate |
Alternative hypothesis (H₁) | The claim that an effect or difference exists | The new button color increases click-through rate |
p-value | Probability of observing results at least as extreme as the data if H₀ were true | p = 0.03 means a 3% chance the difference is due to random variation |
Significance level (α) | The threshold below which we reject H₀; typically 0.05 | If p < 0.05, reject H₀ and accept that a real difference exists |
Statistical power (1−β) | Probability of correctly detecting a real effect when one exists | 80% power means an 80% chance of detecting a true effect |
Effect size | The magnitude of the difference between groups (not just statistical significance) | A 0.5% lift in conversion with p < 0.05 may not be practically meaningful |
Type I and Type II Errors
Error Type | Description | Consequence | Controlled By |
|---|---|---|---|
Type I Error (False Positive) | Rejecting H₀ when it is actually true; concluding an effect exists when it does not | Shipping a feature that doesn't actually improve the metric | Significance level α (lower α → fewer false positives) |
Type II Error (False Negative) | Failing to reject H₀ when it is false; missing a real effect | Discarding a feature that genuinely improves the metric | Statistical power (higher power → fewer false negatives) |
A/B Test Design Checklist
Step | Decision | Why It Matters |
|---|---|---|
1. Define the metric | Choose a primary success metric (e.g., conversion rate, revenue per user) | Multiple metrics increase false positive risk; one primary metric keeps the test focused |
2. State hypotheses | Write H₀ and H₁ explicitly before collecting data | Post-hoc hypothesis selection inflates false positive rate |
3. Determine sample size | Use power analysis: specify α (0.05), power (0.80), and minimum detectable effect (MDE) | Underpowered tests miss real effects; overpowered tests are wasteful |
4. Randomize correctly | Assign users randomly to control/treatment; avoid selection bias | Non-random assignment invalidates causal inference |
5. Run for full duration | Do not stop early based on interim results | Early stopping inflates Type I error rate dramatically |
6. Check for novelty effect | Verify that early engagement lifts are sustained over time | Users may engage with anything new; the effect may fade |
7. Analyze and decide | Compare p-value to α and assess practical effect size | Statistical significance alone is not sufficient for a ship decision |
Choosing the Right Statistical Test
Metric Type | Test | Notes |
|---|---|---|
Proportions (conversion rate, CTR) | Two-proportion z-test or chi-square test | Requires sufficient sample size (n×p ≥ 5 for both groups) |
Continuous metric (revenue, session duration) | Two-sample t-test (Welch's) | Robust to unequal variances; check for normality or use large n |
Non-normal continuous metric | Mann-Whitney U test (non-parametric) | No normality assumption; tests for difference in distributions |
Multiple variants (A/B/C) | ANOVA followed by post-hoc tests (Bonferroni, Tukey HSD) | Running multiple pairwise t-tests without correction inflates α |
Ratio metrics (revenue per session) | Delta method or bootstrap | Ratio metrics have complex variance structures requiring special treatment |
Common A/B Testing Pitfalls
Pitfall | Description | Fix |
|---|---|---|
Peeking at results early | Checking significance repeatedly and stopping when p < 0.05 | Fix the sample size in advance; use sequential testing methods if early stopping is needed |
Multiple comparisons | Testing many metrics or segments inflates false positive rate | Apply Bonferroni correction or control False Discovery Rate (FDR) |
Sample ratio mismatch (SRM) | Actual traffic split differs significantly from intended 50/50 | Check for SRM before analyzing results; investigate assignment bugs |
Network / interference effects | Treatment users affect control users (e.g., social networks, marketplaces) | Use cluster-based randomization or time-based holdouts |
Novelty or primacy effects | Users react differently to new/changed experiences in the short term | Run tests long enough to capture steady-state behavior |
Summary
A/B testing is powerful precisely because it provides causal evidence — not just correlation. But that power depends entirely on rigorous experimental design: pre-specifying hypotheses, calculating sample sizes before running, randomizing correctly, and not stopping early. A test that violates these principles can produce a false positive just as easily as a poorly designed survey. Analysts who master hypothesis testing bring scientific rigor to product decisions, replacing gut instinct and HiPPO (Highest Paid Person's Opinion) with reliable evidence.
Create a free reader account to keep reading.