What Is A/B Testing?
A/B testing (also called split testing or controlled experimentation) is a method of comparing two or more variants of something — a webpage, email subject line, feature, price, or recommendation algorithm — to determine which performs better on a defined metric. Users are randomly assigned to a control group (A) or one or more treatment groups (B, C...). By measuring outcomes for each group and applying statistical tests, analysts can attribute differences in outcomes to the change rather than to random noise or confounding factors.
A/B testing is the gold standard for causal inference in product and marketing analytics. Unlike observational analysis — which can reveal correlations but cannot definitively establish causation — a properly designed experiment with random assignment allows analysts to conclude that a change caused an improvement.
Key Terminology
Term | Definition |
|---|---|
Control (A) | The existing version; baseline for comparison |
Treatment (B) | The new variant being tested |
Randomization unit | The entity randomly assigned (user, session, device, account) |
Primary metric | The main outcome the experiment is designed to move (conversion rate, revenue per user) |
Guardrail metric | A metric that must not regress (e.g., page load time, error rate) |
Statistical significance | The probability that the observed difference is not due to chance (1 − p-value) |
p-value | The probability of observing a result at least as extreme if there were truly no difference |
Significance level (α) | The threshold p-value for declaring a result significant (commonly 0.05) |
Statistical power (1 − β) | The probability of detecting a true effect when one exists (commonly targeted at 80%) |
Minimum detectable effect (MDE) | The smallest effect size the test is designed to detect with the target power |
The Hypothesis Testing Framework
Every A/B test is a hypothesis test. The null hypothesis (H₀) states that there is no difference between control and treatment. The alternative hypothesis (H₁) states that a difference exists (or that the treatment is better).
The analyst sets a significance level α (usually 0.05) before running the test. If the p-value from the test falls below α, the null hypothesis is rejected and the result is declared statistically significant. This does not prove the treatment is better — it means the observed difference is unlikely to occur by chance alone at the chosen significance level.
Two types of errors:
Error Type | Description | Controlled By |
|---|---|---|
Type I error (false positive) | Declaring a winner when there is no real effect | Significance level α (lower α = fewer false positives) |
Type II error (false negative) | Failing to detect a real effect | Power (1 − β); requires sufficient sample size |
Sample Size Calculation
Before running a test, calculate the required sample size to achieve adequate statistical power. Underpowered tests frequently produce false negatives — real improvements go undetected. The required sample size depends on four inputs:
Input | Typical Value | Effect on Sample Size |
|---|---|---|
Significance level (α) | 0.05 | Lower α → larger sample needed |
Statistical power (1 − β) | 0.80 | Higher power → larger sample needed |
Baseline conversion rate | Measured from historical data | Extremes (near 0% or 100%) need larger samples |
Minimum detectable effect (MDE) | Defined by business need | Smaller MDE → much larger sample needed |
A common formula for a two-proportion z-test: n = 2 × (z_α/2 + z_β)² × p(1-p) / δ² where p is the baseline rate, δ is the MDE, and z values are from the standard normal distribution. In practice, use online calculators or Python's statsmodels.stats.proportion.proportion_effectsize.
Running the Test
Randomization: Assign users randomly to groups. The most common unit is user ID (persistent across sessions). Session-level randomization is faster to reach sample size but can create inconsistent experiences. Use a deterministic hash of the user ID and experiment name to ensure the same user always gets the same variant across sessions.
Traffic split: The default 50/50 split maximizes statistical power per total sample. Unequal splits (e.g., 90/10) are used when the risk of degrading user experience with the treatment is high — but they require a larger total sample for the same power.
Novelty effect: Users may behave differently simply because something is new. Run tests for at least one full week to capture weekday/weekend patterns, and longer for products with lower usage frequency.
Analyzing Results
The most common statistical test for A/B tests on proportions (conversion rates, click-through rates) is the two-proportion z-test. For continuous metrics (revenue per user, time on page), use a two-sample t-test or Mann-Whitney U test if distributions are skewed.
In Python with scipy:
from scipy import stats
z_stat, p_value = stats.proportions_ztest([conversions_B, conversions_A], [n_B, n_A])
t_stat, p_value = stats.ttest_ind(revenue_B, revenue_A)
Report the observed lift: lift = (rate_B - rate_A) / rate_A, the confidence interval for the difference, the p-value, and whether the result passes your pre-specified significance threshold.
Common Pitfalls
Pitfall | Description | Fix |
|---|---|---|
Peeking (optional stopping) | Stopping the test early when results look significant — inflates Type I error | Pre-specify end date and sample size; use sequential testing if early stopping is needed |
Multiple comparisons | Testing many metrics or variants increases false positive rate | Apply Bonferroni correction; pre-specify primary metric |
Sample ratio mismatch | Actual traffic split differs significantly from intended split — indicates a bug | Always check observed vs. expected traffic per variant before analyzing |
Network effects / interference | Treatment users affect control users (e.g., social features, marketplaces) | Use cluster-level randomization (by geography, friend group) |
Novelty or primacy effects | Short-term behavior change due to newness — not a lasting effect | Run tests longer; analyze new vs. returning user segments separately |
Summary
A/B testing provides the rigorous causal evidence that separates correlation from causation in product and marketing analytics. The discipline requires careful experimental design — specifying hypotheses, calculating sample sizes, pre-registering metrics, and respecting the test duration — before any data is collected. Analyzing results correctly means reporting effect sizes, confidence intervals, and p-values against pre-specified thresholds, not data-mining for significance after the fact. When done properly, A/B testing is the most reliable tool available for making decisions that improve products, user experience, and revenue in a measured, evidence-based way.
Create a free reader account to keep reading.