A/B Testing and Experimentation for Data Analysts

What Is A/B Testing?

A/B testing (also called split testing or controlled experimentation) is a method of comparing two or more variants of something — a webpage, email subject line, feature, price, or recommendation algorithm — to determine which performs better on a defined metric. Users are randomly assigned to a control group (A) or one or more treatment groups (B, C...). By measuring outcomes for each group and applying statistical tests, analysts can attribute differences in outcomes to the change rather than to random noise or confounding factors.

A/B testing is the gold standard for causal inference in product and marketing analytics. Unlike observational analysis — which can reveal correlations but cannot definitively establish causation — a properly designed experiment with random assignment allows analysts to conclude that a change caused an improvement.

Key Terminology

Term	Definition
Control (A)	The existing version; baseline for comparison
Treatment (B)	The new variant being tested
Randomization unit	The entity randomly assigned (user, session, device, account)
Primary metric	The main outcome the experiment is designed to move (conversion rate, revenue per user)
Guardrail metric	A metric that must not regress (e.g., page load time, error rate)
Statistical significance	The probability that the observed difference is not due to chance (1 − p-value)
p-value	The probability of observing a result at least as extreme if there were truly no difference
Significance level (α)	The threshold p-value for declaring a result significant (commonly 0.05)
Statistical power (1 − β)	The probability of detecting a true effect when one exists (commonly targeted at 80%)
Minimum detectable effect (MDE)	The smallest effect size the test is designed to detect with the target power

The Hypothesis Testing Framework

Every A/B test is a hypothesis test. The null hypothesis (H₀) states that there is no difference between control and treatment. The alternative hypothesis (H₁) states that a difference exists (or that the treatment is better).

The analyst sets a significance level α (usually 0.05) before running the test. If the p-value from the test falls below α, the null hypothesis is rejected and the result is declared statistically significant. This does not prove the treatment is better — it means the observed difference is unlikely to occur by chance alone at the chosen significance level.

Two types of errors:

Error Type	Description	Controlled By
Type I error (false positive)	Declaring a winner when there is no real effect	Significance level α (lower α = fewer false positives)
Type II error (false negative)	Failing to detect a real effect	Power (1 − β); requires sufficient sample size

Sample Size Calculation

Before running a test, calculate the required sample size to achieve adequate statistical power. Underpowered tests frequently produce false negatives — real improvements go undetected. The required sample size depends on four inputs:

Input	Typical Value	Effect on Sample Size
Significance level (α)	0.05	Lower α → larger sample needed
Statistical power (1 − β)	0.80	Higher power → larger sample needed
Baseline conversion rate	Measured from historical data	Extremes (near 0% or 100%) need larger samples
Minimum detectable effect (MDE)	Defined by business need	Smaller MDE → much larger sample needed

A common formula for a two-proportion z-test: n = 2 × (z_α/2 + z_β)² × p(1-p) / δ² where p is the baseline rate, δ is the MDE, and z values are from the standard normal distribution. In practice, use online calculators or Python's statsmodels.stats.proportion.proportion_effectsize.

Running the Test

Randomization: Assign users randomly to groups. The most common unit is user ID (persistent across sessions). Session-level randomization is faster to reach sample size but can create inconsistent experiences. Use a deterministic hash of the user ID and experiment name to ensure the same user always gets the same variant across sessions.

Traffic split: The default 50/50 split maximizes statistical power per total sample. Unequal splits (e.g., 90/10) are used when the risk of degrading user experience with the treatment is high — but they require a larger total sample for the same power.

Novelty effect: Users may behave differently simply because something is new. Run tests for at least one full week to capture weekday/weekend patterns, and longer for products with lower usage frequency.

Analyzing Results

The most common statistical test for A/B tests on proportions (conversion rates, click-through rates) is the two-proportion z-test. For continuous metrics (revenue per user, time on page), use a two-sample t-test or Mann-Whitney U test if distributions are skewed.

In Python with scipy:

from scipy import stats

z_stat, p_value = stats.proportions_ztest([conversions_B, conversions_A], [n_B, n_A])

t_stat, p_value = stats.ttest_ind(revenue_B, revenue_A)

Report the observed lift: lift = (rate_B - rate_A) / rate_A, the confidence interval for the difference, the p-value, and whether the result passes your pre-specified significance threshold.

Common Pitfalls

Pitfall	Description	Fix
Peeking (optional stopping)	Stopping the test early when results look significant — inflates Type I error	Pre-specify end date and sample size; use sequential testing if early stopping is needed
Multiple comparisons	Testing many metrics or variants increases false positive rate	Apply Bonferroni correction; pre-specify primary metric
Sample ratio mismatch	Actual traffic split differs significantly from intended split — indicates a bug	Always check observed vs. expected traffic per variant before analyzing
Network effects / interference	Treatment users affect control users (e.g., social features, marketplaces)	Use cluster-level randomization (by geography, friend group)
Novelty or primacy effects	Short-term behavior change due to newness — not a lasting effect	Run tests longer; analyze new vs. returning user segments separately

Summary

A/B testing provides the rigorous causal evidence that separates correlation from causation in product and marketing analytics. The discipline requires careful experimental design — specifying hypotheses, calculating sample sizes, pre-registering metrics, and respecting the test duration — before any data is collected. Analyzing results correctly means reporting effect sizes, confidence intervals, and p-values against pre-specified thresholds, not data-mining for significance after the fact. When done properly, A/B testing is the most reliable tool available for making decisions that improve products, user experience, and revenue in a measured, evidence-based way.