A/B Testing and Experimentation for Data Analysts

Why Experimentation Is Central to Data-Driven Decision Making

A/B testing — formally called a randomised controlled experiment — is the gold standard for establishing causal relationships between a change and a business outcome. Unlike observational analysis, where confounding variables can distort conclusions, a well-run experiment randomly assigns users to a control group (A) and a treatment group (B), ensuring that the only systematic difference between groups is the change being tested. This allows analysts to attribute differences in outcomes directly to the intervention, not to pre-existing differences in user behaviour or demographics. Organisations that build strong experimentation cultures — Amazon, Booking.com, Netflix, Airbnb — run thousands of simultaneous tests and make product decisions based on measured causal effects rather than intuition or correlation.

Core Statistical Concepts in A/B Testing

Concept	Definition	Practical Meaning
Null hypothesis (H₀)	The assumption that the treatment has no effect; the two groups have equal means	Your starting position: "The new button colour makes no difference to conversion rate"
Alternative hypothesis (H₁)	The claim that the treatment does have an effect	"The new button colour changes conversion rate" (two-sided) or "increases it" (one-sided)
p-value	Probability of observing a result at least as extreme as yours if H₀ were true	Low p-value (e.g. < 0.05) = the data are unlikely under the null; evidence to reject H₀
Significance level (α)	Pre-defined threshold below which you reject H₀; typically 0.05	Accepting a 5% chance of a false positive (Type I error) in your decision
Statistical power (1 − β)	Probability of detecting a true effect if one exists; typically set to 0.8 or 0.9	80–90% chance of correctly identifying a real improvement; determines required sample size
Minimum Detectable Effect (MDE)	Smallest relative or absolute change the test is designed to detect	Detecting a 1% lift requires far more traffic than detecting a 10% lift; drives sample size calculation
Confidence interval	Range within which the true effect likely falls (e.g. 95% CI)	More informative than a binary reject/fail-to-reject decision; shows the magnitude and uncertainty of the effect

Types of Errors in Hypothesis Testing

Error Type	What Happens	Also Called	Controlled By
Type I error (false positive)	You conclude the treatment works when it actually has no effect	α (significance level)	Setting α = 0.05 means you accept a 5% false positive rate
Type II error (false negative)	You conclude the treatment has no effect when it actually does	β	Setting power = 0.8 means β = 0.2; increasing sample size reduces β

The A/B Testing Workflow

Step	What to Do	Common Mistakes
1. Define the hypothesis	State a specific, testable claim: "Adding social proof below the CTA button will increase checkout conversion by at least 2%"	Vague hypotheses; testing multiple changes at once in a simple A/B (use multivariate instead)
2. Choose the primary metric	Identify one primary metric (the guardrail of success) and secondary/guardrail metrics to monitor for unintended harm	Optimising a proxy metric that doesn't reflect business value; ignoring guardrail metrics
3. Calculate sample size	Use a power calculator with: baseline rate, MDE, α (0.05), power (0.8); determine how long to run the test	Running until you see significance (peeking); stopping too early due to impatience
4. Randomise and split traffic	Randomly assign users (not sessions) to groups; ensure consistent assignment for the same user across visits	Assigning by session (one user can be in both groups); sampling bias in who enters the experiment
5. Run the experiment	Let it run for the full planned duration; avoid making changes mid-experiment	Peeking at results and stopping early when significance is reached (inflates Type I error rate)
6. Analyse results	Run the appropriate test (two-proportion z-test for conversion rates, t-test for means); calculate effect size and CI	Using the wrong statistical test; ignoring practical significance in favour of statistical significance
7. Decide and document	Ship if significant and practically meaningful; document the test, result, and decision regardless of outcome	Only sharing wins; not documenting null results (which carry valuable information about what doesn't work)

Choosing the Right Statistical Test

Metric Type	Recommended Test	Python Implementation
Binary outcome (conversion rate, click-through rate)	Two-proportion z-test	from statsmodels.stats.proportion import proportions_ztest
Continuous outcome (revenue per user, session duration)	Welch's t-test (does not assume equal variances)	from scipy.stats import ttest_ind; ttest_ind(a, b, equal_var=False)
Non-normally distributed continuous metric	Mann-Whitney U test (non-parametric)	from scipy.stats import mannwhitneyu
More than two variants	ANOVA followed by post-hoc tests (Tukey HSD); or Bonferroni correction for pairwise comparisons	from scipy.stats import f_oneway; statsmodels pairwise_tukeyhsd

Common Pitfalls and How to Avoid Them

Pitfall	Why It's Dangerous	How to Avoid
Peeking (optional stopping)	Checking results repeatedly and stopping when p < 0.05 inflates the false positive rate far above 5%	Pre-register the sample size and end date; use sequential testing methods (e.g. always-valid inference) if early stopping is needed
Multiple comparisons	Testing many metrics or variants simultaneously inflates the chance of at least one false positive	Apply Bonferroni correction (divide α by number of tests) or use False Discovery Rate methods (Benjamini-Hochberg)
Network effects / interference	In social or marketplace products, users in control and treatment interact, violating the independence assumption	Use cluster randomisation (randomise by geography, device type, or social cluster rather than individual user)
Novelty effect	Users respond to any change initially due to novelty, not because the change is truly better	Run the test long enough to allow novelty to wear off; analyse behaviour of new vs. returning users separately
Simpson's paradox	Overall results can reverse direction when you segment by a confounding variable (e.g. mobile vs. desktop)	Always segment results by key dimensions; check for interaction effects between the treatment and user segments
Survivorship bias in sample selection	Only including users who complete a funnel step biases estimates of the treatment effect	Define the experiment unit at entry to the funnel, not at a downstream step

Beyond Simple A/B: Advanced Experimentation Methods

Method	When to Use	Key Advantage
Multivariate testing (MVT)	Testing multiple elements simultaneously (headline + image + CTA colour)	Reveals interaction effects between variables; more efficient than sequential A/B tests
Bandit algorithms (e.g. Thompson Sampling)	When you want to minimise regret during the experiment by routing more traffic to the winning variant	Reduces the cost of running losing variants; useful when traffic is scarce
Holdout testing	Measuring the long-term cumulative impact of a feature by keeping a small holdout group that never receives it	Captures long-term effects missed by short-term A/B tests
Difference-in-differences	When randomisation is not possible (e.g. a feature rolls out to an entire country)	Uses pre/post trends in treated vs. untreated groups as a quasi-experiment

Summary

A/B testing is the most rigorous tool available to data analysts for moving from correlation to causation in product and business decisions. Getting it right requires upfront statistical planning (sample size, MDE, α, power), disciplined execution (no peeking, proper randomisation), and honest analysis (confidence intervals, effect size, guardrail metrics). The common pitfalls — peeking, multiple comparisons, network effects — are well understood and avoidable. Analysts who master experimentation become invaluable partners to product and engineering teams, because they can reliably distinguish changes that genuinely improve outcomes from those that merely appear to in noisy data.