What Is A/B Testing?
A/B testing, also called split testing or controlled experimentation, is a method for comparing two versions of something to determine which one performs better on a measurable outcome. One group of users (the control group) experiences version A — typically the existing baseline — while another group (the treatment group) experiences version B, the proposed change. By measuring outcomes for both groups simultaneously and applying statistical tests, analysts can determine whether the observed difference in performance is genuine or merely the result of random variation.
A/B testing is used across product development, marketing, pricing, and UX design. Common applications include testing a new checkout flow against the existing one, comparing two subject lines in an email campaign, evaluating whether a price change affects conversion rates, or measuring the impact of a new recommendation algorithm on user engagement. The common thread is a binary comparison on a clearly defined metric.
The Hypothesis Framework
Every valid A/B test starts with a hypothesis — a specific, falsifiable prediction about the effect of a change. The hypothesis must be formed before the test begins. Forming it afterward (based on what the data happens to show) is a form of p-hacking that invalidates the statistical interpretation.
A good A/B test hypothesis follows the structure: "If we change X, then Y will increase/decrease by approximately Z, because [mechanism]." For example: "If we change the CTA button color from grey to orange, then the click-through rate will increase by at least 5%, because the higher contrast will draw more attention." The mechanism clause matters — it forces you to think about why the change should work, which helps prioritize which tests are worth running.
The two statistical hypotheses you are testing are the null hypothesis (H0: there is no difference between A and B) and the alternative hypothesis (H1: there is a difference). The goal of the test is to collect enough evidence to reject the null hypothesis at a chosen confidence level.
Key Statistical Concepts
Concept | Definition | Typical Value |
|---|---|---|
Significance level (α) | Probability of falsely rejecting H0 (Type I error) | 0.05 (5%) |
Statistical power (1-β) | Probability of correctly detecting a true effect | 0.80 (80%) |
p-value | Probability of observing results as extreme as ours if H0 is true | Should be below α to reject H0 |
Minimum Detectable Effect (MDE) | Smallest effect size the test is designed to detect | Depends on business context |
Sample size | Number of observations needed per variant | Calculated from α, power, and MDE |
Confidence interval | Range likely to contain the true effect | 95% CI for α = 0.05 |
Sample Size Calculation
One of the most common mistakes in A/B testing is running a test without calculating the required sample size in advance. Without sufficient sample size, a test lacks the statistical power to detect real effects, leading to either false negatives (missing genuine improvements) or false positives (declaring winners based on noise).
Sample size depends on three inputs: the baseline conversion rate, the minimum detectable effect (the smallest improvement that would be practically meaningful), and the desired power and significance levels. In Python, you can calculate sample size using the statsmodels library:
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
baseline_rate = 0.10 # current conversion rate: 10%
mde = 0.02 # minimum detectable effect: +2pp lift
alpha = 0.05 # significance level
power = 0.80 # desired statistical power
effect_size = proportion_effectsize(baseline_rate + mde, baseline_rate)
analysis = NormalIndPower()
n = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, alternative='two-sided')
print(f"Required sample size per variant: {int(n) + 1}")
# Required sample size per variant: 3843
This means you need approximately 3,843 users in each group (7,686 total) before you can reliably detect a 2 percentage point lift in conversion rate. Only start the test once you have a credible plan for reaching this sample size within a reasonable time window.
Running the Test
Once the hypothesis is formed and sample size calculated, the test can be designed and launched. The key operational requirements are random assignment, isolation, and consistent exposure.
Random assignment means each user is assigned to control or treatment randomly, with equal probability and without any systematic bias. Common methods include hashing the user ID modulo 2, or using a dedicated experimentation platform (like Optimizely, LaunchDarkly, or an in-house system) that handles randomization automatically.
Isolation means users in the control group never see the treatment and vice versa. Leakage between groups contaminates results. This is especially important in social or network products where one user's experience can influence another's (a phenomenon called network interference).
Consistent exposure means a user assigned to a variant always sees that variant throughout the test — no switching. Variant switching introduces noise and violates the independence assumption of the statistical test.
Analyzing Results
After collecting the required number of observations, you can perform the statistical test. For conversion rate metrics (proportions), a two-proportion z-test is appropriate:
from statsmodels.stats.proportion import proportions_ztest
import numpy as np
# Control: 3843 users, 384 conversions (10.0%)
# Treatment: 3843 users, 461 conversions (12.0%)
conversions = np.array([384, 461])
nobs = np.array([3843, 3843])
stat, p_value = proportions_ztest(conversions, nobs, alternative='two-sided')
print(f"Z-statistic: {stat:.3f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("Result: Statistically significant - reject H0")
else:
print("Result: Not significant - fail to reject H0")
A low p-value (below your α threshold) means the observed difference is unlikely to be due to chance. However, statistical significance alone is not sufficient — you must also evaluate practical significance. A test with 10 million users might detect a 0.01% lift as statistically significant, but a 0.01% conversion improvement may have no meaningful business impact. Always report effect sizes and confidence intervals alongside p-values.
Common Pitfalls
Pitfall | Description | How to Avoid |
|---|---|---|
Peeking | Stopping early when results look significant | Pre-commit to sample size; use sequential testing if needed |
Multiple testing | Testing many metrics inflates false positive rate | Pre-define one primary metric; apply Bonferroni correction for secondary metrics |
Novelty effect | Users engage more with new things regardless of quality | Run test long enough (typically 1–2 weeks) to let novelty wear off |
Sample ratio mismatch | Groups are not the expected 50/50 split | Audit randomization logic; check for assignment bugs |
Survivorship bias | Excluding users who dropped out early distorts results | Use intention-to-treat analysis; include all assigned users |
Beyond Binary: Multivariate and Sequential Testing
Standard A/B testing compares two variants. When you want to test multiple changes simultaneously, multivariate testing allows you to test combinations of changes (e.g., headline A vs B, image X vs Y) in a single experiment. This is more efficient than running sequential A/B tests but requires a much larger sample size, as the number of cells grows multiplicatively.
Sequential testing (also called adaptive testing) allows you to monitor results continuously and stop early when sufficient evidence has accumulated, without inflating the false positive rate. Methods like the Sequential Probability Ratio Test (SPRT) and the always-valid inference framework used by tools like Optimizely and VWO allow analysts to "peek" at results without violating statistical guarantees, addressing one of the most common practical challenges in A/B testing.
Bayesian A/B testing is an alternative framework that some teams prefer because it answers a more intuitive question: given the data, what is the probability that B is better than A? Rather than rejecting or failing to reject a null hypothesis, Bayesian methods produce posterior distributions over effect sizes that are easier to communicate to non-technical stakeholders.
Create a free reader account to keep reading.