What Is A/B Testing?
A/B testing (also called split testing) is a controlled experiment in which two or more variants of a feature, design, or message are shown to randomly assigned user groups so that their effect on a target metric can be measured with statistical rigor. The core idea is to isolate the impact of a single change by holding everything else constant. Unlike before/after comparisons, a properly randomized A/B test controls for external confounds such as seasonality, marketing campaigns, or natural user behaviour shifts.
Data analysts are responsible for designing experiments correctly, monitoring them during execution, and drawing valid conclusions from the results — all three stages are where mistakes commonly occur.
Key Concepts and Terminology
| Term | Definition | Practical Note |
|---|---|---|
| Control (A) | The existing version shown to the baseline group | Must be the current production experience, not a stripped-down version |
| Treatment (B) | The new variant shown to the test group | Change only one thing at a time to attribute effects clearly |
| Null Hypothesis (H₀) | There is no difference between A and B | The test tries to reject this; failing to reject ≠ proving A = B |
| p-value | Probability of observing results at least this extreme if H₀ is true | p < 0.05 is a common threshold but should be set before the test |
| Statistical Significance | Result is unlikely to be due to chance at a given confidence level | Significance alone does not mean the effect is large or meaningful |
| Practical Significance | The effect size is large enough to matter to the business | A 0.01% lift in conversion may be significant but not worth shipping |
| Power (1 − β) | Probability of detecting a true effect when one exists | Aim for 80% or higher; low power leads to missed improvements |
| Sample Size | Number of users needed per variant to achieve desired power | Calculate before running; stopping early invalidates the test |
Sample Size Calculation
Running a test without computing the required sample size is one of the most common mistakes in experimentation. Too few users means the test lacks power to detect real effects; too many wastes time and exposes users to an inferior experience longer than necessary.
from scipy import stats
import math
def required_sample_size(baseline_rate, min_detectable_effect, alpha=0.05, power=0.80):
"""
Calculate required sample size per variant for a two-proportion z-test.
baseline_rate: current conversion rate (e.g., 0.10 for 10%)
min_detectable_effect: smallest relative lift worth detecting (e.g., 0.05 for 5%)
"""
p1 = baseline_rate
p2 = baseline_rate * (1 + min_detectable_effect)
# Z-scores for alpha (two-tailed) and power
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
pooled = (p1 + p2) / 2
n = (z_alpha * math.sqrt(2 * pooled * (1 - pooled)) +
z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2 / (p2 - p1) ** 2
return math.ceil(n)
n = required_sample_size(baseline_rate=0.10, min_detectable_effect=0.10)
print(f"Users needed per variant: {n}") # ~15,745
In this example, detecting a 10% relative lift (from 10% to 11% conversion) at 80% power requires roughly 15,745 users per variant — a number that surprises many teams running tests on low-traffic pages.
Running the Statistical Test
Once the experiment has collected enough data, use a two-proportion z-test (for binary outcomes like conversion) or a t-test (for continuous outcomes like revenue per user). The mechanics in Python:
from scipy import stats
import numpy as np
# Conversion data
control_visitors = 12500
control_converts = 1175 # 9.4% conversion
treatment_visitors = 12500
treatment_converts = 1312 # 10.5% conversion
# Two-proportion z-test
count = np.array([treatment_converts, control_converts])
nobs = np.array([treatment_visitors, control_visitors])
z_stat, p_value = stats.proportions_ztest(count, nobs)
control_rate = control_converts / control_visitors
treatment_rate = treatment_converts / treatment_visitors
relative_lift = (treatment_rate - control_rate) / control_rate * 100
print(f"Control rate: {control_rate:.2%}")
print(f"Treatment rate: {treatment_rate:.2%}")
print(f"Relative lift: {relative_lift:.1f}%")
print(f"z-statistic: {z_stat:.3f}")
print(f"p-value: {p_value:.4f}")
print("Significant!" if p_value < 0.05 else "Not significant.")
Querying Experiment Results in SQL
Most experimentation platforms store assignment and event data in a data warehouse. Analysts query it directly to compute per-variant metrics before passing them to statistical tests.
-- Step 1: Get users assigned to each variant
WITH assignments AS (
SELECT user_id, variant -- 'control' or 'treatment'
FROM experiment_assignments
WHERE experiment_id = 'checkout_button_color'
AND assigned_at BETWEEN '2024-03-01' AND '2024-03-14'
),
-- Step 2: Join to conversion events
conversions AS (
SELECT
a.variant,
COUNT(DISTINCT a.user_id) AS visitors,
COUNT(DISTINCT e.user_id) AS converters,
COUNT(DISTINCT e.user_id) * 1.0
/ COUNT(DISTINCT a.user_id) AS conv_rate,
AVG(e.order_value) AS avg_order_value
FROM assignments a
LEFT JOIN events e
ON a.user_id = e.user_id
AND e.event_type = 'purchase'
AND e.event_time BETWEEN '2024-03-01' AND '2024-03-14'
GROUP BY a.variant
)
SELECT * FROM conversions;
Common Pitfalls and How to Avoid Them
| Pitfall | What Goes Wrong | Prevention |
|---|---|---|
| Peeking / Early Stopping | Checking p-values repeatedly and stopping when p < 0.05 inflates false positive rate | Pre-commit to sample size and end date; use sequential testing methods if early stopping is needed |
| Multiple Testing | Testing 10 metrics simultaneously means ~40% chance of at least one false positive at α=0.05 | Define one primary metric; apply Bonferroni correction or FDR control for secondary metrics |
| Sample Ratio Mismatch | Variants receive unequal traffic due to a bug, causing biased results | Check that actual split matches intended split before analysing results |
| Novelty Effect | New UI gets inflated engagement simply because it is new, not because it is better | Run the test long enough for novelty to wear off (typically 2+ weeks) |
| Network Effects | Control and treatment users interact, violating independence assumption | Use cluster-level randomisation (e.g., by household or geographic region) |
| Survivorship Bias | Analysing only users who completed a flow ignores those who dropped off | Use intention-to-treat analysis: include all assigned users regardless of engagement |
Beyond A/B: Multivariate and Bandit Testing
Standard A/B tests compare two variants but become impractical when testing many combinations simultaneously. Two alternatives are common in production systems.
Multivariate Testing (MVT) tests multiple elements at once (e.g., headline + button colour + image). It requires factorial sample sizes and is suited for high-traffic pages where interactions between elements matter.
Multi-Armed Bandit (MAB) algorithms adaptively allocate more traffic to better-performing variants during the experiment itself, reducing the opportunity cost of showing users an inferior experience. Common algorithms include Epsilon-Greedy, UCB1, and Thompson Sampling. The trade-off is that bandits sacrifice statistical rigour for regret minimisation — they are better for short-horizon optimisation than for causal inference.
# Simple epsilon-greedy bandit simulation
import numpy as np
def epsilon_greedy(true_rates, n_rounds=10000, epsilon=0.1):
n_arms = len(true_rates)
counts = np.zeros(n_arms)
rewards = np.zeros(n_arms)
choices = []
for _ in range(n_rounds):
if np.random.rand() < epsilon:
arm = np.random.randint(n_arms) # explore
else:
arm = np.argmax(rewards / (counts + 1e-9)) # exploit
reward = int(np.random.rand() < true_rates[arm])
counts[arm] += 1
rewards[arm] += reward
choices.append(arm)
return counts, rewards / (counts + 1e-9)
counts, est_rates = epsilon_greedy([0.10, 0.12, 0.09])
print("Traffic allocated:", counts)
print("Estimated rates:", est_rates.round(3))
Summary
A/B testing is the gold standard for measuring the causal impact of product changes. Done correctly — with pre-calculated sample sizes, a single primary metric, full runtime, and proper statistical tests — it produces reliable, actionable results. Done incorrectly — with peeking, multiple testing, or biased randomisation — it creates a false sense of certainty. As a data analyst, your role is to be the guardian of experimental rigour: design tests that can actually answer the question being asked, catch integrity issues before results are shared, and communicate effect sizes alongside p-values so stakeholders can make informed decisions.
Create a free reader account to keep reading.