A/B Testing and Experimentation for Data Analysts

What Is A/B Testing?

A/B testing (also called split testing) is a controlled experiment in which two or more variants of a feature, design, or message are shown to randomly assigned user groups so that their effect on a target metric can be measured with statistical rigor. The core idea is to isolate the impact of a single change by holding everything else constant. Unlike before/after comparisons, a properly randomized A/B test controls for external confounds such as seasonality, marketing campaigns, or natural user behaviour shifts.

Data analysts are responsible for designing experiments correctly, monitoring them during execution, and drawing valid conclusions from the results — all three stages are where mistakes commonly occur.

Key Concepts and Terminology

Term	Definition	Practical Note
Control (A)	The existing version shown to the baseline group	Must be the current production experience, not a stripped-down version
Treatment (B)	The new variant shown to the test group	Change only one thing at a time to attribute effects clearly
Null Hypothesis (H₀)	There is no difference between A and B	The test tries to reject this; failing to reject ≠ proving A = B
p-value	Probability of observing results at least this extreme if H₀ is true	p < 0.05 is a common threshold but should be set before the test
Statistical Significance	Result is unlikely to be due to chance at a given confidence level	Significance alone does not mean the effect is large or meaningful
Practical Significance	The effect size is large enough to matter to the business	A 0.01% lift in conversion may be significant but not worth shipping
Power (1 − β)	Probability of detecting a true effect when one exists	Aim for 80% or higher; low power leads to missed improvements
Sample Size	Number of users needed per variant to achieve desired power	Calculate before running; stopping early invalidates the test

Sample Size Calculation

Running a test without computing the required sample size is one of the most common mistakes in experimentation. Too few users means the test lacks power to detect real effects; too many wastes time and exposes users to an inferior experience longer than necessary.

from scipy import stats
import math

def required_sample_size(baseline_rate, min_detectable_effect, alpha=0.05, power=0.80):
    """
    Calculate required sample size per variant for a two-proportion z-test.
    baseline_rate: current conversion rate (e.g., 0.10 for 10%)
    min_detectable_effect: smallest relative lift worth detecting (e.g., 0.05 for 5%)
    """
    p1 = baseline_rate
    p2 = baseline_rate * (1 + min_detectable_effect)
    # Z-scores for alpha (two-tailed) and power
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta  = stats.norm.ppf(power)
    pooled  = (p1 + p2) / 2
    n = (z_alpha * math.sqrt(2 * pooled * (1 - pooled)) +
         z_beta  * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2 / (p2 - p1) ** 2
    return math.ceil(n)

n = required_sample_size(baseline_rate=0.10, min_detectable_effect=0.10)
print(f"Users needed per variant: {n}")  # ~15,745

In this example, detecting a 10% relative lift (from 10% to 11% conversion) at 80% power requires roughly 15,745 users per variant — a number that surprises many teams running tests on low-traffic pages.

Running the Statistical Test

Once the experiment has collected enough data, use a two-proportion z-test (for binary outcomes like conversion) or a t-test (for continuous outcomes like revenue per user). The mechanics in Python:

from scipy import stats
import numpy as np

# Conversion data
control_visitors   = 12500
control_converts   = 1175   # 9.4% conversion
treatment_visitors = 12500
treatment_converts = 1312   # 10.5% conversion

# Two-proportion z-test
count = np.array([treatment_converts, control_converts])
nobs  = np.array([treatment_visitors, control_visitors])
z_stat, p_value = stats.proportions_ztest(count, nobs)

control_rate   = control_converts / control_visitors
treatment_rate = treatment_converts / treatment_visitors
relative_lift  = (treatment_rate - control_rate) / control_rate * 100

print(f"Control rate:   {control_rate:.2%}")
print(f"Treatment rate: {treatment_rate:.2%}")
print(f"Relative lift:  {relative_lift:.1f}%")
print(f"z-statistic:    {z_stat:.3f}")
print(f"p-value:        {p_value:.4f}")
print("Significant!" if p_value < 0.05 else "Not significant.")

Querying Experiment Results in SQL

Most experimentation platforms store assignment and event data in a data warehouse. Analysts query it directly to compute per-variant metrics before passing them to statistical tests.

-- Step 1: Get users assigned to each variant
WITH assignments AS (
  SELECT user_id, variant   -- 'control' or 'treatment'
  FROM experiment_assignments
  WHERE experiment_id = 'checkout_button_color'
    AND assigned_at BETWEEN '2024-03-01' AND '2024-03-14'
),
-- Step 2: Join to conversion events
conversions AS (
  SELECT
    a.variant,
    COUNT(DISTINCT a.user_id)                        AS visitors,
    COUNT(DISTINCT e.user_id)                        AS converters,
    COUNT(DISTINCT e.user_id) * 1.0
      / COUNT(DISTINCT a.user_id)                    AS conv_rate,
    AVG(e.order_value)                               AS avg_order_value
  FROM assignments a
  LEFT JOIN events e
    ON a.user_id = e.user_id
   AND e.event_type = 'purchase'
   AND e.event_time BETWEEN '2024-03-01' AND '2024-03-14'
  GROUP BY a.variant
)
SELECT * FROM conversions;

Common Pitfalls and How to Avoid Them

Pitfall	What Goes Wrong	Prevention
Peeking / Early Stopping	Checking p-values repeatedly and stopping when p < 0.05 inflates false positive rate	Pre-commit to sample size and end date; use sequential testing methods if early stopping is needed
Multiple Testing	Testing 10 metrics simultaneously means ~40% chance of at least one false positive at α=0.05	Define one primary metric; apply Bonferroni correction or FDR control for secondary metrics
Sample Ratio Mismatch	Variants receive unequal traffic due to a bug, causing biased results	Check that actual split matches intended split before analysing results
Novelty Effect	New UI gets inflated engagement simply because it is new, not because it is better	Run the test long enough for novelty to wear off (typically 2+ weeks)
Network Effects	Control and treatment users interact, violating independence assumption	Use cluster-level randomisation (e.g., by household or geographic region)
Survivorship Bias	Analysing only users who completed a flow ignores those who dropped off	Use intention-to-treat analysis: include all assigned users regardless of engagement

Beyond A/B: Multivariate and Bandit Testing

Standard A/B tests compare two variants but become impractical when testing many combinations simultaneously. Two alternatives are common in production systems.

Multivariate Testing (MVT) tests multiple elements at once (e.g., headline + button colour + image). It requires factorial sample sizes and is suited for high-traffic pages where interactions between elements matter.

Multi-Armed Bandit (MAB) algorithms adaptively allocate more traffic to better-performing variants during the experiment itself, reducing the opportunity cost of showing users an inferior experience. Common algorithms include Epsilon-Greedy, UCB1, and Thompson Sampling. The trade-off is that bandits sacrifice statistical rigour for regret minimisation — they are better for short-horizon optimisation than for causal inference.

# Simple epsilon-greedy bandit simulation
import numpy as np

def epsilon_greedy(true_rates, n_rounds=10000, epsilon=0.1):
    n_arms   = len(true_rates)
    counts   = np.zeros(n_arms)
    rewards  = np.zeros(n_arms)
    choices  = []
    for _ in range(n_rounds):
        if np.random.rand() < epsilon:
            arm = np.random.randint(n_arms)      # explore
        else:
            arm = np.argmax(rewards / (counts + 1e-9))  # exploit
        reward = int(np.random.rand() < true_rates[arm])
        counts[arm]  += 1
        rewards[arm] += reward
        choices.append(arm)
    return counts, rewards / (counts + 1e-9)

counts, est_rates = epsilon_greedy([0.10, 0.12, 0.09])
print("Traffic allocated:", counts)
print("Estimated rates:", est_rates.round(3))

Summary

A/B testing is the gold standard for measuring the causal impact of product changes. Done correctly — with pre-calculated sample sizes, a single primary metric, full runtime, and proper statistical tests — it produces reliable, actionable results. Done incorrectly — with peeking, multiple testing, or biased randomisation — it creates a false sense of certainty. As a data analyst, your role is to be the guardian of experimental rigour: design tests that can actually answer the question being asked, catch integrity issues before results are shared, and communicate effect sizes alongside p-values so stakeholders can make informed decisions.