What Is Hypothesis Testing?
Hypothesis testing is a statistical framework for making evidence-based decisions about populations using sample data. It answers questions like: Is the conversion rate of version B actually better than version A, or could the difference be due to random chance? Does the new onboarding flow genuinely reduce churn, or is the observed improvement just noise? Hypothesis testing provides a rigorous, reproducible method for answering these questions with quantified uncertainty.
At its core, hypothesis testing works by assuming the null hypothesis is true (no effect, no difference), then measuring how unlikely the observed data would be under that assumption. If the data is sufficiently unlikely — below a predefined threshold — we reject the null hypothesis and conclude the effect is real.
Key Concepts: Null and Alternative Hypotheses
Every hypothesis test starts with two competing hypotheses. The null hypothesis (H₀) represents the status quo — no difference, no effect, no relationship. The alternative hypothesis (H₁) represents what you're trying to demonstrate — a difference exists, the treatment works, the groups differ.
For example, in an A/B test on a signup button: H₀ = conversion rate of variant B equals variant A (no effect); H₁ = conversion rate of variant B is greater than variant A (improvement). The goal is to determine whether the data provides sufficient evidence to reject H₀ in favor of H₁.
P-Values and Significance Levels
The p-value is the probability of observing data as extreme as (or more extreme than) the actual data, assuming the null hypothesis is true. A small p-value means the observed result is unlikely if there's truly no effect — suggesting the null hypothesis should be rejected. A large p-value means the data is consistent with the null hypothesis.
The significance level α (typically 0.05) is the threshold below which we reject H₀. If p < α, the result is "statistically significant." This does NOT mean the effect is practically important or large — just that it's unlikely to be due to chance alone. Confusing statistical significance with practical importance is one of the most common errors in data analysis.
Type I and Type II Errors
Error Type | What Happens | Also Called | Controlled By |
|---|---|---|---|
Type I (False Positive) | Reject H₀ when it's actually true | α error | Significance level α |
Type II (False Negative) | Fail to reject H₀ when it's false | β error | Statistical power (1 - β) |
Reducing α (e.g., from 0.05 to 0.01) makes it harder to falsely claim significance but increases the risk of missing real effects. Power analysis before running an experiment determines the sample size needed to detect a meaningful effect with sufficient probability. Low power means real effects go undetected — a common problem in underpowered studies.
Common Statistical Tests
Test | Use Case | Data Type | Python Function |
|---|---|---|---|
One-sample t-test | Compare sample mean to known value | Continuous | scipy.stats.ttest_1samp |
Two-sample t-test | Compare means of two groups | Continuous | scipy.stats.ttest_ind |
Paired t-test | Before/after or matched pairs | Continuous | scipy.stats.ttest_rel |
Chi-square test | Association between two categoricals | Categorical | scipy.stats.chi2_contingency |
One-way ANOVA | Compare means across 3+ groups | Continuous | scipy.stats.f_oneway |
Mann-Whitney U | Non-parametric alternative to t-test | Ordinal / non-normal | scipy.stats.mannwhitneyu |
Proportion z-test | Compare conversion rates | Binary | statsmodels.stats.proportion |
A/B Testing: Hypothesis Testing in Practice
A/B testing is the most widely used application of hypothesis testing in business. You randomly split users into a control group (A) and treatment group (B), expose them to different versions of something, and measure whether the treatment produces a statistically significant improvement on a target metric.
The key steps are: define the metric and minimum detectable effect before starting; run a power analysis to determine the required sample size; collect data until the predetermined sample size is reached (do not stop early based on peeking at results); run the appropriate statistical test; and interpret results accounting for practical significance, not just statistical significance.
Peeking at results and stopping as soon as p < 0.05 is one of the most common A/B testing mistakes — it dramatically inflates the false positive rate. Always pre-commit to a sample size and stick to it.
Multiple Testing Problem
Running many hypothesis tests on the same dataset inflates the false positive rate. If you test 20 independent hypotheses at α = 0.05, you expect one false positive by chance alone even if none of the effects are real. This is the multiple comparisons problem.
The Bonferroni correction addresses this by dividing the significance threshold by the number of tests (α/n), making each individual test more stringent. The Benjamini-Hochberg procedure controls the False Discovery Rate (FDR) — a less conservative approach better suited to exploratory analysis. Always account for multiple comparisons when testing many hypotheses simultaneously.
Effect Size: The Other Half of the Story
Statistical significance tells you the effect is unlikely to be chance. Effect size tells you how large the effect is. Cohen's d measures the standardized difference between two means — values of 0.2, 0.5, and 0.8 correspond to small, medium, and large effects. For proportions, relative risk or odds ratios quantify effect size. For correlation, Pearson's r is itself an effect size measure.
A statistically significant result with a tiny effect size (e.g., d = 0.02) may not be worth acting on. Conversely, a non-significant result in an underpowered study might miss a practically important effect. Report both p-values and effect sizes to give a complete picture.
Assumptions and Diagnostics
Each test has assumptions that must hold for valid results. T-tests assume approximately normal distributions (robust with large samples via the Central Limit Theorem) and equal or unequal variances (use Levene's test to check). Chi-square tests require expected cell counts of at least 5. ANOVA assumes independence, normality, and homogeneity of variance. When assumptions are violated, non-parametric alternatives like the Mann-Whitney U or Kruskal-Wallis test are appropriate.
Conclusion
Hypothesis testing is the foundation of evidence-based decision making in data analytics. Mastering the key tests, understanding p-values and their limitations, avoiding common pitfalls like peeking and multiple comparisons, and always reporting effect sizes alongside significance makes your analytical conclusions rigorous and trustworthy. In an era of data-driven decisions, these skills separate analysts who generate noise from those who generate genuine insight.
Create a free reader account to keep reading.