Hypothesis Testing for Data Analysts

What Is Hypothesis Testing?

Hypothesis testing is a statistical framework for making evidence-based decisions about populations using sample data. It answers questions like: Is the conversion rate of version B actually better than version A, or could the difference be due to random chance? Does the new onboarding flow genuinely reduce churn, or is the observed improvement just noise? Hypothesis testing provides a rigorous, reproducible method for answering these questions with quantified uncertainty.

At its core, hypothesis testing works by assuming the null hypothesis is true (no effect, no difference), then measuring how unlikely the observed data would be under that assumption. If the data is sufficiently unlikely — below a predefined threshold — we reject the null hypothesis and conclude the effect is real.

Key Concepts: Null and Alternative Hypotheses

Every hypothesis test starts with two competing hypotheses. The null hypothesis (H₀) represents the status quo — no difference, no effect, no relationship. The alternative hypothesis (H₁) represents what you're trying to demonstrate — a difference exists, the treatment works, the groups differ.

For example, in an A/B test on a signup button: H₀ = conversion rate of variant B equals variant A (no effect); H₁ = conversion rate of variant B is greater than variant A (improvement). The goal is to determine whether the data provides sufficient evidence to reject H₀ in favor of H₁.

P-Values and Significance Levels

The p-value is the probability of observing data as extreme as (or more extreme than) the actual data, assuming the null hypothesis is true. A small p-value means the observed result is unlikely if there's truly no effect — suggesting the null hypothesis should be rejected. A large p-value means the data is consistent with the null hypothesis.

The significance level α (typically 0.05) is the threshold below which we reject H₀. If p < α, the result is "statistically significant." This does NOT mean the effect is practically important or large — just that it's unlikely to be due to chance alone. Confusing statistical significance with practical importance is one of the most common errors in data analysis.

Type I and Type II Errors

Error Type	What Happens	Also Called	Controlled By
Type I (False Positive)	Reject H₀ when it's actually true	α error	Significance level α
Type II (False Negative)	Fail to reject H₀ when it's false	β error	Statistical power (1 - β)

Reducing α (e.g., from 0.05 to 0.01) makes it harder to falsely claim significance but increases the risk of missing real effects. Power analysis before running an experiment determines the sample size needed to detect a meaningful effect with sufficient probability. Low power means real effects go undetected — a common problem in underpowered studies.

Common Statistical Tests

Test	Use Case	Data Type	Python Function
One-sample t-test	Compare sample mean to known value	Continuous	scipy.stats.ttest_1samp
Two-sample t-test	Compare means of two groups	Continuous	scipy.stats.ttest_ind
Paired t-test	Before/after or matched pairs	Continuous	scipy.stats.ttest_rel
Chi-square test	Association between two categoricals	Categorical	scipy.stats.chi2_contingency
One-way ANOVA	Compare means across 3+ groups	Continuous	scipy.stats.f_oneway
Mann-Whitney U	Non-parametric alternative to t-test	Ordinal / non-normal	scipy.stats.mannwhitneyu
Proportion z-test	Compare conversion rates	Binary	statsmodels.stats.proportion

A/B Testing: Hypothesis Testing in Practice

A/B testing is the most widely used application of hypothesis testing in business. You randomly split users into a control group (A) and treatment group (B), expose them to different versions of something, and measure whether the treatment produces a statistically significant improvement on a target metric.

The key steps are: define the metric and minimum detectable effect before starting; run a power analysis to determine the required sample size; collect data until the predetermined sample size is reached (do not stop early based on peeking at results); run the appropriate statistical test; and interpret results accounting for practical significance, not just statistical significance.

Peeking at results and stopping as soon as p < 0.05 is one of the most common A/B testing mistakes — it dramatically inflates the false positive rate. Always pre-commit to a sample size and stick to it.

Multiple Testing Problem

Running many hypothesis tests on the same dataset inflates the false positive rate. If you test 20 independent hypotheses at α = 0.05, you expect one false positive by chance alone even if none of the effects are real. This is the multiple comparisons problem.

The Bonferroni correction addresses this by dividing the significance threshold by the number of tests (α/n), making each individual test more stringent. The Benjamini-Hochberg procedure controls the False Discovery Rate (FDR) — a less conservative approach better suited to exploratory analysis. Always account for multiple comparisons when testing many hypotheses simultaneously.

Effect Size: The Other Half of the Story

Statistical significance tells you the effect is unlikely to be chance. Effect size tells you how large the effect is. Cohen's d measures the standardized difference between two means — values of 0.2, 0.5, and 0.8 correspond to small, medium, and large effects. For proportions, relative risk or odds ratios quantify effect size. For correlation, Pearson's r is itself an effect size measure.

A statistically significant result with a tiny effect size (e.g., d = 0.02) may not be worth acting on. Conversely, a non-significant result in an underpowered study might miss a practically important effect. Report both p-values and effect sizes to give a complete picture.

Assumptions and Diagnostics

Each test has assumptions that must hold for valid results. T-tests assume approximately normal distributions (robust with large samples via the Central Limit Theorem) and equal or unequal variances (use Levene's test to check). Chi-square tests require expected cell counts of at least 5. ANOVA assumes independence, normality, and homogeneity of variance. When assumptions are violated, non-parametric alternatives like the Mann-Whitney U or Kruskal-Wallis test are appropriate.

Conclusion

Hypothesis testing is the foundation of evidence-based decision making in data analytics. Mastering the key tests, understanding p-values and their limitations, avoiding common pitfalls like peeking and multiple comparisons, and always reporting effect sizes alongside significance makes your analytical conclusions rigorous and trustworthy. In an era of data-driven decisions, these skills separate analysts who generate noise from those who generate genuine insight.