Statistical Distributions for Data Analysts

Why Distributions Matter for Data Analysts

A statistical distribution describes the pattern of values in a dataset — what values are possible, how likely each is, and the shape of the spread. Understanding distributions helps analysts choose the right summary statistics, select appropriate statistical tests, identify anomalies, and build better models. Misidentifying a distribution leads to flawed conclusions: averaging skewed data, applying normal-theory tests to count data, or missing the long tail that causes most of the business impact.

Key Distribution Properties

Property	Description	Measures
Central tendency	Where the "middle" of the data is	Mean, median, mode
Spread / dispersion	How spread out the values are	Variance, standard deviation, IQR, range
Skewness	Asymmetry of the distribution	Positive (right tail), negative (left tail)
Kurtosis	Weight of tails vs. normal distribution	Leptokurtic (heavy tails), platykurtic (thin tails)
Support	Range of possible values	All reals, non-negative integers, 0–1, etc.

The Normal (Gaussian) Distribution

The normal distribution is the most commonly assumed distribution for continuous measurements. It is symmetric and bell-shaped, defined by its mean (μ) and standard deviation (σ).

Range	% of Data
μ ± 1σ	~68%
μ ± 2σ	~95%
μ ± 3σ	~99.7%

Analyst applications: z-score outlier detection, A/B test analysis, control charts, many ML model assumptions. Check normality with histograms, Q-Q plots, or the Shapiro-Wilk test.

Common Continuous Distributions

Distribution	Support	Shape	Analyst Use Cases
Normal	All reals	Symmetric bell curve	Heights, errors, test scores, A/B tests
Log-normal	Positive reals	Right-skewed; normal when logged	Revenue, income, session durations, prices
Exponential	Non-negative reals	Monotone decreasing	Time between events, wait times, survival analysis
Uniform	[a, b]	Flat / constant	Random sampling, simulation inputs
Beta	[0, 1]	Flexible shape	Conversion rates, proportions, Bayesian priors
Pareto (power law)	x ≥ x_min	Heavy right tail	Wealth, page views, word frequency (80/20 rule)

Common Discrete Distributions

Distribution	Support	Parameters	Analyst Use Cases
Binomial	0 to n	n trials, p success probability	Click-through rates, pass/fail counts, conversions
Poisson	Non-negative integers	λ = expected count per interval	Error counts, support tickets per hour, arrivals
Negative Binomial	Non-negative integers	r, p	Overdispersed counts; web sessions, purchases
Geometric	Positive integers	p success probability	Trials until first success, churn modeling

Identifying Distributions in Practice

Before assuming normality, always visualize the data:

Tool	What to Look For
Histogram	Overall shape, skewness, multimodality
Box plot	Median, IQR, outliers
Q-Q plot	Deviation from a reference distribution (e.g. normal)
CDF plot	Cumulative probability — useful for percentile analysis
Log-scale histogram	Reveals power-law or log-normal patterns in skewed data

Python: Fitting and Comparing Distributions

import numpy as np import pandas as pd from scipy import stats import matplotlib.pyplot as plt # Basic statistics data = df['revenue'] print(data.describe()) print('skewness:', data.skew()) print('kurtosis:', data.kurtosis()) # Test for normality stat, p = stats.shapiro(data.sample(min(len(data), 5000))) print(f'Shapiro-Wilk p={p:.4f} — {"normal" if p > 0.05 else "not normal"}') # Fit log-normal (if data is right-skewed and positive) log_data = np.log(data[data > 0]) mu, sigma = log_data.mean(), log_data.std() # Generate percentiles percentiles = np.percentile(data, [10, 25, 50, 75, 90, 95, 99]) print('Percentiles:', percentiles)

Skewed Data: When Not to Use the Mean

Many business metrics — revenue, session duration, load time — follow right-skewed or log-normal distributions. The arithmetic mean is pulled toward outliers and may not represent the typical user experience.

Situation	Recommended Measure
Revenue, income (right-skewed)	Median or log-mean; segment into percentile buckets
Page load time (right-skewed)	P50, P75, P95, P99 (percentiles)
Error rates (count data)	Poisson model; rate per time interval
Conversion rates (proportions)	Beta distribution; binomial confidence intervals

Central Limit Theorem

The Central Limit Theorem (CLT) states that the distribution of sample means approaches a normal distribution as sample size grows, regardless of the underlying distribution. This is why normal-theory tests (t-tests, z-tests) are valid for comparing sample means even when the raw data is skewed — as long as the sample size is large enough (typically n ≥ 30). This underpins virtually all frequentist A/B testing.

Summary

Recognizing which distribution governs your data is a foundational data analysis skill. It determines which summary statistics are meaningful, which tests are appropriate, and how to interpret variability. The normal distribution is a useful default for many measurements, but revenue, durations, rates, and counts each follow their own distributional patterns. Visualizing data before modeling — with histograms, Q-Q plots, and log-scale views — reveals the true shape and prevents the common mistake of applying normal-theory methods to non-normal data.