Why Distributions Matter for Data Analysts
A statistical distribution describes the pattern of values in a dataset — what values are possible, how likely each is, and the shape of the spread. Understanding distributions helps analysts choose the right summary statistics, select appropriate statistical tests, identify anomalies, and build better models. Misidentifying a distribution leads to flawed conclusions: averaging skewed data, applying normal-theory tests to count data, or missing the long tail that causes most of the business impact.
Key Distribution Properties
Property | Description | Measures |
|---|---|---|
Central tendency | Where the "middle" of the data is | Mean, median, mode |
Spread / dispersion | How spread out the values are | Variance, standard deviation, IQR, range |
Skewness | Asymmetry of the distribution | Positive (right tail), negative (left tail) |
Kurtosis | Weight of tails vs. normal distribution | Leptokurtic (heavy tails), platykurtic (thin tails) |
Support | Range of possible values | All reals, non-negative integers, 0–1, etc. |
The Normal (Gaussian) Distribution
The normal distribution is the most commonly assumed distribution for continuous measurements. It is symmetric and bell-shaped, defined by its mean (μ) and standard deviation (σ).
Range | % of Data |
|---|---|
μ ± 1σ | ~68% |
μ ± 2σ | ~95% |
μ ± 3σ | ~99.7% |
Analyst applications: z-score outlier detection, A/B test analysis, control charts, many ML model assumptions. Check normality with histograms, Q-Q plots, or the Shapiro-Wilk test.
Common Continuous Distributions
Distribution | Support | Shape | Analyst Use Cases |
|---|---|---|---|
Normal | All reals | Symmetric bell curve | Heights, errors, test scores, A/B tests |
Log-normal | Positive reals | Right-skewed; normal when logged | Revenue, income, session durations, prices |
Exponential | Non-negative reals | Monotone decreasing | Time between events, wait times, survival analysis |
Uniform | [a, b] | Flat / constant | Random sampling, simulation inputs |
Beta | [0, 1] | Flexible shape | Conversion rates, proportions, Bayesian priors |
Pareto (power law) | x ≥ x_min | Heavy right tail | Wealth, page views, word frequency (80/20 rule) |
Common Discrete Distributions
Distribution | Support | Parameters | Analyst Use Cases |
|---|---|---|---|
Binomial | 0 to n | n trials, p success probability | Click-through rates, pass/fail counts, conversions |
Poisson | Non-negative integers | λ = expected count per interval | Error counts, support tickets per hour, arrivals |
Negative Binomial | Non-negative integers | r, p | Overdispersed counts; web sessions, purchases |
Geometric | Positive integers | p success probability | Trials until first success, churn modeling |
Identifying Distributions in Practice
Before assuming normality, always visualize the data:
Tool | What to Look For |
|---|---|
Histogram | Overall shape, skewness, multimodality |
Box plot | Median, IQR, outliers |
Q-Q plot | Deviation from a reference distribution (e.g. normal) |
CDF plot | Cumulative probability — useful for percentile analysis |
Log-scale histogram | Reveals power-law or log-normal patterns in skewed data |
Python: Fitting and Comparing Distributions
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
# Basic statistics
data = df['revenue']
print(data.describe())
print('skewness:', data.skew())
print('kurtosis:', data.kurtosis())
# Test for normality
stat, p = stats.shapiro(data.sample(min(len(data), 5000)))
print(f'Shapiro-Wilk p={p:.4f} — {"normal" if p > 0.05 else "not normal"}')
# Fit log-normal (if data is right-skewed and positive)
log_data = np.log(data[data > 0])
mu, sigma = log_data.mean(), log_data.std()
# Generate percentiles
percentiles = np.percentile(data, [10, 25, 50, 75, 90, 95, 99])
print('Percentiles:', percentiles)
Skewed Data: When Not to Use the Mean
Many business metrics — revenue, session duration, load time — follow right-skewed or log-normal distributions. The arithmetic mean is pulled toward outliers and may not represent the typical user experience.
Situation | Recommended Measure |
|---|---|
Revenue, income (right-skewed) | Median or log-mean; segment into percentile buckets |
Page load time (right-skewed) | P50, P75, P95, P99 (percentiles) |
Error rates (count data) | Poisson model; rate per time interval |
Conversion rates (proportions) | Beta distribution; binomial confidence intervals |
Central Limit Theorem
The Central Limit Theorem (CLT) states that the distribution of sample means approaches a normal distribution as sample size grows, regardless of the underlying distribution. This is why normal-theory tests (t-tests, z-tests) are valid for comparing sample means even when the raw data is skewed — as long as the sample size is large enough (typically n ≥ 30). This underpins virtually all frequentist A/B testing.
Summary
Recognizing which distribution governs your data is a foundational data analysis skill. It determines which summary statistics are meaningful, which tests are appropriate, and how to interpret variability. The normal distribution is a useful default for many measurements, but revenue, durations, rates, and counts each follow their own distributional patterns. Visualizing data before modeling — with histograms, Q-Q plots, and log-scale views — reveals the true shape and prevents the common mistake of applying normal-theory methods to non-normal data.
Create a free reader account to keep reading.