Why Statistics Underpins Data Analysis
Data analysis without statistical grounding produces confident but unreliable conclusions. Statistics provides the framework for quantifying uncertainty, distinguishing signal from noise, and making defensible inferences from samples. For data analysts, practical fluency in descriptive statistics, probability distributions, sampling theory, and inference is not optional — it is what separates analysts who can say "our conversion rate improved" from analysts who can say "our conversion rate improved by 2.3 percentage points with 95% confidence, and this difference is unlikely to be due to chance."
Descriptive Statistics: Summarising Data
Statistic | Formula / Definition | Best For | Pitfall |
|---|---|---|---|
Mean | Sum of values divided by count | Symmetric, unimodal distributions without outliers | Heavily distorted by outliers; the mean income of a room including a billionaire is misleading |
Median | Middle value when sorted; average of two middle values for even counts | Skewed distributions and ordinal data | Less mathematically tractable than the mean; cannot be aggregated across subgroups |
Mode | Most frequently occurring value | Categorical data; identifying the most common category | Uninformative for continuous data where every value may be unique |
Variance | Average squared deviation from the mean | Measuring spread; used in many statistical models | In squared units — harder to interpret directly; use standard deviation instead |
Standard Deviation | Square root of variance; same units as the data | Summarising spread in interpretable units; comparing variability across groups | Like the mean, sensitive to outliers |
Percentiles / Quantiles | Value below which p% of observations fall | Understanding distribution shape; outlier detection (IQR = P75 − P25) | Different interpolation methods can give slightly different results |
Common Probability Distributions in Data Analysis
Distribution | Parameters | Shape | Typical Use Cases |
|---|---|---|---|
Normal (Gaussian) | Mean μ, standard deviation σ | Symmetric, bell-shaped | Modelling measurement errors, heights, test scores; many statistical tests assume normality |
Binomial | n (trials), p (success probability) | Discrete; symmetric when p=0.5, skewed otherwise | Conversion rates, click-through rates, pass/fail outcomes |
Poisson | λ (average rate per interval) | Discrete; right-skewed for small λ | Count data: orders per hour, support tickets per day, page views per minute |
Log-Normal | Mean and σ of the log-transformed variable | Right-skewed; log transformation yields normal | Revenue, income, session duration, file sizes — quantities that cannot be negative and are positively skewed |
Uniform | Minimum a, maximum b | Flat; all values equally likely | Random number generation; modelling uncertainty when no prior information exists |
Exponential | Rate λ | Continuous; right-skewed | Time between events in a Poisson process: time between purchases, inter-arrival times |
Sampling and Estimation
Concept | Definition | Practical Implication |
|---|---|---|
Population vs. sample | A population is the entire group of interest; a sample is a subset drawn from it | Analysts rarely have access to the full population; sample statistics estimate population parameters |
Sampling bias | A systematic tendency for the sample to over- or under-represent parts of the population | Survivorship bias, self-selection bias, and non-response bias all distort conclusions |
Central Limit Theorem (CLT) | The sampling distribution of the mean approaches a normal distribution as sample size grows, regardless of the population distribution | Justifies using normal-distribution-based inference (z-tests, t-tests) even when data is not normally distributed |
Standard error | Standard deviation of the sampling distribution of a statistic; equals σ/√n for the mean | Larger samples produce smaller standard errors — estimates become more precise as n grows |
Confidence interval | A range of values that contains the true population parameter with a stated probability (e.g., 95%) | A 95% CI does NOT mean "95% chance the true value is in this interval" — it means that 95% of intervals constructed this way will contain the true value |
Correlation and Causation
Concept | Definition | Key Points |
|---|---|---|
Pearson correlation (r) | Measures the linear relationship between two continuous variables; ranges from −1 to +1 | Only captures linear relationships; sensitive to outliers; r=0 does not imply independence |
Spearman correlation | Rank-based correlation; measures monotonic (not necessarily linear) relationships | More robust to outliers and non-normal data; appropriate for ordinal variables |
Confounding variable | A third variable that causally affects both the independent and dependent variable, creating a spurious association | Ice cream sales and drowning rates are correlated — both are driven by hot weather (the confounder) |
Causal inference | Establishing that a change in X causes a change in Y, not just that they co-vary | Requires randomised experiments, instrumental variables, difference-in-differences, or regression discontinuity — not just correlation analysis |
Summary
Statistical foundations give data analysts the vocabulary and tools to make rigorous, defensible claims from data. Descriptive statistics summarise what the data shows; probability distributions model how data-generating processes behave; sampling theory connects samples to populations; and correlation analysis reveals relationships while guarding against the correlation-causation fallacy. Every analytical technique — regression, A/B testing, forecasting — is built on these foundations, making statistical literacy the highest-leverage investment a data analyst can make.
Create a free reader account to keep reading.