Statistical Foundations for Data Analysis

Why Statistics Underpins Data Analysis

Data analysis without statistical grounding produces confident but unreliable conclusions. Statistics provides the framework for quantifying uncertainty, distinguishing signal from noise, and making defensible inferences from samples. For data analysts, practical fluency in descriptive statistics, probability distributions, sampling theory, and inference is not optional — it is what separates analysts who can say "our conversion rate improved" from analysts who can say "our conversion rate improved by 2.3 percentage points with 95% confidence, and this difference is unlikely to be due to chance."

Descriptive Statistics: Summarising Data

Statistic	Formula / Definition	Best For	Pitfall
Mean	Sum of values divided by count	Symmetric, unimodal distributions without outliers	Heavily distorted by outliers; the mean income of a room including a billionaire is misleading
Median	Middle value when sorted; average of two middle values for even counts	Skewed distributions and ordinal data	Less mathematically tractable than the mean; cannot be aggregated across subgroups
Mode	Most frequently occurring value	Categorical data; identifying the most common category	Uninformative for continuous data where every value may be unique
Variance	Average squared deviation from the mean	Measuring spread; used in many statistical models	In squared units — harder to interpret directly; use standard deviation instead
Standard Deviation	Square root of variance; same units as the data	Summarising spread in interpretable units; comparing variability across groups	Like the mean, sensitive to outliers
Percentiles / Quantiles	Value below which p% of observations fall	Understanding distribution shape; outlier detection (IQR = P75 − P25)	Different interpolation methods can give slightly different results

Common Probability Distributions in Data Analysis

Distribution	Parameters	Shape	Typical Use Cases
Normal (Gaussian)	Mean μ, standard deviation σ	Symmetric, bell-shaped	Modelling measurement errors, heights, test scores; many statistical tests assume normality
Binomial	n (trials), p (success probability)	Discrete; symmetric when p=0.5, skewed otherwise	Conversion rates, click-through rates, pass/fail outcomes
Poisson	λ (average rate per interval)	Discrete; right-skewed for small λ	Count data: orders per hour, support tickets per day, page views per minute
Log-Normal	Mean and σ of the log-transformed variable	Right-skewed; log transformation yields normal	Revenue, income, session duration, file sizes — quantities that cannot be negative and are positively skewed
Uniform	Minimum a, maximum b	Flat; all values equally likely	Random number generation; modelling uncertainty when no prior information exists
Exponential	Rate λ	Continuous; right-skewed	Time between events in a Poisson process: time between purchases, inter-arrival times

Sampling and Estimation

Concept	Definition	Practical Implication
Population vs. sample	A population is the entire group of interest; a sample is a subset drawn from it	Analysts rarely have access to the full population; sample statistics estimate population parameters
Sampling bias	A systematic tendency for the sample to over- or under-represent parts of the population	Survivorship bias, self-selection bias, and non-response bias all distort conclusions
Central Limit Theorem (CLT)	The sampling distribution of the mean approaches a normal distribution as sample size grows, regardless of the population distribution	Justifies using normal-distribution-based inference (z-tests, t-tests) even when data is not normally distributed
Standard error	Standard deviation of the sampling distribution of a statistic; equals σ/√n for the mean	Larger samples produce smaller standard errors — estimates become more precise as n grows
Confidence interval	A range of values that contains the true population parameter with a stated probability (e.g., 95%)	A 95% CI does NOT mean "95% chance the true value is in this interval" — it means that 95% of intervals constructed this way will contain the true value

Correlation and Causation

Concept	Definition	Key Points
Pearson correlation (r)	Measures the linear relationship between two continuous variables; ranges from −1 to +1	Only captures linear relationships; sensitive to outliers; r=0 does not imply independence
Spearman correlation	Rank-based correlation; measures monotonic (not necessarily linear) relationships	More robust to outliers and non-normal data; appropriate for ordinal variables
Confounding variable	A third variable that causally affects both the independent and dependent variable, creating a spurious association	Ice cream sales and drowning rates are correlated — both are driven by hot weather (the confounder)
Causal inference	Establishing that a change in X causes a change in Y, not just that they co-vary	Requires randomised experiments, instrumental variables, difference-in-differences, or regression discontinuity — not just correlation analysis

Summary

Statistical foundations give data analysts the vocabulary and tools to make rigorous, defensible claims from data. Descriptive statistics summarise what the data shows; probability distributions model how data-generating processes behave; sampling theory connects samples to populations; and correlation analysis reveals relationships while guarding against the correlation-causation fallacy. Every analytical technique — regression, A/B testing, forecasting — is built on these foundations, making statistical literacy the highest-leverage investment a data analyst can make.