Why Statistics Is the Language of Data Analysis
Statistics provides the formal framework for turning raw numbers into defensible conclusions. Without it, analysts are limited to describing what happened — with it, they can quantify uncertainty, test whether observed differences are real, and make predictions with known confidence levels. Every core task in data analysis — summarising distributions, comparing groups, building models, designing experiments — is grounded in statistical reasoning.
This guide covers the practical statistical concepts every data analyst needs: descriptive statistics, probability distributions, hypothesis testing, confidence intervals, and regression fundamentals, with Python implementations throughout.
Descriptive Statistics
Descriptive statistics summarise a dataset's central tendency, spread, and shape. Always compute all three — a mean alone conceals whether data is tightly clustered or wildly spread.
| Statistic | Formula / Function | When to Use | Sensitive to Outliers? |
|---|---|---|---|
| Mean | sum(x) / n | Symmetric distributions without extreme outliers | Yes |
| Median | Middle value when sorted | Skewed distributions; income, house prices, session durations | No |
| Mode | Most frequent value | Categorical data; detecting default/placeholder values | No |
| Standard Deviation | sqrt(variance) | Measuring spread around the mean | Yes |
| IQR (Q3 − Q1) | 75th percentile minus 25th | Robust spread measure; outlier detection | No |
| Skewness | Third standardised moment | Measuring asymmetry; positive = right tail | Yes |
| Kurtosis | Fourth standardised moment | Measuring tail heaviness; >3 = heavy tails | Yes |
Descriptive Statistics in Python
import pandas as pd
import numpy as np
from scipy import stats
df = pd.read_csv("orders.csv")
x = df["order_value"].dropna()
# Central tendency
print(f"Mean: {x.mean():.2f}")
print(f"Median: {x.median():.2f}")
print(f"Mode: {x.mode()[0]:.2f}")
# Spread
print(f"Std Dev: {x.std():.2f}")
print(f"IQR: {x.quantile(0.75) - x.quantile(0.25):.2f}")
# Shape
print(f"Skewness: {x.skew():.3f}") # >0 = right-skewed
print(f"Kurtosis: {x.kurtosis():.3f}") # excess kurtosis
# Full percentile profile
percentiles = [5, 10, 25, 50, 75, 90, 95, 99]
print(x.quantile([p/100 for p in percentiles]))
Probability Distributions Every Analyst Should Know
| Distribution | Shape / Parameters | Typical Use in Analytics |
|---|---|---|
| Normal (Gaussian) | Bell curve; defined by mean μ and std σ | Modelling continuous metrics that cluster around an average; many statistical tests assume normality |
| Log-Normal | Right-skewed; log of values is normal | Revenue, session duration, file sizes — quantities that cannot go below zero but can be very large |
| Binomial | Discrete; n trials, probability p of success | Conversion rates, click-through rates, defect counts |
| Poisson | Discrete; events per fixed interval, parameter λ | Arrivals, page views per minute, support tickets per day |
| Exponential | Continuous; time between events, parameter λ | Time between purchases, time-to-failure, churn timing |
| Uniform | Equal probability across a range | Baseline comparison; random number generation in simulations |
Hypothesis Testing
A hypothesis test evaluates whether an observed pattern in data is likely to have occurred by chance. The framework is always the same: define null and alternative hypotheses, choose a significance level, compute a test statistic, and compare the resulting p-value to the threshold.
from scipy import stats
import numpy as np
# Two-sample t-test: do two groups have different means?
group_a = np.array([45, 52, 48, 61, 55, 50, 47, 53])
group_b = np.array([58, 63, 70, 55, 67, 72, 60, 65])
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")
print("Reject H0" if p_value < 0.05 else "Fail to reject H0")
# One-sample t-test: is the mean different from a target?
target_mean = 50
t_stat2, p_value2 = stats.ttest_1samp(group_a, popmean=target_mean)
print(f"\nOne-sample t: p = {p_value2:.4f}")
# Chi-squared test: are two categorical variables independent?
observed = [[120, 80], [95, 105]]
chi2, p, dof, expected = stats.chi2_contingency(observed)
print(f"\nChi-squared: p = {p:.4f}")
Confidence Intervals
A confidence interval gives a range within which the true population parameter likely falls. A 95% CI means that if you repeated the experiment 100 times, roughly 95 of the resulting intervals would contain the true value — it does not mean there is a 95% chance the true value is in this particular interval.
import numpy as np
from scipy import stats
data = np.array([42, 55, 61, 48, 53, 58, 50, 47, 64, 52])
n = len(data)
mean = data.mean()
se = stats.sem(data) # standard error of the mean
# 95% confidence interval
ci_low, ci_high = stats.t.interval(0.95, df=n-1, loc=mean, scale=se)
print(f"Mean: {mean:.2f}")
print(f"95% CI: ({ci_low:.2f}, {ci_high:.2f})")
# Margin of error
margin = ci_high - mean
print(f"Margin of error: ±{margin:.2f}")
Confidence intervals are more informative than p-values alone because they communicate both statistical significance and practical significance (the magnitude of the effect). A very narrow CI around a small effect may be significant but not worth acting on.
Linear Regression Fundamentals
Linear regression models the relationship between a continuous outcome variable and one or more predictors. It is the most interpretable predictive model and forms the conceptual basis for more advanced techniques like logistic regression, ridge regression, and neural networks.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
df = pd.read_csv("sales.csv")
X = df[["marketing_spend", "price_discount", "days_in_market"]]
y = df["units_sold"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Coefficients:")
for feat, coef in zip(X.columns, model.coef_):
print(f" {feat}: {coef:.3f}")
print(f"Intercept: {model.intercept_:.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.2f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")
The R² score measures the proportion of variance in the outcome explained by the model (0 = no explanatory power, 1 = perfect fit). An R² of 0.70 means the features explain 70% of the variation in units sold.
Choosing the Right Statistical Test
| Question | Data Type | Test to Use |
|---|---|---|
| Is the mean of one group different from a target value? | Continuous | One-sample t-test |
| Are the means of two independent groups different? | Continuous | Two-sample t-test (Welch's if unequal variance) |
| Are the means of 3+ groups different? | Continuous | One-way ANOVA, followed by Tukey HSD post-hoc |
| Are two categorical variables independent? | Categorical | Chi-squared test of independence |
| Are two conversion rates different? | Binary / Proportions | Two-proportion z-test |
| Is there a monotonic relationship between two variables? | Ordinal / Non-normal | Spearman rank correlation |
| Are two paired measurements different (before/after)? | Continuous | Paired t-test |
Summary
Statistical analysis is the foundation of credible data work. Descriptive statistics characterise what the data looks like; probability distributions model the processes that generate it; hypothesis tests distinguish real signals from random noise; confidence intervals quantify uncertainty; and regression models the relationships between variables. Mastering these tools — and knowing which one to apply in which situation — is what separates analysts who describe data from those who extract actionable insight from it.
Create a free reader account to keep reading.