Statistical Analysis Fundamentals for Data Analysts

Why Statistics Is the Language of Data Analysis

Statistics provides the formal framework for turning raw numbers into defensible conclusions. Without it, analysts are limited to describing what happened — with it, they can quantify uncertainty, test whether observed differences are real, and make predictions with known confidence levels. Every core task in data analysis — summarising distributions, comparing groups, building models, designing experiments — is grounded in statistical reasoning.

This guide covers the practical statistical concepts every data analyst needs: descriptive statistics, probability distributions, hypothesis testing, confidence intervals, and regression fundamentals, with Python implementations throughout.

Descriptive Statistics

Descriptive statistics summarise a dataset's central tendency, spread, and shape. Always compute all three — a mean alone conceals whether data is tightly clustered or wildly spread.

Statistic	Formula / Function	When to Use	Sensitive to Outliers?
Mean	sum(x) / n	Symmetric distributions without extreme outliers	Yes
Median	Middle value when sorted	Skewed distributions; income, house prices, session durations	No
Mode	Most frequent value	Categorical data; detecting default/placeholder values	No
Standard Deviation	sqrt(variance)	Measuring spread around the mean	Yes
IQR (Q3 − Q1)	75th percentile minus 25th	Robust spread measure; outlier detection	No
Skewness	Third standardised moment	Measuring asymmetry; positive = right tail	Yes
Kurtosis	Fourth standardised moment	Measuring tail heaviness; >3 = heavy tails	Yes

Descriptive Statistics in Python

import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv("orders.csv")
x = df["order_value"].dropna()

# Central tendency
print(f"Mean:    {x.mean():.2f}")
print(f"Median:  {x.median():.2f}")
print(f"Mode:    {x.mode()[0]:.2f}")

# Spread
print(f"Std Dev: {x.std():.2f}")
print(f"IQR:     {x.quantile(0.75) - x.quantile(0.25):.2f}")

# Shape
print(f"Skewness:  {x.skew():.3f}")   # >0 = right-skewed
print(f"Kurtosis:  {x.kurtosis():.3f}")  # excess kurtosis

# Full percentile profile
percentiles = [5, 10, 25, 50, 75, 90, 95, 99]
print(x.quantile([p/100 for p in percentiles]))

Probability Distributions Every Analyst Should Know

Distribution	Shape / Parameters	Typical Use in Analytics
Normal (Gaussian)	Bell curve; defined by mean μ and std σ	Modelling continuous metrics that cluster around an average; many statistical tests assume normality
Log-Normal	Right-skewed; log of values is normal	Revenue, session duration, file sizes — quantities that cannot go below zero but can be very large
Binomial	Discrete; n trials, probability p of success	Conversion rates, click-through rates, defect counts
Poisson	Discrete; events per fixed interval, parameter λ	Arrivals, page views per minute, support tickets per day
Exponential	Continuous; time between events, parameter λ	Time between purchases, time-to-failure, churn timing
Uniform	Equal probability across a range	Baseline comparison; random number generation in simulations

Hypothesis Testing

A hypothesis test evaluates whether an observed pattern in data is likely to have occurred by chance. The framework is always the same: define null and alternative hypotheses, choose a significance level, compute a test statistic, and compare the resulting p-value to the threshold.

from scipy import stats
import numpy as np

# Two-sample t-test: do two groups have different means?
group_a = np.array([45, 52, 48, 61, 55, 50, 47, 53])
group_b = np.array([58, 63, 70, 55, 67, 72, 60, 65])

t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value:     {p_value:.4f}")
print("Reject H0" if p_value < 0.05 else "Fail to reject H0")

# One-sample t-test: is the mean different from a target?
target_mean = 50
t_stat2, p_value2 = stats.ttest_1samp(group_a, popmean=target_mean)
print(f"\nOne-sample t: p = {p_value2:.4f}")

# Chi-squared test: are two categorical variables independent?
observed = [[120, 80], [95, 105]]
chi2, p, dof, expected = stats.chi2_contingency(observed)
print(f"\nChi-squared: p = {p:.4f}")

Confidence Intervals

A confidence interval gives a range within which the true population parameter likely falls. A 95% CI means that if you repeated the experiment 100 times, roughly 95 of the resulting intervals would contain the true value — it does not mean there is a 95% chance the true value is in this particular interval.

import numpy as np
from scipy import stats

data = np.array([42, 55, 61, 48, 53, 58, 50, 47, 64, 52])
n    = len(data)
mean = data.mean()
se   = stats.sem(data)   # standard error of the mean

# 95% confidence interval
ci_low, ci_high = stats.t.interval(0.95, df=n-1, loc=mean, scale=se)
print(f"Mean: {mean:.2f}")
print(f"95% CI: ({ci_low:.2f}, {ci_high:.2f})")

# Margin of error
margin = ci_high - mean
print(f"Margin of error: ±{margin:.2f}")

Confidence intervals are more informative than p-values alone because they communicate both statistical significance and practical significance (the magnitude of the effect). A very narrow CI around a small effect may be significant but not worth acting on.

Linear Regression Fundamentals

Linear regression models the relationship between a continuous outcome variable and one or more predictors. It is the most interpretable predictive model and forms the conceptual basis for more advanced techniques like logistic regression, ridge regression, and neural networks.

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

df = pd.read_csv("sales.csv")
X = df[["marketing_spend", "price_discount", "days_in_market"]]
y = df["units_sold"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Coefficients:")
for feat, coef in zip(X.columns, model.coef_):
    print(f"  {feat}: {coef:.3f}")
print(f"Intercept: {model.intercept_:.3f}")
print(f"MAE:  {mean_absolute_error(y_test, y_pred):.2f}")
print(f"R²:   {r2_score(y_test, y_pred):.3f}")

The R² score measures the proportion of variance in the outcome explained by the model (0 = no explanatory power, 1 = perfect fit). An R² of 0.70 means the features explain 70% of the variation in units sold.

Choosing the Right Statistical Test

Question	Data Type	Test to Use
Is the mean of one group different from a target value?	Continuous	One-sample t-test
Are the means of two independent groups different?	Continuous	Two-sample t-test (Welch's if unequal variance)
Are the means of 3+ groups different?	Continuous	One-way ANOVA, followed by Tukey HSD post-hoc
Are two categorical variables independent?	Categorical	Chi-squared test of independence
Are two conversion rates different?	Binary / Proportions	Two-proportion z-test
Is there a monotonic relationship between two variables?	Ordinal / Non-normal	Spearman rank correlation
Are two paired measurements different (before/after)?	Continuous	Paired t-test

Summary

Statistical analysis is the foundation of credible data work. Descriptive statistics characterise what the data looks like; probability distributions model the processes that generate it; hypothesis tests distinguish real signals from random noise; confidence intervals quantify uncertainty; and regression models the relationships between variables. Mastering these tools — and knowing which one to apply in which situation — is what separates analysts who describe data from those who extract actionable insight from it.