Introduction to Statistical Analysis
Statistics is the language of data. As a data analyst, you will constantly use statistical concepts to summarize datasets, understand distributions, identify patterns, and communicate findings clearly. Without a solid grounding in statistics, analysis risks becoming surface-level description rather than genuine insight.
This article covers the essential statistical concepts every data analyst should master: measures of central tendency, measures of spread, probability distributions, skewness, kurtosis, correlation, and hypothesis testing.
Measures of Central Tendency
Measures of central tendency describe the typical or central value of a dataset. The three most important are the mean, median, and mode.
The mean is the arithmetic average: the sum of all values divided by the count. It is widely used but sensitive to outliers. A single extreme value can pull the mean far from the center of most values.
The median is the middle value when the data is sorted. It is robust to outliers and is often preferred for skewed distributions such as income or house prices.
The mode is the most frequently occurring value. It is most useful for categorical data and can reveal multimodal distributions.
import numpy as np
from scipy import stats
data = [12, 15, 14, 10, 18, 14, 22, 14, 19, 11]
mean = np.mean(data) # 14.9
median = np.median(data) # 14.5
mode = stats.mode(data) # 14 (appears 3 times)
print(f"Mean: {mean}, Median: {median}, Mode: {mode.mode[0]}")Measures of Spread: Variance and Standard Deviation
Two datasets can have the same mean but very different spreads. Measures of spread quantify how dispersed the values are around the center.
Variance is the average of squared deviations from the mean. Squaring ensures positive and negative deviations do not cancel. A higher variance means data is more dispersed.
Standard deviation is the square root of variance, expressed in the same units as the data. For a normal distribution, approximately 68% of values fall within one standard deviation of the mean and 95% within two standard deviations.
# Population variance and standard deviation
variance = np.var(data)
std_dev = np.std(data)
# Sample statistics (Bessel's correction: divide by n-1)
sample_variance = np.var(data, ddof=1)
sample_std_dev = np.std(data, ddof=1)
print(f"Variance: {sample_variance:.2f}, Std Dev: {sample_std_dev:.2f}")The Interquartile Range (IQR) — the difference between the 75th and 25th percentile — is another robust measure of spread that is less sensitive to outliers than standard deviation.
Probability Distributions
A probability distribution describes how likely each value is in a dataset or random variable. Understanding distributions allows analysts to model data, select appropriate statistical tests, and make probabilistic predictions.
The normal distribution (Gaussian or bell curve) is the most important in statistics. It is symmetric around the mean, and many natural phenomena approximate it. Many statistical tests assume normality.
Other distributions analysts commonly encounter include the binomial distribution (number of successes in binary trials), the Poisson distribution (event counts over a fixed interval), the exponential distribution (time between events), and the uniform distribution (all values equally probable).
import matplotlib.pyplot as plt
from scipy.stats import norm, shapiro
# Visualize normality
plt.hist(data, bins=15, density=True, alpha=0.7)
x = np.linspace(min(data), max(data), 100)
plt.plot(x, norm.pdf(x, np.mean(data), np.std(data)), 'r-', lw=2)
plt.title('Distribution vs Normal Curve')
plt.show()
# Shapiro-Wilk normality test
stat, p = shapiro(data)
print(f"p-value: {p:.4f} — {'normal' if p > 0.05 else 'not normal'}")Skewness
Skewness measures the asymmetry of a distribution. A perfectly symmetric distribution has zero skewness. A right-skewed (positive) distribution has a long tail to the right — the mean exceeds the median, and there are a few very high values. Income and wealth distributions are classic examples. A left-skewed (negative) distribution has a long tail to the left.
Skewness matters because many statistical methods assume normality. Highly skewed data often needs transformation (e.g., a log transformation) before applying these techniques.
from scipy.stats import skew
skewness = skew(data)
print(f"Skewness: {skewness:.4f}")
# Positive → right-skewed, Negative → left-skewed, ~0 → symmetricKurtosis
Kurtosis measures the tailedness of a distribution — how much probability mass lies in the tails compared to a normal distribution. High kurtosis (leptokurtic) means heavy tails and more extreme outliers. Low kurtosis (platykurtic) means light tails and a flatter shape. The normal distribution has a kurtosis of 3; most libraries report excess kurtosis (kurtosis minus 3), so the normal distribution has excess kurtosis of 0.
from scipy.stats import kurtosis
kurt = kurtosis(data) # Excess kurtosis by default
print(f"Excess Kurtosis: {kurt:.4f}")
# > 0 → heavier tails than normal
# < 0 → lighter tails than normalCorrelation and Covariance
Correlation measures the strength and direction of the linear relationship between two variables. The Pearson correlation coefficient ranges from -1 to +1. A value near +1 means strong positive relationship; near -1 means strong negative relationship; near 0 means no linear relationship.
Covariance is similar but not normalized, making its magnitude hard to interpret in isolation. Correlation standardizes covariance, making it comparable across variables.
When data is ordinal or non-linear, Spearman rank correlation is a more appropriate measure.
import pandas as pd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [2, 4, 5, 4, 5]})
pearson = df['x'].corr(df['y'])
spearman = df['x'].corr(df['y'], method='spearman')
print(f"Pearson: {pearson:.4f}, Spearman: {spearman:.4f}")Hypothesis Testing
Hypothesis testing is a formal framework for deciding whether an observed effect in a sample is likely to reflect a real effect in the population. The null hypothesis (H₀) states there is no effect or difference. The alternative hypothesis (H₁) states the opposite. The p-value is the probability of observing data at least as extreme as yours if H₀ were true. A p-value below 0.05 is conventionally taken as evidence to reject H₀.
Common tests include the t-test (comparing means of two groups), ANOVA (comparing means of three or more groups), the chi-squared test (comparing categorical distributions), and the Mann-Whitney U test (non-parametric alternative to the t-test).
from scipy.stats import ttest_ind
group_a = [23, 25, 28, 22, 27]
group_b = [30, 32, 29, 31, 28]
t_stat, p_value = ttest_ind(group_a, group_b)
print(f"t={t_stat:.4f}, p={p_value:.4f}")
if p_value < 0.05:
print("Reject H0: groups differ significantly")Conclusion
Statistical analysis is the core skill that transforms raw data into defensible insight. Understanding central tendency, spread, distributions, skewness, kurtosis, correlation, and hypothesis testing equips you to accurately describe data, identify patterns, and draw conclusions that hold up to scrutiny. These concepts are the foundation on which all higher-level data analysis and machine learning are built.
Create a free reader account to keep reading.