Statistical Analysis for Data Analysts

Introduction to Statistical Analysis

Statistics is the language of data. As a data analyst, you will constantly use statistical concepts to summarize data, understand distributions, identify patterns, and communicate findings. Without a solid grounding in statistics, data analysis risks becoming nothing more than surface-level description.

This article covers the essential statistical concepts every data analyst should understand: measures of central tendency (mean, median, mode), measures of spread (variance and standard deviation), probability distributions, and shape descriptors (skewness and kurtosis).

Measures of Central Tendency

Measures of central tendency describe the center or typical value of a dataset. The three most important are the mean, median, and mode.

The mean is the arithmetic average: the sum of all values divided by the number of values. It is the most widely used measure but is sensitive to extreme values (outliers). For example, the mean salary in a company can be misleadingly high if a few executives earn significantly more than most employees.

The median is the middle value when the dataset is sorted in order. If the dataset has an even number of values, the median is the average of the two middle values. The median is robust to outliers and is often preferred for skewed distributions, such as income or house prices.

The mode is the value that appears most frequently. It is most useful for categorical data. A dataset can have no mode, one mode (unimodal), or multiple modes (bimodal, multimodal).

import numpy as np
import pandas as pd
from scipy import stats

data = [12, 15, 14, 10, 18, 14, 22, 14, 19, 11]

mean = np.mean(data)       # 14.9
median = np.median(data)   # 14.5
mode = stats.mode(data)    # 14 (appears 3 times)

print(f"Mean: {mean}, Median: {median}, Mode: {mode.mode[0]}")

Measures of Spread: Variance and Standard Deviation

Knowing the center of a dataset is not enough. You also need to know how spread out the values are. Two datasets can have the same mean but very different spreads — one tightly clustered, the other highly dispersed.

Variance measures the average squared deviation from the mean. Squaring ensures that positive and negative deviations do not cancel each other out. A higher variance means data points are more spread out from the mean.

Standard deviation is the square root of variance. It expresses spread in the same units as the original data, making it more interpretable than variance. For normally distributed data, about 68% of values fall within one standard deviation of the mean, and 95% within two standard deviations.

variance = np.var(data)          # population variance
std_dev = np.std(data)           # population standard deviation

# For sample statistics (divides by n-1 instead of n)
sample_variance = np.var(data, ddof=1)
sample_std_dev = np.std(data, ddof=1)

The range (max - min) is another simple measure of spread, but it is highly sensitive to outliers. The Interquartile Range (IQR), the difference between the 75th and 25th percentile, is a more robust alternative.

Probability Distributions

A probability distribution describes how the values of a random variable are distributed. Understanding distributions helps analysts model data, make predictions, and apply the right statistical tests.

The normal distribution (also called the Gaussian distribution or bell curve) is the most important distribution in statistics. It is symmetric around the mean, and many natural phenomena — heights, measurement errors, test scores — approximate it. Many statistical tests assume normality.

Other common distributions include the binomial distribution (number of successes in n binary trials), the Poisson distribution (count of events in a fixed interval of time or space), the exponential distribution (time between events), and the uniform distribution (all values equally likely).

You can check whether your data follows a normal distribution using visual tools like histograms and Q-Q plots, or statistical tests like the Shapiro-Wilk test:

import matplotlib.pyplot as plt
from scipy.stats import shapiro, norm

# Visual check
plt.hist(data, bins=20, density=True)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, np.mean(data), np.std(data))
plt.plot(x, p, 'k', linewidth=2)
plt.show()

# Statistical test
stat, p_value = shapiro(data)
print(f"Shapiro-Wilk: stat={stat:.4f}, p={p_value:.4f}")

Skewness

Skewness measures the asymmetry of a distribution. A symmetric distribution has zero skewness. A positive skew (right-skewed) means the tail extends to the right — the mean is greater than the median, and there are a few very high values. Income distributions are typically right-skewed. A negative skew (left-skewed) means the tail extends to the left.

Understanding skewness is important because many statistical methods assume normality (no skew). Highly skewed data may need to be transformed (e.g., log transformation) before applying these methods.

from scipy.stats import skew

skewness = skew(data)
print(f"Skewness: {skewness:.4f}")
# Positive value = right-skewed, Negative = left-skewed

Kurtosis

Kurtosis measures the "tailedness" of a distribution — how much data is concentrated in the tails versus the center compared to a normal distribution. High kurtosis (leptokurtic) means heavy tails and a sharper peak, indicating more extreme outliers than a normal distribution. Low kurtosis (platykurtic) means light tails and a flatter peak.

The normal distribution has a kurtosis of 3. Many statistical packages report "excess kurtosis" (kurtosis minus 3), so that the normal distribution has excess kurtosis of 0.

from scipy.stats import kurtosis

kurt = kurtosis(data)  # Returns excess kurtosis by default
print(f"Excess Kurtosis: {kurt:.4f}")
# > 0: heavier tails than normal
# < 0: lighter tails than normal

Correlation and Covariance

Correlation measures the strength and direction of the linear relationship between two variables. The Pearson correlation coefficient ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship. Spearman rank correlation is a non-parametric alternative that works with ordinal data or non-linear relationships.

Covariance is similar but not scaled, making values hard to interpret without context. Correlation is the standardized version of covariance and is therefore more useful for comparison.

import pandas as pd

df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [2, 4, 5, 4, 5]})
pearson_corr = df['x'].corr(df['y'])
spearman_corr = df['x'].corr(df['y'], method='spearman')
print(f"Pearson: {pearson_corr:.4f}, Spearman: {spearman_corr:.4f}")

Hypothesis Testing

Hypothesis testing is a statistical method to determine whether there is enough evidence in a sample to infer that a condition holds for the entire population. The null hypothesis (H₀) typically states that there is no effect or no difference. The alternative hypothesis (H₁) states the opposite. The p-value tells you the probability of observing the data if H₀ were true — a p-value below 0.05 conventionally leads to rejecting H₀.

Common tests include the t-test (comparing means of two groups), chi-squared test (comparing categorical distributions), ANOVA (comparing means of three or more groups), and the Mann-Whitney U test (non-parametric alternative to the t-test).

Conclusion

Statistical analysis is the core skill that transforms raw data into meaningful insight. Understanding central tendency, spread, distributions, skewness, kurtosis, correlation, and hypothesis testing gives you the tools to accurately describe data, identify patterns, and draw defensible conclusions. These concepts are the foundation on which all higher-level data analysis and machine learning are built.