Exploratory Data Analysis: A Practical Guide

What Is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the practice of systematically examining a dataset before building models or drawing conclusions. Popularised by statistician John Tukey in the 1970s, EDA emphasises visualisation, summary statistics, and pattern detection as a first step in any analytical workflow. The goal is not to confirm a hypothesis but to understand the data's structure, quality, and behaviour well enough to ask better questions.

A thorough EDA catches data quality issues early, reveals distributions that violate model assumptions, uncovers unexpected relationships between variables, and often surfaces the most important findings before any formal analysis begins. Skipping EDA is one of the most common reasons analytical projects produce misleading results.

The EDA Workflow

Stage	What to Do	Key Questions
1. Understand the Schema	List columns, data types, row counts, and source definitions	What does each column mean? What is the grain of the table?
2. Assess Data Quality	Count nulls, duplicates, and out-of-range values	Are there unexpected gaps? Are IDs unique where they should be?
3. Univariate Analysis	Examine each variable in isolation: distributions, extremes, modes	Is the distribution normal, skewed, or bimodal? Are there outliers?
4. Bivariate Analysis	Explore relationships between pairs of variables	Are correlated variables causally related or confounded?
5. Multivariate Analysis	Look for interactions and grouping patterns across 3+ variables	Does a relationship hold across all segments or only some?
6. Document Findings	Record anomalies, assumptions, and decisions made during EDA	What would a new team member need to know about this dataset?

Data Quality Checks in Python

The first thing to run on any new dataset is a quality audit. Pandas makes this concise:

import pandas as pd
import numpy as np

df = pd.read_csv("orders.csv", parse_dates=["order_date"])

# Shape and dtypes
print(df.shape)
print(df.dtypes)

# Null counts and percentages
null_summary = pd.DataFrame({
    "null_count": df.isnull().sum(),
    "null_pct":   df.isnull().mean().mul(100).round(1)
}).query("null_count > 0").sort_values("null_pct", ascending=False)
print(null_summary)

# Duplicate rows
print("Duplicate rows:", df.duplicated().sum())

# Value ranges for numeric columns
print(df.describe())

# Cardinality for categorical columns
for col in df.select_dtypes(include="object").columns:
    print(f"{col}: {df[col].nunique()} unique values")

Univariate Analysis: Understanding Distributions

For numeric columns, histograms and box plots reveal shape, spread, and outliers. For categorical columns, bar charts show frequency imbalances that can bias models.

import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Histogram with KDE overlay
sns.histplot(df["order_value"], kde=True, ax=axes[0], bins=40)
axes[0].set_title("Distribution of Order Value")

# Box plot to highlight outliers
sns.boxplot(y=df["order_value"], ax=axes[1])
axes[1].set_title("Order Value Box Plot")

plt.tight_layout()
plt.show()

# Skewness and kurtosis
print(f"Skewness:  {df['order_value'].skew():.2f}")
print(f"Kurtosis:  {df['order_value'].kurtosis():.2f}")

# IQR-based outlier detection
Q1  = df["order_value"].quantile(0.25)
Q3  = df["order_value"].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df["order_value"] < Q1 - 1.5 * IQR) | (df["order_value"] > Q3 + 1.5 * IQR)]
print(f"Outliers: {len(outliers)} rows ({len(outliers)/len(df):.1%})")

Bivariate Analysis: Finding Relationships

Scatter plots and correlation matrices expose how pairs of variables move together. Always visualise before trusting a correlation coefficient — Anscombe's Quartet famously showed four datasets with identical statistics but completely different patterns.

# Correlation matrix heat map
import matplotlib.pyplot as plt
import seaborn as sns

numeric_df = df.select_dtypes(include="number")
corr = numeric_df.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm",
            vmin=-1, vmax=1, square=True)
plt.title("Correlation Matrix")
plt.tight_layout()
plt.show()

# Scatter plot with regression line
sns.regplot(data=df, x="days_since_signup", y="order_value",
            scatter_kws={"alpha": 0.3}, line_kws={"color": "red"})
plt.title("Order Value vs Days Since Signup")
plt.show()

A high correlation between two features does not imply causation. When correlations are high, check whether a third variable (a confounder) could be driving both.

EDA in SQL: Quick Profiling Queries

When a dataset lives in a warehouse and is too large to pull into Python, SQL-based profiling is faster and avoids unnecessary data movement.

-- Profile a single table: null rates, min, max, approx distinct SELECT COUNT(*) AS total_rows, COUNT(order_id) AS non_null_order_id, COUNT(DISTINCT order_id) AS distinct_orders, MIN(order_date) AS earliest_date, MAX(order_date) AS latest_date, ROUND(AVG(order_value), 2) AS avg_value, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY order_value) AS median_value, STDDEV(order_value) AS stddev_value, SUM(CASE WHEN order_value IS NULL THEN 1 END) * 100.0 / COUNT(*) AS pct_null_value FROM orders; -- Frequency distribution for a categorical column SELECT country, COUNT(*) AS orders, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct FROM orders GROUP BY country ORDER BY orders DESC LIMIT 20;

Common EDA Red Flags

Red Flag	What It Might Mean	Next Step
High null rate in a key column	Data pipeline failure, optional field never filled, or systematic missingness for a subgroup	Check if nulls are random or correlated with another variable (MCAR vs MAR vs MNAR)
Unexpected value spikes on round numbers	Rounding artefact, default value used as placeholder, or data entry bias	Verify with source system; ask if 0, 100, and 1000 are real or sentinel values
Dates outside expected range	Timezone conversion errors, default dates (1970-01-01), or future-dated test records	Filter to plausible range; trace back to ingestion logic
Perfect correlation between two features	One is derived from the other (leakage) or they are the same column with two names	Drop one; investigate data lineage
Bimodal distribution	Two distinct sub-populations mixed together	Segment by a categorical variable to see if the modes separate cleanly
Constant column	Feature has no variance and carries no information	Drop from analysis; investigate why it is constant

Red Flag

What It Might Mean

Next Step

High null rate in a key column

Data pipeline failure, optional field never filled, or systematic missingness for a subgroup

Check if nulls are random or correlated with another variable (MCAR vs MAR vs MNAR)

Unexpected value spikes on round numbers

Rounding artefact, default value used as placeholder, or data entry bias

Verify with source system; ask if 0, 100, and 1000 are real or sentinel values

Dates outside expected range

Timezone conversion errors, default dates (1970-01-01), or future-dated test records

Filter to plausible range; trace back to ingestion logic

Perfect correlation between two features

One is derived from the other (leakage) or they are the same column with two names

Drop one; investigate data lineage

Bimodal distribution

Two distinct sub-populations mixed together

Segment by a categorical variable to see if the modes separate cleanly

Constant column

Feature has no variance and carries no information

Drop from analysis; investigate why it is constant

Summary

EDA is not a box to check before the "real" analysis — it is the foundation that makes everything else reliable. A disciplined EDA routine catches problems that would otherwise invalidate models, embarrass analysts in front of stakeholders, or lead to bad business decisions. Use Python for flexible profiling and visualisation on manageable datasets, SQL for fast profiling directly in the warehouse, and always document what you find. The discoveries made during EDA often turn out to be the most valuable outputs of the entire analytical project.