What Is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is the practice of systematically examining a dataset before building models or drawing conclusions. Popularised by statistician John Tukey in the 1970s, EDA emphasises visualisation, summary statistics, and pattern detection as a first step in any analytical workflow. The goal is not to confirm a hypothesis but to understand the data's structure, quality, and behaviour well enough to ask better questions.
A thorough EDA catches data quality issues early, reveals distributions that violate model assumptions, uncovers unexpected relationships between variables, and often surfaces the most important findings before any formal analysis begins. Skipping EDA is one of the most common reasons analytical projects produce misleading results.
The EDA Workflow
| Stage | What to Do | Key Questions |
|---|---|---|
| 1. Understand the Schema | List columns, data types, row counts, and source definitions | What does each column mean? What is the grain of the table? |
| 2. Assess Data Quality | Count nulls, duplicates, and out-of-range values | Are there unexpected gaps? Are IDs unique where they should be? |
| 3. Univariate Analysis | Examine each variable in isolation: distributions, extremes, modes | Is the distribution normal, skewed, or bimodal? Are there outliers? |
| 4. Bivariate Analysis | Explore relationships between pairs of variables | Are correlated variables causally related or confounded? |
| 5. Multivariate Analysis | Look for interactions and grouping patterns across 3+ variables | Does a relationship hold across all segments or only some? |
| 6. Document Findings | Record anomalies, assumptions, and decisions made during EDA | What would a new team member need to know about this dataset? |
Data Quality Checks in Python
The first thing to run on any new dataset is a quality audit. Pandas makes this concise:
import pandas as pd
import numpy as np
df = pd.read_csv("orders.csv", parse_dates=["order_date"])
# Shape and dtypes
print(df.shape)
print(df.dtypes)
# Null counts and percentages
null_summary = pd.DataFrame({
"null_count": df.isnull().sum(),
"null_pct": df.isnull().mean().mul(100).round(1)
}).query("null_count > 0").sort_values("null_pct", ascending=False)
print(null_summary)
# Duplicate rows
print("Duplicate rows:", df.duplicated().sum())
# Value ranges for numeric columns
print(df.describe())
# Cardinality for categorical columns
for col in df.select_dtypes(include="object").columns:
print(f"{col}: {df[col].nunique()} unique values")
Univariate Analysis: Understanding Distributions
For numeric columns, histograms and box plots reveal shape, spread, and outliers. For categorical columns, bar charts show frequency imbalances that can bias models.
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Histogram with KDE overlay
sns.histplot(df["order_value"], kde=True, ax=axes[0], bins=40)
axes[0].set_title("Distribution of Order Value")
# Box plot to highlight outliers
sns.boxplot(y=df["order_value"], ax=axes[1])
axes[1].set_title("Order Value Box Plot")
plt.tight_layout()
plt.show()
# Skewness and kurtosis
print(f"Skewness: {df['order_value'].skew():.2f}")
print(f"Kurtosis: {df['order_value'].kurtosis():.2f}")
# IQR-based outlier detection
Q1 = df["order_value"].quantile(0.25)
Q3 = df["order_value"].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df["order_value"] < Q1 - 1.5 * IQR) | (df["order_value"] > Q3 + 1.5 * IQR)]
print(f"Outliers: {len(outliers)} rows ({len(outliers)/len(df):.1%})")
Bivariate Analysis: Finding Relationships
Scatter plots and correlation matrices expose how pairs of variables move together. Always visualise before trusting a correlation coefficient — Anscombe's Quartet famously showed four datasets with identical statistics but completely different patterns.
# Correlation matrix heat map
import matplotlib.pyplot as plt
import seaborn as sns
numeric_df = df.select_dtypes(include="number")
corr = numeric_df.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm",
vmin=-1, vmax=1, square=True)
plt.title("Correlation Matrix")
plt.tight_layout()
plt.show()
# Scatter plot with regression line
sns.regplot(data=df, x="days_since_signup", y="order_value",
scatter_kws={"alpha": 0.3}, line_kws={"color": "red"})
plt.title("Order Value vs Days Since Signup")
plt.show()
A high correlation between two features does not imply causation. When correlations are high, check whether a third variable (a confounder) could be driving both.
EDA in SQL: Quick Profiling Queries
When a dataset lives in a warehouse and is too large to pull into Python, SQL-based profiling is faster and avoids unnecessary data movement.
-- Profile a single table: null rates, min, max, approx distinct
SELECT
COUNT(*) AS total_rows,
COUNT(order_id) AS non_null_order_id,
COUNT(DISTINCT order_id) AS distinct_orders,
MIN(order_date) AS earliest_date,
MAX(order_date) AS latest_date,
ROUND(AVG(order_value), 2) AS avg_value,
PERCENTILE_CONT(0.5) WITHIN GROUP
(ORDER BY order_value) AS median_value,
STDDEV(order_value) AS stddev_value,
SUM(CASE WHEN order_value IS NULL THEN 1 END)
* 100.0 / COUNT(*) AS pct_null_value
FROM orders;
-- Frequency distribution for a categorical column
SELECT
country,
COUNT(*) AS orders,
ROUND(COUNT(*) * 100.0
/ SUM(COUNT(*)) OVER (), 1) AS pct
FROM orders
GROUP BY country
ORDER BY orders DESC
LIMIT 20;
Common EDA Red Flags
| Red Flag | What It Might Mean | Next Step |
|---|---|---|
| High null rate in a key column | Data pipeline failure, optional field never filled, or systematic missingness for a subgroup | Check if nulls are random or correlated with another variable (MCAR vs MAR vs MNAR) |
| Unexpected value spikes on round numbers | Rounding artefact, default value used as placeholder, or data entry bias | Verify with source system; ask if 0, 100, and 1000 are real or sentinel values |
| Dates outside expected range | Timezone conversion errors, default dates (1970-01-01), or future-dated test records | Filter to plausible range; trace back to ingestion logic |
| Perfect correlation between two features | One is derived from the other (leakage) or they are the same column with two names | Drop one; investigate data lineage |
| Bimodal distribution | Two distinct sub-populations mixed together | Segment by a categorical variable to see if the modes separate cleanly |
| Constant column | Feature has no variance and carries no information | Drop from analysis; investigate why it is constant |
Summary
EDA is not a box to check before the "real" analysis — it is the foundation that makes everything else reliable. A disciplined EDA routine catches problems that would otherwise invalidate models, embarrass analysts in front of stakeholders, or lead to bad business decisions. Use Python for flexible profiling and visualisation on manageable datasets, SQL for fast profiling directly in the warehouse, and always document what you find. The discoveries made during EDA often turn out to be the most valuable outputs of the entire analytical project.
Create a free reader account to keep reading.