Exploratory Data Analysis (EDA)

What Is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the practice of investigating a dataset before formal modeling or hypothesis testing — to understand its structure, detect anomalies, discover patterns, and form hypotheses. The term was popularized by statistician John Tukey, who argued that data should be explored visually and descriptively before any assumptions are imposed. EDA is not a checklist; it is a mindset of curiosity-driven investigation. For a data analyst, EDA is the critical bridge between raw data ingestion and actionable insight.

Goals of EDA

Goal	Question It Answers	Typical Technique
Understand shape of data	How many rows and columns? What are the data types?	df.shape, df.dtypes, INFORMATION_SCHEMA
Assess data quality	Are there missing values, duplicates, or anomalies?	Null counts, duplicate checks, range validation
Understand distributions	How are individual variables distributed?	Histograms, box plots, summary statistics
Identify relationships	How do variables relate to each other?	Scatter plots, correlation matrices, crosstabs
Detect outliers	Are there extreme values that need attention?	Box plots, z-score analysis, IQR fencing
Generate hypotheses	What patterns suggest causal or predictive relationships?	Grouped comparisons, trend analysis

Univariate Analysis: Understanding Single Variables

Variable Type	Summary Statistics	Visualizations
Continuous numerical	Mean, median, std dev, min, max, percentiles (25th, 75th, 95th, 99th)	Histogram, density plot, box plot
Discrete numerical	Count, min, max, mode, frequency table	Bar chart of value counts
Categorical (nominal)	Frequency counts, mode, cardinality (# distinct values)	Bar chart sorted by frequency
Ordinal	Frequency counts, rank order	Bar chart with ordered categories
Date/time	Min date, max date, range, gaps	Time series line chart, calendar heatmap

Bivariate Analysis: Exploring Relationships

Variable Pair	Technique	What to Look For
Numerical vs. Numerical	Scatter plot, Pearson/Spearman correlation	Linear or nonlinear trends, clusters, outliers
Categorical vs. Numerical	Box plot by group, violin plot, grouped means	Differences in central tendency or spread across groups
Categorical vs. Categorical	Crosstab (contingency table), stacked bar chart	Associations, disproportionate distributions
Time vs. Numerical	Line chart, rolling average	Trends, seasonality, step changes, anomalies
Time vs. Categorical	Stacked area chart, small multiples	Shifts in category composition over time

Correlation and Its Limits

Concept	Explanation	Analyst Action
Pearson correlation (r)	Measures linear relationship; ranges from -1 to +1	Use for normally distributed continuous variables
Spearman correlation	Rank-based; captures monotonic but non-linear relationships	Prefer when data is skewed or ordinal
Correlation ≠ causation	Two variables can be correlated due to a third confounding variable	Investigate confounders; never infer causation from correlation alone
Spurious correlations	Random or coincidental relationships with no logical basis	Apply domain knowledge; require logical plausibility
Correlation matrix	Pairwise correlations for all numeric columns displayed as a heatmap	Identify multicollinearity and promising predictor variables

Detecting Outliers

Method	How It Works	When to Use
IQR fencing	Flag values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR	Robust to skewed distributions; standard box plot method
Z-score	Flag values more than 3 standard deviations from the mean	Normally distributed data; sensitive to extreme values themselves
Percentile thresholds	Flag values below 1st or above 99th percentile	Business-driven thresholds; easy to explain to stakeholders
Domain rules	Apply business logic (e.g., age must be 0–120)	Always apply alongside statistical methods

EDA Workflow

Step	Activity	Tool / Function
1. Load and inspect	Check shape, dtypes, first/last rows, sample	df.head(), df.info(), df.sample()
2. Summarize numerics	Descriptive statistics for all numerical columns	df.describe() — includes count, mean, std, min, percentiles, max
3. Assess missingness	Count and visualize null values per column	df.isnull().sum(), missingno library heatmap
4. Univariate distributions	Plot each variable individually	Histograms, bar charts, box plots
5. Bivariate relationships	Plot pairs and compute correlations	sns.pairplot(), correlation heatmap, scatter plots
6. Group comparisons	Compare key metrics across categorical segments	df.groupby().agg(), pivot tables
7. Document findings	Record observations, anomalies, and hypotheses	Jupyter notebook markdown, EDA report

Summary

EDA is not a single technique but a phase of open-ended investigation. The goal is to build intuition about the data — what it contains, how it behaves, and what questions it can answer — before committing to a specific analysis or model. Analysts who invest in thorough EDA avoid costly mistakes downstream: models trained on misunderstood data, conclusions drawn from biased samples, and dashboards built on the wrong metrics. A well-executed EDA notebook is also a communication tool: it documents the journey from raw data to informed hypothesis, making the analysis transparent and reproducible.