What Is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is the practice of investigating a dataset before formal modeling or hypothesis testing — to understand its structure, detect anomalies, discover patterns, and form hypotheses. The term was popularized by statistician John Tukey, who argued that data should be explored visually and descriptively before any assumptions are imposed. EDA is not a checklist; it is a mindset of curiosity-driven investigation. For a data analyst, EDA is the critical bridge between raw data ingestion and actionable insight.
Goals of EDA
Goal | Question It Answers | Typical Technique |
|---|---|---|
Understand shape of data | How many rows and columns? What are the data types? | df.shape, df.dtypes, INFORMATION_SCHEMA |
Assess data quality | Are there missing values, duplicates, or anomalies? | Null counts, duplicate checks, range validation |
Understand distributions | How are individual variables distributed? | Histograms, box plots, summary statistics |
Identify relationships | How do variables relate to each other? | Scatter plots, correlation matrices, crosstabs |
Detect outliers | Are there extreme values that need attention? | Box plots, z-score analysis, IQR fencing |
Generate hypotheses | What patterns suggest causal or predictive relationships? | Grouped comparisons, trend analysis |
Univariate Analysis: Understanding Single Variables
Variable Type | Summary Statistics | Visualizations |
|---|---|---|
Continuous numerical | Mean, median, std dev, min, max, percentiles (25th, 75th, 95th, 99th) | Histogram, density plot, box plot |
Discrete numerical | Count, min, max, mode, frequency table | Bar chart of value counts |
Categorical (nominal) | Frequency counts, mode, cardinality (# distinct values) | Bar chart sorted by frequency |
Ordinal | Frequency counts, rank order | Bar chart with ordered categories |
Date/time | Min date, max date, range, gaps | Time series line chart, calendar heatmap |
Bivariate Analysis: Exploring Relationships
Variable Pair | Technique | What to Look For |
|---|---|---|
Numerical vs. Numerical | Scatter plot, Pearson/Spearman correlation | Linear or nonlinear trends, clusters, outliers |
Categorical vs. Numerical | Box plot by group, violin plot, grouped means | Differences in central tendency or spread across groups |
Categorical vs. Categorical | Crosstab (contingency table), stacked bar chart | Associations, disproportionate distributions |
Time vs. Numerical | Line chart, rolling average | Trends, seasonality, step changes, anomalies |
Time vs. Categorical | Stacked area chart, small multiples | Shifts in category composition over time |
Correlation and Its Limits
Concept | Explanation | Analyst Action |
|---|---|---|
Pearson correlation (r) | Measures linear relationship; ranges from -1 to +1 | Use for normally distributed continuous variables |
Spearman correlation | Rank-based; captures monotonic but non-linear relationships | Prefer when data is skewed or ordinal |
Correlation ≠ causation | Two variables can be correlated due to a third confounding variable | Investigate confounders; never infer causation from correlation alone |
Spurious correlations | Random or coincidental relationships with no logical basis | Apply domain knowledge; require logical plausibility |
Correlation matrix | Pairwise correlations for all numeric columns displayed as a heatmap | Identify multicollinearity and promising predictor variables |
Detecting Outliers
Method | How It Works | When to Use |
|---|---|---|
IQR fencing | Flag values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR | Robust to skewed distributions; standard box plot method |
Z-score | Flag values more than 3 standard deviations from the mean | Normally distributed data; sensitive to extreme values themselves |
Percentile thresholds | Flag values below 1st or above 99th percentile | Business-driven thresholds; easy to explain to stakeholders |
Domain rules | Apply business logic (e.g., age must be 0–120) | Always apply alongside statistical methods |
EDA Workflow
Step | Activity | Tool / Function |
|---|---|---|
1. Load and inspect | Check shape, dtypes, first/last rows, sample | df.head(), df.info(), df.sample() |
2. Summarize numerics | Descriptive statistics for all numerical columns | df.describe() — includes count, mean, std, min, percentiles, max |
3. Assess missingness | Count and visualize null values per column | df.isnull().sum(), missingno library heatmap |
4. Univariate distributions | Plot each variable individually | Histograms, bar charts, box plots |
5. Bivariate relationships | Plot pairs and compute correlations | sns.pairplot(), correlation heatmap, scatter plots |
6. Group comparisons | Compare key metrics across categorical segments | df.groupby().agg(), pivot tables |
7. Document findings | Record observations, anomalies, and hypotheses | Jupyter notebook markdown, EDA report |
Summary
EDA is not a single technique but a phase of open-ended investigation. The goal is to build intuition about the data — what it contains, how it behaves, and what questions it can answer — before committing to a specific analysis or model. Analysts who invest in thorough EDA avoid costly mistakes downstream: models trained on misunderstood data, conclusions drawn from biased samples, and dashboards built on the wrong metrics. A well-executed EDA notebook is also a communication tool: it documents the journey from raw data to informed hypothesis, making the analysis transparent and reproducible.
Create a free reader account to keep reading.