Data Cleaning and Preprocessing

What Is Data Cleaning?

Data cleaning (also called data preprocessing or data wrangling) is the process of detecting and correcting errors, inconsistencies, and missing values in raw datasets before analysis. Studies consistently show that analysts spend 60–80% of their time on data preparation — not analysis itself. A model or chart built on dirty data will produce misleading results regardless of how sophisticated the technique is. Mastering data cleaning is therefore one of the highest-leverage skills a data analyst can develop.

Common Data Quality Problems

Problem Type	Description	Example
Missing values	NULL, NaN, or blank fields where data should exist	Age column has 12% NULL rows
Duplicate records	The same entity appears more than once	Customer ID 1042 has three rows with identical data
Inconsistent formats	Same value represented in multiple ways	"USA", "U.S.A.", "United States" in the country column
Outliers and errors	Values that are implausible or outside expected range	Age = 999, revenue = -50000
Wrong data types	A column stored as the wrong type	Date stored as VARCHAR: "January 5th, 2024"
Referential integrity issues	Foreign keys that point to non-existent records	order.customer_id = 99 but customer 99 does not exist
Structural issues	Data that violates expected schema or business rules	end_date earlier than start_date

Handling Missing Values

Strategy	When to Use	Risk
Drop rows	Missing rate is very low (<5%) and rows are missing at random	Loses data; can introduce bias if missingness is not random
Drop columns	Column has very high missing rate (>60%) with little analytical value	Permanently loses a feature
Mean/median/mode imputation	Numerical or categorical columns with moderate missingness	Reduces variance; distorts distributions
Forward/backward fill	Time series data where the previous or next value is a reasonable estimate	Inappropriate for non-sequential data
Model-based imputation	Missing values have predictable patterns based on other features	Adds complexity; can propagate model errors
Flag as a category	"Missing" is itself meaningful (e.g., no purchase = no data)	Adds a new category that must be managed downstream

Deduplication Techniques

Duplicates arise from data ingestion errors, system migrations, or manual data entry. The SQL pattern for exact duplicates uses ROW_NUMBER() to keep only the first occurrence:

Duplicate Type	Detection Method	Resolution
Exact duplicates	GROUP BY all columns; COUNT(*) > 1	Keep one row using ROW_NUMBER() OVER (PARTITION BY key ORDER BY created_at)
Near duplicates (fuzzy)	String similarity functions (Levenshtein distance, Jaccard similarity)	Cluster similar records and merge manually or algorithmically
Logical duplicates	Same business entity, different IDs (e.g., merged companies)	Build a canonical ID mapping table

Standardizing and Transforming Data

Transformation	Purpose	Example
Case normalization	Eliminate case-sensitive mismatches	UPPER(country) or LOWER(email)
Trimming whitespace	Remove leading/trailing spaces that cause join failures	TRIM(first_name)
Date parsing	Convert string dates to proper date types	CAST('2024-01-05' AS DATE) or TO_DATE('Jan 5, 2024', 'Mon DD, YYYY')
Type casting	Ensure correct data types for calculations	CAST(revenue_str AS NUMERIC)
Value mapping	Harmonize inconsistent categorical values	CASE WHEN country IN ('USA','U.S.A.') THEN 'United States' END
Outlier capping (winsorizing)	Limit extreme values without removing rows	LEAST(revenue, PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY revenue))
Numeric scaling	Normalize features for ML models	Min-max scaling: (x - min) / (max - min)

A Practical Data Cleaning Checklist

Step	Action	SQL / Python Approach
1. Profile the data	Count rows, nulls, distinct values, min/max per column	SELECT COUNT(*), COUNT(col), MIN(col), MAX(col), COUNT(DISTINCT col) FROM table
2. Identify duplicates	Check for exact and key-level duplicates	SELECT id, COUNT() FROM table GROUP BY id HAVING COUNT() > 1
3. Handle nulls	Decide per column: drop, impute, or flag	COALESCE(col, default_value) or df.fillna()
4. Fix formats	Standardize strings, dates, and codes	TRIM(), LOWER(), TO_DATE(), REGEXP_REPLACE()
5. Validate ranges	Flag rows outside business-defined valid ranges	SELECT * FROM table WHERE age NOT BETWEEN 0 AND 120
6. Check referential integrity	Find orphaned foreign keys	SELECT * FROM orders o LEFT JOIN customers c ON o.customer_id = c.id WHERE c.id IS NULL
7. Document decisions	Record every cleaning rule and its rationale	Maintain a data quality log or dbt test file

Data Profiling Tools

Tool	Environment	Key Capability
pandas-profiling / ydata-profiling	Python	Generates full HTML report with distributions, correlations, missing values
Great Expectations	Python / dbt	Define and run automated data quality tests as code
dbt tests	SQL / dbt	Built-in tests: not_null, unique, accepted_values, relationships
OpenRefine	Desktop GUI	Cluster and merge inconsistent string values interactively
SQL INFORMATION_SCHEMA	SQL	Inspect column types, nullable flags, and table metadata

Summary

Data cleaning is not a one-time step — it is an ongoing discipline. Clean data requires profiling to understand what exists, systematic rules to fix what is wrong, and automated tests to ensure quality does not degrade over time. Analysts who build repeatable cleaning pipelines — rather than manual fixes — produce results that are reproducible, auditable, and trustworthy. The investment in clean data pays compounding dividends: every downstream analysis, model, and dashboard built on it inherits that reliability.