Why Data Cleaning Matters
Raw data is almost never ready for analysis. Inconsistent formats, duplicate records, missing values, outliers introduced by system errors, and mismatched join keys are the norm rather than the exception. Studies consistently show that data analysts spend 50–80% of their time on data preparation — not because cleaning is intellectually rewarding, but because the cost of skipping it is higher: models trained on dirty data produce unreliable predictions, dashboards built on duplicate records overstate metrics, and business decisions made on uncleaned data can be actively harmful.
Data cleaning is not a one-time task. It should be codified into repeatable, version-controlled scripts so that the same transformations are applied consistently as new data arrives.
Handling Missing Values
Missing data is the most common quality issue. The right treatment depends on why the data is missing and how it will be used downstream.
| Missingness Type | Definition | Example | Recommended Treatment |
|---|---|---|---|
| MCAR (Missing Completely At Random) | Missingness is unrelated to any variable in the dataset | Survey response dropped due to network error | Safe to drop rows or use simple imputation |
| MAR (Missing At Random) | Missingness depends on observed variables but not the missing value itself | Income missing more often for younger respondents | Use model-based imputation conditioned on related variables |
| MNAR (Missing Not At Random) | Missingness depends on the unobserved value itself | High earners decline to report income | Cannot be fixed statistically; document the bias and consider collecting data differently |
Imputation Techniques in Python
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
df = pd.read_csv("customers.csv")
# 1. Drop rows where a critical column is null
df = df.dropna(subset=["user_id", "signup_date"])
# 2. Fill with a constant (sentinel value)
df["promo_code"] = df["promo_code"].fillna("NONE")
# 3. Fill numeric with median (robust to outliers)
df["age"] = df["age"].fillna(df["age"].median())
# 4. Forward-fill time series gaps
df = df.sort_values("date")
df["daily_active_users"] = df["daily_active_users"].ffill()
# 5. KNN imputation for correlated numeric features
num_cols = ["age", "income", "tenure_months"]
imputer = KNNImputer(n_neighbors=5)
df[num_cols] = imputer.fit_transform(df[num_cols])
print(df[num_cols].isnull().sum()) # Should all be 0
Removing Duplicates
Duplicate records inflate counts, skew aggregations, and create false joins. They arise from double-ingestion, retry logic in pipelines, or schema migrations that import the same records twice.
import pandas as pd
df = pd.read_csv("orders.csv")
# Exact duplicates across all columns
print("Exact dupes:", df.duplicated().sum())
df = df.drop_duplicates()
# Business-key duplicates: same order_id appearing twice
print("Duplicate order_ids:", df.duplicated(subset=["order_id"]).sum())
# Keep the most recent record per order_id
df = (
df.sort_values("updated_at", ascending=False)
.drop_duplicates(subset=["order_id"], keep="first")
)
print("Rows after dedup:", len(df))
Always investigate why duplicates exist before removing them — a duplicate order_id might indicate a genuine data integrity problem in the source system that needs to be fixed upstream.
Standardising Formats and Types
Inconsistent formatting breaks joins and aggregations silently. A column containing "US", "USA", "United States", and "united states" will produce four separate groups instead of one.
import pandas as pd
import re
df = pd.read_csv("leads.csv")
# Standardise country names
country_map = {"US": "United States", "USA": "United States", "U.S.": "United States"}
df["country"] = df["country"].str.strip().replace(country_map)
# Normalise email addresses
df["email"] = df["email"].str.strip().str.lower()
# Parse dates with mixed formats
df["signup_date"] = pd.to_datetime(df["signup_date"], infer_datetime_format=True, errors="coerce")
# Clean phone numbers: keep digits only, enforce 10-digit format
df["phone"] = df["phone"].str.replace(r"\D", "", regex=True)
df["phone"] = df["phone"].where(df["phone"].str.len() == 10, other=pd.NA)
# Convert revenue stored as string ("$1,200.50") to float
df["revenue"] = (
df["revenue"].str.replace(r"[\$,]", "", regex=True)
.astype(float)
)
Outlier Detection and Treatment
Outliers can be genuine extreme values, data entry errors, or system anomalies. The right action depends on context — a $50,000 order is an outlier in a consumer app but perfectly normal in B2B sales.
import pandas as pd
import numpy as np
# IQR method
Q1 = df["order_value"].quantile(0.25)
Q3 = df["order_value"].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
df["is_outlier"] = ~df["order_value"].between(lower, upper)
print(df["is_outlier"].sum(), "outliers detected")
# Option A: Remove outliers
df_clean = df[~df["is_outlier"]]
# Option B: Winsorise (cap at bounds instead of removing)
df["order_value_w"] = df["order_value"].clip(lower=lower, upper=upper)
# Option C: Log-transform to reduce skew
df["order_value_log"] = np.log1p(df["order_value"])
Winsorising is usually preferred over removal for continuous metrics because it retains all rows while dampening the influence of extreme values on aggregates.
Data Cleaning in SQL
Many cleaning operations can be performed directly in the warehouse, avoiding unnecessary data movement and keeping transformations close to the source.
-- Deduplicate, standardise, and filter in one CTE chain
WITH raw AS (
SELECT *
FROM raw_orders
WHERE order_date >= '2023-01-01' -- exclude stale data
AND order_value > 0 -- remove zero/negative values
),
deduped AS (
SELECT DISTINCT ON (order_id) -- keep one row per order
*
FROM raw
ORDER BY order_id, updated_at DESC
),
standardised AS (
SELECT
order_id,
UPPER(TRIM(country)) AS country,
LOWER(TRIM(email)) AS email,
DATE_TRUNC('day', order_date) AS order_date,
NULLIF(promo_code, '') AS promo_code,
LEAST(order_value, 10000) AS order_value -- cap at 10k
FROM deduped
)
SELECT * FROM standardised;
Data Cleaning Checklist
| Check | Tool / Method | Pass Condition |
|---|---|---|
| Null rates in required columns | df.isnull().sum() / SQL COUNT | 0% nulls in primary keys and required fields |
| Duplicate primary keys | df.duplicated(subset=[pk]) / SQL GROUP BY | No duplicates |
| Date range validity | df["date"].min() / .max() | All dates within expected business range |
| Numeric value range | df.describe() / SQL MIN, MAX | No negatives where impossible; no implausible extremes |
| Categorical consistency | df["col"].value_counts() | No unexpected values or near-duplicate categories |
| Join key integrity | SQL LEFT JOIN with NULL check | All foreign keys match a record in the parent table |
| Row count stability | Compare before/after cleaning | Dropped rows are documented and justified |
Summary
Data cleaning is not glamorous but it is non-negotiable. Every hour spent writing repeatable, documented cleaning scripts pays dividends across every future analysis that uses the same data. The key principles are: understand your data before cleaning it, treat the root cause rather than just the symptoms, prefer transformations that preserve information (winsorising, flagging) over those that discard it (row deletion), and always record what was changed and why. Clean data is the bedrock on which trustworthy analysis is built.
Create a free reader account to keep reading.