Data Cleaning and Preprocessing for Data Analysts

Why Data Cleaning Matters

Raw data is almost never ready for analysis. Inconsistent formats, duplicate records, missing values, outliers introduced by system errors, and mismatched join keys are the norm rather than the exception. Studies consistently show that data analysts spend 50–80% of their time on data preparation — not because cleaning is intellectually rewarding, but because the cost of skipping it is higher: models trained on dirty data produce unreliable predictions, dashboards built on duplicate records overstate metrics, and business decisions made on uncleaned data can be actively harmful.

Data cleaning is not a one-time task. It should be codified into repeatable, version-controlled scripts so that the same transformations are applied consistently as new data arrives.

Handling Missing Values

Missing data is the most common quality issue. The right treatment depends on why the data is missing and how it will be used downstream.

Missingness Type	Definition	Example	Recommended Treatment
MCAR (Missing Completely At Random)	Missingness is unrelated to any variable in the dataset	Survey response dropped due to network error	Safe to drop rows or use simple imputation
MAR (Missing At Random)	Missingness depends on observed variables but not the missing value itself	Income missing more often for younger respondents	Use model-based imputation conditioned on related variables
MNAR (Missing Not At Random)	Missingness depends on the unobserved value itself	High earners decline to report income	Cannot be fixed statistically; document the bias and consider collecting data differently

Imputation Techniques in Python

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer

df = pd.read_csv("customers.csv")

# 1. Drop rows where a critical column is null
df = df.dropna(subset=["user_id", "signup_date"])

# 2. Fill with a constant (sentinel value)
df["promo_code"] = df["promo_code"].fillna("NONE")

# 3. Fill numeric with median (robust to outliers)
df["age"] = df["age"].fillna(df["age"].median())

# 4. Forward-fill time series gaps
df = df.sort_values("date")
df["daily_active_users"] = df["daily_active_users"].ffill()

# 5. KNN imputation for correlated numeric features
num_cols = ["age", "income", "tenure_months"]
imputer = KNNImputer(n_neighbors=5)
df[num_cols] = imputer.fit_transform(df[num_cols])

print(df[num_cols].isnull().sum())  # Should all be 0

Removing Duplicates

Duplicate records inflate counts, skew aggregations, and create false joins. They arise from double-ingestion, retry logic in pipelines, or schema migrations that import the same records twice.

import pandas as pd

df = pd.read_csv("orders.csv")

# Exact duplicates across all columns
print("Exact dupes:", df.duplicated().sum())
df = df.drop_duplicates()

# Business-key duplicates: same order_id appearing twice
print("Duplicate order_ids:", df.duplicated(subset=["order_id"]).sum())

# Keep the most recent record per order_id
df = (
    df.sort_values("updated_at", ascending=False)
      .drop_duplicates(subset=["order_id"], keep="first")
)

print("Rows after dedup:", len(df))

Always investigate why duplicates exist before removing them — a duplicate order_id might indicate a genuine data integrity problem in the source system that needs to be fixed upstream.

Standardising Formats and Types

Inconsistent formatting breaks joins and aggregations silently. A column containing "US", "USA", "United States", and "united states" will produce four separate groups instead of one.

import pandas as pd
import re

df = pd.read_csv("leads.csv")

# Standardise country names
country_map = {"US": "United States", "USA": "United States", "U.S.": "United States"}
df["country"] = df["country"].str.strip().replace(country_map)

# Normalise email addresses
df["email"] = df["email"].str.strip().str.lower()

# Parse dates with mixed formats
df["signup_date"] = pd.to_datetime(df["signup_date"], infer_datetime_format=True, errors="coerce")

# Clean phone numbers: keep digits only, enforce 10-digit format
df["phone"] = df["phone"].str.replace(r"\D", "", regex=True)
df["phone"] = df["phone"].where(df["phone"].str.len() == 10, other=pd.NA)

# Convert revenue stored as string ("$1,200.50") to float
df["revenue"] = (
    df["revenue"].str.replace(r"[\$,]", "", regex=True)
                 .astype(float)
)

Outlier Detection and Treatment

Outliers can be genuine extreme values, data entry errors, or system anomalies. The right action depends on context — a $50,000 order is an outlier in a consumer app but perfectly normal in B2B sales.

import pandas as pd import numpy as np # IQR method Q1 = df["order_value"].quantile(0.25) Q3 = df["order_value"].quantile(0.75) IQR = Q3 - Q1 lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR df["is_outlier"] = ~df["order_value"].between(lower, upper) print(df["is_outlier"].sum(), "outliers detected") # Option A: Remove outliers df_clean = df[~df["is_outlier"]] # Option B: Winsorise (cap at bounds instead of removing) df["order_value_w"] = df["order_value"].clip(lower=lower, upper=upper) # Option C: Log-transform to reduce skew df["order_value_log"] = np.log1p(df["order_value"])

Data Cleaning in SQL

Many cleaning operations can be performed directly in the warehouse, avoiding unnecessary data movement and keeping transformations close to the source.

-- Deduplicate, standardise, and filter in one CTE chain WITH raw AS ( SELECT * FROM raw_orders WHERE order_date >= '2023-01-01' -- exclude stale data AND order_value > 0 -- remove zero/negative values ), deduped AS ( SELECT DISTINCT ON (order_id) -- keep one row per order * FROM raw ORDER BY order_id, updated_at DESC ), standardised AS ( SELECT order_id, UPPER(TRIM(country)) AS country, LOWER(TRIM(email)) AS email, DATE_TRUNC('day', order_date) AS order_date, NULLIF(promo_code, '') AS promo_code, LEAST(order_value, 10000) AS order_value -- cap at 10k FROM deduped ) SELECT * FROM standardised;

Data Cleaning Checklist

Check	Tool / Method	Pass Condition
Null rates in required columns	df.isnull().sum() / SQL COUNT	0% nulls in primary keys and required fields
Duplicate primary keys	df.duplicated(subset=[pk]) / SQL GROUP BY	No duplicates
Date range validity	df["date"].min() / .max()	All dates within expected business range
Numeric value range	df.describe() / SQL MIN, MAX	No negatives where impossible; no implausible extremes
Categorical consistency	df["col"].value_counts()	No unexpected values or near-duplicate categories
Join key integrity	SQL LEFT JOIN with NULL check	All foreign keys match a record in the parent table
Row count stability	Compare before/after cleaning	Dropped rows are documented and justified

Check

Tool / Method

Pass Condition

Null rates in required columns

df.isnull().sum() / SQL COUNT

0% nulls in primary keys and required fields

Duplicate primary keys

df.duplicated(subset=[pk]) / SQL GROUP BY

No duplicates

Date range validity

df["date"].min() / .max()

All dates within expected business range

Numeric value range

df.describe() / SQL MIN, MAX

No negatives where impossible; no implausible extremes

Categorical consistency

df["col"].value_counts()

No unexpected values or near-duplicate categories

Join key integrity

SQL LEFT JOIN with NULL check

All foreign keys match a record in the parent table

Row count stability

Compare before/after cleaning

Dropped rows are documented and justified

Summary

Data cleaning is not glamorous but it is non-negotiable. Every hour spent writing repeatable, documented cleaning scripts pays dividends across every future analysis that uses the same data. The key principles are: understand your data before cleaning it, treat the root cause rather than just the symptoms, prefer transformations that preserve information (winsorising, flagging) over those that discard it (row deletion), and always record what was changed and why. Clean data is the bedrock on which trustworthy analysis is built.