Data Cleaning and Preparation

Why Data Cleaning Matters

Data cleaning — also called data wrangling or data preparation — is the process of detecting and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset before analysis. Studies consistently show that data professionals spend 60–80% of their time on cleaning and preparation rather than modelling. Raw data from real-world sources almost always contains missing values, duplicates, inconsistent formats, outliers, and encoding errors. A model trained on dirty data will produce unreliable results regardless of its sophistication — the principle known as garbage in, garbage out.

This article covers the most common data quality problems and the practical techniques to fix them, with Python code examples using pandas.

A Data Quality Checklist

Issue	Description	How to Detect	Common Fix
Missing values	NaN, None, empty strings, sentinel values like -999 or 0	df.isnull().sum(), df.describe()	Impute, drop, or flag depending on pattern and volume
Duplicates	Identical or near-identical rows caused by data pipeline errors or merges	df.duplicated().sum()	df.drop_duplicates()
Incorrect data types	Numbers stored as strings, dates as objects, booleans as integers	df.dtypes	df.astype(), pd.to_datetime(), pd.to_numeric()
Outliers	Values far outside the expected range — could be errors or legitimate extremes	df.describe(), box plots, IQR method	Investigate first; cap, remove, or transform
Inconsistent categories	"male", "Male", "M", "MALE" for the same value	df[col].value_counts()	Normalise with .str.lower().str.strip() and mapping
Schema violations	Columns in wrong order, extra or missing columns vs expected schema	Compare df.columns to expected list	Reorder, rename, or add missing columns with defaults

Handling Missing Values

The right strategy for missing data depends on why the data is missing (the missingness mechanism) and how much is missing. There are three mechanisms:

MCAR (Missing Completely At Random) — missingness is unrelated to any data; safe to drop rows
MAR (Missing At Random) — missingness is related to other observed variables; imputation is appropriate
MNAR (Missing Not At Random) — missingness is related to the missing value itself; requires domain knowledge and careful handling

import pandas as pd
import numpy as np

df = pd.read_csv('customers.csv')

# 1. Profile missing data
print(df.isnull().sum())
print(df.isnull().mean() * 100)  # percentage missing per column

# 2. Drop rows where the target variable is missing
df = df.dropna(subset=['revenue'])

# 3. Simple imputation — fill numeric with median, categorical with mode
df['age'].fillna(df['age'].median(), inplace=True)
df['country'].fillna(df['country'].mode()[0], inplace=True)

# 4. Flag imputed values for downstream transparency
df['age_imputed'] = df['age'].isnull().astype(int)  # run before fillna

# 5. Forward fill for time-ordered data (e.g., sensor readings)
df['temperature'] = df['temperature'].ffill()

Imputation Strategies Compared

Strategy	When to Use	Pros	Cons
Mean imputation	Numeric; MCAR; low missingness (<5%)	Simple and fast	Reduces variance; distorts correlations
Median imputation	Numeric with outliers; skewed distributions	Robust to outliers	Same correlation distortion as mean
Mode imputation	Categorical columns	Preserves the most common value	Can over-represent the dominant category
Forward/backward fill	Time-ordered data (sensor readings, prices)	Preserves temporal continuity	Wrong if gap is long; propagates stale values
KNN imputation	Numeric; MAR; moderate missingness	Uses relationships between features	Slow on large datasets; sensitive to scale
Model-based (IterativeImputer)	Complex missingness patterns	Most accurate; handles interactions	Computationally expensive; risk of overfitting

Detecting and Handling Outliers

Outliers are data points that lie far from the rest of the distribution. They may represent genuine extreme values (high-value customers, rare events) or errors (data entry mistakes, sensor failures). Always investigate before removing.

# IQR method — flag values outside 1.5 * IQR from Q1/Q3
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = df[(df['revenue'] < lower) | (df['revenue'] > upper)]
print(f'{len(outliers)} outliers detected')

# Option 1: Remove confirmed errors
df = df[(df['revenue'] >= lower) & (df['revenue'] <= upper)]

# Option 2: Cap (Winsorise) at percentile boundaries
df['revenue'] = df['revenue'].clip(lower=lower, upper=upper)

# Option 3: Log-transform to reduce skew without removing data
import numpy as np
df['revenue_log'] = np.log1p(df['revenue'])  # log1p handles zero values

Fixing Data Types and Formats

Incorrect data types silently cause errors in aggregations, comparisons, and model inputs. Always validate types immediately after loading data.

import pandas as pd

df = pd.read_csv('orders.csv')
print(df.dtypes)  # inspect before cleaning

# Fix dates stored as strings
df['order_date'] = pd.to_datetime(df['order_date'], format='%Y-%m-%d', errors='coerce')

# Fix numeric columns stored as strings (e.g., "$1,200.00")
df['price'] = df['price'].str.replace('[$,]', '', regex=True).astype(float)

# Fix boolean columns stored as 0/1 integers
df['is_active'] = df['is_active'].astype(bool)

# Downcast numeric types to save memory
df['age'] = pd.to_numeric(df['age'], errors='coerce').astype('Int16')  # nullable int

# Convert high-cardinality string to category for efficiency
df['country'] = df['country'].astype('category')

print(df.dtypes)  # verify after cleaning

Standardising Categorical Values

Inconsistent string values in categorical columns are one of the most common sources of dirty data. A single column might contain "United States", "US", "U.S.A.", and "usa" all meaning the same thing.

# Normalise case and whitespace first
df['country'] = df['country'].str.strip().str.lower()

# Check unique values after normalisation
print(df['country'].value_counts())

# Map known variants to canonical values
country_map = {
    'us': 'united states',
    'u.s.a.': 'united states',
    'usa': 'united states',
    'uk': 'united kingdom',
    'gb': 'united kingdom',
}
df['country'] = df['country'].replace(country_map)

# Use fuzzy matching for harder cases (requires rapidfuzz library)
from rapidfuzz import process, fuzz
canonical = ['united states', 'united kingdom', 'germany', 'france']

def fuzzy_match(value, choices, threshold=85):
    match, score, _ = process.extractOne(value, choices, scorer=fuzz.token_sort_ratio)
    return match if score >= threshold else None

df['country_clean'] = df['country'].apply(lambda x: fuzzy_match(x, canonical))

Removing Duplicates

Duplicates can arise from system retries, failed de-duplication in ETL pipelines, or accidental double-counting in joins. There are exact duplicates (identical rows) and near-duplicates (same entity, slightly different records).

# Exact duplicates
print(f'Duplicate rows: {df.duplicated().sum()}')
df = df.drop_duplicates()  # keeps first occurrence by default

# Duplicates on a subset of columns (e.g., same customer_id and order_date)
df = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='last')

# Near-duplicates — keep the most recent record per customer
df = df.sort_values('updated_at', ascending=False)
df = df.drop_duplicates(subset='customer_id', keep='first')

# Verify
print(f'Rows after dedup: {len(df)}')
print(f'Unique customers: {df['customer_id'].nunique()}')

Data Validation and Schema Enforcement

Validation ensures your cleaned data conforms to expected rules before it enters a model or dashboard. Catching violations early prevents silent errors downstream.

Validation Rule	Example Check	Action on Failure
Range check	Age must be between 0 and 120	Flag or drop; investigate if large proportion fails
Referential integrity	Every order must have a matching customer_id	Drop orphaned records; log for upstream fix
Uniqueness	order_id must be unique	De-duplicate; alert pipeline owner
Non-null constraint	revenue must not be null	Drop or impute; block record from analysis
Format check	Email matches regex pattern	Mask or nullify invalid values
Business rule	Order date must be after customer signup date	Flag for manual review; remove from cohort analyses

# Using Great Expectations for automated validation
import great_expectations as ge

df_ge = ge.from_pandas(df)

# Define expectations
df_ge.expect_column_values_to_not_be_null('customer_id')
df_ge.expect_column_values_to_be_between('age', min_value=0, max_value=120)
df_ge.expect_column_values_to_be_unique('order_id')
df_ge.expect_column_values_to_match_regex('email', r'^[\w.+-]+@[\w-]+\.[a-z]{2,}$')

# Run validation and get results
results = df_ge.validate()
print(results['statistics'])  # shows how many expectations passed/failed

Building a Reusable Cleaning Pipeline

Rather than cleaning data ad hoc in a notebook, a production pipeline applies cleaning steps in a consistent, documented, and repeatable order. Here is a minimal example using a function-per-step pattern:

import pandas as pd
import numpy as np

def load_data(path):
    return pd.read_csv(path)

def drop_duplicates(df):
    before = len(df)
    df = df.drop_duplicates(subset='order_id')
    print(f'Duplicates removed: {before - len(df)}')
    return df

def fix_types(df):
    df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
    df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')
    return df

def impute_missing(df):
    df['revenue'].fillna(df['revenue'].median(), inplace=True)
    df['country'].fillna('unknown', inplace=True)
    return df

def remove_outliers(df, col, lower_q=0.01, upper_q=0.99):
    lo = df[col].quantile(lower_q)
    hi = df[col].quantile(upper_q)
    return df[(df[col] >= lo) & (df[col] <= hi)]

def run_pipeline(path):
    df = load_data(path)
    df = drop_duplicates(df)
    df = fix_types(df)
    df = impute_missing(df)
    df = remove_outliers(df, 'revenue')
    return df

clean_df = run_pipeline('raw_orders.csv')

Summary

Data cleaning is not a one-time task — it is a continuous discipline. Every dataset has its own quality issues shaped by how it was collected, stored, and transported. The most effective analysts develop a systematic profiling habit: always inspect distributions, missing rates, type consistency, and value uniqueness before any analysis. Imputation and outlier handling require judgement, not just code — the wrong strategy can introduce bias that is harder to detect than the original dirty data. Documenting every cleaning decision and building reusable pipeline functions pays dividends when the data changes or a colleague needs to reproduce your work.