Data Cleaning and Preparation

Why Data Cleaning Matters

Data cleaning is the process of detecting and correcting corrupt, inaccurate, or irrelevant records from a dataset before analysis. Data professionals spend 60–80% of their time on cleaning rather than modelling. Raw data almost always contains missing values, duplicates, inconsistent formats, and outliers. A model trained on dirty data produces unreliable results regardless of its sophistication — garbage in, garbage out.

Data Quality Checklist

Issue	How to Detect	Common Fix
Missing values	df.isnull().sum()	Impute, drop, or flag depending on pattern
Duplicates	df.duplicated().sum()	df.drop_duplicates()
Wrong data types	df.dtypes	df.astype(), pd.to_datetime(), pd.to_numeric()
Outliers	df.describe(), IQR method	Investigate first; cap, remove, or log-transform
Inconsistent categories	df[col].value_counts()	.str.lower().str.strip() + value mapping
Schema violations	Compare df.columns to expected schema	Reorder, rename, or add missing columns

Handling Missing Values

Choose your strategy based on why data is missing: MCAR (safe to drop), MAR (impute using other features), or MNAR (requires domain knowledge). Always create a binary flag column before imputing so downstream models can learn from the missingness pattern itself.

import pandas as pd

df = pd.read_csv('customers.csv')

# Profile: percentage missing per column
print((df.isnull().mean() * 100).round(1))

# Flag before imputing
df['age_was_missing'] = df['age'].isnull().astype(int)

# Impute numeric with median, categorical with mode
df['age'] = df['age'].fillna(df['age'].median())
df['country'] = df['country'].fillna(df['country'].mode()[0])

# Drop rows where the target is missing
df = df.dropna(subset=['revenue'])

# Forward fill for time-series (sensor, price data)
df['sensor_temp'] = df['sensor_temp'].ffill()

Imputation Strategies Compared

Strategy	When to Use	Pros	Cons
Mean / Median	Numeric; low missingness (<5%); MCAR	Simple and fast	Reduces variance; distorts correlations
Mode	Categorical columns	Preserves most common value	Over-represents the dominant category
Forward/backward fill	Time-ordered data	Preserves temporal continuity	Wrong if gap is long; propagates stale values
KNN imputation	Numeric; MAR; moderate missingness	Uses feature relationships	Slow on large datasets; sensitive to scale
IterativeImputer	Complex patterns; MAR	Most accurate; handles interactions	Computationally expensive

Outliers: Detect and Handle

# IQR method to detect outliers
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR

n_outliers = ((df['revenue'] < lower) | (df['revenue'] > upper)).sum()
print(f'{n_outliers} outliers detected out of {len(df)} rows')

# Option 1: Remove (only for confirmed data entry errors)
df_clean = df[(df['revenue'] >= lower) & (df['revenue'] <= upper)]

# Option 2: Winsorise — cap at percentile boundaries
df['revenue'] = df['revenue'].clip(lower=lower, upper=upper)

# Option 3: Log-transform to reduce right skew without removing data
import numpy as np
df['revenue_log'] = np.log1p(df['revenue'])  # log1p handles 0 safely

Fixing Data Types and Standardising Categories

import pandas as pd

df = pd.read_csv('orders.csv')

# Fix dates stored as strings
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')

# Strip currency symbols before converting to float
df['price'] = df['price'].str.replace(r'[$,]', '', regex=True).astype(float)

# Downcast integers; use nullable Int16 so NaN is preserved
df['age'] = pd.to_numeric(df['age'], errors='coerce').astype('Int16')

# Standardise categorical strings
df['country'] = df['country'].str.strip().str.lower()
country_map = {'us': 'united states', 'u.s.a.': 'united states', 'uk': 'united kingdom'}
df['country'] = df['country'].replace(country_map)

# Convert to category dtype to save memory
df['country'] = df['country'].astype('category')
print(df.dtypes)

Data Validation Rules

Rule Type	Example	Action on Failure
Range check	Age between 0 and 120	Flag or drop; investigate if many fail
Referential integrity	Every order must have a matching customer_id	Drop orphaned records; log for upstream fix
Uniqueness	order_id must be unique	De-duplicate; alert pipeline owner
Non-null constraint	revenue must not be null	Drop or impute; block from analysis
Format check	Email matches regex pattern	Nullify invalid values
Business rule	Order date must be after customer signup date	Flag for review; exclude from cohort analyses

Building a Reusable Cleaning Pipeline

Structure cleaning as a series of pure functions — each takes a DataFrame and returns a cleaned DataFrame. This makes every step independently testable and easy to swap out when data sources change.

import pandas as pd, numpy as np

def drop_duplicates(df):
    before = len(df)
    df = df.drop_duplicates(subset='order_id')
    print(f'Duplicates removed: {before - len(df)}')
    return df

def fix_types(df):
    df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
    df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')
    return df

def impute_missing(df):
    df['revenue'] = df['revenue'].fillna(df['revenue'].median())
    df['country'] = df['country'].fillna('unknown')
    return df

def remove_outliers(df, col, lo_q=0.01, hi_q=0.99):
    lo, hi = df[col].quantile(lo_q), df[col].quantile(hi_q)
    return df[(df[col] >= lo) & (df[col] <= hi)]

def run_pipeline(path):
    return (pd.read_csv(path)
            .pipe(drop_duplicates)
            .pipe(fix_types)
            .pipe(impute_missing)
            .pipe(remove_outliers, 'revenue'))

clean_df = run_pipeline('raw_orders.csv')
print(f'Final shape: {clean_df.shape}')

Summary

Data cleaning is a continuous discipline, not a one-time task. Every dataset has its own quality issues shaped by how it was collected, stored, and transported. The most effective analysts develop a systematic profiling habit: always inspect distributions, missing rates, type consistency, and value uniqueness before any analysis. Imputation and outlier handling require judgement — the wrong strategy can introduce bias harder to detect than the original dirty data. Document every cleaning decision and build reusable pipeline functions so your work is reproducible when the data changes.