Why Data Cleaning Matters
Data cleaning is the process of detecting and correcting corrupt, inaccurate, or irrelevant records from a dataset before analysis. Data professionals spend 60–80% of their time on cleaning rather than modelling. Raw data almost always contains missing values, duplicates, inconsistent formats, and outliers. A model trained on dirty data produces unreliable results regardless of its sophistication — garbage in, garbage out.
Data Quality Checklist
| Issue | How to Detect | Common Fix |
|---|---|---|
| Missing values | df.isnull().sum() | Impute, drop, or flag depending on pattern |
| Duplicates | df.duplicated().sum() | df.drop_duplicates() |
| Wrong data types | df.dtypes | df.astype(), pd.to_datetime(), pd.to_numeric() |
| Outliers | df.describe(), IQR method | Investigate first; cap, remove, or log-transform |
| Inconsistent categories | df[col].value_counts() | .str.lower().str.strip() + value mapping |
| Schema violations | Compare df.columns to expected schema | Reorder, rename, or add missing columns |
Handling Missing Values
Choose your strategy based on why data is missing: MCAR (safe to drop), MAR (impute using other features), or MNAR (requires domain knowledge). Always create a binary flag column before imputing so downstream models can learn from the missingness pattern itself.
import pandas as pd
df = pd.read_csv('customers.csv')
# Profile: percentage missing per column
print((df.isnull().mean() * 100).round(1))
# Flag before imputing
df['age_was_missing'] = df['age'].isnull().astype(int)
# Impute numeric with median, categorical with mode
df['age'] = df['age'].fillna(df['age'].median())
df['country'] = df['country'].fillna(df['country'].mode()[0])
# Drop rows where the target is missing
df = df.dropna(subset=['revenue'])
# Forward fill for time-series (sensor, price data)
df['sensor_temp'] = df['sensor_temp'].ffill()
Imputation Strategies Compared
| Strategy | When to Use | Pros | Cons |
|---|---|---|---|
| Mean / Median | Numeric; low missingness (<5%); MCAR | Simple and fast | Reduces variance; distorts correlations |
| Mode | Categorical columns | Preserves most common value | Over-represents the dominant category |
| Forward/backward fill | Time-ordered data | Preserves temporal continuity | Wrong if gap is long; propagates stale values |
| KNN imputation | Numeric; MAR; moderate missingness | Uses feature relationships | Slow on large datasets; sensitive to scale |
| IterativeImputer | Complex patterns; MAR | Most accurate; handles interactions | Computationally expensive |
Outliers: Detect and Handle
# IQR method to detect outliers
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
n_outliers = ((df['revenue'] < lower) | (df['revenue'] > upper)).sum()
print(f'{n_outliers} outliers detected out of {len(df)} rows')
# Option 1: Remove (only for confirmed data entry errors)
df_clean = df[(df['revenue'] >= lower) & (df['revenue'] <= upper)]
# Option 2: Winsorise — cap at percentile boundaries
df['revenue'] = df['revenue'].clip(lower=lower, upper=upper)
# Option 3: Log-transform to reduce right skew without removing data
import numpy as np
df['revenue_log'] = np.log1p(df['revenue']) # log1p handles 0 safely
Fixing Data Types and Standardising Categories
import pandas as pd
df = pd.read_csv('orders.csv')
# Fix dates stored as strings
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
# Strip currency symbols before converting to float
df['price'] = df['price'].str.replace(r'[$,]', '', regex=True).astype(float)
# Downcast integers; use nullable Int16 so NaN is preserved
df['age'] = pd.to_numeric(df['age'], errors='coerce').astype('Int16')
# Standardise categorical strings
df['country'] = df['country'].str.strip().str.lower()
country_map = {'us': 'united states', 'u.s.a.': 'united states', 'uk': 'united kingdom'}
df['country'] = df['country'].replace(country_map)
# Convert to category dtype to save memory
df['country'] = df['country'].astype('category')
print(df.dtypes)
Data Validation Rules
| Rule Type | Example | Action on Failure |
|---|---|---|
| Range check | Age between 0 and 120 | Flag or drop; investigate if many fail |
| Referential integrity | Every order must have a matching customer_id | Drop orphaned records; log for upstream fix |
| Uniqueness | order_id must be unique | De-duplicate; alert pipeline owner |
| Non-null constraint | revenue must not be null | Drop or impute; block from analysis |
| Format check | Email matches regex pattern | Nullify invalid values |
| Business rule | Order date must be after customer signup date | Flag for review; exclude from cohort analyses |
Building a Reusable Cleaning Pipeline
Structure cleaning as a series of pure functions — each takes a DataFrame and returns a cleaned DataFrame. This makes every step independently testable and easy to swap out when data sources change.
import pandas as pd, numpy as np
def drop_duplicates(df):
before = len(df)
df = df.drop_duplicates(subset='order_id')
print(f'Duplicates removed: {before - len(df)}')
return df
def fix_types(df):
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')
return df
def impute_missing(df):
df['revenue'] = df['revenue'].fillna(df['revenue'].median())
df['country'] = df['country'].fillna('unknown')
return df
def remove_outliers(df, col, lo_q=0.01, hi_q=0.99):
lo, hi = df[col].quantile(lo_q), df[col].quantile(hi_q)
return df[(df[col] >= lo) & (df[col] <= hi)]
def run_pipeline(path):
return (pd.read_csv(path)
.pipe(drop_duplicates)
.pipe(fix_types)
.pipe(impute_missing)
.pipe(remove_outliers, 'revenue'))
clean_df = run_pipeline('raw_orders.csv')
print(f'Final shape: {clean_df.shape}')
Summary
Data cleaning is a continuous discipline, not a one-time task. Every dataset has its own quality issues shaped by how it was collected, stored, and transported. The most effective analysts develop a systematic profiling habit: always inspect distributions, missing rates, type consistency, and value uniqueness before any analysis. Imputation and outlier handling require judgement — the wrong strategy can introduce bias harder to detect than the original dirty data. Document every cleaning decision and build reusable pipeline functions so your work is reproducible when the data changes.
Create a free reader account to keep reading.