Why Data Cleaning Matters
Data cleaning — also called data wrangling or data preparation — is the process of detecting and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset before analysis. Studies consistently show that data professionals spend 60–80% of their time on cleaning and preparation rather than modelling. Raw data from real-world sources almost always contains missing values, duplicates, inconsistent formats, outliers, and encoding errors. A model trained on dirty data will produce unreliable results regardless of its sophistication — the principle known as garbage in, garbage out.
This article covers the most common data quality problems and the practical techniques to fix them, with Python code examples using pandas.
A Data Quality Checklist
Issue | Description | How to Detect | Common Fix |
|---|---|---|---|
Missing values | NaN, None, empty strings, sentinel values like -999 or 0 | df.isnull().sum(), df.describe() | Impute, drop, or flag depending on pattern and volume |
Duplicates | Identical or near-identical rows caused by data pipeline errors or merges | df.duplicated().sum() | df.drop_duplicates() |
Incorrect data types | Numbers stored as strings, dates as objects, booleans as integers | df.dtypes | df.astype(), pd.to_datetime(), pd.to_numeric() |
Outliers | Values far outside the expected range — could be errors or legitimate extremes | df.describe(), box plots, IQR method | Investigate first; cap, remove, or transform |
Inconsistent categories | "male", "Male", "M", "MALE" for the same value | df[col].value_counts() | Normalise with .str.lower().str.strip() and mapping |
Schema violations | Columns in wrong order, extra or missing columns vs expected schema | Compare df.columns to expected list | Reorder, rename, or add missing columns with defaults |
Handling Missing Values
The right strategy for missing data depends on why the data is missing (the missingness mechanism) and how much is missing. There are three mechanisms:
MCAR (Missing Completely At Random) — missingness is unrelated to any data; safe to drop rows
MAR (Missing At Random) — missingness is related to other observed variables; imputation is appropriate
MNAR (Missing Not At Random) — missingness is related to the missing value itself; requires domain knowledge and careful handling
import pandas as pd
import numpy as np
df = pd.read_csv('customers.csv')
# 1. Profile missing data
print(df.isnull().sum())
print(df.isnull().mean() * 100) # percentage missing per column
# 2. Drop rows where the target variable is missing
df = df.dropna(subset=['revenue'])
# 3. Simple imputation — fill numeric with median, categorical with mode
df['age'].fillna(df['age'].median(), inplace=True)
df['country'].fillna(df['country'].mode()[0], inplace=True)
# 4. Flag imputed values for downstream transparency
df['age_imputed'] = df['age'].isnull().astype(int) # run before fillna
# 5. Forward fill for time-ordered data (e.g., sensor readings)
df['temperature'] = df['temperature'].ffill()
Imputation Strategies Compared
Strategy | When to Use | Pros | Cons |
|---|---|---|---|
Mean imputation | Numeric; MCAR; low missingness (<5%) | Simple and fast | Reduces variance; distorts correlations |
Median imputation | Numeric with outliers; skewed distributions | Robust to outliers | Same correlation distortion as mean |
Mode imputation | Categorical columns | Preserves the most common value | Can over-represent the dominant category |
Forward/backward fill | Time-ordered data (sensor readings, prices) | Preserves temporal continuity | Wrong if gap is long; propagates stale values |
KNN imputation | Numeric; MAR; moderate missingness | Uses relationships between features | Slow on large datasets; sensitive to scale |
Model-based (IterativeImputer) | Complex missingness patterns | Most accurate; handles interactions | Computationally expensive; risk of overfitting |
Detecting and Handling Outliers
Outliers are data points that lie far from the rest of the distribution. They may represent genuine extreme values (high-value customers, rare events) or errors (data entry mistakes, sensor failures). Always investigate before removing.
# IQR method — flag values outside 1.5 * IQR from Q1/Q3
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['revenue'] < lower) | (df['revenue'] > upper)]
print(f'{len(outliers)} outliers detected')
# Option 1: Remove confirmed errors
df = df[(df['revenue'] >= lower) & (df['revenue'] <= upper)]
# Option 2: Cap (Winsorise) at percentile boundaries
df['revenue'] = df['revenue'].clip(lower=lower, upper=upper)
# Option 3: Log-transform to reduce skew without removing data
import numpy as np
df['revenue_log'] = np.log1p(df['revenue']) # log1p handles zero values
Fixing Data Types and Formats
Incorrect data types silently cause errors in aggregations, comparisons, and model inputs. Always validate types immediately after loading data.
import pandas as pd
df = pd.read_csv('orders.csv')
print(df.dtypes) # inspect before cleaning
# Fix dates stored as strings
df['order_date'] = pd.to_datetime(df['order_date'], format='%Y-%m-%d', errors='coerce')
# Fix numeric columns stored as strings (e.g., "$1,200.00")
df['price'] = df['price'].str.replace('[$,]', '', regex=True).astype(float)
# Fix boolean columns stored as 0/1 integers
df['is_active'] = df['is_active'].astype(bool)
# Downcast numeric types to save memory
df['age'] = pd.to_numeric(df['age'], errors='coerce').astype('Int16') # nullable int
# Convert high-cardinality string to category for efficiency
df['country'] = df['country'].astype('category')
print(df.dtypes) # verify after cleaning
Standardising Categorical Values
Inconsistent string values in categorical columns are one of the most common sources of dirty data. A single column might contain "United States", "US", "U.S.A.", and "usa" all meaning the same thing.
# Normalise case and whitespace first
df['country'] = df['country'].str.strip().str.lower()
# Check unique values after normalisation
print(df['country'].value_counts())
# Map known variants to canonical values
country_map = {
'us': 'united states',
'u.s.a.': 'united states',
'usa': 'united states',
'uk': 'united kingdom',
'gb': 'united kingdom',
}
df['country'] = df['country'].replace(country_map)
# Use fuzzy matching for harder cases (requires rapidfuzz library)
from rapidfuzz import process, fuzz
canonical = ['united states', 'united kingdom', 'germany', 'france']
def fuzzy_match(value, choices, threshold=85):
match, score, _ = process.extractOne(value, choices, scorer=fuzz.token_sort_ratio)
return match if score >= threshold else None
df['country_clean'] = df['country'].apply(lambda x: fuzzy_match(x, canonical))
Removing Duplicates
Duplicates can arise from system retries, failed de-duplication in ETL pipelines, or accidental double-counting in joins. There are exact duplicates (identical rows) and near-duplicates (same entity, slightly different records).
# Exact duplicates
print(f'Duplicate rows: {df.duplicated().sum()}')
df = df.drop_duplicates() # keeps first occurrence by default
# Duplicates on a subset of columns (e.g., same customer_id and order_date)
df = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='last')
# Near-duplicates — keep the most recent record per customer
df = df.sort_values('updated_at', ascending=False)
df = df.drop_duplicates(subset='customer_id', keep='first')
# Verify
print(f'Rows after dedup: {len(df)}')
print(f'Unique customers: {df['customer_id'].nunique()}')
Data Validation and Schema Enforcement
Validation ensures your cleaned data conforms to expected rules before it enters a model or dashboard. Catching violations early prevents silent errors downstream.
Validation Rule | Example Check | Action on Failure |
|---|---|---|
Range check | Age must be between 0 and 120 | Flag or drop; investigate if large proportion fails |
Referential integrity | Every order must have a matching customer_id | Drop orphaned records; log for upstream fix |
Uniqueness | order_id must be unique | De-duplicate; alert pipeline owner |
Non-null constraint | revenue must not be null | Drop or impute; block record from analysis |
Format check | Email matches regex pattern | Mask or nullify invalid values |
Business rule | Order date must be after customer signup date | Flag for manual review; remove from cohort analyses |
# Using Great Expectations for automated validation
import great_expectations as ge
df_ge = ge.from_pandas(df)
# Define expectations
df_ge.expect_column_values_to_not_be_null('customer_id')
df_ge.expect_column_values_to_be_between('age', min_value=0, max_value=120)
df_ge.expect_column_values_to_be_unique('order_id')
df_ge.expect_column_values_to_match_regex('email', r'^[\w.+-]+@[\w-]+\.[a-z]{2,}$')
# Run validation and get results
results = df_ge.validate()
print(results['statistics']) # shows how many expectations passed/failed
Building a Reusable Cleaning Pipeline
Rather than cleaning data ad hoc in a notebook, a production pipeline applies cleaning steps in a consistent, documented, and repeatable order. Here is a minimal example using a function-per-step pattern:
import pandas as pd
import numpy as np
def load_data(path):
return pd.read_csv(path)
def drop_duplicates(df):
before = len(df)
df = df.drop_duplicates(subset='order_id')
print(f'Duplicates removed: {before - len(df)}')
return df
def fix_types(df):
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')
return df
def impute_missing(df):
df['revenue'].fillna(df['revenue'].median(), inplace=True)
df['country'].fillna('unknown', inplace=True)
return df
def remove_outliers(df, col, lower_q=0.01, upper_q=0.99):
lo = df[col].quantile(lower_q)
hi = df[col].quantile(upper_q)
return df[(df[col] >= lo) & (df[col] <= hi)]
def run_pipeline(path):
df = load_data(path)
df = drop_duplicates(df)
df = fix_types(df)
df = impute_missing(df)
df = remove_outliers(df, 'revenue')
return df
clean_df = run_pipeline('raw_orders.csv')
Summary
Data cleaning is not a one-time task — it is a continuous discipline. Every dataset has its own quality issues shaped by how it was collected, stored, and transported. The most effective analysts develop a systematic profiling habit: always inspect distributions, missing rates, type consistency, and value uniqueness before any analysis. Imputation and outlier handling require judgement, not just code — the wrong strategy can introduce bias that is harder to detect than the original dirty data. Documenting every cleaning decision and building reusable pipeline functions pays dividends when the data changes or a colleague needs to reproduce your work.
Create a free reader account to keep reading.