What Is Data Cleaning?
Data cleaning (also called data preprocessing or data wrangling) is the process of detecting and correcting errors, inconsistencies, and missing values in raw datasets before analysis. Studies consistently show that analysts spend 60–80% of their time on data preparation — not analysis itself. A model or chart built on dirty data will produce misleading results regardless of how sophisticated the technique is. Mastering data cleaning is therefore one of the highest-leverage skills a data analyst can develop.
Common Data Quality Problems
Problem Type | Description | Example |
|---|---|---|
Missing values | NULL, NaN, or blank fields where data should exist | Age column has 12% NULL rows |
Duplicate records | The same entity appears more than once | Customer ID 1042 has three rows with identical data |
Inconsistent formats | Same value represented in multiple ways | "USA", "U.S.A.", "United States" in the country column |
Outliers and errors | Values that are implausible or outside expected range | Age = 999, revenue = -50000 |
Wrong data types | A column stored as the wrong type | Date stored as VARCHAR: "January 5th, 2024" |
Referential integrity issues | Foreign keys that point to non-existent records | order.customer_id = 99 but customer 99 does not exist |
Structural issues | Data that violates expected schema or business rules | end_date earlier than start_date |
Handling Missing Values
Strategy | When to Use | Risk |
|---|---|---|
Drop rows | Missing rate is very low (<5%) and rows are missing at random | Loses data; can introduce bias if missingness is not random |
Drop columns | Column has very high missing rate (>60%) with little analytical value | Permanently loses a feature |
Mean/median/mode imputation | Numerical or categorical columns with moderate missingness | Reduces variance; distorts distributions |
Forward/backward fill | Time series data where the previous or next value is a reasonable estimate | Inappropriate for non-sequential data |
Model-based imputation | Missing values have predictable patterns based on other features | Adds complexity; can propagate model errors |
Flag as a category | "Missing" is itself meaningful (e.g., no purchase = no data) | Adds a new category that must be managed downstream |
Deduplication Techniques
Duplicates arise from data ingestion errors, system migrations, or manual data entry. The SQL pattern for exact duplicates uses ROW_NUMBER() to keep only the first occurrence:
Duplicate Type | Detection Method | Resolution |
|---|---|---|
Exact duplicates | GROUP BY all columns; COUNT(*) > 1 | Keep one row using ROW_NUMBER() OVER (PARTITION BY key ORDER BY created_at) |
Near duplicates (fuzzy) | String similarity functions (Levenshtein distance, Jaccard similarity) | Cluster similar records and merge manually or algorithmically |
Logical duplicates | Same business entity, different IDs (e.g., merged companies) | Build a canonical ID mapping table |
Standardizing and Transforming Data
Transformation | Purpose | Example |
|---|---|---|
Case normalization | Eliminate case-sensitive mismatches | UPPER(country) or LOWER(email) |
Trimming whitespace | Remove leading/trailing spaces that cause join failures | TRIM(first_name) |
Date parsing | Convert string dates to proper date types | CAST('2024-01-05' AS DATE) or TO_DATE('Jan 5, 2024', 'Mon DD, YYYY') |
Type casting | Ensure correct data types for calculations | CAST(revenue_str AS NUMERIC) |
Value mapping | Harmonize inconsistent categorical values | CASE WHEN country IN ('USA','U.S.A.') THEN 'United States' END |
Outlier capping (winsorizing) | Limit extreme values without removing rows | LEAST(revenue, PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY revenue)) |
Numeric scaling | Normalize features for ML models | Min-max scaling: (x - min) / (max - min) |
A Practical Data Cleaning Checklist
Step | Action | SQL / Python Approach |
|---|---|---|
1. Profile the data | Count rows, nulls, distinct values, min/max per column | SELECT COUNT(*), COUNT(col), MIN(col), MAX(col), COUNT(DISTINCT col) FROM table |
2. Identify duplicates | Check for exact and key-level duplicates | SELECT id, COUNT(*) FROM table GROUP BY id HAVING COUNT(*) > 1 |
3. Handle nulls | Decide per column: drop, impute, or flag | COALESCE(col, default_value) or df.fillna() |
4. Fix formats | Standardize strings, dates, and codes | TRIM(), LOWER(), TO_DATE(), REGEXP_REPLACE() |
5. Validate ranges | Flag rows outside business-defined valid ranges | SELECT * FROM table WHERE age NOT BETWEEN 0 AND 120 |
6. Check referential integrity | Find orphaned foreign keys | SELECT * FROM orders o LEFT JOIN customers c ON o.customer_id = c.id WHERE c.id IS NULL |
7. Document decisions | Record every cleaning rule and its rationale | Maintain a data quality log or dbt test file |
Data Profiling Tools
Tool | Environment | Key Capability |
|---|---|---|
pandas-profiling / ydata-profiling | Python | Generates full HTML report with distributions, correlations, missing values |
Great Expectations | Python / dbt | Define and run automated data quality tests as code |
dbt tests | SQL / dbt | Built-in tests: not_null, unique, accepted_values, relationships |
OpenRefine | Desktop GUI | Cluster and merge inconsistent string values interactively |
SQL INFORMATION_SCHEMA | SQL | Inspect column types, nullable flags, and table metadata |
Summary
Data cleaning is not a one-time step — it is an ongoing discipline. Clean data requires profiling to understand what exists, systematic rules to fix what is wrong, and automated tests to ensure quality does not degrade over time. Analysts who build repeatable cleaning pipelines — rather than manual fixes — produce results that are reproducible, auditable, and trustworthy. The investment in clean data pays compounding dividends: every downstream analysis, model, and dashboard built on it inherits that reliability.
Create a free reader account to keep reading.