Data Cleanup

Introduction to Data Cleanup

Raw data is rarely ready for analysis straight away. In the real world, datasets are messy: they contain missing values, duplicate records, formatting inconsistencies, and erroneous entries. Data cleanup — also called data cleaning or data wrangling — is the process of identifying and correcting these problems so that the data is accurate, consistent, and ready for analysis.

Data cleanup is often cited as the most time-consuming part of a data analyst's job, yet it is also among the most important. Poor data quality leads directly to poor analysis. This article covers the core techniques every data analyst needs to master.

Handling Missing Values

Missing data is one of the most common issues in real-world datasets. A value can be missing for many reasons: a user skipped a form field, a sensor failed to record, or data was lost during transfer. Missing values are typically represented as NaN (Not a Number) in pandas or NULL in SQL.

There are several strategies for handling missing values. First, you can drop rows or columns with missing data if the proportion is small and the missing data is not systematic. Second, you can impute missing values — replacing them with a calculated substitute such as the mean, median, or mode of the column. Third, for time-series data, forward fill or backward fill can propagate the nearest known value.

In Python with pandas:

import pandas as pd

df = pd.read_csv('data.csv')

# Drop rows where any value is missing
df_clean = df.dropna()

# Fill missing values in a column with the median
df['age'].fillna(df['age'].median(), inplace=True)

# Forward fill for time series
df['temperature'].fillna(method='ffill', inplace=True)

The right strategy depends on the context. Dropping data reduces sample size and can introduce bias. Imputation preserves sample size but introduces assumptions. Always understand why data is missing before deciding how to handle it.

Removing Duplicates

Duplicate records occur when the same entity appears more than once in a dataset. This can happen due to data entry errors, system bugs during data migration, or combining datasets from multiple sources. Duplicates inflate counts, distort statistics, and can skew model training.

In pandas, finding and removing duplicates is straightforward:

# Find duplicates
print(df.duplicated().sum())

# Remove duplicates, keeping the first occurrence
df = df.drop_duplicates()

# Remove duplicates based on specific columns
df = df.drop_duplicates(subset=['customer_id', 'order_date'])

In SQL, you can identify duplicates using GROUP BY and HAVING:

SELECT customer_id, COUNT(*) AS count
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 1;

Before removing duplicates, confirm that they are true duplicates and not legitimate repeated records. For example, a customer can place multiple orders — those are not duplicates. Always use meaningful unique identifiers to guide your deduplication logic.

Detecting and Handling Outliers

Outliers are data points that differ significantly from the rest of the dataset. They can result from genuine extreme values (a billionaire in a salary dataset), data entry errors (a height recorded as 1500 cm), or measurement errors. Outliers can have a large effect on statistical measures like the mean and standard deviation.

Common methods for detecting outliers include the Z-score method, the IQR (Interquartile Range) method, and visual inspection using box plots or scatter plots.

The IQR method:

Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
df_filtered = df[(df['salary'] >= lower_bound) & (df['salary'] <= upper_bound)]

Once detected, you can handle outliers by removing them (if they are errors), capping them at a threshold (winsorization), or transforming the data (e.g., log transformation to reduce skew). The right approach depends on the analysis goals and the nature of the data.

Standardizing and Transforming Data

Raw data often comes in inconsistent formats that need to be standardized before analysis. Common transformation tasks include converting date formats (e.g., "01/04/2024" vs. "2024-04-01"), normalizing text case (making all strings lowercase or title case), parsing composite fields (splitting a "Full Name" column into "First Name" and "Last Name"), and converting units (miles to kilometers, Fahrenheit to Celsius).

In pandas, string standardization is easy:

# Lowercase all text in a column
df['city'] = df['city'].str.lower().str.strip()

# Parse dates
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')

# Split a full name column
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', 1, expand=True)

Data type conversions are also essential. Columns that represent numbers but are stored as strings (due to commas or dollar signs) need to be converted before arithmetic operations can be performed on them.

Validating Data Quality

After cleaning, it is important to validate that the data meets the expected standards. Validation checks include range checks (ages between 0 and 120), uniqueness checks (primary keys have no duplicates), referential integrity checks (every foreign key refers to an existing record), and business rule checks (end date must be after start date).

Libraries like Great Expectations and Pandera in Python allow analysts to define and run automated data quality tests as part of a data pipeline. This makes the cleanup process repeatable and auditable.

Documenting the Cleaning Process

Every data cleaning decision should be documented. When you drop rows, record why. When you impute missing values, record the method. When you remove outliers, explain the threshold. This documentation serves two purposes: reproducibility (anyone can replicate your clean dataset from the raw data) and transparency (stakeholders can understand and trust your analysis).

A good practice is to maintain a data cleaning log or notebook that records the original shape of the data, each transformation applied, and the final shape of the cleaned dataset. Tools like Jupyter Notebook make this natural, as code and narrative explanation sit side by side.

Conclusion

Data cleanup is a foundational skill for every data analyst. Mastering how to handle missing values, remove duplicates, detect outliers, and standardize formats ensures that the datasets you work with are reliable and trustworthy. While it may not be the most glamorous part of the job, thorough data cleaning is what separates insightful analysis from misleading conclusions. Invest the time to do it well.