Data Quality and Validation for Data Analysts

What Is Data Quality?

Data quality refers to the degree to which data is fit for its intended analytical use. Poor data quality is one of the most common causes of incorrect analysis, misinformed decisions, and eroded trust in analytics teams. A data analyst who understands data quality — how to measure it, where it breaks, and how to fix it — produces more reliable work and builds stakeholder confidence faster than one who treats data as given.

Data quality is not a binary property. It is multidimensional: data can be accurate on one dimension while failing on another. The most widely used framework covers six dimensions:

Dimension	Definition	Example Failure
Accuracy	Data correctly reflects the real-world entity it represents	A customer's recorded age is 200 years due to a date parsing bug
Completeness	All required fields are populated; no unexpected nulls	30% of order records are missing the shipping country field
Consistency	The same entity has the same value across systems and tables	A user's email in the CRM differs from the email in the billing system
Timeliness	Data is available when needed and reflects the current state	Yesterday's sales data has not arrived in the warehouse by 9 AM
Uniqueness	Each real-world entity appears exactly once; no duplicates	A user signed up twice with different emails; both records exist
Validity	Values conform to defined formats, ranges, or reference sets	A status field contains the value "actve" instead of "active"

Sources of Data Quality Problems

Understanding where quality breaks makes it easier to fix it at the source rather than patching it downstream:

Source	Description	Example
Data entry errors	Human mistakes during manual data input	A support agent types the wrong account ID into a ticket
Schema changes	A source system changes a field's name, type, or semantics without notifying downstream consumers	A renamed column breaks a pipeline and causes nulls in the warehouse
Pipeline failures	ETL/ELT jobs fail silently, delivering partial or stale data	A daily sync job fails at 2 AM; analysts see yesterday's data all day
Integration bugs	Incorrect join logic, unit mismatches, or timezone offsets between systems	Revenue is in USD in one table and EUR in another; a join without conversion doubles or halves values
Business logic drift	The definition of a metric changes but old data is not backfilled	"Active user" is redefined from 30-day to 7-day; historical charts break

Data Profiling

Data profiling is the process of examining a dataset to understand its structure, content, and statistical characteristics. It is the first step in any data quality assessment. Key profiling statistics include:

Statistic	What It Reveals
Row count and growth rate	Whether data volume is as expected; sudden drops suggest pipeline failures
Null rate per column	Completeness; columns with unexpected nulls signal upstream issues
Distinct value count	Cardinality; helps detect duplicate keys or enumerations with unexpected values
Min / Max / Mean / Stddev	Outliers, impossible values (negative ages, future dates)
Value distribution	Unexpected skew; a column that should be uniform but has 90% of records in one category
Format consistency	Mixed date formats, inconsistent casing, trailing whitespace

In SQL, a quick profile of a column looks like:

SELECT COUNT(*) AS total_rows, COUNT(email) AS non_null_emails, COUNT(DISTINCT email) AS unique_emails, SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS null_count, MIN(created_at) AS earliest, MAX(created_at) AS latest FROM users;

Data Validation Checks

Validation checks are assertions about the data that must be true. They are typically implemented at two levels: in the transformation layer (dbt tests, SQL assertions) and in the ingestion layer (source freshness checks, row count monitors).

Check Type	SQL Example	dbt Test
Not null	WHERE id IS NULL	not_null
Uniqueness	HAVING COUNT(*) > 1 on primary key	unique
Accepted values	WHERE status NOT IN ('active','inactive','pending')	accepted_values
Referential integrity	LEFT JOIN to parent; WHERE parent.id IS NULL	relationships
Range check	WHERE age < 0 OR age > 120	Custom dbt test or Great Expectations
Freshness	WHERE MAX(updated_at) < NOW() - INTERVAL '2 hours'	dbt source freshness
Row count threshold	HAVING COUNT(*) < expected_minimum	Custom assertion

Handling Bad Data

When validation catches bad data, there are several strategies for handling it. The choice depends on the severity of the issue and business requirements:

Strategy	Description	When to Use
Block the pipeline	Fail the job and alert; do not load bad data downstream	Critical fields; financial data; compliance requirements
Quarantine	Load bad records to a separate table for review without blocking good records	High-volume pipelines where most records are good
Imputation	Fill missing values with a reasonable default (median, mode, or domain-specific default)	Non-critical fields where a best estimate is acceptable
Flag and document	Add a data_quality_flag column; let analysts filter; document in the data catalog	Known issues that cannot be fixed at source; self-service analytics environments
Fix at source	Correct the upstream system or process that generates bad data	Always preferred when feasible; permanent fix rather than downstream workaround

Summary

Data quality is not a one-time cleanup — it is an ongoing discipline. High-quality analytical work starts with profiling the data to understand its actual state, validating it systematically against defined expectations, and having a clear protocol for handling failures. Tools like dbt bring software engineering practices (testing, version control, documentation) to SQL transformations, making quality checks a first-class part of the analytics workflow rather than an afterthought. Data analysts who build quality into their pipelines spend less time debugging incorrect dashboards and more time generating insights that stakeholders trust.