What Is Data Quality?
Data quality refers to the degree to which data is fit for its intended analytical use. Poor data quality is one of the most common causes of incorrect analysis, misinformed decisions, and eroded trust in analytics teams. A data analyst who understands data quality — how to measure it, where it breaks, and how to fix it — produces more reliable work and builds stakeholder confidence faster than one who treats data as given.
Data quality is not a binary property. It is multidimensional: data can be accurate on one dimension while failing on another. The most widely used framework covers six dimensions:
Dimension | Definition | Example Failure |
|---|---|---|
Accuracy | Data correctly reflects the real-world entity it represents | A customer's recorded age is 200 years due to a date parsing bug |
Completeness | All required fields are populated; no unexpected nulls | 30% of order records are missing the shipping country field |
Consistency | The same entity has the same value across systems and tables | A user's email in the CRM differs from the email in the billing system |
Timeliness | Data is available when needed and reflects the current state | Yesterday's sales data has not arrived in the warehouse by 9 AM |
Uniqueness | Each real-world entity appears exactly once; no duplicates | A user signed up twice with different emails; both records exist |
Validity | Values conform to defined formats, ranges, or reference sets | A status field contains the value "actve" instead of "active" |
Sources of Data Quality Problems
Understanding where quality breaks makes it easier to fix it at the source rather than patching it downstream:
Source | Description | Example |
|---|---|---|
Data entry errors | Human mistakes during manual data input | A support agent types the wrong account ID into a ticket |
Schema changes | A source system changes a field's name, type, or semantics without notifying downstream consumers | A renamed column breaks a pipeline and causes nulls in the warehouse |
Pipeline failures | ETL/ELT jobs fail silently, delivering partial or stale data | A daily sync job fails at 2 AM; analysts see yesterday's data all day |
Integration bugs | Incorrect join logic, unit mismatches, or timezone offsets between systems | Revenue is in USD in one table and EUR in another; a join without conversion doubles or halves values |
Business logic drift | The definition of a metric changes but old data is not backfilled | "Active user" is redefined from 30-day to 7-day; historical charts break |
Data Profiling
Data profiling is the process of examining a dataset to understand its structure, content, and statistical characteristics. It is the first step in any data quality assessment. Key profiling statistics include:
Statistic | What It Reveals |
|---|---|
Row count and growth rate | Whether data volume is as expected; sudden drops suggest pipeline failures |
Null rate per column | Completeness; columns with unexpected nulls signal upstream issues |
Distinct value count | Cardinality; helps detect duplicate keys or enumerations with unexpected values |
Min / Max / Mean / Stddev | Outliers, impossible values (negative ages, future dates) |
Value distribution | Unexpected skew; a column that should be uniform but has 90% of records in one category |
Format consistency | Mixed date formats, inconsistent casing, trailing whitespace |
In SQL, a quick profile of a column looks like:
SELECT
COUNT(*) AS total_rows,
COUNT(email) AS non_null_emails,
COUNT(DISTINCT email) AS unique_emails,
SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS null_count,
MIN(created_at) AS earliest,
MAX(created_at) AS latest
FROM users;
Data Validation Checks
Validation checks are assertions about the data that must be true. They are typically implemented at two levels: in the transformation layer (dbt tests, SQL assertions) and in the ingestion layer (source freshness checks, row count monitors).
2
Check Type | SQL Example | dbt Test |
|---|---|---|
Not null | WHERE id IS NULL | not_null |
Uniqueness | HAVING COUNT(*) > 1 on primary key | unique |
Accepted values | WHERE status NOT IN ('active','inactive','pending') | accepted_values |
Referential integrity | LEFT JOIN to parent; WHERE parent.id IS NULL | relationships |
Range check | WHERE age < 0 OR age > 120 | Custom dbt test or Great Expectations |
Freshness | WHERE MAX(updated_at) < NOW() - INTERVAL '2 hours' | dbt source freshness |
Row count threshold | HAVING COUNT(*) < expected_minimum | Custom assertion |
Handling Bad Data
When validation catches bad data, there are several strategies for handling it. The choice depends on the severity of the issue and business requirements:
Strategy | Description | When to Use |
|---|---|---|
Block the pipeline | Fail the job and alert; do not load bad data downstream | Critical fields; financial data; compliance requirements |
Quarantine | Load bad records to a separate table for review without blocking good records | High-volume pipelines where most records are good |
Imputation | Fill missing values with a reasonable default (median, mode, or domain-specific default) | Non-critical fields where a best estimate is acceptable |
Flag and document | Add a data_quality_flag column; let analysts filter; document in the data catalog | Known issues that cannot be fixed at source; self-service analytics environments |
Fix at source | Correct the upstream system or process that generates bad data | Always preferred when feasible; permanent fix rather than downstream workaround |
Summary
Data quality is not a one-time cleanup — it is an ongoing discipline. High-quality analytical work starts with profiling the data to understand its actual state, validating it systematically against defined expectations, and having a clear protocol for handling failures. Tools like dbt bring software engineering practices (testing, version control, documentation) to SQL transformations, making quality checks a first-class part of the analytics workflow rather than an afterthought. Data analysts who build quality into their pipelines spend less time debugging incorrect dashboards and more time generating insights that stakeholders trust.
Create a free reader account to keep reading.