What Is Data Governance?
Data governance is the set of policies, processes, roles, and standards that ensure data is accurate, consistent, secure, and used appropriately across an organization. It answers fundamental questions: Who owns this data? Who can access it? How should it be defined? How do we know it's correct? Without governance, organizations accumulate conflicting definitions ("revenue" means different things to Finance and Sales), siloed datasets, compliance risks, and decisions based on data nobody trusts.
Data quality is governance in practice — the ongoing measurement and improvement of data's fitness for use. The two disciplines are inseparable: governance defines the rules; data quality enforces them.
The Six Dimensions of Data Quality
Dimension | Definition | Example Failure | How to Measure |
|---|---|---|---|
Completeness | Required fields are populated | 30% of customer records missing email address | % of non-null values in required columns |
Accuracy | Values correctly represent reality | Product prices differ from ERP system | Comparison against authoritative source |
Consistency | Same data is identical across systems | Customer count differs between CRM and data warehouse | Cross-system record matching |
Timeliness | Data is current when needed | Daily sales report reflects yesterday's data at 2pm | Data freshness lag; SLA breach rate |
Validity | Values conform to expected format and range | Age = -5, ZIP code = "ABCDE" | % of values passing format/range constraints |
Uniqueness | No unintended duplicates | Same customer appears 3 times with different IDs | Duplicate rate on primary key or natural key |
Data Governance Roles and Responsibilities
Effective governance requires clearly defined human roles, not just technical systems.
Role | Responsibilities | Typical Owner |
|---|---|---|
Data Owner | Accountable for data quality and access policies in their domain; approves access requests | Business VP or Director |
Data Steward | Day-to-day enforcement of data standards; resolves quality issues; maintains business glossary | Senior Analyst or Manager |
Data Engineer | Implements pipelines, quality checks, and lineage tracking in code | Engineering team |
Data Analyst | Consumes governed data; raises quality issues; documents findings | Analytics team |
Chief Data Officer (CDO) | Sets enterprise data strategy; sponsors governance program; resolves cross-domain disputes | C-suite |
Data Governance Council | Cross-functional body that sets standards, resolves conflicts, and reviews policy | Owners + CDO + Legal + IT |
The Business Glossary and Data Catalog
The business glossary is a centralized list of agreed-upon definitions for business terms. Without it, "active customer," "churn," and "monthly recurring revenue" mean different things to different teams, making cross-functional reports impossible to reconcile. A good glossary entry includes the term name, definition, owner, related terms, and examples of correct and incorrect usage.
The data catalog extends the glossary to the technical layer — it inventories datasets, tables, and columns, and links each to its business definition, lineage, owner, and quality metrics. Modern data catalogs (Alation, Collibra, Atlan, dbt docs, Google Data Catalog) allow analysts to search for data, understand its history, and trust its provenance before using it in a report.
Data Lineage
Data lineage tracks the full journey of data from its source through every transformation to its final use. It answers: Where did this number come from? What transformations changed it? Which reports depend on this table? Lineage is critical for impact analysis (if we change this source table, what breaks?), debugging (why did this metric change?), and compliance (can we trace every value in this GDPR report back to its source?).
Tools like dbt automatically generate column-level lineage for SQL transformations. Data observability platforms (Monte Carlo, Acceldata, Bigeye) extend lineage to runtime monitoring, alerting when table row counts, distributions, or freshness deviate from expectations.
Implementing Data Quality Checks in Code
Quality checks belong in code, not spreadsheets. They should run automatically at ingestion, at transformation, and before delivery to downstream consumers.
import pandas as pd
import numpy as np
from datetime import datetime
def run_quality_checks(df: pd.DataFrame, table_name: str) -> dict:
"""
Run a standard suite of data quality checks on a DataFrame.
Returns a dict of check results with pass/fail status.
"""
results = {}
total_rows = len(df)
# 1. Completeness: check required columns
required_cols = ['customer_id', 'order_date', 'amount']
for col in required_cols:
if col in df.columns:
null_rate = df[col].isna().mean()
results[f'completeness_{col}'] = {
'passed': null_rate == 0,
'null_rate': round(null_rate, 4),
'threshold': 0.0
}
# 2. Uniqueness: check primary key
if 'order_id' in df.columns:
dup_rate = df['order_id'].duplicated().mean()
results['uniqueness_order_id'] = {
'passed': dup_rate == 0,
'duplicate_rate': round(dup_rate, 4)
}
# 3. Validity: check amount is positive
if 'amount' in df.columns:
invalid_rate = (df['amount'] <= 0).mean()
results['validity_amount_positive'] = {
'passed': invalid_rate == 0,
'invalid_rate': round(invalid_rate, 4)
}
# 4. Timeliness: check freshness
if 'order_date' in df.columns:
max_date = pd.to_datetime(df['order_date']).max()
lag_days = (datetime.now() - max_date).days
results['timeliness_order_date'] = {
'passed': lag_days <= 1,
'lag_days': lag_days,
'threshold_days': 1
}
# 5. Row count anomaly: flag unexpected drop
results['row_count'] = {
'row_count': total_rows,
'passed': total_rows > 0
}
# Summary
passed = sum(1 for r in results.values() if r.get('passed', False))
results['summary'] = {
'table': table_name,
'total_checks': len(results) - 1,
'passed': passed,
'failed': len(results) - 1 - passed,
'run_at': datetime.now().isoformat()
}
return results
# Usage
df = pd.read_parquet('s3://my-bucket/orders/2024-03/')
results = run_quality_checks(df, 'orders')
failed = [k for k, v in results.items() if not v.get('passed', True) and k != 'summary']
if failed:
print(f"QUALITY FAILURES: {failed}")
dbt Tests for Data Quality
dbt (data build tool) has built-in test types that run automatically after each model build. Generic tests check common patterns across any column; singular tests are custom SQL assertions for complex business rules.
# schema.yml — dbt generic tests
models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('customers')
field: customer_id
- name: status
tests:
- accepted_values:
values: ['pending', 'shipped', 'delivered', 'cancelled']
- name: amount
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
Run dbt test after every build. Failed tests block downstream models from deploying, preventing bad data from reaching dashboards.
Data Quality Metrics and SLAs
Data quality should be measured, tracked over time, and tied to service level agreements (SLAs). A data quality SLA might state: "The orders table will be refreshed by 6am UTC every day, with zero null order_ids and a duplicate rate below 0.01%." Breaches trigger alerts to the data owner. Tracking quality metrics over time reveals systemic issues — a gradual increase in null rates often signals an upstream schema change.
Metric | Definition | Typical SLA |
|---|---|---|
Null rate | % of null values in required columns | 0% for primary keys; <5% for optional fields |
Duplicate rate | % of duplicate records on primary/natural key | 0% for transactional tables |
Freshness lag | Hours since last successful load | Depends on use case; often <24h or <1h |
Row count variance | % change in row count vs prior period | Alert if >20% deviation from 7-day average |
Schema drift rate | Frequency of unexpected column changes | Zero unannounced changes to critical tables |
Referential integrity rate | % of foreign keys with matching primary key records | 100% for enforced relationships |
Master Data Management
Master data management (MDM) ensures that core business entities — customers, products, employees, locations — have a single authoritative definition shared across all systems. Without MDM, the same customer appears in the CRM, ERP, and support system with three different IDs and slightly different names, making a unified customer view impossible. MDM solutions create a "golden record" for each entity by deduplicating, matching, and merging records from multiple sources. This is technically hard (fuzzy matching, survivorship rules) and politically hard (which system's version of the truth wins?).
Privacy and Regulatory Compliance
Data governance must account for legal obligations. GDPR (Europe), CCPA (California), HIPAA (US healthcare), and similar regulations impose requirements on how personal data is collected, stored, processed, and deleted. Key governance practices for compliance include data classification (tagging columns as PII, sensitive, or public), data minimization (don't collect what you don't need), retention policies (automated deletion after N years), right-to-erasure workflows, and audit logging of who accessed what data when.
-- SQL: identify and tag PII columns in a data catalog audit
SELECT
table_schema,
table_name,
column_name,
data_type,
CASE
WHEN LOWER(column_name) LIKE '%email%' THEN 'PII_EMAIL'
WHEN LOWER(column_name) LIKE '%phone%' THEN 'PII_PHONE'
WHEN LOWER(column_name) LIKE '%ssn%' THEN 'PII_SSN'
WHEN LOWER(column_name) LIKE '%birth%' THEN 'PII_DOB'
WHEN LOWER(column_name) LIKE '%address%' THEN 'PII_ADDRESS'
WHEN LOWER(column_name) LIKE '%password%' THEN 'SENSITIVE'
ELSE 'UNCLASSIFIED'
END AS data_classification
FROM information_schema.columns
WHERE table_schema NOT IN ('pg_catalog', 'information_schema')
ORDER BY table_schema, table_name, column_name;
Operationalizing Data Governance
Governance fails when it's treated as a one-time project or a purely technical initiative. Sustainable governance requires executive sponsorship (a CDO or equivalent with budget and authority), a federated model (domain teams own their data quality, a central team sets standards), tooling that integrates with analyst workflows (dbt, data catalogs, observability platforms), and regular review cycles (quarterly data quality scorecards reviewed by owners). The goal is not perfection — it's a culture where data quality is everyone's responsibility, issues are surfaced quickly, and fixes are systematized.
Summary
Data governance and data quality are foundational disciplines that determine whether an organization can trust its data. The six quality dimensions — completeness, accuracy, consistency, timeliness, validity, and uniqueness — provide a framework for measurement. Clear ownership (data owners, stewards, councils) provides accountability. Technical tools — dbt tests, data catalogs, observability platforms, quality check pipelines — make standards enforceable at scale. Organizations that invest in governance build a compound advantage: every new dataset and report starts from a higher baseline of trust, enabling faster and more confident decision-making.
Create a free reader account to keep reading.