Data Governance and Data Quality for Analysts

What Is Data Governance?

Data governance is the set of policies, processes, roles, and standards that ensure data is accurate, consistent, secure, and used appropriately across an organization. It answers fundamental questions: Who owns this data? Who can access it? How should it be defined? How do we know it's correct? Without governance, organizations accumulate conflicting definitions ("revenue" means different things to Finance and Sales), siloed datasets, compliance risks, and decisions based on data nobody trusts.

Data quality is governance in practice — the ongoing measurement and improvement of data's fitness for use. The two disciplines are inseparable: governance defines the rules; data quality enforces them.

The Six Dimensions of Data Quality

Dimension	Definition	Example Failure	How to Measure
Completeness	Required fields are populated	30% of customer records missing email address	% of non-null values in required columns
Accuracy	Values correctly represent reality	Product prices differ from ERP system	Comparison against authoritative source
Consistency	Same data is identical across systems	Customer count differs between CRM and data warehouse	Cross-system record matching
Timeliness	Data is current when needed	Daily sales report reflects yesterday's data at 2pm	Data freshness lag; SLA breach rate
Validity	Values conform to expected format and range	Age = -5, ZIP code = "ABCDE"	% of values passing format/range constraints
Uniqueness	No unintended duplicates	Same customer appears 3 times with different IDs	Duplicate rate on primary key or natural key

Data Governance Roles and Responsibilities

Effective governance requires clearly defined human roles, not just technical systems.

Role	Responsibilities	Typical Owner
Data Owner	Accountable for data quality and access policies in their domain; approves access requests	Business VP or Director
Data Steward	Day-to-day enforcement of data standards; resolves quality issues; maintains business glossary	Senior Analyst or Manager
Data Engineer	Implements pipelines, quality checks, and lineage tracking in code	Engineering team
Data Analyst	Consumes governed data; raises quality issues; documents findings	Analytics team
Chief Data Officer (CDO)	Sets enterprise data strategy; sponsors governance program; resolves cross-domain disputes	C-suite
Data Governance Council	Cross-functional body that sets standards, resolves conflicts, and reviews policy	Owners + CDO + Legal + IT

The Business Glossary and Data Catalog

The business glossary is a centralized list of agreed-upon definitions for business terms. Without it, "active customer," "churn," and "monthly recurring revenue" mean different things to different teams, making cross-functional reports impossible to reconcile. A good glossary entry includes the term name, definition, owner, related terms, and examples of correct and incorrect usage.

The data catalog extends the glossary to the technical layer — it inventories datasets, tables, and columns, and links each to its business definition, lineage, owner, and quality metrics. Modern data catalogs (Alation, Collibra, Atlan, dbt docs, Google Data Catalog) allow analysts to search for data, understand its history, and trust its provenance before using it in a report.

Data Lineage

Data lineage tracks the full journey of data from its source through every transformation to its final use. It answers: Where did this number come from? What transformations changed it? Which reports depend on this table? Lineage is critical for impact analysis (if we change this source table, what breaks?), debugging (why did this metric change?), and compliance (can we trace every value in this GDPR report back to its source?).

Tools like dbt automatically generate column-level lineage for SQL transformations. Data observability platforms (Monte Carlo, Acceldata, Bigeye) extend lineage to runtime monitoring, alerting when table row counts, distributions, or freshness deviate from expectations.

Implementing Data Quality Checks in Code

Quality checks belong in code, not spreadsheets. They should run automatically at ingestion, at transformation, and before delivery to downstream consumers.

import pandas as pd
import numpy as np
from datetime import datetime

def run_quality_checks(df: pd.DataFrame, table_name: str) -> dict:
    """
    Run a standard suite of data quality checks on a DataFrame.
    Returns a dict of check results with pass/fail status.
    """
    results = {}
    total_rows = len(df)

    # 1. Completeness: check required columns
    required_cols = ['customer_id', 'order_date', 'amount']
    for col in required_cols:
        if col in df.columns:
            null_rate = df[col].isna().mean()
            results[f'completeness_{col}'] = {
                'passed': null_rate == 0,
                'null_rate': round(null_rate, 4),
                'threshold': 0.0
            }

    # 2. Uniqueness: check primary key
    if 'order_id' in df.columns:
        dup_rate = df['order_id'].duplicated().mean()
        results['uniqueness_order_id'] = {
            'passed': dup_rate == 0,
            'duplicate_rate': round(dup_rate, 4)
        }

    # 3. Validity: check amount is positive
    if 'amount' in df.columns:
        invalid_rate = (df['amount'] <= 0).mean()
        results['validity_amount_positive'] = {
            'passed': invalid_rate == 0,
            'invalid_rate': round(invalid_rate, 4)
        }

    # 4. Timeliness: check freshness
    if 'order_date' in df.columns:
        max_date = pd.to_datetime(df['order_date']).max()
        lag_days = (datetime.now() - max_date).days
        results['timeliness_order_date'] = {
            'passed': lag_days <= 1,
            'lag_days': lag_days,
            'threshold_days': 1
        }

    # 5. Row count anomaly: flag unexpected drop
    results['row_count'] = {
        'row_count': total_rows,
        'passed': total_rows > 0
    }

    # Summary
    passed = sum(1 for r in results.values() if r.get('passed', False))
    results['summary'] = {
        'table': table_name,
        'total_checks': len(results) - 1,
        'passed': passed,
        'failed': len(results) - 1 - passed,
        'run_at': datetime.now().isoformat()
    }

    return results

# Usage
df = pd.read_parquet('s3://my-bucket/orders/2024-03/')
results = run_quality_checks(df, 'orders')
failed = [k for k, v in results.items() if not v.get('passed', True) and k != 'summary']
if failed:
    print(f"QUALITY FAILURES: {failed}")

dbt Tests for Data Quality

dbt (data build tool) has built-in test types that run automatically after each model build. Generic tests check common patterns across any column; singular tests are custom SQL assertions for complex business rules.

# schema.yml — dbt generic tests
models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('customers')
              field: customer_id
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled']
      - name: amount
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

Run dbt test after every build. Failed tests block downstream models from deploying, preventing bad data from reaching dashboards.

Data Quality Metrics and SLAs

Data quality should be measured, tracked over time, and tied to service level agreements (SLAs). A data quality SLA might state: "The orders table will be refreshed by 6am UTC every day, with zero null order_ids and a duplicate rate below 0.01%." Breaches trigger alerts to the data owner. Tracking quality metrics over time reveals systemic issues — a gradual increase in null rates often signals an upstream schema change.

Metric	Definition	Typical SLA
Null rate	% of null values in required columns	0% for primary keys; <5% for optional fields
Duplicate rate	% of duplicate records on primary/natural key	0% for transactional tables
Freshness lag	Hours since last successful load	Depends on use case; often <24h or <1h
Row count variance	% change in row count vs prior period	Alert if >20% deviation from 7-day average
Schema drift rate	Frequency of unexpected column changes	Zero unannounced changes to critical tables
Referential integrity rate	% of foreign keys with matching primary key records	100% for enforced relationships

Master Data Management

Master data management (MDM) ensures that core business entities — customers, products, employees, locations — have a single authoritative definition shared across all systems. Without MDM, the same customer appears in the CRM, ERP, and support system with three different IDs and slightly different names, making a unified customer view impossible. MDM solutions create a "golden record" for each entity by deduplicating, matching, and merging records from multiple sources. This is technically hard (fuzzy matching, survivorship rules) and politically hard (which system's version of the truth wins?).

Privacy and Regulatory Compliance

Data governance must account for legal obligations. GDPR (Europe), CCPA (California), HIPAA (US healthcare), and similar regulations impose requirements on how personal data is collected, stored, processed, and deleted. Key governance practices for compliance include data classification (tagging columns as PII, sensitive, or public), data minimization (don't collect what you don't need), retention policies (automated deletion after N years), right-to-erasure workflows, and audit logging of who accessed what data when.

-- SQL: identify and tag PII columns in a data catalog audit
SELECT
  table_schema,
  table_name,
  column_name,
  data_type,
  CASE
    WHEN LOWER(column_name) LIKE '%email%'     THEN 'PII_EMAIL'
    WHEN LOWER(column_name) LIKE '%phone%'     THEN 'PII_PHONE'
    WHEN LOWER(column_name) LIKE '%ssn%'       THEN 'PII_SSN'
    WHEN LOWER(column_name) LIKE '%birth%'     THEN 'PII_DOB'
    WHEN LOWER(column_name) LIKE '%address%'   THEN 'PII_ADDRESS'
    WHEN LOWER(column_name) LIKE '%password%'  THEN 'SENSITIVE'
    ELSE 'UNCLASSIFIED'
  END AS data_classification
FROM information_schema.columns
WHERE table_schema NOT IN ('pg_catalog', 'information_schema')
ORDER BY table_schema, table_name, column_name;

Operationalizing Data Governance

Governance fails when it's treated as a one-time project or a purely technical initiative. Sustainable governance requires executive sponsorship (a CDO or equivalent with budget and authority), a federated model (domain teams own their data quality, a central team sets standards), tooling that integrates with analyst workflows (dbt, data catalogs, observability platforms), and regular review cycles (quarterly data quality scorecards reviewed by owners). The goal is not perfection — it's a culture where data quality is everyone's responsibility, issues are surfaced quickly, and fixes are systematized.

Summary

Data governance and data quality are foundational disciplines that determine whether an organization can trust its data. The six quality dimensions — completeness, accuracy, consistency, timeliness, validity, and uniqueness — provide a framework for measurement. Clear ownership (data owners, stewards, councils) provides accountability. Technical tools — dbt tests, data catalogs, observability platforms, quality check pipelines — make standards enforceable at scale. Organizations that invest in governance build a compound advantage: every new dataset and report starts from a higher baseline of trust, enabling faster and more confident decision-making.