The Role of a Data Warehouse in an Analytics Stack
A data warehouse is a centralised repository that integrates data from multiple source systems, organises it for analytical queries, and serves as the authoritative source of truth for business reporting. Unlike operational databases, which are optimised for transactional reads and writes on individual records, data warehouses are optimised for aggregating millions to billions of rows across many columns — exactly the query pattern of analytical workloads. Modern cloud-based warehouses such as Snowflake, Google BigQuery, Amazon Redshift, and Databricks SQL have largely replaced on-premise systems and changed how data pipelines are built.
OLTP vs. OLAP: Two Different Purposes
2
Dimension | OLTP (Operational) | OLAP (Analytical) |
|---|---|---|
Primary purpose | Record transactions: inserts, updates, deletes of individual rows | Analyse trends: aggregate queries across large row sets |
Query pattern | Many short queries on a few rows each | Fewer long queries scanning many rows and columns |
Data freshness | Real-time; data is current to the second | Typically delayed by hours or one day; batch or micro-batch loads |
Schema style | Normalised (3NF) to reduce redundancy and update anomalies | Denormalised (star/snowflake schema) for fast aggregation |
Storage format | Row-oriented: all columns of a row stored together | Column-oriented: each column stored separately, enabling compression and column-level scans |
Examples | PostgreSQL, MySQL, Oracle, SQL Server | Snowflake, BigQuery, Redshift, Databricks SQL, ClickHouse |
Dimensional Modelling: Star and Snowflake Schemas
�2
Concept | Definition | Role in the Warehouse |
|---|---|---|
Fact table | Central table containing measurable business events (transactions, clicks, sessions); typically has many rows and foreign keys to dimension tables | Stores the numeric measures analysts aggregate: revenue, quantity, duration |
Dimension table | Lookup table describing the who, what, where, and when of fact events | Stores descriptive attributes: customer name, product category, date, region |
Star schema | Fact table surrounded by denormalised dimension tables; all lookups are one join away from the fact | Optimal for BI tools and fast query performance; at the cost of some data redundancy |
Snowflake schema | Dimension tables are further normalised into sub-dimension tables | Reduces storage redundancy but requires more joins; less common in modern cloud warehouses where storage is cheap |
Slowly Changing Dimension (SCD) | A dimension attribute that changes over time (e.g., customer address, product price) | SCD Type 1 overwrites; SCD Type 2 adds a new row with effective/expiry dates to preserve history |
ETL vs. ELT
Approach | Pipeline Order | Where Transformation Happens | Best For | Common Tools |
|---|---|---|---|---|
ETL (Extract-Transform-Load) | Extract from source → Transform in a staging server → Load clean data into warehouse | Outside the warehouse, in a dedicated transformation server or tool | Legacy on-premise warehouses with limited compute; strict data governance before data enters the warehouse | Informatica, Talend, SSIS, Apache Spark |
ELT (Extract-Load-Transform) | Extract from source → Load raw data into warehouse → Transform inside the warehouse using SQL | Inside the warehouse, leveraging its massive parallel processing power | Modern cloud warehouses where compute is elastic and cheap; enables iterative transformation and data exploration | Fivetran + dbt, Airbyte + dbt, Stitch + dbt |
The Modern Data Stack
Layer | Purpose | Common Tools |
|---|---|---|
Ingestion | Move raw data from source systems into the data warehouse or data lake | Fivetran, Airbyte, Stitch, custom API connectors |
Storage | Store raw and transformed data; provide the compute layer for SQL queries | Snowflake, BigQuery, Redshift, Databricks, ClickHouse |
Transformation | Transform raw tables into clean, modelled, analytics-ready tables using SQL | dbt (data build tool) — the dominant open-source transformation framework |
Orchestration | Schedule, monitor, and manage pipeline dependencies | Apache Airflow, Prefect, Dagster, dbt Cloud Scheduler |
Semantic layer / BI | Define business metrics consistently and expose data to dashboards and self-serve analytics | Looker (LookML), Tableau, Power BI, Metabase, dbt Semantic Layer |
Data Quality and Governance in the Warehouse
Practice | Description | Implementation |
|---|---|---|
Data contracts | Formal agreements between data producers (source teams) and data consumers (analytics teams) on schema, freshness, and quality | Codified in YAML or documentation; enforced at ingestion or transformation time |
Data tests | Automated assertions on transformed tables (uniqueness, not-null, referential integrity, accepted value ranges) | dbt tests run after each model build; failures block downstream models from loading |
Data lineage | Tracking the origin and transformations of each field from source to final table | Provided by dbt docs, OpenLineage, or barehouse-native lineage tools |
Partitioning and clustering | Physical organisation of warehouse tables by date or high-cardinality columns to enable partition pruning | Critical for query cost and performance at scale; filter on the partition key to avoid full table scans |
Summary
Data warehousing provides the infrastructure layer that makes large-scale analytics possible. By separating the storage and compute of analytical queries from operational systems, warehouses allow analysts to run complex aggregations without impacting production applications. The shift from ETL to ELT, powered by cloud warehouses and tools like dbt, has democratised data transformation — allowing analysts to own and version their transformation logic in SQL rather than depending on engineering teams. Understanding dimensional modelling, the difference between fact and dimension tables, and how the modern data stack fits together is foundational knowledge for any analyst working with enterprise data.
Create a free reader account to keep reading.