Data Warehousing and ETL/ELT Concepts

The Role of a Data Warehouse in an Analytics Stack

A data warehouse is a centralised repository that integrates data from multiple source systems, organises it for analytical queries, and serves as the authoritative source of truth for business reporting. Unlike operational databases, which are optimised for transactional reads and writes on individual records, data warehouses are optimised for aggregating millions to billions of rows across many columns — exactly the query pattern of analytical workloads. Modern cloud-based warehouses such as Snowflake, Google BigQuery, Amazon Redshift, and Databricks SQL have largely replaced on-premise systems and changed how data pipelines are built.

OLTP vs. OLAP: Two Different Purposes

Dimension	OLTP (Operational)	OLAP (Analytical)
Primary purpose	Record transactions: inserts, updates, deletes of individual rows	Analyse trends: aggregate queries across large row sets
Query pattern	Many short queries on a few rows each	Fewer long queries scanning many rows and columns
Data freshness	Real-time; data is current to the second	Typically delayed by hours or one day; batch or micro-batch loads
Schema style	Normalised (3NF) to reduce redundancy and update anomalies	Denormalised (star/snowflake schema) for fast aggregation
Storage format	Row-oriented: all columns of a row stored together	Column-oriented: each column stored separately, enabling compression and column-level scans
Examples	PostgreSQL, MySQL, Oracle, SQL Server	Snowflake, BigQuery, Redshift, Databricks SQL, ClickHouse

Dimensional Modelling: Star and Snowflake Schemas

�2

Concept	Definition	Role in the Warehouse
Fact table	Central table containing measurable business events (transactions, clicks, sessions); typically has many rows and foreign keys to dimension tables	Stores the numeric measures analysts aggregate: revenue, quantity, duration
Dimension table	Lookup table describing the who, what, where, and when of fact events	Stores descriptive attributes: customer name, product category, date, region
Star schema	Fact table surrounded by denormalised dimension tables; all lookups are one join away from the fact	Optimal for BI tools and fast query performance; at the cost of some data redundancy
Snowflake schema	Dimension tables are further normalised into sub-dimension tables	Reduces storage redundancy but requires more joins; less common in modern cloud warehouses where storage is cheap
Slowly Changing Dimension (SCD)	A dimension attribute that changes over time (e.g., customer address, product price)	SCD Type 1 overwrites; SCD Type 2 adds a new row with effective/expiry dates to preserve history

ETL vs. ELT

Approach	Pipeline Order	Where Transformation Happens	Best For	Common Tools
ETL (Extract-Transform-Load)	Extract from source → Transform in a staging server → Load clean data into warehouse	Outside the warehouse, in a dedicated transformation server or tool	Legacy on-premise warehouses with limited compute; strict data governance before data enters the warehouse	Informatica, Talend, SSIS, Apache Spark
ELT (Extract-Load-Transform)	Extract from source → Load raw data into warehouse → Transform inside the warehouse using SQL	Inside the warehouse, leveraging its massive parallel processing power	Modern cloud warehouses where compute is elastic and cheap; enables iterative transformation and data exploration	Fivetran + dbt, Airbyte + dbt, Stitch + dbt

The Modern Data Stack

Layer	Purpose	Common Tools
Ingestion	Move raw data from source systems into the data warehouse or data lake	Fivetran, Airbyte, Stitch, custom API connectors
Storage	Store raw and transformed data; provide the compute layer for SQL queries	Snowflake, BigQuery, Redshift, Databricks, ClickHouse
Transformation	Transform raw tables into clean, modelled, analytics-ready tables using SQL	dbt (data build tool) — the dominant open-source transformation framework
Orchestration	Schedule, monitor, and manage pipeline dependencies	Apache Airflow, Prefect, Dagster, dbt Cloud Scheduler
Semantic layer / BI	Define business metrics consistently and expose data to dashboards and self-serve analytics	Looker (LookML), Tableau, Power BI, Metabase, dbt Semantic Layer

Data Quality and Governance in the Warehouse

Practice	Description	Implementation
Data contracts	Formal agreements between data producers (source teams) and data consumers (analytics teams) on schema, freshness, and quality	Codified in YAML or documentation; enforced at ingestion or transformation time
Data tests	Automated assertions on transformed tables (uniqueness, not-null, referential integrity, accepted value ranges)	dbt tests run after each model build; failures block downstream models from loading
Data lineage	Tracking the origin and transformations of each field from source to final table	Provided by dbt docs, OpenLineage, or barehouse-native lineage tools
Partitioning and clustering	Physical organisation of warehouse tables by date or high-cardinality columns to enable partition pruning	Critical for query cost and performance at scale; filter on the partition key to avoid full table scans

Summary

Data warehousing provides the infrastructure layer that makes large-scale analytics possible. By separating the storage and compute of analytical queries from operational systems, warehouses allow analysts to run complex aggregations without impacting production applications. The shift from ETL to ELT, powered by cloud warehouses and tools like dbt, has democratised data transformation — allowing analysts to own and version their transformation logic in SQL rather than depending on engineering teams. Understanding dimensional modelling, the difference between fact and dimension tables, and how the modern data stack fits together is foundational knowledge for any analyst working with enterprise data.