What Is a Data Warehouse?
A data warehouse is a centralized repository designed for analytical querying rather than transactional processing. Unlike operational databases that handle day-to-day inserts, updates, and deletes, a data warehouse is optimized for reading large volumes of historical data and aggregating it for reporting and analysis. The distinction between these two types of systems is captured by the terms OLTP and OLAP.
OLTP vs OLAP
Dimension | OLTP (Online Transaction Processing) | OLAP (Online Analytical Processing) |
|---|---|---|
Primary purpose | Record and process business transactions | Analyze historical data and support decisions |
Typical queries | Single row lookups, inserts, updates | Aggregations over millions of rows |
Data volume | Gigabytes to terabytes (current data) | Terabytes to petabytes (historical data) |
Schema style | Highly normalized (3NF) | Denormalized (star/snowflake schema) |
Qury frequency | Thousands of short transactions per second | Fewer, longer-running analytical queries |
Users | Application end users, operational staff | Analysts, data scientists, executives |
Examples | PostgreSQL, MySQL, Oracle, SQL Server | BigQuery, Snowflake, Redshift, Databricks |
Dimensional Modeling
Dimensional modeling is the standard technique for structuring data in a warehouse. Developed by Ralph Kimball, it organizes data into two types of tables: fact tables and dimension tables. The goal is to make analytical queries fast and intuitive, even for non-technical users.
Fact tables store quantitative measurements — the things you want to measure and aggregate. Each row represents an event or transaction: a sale, a click, a page view, a payment. Fact tables contain numeric measures (revenue, quantity, duration) and foreign keys pointing to dimension tables. They tend to be very wide and very tall — millions to billions of rows.
Dimension tables store descriptive context — the things you want to filter, group, and slice by. A customer dimension contains name, region, segment, and signup date. A product dimension contains category, brand, and SKU. Dimension tables are narrower and shorter than fact tables.
Star Schema
In a star schema, a central fact table is directly connected to multiple dimension tables. The diagram resembles a star, with the fact table at the center and dimension tables radiating outward. Each dimension is fully denormalized — all attributes of a customer, for example, are stored in a single flat customer dimension table rather than being split across multiple normalized tables.
Advantages of star schema: queries are simple and require few joins; query performance is fast because dimension tables are small and can be loaded into memory; easy for analysts and BI tools to understand and navigate.
Disadvantages: data redundancy in dimension tables (the same city name may appear in many rows); updates to dimension attributes require updating many rows.
Example star schema for e-commerce:
Table | Type | Key Columns |
|---|---|---|
fact_orders | Fact | order_id, customer_key, product_key, date_key, revenue, quantity |
dim_customer | Dimension | customer_key, name, region, segment, signup_date |
dim_product | Dimension | product_key, name, category, brand, unit_price |
dim_date | Dimension | date_key, date, year, quarter, month, week, day_of_week |
dim_geography | Dimension | geo_key, country, state, city, postal_code |
Snowflake Schema
A snowflake schema is an extension of the star schema where dimension tables are normalized — broken into multiple related tables. For example, instead of a single flat dim_product table, a snowflake schema might have dim_product, dim_category, and dim_brand tables, with product referencing category and brand by foreign key.
Advantages: less data redundancy; smaller dimension table storage size; easier to maintain and update hierarchies.
Disadvantages: more complex queries requiring additional joins; slower query performance than star schema for the same question; harder for BI tools and non-technical analysts to navigate.
In practice, most modern data warehouses prefer the star schema over the snowflake schema because storage is cheap and query performance matters more than normalization. The snowflake schema is better suited to scenarios with very large, frequently updated dimension tables.
Slowly Changing Dimensions (SCDs)
A slowly changing dimension (SCD) is a dimension where attributes change over time, raising the question of how to handle historical accuracy. For example, if a customer moves from New York to Chicago, which city should be recorded on their past orders?
SCD Type | Behavior | Use When |
|---|---|---|
Type 0 | Never update — retain original value forever | Truly fixed attributes (date of birth) |
Type 1 | Overwrite old value with new value | Corrections; history not needed |
Type 2 | Insert new row with new value; flag old row as expired | Full history required; most common for customer/product dims |
Type 3 | Add a new column to track previous value | Only one prior value matters |
Type 4 | Maintain separate history table | Large dimensions with many changes |
Type 2 SCDs are the most widely used in practice. They require a surrogate key (an artificial integer or UUID key), an effective date range (valid_from, valid_to), and an is_current flag to distinguish the active row from expired historical rows.
Grain
The grain of a fact table is the most atomic level at which a fact is recorded — the lowest level of detail captured in each row. Declaring the grain is the most critical decision in dimensional modeling because it determines what questions can and cannot be answered from the fact table.
A fact table at the order-line grain captures one row per product per order, enabling analysis at the product level. A fact table at the order grain collapses multiple products into one row, losing product-level detail. You can always aggregate from a finer grain to a coarser one, but you cannot disaggregate in the other direction.
Data Warehouse Architecture
Kimball (Bottom-Up): Build individual data marts (subject-oriented subsets of the warehouse) first, then integrate them using shared conformed dimensions. Fast time-to-value but can create integration complexity at scale.
Inmon (Top-Down): Build a normalized enterprise data warehouse first, then derive data marts from it. More rigorous governance but slower to deliver initial value.
Data Vault: A hybrid approach that uses hubs (business keys), links (relationships), and satellites (attributes) to build a flexible, auditable, and scalable warehouse layer. Common in regulated industries requiring full auditability.
Modern Lakehouse: Combines the flexibility of a data lake (raw files in cloud storage) with the query performance and structure of a warehouse. Platforms like Databricks, BigQuery, and Snowflake support lakehouse patterns where raw, curated, and serving layers coexist.
Key Concepts for Analysts
Concept | Definition | Why It Matters |
|---|---|---|
Surrogate key | An artificial primary key (integer or UUID) assigned by the warehouse | Decouples warehouse from source system keys; required for SCDs |
Conformed dimension | A dimension shared identically across multiple fact tables | Enables drill-across queries joining multiple fact tables |
Degenerate dimension | A dimension attribute stored in the fact table (e.g., order number) | Avoids creating a dimension table for single-attribute dimensions |
Junk dimension | A dimension that bundles miscellaneous low-cardinality flags | Avoids cluttering the fact table with many small flag columns |
Factless fact table | A fact table that records events with no numeric measures | Tracks presence/absence (student enrolled, item viewed) |
Summary
Data warehousing is the foundation of analytical infrastructure. Understanding the difference between OLTP and OLAP systems, the principles of dimensional modeling, and the trade-offs between star and snowflake schemas allows analysts to reason about how their queries work, why certain patterns perform well, and how to structure data models that serve the business effectively. As data volumes grow and cloud warehouses become the default, these foundational concepts remain central to how analytics teams organize and deliver reliable insights.
Create a free reader account to keep reading.