What Are ETL and ELT?
ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are the two dominant paradigms for moving data from source systems into an analytical destination. Both follow the same three logical steps — extracting raw data from sources, transforming it into a usable form, and loading it into a target — but differ in where and when the transformation happens.
Paradigm | Transformation Location | Typical Destination | Best For |
|---|---|---|---|
ETL | Intermediate processing server before loading | Traditional data warehouse (on-prem) | Strict schema requirements, sensitive data masking before storage |
ELT | Inside the destination data warehouse after loading | Cloud data warehouse (BigQuery, Snowflake, Redshift) | Large volumes, flexible exploration, SQL-based transformations |
The Extract Phase
Extraction pulls data from one or more source systems. Sources range from operational databases (PostgreSQL, MySQL), SaaS APIs (Salesforce, HubSpot, Stripe), flat files (CSV, JSON), event streams (Kafka, Kinesis), and third-party data vendors.
Extraction Method | Description | Use Case |
|---|---|---|
Full extract | Pull the entire dataset every run | Small reference tables, lookup data |
Incremental extract | Pull only records changed since the last run (using updated_at timestamp or sequence ID) | Large transactional tables with clear change indicators |
Change Data Capture (CDC) | Stream database transaction log changes in near real-time | Low-latency replication, event-driven pipelines |
API polling | Periodically call REST or GraphQL endpoints for new records | SaaS sources (CRM, marketing tools) |
The Transform Phase
Transformation converts raw extracted data into analysis-ready form. In ETL, transforms run in a dedicated processing layer before loading. In ELT (the modern default), raw data lands in the warehouse first, and transformations are written as SQL and orchestrated by tools like dbt.
Common transformation operations:
Operation | Description | Example |
|---|---|---|
Type casting | Converting string timestamps to TIMESTAMP type | CAST(created_at AS TIMESTAMP) |
Deduplication | Removing duplicate records from the source | ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) = 1 |
Joins and enrichment | Combining tables to add dimensional context | JOIN users ON orders.user_id = users.id |
Aggregation | Pre-computing summaries for performance | SUM(revenue) GROUP BY date, product_id |
Business logic encoding | Translating raw status codes into business-meaningful labels | CASE WHEN status = 1 THEN 'active' ELSE 'churned' END |
PII masking / anonymization | Removing or hashing personally identifiable information | MD5(email) or NULL-ing sensitive columns |
The Load Phase
Loading moves data into the destination. For ELT pipelines, the raw load is typically an append or upsert into a staging schema. The destination is usually a cloud data warehouse or data lake.
Load Strategy | Description | When to Use |
|---|---|---|
Full refresh | Truncate and reload the target table entirely | Small tables, reference data, no historical tracking needed |
Incremental append | Insert only new records; old records never updated | Immutable event logs, append-only fact tables |
Upsert (merge) | Insert new records, update existing ones matched by primary key | Slowly changing entities (users, accounts) |
Slowly Changing Dimensions (SCD Type 2) | Insert a new row for each change, preserving full history with valid_from/valid_to columns | When full historical accuracy of dimension attributes matters |
Modern ELT Stack
The modern ELT stack combines dedicated tools for each phase rather than monolithic ETL servers:
Layer | Common Tools | Function |
|---|---|---|
Ingestion (Extract + Load) | Fivetran, Airbyte, Stitch, Meltano | Managed connectors that extract from sources and load raw data to the warehouse |
Transformation | dbt (data build tool) | SQL-based modular transformations with version control, testing, and lineage documentation |
Warehouse / Storage | BigQuery, Snowflake, Databricks, Redshift | Scalable columnar storage and compute for SQL analytics |
Orchestration | Airflow, Prefect, Dagster | Scheduling, dependency management, and monitoring of pipeline runs |
Reverse ETL | Census, Hightouch | Sync warehouse data back into operational tools (CRM, marketing platforms) |
Data Quality in Pipelines
Pipelines degrade over time as sources change without notice. Data analysts should build quality checks into the pipeline at both the ingestion and transformation layers:
Check Type | Example | Tool |
|---|---|---|
Not null | Primary key and required fields must not be NULL | dbt test: not_null |
Uniqueness | Primary key must be unique in the output table | dbt test: unique |
Referential integrity | Every order.user_id must exist in the users table | dbt test: relationships |
Row count threshold | Daily load must deliver at least N rows; sudden drops indicate pipeline failure | Custom SQL assertion or Great Expectations |
Freshness | The most recent record must be within expected recency window | dbt source freshness |
Summary
Understanding ETL and ELT is foundational for data analysts who work with warehoused data. The modern ELT pattern — managed ingestion tools loading raw data, dbt transforming it in SQL, and a cloud warehouse executing queries — has become the industry standard. Analysts who understand how data moves from source to serving layer can write better SQL (aware of what the raw data looks like and how it was deduplicated), design reliable dashboards (aware of update frequencies and freshness), and troubleshoot data quality issues at the right layer of the stack rather than guessing.
Create a free reader account to keep reading.