ETL and ELT: Data Pipeline Concepts for Analysts

What Are ETL and ELT?

ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are the two dominant paradigms for moving data from source systems into an analytical destination. Both follow the same three logical steps — extracting raw data from sources, transforming it into a usable form, and loading it into a target — but differ in where and when the transformation happens.

Paradigm	Transformation Location	Typical Destination	Best For
ETL	Intermediate processing server before loading	Traditional data warehouse (on-prem)	Strict schema requirements, sensitive data masking before storage
ELT	Inside the destination data warehouse after loading	Cloud data warehouse (BigQuery, Snowflake, Redshift)	Large volumes, flexible exploration, SQL-based transformations

The Extract Phase

Extraction pulls data from one or more source systems. Sources range from operational databases (PostgreSQL, MySQL), SaaS APIs (Salesforce, HubSpot, Stripe), flat files (CSV, JSON), event streams (Kafka, Kinesis), and third-party data vendors.

Extraction Method	Description	Use Case
Full extract	Pull the entire dataset every run	Small reference tables, lookup data
Incremental extract	Pull only records changed since the last run (using updated_at timestamp or sequence ID)	Large transactional tables with clear change indicators
Change Data Capture (CDC)	Stream database transaction log changes in near real-time	Low-latency replication, event-driven pipelines
API polling	Periodically call REST or GraphQL endpoints for new records	SaaS sources (CRM, marketing tools)

The Transform Phase

Transformation converts raw extracted data into analysis-ready form. In ETL, transforms run in a dedicated processing layer before loading. In ELT (the modern default), raw data lands in the warehouse first, and transformations are written as SQL and orchestrated by tools like dbt.

Common transformation operations:

Operation	Description	Example
Type casting	Converting string timestamps to TIMESTAMP type	CAST(created_at AS TIMESTAMP)
Deduplication	Removing duplicate records from the source	ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) = 1
Joins and enrichment	Combining tables to add dimensional context	JOIN users ON orders.user_id = users.id
Aggregation	Pre-computing summaries for performance	SUM(revenue) GROUP BY date, product_id
Business logic encoding	Translating raw status codes into business-meaningful labels	CASE WHEN status = 1 THEN 'active' ELSE 'churned' END
PII masking / anonymization	Removing or hashing personally identifiable information	MD5(email) or NULL-ing sensitive columns

The Load Phase

Loading moves data into the destination. For ELT pipelines, the raw load is typically an append or upsert into a staging schema. The destination is usually a cloud data warehouse or data lake.

Load Strategy	Description	When to Use
Full refresh	Truncate and reload the target table entirely	Small tables, reference data, no historical tracking needed
Incremental append	Insert only new records; old records never updated	Immutable event logs, append-only fact tables
Upsert (merge)	Insert new records, update existing ones matched by primary key	Slowly changing entities (users, accounts)
Slowly Changing Dimensions (SCD Type 2)	Insert a new row for each change, preserving full history with valid_from/valid_to columns	When full historical accuracy of dimension attributes matters

Modern ELT Stack

The modern ELT stack combines dedicated tools for each phase rather than monolithic ETL servers:

Layer	Common Tools	Function
Ingestion (Extract + Load)	Fivetran, Airbyte, Stitch, Meltano	Managed connectors that extract from sources and load raw data to the warehouse
Transformation	dbt (data build tool)	SQL-based modular transformations with version control, testing, and lineage documentation
Warehouse / Storage	BigQuery, Snowflake, Databricks, Redshift	Scalable columnar storage and compute for SQL analytics
Orchestration	Airflow, Prefect, Dagster	Scheduling, dependency management, and monitoring of pipeline runs
Reverse ETL	Census, Hightouch	Sync warehouse data back into operational tools (CRM, marketing platforms)

Data Quality in Pipelines

Pipelines degrade over time as sources change without notice. Data analysts should build quality checks into the pipeline at both the ingestion and transformation layers:

Check Type	Example	Tool
Not null	Primary key and required fields must not be NULL	dbt test: not_null
Uniqueness	Primary key must be unique in the output table	dbt test: unique
Referential integrity	Every order.user_id must exist in the users table	dbt test: relationships
Row count threshold	Daily load must deliver at least N rows; sudden drops indicate pipeline failure	Custom SQL assertion or Great Expectations
Freshness	The most recent record must be within expected recency window	dbt source freshness

Summary

Understanding ETL and ELT is foundational for data analysts who work with warehoused data. The modern ELT pattern — managed ingestion tools loading raw data, dbt transforming it in SQL, and a cloud warehouse executing queries — has become the industry standard. Analysts who understand how data moves from source to serving layer can write better SQL (aware of what the raw data looks like and how it was deduplicated), design reliable dashboards (aware of update frequencies and freshness), and troubleshoot data quality issues at the right layer of the stack rather than guessing.