SQL for Data Analysis: Joins, Aggregations, and Window Functions

Why SQL Is the Core Language of Data Analysis

Structured Query Language (SQL) remains the most widely used tool in a data analyst's toolkit. Unlike programming languages that require writing algorithms, SQL lets you declare what data you want and lets the database engine figure out how to retrieve it. Nearly every analytical platform — from traditional relational databases like PostgreSQL and MySQL to modern cloud warehouses like BigQuery, Snowflake, and Redshift — uses SQL as its primary query language. Mastering joins, aggregations, and window functions unlocks the ability to answer complex business questions directly from raw data without exporting to spreadsheets.

Types of Joins

Join Type	What It Returns	Typical Use Case	Gotcha
INNER JOIN	Only rows where the join key exists in both tables	Linking orders to customers where both records must exist	Silently drops rows with no match — check row counts before and after
LEFT JOIN	All rows from the left table; NULLs for non-matching right rows	Finding customers who have never placed an order	Aggregations on the right-side columns count NULLs as 0 with COUNT(*) but as 0 with COUNT(col)
RIGHT JOIN	All rows from the right table; NULLs for non-matching left rows	Rarely preferred; rewrite as LEFT JOIN with tables swapped for clarity	Hard to read in complex queries; avoid in practice
FULL OUTER JOIN	All rows from both tables; NULLs where no match exists on either side	Reconciling two data sources to find discrepancies	Not supported in all databases (e.g., MySQL requires UNION workaround)
CROSS JOIN	Cartesian product — every row in table A paired with every row in table B	Generating a date spine or all combinations of two dimension tables	Explosive row growth; a 1,000-row × 1,000-row cross join = 1,000,000 rows
SELF JOIN	A table joined to itself using aliases	Comparing rows within the same table (e.g., manager–employee hierarchy)	Requires clear aliasing to avoid confusion

Aggregation Functions

Function	Description	NULL Handling	Example
COUNT(*)	Counts all rows including NULLs	Includes rows where column is NULL	COUNT(*) to get total orders
COUNT(col)	Counts non-NULL values in a column	Ignores NULLs	COUNT(email) to count users with an email address
COUNT(DISTINCT col)	Counts unique non-NULL values	Ignores NULLs and duplicates	COUNT(DISTINCT user_id) for unique active users
SUM(col)	Total of numeric values	Ignores NULLs; returns NULL if all values are NULL	SUM(revenue) for total revenue
AVG(col)	Mean of numeric values	Ignores NULLs — not the same as SUM/COUNT(*)	AVG(order_value) for average order value
MIN / MAX	Smallest or largest value	Ignores NULLs	MIN(created_at) for the earliest signup date

GROUP BY and HAVING

GROUP BY collapses rows with the same value in the grouping column(s) into a single output row, allowing aggregation functions to summarize each group. HAVING filters groups after aggregation — it is the WHERE clause for aggregated results.

Clause	When It Filters	Can Reference Aggregates?	Example
WHERE	Before aggregation (on raw rows)	No	WHERE status = 'completed'
HAVING	After aggregation (on grouped results)	Yes	HAVING COUNT(*) > 10

Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing rows like GROUP BY does. They use an OVER() clause to define the window (partition and ordering). Window functions are essential for ranking, running totals, period-over-period comparisons, and moving averages.

Function	Category	Description	Common Use
ROW_NUMBER()	Ranking	Assigns a unique sequential integer to each row within a partition	Deduplication: keep only the first row per user_id ordered by created_at
RANK()	Ranking	Assigns the same rank to ties; leaves gaps after tied ranks	Ranking products by revenue where ties are possible
DENSE_RANK()	Ranking	Like RANK() but no gaps after ties	Top-N analysis where gaps in rank would be confusing
SUM() OVER()	Aggregate	Running or partitioned sum without collapsing rows	Cumulative revenue per month
AVG() OVER()	Aggregate	Moving average over a defined frame	7-day rolling average of daily active users
LAG(col, n)	Offset	Returns the value of col from n rows before the current row	Week-over-week or month-over-month comparisons
LEAD(col, n)	Offset	Returns the value of col from n rows after the current row	Time to next event; churn prediction features
NTILE(n)	Ranking	Divides rows into n roughly equal buckets	Assigning users to quartiles by spend
FIRST_VALUE / LAST_VALUE	Offset	Returns the first or last value in the window frame	Attribution: first touch vs. last touch channel

Subqueries vs. CTEs

Approach	Syntax	Readability	Reusability	Best For
Inline subquery	SELECT * FROM (SELECT ...) sub	Poor for complex logic; nested queries become hard to follow	Cannot reference the same subquery twice	Simple one-off transformations
CTE (WITH clause)	WITH cte AS (SELECT ...) SELECT * FROM cte	High; each step is named and readable	Can reference the same CTE multiple times in the main query	Multi-step analytical queries; self-documenting pipelines
Recursive CTE	WITH RECURSIVE cte AS (...)	Moderate; requires understanding of anchor + recursive member	Yes, within the same query	Hierarchical data (org charts, category trees)

Performance Considerations for Analytical SQL

Practice	Why It Matters
Filter early with WHERE before joining	Reduces the number of rows that need to be joined, lowering memory and CPU usage
Avoid SELECT *	Columnar databases scan only requested columns; SELECT * forces reading all columns
Use partition pruning	Filter on partition keys (e.g., date) so the engine skips irrelevant partitions entirely
Prefer COUNT(DISTINCT) approximations when exact precision is not needed	Exact COUNT(DISTINCT) is expensive at scale; HyperLogLog approximations (e.g., APPROX_COUNT_DISTINCT in BigQuery) are much faster
Inspect the query execution plan (EXPLAIN)	Reveals whether indexes are being used, where full table scans occur, and what the estimated row counts are at each step

Summary

SQL mastery for data analysts goes well beyond basic SELECT statements. Joins allow you to combine data from multiple sources while understanding the cardinality implications of each join type. Aggregations with GROUP BY and HAVING enable summarizing large datasets into actionable metrics. Window functions are particularly powerful because they allow row-level calculations — rankings, running totals, period comparisons — without losing the granularity of the underlying data. CTEs make complex multi-step queries readable and maintainable. Together, these SQL capabilities allow analysts to answer the majority of business questions directly in the database, where the data lives.