Why SQL Is the Core Language of Data Analysis
Structured Query Language (SQL) remains the most widely used tool in a data analyst's toolkit. Unlike programming languages that require writing algorithms, SQL lets you declare what data you want and lets the database engine figure out how to retrieve it. Nearly every analytical platform — from traditional relational databases like PostgreSQL and MySQL to modern cloud warehouses like BigQuery, Snowflake, and Redshift — uses SQL as its primary query language. Mastering joins, aggregations, and window functions unlocks the ability to answer complex business questions directly from raw data without exporting to spreadsheets.
Types of Joins
Join Type | What It Returns | Typical Use Case | Gotcha |
|---|---|---|---|
INNER JOIN | Only rows where the join key exists in both tables | Linking orders to customers where both records must exist | Silently drops rows with no match — check row counts before and after |
LEFT JOIN | All rows from the left table; NULLs for non-matching right rows | Finding customers who have never placed an order | Aggregations on the right-side columns count NULLs as 0 with COUNT(*) but as 0 with COUNT(col) |
RIGHT JOIN | All rows from the right table; NULLs for non-matching left rows | Rarely preferred; rewrite as LEFT JOIN with tables swapped for clarity | Hard to read in complex queries; avoid in practice |
FULL OUTER JOIN | All rows from both tables; NULLs where no match exists on either side | Reconciling two data sources to find discrepancies | Not supported in all databases (e.g., MySQL requires UNION workaround) |
CROSS JOIN | Cartesian product — every row in table A paired with every row in table B | Generating a date spine or all combinations of two dimension tables | Explosive row growth; a 1,000-row × 1,000-row cross join = 1,000,000 rows |
SELF JOIN | A table joined to itself using aliases | Comparing rows within the same table (e.g., manager–employee hierarchy) | Requires clear aliasing to avoid confusion |
Aggregation Functions
Function | Description | NULL Handling | Example |
|---|---|---|---|
COUNT(*) | Counts all rows including NULLs | Includes rows where column is NULL | COUNT(*) to get total orders |
COUNT(col) | Counts non-NULL values in a column | Ignores NULLs | COUNT(email) to count users with an email address |
COUNT(DISTINCT col) | Counts unique non-NULL values | Ignores NULLs and duplicates | COUNT(DISTINCT user_id) for unique active users |
SUM(col) | Total of numeric values | Ignores NULLs; returns NULL if all values are NULL | SUM(revenue) for total revenue |
AVG(col) | Mean of numeric values | Ignores NULLs — not the same as SUM/COUNT(*) | AVG(order_value) for average order value |
MIN / MAX | Smallest or largest value | Ignores NULLs | MIN(created_at) for the earliest signup date |
GROUP BY and HAVING
GROUP BY collapses rows with the same value in the grouping column(s) into a single output row, allowing aggregation functions to summarize each group. HAVING filters groups after aggregation — it is the WHERE clause for aggregated results.
Clause | When It Filters | Can Reference Aggregates? | Example |
|---|---|---|---|
WHERE | Before aggregation (on raw rows) | No | WHERE status = 'completed' |
HAVING | After aggregation (on grouped results) | Yes | HAVING COUNT(*) > 10 |
Window Functions
Window functions perform calculations across a set of rows related to the current row, without collapsing rows like GROUP BY does. They use an OVER() clause to define the window (partition and ordering). Window functions are essential for ranking, running totals, period-over-period comparisons, and moving averages.
Function | Category | Description | Common Use |
|---|---|---|---|
ROW_NUMBER() | Ranking | Assigns a unique sequential integer to each row within a partition | Deduplication: keep only the first row per user_id ordered by created_at |
RANK() | Ranking | Assigns the same rank to ties; leaves gaps after tied ranks | Ranking products by revenue where ties are possible |
DENSE_RANK() | Ranking | Like RANK() but no gaps after ties | Top-N analysis where gaps in rank would be confusing |
SUM() OVER() | Aggregate | Running or partitioned sum without collapsing rows | Cumulative revenue per month |
AVG() OVER() | Aggregate | Moving average over a defined frame | 7-day rolling average of daily active users |
LAG(col, n) | Offset | Returns the value of col from n rows before the current row | Week-over-week or month-over-month comparisons |
LEAD(col, n) | Offset | Returns the value of col from n rows after the current row | Time to next event; churn prediction features |
NTILE(n) | Ranking | Divides rows into n roughly equal buckets | Assigning users to quartiles by spend |
FIRST_VALUE / LAST_VALUE | Offset | Returns the first or last value in the window frame | Attribution: first touch vs. last touch channel |
Subqueries vs. CTEs
Approach | Syntax | Readability | Reusability | Best For |
|---|---|---|---|---|
Inline subquery | SELECT * FROM (SELECT ...) sub | Poor for complex logic; nested queries become hard to follow | Cannot reference the same subquery twice | Simple one-off transformations |
CTE (WITH clause) | WITH cte AS (SELECT ...) SELECT * FROM cte | High; each step is named and readable | Can reference the same CTE multiple times in the main query | Multi-step analytical queries; self-documenting pipelines |
Recursive CTE | WITH RECURSIVE cte AS (...) | Moderate; requires understanding of anchor + recursive member | Yes, within the same query | Hierarchical data (org charts, category trees) |
Performance Considerations for Analytical SQL
Practice | Why It Matters |
|---|---|
Filter early with WHERE before joining | Reduces the number of rows that need to be joined, lowering memory and CPU usage |
Avoid SELECT * | Columnar databases scan only requested columns; SELECT * forces reading all columns |
Use partition pruning | Filter on partition keys (e.g., date) so the engine skips irrelevant partitions entirely |
Prefer COUNT(DISTINCT) approximations when exact precision is not needed | Exact COUNT(DISTINCT) is expensive at scale; HyperLogLog approximations (e.g., APPROX_COUNT_DISTINCT in BigQuery) are much faster |
Inspect the query execution plan (EXPLAIN) | Reveals whether indexes are being used, where full table scans occur, and what the estimated row counts are at each step |
Summary
SQL mastery for data analysts goes well beyond basic SELECT statements. Joins allow you to combine data from multiple sources while understanding the cardinality implications of each join type. Aggregations with GROUP BY and HAVING enable summarizing large datasets into actionable metrics. Window functions are particularly powerful because they allow row-level calculations — rankings, running totals, period comparisons — without losing the granularity of the underlying data. CTEs make complex multi-step queries readable and maintainable. Together, these SQL capabilities allow analysts to answer the majority of business questions directly in the database, where the data lives.
Create a free reader account to keep reading.