SQL for Data Analysis

Introduction to SQL for Data Analysis

SQL (Structured Query Language) is the most essential tool in a data analyst's arsenal. Virtually every organization stores its data in relational databases, and SQL is the language used to retrieve, filter, aggregate, and transform that data. Regardless of whether you work with PostgreSQL, MySQL, SQL Server, BigQuery, or SQLite, the core syntax is consistent and transferable across platforms.

This article walks through the foundational SQL concepts every data analyst needs: selecting data, filtering rows, sorting results, aggregating with GROUP BY, joining tables, and using subqueries and window functions.

The SELECT Statement

Every SQL query begins with SELECT. It tells the database which columns to return. The FROM clause specifies which table to pull from.

-- Select all columns from a table
SELECT * FROM orders;

-- Select specific columns
SELECT order_id, customer_id, total_amount, order_date
FROM orders;

Using SELECT * retrieves every column, which is convenient for exploration but inefficient in production. Always name the specific columns you need in analytical queries to keep results clear and fast.

You can also create computed columns directly in the SELECT clause:

SELECT
  order_id,
  quantity,
  unit_price,
  quantity * unit_price AS total_revenue
FROM order_items;

Filtering with WHERE

The WHERE clause filters rows based on conditions. Only rows that satisfy the condition are returned.

-- Filter by a single value
SELECT * FROM customers WHERE country = 'France';

-- Multiple conditions with AND / OR
SELECT * FROM orders
WHERE total_amount > 500
  AND status = 'completed';

-- Range filter with BETWEEN
SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';

-- Filter from a list of values
SELECT * FROM products
WHERE category IN ('Electronics', 'Accessories');

-- Partial text match with LIKE
SELECT * FROM customers
WHERE email LIKE '%@gmail.com';

NULL values require special handling — you cannot use = NULL. Instead use IS NULL or IS NOT NULL:

SELECT * FROM customers WHERE phone_number IS NULL;

Sorting with ORDER BY

The ORDER BY clause sorts the result set. ASC (ascending) is the default; DESC sorts from highest to lowest.

-- Sort by a single column
SELECT * FROM products ORDER BY price DESC;

-- Sort by multiple columns
SELECT customer_id, order_date, total_amount
FROM orders
ORDER BY customer_id ASC, order_date DESC;

LIMIT (or TOP in SQL Server) restricts the number of rows returned. This is extremely useful when exploring large tables:

SELECT * FROM events ORDER BY created_at DESC LIMIT 100;

Aggregating Data with GROUP BY

Aggregation is where SQL becomes truly powerful for analysis. The GROUP BY clause groups rows that share a value in one or more columns, and aggregate functions compute a summary value for each group.

Common aggregate functions: COUNT(), SUM(), AVG(), MIN(), MAX().

-- Total revenue by country
SELECT
  country,
  COUNT(*) AS num_orders,
  SUM(total_amount) AS total_revenue,
  AVG(total_amount) AS avg_order_value
FROM orders
GROUP BY country
ORDER BY total_revenue DESC;

-- Monthly sales summary
SELECT
  DATE_TRUNC('month', order_date) AS month,
  COUNT(*) AS orders,
  SUM(total_amount) AS revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

The HAVING clause filters groups after aggregation — it works like WHERE but for GROUP BY results:

-- Only show countries with more than 100 orders
SELECT country, COUNT(*) AS num_orders
FROM orders
GROUP BY country
HAVING COUNT(*) > 100
ORDER BY num_orders DESC;

Joining Tables

Real-world databases store related data across multiple tables. JOINs combine rows from two or more tables based on a related column. Understanding joins is one of the most important SQL skills for any analyst.

INNER JOIN returns only rows where there is a match in both tables:

SELECT
  o.order_id,
  c.name AS customer_name,
  o.total_amount
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;

LEFT JOIN returns all rows from the left table, and matching rows from the right. Rows with no match get NULL for the right table's columns. This is useful for finding customers with no orders:

-- Customers who have never placed an order
SELECT c.customer_id, c.name
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;

You can join more than two tables in a single query:

SELECT
  o.order_id,
  c.name AS customer_name,
  p.product_name,
  oi.quantity,
  oi.unit_price
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id;

Subqueries and CTEs

Subqueries are queries nested inside another query. They allow you to break complex problems into steps. A subquery in the WHERE clause can filter using the result of another query:

-- Customers who spent more than the average
SELECT customer_id, name
FROM customers
WHERE customer_id IN (
  SELECT customer_id
  FROM orders
  GROUP BY customer_id
  HAVING SUM(total_amount) > (SELECT AVG(total_amount) FROM orders)
);

Common Table Expressions (CTEs) use the WITH keyword and make complex queries much more readable by naming intermediate result sets:

WITH monthly_revenue AS (
  SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(total_amount) AS revenue
  FROM orders
  GROUP BY DATE_TRUNC('month', order_date)
),
ranked AS (
  SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_revenue
  FROM monthly_revenue
)
SELECT
  month,
  revenue,
  ROUND((revenue - prev_revenue) / prev_revenue * 100, 2) AS growth_pct
FROM ranked
ORDER BY month;

Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing the result into groups like GROUP BY does. They are among the most powerful SQL tools for analysts.

-- Running total of revenue
SELECT
  order_date,
  total_amount,
  SUM(total_amount) OVER (ORDER BY order_date) AS running_total

-- Rank customers by total spending
SELECT
  customer_id,
  total_spent,
  RANK() OVER (ORDER BY total_spent DESC) AS spending_rank

-- Month-over-month comparison
SELECT
  month,
  revenue,
  LAG(revenue, 1) OVER (ORDER BY month) AS previous_month,
  revenue - LAG(revenue, 1) OVER (ORDER BY month) AS change
FROM monthly_revenue;

Key window functions to know: ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), SUM() OVER, AVG() OVER, and NTILE().

Best Practices for Analytical SQL

Write readable queries by using consistent indentation, aliasing tables with short but meaningful names, and adding comments to explain complex logic. Format keywords in uppercase to distinguish them from column and table names.

Always test queries on a small subset before running them on full tables. Use LIMIT to preview results and avoid accidentally scanning millions of rows. When filtering on large tables, make sure the columns in your WHERE and JOIN clauses are indexed for performance.

Avoid SELECT * in production queries — always specify the columns you need. This makes queries faster, makes the intent clear, and prevents breaking changes when tables are altered.

Conclusion

SQL is the lingua franca of data analysis. Mastering SELECT, filtering, aggregation, joins, CTEs, and window functions gives you direct access to the insights locked inside any relational database. Whether you are building dashboards, answering ad-hoc business questions, or feeding data into Python for further analysis, SQL proficiency is a non-negotiable skill for every data analyst.