Python for Data Analysis: A Pandas Handbook

Why Pandas Is the Data Analyst's Core Python Tool

Pandas is the foundational Python library for data manipulation and analysis. Built on top of NumPy, it introduces the DataFrame — a two-dimensional labelled data structure that behaves like a spreadsheet or SQL table but with the full power of a programming language behind it. Almost every Python-based data workflow begins with Pandas: loading data, cleaning it, transforming it, aggregating it, and preparing it for visualisation or modelling.

This guide covers the Pandas operations that appear in real analytical work every day, with practical examples throughout.

Loading and Inspecting Data

import pandas as pd
import numpy as np

# Load from various sources
df_csv   = pd.read_csv("orders.csv", parse_dates=["order_date"])
df_excel = pd.read_excel("report.xlsx", sheet_name="Sales")
df_sql   = pd.read_sql("SELECT * FROM orders WHERE status = 'PAID'", con=engine)
df_json  = pd.read_json("events.json")

# Basic inspection
print(df_csv.shape)          # (rows, columns)
print(df_csv.dtypes)         # data types per column
print(df_csv.head())         # first 5 rows
print(df_csv.describe())     # stats for numeric columns
print(df_csv.info())         # non-null counts + dtypes

# Check nulls
print(df_csv.isnull().sum())

# Value counts for a column
print(df_csv["status"].value_counts(normalize=True).mul(100).round(1))

Selecting, Filtering, and Slicing

# Select columns
df["order_value"]                        # single column (Series)
df[["user_id", "order_value", "status"]] # multiple columns (DataFrame)

# Filter rows
df[df["order_value"] > 100]
df[(df["status"] == "PAID") & (df["order_value"] > 50)]
df[df["country"].isin(["US", "UK", "DE"])]
df[df["email"].str.contains("@gmail.com", na=False)]

# loc: label-based selection
df.loc[df["status"] == "PAID", ["user_id", "order_value"]]

# iloc: position-based selection
df.iloc[0:10, 0:4]   # first 10 rows, first 4 columns

# query: readable string syntax
df.query("order_value > 100 and status == 'PAID'")

Aggregation and GroupBy

GroupBy is the Pandas equivalent of SQL's GROUP BY — it splits the DataFrame into groups, applies a function to each, and combines the results.

# Simple aggregation
df.groupby("country")["order_value"].sum()
df.groupby("country")["order_value"].agg(["sum", "mean", "count"])

# Multiple columns in groupby
df.groupby(["country", "status"])["order_value"].sum().reset_index()

# Named aggregations (pandas >= 0.25)
summary = df.groupby("country").agg(
    total_revenue = ("order_value", "sum"),
    order_count   = ("order_id",   "nunique"),
    avg_order     = ("order_value", "mean"),
    first_order   = ("order_date",  "min"),
    last_order    = ("order_date",  "max"),
).reset_index()

# Transform: add group aggregate back to original rows
df["country_avg"] = df.groupby("country")["order_value"].transform("mean")
df["pct_of_country"] = df["order_value"] / df["country_avg"] * 100

Merging and Joining DataFrames

Pandas merge mirrors SQL joins and is one of the most frequently used operations in analytical pipelines.

orders    = pd.read_csv("orders.csv")
customers = pd.read_csv("customers.csv")
products  = pd.read_csv("products.csv")

# Inner join: only rows with matches in both tables
df = orders.merge(customers, on="user_id", how="inner")

# Left join: all orders, customer info where available
df = orders.merge(customers, on="user_id", how="left")

# Join on different column names
df = orders.merge(products, left_on="product_id", right_on="id", how="left")

# Chain multiple joins
df = (orders
      .merge(customers, on="user_id",    how="left")
      .merge(products,  on="product_id", how="left"))

# Check for join quality: rows lost/gained
print("Before:", len(orders))
print("After: ", len(df))
print("Unmatched:", df["customer_name"].isnull().sum())

Reshaping: Pivot Tables and Melt

Reshaping converts data between wide and long formats — essential for preparing data for visualisation libraries and for computing cross-tabulations.

# Pivot table: rows = month, columns = country, values = revenue pivot = df.pivot_table( index="month", columns="country", values="order_value", aggfunc="sum", fill_value=0 ) # Crosstab: frequency table of two categorical variables cross = pd.crosstab(df["device_type"], df["status"], margins=True, normalize="index") # Stack / unstack: toggle between wide and long long_df = pivot.stack().reset_index(name="revenue") wide_df = long_df.pivot(index="month", columns="country", values="revenue") # Melt: unpivot wide to long (useful for Seaborn and Plotly) long = df.melt( id_vars=["order_date", "country"], value_vars=["gross_revenue", "net_revenue", "tax"], var_name="metric", value_name="amount" )

Window Functions with Rolling and Expanding

Pandas rolling and expanding methods replicate SQL window functions, enabling moving averages, cumulative sums, and rank calculations directly on DataFrames.

df = df.sort_values("order_date") # 7-day rolling average of daily revenue df["revenue_7d_avg"] = df["daily_revenue"].rolling(window=7, min_periods=1).mean() # 28-day rolling sum df["revenue_28d"] = df["daily_revenue"].rolling(28).sum() # Expanding (cumulative) sum — grows with each row df["cumulative_revenue"] = df["daily_revenue"].expanding().sum() # Percentage change vs. prior period df["wow_growth"] = df["weekly_revenue"].pct_change(periods=1).mul(100).round(1) # Year-over-year change (52 weeks back) df["yoy_growth"] = df["weekly_revenue"].pct_change(periods=52).mul(100).round(1) # Rank within group df["revenue_rank"] = df.groupby("month")["order_value"].rank(ascending=False)

Key Pandas Operations Quick Reference

Task	Pandas Method	SQL Equivalent
Filter rows	df[condition] or df.query()	WHERE
Select columns	df[["col1","col2"]]	SELECT col1, col2
Aggregate by group	df.groupby().agg()	GROUP BY + aggregate functions
Join two tables	df.merge(other, on=key, how=type)	JOIN
Sort rows	df.sort_values("col", ascending=False)	ORDER BY col DESC
Add computed column	df["new"] = expression	SELECT *, expression AS new
Remove duplicates	df.drop_duplicates(subset=[key])	SELECT DISTINCT / DISTINCT ON
Running total	df["col"].expanding().sum()	SUM() OVER (ORDER BY ...)
Moving average	df["col"].rolling(n).mean()	AVG() OVER (ROWS n PRECEDING)
Unpivot wide to long	df.melt(id_vars, value_vars)	UNPIVOT (vendor-specific)

Task

Pandas Method

SQL Equivalent

Filter rows

df[condition] or df.query()

WHERE

Select columns

df[["col1","col2"]]

SELECT col1, col2

Aggregate by group

df.groupby().agg()

GROUP BY + aggregate functions

Join two tables

df.merge(other, on=key, how=type)

JOIN

Sort rows

df.sort_values("col", ascending=False)

ORDER BY col DESC

Add computed column

df["new"] = expression

SELECT *, expression AS new

Remove duplicates

df.drop_duplicates(subset=[key])

SELECT DISTINCT / DISTINCT ON

Running total

df["col"].expanding().sum()

SUM() OVER (ORDER BY ...)

Moving average

df["col"].rolling(n).mean()

AVG() OVER (ROWS n PRECEDING)

Unpivot wide to long

df.melt(id_vars, value_vars)

UNPIVOT (vendor-specific)

Summary

Pandas is the backbone of Python-based data analysis. Mastering its core operations — loading, filtering, grouping, merging, reshaping, and rolling calculations — gives you the ability to answer virtually any analytical question on tabular data without leaving Python. The key to fluency is practice: take a dataset you care about and replicate every SQL query you know as a Pandas operation. The SQL-to-Pandas mapping becomes second nature quickly, and once it does, you gain access to the full Python ecosystem — visualisation, machine learning, statistical testing — as a seamless extension of your analytical workflow.