Why Pandas Is the Data Analyst's Core Python Tool
Pandas is the foundational Python library for data manipulation and analysis. Built on top of NumPy, it introduces the DataFrame — a two-dimensional labelled data structure that behaves like a spreadsheet or SQL table but with the full power of a programming language behind it. Almost every Python-based data workflow begins with Pandas: loading data, cleaning it, transforming it, aggregating it, and preparing it for visualisation or modelling.
This guide covers the Pandas operations that appear in real analytical work every day, with practical examples throughout.
Loading and Inspecting Data
import pandas as pd
import numpy as np
# Load from various sources
df_csv = pd.read_csv("orders.csv", parse_dates=["order_date"])
df_excel = pd.read_excel("report.xlsx", sheet_name="Sales")
df_sql = pd.read_sql("SELECT * FROM orders WHERE status = 'PAID'", con=engine)
df_json = pd.read_json("events.json")
# Basic inspection
print(df_csv.shape) # (rows, columns)
print(df_csv.dtypes) # data types per column
print(df_csv.head()) # first 5 rows
print(df_csv.describe()) # stats for numeric columns
print(df_csv.info()) # non-null counts + dtypes
# Check nulls
print(df_csv.isnull().sum())
# Value counts for a column
print(df_csv["status"].value_counts(normalize=True).mul(100).round(1))
Selecting, Filtering, and Slicing
# Select columns
df["order_value"] # single column (Series)
df[["user_id", "order_value", "status"]] # multiple columns (DataFrame)
# Filter rows
df[df["order_value"] > 100]
df[(df["status"] == "PAID") & (df["order_value"] > 50)]
df[df["country"].isin(["US", "UK", "DE"])]
df[df["email"].str.contains("@gmail.com", na=False)]
# loc: label-based selection
df.loc[df["status"] == "PAID", ["user_id", "order_value"]]
# iloc: position-based selection
df.iloc[0:10, 0:4] # first 10 rows, first 4 columns
# query: readable string syntax
df.query("order_value > 100 and status == 'PAID'")
Aggregation and GroupBy
GroupBy is the Pandas equivalent of SQL's GROUP BY — it splits the DataFrame into groups, applies a function to each, and combines the results.
# Simple aggregation
df.groupby("country")["order_value"].sum()
df.groupby("country")["order_value"].agg(["sum", "mean", "count"])
# Multiple columns in groupby
df.groupby(["country", "status"])["order_value"].sum().reset_index()
# Named aggregations (pandas >= 0.25)
summary = df.groupby("country").agg(
total_revenue = ("order_value", "sum"),
order_count = ("order_id", "nunique"),
avg_order = ("order_value", "mean"),
first_order = ("order_date", "min"),
last_order = ("order_date", "max"),
).reset_index()
# Transform: add group aggregate back to original rows
df["country_avg"] = df.groupby("country")["order_value"].transform("mean")
df["pct_of_country"] = df["order_value"] / df["country_avg"] * 100
Merging and Joining DataFrames
Pandas merge mirrors SQL joins and is one of the most frequently used operations in analytical pipelines.
orders = pd.read_csv("orders.csv")
customers = pd.read_csv("customers.csv")
products = pd.read_csv("products.csv")
# Inner join: only rows with matches in both tables
df = orders.merge(customers, on="user_id", how="inner")
# Left join: all orders, customer info where available
df = orders.merge(customers, on="user_id", how="left")
# Join on different column names
df = orders.merge(products, left_on="product_id", right_on="id", how="left")
# Chain multiple joins
df = (orders
.merge(customers, on="user_id", how="left")
.merge(products, on="product_id", how="left"))
# Check for join quality: rows lost/gained
print("Before:", len(orders))
print("After: ", len(df))
print("Unmatched:", df["customer_name"].isnull().sum())
Reshaping: Pivot Tables and Melt
Reshaping converts data between wide and long formats — essential for preparing data for visualisation libraries and for computing cross-tabulations.
# Pivot table: rows = month, columns = country, values = revenue
pivot = df.pivot_table(
index="month", columns="country", values="order_value",
aggfunc="sum", fill_value=0
)
# Crosstab: frequency table of two categorical variables
cross = pd.crosstab(df["device_type"], df["status"],
margins=True, normalize="index")
# Stack / unstack: toggle between wide and long
long_df = pivot.stack().reset_index(name="revenue")
wide_df = long_df.pivot(index="month", columns="country", values="revenue")
# Melt: unpivot wide to long (useful for Seaborn and Plotly)
long = df.melt(
id_vars=["order_date", "country"],
value_vars=["gross_revenue", "net_revenue", "tax"],
var_name="metric", value_name="amount"
)
Window Functions with Rolling and Expanding
Pandas rolling and expanding methods replicate SQL window functions, enabling moving averages, cumulative sums, and rank calculations directly on DataFrames.
df = df.sort_values("order_date")
# 7-day rolling average of daily revenue
df["revenue_7d_avg"] = df["daily_revenue"].rolling(window=7, min_periods=1).mean()
# 28-day rolling sum
df["revenue_28d"] = df["daily_revenue"].rolling(28).sum()
# Expanding (cumulative) sum — grows with each row
df["cumulative_revenue"] = df["daily_revenue"].expanding().sum()
# Percentage change vs. prior period
df["wow_growth"] = df["weekly_revenue"].pct_change(periods=1).mul(100).round(1)
# Year-over-year change (52 weeks back)
df["yoy_growth"] = df["weekly_revenue"].pct_change(periods=52).mul(100).round(1)
# Rank within group
df["revenue_rank"] = df.groupby("month")["order_value"].rank(ascending=False)
Key Pandas Operations Quick Reference
| Task | Pandas Method | SQL Equivalent |
|---|---|---|
| Filter rows | df[condition] or df.query() | WHERE |
| Select columns | df[["col1","col2"]] | SELECT col1, col2 |
| Aggregate by group | df.groupby().agg() | GROUP BY + aggregate functions |
| Join two tables | df.merge(other, on=key, how=type) | JOIN |
| Sort rows | df.sort_values("col", ascending=False) | ORDER BY col DESC |
| Add computed column | df["new"] = expression | SELECT *, expression AS new |
| Remove duplicates | df.drop_duplicates(subset=[key]) | SELECT DISTINCT / DISTINCT ON |
| Running total | df["col"].expanding().sum() | SUM() OVER (ORDER BY ...) |
| Moving average | df["col"].rolling(n).mean() | AVG() OVER (ROWS n PRECEDING) |
| Unpivot wide to long | df.melt(id_vars, value_vars) | UNPIVOT (vendor-specific) |
Summary
Pandas is the backbone of Python-based data analysis. Mastering its core operations — loading, filtering, grouping, merging, reshaping, and rolling calculations — gives you the ability to answer virtually any analytical question on tabular data without leaving Python. The key to fluency is practice: take a dataset you care about and replicate every SQL query you know as a Pandas operation. The SQL-to-Pandas mapping becomes second nature quickly, and once it does, you gain access to the full Python ecosystem — visualisation, machine learning, statistical testing — as a seamless extension of your analytical workflow.
Create a free reader account to keep reading.