pandas for Data Analysis: DataFrames, GroupBy, Merging, and Time Series

What is pandas?

pandas is the primary data manipulation library for Python. Built on top of NumPy, it provides two core data structures — Series (1-dimensional labeled array) and DataFrame (2-dimensional labeled table) — along with an extensive API for loading, cleaning, transforming, aggregating, and merging tabular data. For most data analysts working in Python, pandas is the daily workhorse: it handles everything from reading CSV files to computing grouped statistics to preparing data for visualization or machine learning.

The standard import convention is import pandas as pd.

Series and DataFrame

Structure	Description	Analogy
Series	1D labeled array with an index	A single column in a spreadsheet
DataFrame	2D table with labeled rows and columns	A full spreadsheet or SQL table
Index	Labels for rows (can be integers, strings, dates)	Row IDs or primary key

Reading and Writing Data

Function	Description
pd.read_csv(path)	Read CSV file into DataFrame
pd.read_excel(path)	Read Excel file (.xlsx, .xls)
pd.read_json(path)	Read JSON file or string
pd.read_parquet(path)	Read Parquet (columnar, efficient for large data)
pd.read_sql(query, con)	Execute SQL query and load result
df.to_csv(path)	Write DataFrame to CSV
df.to_parquet(path)	Write DataFrame to Parquet

Key parameters for read_csv: sep (delimiter), header (row to use as column names), index_col (column to use as index), dtype (specify column types), parse_dates (columns to parse as datetime), na_values (strings to treat as NaN), nrows (read only N rows).

Exploring a DataFrame

Method / Attribute	What It Returns
df.shape	Tuple (rows, columns)
df.dtypes	Data type of each column
df.head(n)	First n rows (default 5)
df.tail(n)	Last n rows
df.info()	Column names, dtypes, non-null counts, memory usage
df.describe()	Summary statistics (count, mean, std, min, quartiles, max)
df.value_counts()	Frequency of each unique value in a Series
df.nunique()	Number of unique values per column
df.isnull().sum()	Count of missing values per column

Selecting Data

pandas provides two primary selectors: .loc[] for label-based selection and .iloc[] for integer-position-based selection.

Expression	Result
df['col']	Select a single column (returns Series)
df[['col1', 'col2']]	Select multiple columns (returns DataFrame)
df.loc[row_label, col_label]	Select by row and column labels
df.iloc[row_int, col_int]	Select by integer position
df.loc[df['col'] > 5]	Boolean filtering
df.query('col > 5 and col2 == "A"')	SQL-like string filtering

Handling Missing Data

Method	Description
df.isnull()	Boolean mask of missing values
df.dropna()	Remove rows (or columns) with any missing values
df.dropna(subset=['col'])	Drop rows where specific column is null
df.fillna(value)	Replace NaN with a constant value
df['col'].fillna(df['col'].mean())	Impute with column mean
df.fillna(method='ffill')	Forward-fill: propagate last valid value forward
df.fillna(method='bfill')	Back-fill: propagate next valid value backward

Transforming Data

Adding and modifying columns:

df['new_col'] = df['a'] + df['b'] creates a new column from existing ones. df['col'] = df['col'].str.upper() transforms string columns using the .str accessor. df['col'] = df['col'].astype(float) casts types.

Applying functions:

df['col'].apply(func) applies a function element-wise to a Series. df.apply(func, axis=1) applies a function row-wise across the DataFrame. df.applymap(func) applies a function element-wise to every cell (deprecated in newer pandas; use df.map(func)).

Renaming and reindexing:

df.rename(columns={'old': 'new'}) renames columns. df.set_index('col') sets a column as the index. df.reset_index() moves the index back into a column.

GroupBy and Aggregation

The groupby operation is one of the most powerful features in pandas. It follows a split-apply-combine pattern: split the DataFrame into groups, apply an aggregation or transformation to each group, and combine the results.

df.groupby('category')['revenue'].sum() — sum of revenue per category.

df.groupby(['region', 'category'])['revenue'].agg(['sum', 'mean', 'count']) — multiple aggregations on grouped data.

df.groupby('user_id').agg({'revenue': 'sum', 'order_id': 'nunique', 'date': 'max'}) — different aggregations per column.

Transform vs Aggregate: .agg() reduces each group to a scalar (result has fewer rows). .transform() returns a result with the same shape as the original DataFrame, useful for adding group-level statistics as new columns: df['group_mean'] = df.groupby('cat')['val'].transform('mean').

Merging and Joining DataFrames

Operation	Description	SQL Equivalent
pd.merge(left, right, on='key', how='inner')	Join on shared column	INNER JOIN
pd.merge(left, right, on='key', how='left')	Keep all left rows	LEFT JOIN
pd.merge(left, right, on='key', how='outer')	Keep all rows from both	FULL OUTER JOIN
pd.concat([df1, df2], axis=0)	Stack DataFrames vertically	UNION ALL
pd.concat([df1, df2], axis=1)	Join DataFrames side by side	Column binding

Pivot Tables and Reshaping

df.pivot_table(values='revenue', index='region', columns='category', aggfunc='sum') creates a cross-tabulation similar to an Excel pivot table. fill_value=0 replaces NaN with 0 in empty cells.

df.melt(id_vars=['id'], value_vars=['jan', 'feb', 'mar']) unpivots columns into rows — converting wide format to long format. This is the inverse of pivot. Long format is typically better for analysis and visualization.

df.stack() and df.unstack() operate on hierarchical (MultiIndex) DataFrames, collapsing or expanding index levels.

Working with Dates and Times

Convert a column to datetime with pd.to_datetime(df['date_col']). Once a column is datetime type, the .dt accessor exposes many useful attributes and methods:

Expression	Result
df['date'].dt.year	Year component
df['date'].dt.month	Month (1-12)
df['date'].dt.day_name()	Day of week name (Monday, Tuesday...)
df['date'].dt.quarter	Quarter (1-4)
df['date'].dt.to_period('M')	Convert to monthly period
df.resample('W', on='date')['val'].sum()	Weekly aggregation of a time series

Performance Tips

Use vectorized operations: Avoid Python loops over rows. Operations on entire columns (e.g., df['a'] + df['b']) execute in compiled C code and are orders of magnitude faster than iterating with iterrows().

Choose the right dtype: Using category dtype for low-cardinality string columns (e.g., region, status) dramatically reduces memory usage and speeds up groupby operations. Use int32 or float32 instead of 64-bit types when precision allows.

Prefer query() for filtering: df.query() is often faster than boolean indexing for large DataFrames because it uses numexpr under the hood.

Read only needed columns: pd.read_csv(path, usecols=['col1', 'col2']) avoids loading unnecessary data into memory.

Summary

pandas is the cornerstone of tabular data analysis in Python. Mastering its selection syntax (.loc/.iloc), groupby/agg patterns, merge operations, and datetime handling covers the vast majority of real-world data manipulation tasks. Understanding how pandas relates to the underlying NumPy arrays — and when to use vectorized operations versus apply — helps analysts write code that is both readable and performant. As datasets grow beyond memory limits, the pandas API carries over cleanly to distributed frameworks like Dask, Polars, and PySpark, making these foundations transferable at scale.