What is pandas?
pandas is the primary data manipulation library for Python. Built on top of NumPy, it provides two core data structures — Series (1-dimensional labeled array) and DataFrame (2-dimensional labeled table) — along with an extensive API for loading, cleaning, transforming, aggregating, and merging tabular data. For most data analysts working in Python, pandas is the daily workhorse: it handles everything from reading CSV files to computing grouped statistics to preparing data for visualization or machine learning.
The standard import convention is import pandas as pd.
Series and DataFrame
Structure | Description | Analogy |
|---|---|---|
Series | 1D labeled array with an index | A single column in a spreadsheet |
DataFrame | 2D table with labeled rows and columns | A full spreadsheet or SQL table |
Index | Labels for rows (can be integers, strings, dates) | Row IDs or primary key |
Reading and Writing Data
Function | Description |
|---|---|
pd.read_csv(path) | Read CSV file into DataFrame |
pd.read_excel(path) | Read Excel file (.xlsx, .xls) |
pd.read_json(path) | Read JSON file or string |
pd.read_parquet(path) | Read Parquet (columnar, efficient for large data) |
pd.read_sql(query, con) | Execute SQL query and load result |
df.to_csv(path) | Write DataFrame to CSV |
df.to_parquet(path) | Write DataFrame to Parquet |
Key parameters for read_csv: sep (delimiter), header (row to use as column names), index_col (column to use as index), dtype (specify column types), parse_dates (columns to parse as datetime), na_values (strings to treat as NaN), nrows (read only N rows).
Exploring a DataFrame
Method / Attribute | What It Returns |
|---|---|
df.shape | Tuple (rows, columns) |
df.dtypes | Data type of each column |
df.head(n) | First n rows (default 5) |
df.tail(n) | Last n rows |
df.info() | Column names, dtypes, non-null counts, memory usage |
df.describe() | Summary statistics (count, mean, std, min, quartiles, max) |
df.value_counts() | Frequency of each unique value in a Series |
df.nunique() | Number of unique values per column |
df.isnull().sum() | Count of missing values per column |
Selecting Data
pandas provides two primary selectors: .loc[] for label-based selection and .iloc[] for integer-position-based selection.
Expression | Result |
|---|---|
df['col'] | Select a single column (returns Series) |
df[['col1', 'col2']] | Select multiple columns (returns DataFrame) |
df.loc[row_label, col_label] | Select by row and column labels |
df.iloc[row_int, col_int] | Select by integer position |
df.loc[df['col'] > 5] | Boolean filtering |
df.query('col > 5 and col2 == "A"') | SQL-like string filtering |
Handling Missing Data
Method | Description |
|---|---|
df.isnull() | Boolean mask of missing values |
df.dropna() | Remove rows (or columns) with any missing values |
df.dropna(subset=['col']) | Drop rows where specific column is null |
df.fillna(value) | Replace NaN with a constant value |
df['col'].fillna(df['col'].mean()) | Impute with column mean |
df.fillna(method='ffill') | Forward-fill: propagate last valid value forward |
df.fillna(method='bfill') | Back-fill: propagate next valid value backward |
Transforming Data
Adding and modifying columns:
df['new_col'] = df['a'] + df['b'] creates a new column from existing ones. df['col'] = df['col'].str.upper() transforms string columns using the .str accessor. df['col'] = df['col'].astype(float) casts types.
Applying functions:
df['col'].apply(func) applies a function element-wise to a Series. df.apply(func, axis=1) applies a function row-wise across the DataFrame. df.applymap(func) applies a function element-wise to every cell (deprecated in newer pandas; use df.map(func)).
Renaming and reindexing:
df.rename(columns={'old': 'new'}) renames columns. df.set_index('col') sets a column as the index. df.reset_index() moves the index back into a column.
GroupBy and Aggregation
The groupby operation is one of the most powerful features in pandas. It follows a split-apply-combine pattern: split the DataFrame into groups, apply an aggregation or transformation to each group, and combine the results.
df.groupby('category')['revenue'].sum() — sum of revenue per category.
df.groupby(['region', 'category'])['revenue'].agg(['sum', 'mean', 'count']) — multiple aggregations on grouped data.
df.groupby('user_id').agg({'revenue': 'sum', 'order_id': 'nunique', 'date': 'max'}) — different aggregations per column.
Transform vs Aggregate: .agg() reduces each group to a scalar (result has fewer rows). .transform() returns a result with the same shape as the original DataFrame, useful for adding group-level statistics as new columns: df['group_mean'] = df.groupby('cat')['val'].transform('mean').
Merging and Joining DataFrames
Operation | Description | SQL Equivalent |
|---|---|---|
pd.merge(left, right, on='key', how='inner') | Join on shared column | INNER JOIN |
pd.merge(left, right, on='key', how='left') | Keep all left rows | LEFT JOIN |
pd.merge(left, right, on='key', how='outer') | Keep all rows from both | FULL OUTER JOIN |
pd.concat([df1, df2], axis=0) | Stack DataFrames vertically | UNION ALL |
pd.concat([df1, df2], axis=1) | Join DataFrames side by side | Column binding |
Pivot Tables and Reshaping
df.pivot_table(values='revenue', index='region', columns='category', aggfunc='sum') creates a cross-tabulation similar to an Excel pivot table. fill_value=0 replaces NaN with 0 in empty cells.
df.melt(id_vars=['id'], value_vars=['jan', 'feb', 'mar']) unpivots columns into rows — converting wide format to long format. This is the inverse of pivot. Long format is typically better for analysis and visualization.
df.stack() and df.unstack() operate on hierarchical (MultiIndex) DataFrames, collapsing or expanding index levels.
Working with Dates and Times
Convert a column to datetime with pd.to_datetime(df['date_col']). Once a column is datetime type, the .dt accessor exposes many useful attributes and methods:
Expression | Result |
|---|---|
df['date'].dt.year | Year component |
df['date'].dt.month | Month (1-12) |
df['date'].dt.day_name() | Day of week name (Monday, Tuesday...) |
df['date'].dt.quarter | Quarter (1-4) |
df['date'].dt.to_period('M') | Convert to monthly period |
df.resample('W', on='date')['val'].sum() | Weekly aggregation of a time series |
Performance Tips
Use vectorized operations: Avoid Python loops over rows. Operations on entire columns (e.g., df['a'] + df['b']) execute in compiled C code and are orders of magnitude faster than iterating with iterrows().
Choose the right dtype: Using category dtype for low-cardinality string columns (e.g., region, status) dramatically reduces memory usage and speeds up groupby operations. Use int32 or float32 instead of 64-bit types when precision allows.
Prefer query() for filtering: df.query() is often faster than boolean indexing for large DataFrames because it uses numexpr under the hood.
Read only needed columns: pd.read_csv(path, usecols=['col1', 'col2']) avoids loading unnecessary data into memory.
Summary
pandas is the cornerstone of tabular data analysis in Python. Mastering its selection syntax (.loc/.iloc), groupby/agg patterns, merge operations, and datetime handling covers the vast majority of real-world data manipulation tasks. Understanding how pandas relates to the underlying NumPy arrays — and when to use vectorized operations versus apply — helps analysts write code that is both readable and performant. As datasets grow beyond memory limits, the pandas API carries over cleanly to distributed frameworks like Dask, Polars, and PySpark, making these foundations transferable at scale.
Create a free reader account to keep reading.