Python for Data Analysis: pandas and numpy Essentials

Why Python for Data Analysis?

Python has become the dominant language for data analysis, displacing spreadsheets and statistical tools like SAS and SPSS for most analytical workflows. Two libraries sit at the foundation of nearly every Python data project: NumPy for fast numerical computing and pandas for tabular data manipulation. Understanding these two libraries deeply — how they represent data, how they handle missing values, and how to use them efficiently — is the prerequisite for virtually everything else in the Python data ecosystem, including visualization (matplotlib, seaborn), machine learning (scikit-learn), and data pipelines.

NumPy: The Foundation of Numerical Python

Concept	Description	Why It Matters for Analysts
ndarray	N-dimensional array of a single numeric dtype stored in contiguous memory	Enables vectorized operations — applying a function to every element without a Python loop, which is 10–100x faster
dtype	The data type of every element in the array (int32, float64, bool, etc.)	Choosing the correct dtype reduces memory usage; float64 uses 8 bytes per element vs float32's 4 bytes
Broadcasting	Rules that allow operations between arrays of different shapes by implicitly expanding dimensions	Allows adding a scalar or 1-D array to a 2-D array without explicit loops
Vectorization	Replacing explicit Python for-loops with array operations executed in compiled C code	Critical for performance on large datasets; a loop over 10M rows in Python takes seconds; the vectorized equivalent takes milliseconds
np.nan	A special IEEE 754 floating-point "not a number" value used to represent missing numeric data	np.nan propagates through arithmetic — any operation involving np.nan returns np.nan — which helps surface missing data problems

pandas Data Structures

Structure	Description	Analogy	Common Creation
Series	1-D labeled array that can hold any dtype; has an index	A single column in a spreadsheet, with row labels	pd.Series([1, 2, 3], index=['a','b','c'])
DataFrame	2-D labeled data structure with columns of potentially different dtypes; has both row index and column labels	A spreadsheet or SQL table	pd.read_csv('file.csv') or pd.DataFrame({'col': values})
Index	Immutable sequence used for axis labels; enables fast lookup and alignment	The row numbers or labels in a spreadsheet	Created automatically; reset with df.reset_index()

Key DataFrame Operations

Operation	Method / Syntax	Notes
Select columns	df['col'] or df[['col1','col2']]	Single brackets return a Series; double brackets return a DataFrame
Filter rows	df[df['col'] > 10] or df.query('col > 10')	Boolean indexing is the most common pattern; .query() is more readable for complex conditions
Label-based indexing	df.loc[row_label, col_label]	Uses index labels; inclusive of both endpoints in slices
Integer-based indexing	df.iloc[row_int, col_int]	Uses integer positions; exclusive of end in slices (like Python lists)
Add/modify column	df['new_col'] = expression	Vectorized assignment; avoid row-by-row loops
Drop rows or columns	df.drop('col', axis=1) or df.drop(index=0)	Returns a new DataFrame by default; use inplace=True to modify in place
Rename columns	df.rename(columns={'old': 'new'})	Pass a dict mapping old names to new names
Sort	df.sort_values('col', ascending=False)	NaN values are placed at the end by default; control with na_position

Grouping and Aggregation in pandas

Pattern	Code	Description
Simple groupby	df.groupby('segment')['revenue'].sum()	Groups rows by segment, then sums revenue within each group
Multiple aggregations	df.groupby('segment').agg({'revenue': 'sum', 'orders': 'count'})	Apply different aggregation functions to different columns
Named aggregations	df.groupby('segment').agg(total_rev=('revenue','sum'), n_orders=('orders','count'))	Produces clean column names directly in the output
Transform (no collapse)	df['pct_of_group'] = df.groupby('segment')['revenue'].transform(lambda x: x / x.sum())	Returns a Series with the same index as the original DataFrame — useful for computing within-group percentages
Pivot table	df.pivot_table(values='revenue', index='region', columns='product', aggfunc='sum')	Two-dimensional groupby; equivalent to a spreadsheet pivot table

Handling Missing Data

Method	Code	When to Use
Detect missing values	df.isna() or df.isnull()	Returns a boolean DataFrame; use df.isna().sum() to count missing values per column
Drop missing rows	df.dropna(subset=['col'])	Use when missingness is random and the proportion is small; risky if missingness is systematic
Fill with a constant	df['col'].fillna(0)	Appropriate for count columns where NULL means zero; dangerous for continuous measures
Forward fill	df['col'].ffill()	Carry the last known value forward; common in time series data
Fill with column mean	df['col'].fillna(df['col'].mean())	Simple imputation; works for MCAR data; do not use for skewed distributions — use median instead

Merging DataFrames

Function	Equivalent SQL	Key Parameters
pd.merge(left, right, on='key', how='inner')	INNER JOIN	how: 'inner','left','right','outer'; on: join key column(s)
pd.merge(left, right, left_on='a', right_on='b')	JOIN ON different column names	Use left_on/right_on when key columns have different names in each DataFrame
pd.concat([df1, df2], axis=0)	UNION ALL	axis=0 stacks rows; use ignore_index=True to reset the index
pd.concat([df1, df2], axis=1)	Horizontal join (by position)	Joins side by side on index; both DataFrames must have matching indices

Summary

NumPy provides the high-performance array foundation that makes Python competitive with compiled languages for numerical computation. pandas builds on top of NumPy to provide the DataFrame abstraction — the lingua franca of tabular data in Python. Mastering selection, filtering, groupby aggregations, merges, and missing-value handling in pandas covers the vast majority of day-to-day data cleaning and exploration tasks. The key principle throughout is to favour vectorized operations over explicit loops, which keeps code concise and ensures performance scales to larger datasets.