Why Python for Data Analysis?
Python has become the dominant language for data analysis, displacing spreadsheets and statistical tools like SAS and SPSS for most analytical workflows. Two libraries sit at the foundation of nearly every Python data project: NumPy for fast numerical computing and pandas for tabular data manipulation. Understanding these two libraries deeply — how they represent data, how they handle missing values, and how to use them efficiently — is the prerequisite for virtually everything else in the Python data ecosystem, including visualization (matplotlib, seaborn), machine learning (scikit-learn), and data pipelines.
NumPy: The Foundation of Numerical Python
Concept | Description | Why It Matters for Analysts |
|---|---|---|
ndarray | N-dimensional array of a single numeric dtype stored in contiguous memory | Enables vectorized operations — applying a function to every element without a Python loop, which is 10–100x faster |
dtype | The data type of every element in the array (int32, float64, bool, etc.) | Choosing the correct dtype reduces memory usage; float64 uses 8 bytes per element vs float32's 4 bytes |
Broadcasting | Rules that allow operations between arrays of different shapes by implicitly expanding dimensions | Allows adding a scalar or 1-D array to a 2-D array without explicit loops |
Vectorization | Replacing explicit Python for-loops with array operations executed in compiled C code | Critical for performance on large datasets; a loop over 10M rows in Python takes seconds; the vectorized equivalent takes milliseconds |
np.nan | A special IEEE 754 floating-point "not a number" value used to represent missing numeric data | np.nan propagates through arithmetic — any operation involving np.nan returns np.nan — which helps surface missing data problems |
pandas Data Structures
Structure | Description | Analogy | Common Creation |
|---|---|---|---|
Series | 1-D labeled array that can hold any dtype; has an index | A single column in a spreadsheet, with row labels | pd.Series([1, 2, 3], index=['a','b','c']) |
DataFrame | 2-D labeled data structure with columns of potentially different dtypes; has both row index and column labels | A spreadsheet or SQL table | pd.read_csv('file.csv') or pd.DataFrame({'col': values}) |
Index | Immutable sequence used for axis labels; enables fast lookup and alignment | The row numbers or labels in a spreadsheet | Created automatically; reset with df.reset_index() |
Key DataFrame Operations
Operation | Method / Syntax | Notes |
|---|---|---|
Select columns | df['col'] or df[['col1','col2']] | Single brackets return a Series; double brackets return a DataFrame |
Filter rows | df[df['col'] > 10] or df.query('col > 10') | Boolean indexing is the most common pattern; .query() is more readable for complex conditions |
Label-based indexing | df.loc[row_label, col_label] | Uses index labels; inclusive of both endpoints in slices |
Integer-based indexing | df.iloc[row_int, col_int] | Uses integer positions; exclusive of end in slices (like Python lists) |
Add/modify column | df['new_col'] = expression | Vectorized assignment; avoid row-by-row loops |
Drop rows or columns | df.drop('col', axis=1) or df.drop(index=0) | Returns a new DataFrame by default; use inplace=True to modify in place |
Rename columns | df.rename(columns={'old': 'new'}) | Pass a dict mapping old names to new names |
Sort | df.sort_values('col', ascending=False) | NaN values are placed at the end by default; control with na_position |
Grouping and Aggregation in pandas
Pattern | Code | Description |
|---|---|---|
Simple groupby | df.groupby('segment')['revenue'].sum() | Groups rows by segment, then sums revenue within each group |
Multiple aggregations | df.groupby('segment').agg({'revenue': 'sum', 'orders': 'count'}) | Apply different aggregation functions to different columns |
Named aggregations | df.groupby('segment').agg(total_rev=('revenue','sum'), n_orders=('orders','count')) | Produces clean column names directly in the output |
Transform (no collapse) | df['pct_of_group'] = df.groupby('segment')['revenue'].transform(lambda x: x / x.sum()) | Returns a Series with the same index as the original DataFrame — useful for computing within-group percentages |
Pivot table | df.pivot_table(values='revenue', index='region', columns='product', aggfunc='sum') | Two-dimensional groupby; equivalent to a spreadsheet pivot table |
Handling Missing Data
Method | Code | When to Use |
|---|---|---|
Detect missing values | df.isna() or df.isnull() | Returns a boolean DataFrame; use df.isna().sum() to count missing values per column |
Drop missing rows | df.dropna(subset=['col']) | Use when missingness is random and the proportion is small; risky if missingness is systematic |
Fill with a constant | df['col'].fillna(0) | Appropriate for count columns where NULL means zero; dangerous for continuous measures |
Forward fill | df['col'].ffill() | Carry the last known value forward; common in time series data |
Fill with column mean | df['col'].fillna(df['col'].mean()) | Simple imputation; works for MCAR data; do not use for skewed distributions — use median instead |
Merging DataFrames
Function | Equivalent SQL | Key Parameters |
|---|---|---|
pd.merge(left, right, on='key', how='inner') | INNER JOIN | how: 'inner','left','right','outer'; on: join key column(s) |
pd.merge(left, right, left_on='a', right_on='b') | JOIN ON different column names | Use left_on/right_on when key columns have different names in each DataFrame |
pd.concat([df1, df2], axis=0) | UNION ALL | axis=0 stacks rows; use ignore_index=True to reset the index |
pd.concat([df1, df2], axis=1) | Horizontal join (by position) | Joins side by side on index; both DataFrames must have matching indices |
Summary
NumPy provides the high-performance array foundation that makes Python competitive with compiled languages for numerical computation. pandas builds on top of NumPy to provide the DataFrame abstraction — the lingua franca of tabular data in Python. Mastering selection, filtering, groupby aggregations, merges, and missing-value handling in pandas covers the vast majority of day-to-day data cleaning and exploration tasks. The key principle throughout is to favour vectorized operations over explicit loops, which keeps code concise and ensures performance scales to larger datasets.
Create a free reader account to keep reading.