What Is NumPy?
NumPy (Numerical Python) is the foundational library for numerical computing in Python. It provides the ndarray — a fast, flexible n-dimensional array — along with a comprehensive suite of mathematical functions for working with arrays. Almost every major data science library in Python (pandas, scikit-learn, SciPy, TensorFlow, PyTorch) is built on top of NumPy or relies on it for numerical operations.
The core advantage of NumPy over plain Python lists is performance. NumPy arrays are stored in contiguous blocks of memory with a fixed data type, enabling vectorized operations that execute in compiled C code rather than interpreted Python loops. For large datasets, this difference can be 10–100x faster than equivalent pure Python implementations.
Installing and Importing NumPy
NumPy is included in most data science environments (Anaconda, Google Colab, etc.) and can be installed with pip install numpy. The universal import convention is import numpy as np.
The ndarray
The ndarray is the central data structure in NumPy. Every array has a shape (dimensions and sizes), a dtype (data type of elements), and an ndim (number of dimensions).
Concept | Meaning | Example |
|---|---|---|
shape | Tuple describing size of each dimension | (3,) for 1D of 3 elements; (4, 5) for 2D matrix |
dtype | Data type of all elements (must be uniform) | float64, int32, bool, complex128 |
ndim | Number of dimensions (axes) | 1 for vector, 2 for matrix, 3 for tensor |
size | Total number of elements | 20 for a (4, 5) array |
itemsize | Bytes per element | 8 for float64 |
Creating Arrays
Function | Description | Example |
|---|---|---|
np.array() | Create array from Python list or nested list | np.array([1, 2, 3]) |
np.zeros(shape) | Array of all zeros | np.zeros((3, 4)) |
np.ones(shape) | Array of all ones | np.ones((2, 5)) |
np.full(shape, val) | Array filled with a constant value | np.full((3, 3), 7) |
np.arange(start, stop, step) | Evenly spaced values (like Python range) | np.arange(0, 10, 2) |
np.linspace(start, stop, n) | n evenly spaced values between start and stop | np.linspace(0, 1, 100) |
np.eye(n) | n×n identity matrix | np.eye(3) |
np.random.rand(shape) | Uniform random values in [0, 1) | np.random.rand(4, 4) |
np.random.randn(shape) | Standard normal (Gaussian) random values | np.random.randn(100) |
np.random.randint(low, high, size) | Random integers in [low, high) | np.random.randint(1, 7, 10) |
Indexing and Slicing
NumPy arrays support powerful indexing mechanisms beyond Python lists. For a 2D array a:
Expression | Result |
|---|---|
a[2, 3] | Element at row 2, column 3 |
a[0:3, :] | First 3 rows, all columns |
a[:, 1] | All rows, column 1 (returns 1D array) |
a[a > 5] | Boolean indexing: elements greater than 5 |
a[[0, 2, 4], :] | Fancy indexing: rows 0, 2, and 4 |
a[..., -1] | Last column (ellipsis selects all preceding dimensions) |
Important: NumPy slices return views (not copies) of the original array. Modifying a slice modifies the original. Use .copy() to create an independent copy.
Vectorized Operations and Broadcasting
NumPy operations apply element-wise across arrays without explicit loops. Arithmetic operators (+, -, *, /, **) all work element-wise on arrays of the same shape.
Broadcasting extends element-wise operations to arrays of different but compatible shapes. The rules are: NumPy compares shapes from the rightmost dimension; dimensions are compatible if they are equal or one of them is 1; a size-1 dimension is "stretched" to match the other array's dimension. For example, adding a shape (4, 3) array to a shape (3,) array broadcasts the 1D array across all 4 rows.
Broadcasting enables efficient operations like subtracting the column mean from every row without any Python loops: centered = a - a.mean(axis=0).
Aggregation Functions
Function | Description | axis=0 behavior |
|---|---|---|
np.sum(a) | Sum of all elements | Sum per column |
np.mean(a) | Arithmetic mean | Mean per column |
np.std(a) | Standard deviation | Std per column |
np.var(a) | Variance | Variance per column |
np.min(a) / np.max(a) | Minimum / maximum | Min/max per column |
np.argmin(a) / np.argmax(a) | Index of min/max | Index per column |
np.median(a) | Median value | Median per column |
np.percentile(a, q) | qth percentile | Percentile per column |
np.cumsum(a) | Cumulative sum | Cumulative sum per column |
Reshaping and Stacking
Operation | Description |
|---|---|
a.reshape(new_shape) | Change shape without changing data; total elements must match |
a.flatten() | Collapse to 1D array (returns a copy) |
a.ravel() | Collapse to 1D array (returns a view when possible) |
a.T or a.transpose() | Transpose (swap axes) |
np.concatenate([a, b], axis=0) | Join arrays along an existing axis |
np.vstack([a, b]) | Stack arrays vertically (row-wise) |
np.hstack([a, b]) | Stack arrays horizontally (column-wise) |
np.split(a, n, axis) | Split array into n equal sub-arrays |
np.newaxis | Add a new axis (increase dimensions by 1) |
Linear Algebra with numpy.linalg
The numpy.linalg module provides essential linear algebra operations used in machine learning and statistics:
Function | Description |
|---|---|
np.dot(a, b) or a @ b | Matrix multiplication (dot product) |
np.linalg.inv(a) | Matrix inverse |
np.linalg.det(a) | Determinant |
np.linalg.eig(a) | Eigenvalues and eigenvectors |
np.linalg.svd(a) | Singular value decomposition |
np.linalg.solve(a, b) | Solve linear system ax = b |
np.linalg.norm(a) | Matrix or vector norm |
Practical Patterns for Data Analysis
Normalize a feature to [0,1]:
normalized = (x - x.min()) / (x.max() - x.min())
Standardize (z-score):
standardized = (x - x.mean()) / x.std()
Count values meeting a condition:
count = np.sum(x > threshold) or proportion = np.mean(x > threshold)
Replace values meeting a condition (np.where):
capped = np.where(x > 100, 100, x)
Compute a correlation matrix:
np.corrcoef(matrix, rowvar=False) — columns are variables, rows are observations.
NumPy vs pandas
NumPy and pandas serve complementary roles. NumPy excels at homogeneous numerical computation: arrays must be a single dtype, and the API is optimized for mathematical operations. pandas is built on NumPy and adds labeled axes (index, column names), mixed dtypes per column, missing value handling, and high-level operations like groupby and merge. For pure numerical work (matrix operations, numerical simulation, machine learning feature arrays), NumPy is the right tool. For tabular data with mixed types and named columns, use pandas — it calls NumPy under the hood for the heavy lifting.
Summary
NumPy is the numerical backbone of the Python data science stack. Mastering its array creation, indexing, broadcasting, and aggregation patterns allows analysts to write fast, expressive numerical code without explicit Python loops. Understanding NumPy also makes it easier to work with higher-level libraries like pandas and scikit-learn, since those libraries expose NumPy arrays and respect NumPy conventions at their core interfaces.
Create a free reader account to keep reading.