Python for Data Analysis

Introduction to Python for Data Analysis

Python has become the dominant programming language for data analysis. Its readable syntax, rich ecosystem of libraries, and vibrant community make it the tool of choice for analysts worldwide. Whether you are loading a CSV, querying a database, cleaning messy data, running statistical tests, or creating visualizations, Python has a library for every step of the workflow.

This article covers the core Python tools every data analyst should know: NumPy for numerical computing, pandas for data manipulation, and Matplotlib and Seaborn for visualization. We also introduce Jupyter Notebooks, the preferred environment for exploratory data analysis.

Setting Up Your Environment

The easiest way to get started with Python for data analysis is to install Anaconda, a distribution that bundles Python with all the key data science libraries pre-installed. After installing Anaconda, you can launch Jupyter Notebook or JupyterLab from the terminal or Anaconda Navigator.

# Install key libraries (if not using Anaconda)
pip install numpy pandas matplotlib seaborn jupyter

Jupyter Notebooks are the standard environment for data analysis work. They allow you to write and run code in cells, display outputs (tables, charts) inline, and document your analysis with markdown text — all in the same file. This makes notebooks ideal for exploration and for communicating results.

NumPy: The Foundation of Numerical Computing

NumPy (Numerical Python) provides the ndarray, a fast and memory-efficient multi-dimensional array. It is the foundation that pandas and most scientific Python libraries are built on.

import numpy as np

# Create arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2, 3], [4, 5, 6]])

# Array operations (vectorized — no loop needed)
print(arr * 2)         # [2, 4, 6, 8, 10]
print(arr.mean())      # 3.0
print(arr.std())       # 1.414
print(np.sqrt(arr))    # [1., 1.41, 1.73, 2., 2.24]

# Create ranges
linspace = np.linspace(0, 1, 5)   # [0, 0.25, 0.5, 0.75, 1.0]
zeros = np.zeros((3, 4))           # 3x4 matrix of zeros

The power of NumPy lies in vectorized operations — calculations that apply to entire arrays at once without needing explicit Python loops. This makes numeric operations dramatically faster than plain Python lists.

pandas: The Core of Data Analysis

pandas is the most important Python library for data analysts. It provides two key data structures: the Series (a one-dimensional labeled array) and the DataFrame (a two-dimensional table with labeled rows and columns). DataFrames are similar to SQL tables or Excel spreadsheets, and most data analysis in Python revolves around them.

import pandas as pd

# Load data
df = pd.read_csv('sales.csv')
df_excel = pd.read_excel('report.xlsx')

# First look at the data
print(df.head())          # First 5 rows
print(df.shape)           # (rows, columns)
print(df.dtypes)          # Column data types
print(df.describe())      # Summary statistics
print(df.isnull().sum())  # Count of missing values per column

Selecting and Filtering Data

# Select a single column (returns Series)
revenue = df['revenue']

# Select multiple columns (returns DataFrame)
subset = df[['date', 'revenue', 'country']]

# Filter rows by condition
big_orders = df[df['revenue'] > 1000]
france_orders = df[df['country'] == 'France']

# Multiple conditions
high_value_france = df[(df['country'] == 'France') & (df['revenue'] > 500)]

# loc (label-based) and iloc (integer-based)
df.loc[0:4, 'revenue':'country']
df.iloc[0:5, 2:5]

Transforming and Aggregating Data

# Add a new computed column
df['profit_margin'] = (df['profit'] / df['revenue']) * 100

# Group by and aggregate
summary = df.groupby('country').agg(
    total_revenue=('revenue', 'sum'),
    avg_order=('revenue', 'mean'),
    num_orders=('order_id', 'count')
).reset_index()

# Sort
summary = summary.sort_values('total_revenue', ascending=False)

# Apply a function to a column
df['revenue_category'] = df['revenue'].apply(
    lambda x: 'High' if x > 1000 else ('Medium' if x > 500 else 'Low')
)

Handling Missing Data

# Drop rows with any missing value
df_clean = df.dropna()

# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['region'].fillna('Unknown', inplace=True)

# Check for duplicates and remove them
print(df.duplicated().sum())
df = df.drop_duplicates()

Merging and Joining DataFrames

pandas has powerful merge and join operations that mirror SQL JOIN syntax:

# Inner join (like SQL INNER JOIN)
merged = pd.merge(orders, customers, on='customer_id', how='inner')

# Left join
merged = pd.merge(orders, customers, on='customer_id', how='left')

# Concatenate DataFrames vertically (stack rows)
combined = pd.concat([df_2023, df_2024], ignore_index=True)

Visualization with Matplotlib and Seaborn

Matplotlib is the foundation for plotting in Python. Seaborn builds on top of it with a higher-level API and attractive statistical chart types.

import matplotlib.pyplot as plt
import seaborn as sns

# Line chart
plt.figure(figsize=(10, 5))
plt.plot(df['month'], df['revenue'], marker='o', color='steelblue')
plt.title('Monthly Revenue')
plt.xlabel('Month')
plt.ylabel('Revenue ($)')
plt.tight_layout()
plt.savefig('revenue.png', dpi=150)
plt.show()

# Seaborn bar chart
sns.barplot(data=df, x='country', y='revenue', palette='Set2')
plt.title('Revenue by Country')
plt.xticks(rotation=45)
plt.show()

# Distribution
sns.histplot(df['revenue'], kde=True, bins=30)

# Correlation heatmap
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')

Exporting Results

# Save to CSV
df.to_csv('clean_data.csv', index=False)

# Save to Excel
df.to_excel('report.xlsx', sheet_name='Sales', index=False)

# Save multiple sheets
with pd.ExcelWriter('multi_sheet.xlsx') as writer:
    summary.to_excel(writer, sheet_name='Summary', index=False)
    df.to_excel(writer, sheet_name='Raw Data', index=False)

A Typical Analysis Workflow

A typical data analysis project in Python follows this pattern: load the data with pd.read_csv() or a database connector, inspect it with head(), describe(), and isnull(), clean it by handling missing values and removing duplicates, transform it by creating new columns and aggregating, visualize the key findings with matplotlib or seaborn, and export the results to CSV or Excel for stakeholders.

Conclusion

Python with pandas, NumPy, and Matplotlib gives data analysts a powerful, flexible, and reproducible workflow. Once you are comfortable with these libraries, you can handle virtually any tabular data problem — from small CSV files to datasets with millions of rows. The investment in learning Python pays dividends at every stage of the data analysis process.