Why Data Analysts Work with APIs
Application Programming Interfaces (APIs) are the standard mechanism for programmatically fetching data from external services — social platforms, payment processors, weather providers, financial data vendors, and internal microservices. As a data analyst, understanding how to call APIs, handle authentication, parse responses, and load results into analysis tools dramatically expands the data sources available beyond what already lives in your data warehouse.
How REST APIs Work
Most data APIs follow the REST (Representational State Transfer) pattern. Clients send HTTP requests to a URL (endpoint), and the server returns data — almost always as JSON. The key elements are:
Element | Description | Example |
|---|---|---|
Base URL | The root address of the API | https://api.example.com/v2 |
Endpoint | The specific resource path | /users, /orders, /metrics/daily |
HTTP Method | The action to perform | GET (read), POST (create), PUT/PATCH (update), DELETE |
Query Parameters | Filters and options in the URL | ?start_date=2024-01-01&limit=100 |
Request Headers | Metadata sent with the request | Authorization, Content-Type, Accept |
Response Body | The returned data, typically JSON | {"data": [...], "next_page": "..."} |
Status Code | Numeric result indicating success or failure | 200 OK, 401 Unauthorized, 429 Rate Limited |
HTTP Status Codes for Analysts
Code | Meaning | Analyst Action |
|---|---|---|
200 | OK — request succeeded | Parse the response body |
201 | Created — resource was created | Capture the returned ID if needed |
400 | Bad Request — malformed parameters | Check query parameters and request body |
401 | Unauthorized — invalid or missing credentials | Refresh or regenerate API key/token |
403 | Forbidden — credentials valid but no access | Check account permissions or scopes |
404 | Not Found — endpoint or resource doesn't exist | Verify the URL path and resource ID |
429 | Too Many Requests — rate limit hit | Implement backoff and retry logic |
500 | Internal Server Error — API-side problem | Retry with exponential backoff; log the incident |
Authentication Methods
Method | How It Works | Common Uses |
|---|---|---|
API Key | Static key sent in a header or query parameter | Simple public APIs, internal tools |
Bearer Token (JWT) | Short-lived token in Authorization: Bearer header | Modern REST APIs, OAuth 2.0 flows |
Basic Auth | Base64-encoded username:password in Authorization header | Legacy APIs, some analytics platforms |
OAuth 2.0 | Token exchange flow; access + refresh tokens | Google Analytics, Salesforce, social APIs |
HMAC Signature | Request signed with a secret key | AWS, financial APIs requiring audit trails |
Never hard-code credentials. Store API keys in environment variables or secret managers and read them at runtime:
import os
import requests
API_KEY = os.environ['MY_API_KEY']
headers = {'Authorization': f'Bearer {API_KEY}'}
Calling APIs with Python
The requests library is the standard tool for HTTP calls in Python.
import requests
response = requests.get(
'https://api.example.com/v2/orders',
headers={'Authorization': 'Bearer YOUR_TOKEN'},
params={'start_date': '2024-01-01', 'limit': 100}
)
response.raise_for_status() # raises HTTPError on 4xx/5xx
data = response.json() # parse JSON body to dict/list
Pagination
APIs rarely return all records in one response. Most use one of these pagination strategies:
Strategy | How It Works | Python Pattern |
|---|---|---|
Page/offset | ?page=2 or ?offset=100&limit=100 | Loop incrementing page until empty results |
Cursor-based | Response includes a next_cursor token for the next page | Loop until next_cursor is null |
Link header | Response headers include a rel="next" URL | Follow response.headers["Link"] until absent |
next_page_token | Token in the response body for the next call | Loop while token is not None |
Cursor pagination example:
all_records = []
cursor = None
while True:
params = {'limit': 200}
if cursor:
params['cursor'] = cursor
resp = requests.get(url, headers=headers, params=params)
resp.raise_for_status()
body = resp.json()
all_records.extend(body['data'])
cursor = body.get('next_cursor')
if not cursor:
break
Rate Limiting and Backoff
APIs enforce rate limits to prevent abuse. When you hit a 429 response, implement exponential backoff:
import time
def get_with_retry(url, headers, params, max_retries=5):
delay = 1
for attempt in range(max_retries):
resp = requests.get(url, headers=headers, params=params)
if resp.status_code == 429:
retry_after = int(resp.headers.get('Retry-After', delay))
time.sleep(retry_after)
delay *= 2
continue
resp.raise_for_status()
return resp.json()
raise Exception('Max retries exceeded')
Loading API Data into a DataFrame
Once you have JSON data, converting it to a pandas DataFrame is usually straightforward:
import pandas as pd
# Flat list of records
df = pd.DataFrame(all_records)
# Nested JSON — normalize it
df = pd.json_normalize(all_records, sep='_')
# Select and rename columns
df = df[['id', 'created_at', 'amount', 'status']].rename(
columns={'created_at': 'order_date'}
)
df['order_date'] = pd.to_datetime(df['order_date'])
Common Data Analyst API Patterns
Task | Approach |
|---|---|
Incremental refresh | Track last-fetched timestamp; use updated_after parameter on each run |
Parallel requests | Use concurrent.futures.ThreadPoolExecutor for independent endpoints |
Schema discovery | Fetch a small sample, call pd.json_normalize to see all fields |
Writing to warehouse | Use SQLAlchemy + df.to_sql() or cloud SDK bulk-insert methods |
Caching responses | Save raw JSON to disk or S3 before transforming; re-run transforms without re-fetching |
GraphQL APIs
Some APIs use GraphQL instead of REST. Rather than many endpoints, GraphQL has one endpoint where you send a query specifying exactly the fields you need:
query = """
{
orders(first: 100, after: "cursor123") {
edges {
node { id createdAt totalPrice status }
}
pageInfo { hasNextPage endCursor }
}
}
"""
resp = requests.post(
'https://api.shop.com/graphql',
json={'query': query},
headers={'Authorization': 'Bearer TOKEN'}
)
data = resp.json()['data']['orders']
Summary
For data analysts, API literacy means knowing how to authenticate, handle pagination, respect rate limits, and convert JSON responses into clean DataFrames ready for analysis. Python's requests library and pandas json_normalize together handle the majority of real-world API ingestion tasks. Building reusable fetch functions with retry logic and incremental refresh support turns one-off API pulls into reliable, schedulable data pipelines.
Create a free reader account to keep reading.