Working with APIs for Data Analysts

Why Data Analysts Work with APIs

Application Programming Interfaces (APIs) are the standard mechanism for programmatically fetching data from external services — social platforms, payment processors, weather providers, financial data vendors, and internal microservices. As a data analyst, understanding how to call APIs, handle authentication, parse responses, and load results into analysis tools dramatically expands the data sources available beyond what already lives in your data warehouse.

How REST APIs Work

Most data APIs follow the REST (Representational State Transfer) pattern. Clients send HTTP requests to a URL (endpoint), and the server returns data — almost always as JSON. The key elements are:

Element	Description	Example
Base URL	The root address of the API	https://api.example.com/v2
Endpoint	The specific resource path	/users, /orders, /metrics/daily
HTTP Method	The action to perform	GET (read), POST (create), PUT/PATCH (update), DELETE
Query Parameters	Filters and options in the URL	?start_date=2024-01-01&limit=100
Request Headers	Metadata sent with the request	Authorization, Content-Type, Accept
Response Body	The returned data, typically JSON	{"data": [...], "next_page": "..."}
Status Code	Numeric result indicating success or failure	200 OK, 401 Unauthorized, 429 Rate Limited

HTTP Status Codes for Analysts

Code	Meaning	Analyst Action
200	OK — request succeeded	Parse the response body
201	Created — resource was created	Capture the returned ID if needed
400	Bad Request — malformed parameters	Check query parameters and request body
401	Unauthorized — invalid or missing credentials	Refresh or regenerate API key/token
403	Forbidden — credentials valid but no access	Check account permissions or scopes
404	Not Found — endpoint or resource doesn't exist	Verify the URL path and resource ID
429	Too Many Requests — rate limit hit	Implement backoff and retry logic
500	Internal Server Error — API-side problem	Retry with exponential backoff; log the incident

Authentication Methods

Method	How It Works	Common Uses
API Key	Static key sent in a header or query parameter	Simple public APIs, internal tools
Bearer Token (JWT)	Short-lived token in Authorization: Bearer header	Modern REST APIs, OAuth 2.0 flows
Basic Auth	Base64-encoded username:password in Authorization header	Legacy APIs, some analytics platforms
OAuth 2.0	Token exchange flow; access + refresh tokens	Google Analytics, Salesforce, social APIs
HMAC Signature	Request signed with a secret key	AWS, financial APIs requiring audit trails

Never hard-code credentials. Store API keys in environment variables or secret managers and read them at runtime:

import os import requests API_KEY = os.environ['MY_API_KEY'] headers = {'Authorization': f'Bearer {API_KEY}'}

Calling APIs with Python

The requests library is the standard tool for HTTP calls in Python.

import requests response = requests.get( 'https://api.example.com/v2/orders', headers={'Authorization': 'Bearer YOUR_TOKEN'}, params={'start_date': '2024-01-01', 'limit': 100} ) response.raise_for_status() # raises HTTPError on 4xx/5xx data = response.json() # parse JSON body to dict/list

Pagination

APIs rarely return all records in one response. Most use one of these pagination strategies:

Strategy	How It Works	Python Pattern
Page/offset	?page=2 or ?offset=100&limit=100	Loop incrementing page until empty results
Cursor-based	Response includes a next_cursor token for the next page	Loop until next_cursor is null
Link header	Response headers include a rel="next" URL	Follow response.headers["Link"] until absent
next_page_token	Token in the response body for the next call	Loop while token is not None

Cursor pagination example:

all_records = [] cursor = None while True: params = {'limit': 200} if cursor: params['cursor'] = cursor resp = requests.get(url, headers=headers, params=params) resp.raise_for_status() body = resp.json() all_records.extend(body['data']) cursor = body.get('next_cursor') if not cursor: break

Rate Limiting and Backoff

APIs enforce rate limits to prevent abuse. When you hit a 429 response, implement exponential backoff:

import time def get_with_retry(url, headers, params, max_retries=5): delay = 1 for attempt in range(max_retries): resp = requests.get(url, headers=headers, params=params) if resp.status_code == 429: retry_after = int(resp.headers.get('Retry-After', delay)) time.sleep(retry_after) delay *= 2 continue resp.raise_for_status() return resp.json() raise Exception('Max retries exceeded')

Loading API Data into a DataFrame

Once you have JSON data, converting it to a pandas DataFrame is usually straightforward:

import pandas as pd # Flat list of records df = pd.DataFrame(all_records) # Nested JSON — normalize it df = pd.json_normalize(all_records, sep='_') # Select and rename columns df = df[['id', 'created_at', 'amount', 'status']].rename( columns={'created_at': 'order_date'} ) df['order_date'] = pd.to_datetime(df['order_date'])

Common Data Analyst API Patterns

Task	Approach
Incremental refresh	Track last-fetched timestamp; use updated_after parameter on each run
Parallel requests	Use concurrent.futures.ThreadPoolExecutor for independent endpoints
Schema discovery	Fetch a small sample, call pd.json_normalize to see all fields
Writing to warehouse	Use SQLAlchemy + df.to_sql() or cloud SDK bulk-insert methods
Caching responses	Save raw JSON to disk or S3 before transforming; re-run transforms without re-fetching

GraphQL APIs

Some APIs use GraphQL instead of REST. Rather than many endpoints, GraphQL has one endpoint where you send a query specifying exactly the fields you need:

query = """ { orders(first: 100, after: "cursor123") { edges { node { id createdAt totalPrice status } } pageInfo { hasNextPage endCursor } } } """ resp = requests.post( 'https://api.shop.com/graphql', json={'query': query}, headers={'Authorization': 'Bearer TOKEN'} ) data = resp.json()['data']['orders']

Summary

For data analysts, API literacy means knowing how to authenticate, handle pagination, respect rate limits, and convert JSON responses into clean DataFrames ready for analysis. Python's requests library and pandas json_normalize together handle the majority of real-world API ingestion tasks. Building reusable fetch functions with retry logic and incremental refresh support turns one-off API pulls into reliable, schedulable data pipelines.