Why Web Scraping and APIs Matter for Data Analysts
Most analytical projects start with data that already lives in a database or data warehouse. But a large share of the world's most valuable information is locked inside websites and web services: competitor pricing, job postings, social media signals, public records, financial disclosures, and more. Web scraping and API integration give analysts the ability to gather this external data programmatically, transforming it into structured datasets ready for analysis.
Understanding both techniques is essential for any analyst who wants to build original datasets, automate recurring data collection, or integrate third-party services into reporting pipelines.
APIs vs. Web Scraping: Choosing the Right Approach
Dimension | API | Web Scraping |
|---|---|---|
Data availability | Only what the provider exposes | Anything visible in a browser |
Reliability | High — structured, versioned contracts | Low — breaks when HTML changes |
Legal clarity | Clear — governed by Terms of Service | Murky — check robots.txt and ToS |
Rate limits | Explicit, often enforced | Implicit — polite crawling required |
Authentication | API keys, OAuth | Session cookies, form login |
Best for | Integrating services, real-time data | Sites without APIs, historical snapshots |
Always prefer an official API when one exists. Scraping without permission may violate Terms of Service or, in some jurisdictions, computer fraud laws. Always review robots.txt (e.g., https://example.com/robots.txt) and the site's Terms of Service before scraping.
Working with REST APIs
HTTP Fundamentals
REST APIs communicate over HTTP. Every request has a method (GET, POST, PUT, PATCH, DELETE), a URL endpoint, headers (metadata including authentication tokens), and optionally a body (for POST/PUT/PATCH). Responses include a status code and a body, usually JSON.
Status Code | Meaning |
|---|---|
200 OK | Success |
201 Created | Resource created successfully |
400 Bad Request | Malformed request syntax |
401 Unauthorized | Missing or invalid credentials |
403 Forbidden | Authenticated but not permitted |
404 Not Found | Resource does not exist |
429 Too Many Requests | Rate limit exceeded |
500 Internal Server Error | Server-side error |
Making API Requests in Python
import requests
import pandas as pd
import time
# Basic GET request with API key authentication
API_KEY = "your_api_key_here"
BASE_URL = "https://api.example.com/v1"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Accept": "application/json"
}
# Fetch paginated data
def fetch_all_records(endpoint, params=None):
all_records = []
page = 1
params = params or {}
while True:
params["page"] = page
params["per_page"] = 100
response = requests.get(
f"{BASE_URL}/{endpoint}",
headers=headers,
params=params
)
# Handle rate limiting
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 60))
print(f"Rate limited. Waiting {retry_after} seconds...")
time.sleep(retry_after)
continue
response.raise_for_status() # Raises for 4xx/5xx errors
data = response.json()
records = data.get("results", data.get("data", []))
if not records:
break
all_records.extend(records)
page += 1
# Respect rate limits
time.sleep(0.2)
return all_records
records = fetch_all_records("users")
df = pd.DataFrame(records)
print(df.head())
Authentication Patterns
APIs use several authentication mechanisms. API keys are the simplest — pass them in a header or query parameter. OAuth 2.0 is the standard for user-delegated access (e.g., accessing a user's Google Analytics data on their behalf) and involves an authorization flow that exchanges credentials for short-lived access tokens. Basic authentication sends a username and password encoded in base64 — only appropriate over HTTPS. JWT (JSON Web Token) authentication involves signing a payload with a secret and including the token in the Authorization header.
Handling Pagination
Most APIs paginate responses to limit payload size. Three common patterns exist: page-based pagination (pass page=2&per_page=100), cursor-based pagination (each response includes a next_cursor token to use in the next request — more stable for real-time data), and offset-based pagination (pass offset=100&limit=100). Always handle all pages; stopping at the first page is a common data quality bug.
Web Scraping with Python
The Scraping Stack
Two tools cover the vast majority of scraping needs: Requests + BeautifulSoup for static HTML pages, and Playwright or Selenium for JavaScript-rendered pages that require a real browser. Use the simpler tool first — firing up a headless browser for a static page is wasteful.
Scraping Static Pages with BeautifulSoup
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
def scrape_product_listings(base_url, max_pages=5):
products = []
headers = {
"User-Agent": "Mozilla/5.0 (compatible; DataAnalyst/1.0; research)"
}
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
# Find all product cards
cards = soup.find_all("div", class_="product-card")
if not cards:
break # No more pages
for card in cards:
name = card.find("h2", class_="product-name")
price = card.find("span", class_="price")
rating = card.find("span", class_="rating")
products.append({
"name": name.get_text(strip=True) if name else None,
"price": price.get_text(strip=True) if price else None,
"rating": rating.get_text(strip=True) if rating else None,
"page": page
})
time.sleep(1) # Be polite — 1 second between requests
return pd.DataFrame(products)
df = scrape_product_listings("https://example.com/products")
print(df.head())
CSS Selectors and XPath
BeautifulSoup supports both CSS selectors (via soup.select()) and its own find methods. CSS selectors are often more concise: soup.select("div.product-card h2") finds all h2 elements inside div elements with class product-card. XPath is more powerful for complex traversals and is the native selector language for lxml and Scrapy. For simple scraping, CSS selectors are usually sufficient and easier to read.
Scraping JavaScript-Rendered Pages with Playwright
from playwright.sync_api import sync_playwright
import pandas as pd
def scrape_dynamic_page(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Block images and fonts to speed up loading
page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}", lambda route: route.abort())
page.goto(url, wait_until="networkidle")
# Wait for specific content to load
page.wait_for_selector(".data-table", timeout=10000)
# Extract table data
rows = page.query_selector_all(".data-table tbody tr")
data = []
for row in rows:
cells = row.query_selector_all("td")
data.append([cell.inner_text() for cell in cells])
browser.close()
return pd.DataFrame(data)
df = scrape_dynamic_page("https://example.com/dynamic-dashboard")
print(df.head())
Data Cleaning After Collection
Raw scraped data is rarely clean. Common issues include leading and trailing whitespace, currency symbols and commas in numeric fields, mixed date formats, HTML entities (& rendering as &), inconsistent null representations (empty string, "N/A", "-", "null"), and duplicate records from overlapping scrape runs.
import pandas as pd
import re
def clean_scraped_prices(df):
# Remove currency symbols and convert to float
df["price"] = df["price"].str.replace(r"[^\d.]", "", regex=True)
df["price"] = pd.to_numeric(df["price"], errors="coerce")
# Clean ratings (e.g., "4.5 out of 5" → 4.5)
df["rating"] = df["rating"].str.extract(r"(\d+\.?\d*)").astype(float)
# Strip whitespace from string columns
str_cols = df.select_dtypes(include="object").columns
df[str_cols] = df[str_cols].apply(lambda col: col.str.strip())
# Drop exact duplicates
df = df.drop_duplicates()
return df
Building Robust Data Pipelines
Error Handling and Retries
Network requests fail. Build retry logic with exponential backoff into any scraping pipeline. The tenacity library provides decorators for this. Always set timeouts on requests — a hanging connection without a timeout can stall your pipeline indefinitely. Log failures with enough context (URL, timestamp, error message) to diagnose problems without re-running everything.
from tenacity import retry, stop_after_attempt, wait_exponential
import requests
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=30)
)
def fetch_with_retry(url, headers=None):
response = requests.get(url, headers=headers, timeout=15)
response.raise_for_status()
return response
Storing Collected Data
For small one-off collections, CSV or Parquet files are fine. For ongoing pipelines, store raw responses (JSON or HTML) in object storage (S3, GCS) before parsing — this lets you re-parse without re-scraping if your parsing logic changes. Use a database (PostgreSQL, SQLite) for structured data that needs querying. Track scrape metadata (URL, timestamp, HTTP status, response size) in a separate table for debugging and audit.
Scheduling and Automation
Recurring data collection should run automatically. Options range from simple cron jobs on a server, to cloud schedulers (AWS EventBridge, GCP Cloud Scheduler), to workflow orchestrators like Airflow or Prefect for complex multi-step pipelines. Choose the simplest tool that meets your reliability and monitoring needs. A cron job that emails you on failure is often better than an orchestrator that requires ongoing maintenance.
Ethical and Legal Considerations
Consideration | Guidance |
|---|---|
robots.txt | Respect disallow directives; they signal the site owner's intent even if not legally binding everywhere |
Terms of Service | Read them; many prohibit automated access or commercial use of scraped data |
Rate limiting | Add delays between requests; aggressive scraping can constitute a denial-of-service attack |
Personal data | GDPR and CCPA impose restrictions on collecting and processing personal information even from public sources |
Copyright | Collected content may be copyrighted; using it beyond internal analysis may require permission |
Authentication bypass | Never attempt to bypass login walls, CAPTCHAs, or access controls |
Popular APIs for Data Analysts
API | Data Available | Free Tier |
|---|---|---|
Alpha Vantage | Stock prices, forex, crypto | Yes (limited calls/day) |
OpenWeatherMap | Current and historical weather | Yes |
World Bank | Global economic and development indicators | Yes (public) |
US Census Bureau | Demographic and economic data | Yes (public) |
Twitter/X API | Tweets, user data, trends | Limited |
Google Maps Platform | Geocoding, places, distance matrix | Monthly credit |
Reddit API (PRAW) | Posts, comments, subreddit metadata | Yes |
GitHub API | Repositories, commits, issues, users | Yes (authenticated) |
Summary
Web scraping and API integration are core data collection skills that let analysts go beyond pre-existing datasets. APIs offer structured, reliable access with clear terms; web scraping fills gaps where APIs don't exist. Both require thoughtful error handling, rate limit respect, and legal awareness. By combining these techniques with clean data pipelines and proper storage, analysts can build self-refreshing datasets that power ongoing dashboards, competitive intelligence tools, and research projects impossible to assemble from internal sources alone.
Create a free reader account to keep reading.