Web Scraping and APIs for Data Analysts

Why Web Scraping and APIs Matter for Data Analysts

Most analytical projects start with data that already lives in a database or data warehouse. But a large share of the world's most valuable information is locked inside websites and web services: competitor pricing, job postings, social media signals, public records, financial disclosures, and more. Web scraping and API integration give analysts the ability to gather this external data programmatically, transforming it into structured datasets ready for analysis.

Understanding both techniques is essential for any analyst who wants to build original datasets, automate recurring data collection, or integrate third-party services into reporting pipelines.

APIs vs. Web Scraping: Choosing the Right Approach

Dimension	API	Web Scraping
Data availability	Only what the provider exposes	Anything visible in a browser
Reliability	High — structured, versioned contracts	Low — breaks when HTML changes
Legal clarity	Clear — governed by Terms of Service	Murky — check robots.txt and ToS
Rate limits	Explicit, often enforced	Implicit — polite crawling required
Authentication	API keys, OAuth	Session cookies, form login
Best for	Integrating services, real-time data	Sites without APIs, historical snapshots

Always prefer an official API when one exists. Scraping without permission may violate Terms of Service or, in some jurisdictions, computer fraud laws. Always review robots.txt (e.g., https://example.com/robots.txt) and the site's Terms of Service before scraping.

Working with REST APIs

HTTP Fundamentals

REST APIs communicate over HTTP. Every request has a method (GET, POST, PUT, PATCH, DELETE), a URL endpoint, headers (metadata including authentication tokens), and optionally a body (for POST/PUT/PATCH). Responses include a status code and a body, usually JSON.

Status Code	Meaning
200 OK	Success
201 Created	Resource created successfully
400 Bad Request	Malformed request syntax
401 Unauthorized	Missing or invalid credentials
403 Forbidden	Authenticated but not permitted
404 Not Found	Resource does not exist
429 Too Many Requests	Rate limit exceeded
500 Internal Server Error	Server-side error

Making API Requests in Python

import requests
import pandas as pd
import time

# Basic GET request with API key authentication
API_KEY = "your_api_key_here"
BASE_URL = "https://api.example.com/v1"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Accept": "application/json"
}

# Fetch paginated data
def fetch_all_records(endpoint, params=None):
    all_records = []
    page = 1
    params = params or {}

    while True:
        params["page"] = page
        params["per_page"] = 100

        response = requests.get(
            f"{BASE_URL}/{endpoint}",
            headers=headers,
            params=params
        )

        # Handle rate limiting
        if response.status_code == 429:
            retry_after = int(response.headers.get("Retry-After", 60))
            print(f"Rate limited. Waiting {retry_after} seconds...")
            time.sleep(retry_after)
            continue

        response.raise_for_status()  # Raises for 4xx/5xx errors
        data = response.json()

        records = data.get("results", data.get("data", []))
        if not records:
            break

        all_records.extend(records)
        page += 1

        # Respect rate limits
        time.sleep(0.2)

    return all_records

records = fetch_all_records("users")
df = pd.DataFrame(records)
print(df.head())

Authentication Patterns

APIs use several authentication mechanisms. API keys are the simplest — pass them in a header or query parameter. OAuth 2.0 is the standard for user-delegated access (e.g., accessing a user's Google Analytics data on their behalf) and involves an authorization flow that exchanges credentials for short-lived access tokens. Basic authentication sends a username and password encoded in base64 — only appropriate over HTTPS. JWT (JSON Web Token) authentication involves signing a payload with a secret and including the token in the Authorization header.

Handling Pagination

Most APIs paginate responses to limit payload size. Three common patterns exist: page-based pagination (pass page=2&per_page=100), cursor-based pagination (each response includes a next_cursor token to use in the next request — more stable for real-time data), and offset-based pagination (pass offset=100&limit=100). Always handle all pages; stopping at the first page is a common data quality bug.

Web Scraping with Python

The Scraping Stack

Two tools cover the vast majority of scraping needs: Requests + BeautifulSoup for static HTML pages, and Playwright or Selenium for JavaScript-rendered pages that require a real browser. Use the simpler tool first — firing up a headless browser for a static page is wasteful.

Scraping Static Pages with BeautifulSoup

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def scrape_product_listings(base_url, max_pages=5):
    products = []
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; DataAnalyst/1.0; research)"
    }

    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()

        soup = BeautifulSoup(response.text, "html.parser")

        # Find all product cards
        cards = soup.find_all("div", class_="product-card")
        if not cards:
            break  # No more pages

        for card in cards:
            name = card.find("h2", class_="product-name")
            price = card.find("span", class_="price")
            rating = card.find("span", class_="rating")

            products.append({
                "name": name.get_text(strip=True) if name else None,
                "price": price.get_text(strip=True) if price else None,
                "rating": rating.get_text(strip=True) if rating else None,
                "page": page
            })

        time.sleep(1)  # Be polite — 1 second between requests

    return pd.DataFrame(products)

df = scrape_product_listings("https://example.com/products")
print(df.head())

CSS Selectors and XPath

BeautifulSoup supports both CSS selectors (via soup.select()) and its own find methods. CSS selectors are often more concise: soup.select("div.product-card h2") finds all h2 elements inside div elements with class product-card. XPath is more powerful for complex traversals and is the native selector language for lxml and Scrapy. For simple scraping, CSS selectors are usually sufficient and easier to read.

Scraping JavaScript-Rendered Pages with Playwright

from playwright.sync_api import sync_playwright
import pandas as pd

def scrape_dynamic_page(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Block images and fonts to speed up loading
        page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}", lambda route: route.abort())

        page.goto(url, wait_until="networkidle")

        # Wait for specific content to load
        page.wait_for_selector(".data-table", timeout=10000)

        # Extract table data
        rows = page.query_selector_all(".data-table tbody tr")
        data = []
        for row in rows:
            cells = row.query_selector_all("td")
            data.append([cell.inner_text() for cell in cells])

        browser.close()
        return pd.DataFrame(data)

df = scrape_dynamic_page("https://example.com/dynamic-dashboard")
print(df.head())

Data Cleaning After Collection

Raw scraped data is rarely clean. Common issues include leading and trailing whitespace, currency symbols and commas in numeric fields, mixed date formats, HTML entities (& rendering as &), inconsistent null representations (empty string, "N/A", "-", "null"), and duplicate records from overlapping scrape runs.

import pandas as pd
import re

def clean_scraped_prices(df):
    # Remove currency symbols and convert to float
    df["price"] = df["price"].str.replace(r"[^\d.]", "", regex=True)
    df["price"] = pd.to_numeric(df["price"], errors="coerce")

    # Clean ratings (e.g., "4.5 out of 5" → 4.5)
    df["rating"] = df["rating"].str.extract(r"(\d+\.?\d*)").astype(float)

    # Strip whitespace from string columns
    str_cols = df.select_dtypes(include="object").columns
    df[str_cols] = df[str_cols].apply(lambda col: col.str.strip())

    # Drop exact duplicates
    df = df.drop_duplicates()

    return df

Building Robust Data Pipelines

Error Handling and Retries

Network requests fail. Build retry logic with exponential backoff into any scraping pipeline. The tenacity library provides decorators for this. Always set timeouts on requests — a hanging connection without a timeout can stall your pipeline indefinitely. Log failures with enough context (URL, timestamp, error message) to diagnose problems without re-running everything.

from tenacity import retry, stop_after_attempt, wait_exponential
import requests

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30)
)
def fetch_with_retry(url, headers=None):
    response = requests.get(url, headers=headers, timeout=15)
    response.raise_for_status()
    return response

Storing Collected Data

For small one-off collections, CSV or Parquet files are fine. For ongoing pipelines, store raw responses (JSON or HTML) in object storage (S3, GCS) before parsing — this lets you re-parse without re-scraping if your parsing logic changes. Use a database (PostgreSQL, SQLite) for structured data that needs querying. Track scrape metadata (URL, timestamp, HTTP status, response size) in a separate table for debugging and audit.

Scheduling and Automation

Recurring data collection should run automatically. Options range from simple cron jobs on a server, to cloud schedulers (AWS EventBridge, GCP Cloud Scheduler), to workflow orchestrators like Airflow or Prefect for complex multi-step pipelines. Choose the simplest tool that meets your reliability and monitoring needs. A cron job that emails you on failure is often better than an orchestrator that requires ongoing maintenance.

Ethical and Legal Considerations

Consideration	Guidance
robots.txt	Respect disallow directives; they signal the site owner's intent even if not legally binding everywhere
Terms of Service	Read them; many prohibit automated access or commercial use of scraped data
Rate limiting	Add delays between requests; aggressive scraping can constitute a denial-of-service attack
Personal data	GDPR and CCPA impose restrictions on collecting and processing personal information even from public sources
Copyright	Collected content may be copyrighted; using it beyond internal analysis may require permission
Authentication bypass	Never attempt to bypass login walls, CAPTCHAs, or access controls

Popular APIs for Data Analysts

API	Data Available	Free Tier
Alpha Vantage	Stock prices, forex, crypto	Yes (limited calls/day)
OpenWeatherMap	Current and historical weather	Yes
World Bank	Global economic and development indicators	Yes (public)
US Census Bureau	Demographic and economic data	Yes (public)
Twitter/X API	Tweets, user data, trends	Limited
Google Maps Platform	Geocoding, places, distance matrix	Monthly credit
Reddit API (PRAW)	Posts, comments, subreddit metadata	Yes
GitHub API	Repositories, commits, issues, users	Yes (authenticated)

Summary

Web scraping and API integration are core data collection skills that let analysts go beyond pre-existing datasets. APIs offer structured, reliable access with clear terms; web scraping fills gaps where APIs don't exist. Both require thoughtful error handling, rate limit respect, and legal awareness. By combining these techniques with clean data pipelines and proper storage, analysts can build self-refreshing datasets that power ongoing dashboards, competitive intelligence tools, and research projects impossible to assemble from internal sources alone.