JSON and Semi-Structured Data for Data Analysts

What Is Semi-Structured Data?

Semi-structured data does not conform to a rigid tabular schema but does contain self-describing tags or markers that separate elements. JSON (JavaScript Object Notation) and XML are the two dominant formats. JSON has become the default format for web APIs, application logs, event streams, and NoSQL databases. Data analysts increasingly encounter JSON columns inside relational databases, JSON Lines files in data lakes, and nested JSON blobs returned from APIs.

JSON Data Types

Type	Example	Notes
String	"name": "Alice"	Always double-quoted
Number	"age": 30, "score": 9.5	Integer or float; no quotes
Boolean	"active": true	Lowercase true/false only
Null	"deleted_at": null	Represents missing/unknown value
Object	"address": {"city": "Paris", "zip": "75001"}	Nested key-value pairs
Array	"tags": ["sql", "python"]	Ordered list of any JSON type

Querying JSON in SQL

Modern SQL databases provide operators and functions for querying JSON columns without extracting them first.

Database	Extract Value	Example
PostgreSQL	-> (object), ->> (text), jsonb_path_query	data->>'city' returns city as text
MySQL 8+	JSON_EXTRACT() or -> shorthand	JSON_EXTRACT(data, '$.address.city')
BigQuery	JSON_VALUE(), JSON_QUERY()	JSON_VALUE(data, '$.user.name')
Snowflake	: operator or GET_PATH()	data:address:city::string
SQL Server	JSON_VALUE(), OPENJSON()	JSON_VALUE(data, '$.status')

PostgreSQL example — flatten a JSONB column of user events:

SELECT id, payload->>'event_type' AS event_type, (payload->>'timestamp')::timestamptz AS event_time, payload->'properties'->>'page' AS page, payload->'properties'->>'duration_ms' AS duration_ms FROM raw_events WHERE payload->>'event_type' = 'page_view';

Unnesting Arrays in SQL

When a JSON column contains an array, you need to unnest it to analyze each element as a row.

Database	Function	Usage
PostgreSQL	jsonb_array_elements()	CROSS JOIN LATERAL jsonb_array_elements(tags) AS tag
BigQuery	JSON_QUERY_ARRAY() + UNNEST()	CROSS JOIN UNNEST(JSON_QUERY_ARRAY(data,'$.tags')) AS tag
Snowflake	FLATTEN()	CROSS JOIN TABLE(FLATTEN(data:tags)) f
Spark SQL	explode()	SELECT explode(tags) AS tag FROM events

Parsing JSON with Python

Python's built-in json module handles encoding and decoding. For DataFrames, pandas.json_normalize flattens nested structures.

import json, pandas as pd # Parse a JSON string record = json.loads('{"user": {"id": 1, "name": "Alice"}, "score": 95}') # Flatten nested JSON to DataFrame columns df = pd.json_normalize(records, sep='_') # Result: columns user_id, user_name, score # Explode an array column into rows df = df.explode('tags').reset_index(drop=True) # Parse a JSON column in an existing DataFrame df['payload_parsed'] = df['payload'].apply(json.loads) df_flat = pd.json_normalize(df['payload_parsed'])

JSON Lines (JSONL) Format

JSON Lines (also called JSONL or ndjson) stores one JSON object per line. It is the standard format for streaming data, log files, and data lake exports because it is appendable and parallelizable.

# Reading JSONL with pandas df = pd.read_json('events.jsonl', lines=True) # Writing JSONL df.to_json('output.jsonl', orient='records', lines=True) # Reading a large JSONL file in chunks chunks = pd.read_json('large_events.jsonl', lines=True, chunksize=10000) df = pd.concat(chunks, ignore_index=True)

Schema Inference and Handling Variable Keys

Real-world JSON data often has inconsistent schemas — some records have keys others lack. Key strategies:

Challenge	Approach
Missing keys	Use dict.get('key', default) in Python; coalesce functions in SQL
Inconsistent nesting depth	pd.json_normalize with max_level parameter; flatten only top levels
Mixed types in same key	Cast explicitly after extraction; validate with a schema library (jsonschema)
Large arrays in one record	Explode early; avoid loading full arrays into memory
Unknown key set	Sample a subset of records; collect all keys with pd.json_normalize on sample

XML Basics for Analysts

XML is less common than JSON in modern APIs but still appears in enterprise systems, SOAP services, and legacy data exports. Python's xml.etree.ElementTree handles parsing:

import xml.etree.ElementTree as ET tree = ET.parse('data.xml') root = tree.getroot() records = [] for order in root.findall('order'): records.append({ 'id': order.find('id').text, 'total': order.find('total').text, 'status': order.get('status') # attribute }) df = pd.DataFrame(records)

Summary

JSON is now a first-class data type in most analytical databases and Python workflows. Data analysts need to know how to extract scalar values from nested objects in SQL, unnest arrays into rows, and flatten hierarchical JSON into tabular DataFrames using pandas. JSON Lines is the standard format for log and event data in data lakes, and understanding schema variability — missing keys, mixed types, inconsistent nesting — is as important as knowing the syntax for parsing JSON itself.