What Is Semi-Structured Data?
Semi-structured data does not conform to a rigid tabular schema but does contain self-describing tags or markers that separate elements. JSON (JavaScript Object Notation) and XML are the two dominant formats. JSON has become the default format for web APIs, application logs, event streams, and NoSQL databases. Data analysts increasingly encounter JSON columns inside relational databases, JSON Lines files in data lakes, and nested JSON blobs returned from APIs.
JSON Data Types
Type | Example | Notes |
|---|---|---|
String | "name": "Alice" | Always double-quoted |
Number | "age": 30, "score": 9.5 | Integer or float; no quotes |
Boolean | "active": true | Lowercase true/false only |
Null | "deleted_at": null | Represents missing/unknown value |
Object | "address": {"city": "Paris", "zip": "75001"} | Nested key-value pairs |
Array | "tags": ["sql", "python"] | Ordered list of any JSON type |
Querying JSON in SQL
Modern SQL databases provide operators and functions for querying JSON columns without extracting them first.
Database | Extract Value | Example |
|---|---|---|
PostgreSQL | -> (object), ->> (text), jsonb_path_query | data->>'city' returns city as text |
MySQL 8+ | JSON_EXTRACT() or -> shorthand | JSON_EXTRACT(data, '$.address.city') |
BigQuery | JSON_VALUE(), JSON_QUERY() | JSON_VALUE(data, '$.user.name') |
Snowflake | : operator or GET_PATH() | data:address:city::string |
SQL Server | JSON_VALUE(), OPENJSON() | JSON_VALUE(data, '$.status') |
PostgreSQL example — flatten a JSONB column of user events:
SELECT
id,
payload->>'event_type' AS event_type,
(payload->>'timestamp')::timestamptz AS event_time,
payload->'properties'->>'page' AS page,
payload->'properties'->>'duration_ms' AS duration_ms
FROM raw_events
WHERE payload->>'event_type' = 'page_view';
Unnesting Arrays in SQL
When a JSON column contains an array, you need to unnest it to analyze each element as a row.
Database | Function | Usage |
|---|---|---|
PostgreSQL | jsonb_array_elements() | CROSS JOIN LATERAL jsonb_array_elements(tags) AS tag |
BigQuery | JSON_QUERY_ARRAY() + UNNEST() | CROSS JOIN UNNEST(JSON_QUERY_ARRAY(data,'$.tags')) AS tag |
Snowflake | FLATTEN() | CROSS JOIN TABLE(FLATTEN(data:tags)) f |
Spark SQL | explode() | SELECT explode(tags) AS tag FROM events |
Parsing JSON with Python
Python's built-in json module handles encoding and decoding. For DataFrames, pandas.json_normalize flattens nested structures.
import json, pandas as pd
# Parse a JSON string
record = json.loads('{"user": {"id": 1, "name": "Alice"}, "score": 95}')
# Flatten nested JSON to DataFrame columns
df = pd.json_normalize(records, sep='_')
# Result: columns user_id, user_name, score
# Explode an array column into rows
df = df.explode('tags').reset_index(drop=True)
# Parse a JSON column in an existing DataFrame
df['payload_parsed'] = df['payload'].apply(json.loads)
df_flat = pd.json_normalize(df['payload_parsed'])
JSON Lines (JSONL) Format
JSON Lines (also called JSONL or ndjson) stores one JSON object per line. It is the standard format for streaming data, log files, and data lake exports because it is appendable and parallelizable.
# Reading JSONL with pandas
df = pd.read_json('events.jsonl', lines=True)
# Writing JSONL
df.to_json('output.jsonl', orient='records', lines=True)
# Reading a large JSONL file in chunks
chunks = pd.read_json('large_events.jsonl', lines=True, chunksize=10000)
df = pd.concat(chunks, ignore_index=True)
Schema Inference and Handling Variable Keys
Real-world JSON data often has inconsistent schemas — some records have keys others lack. Key strategies:
Challenge | Approach |
|---|---|
Missing keys | Use dict.get('key', default) in Python; coalesce functions in SQL |
Inconsistent nesting depth | pd.json_normalize with max_level parameter; flatten only top levels |
Mixed types in same key | Cast explicitly after extraction; validate with a schema library (jsonschema) |
Large arrays in one record | Explode early; avoid loading full arrays into memory |
Unknown key set | Sample a subset of records; collect all keys with pd.json_normalize on sample |
XML Basics for Analysts
XML is less common than JSON in modern APIs but still appears in enterprise systems, SOAP services, and legacy data exports. Python's xml.etree.ElementTree handles parsing:
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
records = []
for order in root.findall('order'):
records.append({
'id': order.find('id').text,
'total': order.find('total').text,
'status': order.get('status') # attribute
})
df = pd.DataFrame(records)
Summary
JSON is now a first-class data type in most analytical databases and Python workflows. Data analysts need to know how to extract scalar values from nested objects in SQL, unnest arrays into rows, and flatten hierarchical JSON into tabular DataFrames using pandas. JSON Lines is the standard format for log and event data in data lakes, and understanding schema variability — missing keys, mixed types, inconsistent nesting — is as important as knowing the syntax for parsing JSON itself.
Create a free reader account to keep reading.