What Are Regular Expressions?
A regular expression (regex) is a sequence of characters that defines a search pattern. In data analysis, regex is used to validate string formats, extract substrings from unstructured text, clean messy data, and parse log files or semi-structured data. Most SQL dialects, Python, R, and command-line tools support regex natively, making it one of the most transferable technical skills a data analyst can have.
Regex Syntax Reference
Pattern | Meaning | Example Match |
|---|---|---|
. | Any single character except newline | a.c matches "abc", "a1c", "a-c" |
^ | Start of string (or line in multiline mode) | ^Hello matches "Hello world" |
$ | End of string | end$ matches "the end" |
* | Zero or more of the preceding element | ab* matches "a", "ab", "abb" |
+ | One or more of the preceding element | ab+ matches "ab", "abb" but not "a" |
? | Zero or one of the preceding element (optional) | colou?r matches "color" and "colour" |
{n,m} | Between n and m repetitions | \d{2,4} matches 2 to 4 digits |
[abc] | Character class: matches any one of a, b, c | [aeiou] matches any vowel |
[^abc] | Negated character class: anything except a, b, c | [^0-9] matches any non-digit |
\d | Any digit (equivalent to [0-9]) | \d+ matches "123" |
\w | Any word character (letter, digit, underscore) | \w+ matches "hello_123" |
\s | Any whitespace character (space, tab, newline) | \s+ matches whitespace runs |
\b | Word boundary | \bcat\b matches "cat" but not "catfish" |
(abc) | Capturing group: groups and captures matched text | (\d{4})-(\d{2}) captures year and month |
(?:abc) | Non-capturing group: groups without capturing | (?:Mr|Ms)\.? matches titles without capture |
a|b | Alternation: matches a or b | cat|dog matches "cat" or "dog" |
Common Data Analyst Regex Patterns
Use Case | Pattern | Notes |
|---|---|---|
Email address | ^[\w.+-]+@[\w-]+\.[a-z]{2,}$ | Basic validation; not RFC-complete |
ISO date (YYYY-MM-DD) | ^\d{4}-\d{2}-\d{2}$ | Format check only; doesn't validate calendar |
US phone number | ^(\+1)?[\s.-]?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$ | Handles multiple common formats |
IPv4 address | ^(\d{1,3}\.){3}\d{1,3}$ | Format check; use logic for range validation |
Extract domain from URL | https?://([^/]+) | Group 1 captures the domain |
Remove HTML tags | <[^>]+> | Replace with empty string to strip tags |
Leading/trailing whitespace | ^\s+|\s+$ | Replace with '' to trim; most DBs have TRIM() |
Numeric with optional decimals | ^-?\d+(\.\d+)?$ | Validates numbers like -3, 42, 3.14 |
Regex in SQL
Most SQL dialects support regex through specific functions. The syntax differs between databases:
Database | Function / Operator | Example |
|---|---|---|
PostgreSQL | ~ (match), REGEXP_REPLACE, REGEXP_MATCHES | WHERE email ~ '^[\w.]+@[\w.]+$' |
MySQL / BigQuery | REGEXP, REGEXP_EXTRACT, REGEXP_REPLACE | WHERE email REGEXP '^[a-z]+' |
Snowflake | REGEXP_LIKE, REGEXP_SUBSTR, REGEXP_REPLACE | REGEXP_SUBSTR(url, 'https?://([^/]+)', 1, 1, 'e', 1) |
SQLite | GLOB (limited) or user-defined functions | Native regex limited; use application layer |
Example: Extract the domain from a URL column in PostgreSQL:
SELECT
url,
REGEXP_REPLACE(url, '^https?://([^/]+).*', '\1') AS domain
FROM page_views
WHERE url ~ '^https?://';
Regex in Python
Python's re module provides full regex support. Key functions:
Function | Description | Returns |
|---|---|---|
re.match(pattern, string) | Match at the beginning of the string only | Match object or None |
re.search(pattern, string) | Find first match anywhere in string | Match object or None |
re.findall(pattern, string) | Find all non-overlapping matches | List of strings or tuples |
re.sub(pattern, repl, string) | Replace matches with repl | Modified string |
re.split(pattern, string) | Split string on pattern matches | List of strings |
With pandas, str.extract(), str.contains(), and str.replace() all accept regex patterns:
import re, pandas as pd
df['domain'] = df['url'].str.extract(r'https?://([^/]+)')
df['is_valid_email'] = df['email'].str.match(r'^[\w.+-]+@[\w-]+\.[a-z]{2,}$')
df['phone_clean'] = df['phone'].str.replace(r'[^\d]', '', regex=True)
Summary
Regular expressions are a precision tool for working with string data. They excel at validation (does this value match a format?), extraction (pull the domain from this URL), and cleaning (strip non-numeric characters from a phone number). While complex regex patterns can be hard to read, well-chosen patterns solve entire categories of messy data problems in a single expression. Data analysts who are comfortable with regex in both SQL and Python dramatically reduce the time spent on data cleaning and open up new possibilities for parsing unstructured sources like log files, free-text fields, and scraped web content.
Create a free reader account to keep reading.