Regular Expressions for Data Analysts

What Are Regular Expressions?

A regular expression (regex) is a sequence of characters that defines a search pattern. In data analysis, regex is used to validate string formats, extract substrings from unstructured text, clean messy data, and parse log files or semi-structured data. Most SQL dialects, Python, R, and command-line tools support regex natively, making it one of the most transferable technical skills a data analyst can have.

Regex Syntax Reference

Pattern	Meaning	Example Match
.	Any single character except newline	a.c matches "abc", "a1c", "a-c"
^	Start of string (or line in multiline mode)	^Hello matches "Hello world"
$	End of string	end$ matches "the end"
*	Zero or more of the preceding element	ab* matches "a", "ab", "abb"
+	One or more of the preceding element	ab+ matches "ab", "abb" but not "a"
?	Zero or one of the preceding element (optional)	colou?r matches "color" and "colour"
{n,m}	Between n and m repetitions	\d{2,4} matches 2 to 4 digits
[abc]	Character class: matches any one of a, b, c	[aeiou] matches any vowel
[^abc]	Negated character class: anything except a, b, c	[^0-9] matches any non-digit
\d	Any digit (equivalent to [0-9])	\d+ matches "123"
\w	Any word character (letter, digit, underscore)	\w+ matches "hello_123"
\s	Any whitespace character (space, tab, newline)	\s+ matches whitespace runs
\b	Word boundary	\bcat\b matches "cat" but not "catfish"
(abc)	Capturing group: groups and captures matched text	(\d{4})-(\d{2}) captures year and month
(?:abc)	Non-capturing group: groups without capturing	(?:Mr\|Ms)\.? matches titles without capture
a\|b	Alternation: matches a or b	cat\|dog matches "cat" or "dog"

Common Data Analyst Regex Patterns

Use Case	Pattern	Notes
Email address	^[\w.+-]+@[\w-]+\.[a-z]{2,}$	Basic validation; not RFC-complete
ISO date (YYYY-MM-DD)	^\d{4}-\d{2}-\d{2}$	Format check only; doesn't validate calendar
US phone number	^(\+1)?[\s.-]?$?\d{3}$?[\s.-]?\d{3}[\s.-]?\d{4}$	Handles multiple common formats
IPv4 address	^(\d{1,3}\.){3}\d{1,3}$	Format check; use logic for range validation
Extract domain from URL	https?://([^/]+)	Group 1 captures the domain
Remove HTML tags	<[^>]+>	Replace with empty string to strip tags
Leading/trailing whitespace	^\s+\|\s+$	Replace with '' to trim; most DBs have TRIM()
Numeric with optional decimals	^-?\d+(\.\d+)?$	Validates numbers like -3, 42, 3.14

Regex in SQL

Most SQL dialects support regex through specific functions. The syntax differs between databases:

Database	Function / Operator	Example
PostgreSQL	~ (match), REGEXP_REPLACE, REGEXP_MATCHES	WHERE email ~ '^[\w.]+@[\w.]+$'
MySQL / BigQuery	REGEXP, REGEXP_EXTRACT, REGEXP_REPLACE	WHERE email REGEXP '^[a-z]+'
Snowflake	REGEXP_LIKE, REGEXP_SUBSTR, REGEXP_REPLACE	REGEXP_SUBSTR(url, 'https?://([^/]+)', 1, 1, 'e', 1)
SQLite	GLOB (limited) or user-defined functions	Native regex limited; use application layer

Example: Extract the domain from a URL column in PostgreSQL:

SELECT url, REGEXP_REPLACE(url, '^https?://([^/]+).*', '\1') AS domain FROM page_views WHERE url ~ '^https?://';

Regex in Python

Python's re module provides full regex support. Key functions:

Function	Description	Returns
re.match(pattern, string)	Match at the beginning of the string only	Match object or None
re.search(pattern, string)	Find first match anywhere in string	Match object or None
re.findall(pattern, string)	Find all non-overlapping matches	List of strings or tuples
re.sub(pattern, repl, string)	Replace matches with repl	Modified string
re.split(pattern, string)	Split string on pattern matches	List of strings

With pandas, str.extract(), str.contains(), and str.replace() all accept regex patterns:

import re, pandas as pd df['domain'] = df['url'].str.extract(r'https?://([^/]+)') df['is_valid_email'] = df['email'].str.match(r'^[\w.+-]+@[\w-]+\.[a-z]{2,}$') df['phone_clean'] = df['phone'].str.replace(r'[^\d]', '', regex=True)

Summary

Regular expressions are a precision tool for working with string data. They excel at validation (does this value match a format?), extraction (pull the domain from this URL), and cleaning (strip non-numeric characters from a phone number). While complex regex patterns can be hard to read, well-chosen patterns solve entire categories of messy data problems in a single expression. Data analysts who are comfortable with regex in both SQL and Python dramatically reduce the time spent on data cleaning and open up new possibilities for parsing unstructured sources like log files, free-text fields, and scraped web content.