What Is Cohort Analysis?
Cohort analysis groups users (or other entities) by a shared characteristic at a specific point in time — typically when they first performed an action — and then tracks their behaviour over subsequent periods. The most common cohort is the acquisition cohort: users grouped by the week or month they first signed up. By comparing cohorts side-by-side you can see whether product changes improved retention, whether newer customers are more or less valuable than older ones, and whether churn rates are improving over time.
Two questions cohort analysis answers that aggregate metrics cannot: Is retention actually improving, or is the average being propped up by a growing new-user base? And: do users acquired in different periods or channels behave differently long-term?
Types of Cohort Analysis
| Type | Cohort Definition | What You Measure | Typical Use Case |
|---|---|---|---|
| Acquisition cohort | Users grouped by signup month/week | Retention rate, LTV, engagement over time | SaaS subscription retention, app retention |
| Behavioural cohort | Users grouped by an action taken (e.g., completed onboarding) | Conversion rate, engagement lift vs non-cohort | Measuring impact of a feature or onboarding step |
| Size cohort | Customers grouped by their first order value | Repeat purchase rate, LTV per segment | E-commerce: do big first orders predict loyalty? |
| Channel cohort | Users grouped by acquisition channel | Retention and LTV by channel | Marketing mix: which channels acquire valuable users? |
Building a Retention Cohort Table in SQL
The standard output is a triangle-shaped table where rows are cohorts (signup month), columns are periods (months since signup: 0, 1, 2, ...), and each cell contains the retention rate.
-- Step 1: Get each user's cohort (first activity month)
WITH cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(order_date)) AS cohort_month
FROM orders
GROUP BY user_id
),
-- Step 2: Join back to get all activity and compute period number
activity AS (
SELECT
o.user_id,
c.cohort_month,
DATE_TRUNC('month', o.order_date) AS activity_month,
DATEDIFF('month', c.cohort_month, DATE_TRUNC('month', o.order_date)) AS period
FROM orders o
JOIN cohorts c ON o.user_id = c.user_id
),
-- Step 3: Count distinct active users per cohort per period
cohort_counts AS (
SELECT
cohort_month,
period,
COUNT(DISTINCT user_id) AS active_users
FROM activity
GROUP BY cohort_month, period
)
-- Step 4: Divide by cohort size (period=0) to get retention rate
SELECT
c.cohort_month,
c.period,
c.active_users,
c0.active_users AS cohort_size,
ROUND(100.0 * c.active_users / c0.active_users, 1) AS retention_pct
FROM cohort_counts c
JOIN cohort_counts c0
ON c.cohort_month = c0.cohort_month AND c0.period = 0
ORDER BY c.cohort_month, c.period;
Pivot to a Triangle Heatmap in Python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# df has columns: cohort_month, period, retention_pct
df = pd.read_csv('cohort_retention.csv')
# Pivot to wide format
cohort_pivot = df.pivot_table(
index='cohort_month',
columns='period',
values='retention_pct'
)
# Plot as heatmap
plt.figure(figsize=(14, 7))
sns.heatmap(
cohort_pivot,
annot=True,
fmt='.1f',
cmap='YlGn',
linewidths=0.5,
vmin=0, vmax=100
)
plt.title('Monthly Retention Cohorts (%)')
plt.xlabel('Months Since Signup')
plt.ylabel('Cohort Month')
plt.tight_layout()
plt.savefig('retention_heatmap.png', dpi=150)
plt.show()
Reading a Retention Cohort Table
| What to Look For | What It Means | Action |
|---|---|---|
| Period-0 retention is 100% | Expected — every user is active in their signup period | Baseline; use as denominator for all other periods |
| Sharp drop from period 0 to period 1 | Many users never return after first use — early churn problem | Investigate onboarding; improve activation experience |
| Newer cohorts have higher retention at period 1, 2, 3 | Retention is genuinely improving over time | Identify what changed (product, onboarding, channel) and double down |
| Retention flattens after period 3–4 | Users who survive early churn become long-term loyal users | Focus on getting users past the "churn cliff"; early engagement drives LTV |
| One cohort month looks anomalously high | Could be a marketing campaign, seasonal effect, or data quality issue | Investigate the cause before drawing conclusions |
LTV Cohort Analysis
Beyond retention, cohorts can track cumulative revenue per user over time — revealing which acquisition periods or channels produce the highest lifetime value.
-- Cumulative LTV per cohort
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('month', MIN(order_date)) AS cohort_month
FROM orders GROUP BY user_id
),
revenue_by_period AS (
SELECT
c.cohort_month,
DATEDIFF('month', c.cohort_month, DATE_TRUNC('month', o.order_date)) AS period,
SUM(o.revenue) AS period_revenue,
COUNT(DISTINCT c.user_id) AS cohort_size
FROM orders o
JOIN cohorts c ON o.user_id = c.user_id
GROUP BY c.cohort_month, period
)
SELECT
cohort_month,
period,
ROUND(SUM(period_revenue) OVER (
PARTITION BY cohort_month
ORDER BY period
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) / MAX(cohort_size) OVER (PARTITION BY cohort_month), 2) AS cumulative_ltv_per_user
FROM revenue_by_period
ORDER BY cohort_month, period;
Cohort Analysis in Python with pandas
import pandas as pd
df = pd.read_csv('orders.csv', parse_dates=['order_date'])
# Step 1: Assign cohort month to each user
df['cohort_month'] = df.groupby('user_id')['order_date'].transform('min').dt.to_period('M')
df['order_month'] = df['order_date'].dt.to_period('M')
# Step 2: Calculate period (months since cohort)
df['period'] = (df['order_month'] - df['cohort_month']).apply(lambda x: x.n)
# Step 3: Count unique users per cohort per period
cohort_data = df.groupby(['cohort_month', 'period'])['user_id'].nunique().reset_index()
cohort_data.columns = ['cohort_month', 'period', 'users']
# Step 4: Get cohort sizes (period == 0)
cohort_sizes = cohort_data[cohort_data['period'] == 0].set_index('cohort_month')['users']
# Step 5: Calculate retention
cohort_data['cohort_size'] = cohort_data['cohort_month'].map(cohort_sizes)
cohort_data['retention'] = cohort_data['users'] / cohort_data['cohort_size'] * 100
# Step 6: Pivot for heatmap
retention_pivot = cohort_data.pivot_table(
index='cohort_month', columns='period', values='retention'
).round(1)
print(retention_pivot)
Common Pitfalls in Cohort Analysis
- Incomplete recent cohorts: A cohort from last month only has 1 month of data. Never compare period-6 retention for a new cohort against a mature one — you don't have the data yet. Truncate the triangle at the diagonal.
- Mixing absolute and relative time: Always use "months since signup" (relative), not calendar month (absolute), for the column axis.
- Survivorship bias in activity definition: If your activity event only fires for paying users, you're measuring retention of paying users, not all users. Be explicit about the denominator.
- Ignoring cohort size: A 60% retention rate from a cohort of 50 is statistically noise. Always display cohort sizes alongside retention rates.
- Forgetting seasonality: Users acquired in December may behave differently from those acquired in March due to seasonal reasons, not product quality.
Summary
Cohort analysis reveals whether your product is genuinely improving for users, or whether growth is masking underlying churn. The acquisition cohort retention heatmap is the most important single chart for subscription and recurring-revenue businesses — it shows whether you are building a product that people keep using over time. Combine it with LTV cohorts to understand revenue impact, and channel cohorts to inform marketing budget allocation. The SQL and Python patterns in this article cover 90% of real-world cohort analysis needs.
Create a free reader account to keep reading.