Cohort Analysis for Data Analysts

What Is Cohort Analysis?

Cohort analysis groups users (or other entities) by a shared characteristic at a specific point in time — typically when they first performed an action — and then tracks their behaviour over subsequent periods. The most common cohort is the acquisition cohort: users grouped by the week or month they first signed up. By comparing cohorts side-by-side you can see whether product changes improved retention, whether newer customers are more or less valuable than older ones, and whether churn rates are improving over time.

Two questions cohort analysis answers that aggregate metrics cannot: Is retention actually improving, or is the average being propped up by a growing new-user base? And: do users acquired in different periods or channels behave differently long-term?

Types of Cohort Analysis

Type	Cohort Definition	What You Measure	Typical Use Case
Acquisition cohort	Users grouped by signup month/week	Retention rate, LTV, engagement over time	SaaS subscription retention, app retention
Behavioural cohort	Users grouped by an action taken (e.g., completed onboarding)	Conversion rate, engagement lift vs non-cohort	Measuring impact of a feature or onboarding step
Size cohort	Customers grouped by their first order value	Repeat purchase rate, LTV per segment	E-commerce: do big first orders predict loyalty?
Channel cohort	Users grouped by acquisition channel	Retention and LTV by channel	Marketing mix: which channels acquire valuable users?

Building a Retention Cohort Table in SQL

The standard output is a triangle-shaped table where rows are cohorts (signup month), columns are periods (months since signup: 0, 1, 2, ...), and each cell contains the retention rate.

-- Step 1: Get each user's cohort (first activity month)
WITH cohorts AS (
  SELECT
    user_id,
    DATE_TRUNC('month', MIN(order_date)) AS cohort_month
  FROM orders
  GROUP BY user_id
),

-- Step 2: Join back to get all activity and compute period number
activity AS (
  SELECT
    o.user_id,
    c.cohort_month,
    DATE_TRUNC('month', o.order_date) AS activity_month,
    DATEDIFF('month', c.cohort_month, DATE_TRUNC('month', o.order_date)) AS period
  FROM orders o
  JOIN cohorts c ON o.user_id = c.user_id
),

-- Step 3: Count distinct active users per cohort per period
cohort_counts AS (
  SELECT
    cohort_month,
    period,
    COUNT(DISTINCT user_id) AS active_users
  FROM activity
  GROUP BY cohort_month, period
)

-- Step 4: Divide by cohort size (period=0) to get retention rate
SELECT
  c.cohort_month,
  c.period,
  c.active_users,
  c0.active_users AS cohort_size,
  ROUND(100.0 * c.active_users / c0.active_users, 1) AS retention_pct
FROM cohort_counts c
JOIN cohort_counts c0
  ON c.cohort_month = c0.cohort_month AND c0.period = 0
ORDER BY c.cohort_month, c.period;

Pivot to a Triangle Heatmap in Python

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# df has columns: cohort_month, period, retention_pct
df = pd.read_csv('cohort_retention.csv')

# Pivot to wide format
cohort_pivot = df.pivot_table(
    index='cohort_month',
    columns='period',
    values='retention_pct'
)

# Plot as heatmap
plt.figure(figsize=(14, 7))
sns.heatmap(
    cohort_pivot,
    annot=True,
    fmt='.1f',
    cmap='YlGn',
    linewidths=0.5,
    vmin=0, vmax=100
)
plt.title('Monthly Retention Cohorts (%)')
plt.xlabel('Months Since Signup')
plt.ylabel('Cohort Month')
plt.tight_layout()
plt.savefig('retention_heatmap.png', dpi=150)
plt.show()

Reading a Retention Cohort Table

What to Look For	What It Means	Action
Period-0 retention is 100%	Expected — every user is active in their signup period	Baseline; use as denominator for all other periods
Sharp drop from period 0 to period 1	Many users never return after first use — early churn problem	Investigate onboarding; improve activation experience
Newer cohorts have higher retention at period 1, 2, 3	Retention is genuinely improving over time	Identify what changed (product, onboarding, channel) and double down
Retention flattens after period 3–4	Users who survive early churn become long-term loyal users	Focus on getting users past the "churn cliff"; early engagement drives LTV
One cohort month looks anomalously high	Could be a marketing campaign, seasonal effect, or data quality issue	Investigate the cause before drawing conclusions

LTV Cohort Analysis

Beyond retention, cohorts can track cumulative revenue per user over time — revealing which acquisition periods or channels produce the highest lifetime value.

-- Cumulative LTV per cohort WITH cohorts AS ( SELECT user_id, DATE_TRUNC('month', MIN(order_date)) AS cohort_month FROM orders GROUP BY user_id ), revenue_by_period AS ( SELECT c.cohort_month, DATEDIFF('month', c.cohort_month, DATE_TRUNC('month', o.order_date)) AS period, SUM(o.revenue) AS period_revenue, COUNT(DISTINCT c.user_id) AS cohort_size FROM orders o JOIN cohorts c ON o.user_id = c.user_id GROUP BY c.cohort_month, period ) SELECT cohort_month, period, ROUND(SUM(period_revenue) OVER ( PARTITION BY cohort_month ORDER BY period ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) / MAX(cohort_size) OVER (PARTITION BY cohort_month), 2) AS cumulative_ltv_per_user FROM revenue_by_period ORDER BY cohort_month, period;

Cohort Analysis in Python with pandas

import pandas as pd df = pd.read_csv('orders.csv', parse_dates=['order_date']) # Step 1: Assign cohort month to each user df['cohort_month'] = df.groupby('user_id')['order_date'].transform('min').dt.to_period('M') df['order_month'] = df['order_date'].dt.to_period('M') # Step 2: Calculate period (months since cohort) df['period'] = (df['order_month'] - df['cohort_month']).apply(lambda x: x.n) # Step 3: Count unique users per cohort per period cohort_data = df.groupby(['cohort_month', 'period'])['user_id'].nunique().reset_index() cohort_data.columns = ['cohort_month', 'period', 'users'] # Step 4: Get cohort sizes (period == 0) cohort_sizes = cohort_data[cohort_data['period'] == 0].set_index('cohort_month')['users'] # Step 5: Calculate retention cohort_data['cohort_size'] = cohort_data['cohort_month'].map(cohort_sizes) cohort_data['retention'] = cohort_data['users'] / cohort_data['cohort_size'] * 100 # Step 6: Pivot for heatmap retention_pivot = cohort_data.pivot_table( index='cohort_month', columns='period', values='retention' ).round(1) print(retention_pivot)

Common Pitfalls in Cohort Analysis

Incomplete recent cohorts: A cohort from last month only has 1 month of data. Never compare period-6 retention for a new cohort against a mature one — you don't have the data yet. Truncate the triangle at the diagonal.

Mixing absolute and relative time: Always use "months since signup" (relative), not calendar month (absolute), for the column axis.

Survivorship bias in activity definition: If your activity event only fires for paying users, you're measuring retention of paying users, not all users. Be explicit about the denominator.

Ignoring cohort size: A 60% retention rate from a cohort of 50 is statistically noise. Always display cohort sizes alongside retention rates.

Forgetting seasonality: Users acquired in December may behave differently from those acquired in March due to seasonal reasons, not product quality.

Summary

Cohort analysis reveals whether your product is genuinely improving for users, or whether growth is masking underlying churn. The acquisition cohort retention heatmap is the most important single chart for subscription and recurring-revenue businesses — it shows whether you are building a product that people keep using over time. Combine it with LTV cohorts to understand revenue impact, and channel cohorts to inform marketing budget allocation. The SQL and Python patterns in this article cover 90% of real-world cohort analysis needs.