What Is Cohort Analysis?
Cohort analysis is a technique for segmenting users or customers into groups — cohorts — based on a shared characteristic or experience at a specific point in time, then tracking those groups' behavior over subsequent time periods. The most common cohort definition is the acquisition cohort: all users who signed up, made their first purchase, or Were acquired in the same week or month. By comparing cohorts across time, analysts can distinguish genuine behavioral trends from compositional changes in the user base.
Cohort analysis answers questions like: Are users acquired in Q3 retaining better than those acquired in Q1? Is lifetime value improving for newer cohorts? How does engagement in week 2 predict long-term retention?
Why Cohort Analysis Matters
Aggregate metrics like "monthly active users" or "average revenue" hide the underlying dynamics of a product. A growing MAU number can mask the fact that churn is accelerating — you're simply acquiring new users faster than you're losing old ones. Cohort analysis exposes these dynamics by keeping each group of users separate and measuring them at the same relative time points (e.g., Day 1, Day 7, Day 30 after signup — not calendar dates).
This relative-time perspective is what makes cohort analysis powerful: it controls for the maturity of the user relationship, enabling apples-to-apples comparisons across cohorts acquired at different times.
Types of Cohorts
Cohort Type | Definition | Example Use Case |
|---|---|---|
Acquisition cohort | Grouped by when they first joined or made a purchase | Retention curves by signup month |
Behavioral cohort | Grouped by an action they took (e.g., used a feature) | Do users who complete onboarding retain better? |
Size cohort | Grouped by a size attribute (e.g., company size in B2B) | Do enterprise accounts expand faster than SMB? |
Channel cohort | Grouped by acquisition channel (paid, organic, referral) | Which channel yields highest LTV users? |
The Cohort Retention Table
The standard output of cohort analysis is a retention table (also called a cohort matrix). Rows represent cohorts (e.g., the month of first purchase), columns represent time periods after the cohort event (Month 0, Month 1, Month 2...), and each cell contains the percentage of users from that cohort who were still active in that period.
Cohort | Month 0 | Month 1 | Month 2 | Month 3 |
|---|---|---|---|---|
Jan 2024 | 100% | 62% | 48% | 41% |
Feb 2024 | 100% | 65% | 51% | 44% |
Mar 2024 | 100% | 68% | 54% | — |
Apr 2024 | 100% | 70% | — | — |
Reading down a column shows how the same time period compares across cohorts — an improving Month 1 column (62% → 65% → 68% → 70%) indicates that recent cohorts are retaining better at that stage, suggesting a product improvement or better-qualified acquisition.
Building a Cohort Analysis in SQL
The SQL pattern for cohort analysis involves three steps: assign each user to a cohort, compute the activity period for each event, and aggregate by cohort and period.
Step 1 — Assign cohort (first activity date):
SELECT user_id, DATE_TRUNC('month', MIN(created_at)) AS cohort_month FROM orders GROUP BY user_id
Step 2 — Compute period index (months since cohort):
SELECT o.user_id, c.cohort_month, DATE_TRUNC('month', o.created_at) AS activity_month, DATEDIFF('month', c.cohort_month, DATE_TRUNC('month', o.created_at)) AS period FROM orders o JOIN cohorts c ON o.user_id = c.user_id
Step 3 — Aggregate into retention table:
SELECT cohort_month, period, COUNT(DISTINCT user_id) AS users FROM cohort_activity GROUP BY 1, 2 ORDER BY 1, 2
Divide each cell by the cohort size (period = 0 count) to get retention percentages. Then pivot by period to get the matrix format.
Building a Cohort Analysis in pandas
In Python, the cohort analysis pattern typically involves: assigning cohort labels, computing a period index, grouping and counting, then pivoting.
df['cohort'] = df.groupby('user_id')['order_date'].transform('min').dt.to_period('M')
df['period'] = (df['order_date'].dt.to_period('M') - df['cohort']).apply(lambda x: x.n)
cohort_data = df.groupby(['cohort', 'period'])['user_id'].nunique().reset_index()
cohort_pivot = cohort_data.pivot(index='cohort', columns='period', values='user_id')
retention = cohort_pivot.divide(cohort_pivot[0], axis=0) — divide by cohort size to get percentages.
Interpreting the Retention Curve
The retention curve plots the percentage of users active at each period (Day 1, Day 7, Day 30...). Key patterns to look for:
Pattern | What It Means | Implication |
|---|---|---|
Steep early drop, then flat | Many users churn quickly but a loyal core stays | Focus on improving early activation |
Gradual continuous decline | Users are slowly disengaging over time | Product may lack long-term value hooks |
Improving across cohorts | Newer cohorts retain better — product is improving | Positive signal; identify what changed |
Smiling curve (retention rises after initial drop) | Seasonal or periodic usage pattern | Normal for seasonal products |
Cohort Analysis for Revenue: LTV
Cohort analysis extends naturally to revenue. Instead of tracking active users, track cumulative revenue per user over time. This builds a customer lifetime value (LTV) curve by cohort:
LTV at Month N = (total revenue from cohort through Month N) / (number of users in cohort)
Comparing LTV curves across cohorts — and across acquisition channels — reveals which cohorts are most valuable and whether LTV is improving over time. LTV curves are also essential for computing the LTV:CAC ratio, which determines whether customer acquisition spending is economically justified.
Common Pitfalls
Pitfall | Description | Fix |
|---|---|---|
Immature cohorts | Recent cohorts have fewer periods — their curve looks better simply because we haven't seen their full churn yet | Only compare cohorts at periods where all have data |
Survivorship bias | Analyzing only users who completed a certain action excludes those who churned before reaching it | Define cohort at the earliest possible event |
Wrong cohort definition | Using a non-first event (e.g., last purchase) as the cohort anchor mixes up cohort timing | Always use the user's first event as the cohort date |
Calendar vs. relative time | Comparing activity in January vs. February mixes cohort age with seasonality | Always use period index (0, 1, 2...) not calendar dates |
Summary
Cohort analysis is one of the most valuable analytical techniques for understanding product health, customer retention, and lifetime value. By grouping users by acquisition time and tracking them through relative time periods, analysts can separate compositional effects from behavioral trends, detect whether product improvements are actually improving retention, and build robust LTV models by acquisition channel. The SQL and pandas patterns are consistent across industries: assign cohorts, compute period indices, aggregate, and pivot — then look for trends across the diagonal of the retention matrix.
Create a free reader account to keep reading.