SQL Story of unbroken chains of events (daily/weekly streaks)

#sql #postgres #tutorial

Recently I decided to write my own habit tracker.
I'm planning to have 2 kinds of habits: everyday habits and every-week habits. Plus in order to accomplish some of them you must check them several times per period. To track it and prevent me from quitting my new habits we'll calculate unbroken chains of events in 3 parts:

Everyday habit streak
Every week habit streak
With postgreSQL queries

1. Everyday habit streak

For a start let's calculate consecutive series of events that should happen every day.
For the demonstration purpose let's use data dumps on Stack Exchange Data Explorer. We will use a post table and try to find streaks of post creation. You can find the schema description here. It uses SQL Server, but next, we'll rewrite it for PostgresSQL.
First, let's check what do we have for user 1.

SELECT 
  ROW_NUMBER() OVER (ORDER BY CreationDate) rowNumber,
  CAST(CreationDate AS DATE) date
FROM Posts
WHERE OwnerUserId = ##UserId##
ORDER BY CreationDate

(you can run this snippet here and check all 144 rows of data)
We can see lots of posts created during the first days. And we can find 5 day streak here:

rowNumber   date
-----------------------
1   2008-07-31 00:00:00
2   2008-07-31 00:00:00
3   2008-07-31 00:00:00
4   2008-08-04 00:00:00
5   2008-08-04 00:00:00
6   2008-08-04 00:00:00

7   2008-08-10 00:00:00 ----
8   2008-08-11 00:00:00     |
9   2008-08-12 00:00:00     |
10  2008-08-12 00:00:00     |
11  2008-08-12 00:00:00     |
12  2008-08-12 00:00:00     |5 day
13  2008-08-12 00:00:00     |streak
14  2008-08-13 00:00:00     |found  
15  2008-08-13 00:00:00     | 
16  2008-08-13 00:00:00     |
17  2008-08-14 00:00:00     |
18  2008-08-14 00:00:00     |
19  2008-08-14 00:00:00     |
20  2008-08-14 00:00:00 ----

21  2008-08-17 00:00:00
...

What is ROW_NUMBER() function in the snippet above? It's called window function, it allows to make calculations across all rows of the current query.

Before using it in our calculations let's resolve another problem.
In our 5 days streak we can see a lot of repeating days, which gives us nothing new, let's group it and calculate the number of posts for these specific days - column amount.

SELECT 
  ROW_NUMBER() OVER (ORDER BY MIN(CreationDate)) rowNumber,
  COUNT(*) amount,
  CAST(MIN(CreationDate)AS DATE) date
FROM Posts
WHERE OwnerUserId = ##UserId##
  AND CreationDate BETWEEN '2008-08-10' and '2008-08-17'
GROUP BY CAST(CreationDate AS DATE)
ORDER BY MIN(CreationDate)

(run it!)

r amount  date
---------------------------
1   1   2008-08-10 00:00:00 ----
2   1   2008-08-11 00:00:00     |our
3   5   2008-08-12 00:00:00     |5 days
4   3   2008-08-13 00:00:00     |streak
5   4   2008-08-14 00:00:00 ----

Now we can see a pattern: day 5 of our streak here is also the 5th row.
Let's use this sweet ROW_NUMBER() function and subtract row number from our date.

SELECT 
  ROW_NUMBER() OVER (ORDER BY MIN(CreationDate)) row,
  COUNT(*) amount,
  CAST(MIN(CreationDate)AS DATE) date,
  CAST(MIN(CreationDate) - ROW_NUMBER() OVER (ORDER BY MIN(CreationDate)) as DATE) dateMinusRow
FROM Posts
WHERE OwnerUserId = ##UserId##
  AND CreationDate BETWEEN '2008-08-10' and '2008-08-17'
GROUP BY CAST(CreationDate AS DATE)
ORDER BY MIN(CreationDate)

(run it here)

r   amount          date                dateMinusRow
------------------------------------------------------------
1   1       2008-08-10 00:00:00     2008-08-09 00:00:00 --
2   1       2008-08-11 00:00:00     2008-08-09 00:00:00    |same value
3   5       2008-08-12 00:00:00     2008-08-09 00:00:00    |because of
4   3       2008-08-13 00:00:00     2008-08-09 00:00:00    |(date - row_number)
5   4       2008-08-14 00:00:00     2008-08-09 00:00:00 --

And apply this logic to the whole set:

SELECT 
  ROW_NUMBER() OVER (ORDER BY MIN(CreationDate)) as DATE) row,
  COUNT(*) amount,
  CAST(MIN(CreationDate)AS DATE) date,
  CAST(MIN(CreationDate) - ROW_NUMBER() OVER (ORDER BY MIN(CreationDate)) as DATE) dateMinusRow
FROM Posts
WHERE OwnerUserId = ##UserId##
GROUP BY CAST(CreationDate AS DATE)
ORDER BY MIN(CreationDate)

(run me)

r   amount      date             dateMinusRow
-----------------------------------------------------------
1    3       2008-07-31 00:00:00   2008-07-30 00:00:00
2    3       2008-08-04 00:00:00   2008-08-02 00:00:00

3    1       2008-08-10 00:00:00   2008-08-07 00:00:00-- 
4    1       2008-08-11 00:00:00   2008-08-07 00:00:00  |same value
5    5       2008-08-12 00:00:00   2008-08-07 00:00:00  |because of
6    3       2008-08-13 00:00:00   2008-08-07 00:00:00  |(date - row_number)
7    4       2008-08-14 00:00:00   2008-08-07 00:00:00--

8    2       2008-08-17 00:00:00   2008-08-09 00:00:00

As we can see it works perfectly, column dateMinusRow has the same value for all streak days.
All we need to do now is group the values and enjoy the result.

SELECT
  COUNT(*) streak,
  SUM(amount) streakAmount,
  MIN(date) startDate,
  MAX(date) endDate,
  dateMinusRow dateMinusRow
FROM (
  SELECT 
    COUNT(*) amount,
    CAST(MIN(CreationDate)AS DATE) date,
    CAST(MIN(CreationDate) - ROW_NUMBER() OVER (ORDER BY MIN(CreationDate)) as DATE) dateMinusRow
  FROM Posts
  WHERE OwnerUserId = ##UserId##
  GROUP BY CAST(CreationDate AS DATE)
) groupedDays
GROUP BY dateMinusRow

And result was as we expected (run it here and check all records)

streak          start                  endDate
--------------------------------------------------------
1       2008-07-31 00:00:00     2008-07-31 00:00:00 
1       2008-08-04 00:00:00     2008-08-04 00:00:00 
5       2008-08-10 00:00:00     2008-08-14 00:00:00 - our 5 days streak
1       2008-08-17 00:00:00     2008-08-17 00:00:00

We can work with this query further and find out what is our user's max daily streak (wasn't it 5?). But now let's talk about weeks and use this ROW_NUMBER() method there.

2. Every week habit streak

To use the same calculation method we need to understand what can represent all events during the same week.

First, we can use DATEPART() to get a week number and year from the date.
In this case, we'll have to get our head around calculations when streak goes from one year to another.

Another way to solve the week streak case is to use the difference in weeks between our CreationDate and some start date.
We'll use 0 as a start date. Sql Server interprets it like 1900-01-01 00:00:00.000.
Also by default Sql Server uses Sunday as the first day of the week:

SELECT 
  COUNT(*) anount,
  MIN(p.CreationDate) minDate,
  MAX(p.CreationDate) maxDate,
  DATEDIFF(wk, 0, CreationDate) weeks,
  DATEDIFF(wk, 0, CreationDate) - ROW_NUMBER() OVER (ORDER BY MIN(CreationDate)) weekMinusRow
FROM Posts p
WHERE OwnerUserId = ##UserId##
GROUP BY DATEDIFF(wk, 0, CreationDate)

In my case, I prefer to use Monday as the first day. Even though DATEDIFF ignores SET DATEFIRST there is a hint from here.

(run version with Monday as a first day)

So our final query:

SET DATEFIRST 1
SELECT
  COUNT(*) streak,
  MIN(minDate) startDate,
  MAX(maxDate) endDate
FROM (
  SELECT 
    COUNT(*) counts,
    MIN(p.CreationDate) minDate,
    MAX(p.CreationDate) maxDate,
    DATEDIFF(wk, DATEADD(dd,-@@datefirst,0), DATEADD(dd,-@@datefirst,CreationDate)) weeks,
    DATEDIFF(wk, DATEADD(dd,-@@datefirst,0), DATEADD(dd,-@@datefirst,CreationDate)) - ROW_NUMBER() OVER (ORDER BY MIN(CreationDate)) weekMinusRow
  FROM Posts p
  WHERE OwnerUserId = ##UserId##
  GROUP BY DATEDIFF(wk, DATEADD(dd,-@@datefirst,0), DATEADD(dd,-@@datefirst,CreationDate))
  HAVING COUNT(*) >= ##numberOfRepetitions##
) groupedRows
GROUP BY weekMinusRow
ORDER BY endDate DESC

(run it)

streak      startDate                  endDate
-----------------------------------------------------------
1        2009-03-30 08:00:18       2009-04-03 10:35:03
1        2009-03-10 01:05:12       2009-03-14 17:45:04
1        2008-12-01 04:56:38       2008-12-05 00:34:40
3        2008-10-07 08:01:40       2008-10-24 15:45:15
1        2008-08-25 00:11:43       2008-08-28 13:26:27
2        2008-08-04 02:45:07       2008-08-17 03:00:34

By the way, I also added HAVING COUNT(*) >= ##numberOfRepetitions## to filter by the number of actions that were performed per week.
From now on for streak week we should post harder, at least 4 times a week ;-)

3. With PostgreSQL queries

Data dumps on Stack Exchange Data Explorer are amazing and a lot of fun to play with. But in my current project, I use PostgreSQL. Let's rewrite our queries. For better readability, we will use with clause and extract subquery.

Sql Server	PostgreSQL
CAST(date AS DATE)	date_trunc('day', date)
CAST(MIN(CreationDate)AS DATE) - ROW_NUMBER() OVER (ORDER BY MIN(CreationDate))	date_trunc('day', CreationDate) - INTERVAL '1' DAY * ROW_NUMBER() OVER (ORDER BY MIN(CreationDate))
DATEDIFF(wk, start, end)	(DATE_PART('day', end - start)/7)::int

The final query for day streaks:

WITH
  groups(date, dateMinusRow) AS (
    SELECT 
      date_trunc('day', CreationDate) date,
      date_trunc('day', CreationDate) - INTERVAL '1' DAY * DENSE_RANK() OVER (ORDER BY date_trunc('day', CreationDate)) dateMinusRow
    FROM Posts
    GROUP BY date_trunc('day', CreationDate)
    HAVING COUNT(*) >= 1
  )
SELECT
  COUNT(*) AS streak,
  MIN(date) AS startDate,
  MAX(date) AS endDate
FROM groups
GROUP BY dateMinusRow
ORDER BY endDate DESC

run it here

The final query for week streaks:

WITH
  groups(minDate, maxDate, weekMinusRow) AS (
    SELECT
      MIN(CreationDate) minDate,
      MAX(CreationDate) maxDate,
      (DATE_PART('day', CreationDate - to_timestamp(0))/7)::int - ROW_NUMBER() OVER (ORDER BY MIN(CreationDate)) weekMinusRow
    FROM Posts
    GROUP BY (DATE_PART('day', CreationDate - to_timestamp(0))/7)::int
    HAVING COUNT(*) >= 2
  )
SELECT
  COUNT(*) streak,
  MIN(minDate) startDate,
  MAX(maxDate) endDate
FROM groups
GROUP BY weekMinusRow
ORDER BY endDate DESC