Introduction
If you're one of the first analysts joining a company, you will probably arrive to find an overwhelming amount of data. You'll almost certainly have some relational data (probably powering your application's backend), and likely have event data too (possibly through a third party like Segment).
This can make it difficult to know where to start with each analytical request. You might be tempted to create a new dataset for each analysis, but this will add to the amount of data and quickly becomes difficult to manage.
Fortunately, you can gather all of your data into three tables that can then be used to answer the majority of user related questions.
The SQL snippets in this post apply to Google BigQuery but can easily be adapted for other data warehouses.
Table 1: User Stats
To start, you need a table that tells you all about your users. This table will contain dimensions (e.g. user account creation timestamp) and metrics (e.g. user lifetime transaction count). It might be simple to begin with, but as your company grows and gets more complex, you’ll keep adding to this table.
Key point: user_stats
has one row per user
SELECT * FROM user_stats
user_id | created_at | last_seen_at | transactions_last_7d | transactions_lifetime |
---|---|---|---|---|
32356543 | 2019-01-01 14:36:51 | 2019-03-14 19:55:10 | 2 | 13 |
64763654 | 2019-02-01 10:30:18 | 2019-02-08 10:02:32 | 0 | 4 |
35795445 | 2019-03-03 23:00:04 | 2019-03-03 23:03:07 | 0 | 0 |
98765446 | 2019-03-06 14:36:31 | 2019-03-15 08:20:22 | 1 | 2 |
... | ... | ... | ... | ... |
What kind of questions can this table be used to answer?
How many users signed up this week?
SELECT
COUNT(*) as user_count
FROM
user_stats
WHERE
created_at >= TIMESTAMP_SUB(current_timestamp, INTERVAL 7 DAY)
How many users have made at least 5 transactions?
SELECT
COUNT(*) as user_count
FROM
user_stats
WHERE
transactions_lifetime >= 5
How many users have logged in within the past 24 hours?
SELECT
COUNT(*) as user_count
FROM
user_stats
WHERE
last_seen_at >= TIMESTAMP_SUB(current_timestamp, INTERVAL 24 HOUR)
Table 2: Daily user stats
The need for this table might not be so obvious, but it turns out that having a table describing each user’s activity every single day since they signed up (even if they were not active at all on that day) is really useful. There are four dimensions you’ll definitely want in this table:
-
user_id
: a unique identifier for your users, that you can link back touser_stats
-
date
: what date does this row correspond to -
day_n
: what day (in the lifetime of the user) does this date represent. On the day a user signs upday_n = 1
, the day after sign upday_n = 2
etc -
was_active
: was the user active on this day? (Exactly how you choose to define active should depend on the specifics of your business. If you are creating a messaging app, you may define “active” as having sent at least one message, for example)
As with user_stats
, this table will probably end up with many more than these 4 fields as your business becomes more complex.
Key point: daily_user_stats
has one row per user per day (for all days since they signed up)
SELECT * FROM daily_user_stats ORDER BY user_id, day_n
user_id | date | day_n | was_active |
---|---|---|---|
32356543 | 2019-03-01 | 1 | 1 |
32356543 | 2019-03-02 | 2 | 0 |
32356543 | 2019-03-03 | 3 | 1 |
35795445 | 2019-03-02 | 1 | 1 |
35795445 | 2019-03-03 | 2 | 0 |
98765446 | 2019-03-03 | 1 | 1 |
... | ... | ... | ... |
What kind of questions can this table be used to answer?
What percentage of users are active one week after they signed up?
SELECT
AVG(was_active) as active_user_ratio
FROM
daily_users_stats
WHERE
day_n = 8
How has the percentage of users active the day after they sign up trended over time?
SELECT
DATE(user_stats.created_at) AS created_date,
AVERAGE(was_active) as active_user_ratio
FROM
daily_user_stats
INNER JOIN user_stats
ON daily_user_stats.user_id = user_stats.user_id
WHERE
day_n = 2
ORDER BY created_date ASC
What does our user retention curve look like?
SELECT
day_n,
AVERAGE(was_active) as active_user_ratio
FROM
daily_user_stats
INNER JOIN user_stats
ON daily_user_stats.user_id = user_stats.user_id
WHERE
user_stats.created_at >= TIMESTAMP_SUB(current_timestamp, INTERVAL 7 DAY)
AND day_n <= 7
Table 3: Sessions
The sessions
table is a place to store your event data. Events are user interactions on your app or site (e.g. button clicks, form submissions, message sends etc). You probably track many different events, and finding a consistent approach to analysing them can be tricky. Sessions is a table to organise the chaos!
A session is a collection of events that represent one end to end interaction with your product: user opens app > user opens product abc page > user clicks add to cart > user closes app. How you combine a user’s events into a session depends on your product, but a standard approach is to end a session after a period of 30 minutes of inactivity. Any subsequent events from that user will be added to a new session.
Using some relatively complex SQL, you can split your raw events into sessions. Your sessions table should at least have these dimensions:
-
user_id
: unique user identifier -
session_id
: Starting at 1 (the user's first ever session), incrementing for each further session -
start_time
: timestamp of first event in session -
end_time
: timestamp of last event in session -
events
: ordered list/array of all events in the session (this is well supported in BigQuery, other query languages may need to approach this slightly differently, for example using JSON)
Key point: sessions
has one row for each session a user has had
SELECT * FROM sessions
user_id | session_id | start_time | end_time | events |
---|---|---|---|---|
32356543 | 1 | 2019-03-14 19:55:10 | 2019-03-14 19:58:15 | [opened_app, clicked_product, closed_app] |
32356543 | 2 | 2019-03-16 13:45:10 | 2019-03-16 13:51:15 | [opened_app, scrolled_list, closed_app] |
23501294 | 1 | 2019-03-18 16:13:10 | 2019-03-18 16:14:15 | [opened_app, ..., closed_app] |
23501294 | 1 | 2019-03-19 19:55:10 | 2019-03-19 19:58:15 | [opened_app, ..., closed_app] |
... | ... | ... | ... | ... |
What kind of questions can this table be used to answer?
How has average session length changed over time?
SELECT
DATE(end_time) session_date,
AVG(TIMESTAMP_DIFF(end_time, start_time)) AS session_length
FROM
sessions
GROUP BY
1
What share of sessions contain a transaction?
SELECT
DATE(end_time) session_date,
AVG(IF(ARRAY_TO_STRING(events, '-') LIKE '%transaction_complete%',
1,
0)) AS transaction_ratio
FROM
sessions
GROUP BY
1
What chain of events lead to a user leaving the app and never returning?
SELECT
user_id,
ARRAY_REVERSE(events)[SAFE_OFFSET(2)] AS last_event_offset_2,
ARRAY_REVERSE(events)[SAFE_OFFSET(1)] AS last_event_offset_1,
ARRAY_REVERSE(events)[SAFE_OFFSET(0)] AS last_event_offset_0
FROM
sessions
LEFT JOIN user_stats
ON sessions.user_id = user_stats.user_id
WHERE
sessions.session_id = 1
AND user_stats.sessions_all_time = 1
Summary
These three well structured tables can help answer most questions:
-
user_stats
: who is in my user base -
daily_user_stats
: how many active users do I have, and how well do they retain -
sessions
: what journey are users taking through my product
Once you've written the queries to combine your other datasets into these three simple tables you should set up a schedule to keep them up to date with any new changes. You'll soon start relying on these tables for a large share of your analytics.
Dataform provides a fully managed, collaborative workspace to set up schedules and manage your data warehouse. For more information on Dataform check out our documentation, or create an account for free and start using Dataform's fully managed Web platform.
Top comments (0)