DEV Community

Mark Adel
Mark Adel

Posted on • Edited on

A Practical Example of Using SQL Self Joins

Problem

In this article, I am going to share a scenario that I recently faced at work and made use of SQL self joins.

Our company operates a food ordering app. We noticed that some customers add items to their basket, and for one or more hours, do not proceed to create an order. To address this, we decided to send these customers a push notification to encourage them to complete the ordering process. And that was my task. In this article, we will focus on how to retrieve these customers.

Solution

We have a relational table called customer_events where we store various events created by our app customers, such as: 'opened_app', 'added_items_to_basket', 'created_order', 'closed_app', and so on.

While Time-Series Databases could be a more suitable solution for this type of data, given our specific scenario and scale, we decided to stick with the traditional relational database.

Following is a sample of the customer_events table:

id customer_id event_type event_timestamp
1 20 opened_app 2023-01-01 11:00:00
2 30 added_items_to_basket 2023-01-01 12:00:00
3 30 closed_app 2023-01-01 12:10:00
4 40 added_items_to_basket 2023-01-01 13:00:00
5 40 created_order 2023-01-01 13:30:00
6 50 added_items_to_basket 2023-01-01 15:00:00
7 60 added_items_to_basket 2023-01-01 16:30:00

We have events from 5 customers. Only the customer_id of 40 added items to their basket and then created an order. The others did not, so we want to retrieve these customers to send them a push notification, which are customer_id of 30, 50, and 60.

Before writing non-straightforward SQL queries, I like to use plain text to describe exactly what I want to achieve, in a way that would help me translate that into the query. Let's see what I would write for our case:

"I want to retrieve customer IDs that their event_type = 'added_items_to_basket' and their event_timestamp is one or more hours ago, excluding those for whom there exists a row with the same customer_id and event_name = 'created_order' and a greater event_timestamp"

Let's start with the easy part of the query, which is selecting all customers who added items to their basket at least one hour ago:

SELECT
    customer_id
FROM
    customer_events
WHERE
    event_type = 'added_items_to_basket'
    AND (EXTRACT(EPOCH FROM (NOW() - event_timestamp)) / 3600) >= 1;
Enter fullscreen mode Exit fullscreen mode

Now, let's update the query to exclude those customers who created an order after they added items to the basket. We can do this using self joins:

SELECT
    a.customer_id
FROM
    customer_events a
LEFT JOIN
    customer_events b
    ON a.customer_id = b.customer_id
    AND b.event_type = 'created_order'
    AND b.event_timestamp > a.event_timestamp
WHERE
    a.event_type = 'added_items_to_basket'
    AND (EXTRACT(EPOCH FROM (NOW() - a.event_timestamp)) / 3600) >= 1
    AND b.customer_id IS NULL;
Enter fullscreen mode Exit fullscreen mode

To the untrained eye, this query might look intimidating, but it's actually quite simple. Let's break it down.

We performed a LEFT JOIN on the customer_events table (a) with itself (b). This retrieves all rows from the left table (a) along with their matching rows from the right table (b) based on the specified join conditions. In the right side, all columns will be NULL if there are no subsequent 'created_order' event for the customer on the left side, and will have values otherwise. We use the b.customer_id IS NULL condition to only get customers where there are no subsequent 'created_order' event.

Query Output:

customer_id
30
50
60

If you are not sure what the output of the self join looks like before filtering using b.customer_id IS NULL, you may run this example I prepared on db-fiddle.

It is worth noting that this is a simplified version of the actual query. There are other cases that we need to handle, such as ensuring that the order belongs to the specified basket, or only taking the last event into account. However, this is beyond the scope of this article.

Now, I want you to go back and read the plain text we wrote for this query, and then check the query again. You will find that it does exactly what we described.

Another Approach

There is an alternative way we can write this query to achieve the same result, using sub-queries:

SELECT
    a.customer_id
FROM
    customer_events a
WHERE
    a.event_type = 'added_items_to_basket'
    AND (EXTRACT(EPOCH FROM (NOW() - a.event_timestamp)) / 3600) >= 1
    AND NOT EXISTS (
        SELECT 1
        FROM customer_events b
        WHERE a.customer_id = b.customer_id
        AND b.event_type = 'created_order'
        AND b.event_timestamp > a.event_timestamp
    );
Enter fullscreen mode Exit fullscreen mode

Which approach to use? We have two factors to consider: readability and performance.

In terms of readability, it is a personal preference. For me, I find both equally readable.

In terms of performance, it depends on many factors. It is always advisable to check the execution plans of your queries.

Conclusion

Whenever you find yourself in a situation where you want to exclude certain rows from your result based on conditions that involve other rows in the same table, SQL self joins can come in handy.

I hope you found this article helpful. Please let me know if you have any feedback. Happy querying!

Top comments (0)