The Problem
The challenge involves two tables: Trips
and Users
. The Trips
table holds information about all taxi trips, including the unique trip id, client id, driver id, city id, status of the trip, and the date of the request. The Users
table maintains records of all users, encompassing the unique user id, user's status (banned or not), and their role (client, driver, partner).
The task is to compute the cancellation rate each day between "2013-10-01" and "2013-10-03". The cancellation rate is defined as the ratio of the number of cancelled requests by unbanned users (both client and driver) to the total number of requests made by unbanned users that day. The cancellation rate should be expressed up to two decimal places. The tables' schemas are as follows:
Trips
Column Name | Type |
---|---|
id | int |
client_id | int |
driver_id | int |
city_id | int |
status | enum |
request_at | date |
Users
Column Name | Type |
---|---|
users_id | int |
banned | enum |
role | enum |
Explanation
Consider the following input:
Trips
id | client_id | driver_id | city_id | status | request_at |
---|---|---|---|---|---|
1 | 1 | 10 | 1 | completed | 2013-10-01 |
2 | 2 | 11 | 1 | cancelled_by_driver | 2013-10-01 |
3 | 3 | 12 | 6 | completed | 2013-10-01 |
4 | 4 | 13 | 6 | cancelled_by_client | 2013-10-01 |
5 | 1 | 10 | 1 | completed | 2013-10-02 |
6 | 2 | 11 | 6 | completed | 2013-10-02 |
7 | 3 | 12 | 6 | completed | 2013-10-02 |
8 | 2 | 12 | 12 | completed | 2013-10-03 |
9 | 3 | 10 | 12 | completed | 2013-10-03 |
10 | 4 | 13 | 12 | cancelled_by_driver | 2013-10-03 |
Users
users_id | banned | role |
---|---|---|
1 | No | client |
2 | Yes | client |
3 | No | client |
4 | No | client |
10 | No | driver |
11 | No | driver |
12 | No | driver |
13 | No | driver |
For this data, the expected output is:
Day | Cancellation Rate |
---|---|
2013-10-01 | 0.33 |
2013-10-02 | 0.00 |
2013-10-03 | 0.50 |
Explanation:
On 2013-10-01:
- There were 4 requests in total, 2 of which were canceled.
- The request with Id=2 was made by a banned client (User_Id=2), so it is excluded from the calculation.
- Hence, considering unbanned users, there were 3 requests in total, 1 of which was canceled.
- Therefore, the Cancellation Rate is (1 / 3) = 0.33
On 2013-10-02:
- There were 3 requests in total, none of which were canceled.
- The request with Id=6 was made by a banned client, so it is excluded from the calculation.
- Thus, considering unbanned users, there were 2 requests in total, none of which were canceled.
- Hence, the Cancellation Rate is (0 / 2) = 0.00
On 2013-10-03:
- There were 3 requests in total, 1 of which was canceled.
- The request with Id=8 was made by a banned client, so it is excluded from the calculation.
- Hence, considering unbanned users, there were 2 requests in total, 1 of which was canceled.
- Therefore, the Cancellation Rate is (1 / 2) = 0.50
The cancellation rate for each day is calculated considering only unbanned users (both client and driver) and requests that occurred on the day. If a request was made by a banned user, it is ignored in the calculation. The final cancellation rate is presented as a fraction of the total requests made by unbanned users that were canceled, rounded to two decimal places.
The Solution
Let's delve into three different approaches to solve this problem. As we progress from Source Code 1 to Source Code 3, the code gets more efficient and easier to read, even though all three of them provide the correct results.
Source Code 1
This SQL query makes use of a Common Table Expression (CTE) called 'cancellation_rates'. It generates a temporary result set by joining the Trips
table with the Users
table twice, once for the client and once for the driver. This query uses window functions COUNT(*) OVER()
to get the number of unbanned requests and unbanned statuses for each day. It then calculates the cancellation rate in the main SELECT
query.
This query might seem a bit complex due to the use of window functions and CTE. However, it provides a clear separation of calculating the unbanned request count and the actual calculation of cancellation rate.
The performance of this query is good but not the best, with a runtime of 572ms, beating 45.44% of other submissions.
WITH cancellation_rates AS (
SELECT DISTINCT
t.request_at AS Day,
t.status,
COUNT(*) OVER(PARTITION BY t.request_at, t.status) AS unbanned_stat,
COUNT(*) OVER(PARTITION BY t.request_at) AS unbanned_req
FROM Trips t JOIN Users c ON t.client_id = c.users_id JOIN Users d ON t.driver_id = d.users_id
WHERE
c.banned = 'No'
AND d.banned = 'No'
AND t.request_at BETWEEN '2013-10-01' AND '2013-10-03'
)
SELECT
Day,
CASE
WHEN status = 'completed' AND unbanned_stat = unbanned_req THEN 0
ELSE ROUND(SUM(unbanned_stat) OVER (PARTITION BY Day, unbanned_stat) * 1.0 / unbanned_req, 2)
END AS 'Cancellation Rate'
FROM cancellation_rates
WHERE
status LIKE 'cancelled%'
OR (status = 'completed' AND unbanned_stat = unbanned_req)
Source Code 2
This query simplifies the approach by removing the use of a CTE. Instead, it calculates the cancellation rate directly within the main SELECT
statement using conditional aggregation. It uses a CASE
statement inside the SUM()
function to count the number of cancellations and then divides that by the total number of requests.
This approach is easier to understand but has a slightly worse performance, with a runtime of 741ms, beating 12.61% of other submissions.
SELECT
t.request_at AS Day,
ISNULL(
ROUND(
CAST(
SUM(
CASE
WHEN t.status LIKE 'cancelled%' THEN 1
ELSE 0
END
) AS FLOAT) /
CAST(
COUNT(*) AS FLOAT),
2),
0) AS 'Cancellation Rate'
FROM
Trips t
JOIN
Users c ON t.client_id = c.users_id AND c.banned = 'No'
JOIN
Users d ON t.driver_id = d.users_id AND d.banned = 'No'
WHERE
t.request_at BETWEEN '2013-10-01' AND '2013-10-03'
GROUP BY
t.request_at
Source Code 3
The third query is even more simplified and performs better than the previous two queries. It removes the explicit JOIN
operations, replacing them with subqueries in the WHERE
clause to filter out banned users.
This not only improves readability but also enhances performance because the query doesn't need to create large intermediate join tables. Instead, it filters rows directly based on client_id and driver_id, which is more efficient.
This query has the best performance among the three, with a runtime of 435ms, beating 81.43% of other submissions.
SELECT
t.request_at AS Day,
ISNULL(
ROUND(
SUM(
CASE
WHEN t.status LIKE 'cancelled%' THEN 1.0
ELSE 0.0
END
) /
COUNT(*),
2),
0) AS 'Cancellation Rate'
FROM
Trips t
WHERE
t.request_at BETWEEN '2013-10-01' AND '2013-10-03'
AND t.client_id IN (SELECT users_id FROM Users WHERE banned = 'No' AND role = 'client')
AND t.driver_id IN (SELECT users_id FROM Users WHERE banned = 'No' AND role = 'driver')
GROUP BY
t.request_at
Conclusion
From these solutions, we can learn that while different SQL queries can achieve the same result, their performance can vary based on how we leverage the SQL features and functions. Using joins and subqueries strategically can have a significant impact on the query execution speed.
You can find the original problem at LeetCode.
For more insightful solutions and tech-related content, feel free to connect with me on my Beacons page.
Top comments (0)