The Problem
Consider a table DailySales
with the following schema:
Column Name | Type |
---|---|
date_id | date |
make_name | varchar |
lead_id | int |
partner_id | int |
The table contains the date and the name of the product sold as well as the IDs of the lead and partner to whom it was sold. The name consists of only lowercase English letters. For each date_id
and make_name
, you need to find the number of distinct lead_id
's and distinct partner_id
's. The result table should be returned in any order.
Explanation
Consider the following sample DailySales
table:
date_id | make_name | lead_id | partner_id |
---|---|---|---|
2020-12-8 | toyota | 0 | 1 |
2020-12-8 | toyota | 1 | 0 |
2020-12-8 | toyota | 1 | 2 |
2020-12-7 | toyota | 0 | 2 |
2020-12-7 | toyota | 0 | 1 |
2020-12-8 | honda | 1 | 2 |
2020-12-8 | honda | 2 | 1 |
2020-12-7 | honda | 0 | 1 |
2020-12-7 | honda | 1 | 2 |
2020-12-7 | honda | 2 | 1 |
For this input, the expected output would be:
date_id | make_name | unique_leads | unique_partners |
---|---|---|---|
2020-12-8 | toyota | 2 | 3 |
2020-12-7 | toyota | 1 | 2 |
2020-12-8 | honda | 2 | 2 |
2020-12-7 | honda | 3 | 2 |
Here, each row represents a distinct date and make_name, and the count of unique leads and partners for each combination.
The Solution
Let's discuss two different approaches to solve this problem, highlight their strengths and weaknesses, and explain their underlying structures.
Using COUNT(DISTINCT)
This approach directly counts the distinct lead_id
and partner_id
for each combination of date_id
and make_name
.
SELECT
date_id,
make_name,
COUNT(DISTINCT lead_id) [unique_leads],
COUNT(DISTINCT partner_id) [unique_partners]
FROM DailySales
GROUP BY
date_id,
make_name
This query is fairly straightforward and easy to understand. It works by grouping the data by date_id
and make_name
, and then it counts the distinct leads and partners within each group. However, COUNT(DISTINCT)
can be slow on large data sets. This query runtime is 1073ms, beating 22.3% of other submissions on LeetCode.
Using DENSE_RANK()
This approach first ranks lead_id
and partner_id
within each group defined by date_id
and make_name
, and then it selects the maximum rank for each group as the count of distinct leads or partners.
WITH ranks AS (
SELECT
date_id,
make_name,
DENSE_RANK() OVER (PARTITION BY date_id, make_name ORDER BY lead_id) [lead_rank],
DENSE_RANK() OVER (PARTITION BY date_id, make_name ORDER BY partner_id) [partner_rank]
FROM DailySales
)
SELECT
date_id,
make_name,
(SELECT MAX(lead_rank) FROM ranks r WHERE r.date_id = ds.date_id AND r.make_name = ds.make_name) [unique_leads],
(SELECT MAX(partner_rank) FROM ranks r WHERE r.date_id = ds.date_id AND r.make_name = ds.make_name) [unique_partners]
FROM DailySales ds
GROUP BY
date_id,
make_name
This approach, while more complex, eliminates the need for COUNT(DISTINCT)
, which can improve performance in some situations. However, the inclusion of the subqueries in the SELECT clause may cause performance degradation when handling large datasets. This query runtime is 2467ms, beating 5.17% of other submissions on LeetCode.
Conclusion
Both methods solve the problem, but their performances can vary based on the data set size and distribution. The COUNT(DISTINCT)
method is simpler and performed better on LeetCode's platform, making it the preferred solution in this case. However, the performance may vary in real-world RDBMS depending on several factors such as data distribution and the database engine's optimizations.
You can find the original problem at LeetCode.
For more insightful solutions and tech-related content, feel free to connect with me on my Beacons page.
Top comments (0)