Depending on the type of questions we want to answer there might be times when we need to compare aggregated data to unaggregated data in SQL.
This is where the fact that the GROUP BY clause forces us to 'squeeze' all unaggregated columns into one row can be a headache.
In this article, you would see how a window function can make this quite easy.
The Big question
Suppose we wanted to answer the following question using our COVID Dataset:
How does each case recorded in Nigeria from March 2020 to June 2020 compared with the maximum case recorded within the same time frame.
We can answer this question by returning a table that has the recorded case for each day and the maximum case recorded in a day within the time frame. Something like this:
date | country | confirmed_today | max_value_in_timeframe |
---|---|---|---|
2020-03-01 | Nigeria | value1 | max_value_in_result_set |
2020-03-02 | Nigeria | value2 | max_value_in_result_set |
2020-03-03 | Nigeria | value3 | max_value_in_result_set |
2020-03-04 | Nigeria | value4 | max_value_in_result_set |
This would allow us to easily plot graphs that help us with the comparison. Let's see how we can solve this
The Inefficient solution
We can generate this result using a subquery in the SELECT clause. This subquery would calculate the maximum value and return that value for each row in our table. See the code below:
SELECT
date,
countries_and_territories as country,
daily_confirmed_cases AS confirmed_today,
(
SELECT
MAX(daily_confirmed_cases)
FROM
`bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`
WHERE
geo_id = 'NG'
AND date >= '2020-03-01'
AND date < '2020-06-01'
) as highest_case_recorded_in_a_day
FROM
`bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`
WHERE
geo_id = 'NG'
AND date >= '2020-03-01'
AND date < '2020-06-01';
This returns the following results:
date | country | confirmed_today | highest_case_recorded_in_a_day |
---|---|---|---|
2020-03-01 | Nigeria | 0 | 533 |
2020-03-02 | Nigeria | 0 | 553 |
2020-03-10 | Nigeria | 1 | 553 |
2020-03-15 | Nigeria | 0 | 553 |
2020-03-16 | Nigeria | 0 | 553 |
2020-03-17 | Nigeria | 0 | 553 |
2020-03-18 | Nigeria | 1 | 553 |
2020-03-19 | Nigeria | 5 | 553 |
2020-03-20 | Nigeria | 0 | 553 |
2020-03-21 | Nigeria | 4 | 553 |
2020-03-22 | Nigeria | 10 | 553 |
2020-03-23 | Nigeria | 8 | 553 |
2020-03-24 | Nigeria | 10 | 553 |
2020-03-25 | Nigeria | 4 | 553 |
2020-03-26 | Nigeria | 7 | 553 |
2020-03-27 | Nigeria | 14 | 553 |
2020-03-28 | Nigeria | 16 | 553 |
2020-03-29 | Nigeria | 16 | 553 |
2020-03-30 | Nigeria | 0 | 553 |
2020-03-31 | Nigeria | 34 | 553 |
2020-04-01 | Nigeria | 0 | 553 |
2020-04-02 | Nigeria | 20 | 553 |
2020-04-03 | Nigeria | 23 | 553 |
2020-04-04 | Nigeria | 16 | 553 |
2020-04-05 | Nigeria | 20 | 553 |
2020-04-06 | Nigeria | 22 | 553 |
2020-04-07 | Nigeria | 6 | 553 |
2020-04-08 | Nigeria | 16 | 553 |
2020-04-09 | Nigeria | 22 | 553 |
2020-04-10 | Nigeria | 12 | 553 |
2020-04-11 | Nigeria | 17 | 553 |
2020-04-12 | Nigeria | 13 | 553 |
2020-04-13 | Nigeria | 5 | 553 |
2020-04-14 | Nigeria | 20 | 553 |
2020-04-15 | Nigeria | 30 | 553 |
2020-04-16 | Nigeria | 34 | 553 |
2020-04-17 | Nigeria | 35 | 553 |
2020-04-18 | Nigeria | 51 | 553 |
2020-04-19 | Nigeria | 49 | 553 |
2020-04-20 | Nigeria | 85 | 553 |
2020-04-21 | Nigeria | 38 | 553 |
2020-04-22 | Nigeria | 117 | 553 |
2020-04-23 | Nigeria | 91 | 553 |
2020-04-24 | Nigeria | 108 | 553 |
2020-04-25 | Nigeria | 114 | 553 |
2020-04-26 | Nigeria | 87 | 553 |
2020-04-27 | Nigeria | 91 | 553 |
2020-04-28 | Nigeria | 64 | 553 |
2020-04-29 | Nigeria | 195 | 553 |
2020-04-30 | Nigeria | 196 | 553 |
2020-05-01 | Nigeria | 204 | 553 |
2020-05-02 | Nigeria | 238 | 553 |
2020-05-03 | Nigeria | 218 | 553 |
2020-05-04 | Nigeria | 170 | 553 |
2020-05-05 | Nigeria | 244 | 553 |
2020-05-06 | Nigeria | 148 | 553 |
2020-05-07 | Nigeria | 195 | 553 |
2020-05-08 | Nigeria | 381 | 553 |
2020-05-09 | Nigeria | 386 | 553 |
2020-05-10 | Nigeria | 239 | 553 |
2020-05-11 | Nigeria | 248 | 553 |
2020-05-12 | Nigeria | 242 | 553 |
2020-05-13 | Nigeria | 146 | 553 |
2020-05-14 | Nigeria | 184 | 553 |
2020-05-15 | Nigeria | 191 | 553 |
2020-05-16 | Nigeria | 283 | 553 |
2020-05-17 | Nigeria | 176 | 553 |
2020-05-18 | Nigeria | 338 | 553 |
2020-05-19 | Nigeria | 216 | 553 |
2020-05-20 | Nigeria | 226 | 553 |
2020-05-21 | Nigeria | 276 | 553 |
2020-05-22 | Nigeria | 339 | 553 |
2020-05-23 | Nigeria | 245 | 553 |
2020-05-24 | Nigeria | 265 | 553 |
2020-05-25 | Nigeria | 313 | 553 |
2020-05-26 | Nigeria | 229 | 553 |
2020-05-27 | Nigeria | 276 | 553 |
2020-05-28 | Nigeria | 389 | 553 |
2020-05-29 | Nigeria | 182 | 553 |
2020-05-30 | Nigeria | 387 | 553 |
2020-05-31 | Nigeria | 553 | 553 |
As you can see it returns exactly the shape of the data that we want in our table. Now we can go ahead and generate a chart, like the one shown below, which are easier to comprehend:
You can interact with the chart here
So what are the problems
The first issue is that subqueries perform calculations by running through the entire table. This simply means that the query above runs through the table twice to get the results that we got. This can affect performance
Another problem is that subqueries are quite difficult to read and maintain. It is quite difficult to grasp what the query above is doing at a glance
The window function solution
Now let's try to solve this problem using a window function.
The way we would do this is to replace the subquery part of the first query with our window function, like so:
SELECT
date,
countries_and_territories as country,
daily_confirmed_cases AS confirmed_today,
MAX(daily_confirmed_cases) OVER() as highest_case_recorded_in_a_day
FROM
`bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`
WHERE
geo_id = 'NG'
AND date >= '2020-03-01'
AND date < '2020-06-01'
The OVER() function is what starts our window function(as stated in the previous article in the series) while the MAX() gets the maximum case recorded on a particular day in the specified time frame.
The window function uses the ResultSet gotten from the WHERE part of the query thus there it does not run through the entire table like the subquery alternative.
In the end, you get a much shorter, more readable, maintainable and performant code.
Conclusion
What I have described in this article is a very simple use case of a window function. Let me know what insightful questions related to COVID-19 we could answer by comparing aggregated and unaggregated data in the comments section.
Next in this series, we would be talking about performing calculations that are dependent on the value of other rows eg. running totals, daily_increase in cases etc.
Top comments (0)