Imagine you are analyzing sales transactions and want to see a running total of daily sales. Or you want to find the best-performing product in each region based on the total sales value. These are all common questions from the business when working with data. In both scenarios, notice that we cannot change the granularity of the data but need an aggregate value. For running totals, we still want the data to be at a daily level and also have an aggregate of all the sales that happened until that day. For the best-performing product, we still want the data to be at the product level but need to get which product has the highest sales in a given region. While we might think of using GROUP BY
to achieve these, GROUP BY
aggregates the data losing the granularity. For these scenarios, SQL provides a powerful feature called Window functions, a way to aggregate data (and more) without losing granularity.
What are window functions?
Window functions perform operations on a set of table rows that are somehow related to the current row. This set of related rows is called a Window. Unlike aggregate functions like SUM()
with GROUP BY
, which collapses the rows to the group level, window functions retain the original number of rows in the output.
Understanding with an example
Problem definition
Let us break that definition down and understand with an example. Firstly in window functions, we perform operations on every row. This operation which is performed on every row takes into account a set of related rows called windows. Imagine we have a dataset of product sales in each region as shown below. Let us say we want to calculate the percentage contribution of each product to the overall sales in that region. So, for every row in the dataset, if we have the total sales in the region, then we can divide the product sales by the region sales to get the percentage. Below is a sample dataset.
id | region | product | sales_amount |
---|---|---|---|
1 | Bangalore | Ice cream | 5000 |
2 | Bangalore | Chocolate | 10000 |
3 | Bangalore | Soft drinks | 2000 |
4 | Delhi | Ice cream | 1000 |
5 | Delhi | Chocolate | 3000 |
6 | Delhi | Soft drinks | 6000 |
7 | Hyderabad | Ice cream | 8000 |
8 | Hyderabad | Chocolate | 1000 |
9 | Hyderabad | Soft drinks | 3500 |
10 | Mumbai | Ice cream | 12000 |
11 | Mumbai | Chocolate | 5800 |
12 | Mumbai | Soft drinks | 12000 |
Table: Sample dataset showing sales across cities and products
Solution using window function concepts
Let us break this problem into steps as per the definition of the window functions.
For every row in the dataset, we need to apply an operation. In this scenario, we apply the SUM(sales_amount)
operation to get the total sales of all the products.
But, this operation has to take into account only a set of related rows along with the current row. In this case, these related rows are all the rows belonging to the same region. For example, if we are operating on the row with id=1
, we know this row belongs to the Bangalore region. Now to compute SUM(sales_amount)
, the window function considers rows with IDs 1,2 and 3 since all three belong to the Bangalore window. Now combining the sales amount of all three rows gives the value 17000 which is computed against row id 1. By repeating this logic for every row, we get the total_sales_in_region
for every row. So, in this case, all the rows in a region are considered a window. Try to compute the total_sales_in_region
for row_id=4
using the above explanation.
As a final step, we can simply divide sales_amount by total_sales_in_region to get the percentages. Note that this step is not shown in the screenshot below.
Notice how the windows are defined based on the values in the region column. Also, the total_sales_in_region is computed by summing up sales_amount of all the rows in the respective windows. |
With this example, we can see how we computed the aggregate of sales at the region level and still maintained the table at region-product granularity. This is how window functions operate in SQL.
How is GROUP BY different?
Before we move forward, let us be clear that GROUP BY
is different and we would not be able to achieve the same result with GROUP BY
. If we SUM(total_sales)
using GROUP BY
on the region column, then the output will be at the region level as shown below. Notice how GROUP BY
reduced the total rows in the output to just show totals at the region level. We no longer have access to the product information in this.
Output after aggregating the sales_amount at region level using GROUP BY |
Here is the query used to get the total sales at the region level using GROUP BY
.
-- GROUP BY query to get the total sales amount at the region level
SELECT region,
SUM(sales_amount) AS total_sales_in_region
FROM region_product_sales
GROUP BY region
Window functions β Syntax
Now let us understand the basic syntax of a window function in MySQL. Again starting with the definition, we define the aggregate function for each row over a defined window. We define the window by telling SQL which columns it should use to partition the dataset into multiple windows. Here is the syntax for the above problem to get the total sales in the region for every row.
Breaking down the syntax of Window functions |
SUM(sales_amount) OVER (PARTITION BY region) AS total_sales_in_region
All this syntax goes in-line in the SELECT part where we list the columns. So, the full SQL query for the example problem will be as follows.
-- Window functions SQL syntax using PARTITION BY
SELECT region,
product,
sales_amount,
SUM(sales_amount) OVER (PARTITION BY region) AS total_sales_in_region
FROM region_product_sales
Conclusion
With this example, I hope it is clear what window functions are and how they work. We also saw how the window functions output differs from aggregates computed using GROUP BY
. Finally, we saw the syntax to use window functions with PARTITION BY
in MySQL. This is only a simple example and there are a lot more features to explore in window functions in SQL.
Whatβs next?
In this blog, we only discussed a very basic example of using window functions in MySQL. Window functions offer several operations that you can do on the rows apart from aggregate functions like SUM
, COUNT
. You can calculate for example the rank of every product based on their sales amount using the RANK
window function. We can also, form more complex windows using multiple columns in the PARTITION BY
clause. You can see the full list of MySQL window functions here. Find the list of window functions for Postgres here. In the upcoming blogs, we will discuss more complex scenarios using window functions and also introduce how to use ORDER BY
in window functions.
Hope you enjoyed the read. Do share any feedback in the comments.
Top comments (0)