Sometimes when writing queries for reporting purposes, the queries required to manipulate the data can get complex.
If you are able to use PHP or Python to manipulate the data after running a query, then the query required to get the desired result is much simpler.
But sometimes you just have to find a way to fetch the data you need using only a SQL query.
For the purposes of this article, the desired outcome from the query is to fetch the total amount transacted each day and display both the daily amount and a running tally for a 7 day period.
Date | Amount | Tally |
---|---|---|
2023-01-01 | 28.50 | 28.50 |
2023-01-02 | 40.50 | 69.00 |
2023-01-03 | 15.50 | 84.50 |
2023-01-04 | 48.00 | 132.50 |
2023-01-05 | 0.00 | 132.50 |
2023-01-06 | 0.00 | 132.50 |
2023-01-07 | 30.00 | 162.50 |
The data queried to get this output is
transaction_date | amount |
---|---|
2023-01-01 12:55:19 | 18.00 |
2023-01-01 23:05:43 | 10.50 |
2023-01-02 10:29:23 | 10.50 |
2023-01-02 11:40:19 | 30.00 |
2023-01-03 11:08:33 | 10.50 |
2023-01-03 11:38:20 | 5.00 |
2023-01-04 09:32:00 | 18.00 |
2023-01-04 10:52:01 | 30.00 |
2023-01-07 16:14:27 | 30.00 |
Getting Started
With any complex query, the trick is to break it down into its parts and work on each part until you are ready to put it all together.
Here I would run a query to fetch the data for the date range just so I can see what I need to do.
SELECT transaction_date, amount
FROM subscriptions
WHERE transaction_date
BETWEEN '2023-01-01 00:00:00'
AND '2023-01-07 23:59:59'
transaction_date | amount |
---|---|
2023-01-01 12:55:19 | 18.00 |
2023-01-01 23:05:43 | 10.50 |
2023-01-02 10:29:23 | 10.50 |
2023-01-02 11:40:19 | 30.00 |
2023-01-03 11:08:33 | 10.50 |
2023-01-03 11:38:20 | 5.00 |
2023-01-04 09:32:00 | 18.00 |
2023-01-04 10:52:01 | 30.00 |
2023-01-07 16:14:27 | 30.00 |
Now I need to group the data by day and SUM the amount transacted each day.
SELECT DATE(transaction_date) transaction_date,
SUM(amount) amount
FROM subscriptions
WHERE transaction_date
BETWEEN '2023-01-01 00:00:00'
AND '2023-01-07 23:59:59'
GROUP BY DATE(transaction_date)
transaction_date | amount |
---|---|
2023-01-01 | 28.50 |
2023-01-02 | 40.50 |
2023-01-03 | 15.50 |
2023-01-04 | 48.00 |
2023-01-07 | 30.00 |
The date is in the right format, the amounts are grouped and summed correctly, but we are missing days.
There are many ways in SQL to fill in the gaps, some are fairly simple and others are overly complex.
For me, my preferred method is to use Common Table Expressions.
What are Common Table Expressions (CTE)
CTEs are reusable temporary result sets. In their simplest form they can replace a sub query or a view, but they have some unique characteristics which makes them powerful.
CTEs are reusable, which means that a query is run once, but the resulting data set can be used/queried multiple times.
Defining a CTE
CTEs are defined using WITH
WITH counter AS (...query...)
This creates a temporary result set that can then be used in a query
WITH counter AS (SELECT 1)
SELECT * FROM counter
If needed, the field name returned from a CTE can be defined
WITH counter (number) AS (SELECT 1)
SELECT * FROM counter
Its also possible to define multiple CTEs
WITH counter (number) AS (SELECT 1),
WITH total (total) AS (SELECT 2)
SELECT * FROM counter
A slightly more complex CTE might look like
WITH counter (number) AS (
SELECT 1
UNION ALL
SELECT 2
)
SELECT * FROM counter
number |
---|
1 |
2 |
Where CTEs become powerful is in their ability to reference themselves recursively
If I wanted to return the numbers 1 to 6 I could use a recursive CTE to generate the numbers
WITH RECURSIVE counter (number) AS (
SELECT 1
UNION ALL
SELECT number + 1 FROM counter WHERE number < 6
)
SELECT * FROM counter
This will produce
number |
---|
1 |
2 |
3 |
4 |
5 |
6 |
Using a CTE to generate a date range
For our original request, all dates within the date range were required even for dates where there were no transactions.
To do this, a CTE can be created to return the date range required.
WITH RECURSIVE dates (day) AS (
SELECT '2023-01-01'
UNION ALL
SELECT day + INTERVAL 1 day FROM dates WHERE day < '2023-01-07'
)
SELECT * FROM dates
day |
---|
2023-01-01 |
2023-01-02 |
2023-01-03 |
2023-01-04 |
2023-01-05 |
2023-01-06 |
2023-01-07 |
Joining CTEs In Queries
For the purposes of the original requirements, I am adding the query that fetched the transactions into its own CTE so I can manipulate the data a little later on
WITH RECURSIVE dates (day) AS (
SELECT '2023-01-01'
UNION ALL
SELECT day + INTERVAL 1 day FROM dates WHERE day < '2023-01-07'
),
transactions AS (
SELECT DATE(transaction_date)
transaction_date,
SUM(amount) amount
FROM subscriptions
WHERE transaction_date
BETWEEN '2023-01-01 00:00:00'
AND '2023-01-07 23:59:59'
GROUP BY DATE(transaction_date))
SELECT * FROM transactions
A feature of CTEs is the ability to check the result from each CTE before putting it all together
SELECT * FROM transactions OR SELECT * FROM dates will help me verify that the data the queries produce are correct.
While this query works as-is, if the date range needs to change, both CTEs would need to be updated.
A CTE can reference another CTE. To change the query so only one CTE needs to be updated in order to change the date range, we can reference the dates CTE from within the transactions CTE.
WHERE transaction_date
BETWEEN (SELECT MIN(day) FROM dates)
AND (SELECT MAX(day) + INTERVAL 1 DAY FROM dates)
Now to join the two CTEs together to see all the data together
WITH RECURSIVE dates (day) AS (
SELECT '2023-01-01'
UNION ALL
SELECT day + INTERVAL 1 day FROM dates WHERE day < '2023-01-07'
),
transactions AS (
SELECT DATE(transaction_date)
transaction_date,
SUM(amount) amount
FROM subscriptions
WHERE transaction_date
BETWEEN (SELECT MIN(day) FROM dates)
AND (SELECT MAX(day) + INTERVAL 1 DAY FROM dates)
GROUP BY DATE(transaction_date))
SELECT day,amount FROM dates d
LEFT JOIN transactions t ON t.transaction_date = d.day
day | amount |
---|---|
2023-01-01 | 28.50 |
2023-01-02 | 40.50 |
2023-01-03 | 15.50 |
2023-01-04 | 48.00 |
2023-01-05 | NULL |
2023-01-06 | NULL |
2023-01-07 | 30.00 |
Other than the NULLs, the data is looking good so far. To remove the NULL values, we can use the COALESCE function.
What Does COALESCE Do?
COALESCE replaces NULL values with the first non null value in a list of values
COALEASCE(value,0) will replace the value with 0 if the value is NULL. Equally COALESCE(value, null,null,0) will also replace with 0 as its the first non null value in the list of values.
SELECT day,COALESCE(amount,0) amount FROM dates d
LEFT JOIN transactions t ON t.transaction_date = d.day
day | amount |
---|---|
2023-01-01 | 28.50 |
2023-01-02 | 40.50 |
2023-01-03 | 15.50 |
2023-01-04 | 48.00 |
2023-01-05 | 0.00 |
2023-01-06 | 0.00 |
2023-01-07 | 30.00 |
The last part of the request was a running tally. To return a running tally, SQL Window functions are required.
What Are Window Functions?
Window functions are functions that are run across the entire result set. They allow you to fetch a running tally, or data from a previous row, or from the rows ahead of the current row.
There are quite a few window function, some of them are LEAD, LAG, NTILE
LEAD - return a value from a row ahead of the current row
LAG - return a value from a previous row
NTILE - break rows out into sections, this would allow you to only return the top 25% of results for example.
You can also use aggregating functions in a window (SUM,AVG etc)
We will use the SUM function to fetch the running tally.
The format of a window function is
function(field) OVER (clause)
The PARTITION BY clause is similar to a GROUP BY but not the same.
In a window function the function is run over every row, it won't group rows together, but it will produce a result that grouped values together in its calculation.
ORDER BY will order the results in a specific way that is then used to calculate a result.
Fetching A Running Tally
To fetch the running tally required, we need to add a window function field into the query
SUM(amount) OVER (ORDER BY day) tally
Predefining a Window
While the method above is fine to use. If you find yourself in a situation where multiple window functions are required, it is possible to pre define the window and reference it.
In our current query
SELECT day,COALESCE(amount,0) amount,
SUM(amount) OVER (ORDER BY day) tally
FROM dates d
LEFT JOIN transactions t ON t.transaction_date = d.day
This could be changed to
SELECT day,COALESCE(amount,0) amount,
SUM(amount) OVER w tally
FROM dates d
LEFT JOIN transactions t ON t.transaction_date = d.day
WINDOW w AS (ORDER BY day)
Final Query
The final query using the window functions and CTEs looks like
WITH RECURSIVE dates (day) AS (
SELECT '2023-01-01'
UNION ALL
SELECT day + INTERVAL 1 day FROM dates WHERE day < '2023-01-07'
),
transactions AS (
SELECT DATE(transaction_date)
transaction_date,
SUM(amount) amount
FROM subscriptions
WHERE transaction_date
BETWEEN (SELECT MIN(day) FROM dates)
AND (SELECT MAX(day) + INTERVAL 1 DAY FROM dates)
GROUP BY DATE(transaction_date)
)
SELECT day,COALESCE(amount,0) amount,
SUM(amount) OVER w tally
FROM dates d
LEFT JOIN transactions t ON t.transaction_date = d.day
WINDOW w AS (ORDER BY day)
day | amount | tally |
---|---|---|
2023-01-01 | 28.50 | 28.50 |
2023-01-02 | 40.50 | 69.00 |
2023-01-03 | 15.50 | 84.50 |
2023-01-04 | 48.00 | 132.50 |
2023-01-05 | 0.00 | 132.50 |
2023-01-06 | 0.00 | 132.50 |
2023-01-07 | 30.00 | 162.50 |
Common Table Expressions and Windows open up SQL for better data processing especially for reporting and data analysis.
CTEs contain powerful features that simplify tasks, they also add readability making building or debugging complex queries easier.
Being reusable means that where you might have to run a subquery more than once in a query, the CTE will only run the query once and use the result set instead of running multiple subqueries, which can add efficiency to the query in some cases.
Window functions are a great way to provide analysis over the result set, this is evident in accounting type reports, but not limited to just accounting, it has a lot of use cases where using window functions allow you to gather additional information needed to produce the results required.
Top comments (1)
Excellent article. For me the key advantages of CTEs are:
Code simplification by avoiding sub queries.
code simplification by avoiding complex join rules (prep the data in a CTE instead)
unit testability of each CTE separately
being able to reference the same CTE multiple times
recursion
preparation of any data that uses variables or configuration instead of 'hiding' it in the main body of code
Be aware that sometimes using lots of CTEs will slow down your code because it is all done in memory. If this happens I tend to convert the CTEs to #tmp tables which can be magnitudes faster and allow you to add indexes if necessary, just remember to be a good boy scout and explicitly drop them when you are done.