This is sort of a continuation of this
SQL Subqueries in action
Chidiebere Ogujeiofor ・ Dec 13 '19 ・ 8 min read
You can generate the data I used in this blogpost from this GitHub Gist.
In that post, I talked about subqueries and how they work. How you can use them to answer complex questions. In this post, I would extend on that and talk about Common Table Expression(CTEs).
What are they
According to PostgreSQL Tutorial:
A common table expression is a temporary result set which you can reference within another SQL statement including SELECT, INSERT, UPDATE or DELETE. Common Table Expressions are temporary in the sense that they only exist during the execution of the query.
Once created the CTE can then be treated as a table thus you can perform JOINs, put them in a FROM, etc.
Compared to nested subqueries, CTEs have the following advantages:
- Since they store data temporarily, they can improve performance than subqueries
- They are more readable than nested subqueries
Single CTE in a query syntax
CTEs are created using the WITH keyword and are immediately followed by the main query. Below is an overall syntax for CTEs:
WITH <cte-name> AS (
<the CTE query>
)
<Final query>
For example, in the post on subqueries, we answered the question
In which courses did the student with an ID of 9 scores above his overall average? What was his score in those courses?
We can answer that question, using CTEs as following:
WITH student_average AS (
SELECT ROUND(AVG(score),2) AS avg FROM results
WHERE student_id=9
)
SELECT course, score, student_average.avg as average
FROM results
JOIN student_average
on TRUE
WHERE student_id=9 AND score > student_average.avg
ORDER BY course;
course | score | average |
---|---|---|
Biology | 84 | 82.08 |
Computer Science | 93 | 82.08 |
History | 84 | 82.08 |
Physics | 85 | 82.08 |
Statistics | 85 | 82.08 |
With similar results as when we used subqueries in the previous post.
Multiple CTEs
It is possible to declare multiple CTEs such that one CTE may or may not call other CTEs. Again this technique helps make queries faster and readable as parts of the query are named.
Below is the overall structure of using multiple CTEs:
WITH cte1 AS (
<SOME-sql-statement>
),
cte2 AS (
<some-sql-that-may-or-may-not-use-cte1>
),
.
.
.,
cteN AS (
<some-sql-that-may-or-may-not-previous-ctes>
)
<main/final-sql-statement>
;
For example, say we want to answer the following question:
What was the class average and which students' average was higher than the class average? What was each of those students' average?
We can answer that question by retrieving a table with student_ids, their average, the class average and filter only when the student_average is more than the class average. We can do this using the following query:
WITH
each_student_avg AS (
SELECT student_id, ROUND(AVG(score),2) as student_overall_avg
FROM results
GROUP BY results.student_id
),
class_average AS (
SELECT ROUND(AVG(score),2) as class_avg
FROM results
)
SELECT student_id, student_overall_avg, cls.class_avg
FROM each_student_avg AS esa
JOIN class_average AS cls
ON TRUE
WHERE student_overall_avg > class_avg
;
This gives the following result:
student_id | student_overall_avg | class_avg |
---|---|---|
4 | 65.62 | 65.46 |
18 | 66.23 | 65.46 |
8 | 66.92 | 65.46 |
13 | 68.62 | 65.46 |
16 | 68.69 | 65.46 |
2 | 69.62 | 65.46 |
5 | 69.62 | 65.46 |
10 | 70.38 | 65.46 |
19 | 71.77 | 65.46 |
17 | 72.00 | 65.46 |
7 | 75.77 | 65.46 |
14 | 75.85 | 65.46 |
9 | 82.08 | 65.46 |
24 | 86.00 | 65.46 |
With this ResultSet we can see that the class average is 65.46
and the ids of the students whose average is above the class average.
NB
After each CTE is declared (class_average
and each_student_avg
), their result is stored in memory. Thus, the calls to class_average
and each_student_avg
in the main query performs operations on the ResultSet stored in each of the CTE and not the whole results
table.
Readability powers
Without CTEs here's how the previous query would look like:
SELECT student_id, student_overall_avg, cls.class_avg
FROM (
SELECT student_id, ROUND(AVG(score),2) as student_overall_avg
FROM results
GROUP BY results.student_id
) AS esa
JOIN (
SELECT ROUND(AVG(score),2) as class_avg
FROM results
) AS cls
ON TRUE
WHERE student_overall_avg > class_avg;
You can agree that the query with CTEs improves the readability of the query.
A more complex question
CTEs are particularly handy in answering questions that are very complex and involve a lot of aggregation/reshaping of the data.
For example, say our school has a policy of placing students on Probation or Withdrawing their admission based on poor performance.
When a student gets below 60 in any course in a particular semester and they are not on probation from the previous term, then they are placed on probation for the next semester.
Also, if they get below 60 in any course and they were on probation from the previous term then their admission is withdrawn. In summary:
When a student gets below 60 in any course in the current semester and he got below 60 in any course in the previous semester then the admission of such a student should be Withdrawn
When a student gets below 60 in any course in the current semester and passed all his courses in the previous semester then such a student should be placed on probation for the next semester
Assuming that Semester 2 started from '2019-08-01' and ended in '2019-10-30' while Semester 1 started from '2019-05-01' and ended in '2019-07-30':
Which students that should be placed on placed on probation and which should have their admission withdrawn at the end of Semester 2? Which courses did they fail? What was their score in those courses and when did they take them?
We can answer the above question using a table that has student ID, a status that can be either 'Withdrawal' or 'Probation' and a column that contains all the courses the student got below 60 in both semesters, the date the course was taken and their score in a comma-separated list. Something like:
student_id | status | course_info |
---|---|---|
{id-1} | Probation | {score1} {course1} {date-taken1}, ...,{scoreN} {courseN} {date-takenN} |
{id-2} | Withdrawal | {score1} {course1} {date-taken1}, ...,{scoreN} {courseN} {date-takenN} |
. | . | . |
. | . | . |
. | . | . |
Using CTEs, we begin with the WITH
keyword:
WITH
Then we get all those that scored below 60 in semester 1 like so
semester_1_lt_60 AS (
--Retrieves all the data for semester 1 that got below 60
SELECT student_id, course_end_date,score FROM results
WHERE
score < 60 AND
course_end_date BETWEEN '2019-05-01' AND '2019-07-30'
),
Then we get all those that scored below 60 in semester 2 like so
semester_2_lt_60 AS (
--Retrieves all the data for semester 2 that got below 60
SELECT student_id, course_end_date,score FROM results
WHERE
score < 60 AND
course_end_date BETWEEN '2019-08-01' AND '2019-10-30'
),
We can now get all those due for withdrawals as those who failed courses in both semester 1 and 2
withdrawals as (
-- Retrieves all those that got below 60 in both semester 1 and 2
SELECT s1.student_id
FROM semester_1_lt_60 as s1
JOIN semester_2_lt_60 AS s2
ON s1.student_id = s2.student_id
),
Next, we can retrieve our Probation candidates as those that scored below 60 in semester 2 that are not in the withdrawal
CTE:
probation AS (
-- Retrieves all those that failed only in semester 2 but not in both semester 1 and 2
-- This is essentially removing all those ids in semester_2_lt_60 that are not in withdrawals
SELECT student_id
FROM semester_2_lt_60
WHERE student_id NOT IN
(
SELECT student_id
FROM withdrawals
)
),
Now we create our final CTE which joins both Probation
and Withdrawal
CTEs:
id_with_status AS (
--Adds retrieves both data with the student status for the term
SELECT
DISTINCT COALESCE(w.student_id, p.student_id) AS student_id,
CASE
WHEN w.student_id IS NOT NULL THEN 'Withdrawal'
WHEN p.student_id IS NOT NULL THEN 'Probation'
END AS status FROM withdrawals as w
FULL JOIN probation as p
ON p.student_id = w.student_id
ORDER BY status
)
Finally, our main query simply queries the results table JOINs id_with_status
and performs a string aggregation that returns all the courses that each of the students failed in both semesters with the name of the course, their score and the date the course ended
SELECT r.student_id, id_status.status, STRING_AGG(
CONCAT(r.score, ' ',r.course, ' ', r.course_end_date), ','
) as course_info
FROM results AS r
JOIN id_with_status AS id_status
ON r.student_id = id_status.student_id
WHERE
r.score < 60 AND
r.course_end_date BETWEEN '2019-05-01' AND '2019-10-30'
GROUP BY r.student_id, id_status.status
ORDER BY id_status.status ;
Here is how it looks all together
WITH
semester_1_lt_60 AS (
--Retrieves all the data for semester 1 that got below 60
SELECT student_id, course_end_date,score FROM results
WHERE
score < 60 AND
course_end_date BETWEEN '2019-05-01' AND '2019-07-30'
),
semester_2_lt_60 AS (
--Retrieves all the data for semester 2 that got below 60
SELECT student_id, course_end_date,score FROM results
WHERE
score < 60 AND
course_end_date BETWEEN '2019-08-01' AND '2019-10-30'
),
withdrawals as (
-- Retrieves all those that got below 60 in both semester 1 and 2
SELECT s1.student_id
FROM semester_1_lt_60 as s1
JOIN semester_2_lt_60 AS s2
ON s1.student_id = s2.student_id
),
probation AS (
-- Retrieves all those that failed only in semester 2 but not in both semester 1 and 2
-- This is essentially removing all those ids in semester_2_lt_60 that are not in withdrawals
SELECT student_id
FROM semester_2_lt_60
WHERE student_id NOT IN
(
SELECT student_id
FROM withdrawals
)
),
id_with_status AS (
--Adds retrieves both data with the student status for the term
SELECT
DISTINCT COALESCE(w.student_id, p.student_id) AS student_id,
CASE
WHEN w.student_id IS NOT NULL THEN 'Withdrawal'
WHEN p.student_id IS NOT NULL THEN 'Probation'
END AS status FROM withdrawals as w
FULL JOIN probation as p
ON p.student_id = w.student_id
ORDER BY status
)
SELECT r.student_id, id_status.status, STRING_AGG(
CONCAT(r.score, ' ',r.course, ' ', r.course_end_date), ','
) as course_info
FROM results AS r
JOIN id_with_status AS id_status
ON r.student_id = id_status.student_id
WHERE
r.score < 60 AND
r.course_end_date BETWEEN '2019-05-01' AND '2019-10-30'
GROUP BY r.student_id, id_status.status
ORDER BY id_status.status ;
Below is the final result:
student_id | status | course_info |
---|---|---|
6 | Probation | 53 Statistics 2019-10-05 |
13 | Probation | 58 Statistics 2019-10-05 |
15 | Probation | 57 Statistics 2019-10-05 |
18 | Probation | 0 History 2019-10-05 |
20 | Probation | 40 Chemistry 2019-10-05,54 Statistics 2019-10-05,0 History 2019-10-05 |
23 | Probation | 36 Chemistry 2019-10-05,57 Statistics 2019-10-05 |
1 | Withdrawal | 0 Chemistry 2019-10-05,0 Mathematics 2019-05-09,14 Statistics 2019-10-05 |
12 | Withdrawal | 52 Statistics 2019-10-05,40 Computer Science 2019-05-09,44 Commerce 2019-05-09,47 Mathematics 2019-05-09 |
21 | Withdrawal | 0 Chemistry 2019-10-05,35 Computer Science 2019-05-09,48 Statistics 2019-10-05 |
22 | Withdrawal | 55 Commerce 2019-05-09,49 Statistics 2019-10-05,41 Chemistry 2019-10-05 |
Looking at the SQL query above you can easily see how we are storing data in memory and re-using them in subsequent CTEs and the main query thus breaking down the problem into smaller bits until we have enough info to generate/return our final ResultSet. This greatly improves performance as opposed to running through the table several times.
The query is also quite readable as we are naming parts of the query while using the CTE names.
I would like to make one last attempt to show you how readable CTEs are by showing you the nested subquery equivalent( Brace yourself, comrades!⛈⚡️🎇):
SELECT r.student_id, id_status.status, STRING_AGG(
CONCAT(r.score, ' ',r.course, ' ', r.course_end_date), ','
) as course_info
FROM results AS r
JOIN (
SELECT
DISTINCT COALESCE(w.student_id, p.student_id) AS student_id,
CASE
WHEN w.student_id IS NOT NULL THEN 'Withdrawal'
WHEN p.student_id IS NOT NULL THEN 'Probation'
END AS status FROM (
SELECT s1.student_id
FROM (
SELECT student_id, course_end_date,score FROM results
WHERE
score < 60 AND
course_end_date BETWEEN '2019-05-01' AND '2019-07-30'
) as s1
JOIN (
SELECT student_id, course_end_date,score FROM results
WHERE
score < 60 AND
course_end_date BETWEEN '2019-08-01' AND '2019-10-30'
) AS s2
ON s1.student_id = s2.student_id
) as w
FULL JOIN (
SELECT student_id
FROM (
SELECT student_id, course_end_date,score FROM results
WHERE
score < 60 AND
course_end_date BETWEEN '2019-08-01' AND '2019-10-30'
) AS lt_60_in_semester_1
WHERE student_id NOT IN
(
SELECT student_id
FROM (
SELECT s1.student_id
FROM (
SELECT student_id, course_end_date,score FROM results
WHERE
score < 60 AND
course_end_date BETWEEN '2019-05-01' AND '2019-07-30'
) as s1
JOIN (
SELECT student_id, course_end_date,score FROM results
WHERE
score < 60 AND
course_end_date BETWEEN '2019-08-01' AND '2019-10-30'
) AS s2
ON s1.student_id = s2.student_id
) AS withdrawal
)
)as p
ON p.student_id = w.student_id
ORDER BY status
) AS id_status
ON r.student_id = id_status.student_id
WHERE
r.score < 60 AND
r.course_end_date BETWEEN '2019-05-01' AND '2019-10-30'
GROUP BY r.student_id, id_status.status
ORDER BY id_status.status ;
I believe you now get the point! 😉
Conclusion
In this blogpost:
- we have looked at CTEs what they are and how they work
- we have seen that CTEs are more readable than nested subqueries
- we have also seen that CTEs can improve performance.
- finally, that CTEs can be used to efficiently answer complex questions
I would like to conclude guys by urging you to look at the query that used CTEs to answer the question in the Complex Question
section. How would you make it more efficient?
Folks its time to throw away those unreadable/inefficient nested subqueries
Top comments (3)
Great article. One key advantage of CTEs not mentioned is that you can effectively debug/ unit test them by just running select * from CTE immediately after the CTE and check the results rather than having to comment out the outer queries.
The first example could also have been achieved using SUM() OVER () which I think is a cleaner syntax but as this is an article about CTEs we can let it go.
Instead of writing two CTEs, one for each semester, you could have had a single CTE with the semester in the dataset and then joined to it twice using the semester column as a filter in the join clause.
One other trick with CTEs us that normally you can't use windowed functions like RANK() in the WHERE clause, but if you wrap the initial query as a CTE and give the RANK() result a column name you can use it in the main query WHERE clause.
All fields returned by a CTE have to have a unique column name.
CTEs are not always the most performant solution, at least on MSSQL, in my experience very large CTE results sets seem to put pressure on memory and can also stop the query optimiser from using indexes and table statistics when generating a query plan. As there are no indexes bon CTEs you can often get better performance by persisting the data to temp tables and building indexes on them
Makes sense. Great comment
edited this to have syntax highlighting. eg
Some comments may only be visible to logged-in visitors. Sign in to view all comments.