I have recently been helping out my brother on a very exciting project that requires managing student data and we recently implemented some features using subqueries and CTEs.
Thus, I decided to write this post about subqueries and another about CTEs.
You can generate the data I used in this blogpost from this GitHub Gist.
SQL challenges are all about answering questions. To this effect, each example in the blog post asks a question and then answers it using an SQL statement
Subqueries
A subquery is essentially a SELECT
statement that is within another SELECT
statement. It may be placed in SELECT, WHERE or FROM clause of their parent SELECT statement.
Subqueries in SELECT clause
A subquery in the SELECT clause is useful when we want to add a column to our data that cannot be easily gotten from the ResultSet. These subqueries must return a single value.
For example, consider the following ResultSet for the student with an id of 9.
SELECT course, score
FROM results
WHERE student_id=9
ORDER BY course;
course | score |
---|---|
Arts | 80 |
Biology | 84 |
Chemistry | 81 |
Commerce | 82 |
Computer Science | 93 |
Economics | 75 |
English Language | 78 |
History | 84 |
Literature | 79 |
Mathematics | 80 |
Physical Education | 81 |
Physics | 85 |
Statistics | 85 |
To get the student's average score per course:
SELECT ROUND(AVG(score),2) AS average FROM results WHERE student_id=9;
We get an average of 82.08
.
Now say we want to answer the question:
How well did the student with ID of 9 do in each course compared to his/her overall average?
The following query gives us an answer:
SELECT course, score,
(
SELECT ROUND(AVG(score),2) FROM results
WHERE student_id =9
)AS student_avg
FROM results
WHERE student_id=9
ORDER BY course
;
course | score | student_avg |
---|---|---|
Arts | 80 | 82.08 |
Biology | 84 | 82.08 |
Chemistry | 81 | 82.08 |
Commerce | 82 | 82.08 |
Computer Science | 93 | 82.08 |
Economics | 75 | 82.08 |
English Language | 78 | 82.08 |
History | 84 | 82.08 |
Literature | 79 | 82.08 |
Mathematics | 80 | 82.08 |
Physical Education | 81 | 82.08 |
Physics | 85 | 82.08 |
Statistics | 85 | 82.08 |
We could also decide to perform a mathematical operation such that we see the difference between the student's score in each course and the student's overall average as follows:
SELECT course, score,
ROUND(score - (
SELECT AVG(score) FROM results
WHERE student_id =9
),2) AS difference_between_score_and_student_average
FROM results
WHERE student_id=9
ORDER BY difference_between_score_and_student_average DESC;
course | score | difference_between_score_and_student_average |
---|---|---|
Computer Science | 93 | 10.92 |
Statistics | 85 | 2.92 |
Physics | 85 | 2.92 |
Biology | 84 | 1.92 |
History | 84 | 1.92 |
Commerce | 82 | -0.08 |
Chemistry | 81 | -1.08 |
Physical Education | 81 | -1.08 |
Arts | 80 | -2.08 |
Mathematics | 80 | -2.08 |
Literature | 79 | -3.08 |
English Language | 78 | -4.08 |
Economics | 75 | -7.08 |
Some ground rules
When writing subqueries in the SELECT clause it is important to note the following:
- The subquery must return a single value. If it returns a table there would be no way of putting in one column thus an error is thrown
- The WHERE clause of the outer query does not affect the subquery in any way. For instance, if we had omitted the student_id filter in our subquery, that is
SELECT course, score,
(
SELECT ROUND(AVG(score),2)
FROM results
)AS student_avg
FROM results
WHERE student_id=9
ORDER BY course;
We would have gotten totally different results:
course | score | student_avg |
---|---|---|
Arts | 80 | 65.46 |
Biology | 84 | 65.46 |
Chemistry | 81 | 65.46 |
Commerce | 82 | 65.46 |
Computer Science | 93 | 65.46 |
Economics | 75 | 65.46 |
English Language | 78 | 65.46 |
History | 84 | 65.46 |
Literature | 79 | 65.46 |
Mathematics | 80 | 65.46 |
Physical Education | 81 | 65.46 |
Physics | 85 | 65.46 |
Statistics | 85 | 65.46 |
So the WHERE clause in the outer query does not affect the subquery.
Subqueries in WHERE clause
Adding a subquery in the WHERE clause is useful when we want to filter a ResultSet by some value from another query.
For example, to answer the following question:
In which courses did the student with an ID of 9 scores above his overall average? What was his score in those courses?
If we know that the student average is 82.08
and we can answer that question with the following query:
SELECT course, score
FROM results
WHERE student_id=9 AND score > 82.08
ORDER BY course;
And get the following:
course | score |
---|---|
Biology | 84 |
Computer Science | 93 |
History | 84 |
Physics | 85 |
Statistics | 85 |
But if we don't know the average beforehand, we can use a subquery to calculate the average and use the value in the where clause immediately:
SELECT course, score
FROM results
WHERE student_id=9 AND score > (
SELECT AVG(score) FROM results
WHERE student_id=9
)
ORDER BY course;
with the same results but a more reusable query.
When using subqueries in the WHERE clause, we could also return a list of values and filter using the IN operator. For example, we could retrieve courses taken in the first quarter of 2019:
SELECT course, score
FROM results
WHERE student_id=9 AND course IN (
SELECT course
Where course_end_date >= '2019-01-01' AND course_end_date <= '2019-03-31'
)
ORDER BY course
;
course | score |
---|---|
Biology | 84 |
Economics | 75 |
Physical Education | 81 |
NB
Note that the above example is used to demonstrate using IN clause with a subquery. A more efficient way would have been to just do
WHERE student_id=9 AND course_end_date >= '2019-01-01' AND course_end_date <= '2019-03-31'
Some ground rules
When using subqueries in WHERE clause:
- the subqueries would have to return a single column
- if you want to compare a single value ensure that the subquery returns a single row/value.
- if you want to compare a list of values, ensure that the subquery returns a single column.
Subqueries in FROM/JOIN clause
Let us say we want to answer the following question:
How did each student perform in Literature compared to the student's average on all the courses
One way we can answer this is by using a subquery to get each student's average in each course, that is:
SELECT student_id, ROUND(AVG(score),2) as student_avg
FROM results
GROUP BY results.student_id
and then join that to our results table like so:
SELECT
main_result.student_id,
main_result.course,
score,
student_av.student_avg
FROM results AS main_result
JOIN (
SELECT student_id, ROUND(AVG(score),2) as student_avg
FROM results
GROUP BY results.student_id
) as student_av
ON student_av.student_id = main_result.student_id
WHERE course = 'Literature'
ORDER BY score;
student_id | course | score | student_avg |
---|---|---|---|
1 | Literature | 0 | 42.00 |
2 | Literature | 73 | 69.62 |
4 | Literature | 77 | 65.62 |
5 | Literature | 62 | 69.62 |
6 | Literature | 61 | 64.31 |
7 | Literature | 77 | 75.77 |
8 | Literature | 66 | 66.92 |
9 | Literature | 79 | 82.08 |
10 | Literature | 63 | 70.38 |
11 | Literature | 68 | 60.92 |
12 | Literature | 61 | 50.69 |
13 | Literature | 60 | 68.62 |
14 | Literature | 78 | 75.85 |
15 | Literature | 60 | 60.77 |
16 | Literature | 61 | 68.69 |
17 | Literature | 68 | 72.00 |
18 | Literature | 79 | 66.23 |
19 | Literature | 65 | 71.77 |
20 | Literature | 60 | 52.69 |
21 | Literature | 62 | 44.08 |
22 | Literature | 66 | 61.31 |
23 | Literature | 53 | 57.92 |
24 | Literature | 88 | 86.00 |
Now we can take it one step further by getting the difference between the student Overall average and his Literature score
SELECT
course,
main_result.student_id,
score,
student_avg_sub.student_avg,
score - student_avg_sub.student_avg AS score_diff
FROM results AS main_result
JOIN (
SELECT student_id, ROUND(AVG(score),2) as student_avg
FROM results
GROUP BY results.student_id
) as student_avg_sub
ON student_avg_sub.student_id = main_result.student_id
WHERE course = 'Literature'
ORDER BY score_diff DESC;
course | student_id | score | student_avg | score_diff |
---|---|---|---|---|
Literature | 21 | 62 | 44.08 | 17.92 |
Literature | 18 | 79 | 66.23 | 12.77 |
Literature | 4 | 77 | 65.62 | 11.38 |
Literature | 12 | 61 | 50.69 | 10.31 |
Literature | 20 | 60 | 52.69 | 7.31 |
Literature | 11 | 68 | 60.92 | 7.08 |
Literature | 22 | 66 | 61.31 | 4.69 |
Literature | 2 | 73 | 69.62 | 3.38 |
Literature | 14 | 78 | 75.85 | 2.15 |
Literature | 24 | 88 | 86.00 | 2.00 |
Literature | 7 | 77 | 75.77 | 1.23 |
Literature | 15 | 60 | 60.77 | -0.77 |
Literature | 8 | 66 | 66.92 | -0.92 |
Literature | 9 | 79 | 82.08 | -3.08 |
Literature | 6 | 61 | 64.31 | -3.31 |
Literature | 17 | 68 | 72.00 | -4.00 |
Literature | 23 | 53 | 57.92 | -4.92 |
Literature | 19 | 65 | 71.77 | -6.77 |
Literature | 10 | 63 | 70.38 | -7.38 |
Literature | 5 | 62 | 69.62 | -7.62 |
Literature | 16 | 61 | 68.69 | -7.69 |
Literature | 13 | 60 | 68.62 | -8.62 |
Literature | 1 | 0 | 42.00 | -42.00 |
With this, we can see that student with the id
of 21 exceeded his overall average the most. Thus we can conclude that this is one of his best courses.
Subqueries in all parts of the main query
Subqueries can also be used in more than one part of the main query. That is, they can be used in the WHERE or FROM or SELECT clauses or a combination of 2 or all.
For example, consider the question:
What was the average score of the student with an ID of 9 and in which courses did he score above this average?
We could run the following:
SELECT course, score, (
SELECT ROUND(AVG(score),2) FROM results
WHERE student_id=9
) AS average_score
FROM results
WHERE student_id = 9 AND score > (
SELECT AVG(score) FROM results
WHERE student_id=9
)
ORDER BY score;
course | score | average_score |
---|---|---|
History | 84 | 82.08 |
Biology | 84 | 82.08 |
Physics | 85 | 82.08 |
Statistics | 85 | 82.08 |
Computer Science | 93 | 82.08 |
Looking at the above ResultSet we can see the student's average score and all the courses where he performed above his average.
Other notes about subqueries
Here are some extra features of subqueries:
- you can have a subquery within another subquery. This is known as a nested subquery.
- subqueries can access data from the main query. This is known as a correlated subquery.
Disadvantages of Subqueries
All tools/techniques have their downside and subqueries are not excluded from this. Here are some downsides of using subqueries
- subqueries are executed as separate SELECT statements thus it increases the amount of time our query takes to run. For this reason, it is best to use it in a few places as possible
- subqueries are quite unreadable and can be difficult to maintain(well we could add a comment that says "don't touch the query below" 😀).
- if we have two subqueries that do the same thing, they would be executed twice. For example, the query from the previous section actually computes the average of the student twice. This can be very inefficient in large tables. One way around this is to use CTEs
Conclusions
In this blog post, we have seen that:
- subqueries are SELECT statements within another statement.
- subqueries are useful for reshaping and filtering our data
- we can use subqueries in the SELECT, FROM or WHERE clauses of our outer query or a combination of 2 or more.
- subqueries are executed as separate SELECT statements. For this reason, multiple subqueries can re-compute the same value.
- it is best to use subqueries sparingly as they can be quite unreadable and difficult to maintain
Top comments (3)
edited to add syntax highlighting eg
from
to
I know sub query use it on daily. But he has given complete picture of Sub query end to end. For sure I will recommend this for new learners.
Thanks a lot @vinoth . Really appreciate this feedback