When writing queries, we should always take time to find the best way to write the query.
Sometimes this can mean using methods that on the surface seem like they wouldn't be fast, but actually are.
Query optimization is critical to having an efficient website.
While query optimization also applies to reporting and analytics, queries that run as part of a web service are the ones most noticed by users of your website.
For this article I am using the MySQL test employee database: https://dev.mysql.com/doc/employee/en/
The Schema
CREATE TABLE `employees` (
`emp_no` int NOT NULL,
`birth_date` date NOT NULL,
`first_name` varchar(14) NOT NULL,
`last_name` varchar(16) NOT NULL,
`gender` enum('M','F') NOT NULL,
`hire_date` date NOT NULL,
PRIMARY KEY (`emp_no`),
KEY `name` (`first_name`,`last_name`)
)
CREATE TABLE `salaries` (
`emp_no` int NOT NULL,
`salary` int NOT NULL,
`from_date` date NOT NULL,
`to_date` date NOT NULL,
PRIMARY KEY (`emp_no`,`from_date`),
KEY `salary` (`emp_no`,`salary`)
)
The salaries table can contain the same employee multiple times, each time an employees salary changes, it's a new row in the salaries table.
The task
The task for this query is to return a unique list of employee number, first_name, last_name who earn over $50,000 a year.
Along with selecting the data, we will need to ensure there are no duplicate employees.
Using DISTINCT
SELECT DISTINCT
employees.emp_no,
first_name,
last_name
FROM
employees
INNER JOIN salaries USING (emp_no)
WHERE
salary > 50000
In general, the use of DISTINCT is an indication that the query could be written better.
DISTINCT fetches all the possible rows, and at the end of the query process, strips out duplicate rows it doesn't need.
Distinct is calculated against all selected rows. This can mean that it's possible to return duplicate names in some cases.
An example of when this could occur would be if we included a column where each row for an employee changed, for example salary
SELECT DISTINCT
employees.emp_no,
first_name,
last_name,
salary
FROM
employees
INNER JOIN salaries USING (emp_no)
WHERE
salary > 50000
Query Execution Plan:
-> Table scan on <temporary> (cost=241946..245972 rows=321886)
└─> Temporary table with deduplication (cost=241946..241946 rows=321886)
└─> Nested loop inner join (cost=209757 rows=321886)
├─> Filter: (salaries.salary > 50000) (cost=97097 rows=321886)
│ └─> Index scan on salaries using salary (cost=97097 rows=965756)
└─> Single-row index lookup on employees using PRIMARY (emp_no=salaries.emp_no) (cost=0.25 rows=1)
The execution plan shows the use of a temporary table and a high cost. Temporary tables are generally slower queries. They are necessary at times, but if you can find a way to query without the use of a temporary table, it's generally going to be more efficient.
Average response time: 745ms
Using GROUP BY
A common method of ensuring unique users is to use GROUP BY
GROUP BY is generally faster than DISTINCT. It doesn't need that last step of removing duplicates to complete the query plan
SELECT
employees.emp_no,
first_name,
last_name
FROM
employees
INNER JOIN salaries USING(emp_no)
WHERE
salary > 50000
GROUP BY
employees.emp_no
Query Execution Plan:
-> Table scan on <temporary> (cost=241946..245972 rows=321886)
└─> Temporary table with deduplication (cost=241946..241946 rows=321886)
└─> Nested loop inner join (cost=209757 rows=321886)
├─> Filter: (salaries.salary > 50000) (cost=97097 rows=321886)
│ └─> Index scan on salaries using salary (cost=97097 rows=965756)
└─> Single-row index lookup on employees using PRIMARY (emp_no=salaries.emp_no) (cost=0.25 rows=1)
While the GROUP BY is slightly faster than DISTINCT, the execution plan is the same. The difference between them in this case is generally related to the internal query optimizer, query caching etc.
While execution plans are very useful, they don't always give you the whole story of what is going on internally, which leads to subtle differences between queries that might have the same execution plan.
Average response time: 721ms
Using Subquery
While subqueries are often viewed as less efficient, there are times where they can reduce the row count, which can make queries faster.
In this case, we are going to use a subquery to find the employee numbers where salary is over $50,000
SELECT
employees.emp_no,
first_name,
last_name
FROM
employees
WHERE
emp_no IN(
SELECT
emp_no FROM salaries
WHERE
salary > 50000)
Using this method, the query time drops significantly.
Query Execution Plan:
-> Nested loop inner join (cost=89029 rows=33961)
├─> Remove duplicates from input sorted on salary (cost=5161 rows=33961)
│ └─> Filter: (salaries.salary > 50000) (cost=5161 rows=33961)
│ └─> Index scan on salaries using salary (cost=5161 rows=965756)
└─> Single-row index lookup on employees using PRIMARY (emp_no=salaries.emp_no) (cost=80472 rows=1)
Here you will see that the query is no longer using a temporary table, and is using a much simpler plan, with a much lower cost value.
These factors lead to a faster response time.
Average response time: 234ms
While using a subquery significantly improved the query performance, we may be able to achieve better results by using the EXISTS clause, which offers some advantages over the IN statement used in the subquery.
Using EXISTS
When using EXISTS, the query early terminates once it finds a match. In this case, it will early terminate once it has found a specific employee.
While there are multiple rows in the salaries table for an employee, it does not need to continue checking if that specific employee exists if it has found a matching row, so it stops looking for the employee and moves onto looking for the next one.
SELECT
employees.emp_no,
first_name,
last_name
FROM
employees
WHERE
EXISTS (
SELECT
1
FROM
salaries
WHERE
salaries.emp_no = employees.emp_no
AND salary > 50000)
We use SELECT 1 in this query because EXISTS only returns TRUE or FALSE, not what that the row contains.
While we could use SELECT emp_no or SELECT *, returning a constant makes the intent of the query clearer, and in some cases, can be more efficient.
Query Execution Plan:
-> Nested loop inner join (cost=89029 rows=33961)
├─> Remove duplicates from input sorted on salary (cost=5161 rows=33961)
│ └─> Filter: (salaries.salary > 50000) (cost=5161 rows=33961)
│ └─> Index scan on salaries using salary (cost=5161 rows=965756)
└─> Single-row index lookup on employees using PRIMARY (emp_no=salaries.emp_no) (cost=80472 rows=1)
While this query plan is the same as the subquery query plan, the early termination improves the execution time.
Average response time: 220ms
Summary
Distinct: 745ms
Group By: 721ms
Subquery: 234ms
Exists : 220ms
Using subqueries is not always the most efficient querying method, however, in scenarios like this, it can significantly improve your query.
While just changing the query can help fix slow queries, there are other optimizations that could be considered.
Creating better indexes can also help resolve slow queries, but adding indexes should be reserved for times where rewriting the query doesn't help the query to be more efficient.
It's important to try out different query strategies on your own data. While EXISTS was the most efficient strategy when querying this dataset, results may differ on other datasets, so try out a variety of queries and see which one works best for you.
Top comments (0)