We can keep our database-backed applications performing pretty well already by following a couple of simple rules, for example:
- no N+1 queries
- adding adequate indexes
- keeping the output "narrow" (retrieving only the required columns)
The scenario I'm about to show you is a bit different. It was already down to just one query, it had the adequate indexes, but it was still taking ~30 seconds
to run, so it had to be improved. Everything was in place such that it would be fast, but somehow, it wasn't! I'll show you what I did such that it ran in <1s
and explain every step with the reasoning behind it.
The tables
What I had is a many-to-many relationship, involving two entity classes, let's call them Foo
and Bar
. They were mapped with the @ManyToMany
JPA annotation in the Java code. At the database level, besides the corresponding tables for storing the entities, there was a link table called foo_bar
containing only two columns (foreign keys).
The tables were all pretty large, especially the link table totalling more than a hundred million rows. I should note that the link table did have standard b-tree indexes on both columns.
Table | Nr. of rows |
---|---|
foo | 2819724 |
bar | 21109691 |
foo_bar | 126167975 |
The query
The query was the following, written in JPQL
:
delete
from
Foo foo
where
foo.columnA in ... and
foo.columnB = ...
Based on the above, Hibernate generated the following SQL query:
delete
from foo_bar
where
foo_bar in (select
f1_0.id
from
foo f1_0
where
f1_0.columnA in ... and
f1_0.columnB = ...
)
As you can see, it's actually a delete
, however as you will see below, the bottleneck was a join operation in the execution plan.
With hindsight, the query could have been written as a native query, and using the CASCADE
feature would allow solving it more declaratively. However, for the rest of this post I'll continue with the query that Hibernate generated, as I still think it's a good support for showing you some instruments you have when a query is slow.
I've extracted the set of parameters for an exemplar to reproduce. I've then added a transaction block around the query, such that I can rollback and no actual deletes happen.
The explain plan
When wanting to understand what exactly a query is doing to retrieve our data, we consult so-called explain plans. Let's have a look.
Whilst it's clear what's slowing it down (the sequential scan reading all the contents of the large link table - the thick line in the image above on the left side), it's unexpected (at least for me). Given the presence of the indexes and the fact that the sub-select by itself only returns about ~800 rows, I had expected that we could avoid reading the entire large table like that, because given the data volume, it will always be slow.
No problem, let's see what our options are.
Looking for inspiration
Let's first get some inspiration by disabling some operations the database can use in planning, for example let's prevent it from going for hash joins. It will then have to use one of its alternatives (the other two options are nested loop and merge joins).
set enable_hashjoin = off;
Oh - take a look at that!
No more sequential scan of a hundred million rows. By disabling the hash join option, Postgres went for a much more efficient nested loop that fully utilises the indexes we've defined on the link table.
So far so good. We now know what we're after, the next question is how to get there, because disabling the hash joins like this is only meant to be for experimentation.
Hypothesis
The question to ask is - what's preventing Postgres from employing the more efficient nested loop alternative? Let's take a step back and reflect on how Postgres makes its decisions with regards to how it retrieves our data from disk.
Postgres uses a cost-based optimiser that computes the cost for the various alternatives, and then selects the best one. The costs are calculated based on various factors, including table statistics of the data and configuration properties.
From the Postgres codebase we learn that it collects a sample of 300 multiplied by the so-called default statistics target, a configuration option we can control. If you're wondering like me why the 300, it's not due the 300 Spartans confronting the Persians at Thermopylae. For the real reason, you can have a look here, where as usual, the code has a helpful comment indicating even the research paper on which the choice is based on.
For our use-case, the intuition is that due to the size of the table (hundreds of millions), the sample might be too small to get an accurate representation of the data, which might lead to less efficient planning, but let's try to confirm that with some numbers.
Checking the statistics
One of the statistics Postgres collects is an n_distinct
, an estimate of the number of distinct rows in the table. Let's see this for table foo_bar
:
select
n_distinct
from
pg_stats
where
tablename = 'foo_bar' and
attname = 'foo_id';
28001
Let's now compute the actual number of distinct values and compare with the estimate. For this, I'm using the following SQL query:
select
count(distinct(foo_bar.foo_id))
from
foo_bar;
910322
There we go! The entry in the pg_stats
is ~30
times smaller than the actual number of distinct rows.
This is a problem because it will impact the calculation of the selectivity
value used by the planner. To calculate this, Postgres uses the frequencies of the most common values (MCV) in a table. Note that how many we collect is bounded by the default statistics target value we talked about earlier. If the values we're looking for in a table are not in this list, Postgres fallbacks to using this following formula for selectivity:
selectivity = (1 - sum(mcv_freqs))/(num_distinct - num_mcv)
Now if the values are not in the MCV (because we collected too few), and result of this is wrong because of the wrong estimate of distinct values, then indeed, the planner will not choose the best plan. We might get a nested loop when the selectivity is low, or a hash join when the selectivity is high, which we don't want.
Increasing the amount of statistics
Let's try to improve the situation by allowing Postgres to use a larger sample size in order to get better statistics. This means it will store more values in the MCV list as well as looking at more rows when determining the n_distinct. We have to keep in mind however that it will take longer for it to create plans ("Planning time" in the explain analyse output).
Let's increase the sample size from the default of 100 to a value let's say 10 times bigger, but only for one column, like so:
alter table foo_bar alter foo_id set statistics 1000;
If we query for the n_distinct
now we'll get the same value as before, it will only update after running:
analyse foo_bar;
This will take a while. Remember, it has more work to do.
After a couple of minutes, it finished. Let's have a look if the estimate is closer to reality now.
select
n_distinct
from
pg_stats
where
tablename = 'foo_bar' and
attname = 'foo_id';
121894
Progress! Still about 9 times smaller than the actual number, but let's run explain analyse now and see if we'll get the nested loop being chosen.
Bingo
This executes in under a second, which is a big improvement over where we started. Nice!
Altering the amount of statistics collected was enough to significantly improve this example, however, in case you need even more control, at least for the n_distinct
you can manually set it to whatever you want with the following command:
alter table tool_analysis_finding alter column tool_analysis_id set (n_distinct = ...);
However, I would advise against going directly for this approach, because it means we won't benefit anymore from the automatic statistic collection that the database does in the background in an unattended fashion.
An alternative - changing random_page_cost
Let's look at something else. From the Postgres code we can look up other variables that come into the picture when deciding to choose an index scan over a sequential scan. For example, there's this random_page_cost that can be found here in the file costsize.c
.
For a description of this setting, have a look here. Basically, it's the estimate for the cost of retrieving a page non-sequentially from disk. With modern hardware like SSDs, there isn't such a big difference between sequential retrieval and random. The default configuration of 4
is not suitable therefore. For example, Crunchy Bridge has changed this value to 1.1
for all new databases on their platform.
Let's try adjusting this to 1.1
and see what happens.
Result
It worked again! We got the nested loop and sub-second execution time again - great stuff.
Conclusion
Postgres features a very clever query planner that does an excellent job finding an efficient way to retrieve our data in most of the cases. However, with some guidance from us (mainly in uncommon situations like very large tables), giving it more details about the context, or letting it use some more storage or time for its internal operations, it continues to deliver the results to our queries as fast as possible.
Top comments (0)