I got absolutely thrashed by some PostgreSQL performance issues this past week. It was painful. And, it was incredibly instructive. I didn't just solve the problem, I now better understand Postgres. I'm better equipped to diagnose the next performance skirmish that comes my way. And I'm paying the lesson forward with this article.
Let me give you an idea of how bad it was. I had a query deleting 50k records on one table in ~100ms. That's fast enough for me. I then had a similar query deleting the same number of records from another table that would run for ~30 minutes. 10,000x slower! 😱 That's a nope.
After diagnosing and addressing the issue, I got that second delete query down to about 1 second. Much better! Toward the end of the post we'll see how to make it even faster than that.
But before I reconstruct a minimal example that reproduces the problem and get into the details, here is the...
tl;dr
Foreign keys are essential for enforcing the shape and integrity of our data. Indexes are there to keep queries fast. We can combine the two to start getting the most out of our database.
I don't tend to think of foreign key constraints as impacting performance. As we are about to see, they do.
Without that index, we're bound to run into some really sneaky performance gotchas.
Let's dig in.
A Minimal Example
This minimal example is also a real-world example. Most software systems have users and those users need to be given a variety of roles. These roles help the system determine the access and permissions of each user.
If you want to skip over the example setup, you can jump right to the details.
The Data Model
First is the users
table. It doesn't feature strongly in the example so I'll leave it at the id
column.
Then we have the roles
table which can contain global roles like admin
and support
as well as roles tied to specific resources like team_owner
and team_collaborator
.
Creating a relationship between the two is the users_roles
join table. A record in this table tying my user
record to the admin
role would tell the system that I am an admin. Another record in this table tying my user
record to a team_collaborator
role for the Red team would tell you that in addition to being an admin, I am a collaborator of the Red team.
Building The Schema
Here is some SQL that will generate our schema for reproducing the performance issue. The specific of how this works are out of scope for this post. If that interests you, let me know and I'll write a follow up.
The full queries are embedded in these code blocks, but you can also check out the code in this repo.
create table users (
id bigint generated by default as identity primary key
);
create table roles (
id bigint generated by default as identity primary key,
name varchar not null,
resource_type varchar,
resource_id bigint
);
create table users_roles (
user_id bigint references users,
role_id bigint references roles
);
I put this in a file called schema.sql
. From a psql
session I connect to an empty database where I can safely experiment. I then tell psql
to execute the script which will create these three tables and their sequences.
> \i schema.sql
CREATE TABLE
Time: 2.605 ms
CREATE TABLE
Time: 2.615 ms
CREATE TABLE
Time: 1.141 ms
Time: 0.099 ms
> \d
List of relations
Schema | Name | Type | Owner
--------+--------------+----------+------------
public | roles | table | jbranchaud
public | roles_id_seq | sequence | jbranchaud
public | users | table | jbranchaud
public | users_id_seq | sequence | jbranchaud
public | users_roles | table | jbranchaud
(5 rows)
Generating Some Data
Let's fill these tables with some data with help from the generate_series
function. Again, I won't go into specifics here. Drop a note if you're interested in a follow up post on this part.
-- create 50,000 users
insert into users
select from generate_series(1,50000);
-- create 100,000 roles
insert into roles (
name,
resource_type,
resource_id
)
select
(array['admin', 'support', 'member'])[floor(random() * 3 + 1)],
null,
null
from generate_series(1,50000);
insert into roles (
name,
resource_type,
resource_id
)
select
'team_collaborator',
'Team',
floor(random() * 1000 + 1)
from generate_series(1,50000);
-- create 500,000 connections between users and roles
-- start with 225,000 random global roles
insert into users_roles (
user_id,
role_id
)
select
floor(random() * 50000 + 1),
floor(random() * 50000 + 1)
from generate_series(1,225000);
-- then 50,000 for the team collaborator role
insert into users_roles (
user_id,
role_id
)
select
floor(random() * 50000 + 1),
floor(50000 + g.id)
from generate_series(1,50000) as g(id);
-- then another 225,000 random global roles
insert into users_roles (
user_id,
role_id
)
select
floor(random() * 50000 + 1),
floor(random() * 50000 + 1)
from generate_series(1,225000);
This creates 50,000 users and 100,000 roles. It then creates 500,000 connections between them in the users_roles
join table, with 50,000 of those connection specifically for the team_collaborator
role.
With all the data in place, let's find some specific data and delete it. This is where we'll see a reasonable query and a drastically slower query.
Fast Query, Slow Query
In the actual production data set I was dealing with, I had a bunch of data that I needed to clear out of the roles
table. It turned out to be about 50k records. Because of the foreign key relationship, I first had to clear dependent records out of the users_roles
table.
The following might feel a bit contrived, but it is based on that real scenario.
Here are the 50k roles that we are targeting for deletion.
> select count(*)
from roles
join users_roles
on roles.id = users_roles.role_id
where roles.name = 'team_collaborator';
count
-------
50000
(1 row)
Why are we going to be deleting all the team_collaborator
related data? Because that gives us about 50k rows. Like I said, a bit contrived. Nevertheless, we are going to learn some performance things from this.
The Quick Delete
Let's delete the dependent records from users_roles
first.
Anytime I'm about to modify or delete data, I like to open a transaction. This makes it easy to safely rollback the changes if anything looks off. That is done with a begin
.
> begin;
> delete from users_roles
using roles
where users_roles.role_id = roles.id
and roles.name = 'team_collaborator';
DELETE 50000
Time: 116.313 ms
That looks good. That query appears to have handled the 50k records from users_roles
that I wanted deleted. Because I'm in a transaction, I can always dig into the data with a few select queries to be sure. And it was fast enough, clocking in at ~100ms in this instance.
I'm not ready to commit this transaction yet. Next I want to attempt to delete the roles
records.
The Slow Delete
With all the foreign key dependencies taken care of, we are clear to get on with our main goal, deleting the roles
records.
This query is even simpler than the previous. Typing it out, I was certain it would run just as quickly.
> delete from roles
where roles.name = 'team_collaborator';
-- ... keep waiting ...
For real though, if you are following along at home, don't hold your breath. This is going to take more or less 30 minutes depending on the specs of your machine.
What gives?
Can We Explain This?
This query isn't all that different than the previous one. It is targeting the same number of rows. What could account for the slow down?
Let's start by taking a look at the explain
.
> explain delete from roles
where roles.name = 'team_collaborator';
QUERY PLAN
------------------------------------------------------------------
Delete on roles (cost=0.00..1937.00 rows=49990 width=6)
-> Seq Scan on roles (cost=0.00..1937.00 rows=49990 width=6)
Filter: ((name)::text = 'team_collaborator'::text)
(3 rows)
There isn't a lot going on in this one. The seq scan
stands out to me. This means the query planner expects the query will have to look at every row sequentially. For a mere ~50k rows that doesn't seem like it should be an issue.
By contrast, here is the explain
for the previous query.
> explain delete from users_roles
using roles
where users_roles.role_id = roles.id
and roles.name = 'team_collaborator';
QUERY PLAN
--------------------------------------------------------------------------------
Delete on users_roles (cost=2561.88..11577.43 rows=249950 width=12)
-> Hash Join (cost=2561.88..11577.43 rows=249950 width=12)
Hash Cond: (users_roles.role_id = roles.id)
-> Seq Scan on users_roles (cost=0.00..7703.00 rows=500000 width=14)
-> Hash (cost=1937.00..1937.00 rows=49990 width=14)
-> Seq Scan on roles (cost=0.00..1937.00 rows=49990 width=14)
Filter: ((name)::text = 'team_collaborator'::text)
(7 rows)
This one is more involved. I'm noticing that there is a seq scan
of users_roles
(500k rows) and nested under that is another seq scan
of roles
(~50k). In terms of predicted work, this one has a ton more to do, yet it finished in 100ms. Computers can be fast.
Despite that comparison, we know from running the queries that the first (simpler) one is going to take much longer than the second one. Some back of the napkin math says the simpler query plan for deleting from roles
is looking to take 10,000x the amount of time of the second query 😱.
So that's the second scream emoji in this post—we've got to figure out what is going on.
What's the deal with this "simple" delete query?
The Mystery Uncovered
> delete from roles
where roles.name = 'team_collaborator';
The issue with this two-line, 50k row query doesn't have anything to do with the roles
table itself. Instead, it has to do with the relationship that the roles
table has to other tables.
The users_roles
table depends on the roles
table—specifically, on its id
column. This dependency on roles.id
is through the foreign key constraint on its own role_id
column.
The foreign key is causing this
Kinda.
To demonstrate that the foreign key constraint is at the center of this performance issue, let's start by removing the foreign key constraint.
> begin;
> delete from users_roles
using roles
where users_roles.role_id = roles.id
and roles.name = 'team_collaborator';
DELETE 50000
Time: 120.267 ms
> alter table users_roles
drop constraint users_roles_role_id_fkey;
ALTER TABLE
Time: 0.561 ms
> delete from roles
where roles.name = 'team_collaborator';
DELETE 50000
Time: 54.115 ms
> rollback;
With the foreign key constraint out of the picture, the delete is quite speedy. Notice I immediately rolled these changes back. This was to highlight the impact of the constraint. This isn't the solution.
Keep the Foreign Key Constraint
Foreign key constraints are essential to solid schema design, and they are one of my favorite features of Postgres. They help ensure that the data relationships between two tables are always intact (that's called referential integrity). Because of this assurance, I never have to worry that data is going to be inadvertently orphaned.
How is the foreign key constraint having such an impact on performance?
For every single row you tell Postgres to delete from the roles
table, it is going to first check with all dependent tables to make sure it is okay to do that. If the foreign key value isn't being used by any records in the associated table(s), then the delete can proceed. If even just one record depends on that value, the foreign key constraint is going to flag that and abort the delete action.
So, even though we preemptively deleted all the related data from users_roles
, Postgres still must check at the moment of the delete if that is still true.
Postgres goes through the users_roles
table and does a sequential scan checking every single record. For each roles
record, Postgres is going to check 500,000 users_roles
records.
And there is the root of the problem.
The Query Planner Needs More Info
Postgres is going to use the information at its disposal to verify that the foreign key constraint. As is, it doesn't have enough info to do anything better than look at every single row. If Postgres had an index though, it would have the info necessary to do a lot less work.
A database index is roughly like the organizational system that a library uses. Books are categorized and then they have a placement relative to other books in their same category. Using this system, you can quickly navigate to where a book should be.
Not having an index on users_roles.role_id
is kinda like walking into a library without an organizational system, you have to start on one end and go book by book, perhaps through tens of thousands of books, until you find what you're looking for.
We can cut our delete
query down to around 1 second by adding an index to users_roles.role_id
. The index provides Postgres with the info it needs to quickly shortcut its way to any dependent rows.
create index role_id_index
on users_roles (role_id);
CREATE INDEX
Time: 342.591 ms
This adds an index to the users_roles.role_id
column. It is a B-Tree index. "The PostgreSQL query planner will consider using a B-tree index whenever an indexed column is involved in a comparison" (with operators like =
, <
, >
, etc.).
With this index, we'll tend to get faster queries, especially with this delete. One tradeoff is that the index will take a little disk space. Since Postgres automatically updates the index on each write to the indexed column, that adds a teeny tiny bit of write time. Both of these tradeoffs are well worth it in the vast majority of cases.
Let's see how the index performs.
Deleting Faster
With that index in place, let's give both of our delete queries another try.
> begin;
> delete from users_roles
using roles
where users_roles.role_id = roles.id
and roles.name = 'team_collaborator';
DELETE 50000
Time: 111.732 ms
> delete from roles
where roles.name = 'team_collaborator';
DELETE 50000
Time: 590.668 ms
In the blink of an eye it is done. The second delete query went from around 30 minutes to under 600ms. That's a massive difference. That's what indexes can do for us.
That brings me back to the tl;dr of this post.
Foreign keys are essential for enforcing the shape and integrity of our data. Indexes are there to keep queries fast. We can combine the two to start getting the most out of our database.
Without that index, you're bound to run into some really sneaky performance gotchas.
But can we go even faster?!
Deleting Even Faster
Since we are in a transaction and we are doing a ton of deletes, what if we tell Postgres to hold off on verifying the foreign key constraint until we're done with the deletes.
> begin;
> alter table users_roles
alter constraint users_roles_role_id_fkey
initially deferred;
ALTER TABLE
Time: 0.455 ms
> delete from users_roles
using roles
where users_roles.role_id = roles.id
and roles.name = 'team_collaborator';
DELETE 50000
Time: 114.574 ms
> delete from roles
where roles.name = 'team_collaborator';
DELETE 50000
Time: 68.433 ms
Whoa, that made it another 10x faster, coming in at ~68ms.
Instead of checking the foreign key constraint 50,000 times, even with the index, it can check it once at the end. Zoom!
Wrapping Up
After running into this incredibly slow delete query, I went on quite a journey trying to understand what was causing the performance hit. There is a lot to be learned by being willing to really dig into a hard problem and try to understand, even if you can only understand it a little bit at a time. I came out the other end better for it. And I hope you are at the other end of this article better for having read it.
Foreign keys help ensure the integrity of your data. Indexes help keep your queries fast. Combine the two to really take advantage of the speed and power of PostgreSQL.
If you enjoy my writing, consider joining my newsletter or following me on twitter.
Notes
An improvement to this schema would be to add an
on delete cascade
directive to therole_id
foreign key constraint. This would mean we could go right to deleting theroles
record and the correspondingusers_roles
records would be automatically removed. This would still suffer from the same performance issues.Postgres allows you to add indexes concurrently to avoid locking writes to the entire table. Though adding an index is a pretty quick action, it is a good rule of thumb to add them concurrently. This is especially true when dealing with live tables in a production environment. I chose not to add our index concurrently in this article to avoid tossing one more concept into the mix.
Acknowledgements
A big thanks to several people who read through and provided feedback on earlier drafts of this post. They are awesome. Check out their stuff.
- Simon Willison, who is doing awesome work on datasette
- Chris Toomey, who does fantastic write ups on Vim and Postgres
- Jordan Lewis, who has an engaging weekly stream on Go and databases
- Dillon Hafer
- Max Rozen
Cover photo by Ryan Johnston on Unsplash
Top comments (6)
This was really well written up!
How do you find all the tables that are dependent on the current table you're dealing with?
Seems like it would be nice if the query planner also showed you the implicit checks it has to do before deleting the row. At least from what I gather about:
its not doing that.
Thanks, Ian! Unfortunately, the query plan does not expose that kind of information which is part of what made this particular issue so hard to track down.
As for finding all the tables that are dependent on a column in a given table, here is a query, for example, that finds all tables dependent on
users.id
. It takes advantage of Postgres'sinformation_schema
. Just replace the column name and table name in thewhere
clause with whatever you are looking for.source
Nice write up, thanks for sharing!
It’s a shame you didn’t use
explain analyze
, orexplain (analyze, verbose)
, as it would have shown referential integrity triggers taking the majority of the time. Unfortunately it would have taken a full 30 minutes to execute, as it does run the query, so your instincts around using transactions would have been perfect for that too.I’ve updated our explain glossary page to hopefully serve this case better too, thanks: pgmustard.com/docs/explain/trigger...
Are you sure that with analyze,verbose you see cost of checking integrity constraints?
I always use that options and I'm not able to see them.
Also I checked the link on your comment but it's not clear to me to what options the trigger information is related
Hi Joan, here's an example: db-fiddle.com/f/6odBaA7hYaEvD3WGVU...
If you click Run on that, you should see a line starting with "Trigger RI_ConstraintTrigger". Hope that helps.
Thanks for feedback on the link too, I'll have a think about how to make that better.
I love this! That's the best way to learn more. Every problem is an opportunity and you always discover something new.
The first time I thought indexes were created automatically like in MySQL