As I believe a relational database schema should be as independent as possible from the apps using it, I've been trying to strengthen the DEV's database a bit lately (there are exceptions to this rule but they are not for this post).
One way to do that is to make sure that Rails model statements like this:
validates username, uniqueness: true
correspond to actual unique indexes in PostgreSQL.
Two reasons for that:
- let the DBMS do its job, it was built to check constraints
- data can "get in" from all sort of ways (throwaway SQL scripts for example)
Even if today your database is used only by a single app, you might have more than one in the future and adding indexes on existing tables or having to clean duplicate rows in large tables is always a bit of a pain (because of locking, I might write another article about that..).
What happened then?
It seems straigtforward, right? List the column(s) you need the index for, write a Rails migration for them, run the migration, forget about it.
That's where a random test literally saved me from an oversight.
We have a test in our codebase that imports 20+ items from a RSS feed, transforms them into articles and inserts them in the DB, then checks the count to make sure it matches.
They are all different articles, but the database is going to check they are unique anyway (for obvious reasons).
The counts weren't matching and after some very serious debugging magic (aka setting a breakpoint and printing stuff) I came across this:
[1] pry(#<RssReader>)> p e
#<ActiveRecord::StatementInvalid: PG::ProgramLimitExceeded: ERROR: index row size 7280 exceeds btree version 4 maximum 2704 for index "index_articles_on_body_markdown_and_user_id_and_title"
DETAIL: Index row references tuple (8,1) in relation "articles".
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
: INSERT INTO "articles" ("body_markdown", "boost_states", "cached_tag_list", "cached_user", "cached_user_name", "cached_user_username", "created_at", "description", "feed_source_url", "password", "path", "processed_html", "published_from_feed", "reading_time", "slug", "title", "updated_at"
Wait, what!?
After a bit of digging I realized my oversight: if the text to be indexed is too large and doesn't fit PostgreSQL buffer page, indexing is not going to work.
PostgreSQL buffer page size can be enlarged but that's beside the point and also not a great idea.
So, what's the solution?
The solution is to create a hash of the column and index that instead of the column itself.
There are many ways to go about this but this is what I chose for our particular situation:
CREATE UNIQUE INDEX CONCURRENTLY "index_articles_on_digest_body_markdown_and_user_id_and_title"
ON "articles"
USING btree (digest("body_markdown", 'sha512'::text), "user_id", "title");
Let's break it down:
-
CREATE UNIQUE INDEX
is self explanatory: creates an index on a column, making sure you can't insert the same value twice -
CONCURRENTLY
is a huge change in PostgreSQL land. In short: it adds the index asynchronously in the background. Basically it doesn't block operations on the table while the index is being built. -
btree
is the standard default index for PostgreSQL -
digest("body_markdown", 'sha512'::text)
is where the magic happens: we tell PostgreSQL to build a SHA512 hash (go away MD5 😅) and use that for comparison of the index -
"user_id", "title"
are there because this is not an index on a single column, but a multi column index
This is what happens when you try to add the value twice in the database:
$ pgcli PracticalDeveloper_development
PracticalDeveloper_development> insert into articles (body_markdown, user_id, title, created_at, updated_at) select body_markdown, user_id, title, now(), now() from articles order by random() limit 1;
duplicate key value violates unique constraint "index_articles_on_digest_body_markdown_and_user_id_and_title"
DETAIL: Key (digest(body_markdown, 'sha512'::text), user_id, title)=(\x1f40fc92da241694750979ee6cf582f2d5d7d28e18335de05abc54d0560e0f5302860c652bf08d560252aa5e74210546f369fbbbce8c12cfc7957b2652fe9a75, 10, The Curious Incident of the Dog in the Night-Time Voluptas quia) already exists.
bonus tip for pgcli which I use instead of the regular psql
The result of this investigation is this commit.
Top comments (6)
In this use case, md5 looks good imo, i mean, the sha-2 familiy is more robust but this means more time to compute it, not really good, but the
CONCURRENTLY
is the heroe here, but...Why choose sha-512 instead of md5 ? if your use case isnt related to crypt stuff or something similar, btw both could have a collision.
Hi kip, you're right, MD5 tends to be faster than SHA2, but as we're not building a brute forcing app the time difference of computing the hash in C (PostgreSQL pgcrypto is written in it AFAIK) is not that big, also given the fact it's going to be done asynchronously.
Fun fact: SHA-512 is faster than SHA-256 on 64bit processors:
answer re: SHA512 faster than SHA256?
Benchmarks
I would like to see also some real-life measurements here, so I hope you'll like it ;)
Intel Core i7-7700HQ (7th gen = Kaby Lake); RAM (DDR4)
HW / OS configuration:
System: Linux Mint 19.3 "Tricia" - Cinnamon (64-bit); intel-microcode package, as well as the latest…
Yes, i understand, my idea here is give some thoughts about why exists the expression go away MD5, use md5 for id signature isnt too bad.
Like you said the async execution of computing is a big advantage here, whatever was the hash.
Lets digging in that !
Ah ah yeah, it's not too bad. My parenthesis was probably too harsh :)
Hi @rhymes ,
Great post. TIL about the limit on size of what can be indexed. A buffer page is generally around 8kb so looks like 2.5 kb is maximum text size to be indexable.
I had a few queries on this:
validates article_hash, uniqueness: true
Hi Raunak!
CHECK
constraints in the database, so it's one of those things that it's still only at the application levelHope this clears a bit :) Great questions!