DEV Community

rhymes
rhymes

Posted on • Updated on • Originally published at rhymes.dev

How to build unique indexes in PostgreSQL on large text

As I believe a relational database schema should be as independent as possible from the apps using it, I've been trying to strengthen the DEV's database a bit lately (there are exceptions to this rule but they are not for this post).

One way to do that is to make sure that Rails model statements like this:

validates username, uniqueness: true
Enter fullscreen mode Exit fullscreen mode

correspond to actual unique indexes in PostgreSQL.

Two reasons for that:

  • let the DBMS do its job, it was built to check constraints
  • data can "get in" from all sort of ways (throwaway SQL scripts for example)

Even if today your database is used only by a single app, you might have more than one in the future and adding indexes on existing tables or having to clean duplicate rows in large tables is always a bit of a pain (because of locking, I might write another article about that..).

What happened then?

It seems straigtforward, right? List the column(s) you need the index for, write a Rails migration for them, run the migration, forget about it.

That's where a random test literally saved me from an oversight.

We have a test in our codebase that imports 20+ items from a RSS feed, transforms them into articles and inserts them in the DB, then checks the count to make sure it matches.

They are all different articles, but the database is going to check they are unique anyway (for obvious reasons).

The counts weren't matching and after some very serious debugging magic (aka setting a breakpoint and printing stuff) I came across this:

[1] pry(#<RssReader>)> p e
#<ActiveRecord::StatementInvalid: PG::ProgramLimitExceeded: ERROR:  index row size 7280 exceeds btree version 4 maximum 2704 for index "index_articles_on_body_markdown_and_user_id_and_title"
DETAIL:  Index row references tuple (8,1) in relation "articles".
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
: INSERT INTO "articles" ("body_markdown", "boost_states", "cached_tag_list", "cached_user", "cached_user_name", "cached_user_username", "created_at", "description", "feed_source_url", "password", "path", "processed_html", "published_from_feed", "reading_time", "slug", "title", "updated_at"
Enter fullscreen mode Exit fullscreen mode

Wait, what!?

After a bit of digging I realized my oversight: if the text to be indexed is too large and doesn't fit PostgreSQL buffer page, indexing is not going to work.

PostgreSQL buffer page size can be enlarged but that's beside the point and also not a great idea.

So, what's the solution?

The solution is to create a hash of the column and index that instead of the column itself.

There are many ways to go about this but this is what I chose for our particular situation:

CREATE UNIQUE INDEX CONCURRENTLY "index_articles_on_digest_body_markdown_and_user_id_and_title"
ON "articles"
USING btree (digest("body_markdown", 'sha512'::text), "user_id", "title");
Enter fullscreen mode Exit fullscreen mode

Let's break it down:

  • CREATE UNIQUE INDEX is self explanatory: creates an index on a column, making sure you can't insert the same value twice
  • CONCURRENTLY is a huge change in PostgreSQL land. In short: it adds the index asynchronously in the background. Basically it doesn't block operations on the table while the index is being built.
  • btree is the standard default index for PostgreSQL
  • digest("body_markdown", 'sha512'::text) is where the magic happens: we tell PostgreSQL to build a SHA512 hash (go away MD5 😅) and use that for comparison of the index
  • "user_id", "title" are there because this is not an index on a single column, but a multi column index

This is what happens when you try to add the value twice in the database:

$ pgcli PracticalDeveloper_development
PracticalDeveloper_development> insert into articles (body_markdown, user_id, title, created_at, updated_at) select body_markdown, user_id, title, now(), now() from articles order by random() limit 1;
duplicate key value violates unique constraint "index_articles_on_digest_body_markdown_and_user_id_and_title"
DETAIL:  Key (digest(body_markdown, 'sha512'::text), user_id, title)=(\x1f40fc92da241694750979ee6cf582f2d5d7d28e18335de05abc54d0560e0f5302860c652bf08d560252aa5e74210546f369fbbbce8c12cfc7957b2652fe9a75, 10,  The Curious Incident of the Dog in the Night-Time Voluptas quia) already exists.
Enter fullscreen mode Exit fullscreen mode

bonus tip for pgcli which I use instead of the regular psql

The result of this investigation is this commit.

Top comments (6)

Collapse
 
kip13 profile image
kip • Edited

is where the magic happens: we tell PostgreSQL to build a SHA512 hash (go away MD5 😅)

In this use case, md5 looks good imo, i mean, the sha-2 familiy is more robust but this means more time to compute it, not really good, but the CONCURRENTLY is the heroe here, but...

Why choose sha-512 instead of md5 ? if your use case isnt related to crypt stuff or something similar, btw both could have a collision.

Collapse
 
rhymes profile image
rhymes • Edited

Hi kip, you're right, MD5 tends to be faster than SHA2, but as we're not building a brute forcing app the time difference of computing the hash in C (PostgreSQL pgcrypto is written in it AFAIK) is not that big, also given the fact it's going to be done asynchronously.

Fun fact: SHA-512 is faster than SHA-256 on 64bit processors:

Benchmarks

I would like to see also some real-life measurements here, so I hope you'll like it ;)


Intel Core i7-7700HQ (7th gen = Kaby Lake); RAM (DDR4)

HW / OS configuration:

Collapse
 
kip13 profile image
kip • Edited

Yes, i understand, my idea here is give some thoughts about why exists the expression go away MD5, use md5 for id signature isnt too bad.

Like you said the async execution of computing is a big advantage here, whatever was the hash.

SHA-512 is faster than SHA-256 on 64bit processors

Lets digging in that !

Thread Thread
 
rhymes profile image
rhymes

Yes, i understand, my idea here is give some thoughts about why exists the expression go away MD5, use md5 for id signature isnt too bad.

Ah ah yeah, it's not too bad. My parenthesis was probably too harsh :)

Collapse
 
rrampage profile image
Raunak Ramakrishnan • Edited

Hi @rhymes ,
Great post. TIL about the limit on size of what can be indexed. A buffer page is generally around 8kb so looks like 2.5 kb is maximum text size to be indexable.

I had a few queries on this:

  1. Are articles queried directly by their text in the code? If not so, is a unique index needed for the article text?
  2. The order of columns in the index matters. So, is there advantage in indexing (hash, title, user) vs (user, title, hash) as any query selecting by user will automatically use this index.
  3. I realized there is a limit to title of 128 char in dev.to. Is that application only or a DB contraint/data type?
  4. Finally, is there any advantage of adding the hash as a separate column and indexing it instead and adding the uniqueness of this column in Rails model like validates article_hash, uniqueness: true
Collapse
 
rhymes profile image
rhymes • Edited

Hi Raunak!

  1. No, it's just used to avoid accidental double posting mostly
  2. It's not queried so it doesn't really matter in this case but yeah, in general it does matter, great point!
  3. We don't have length CHECK constraints in the database, so it's one of those things that it's still only at the application level
  4. It would mean that the hash needs to be updated at the code level at each "article save", which means that we're back to depending on Rails to check uniqueness

Hope this clears a bit :) Great questions!