DEV Community

Cover image for In-Depth Analysis of Search Strategies for Momen
Yaokai Jiang for Momen

Posted on

In-Depth Analysis of Search Strategies for Momen

TL;DR

Install pg_trgm extension, use a GIN index and add an exclude constraint using hash if needed. GIN with pg_trgm will get you indexed operation on 'LIKE', 'ILIKE', 'SIMILAR TO' and '~', while the exclude constraint using hash gives you both uniqueness and super fast point selects.


Introduction

Momen is a cloud-native no-code IDE (integrated development environment) that we here at Momen have been developing since early 2020. It aims to significantly lower the hurdle one has to overcome to build an application that is capable of meeting the requirements of serving requests from a non-insignificant crowd. A very common requirement is text search. I would like to share some of my thoughts on this requirement.


Breaking Down the Requirement

A lot is implied when our customers ask the question: "Do you support search?". After some digging, I narrowed them down to the following.

  • Exact matches. e.g. SELECT * FROM account WHERE phone_number = '12312345678'. This is the easiest type to satisfy.
  • Prefix matches. e.g. SELECT * FROM post WHERE post.title ILIKE 'Momen%'. As we shall see later, these are quite easy as well.
  • Suffix matches. e.g. SELECT * FROM post WHERE post.title ILIKE '%Momen'. These look similar to prefix matches, and indeed are quite similar.
  • Infix matches. e.g. SELECT * FROM post WHERE post.title ILIKE '%Momen%'. These are quite a bit more difficult to deal with, and require a different strategy entirely.
  • Regular-expression matches. e.g. SELECT * FROM post WHERE post.title ~ '(Momen|mirror)'. Yeah, a tough nut to crack.
  • Full-text searches. e.g. SELECT * FROM post WHERE post.content @@ to_tsquery('Momen & mirror'). (Postgres syntax) This one is especially difficult to deal with, and we shall reserve this for a separate post.

In addition to those "search" requirements, our customers frequently use "unique" constraint on the content of the text. This one although seems innocuous on the surface, can become a huge pain under certain circumstances.


Relevant Background Information

Our product mirror is opinionated. We are a start-up team. We are very cognizant of the resources that we have, and being opinionated allows us to not be boggled down by the innumerable combinations of technology choices at every level of the stack. We also wanted to have as few components as possible while meeting our customers' needs.

Our choice for database is Postgres, and our API was initially generated by hasura. Hasura is a GraphQL backend that essentially provides a pass-through SQL interface, and does not offer the ability to inject custom logic for specific types of queries. That means whatever we decide to do to optimize the SQL queries generated by all those different kinds of "searches".

Image description


Route to Optimization

Setup

It is not scientific, but should be representative.
Surface Book 2, i7-8650U + 16GB + 512GB.
Postgres 14 installed on Windows' NTFS (not WSL2).
Populated via

CREATE DATABASE momen ENCODING 'UTF-8';
CREATE TABLE simple (id serial PRIMARY KEY, content TEXT);
INSERT INTO simple (content) SELECT gen_random_uuid() FROM generate_series(1, 1000000);
Enter fullscreen mode Exit fullscreen mode

Exact Matches

This one is simple enough. All it takes is to create any kind of index that supports equality test.
Before any optimization, let's establish the baseline.

momen=# explain analyze select * from simple where content = 'f0f453fd-8f7c-44f2-bcc0-124f706786cf';
QUERY PLAN
-------------------------------------------------------------------------
Gather  (cost=1000.00..15554.43 rows=1 width=41) (actual time=144.976..172.778 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on simple  (cost=0.00..14554.33 rows=1 width=41) (actual time=35.305..35.305 rows=0 loops=3)
         Filter: (content = 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text)
         Rows Removed by Filter: 333333
 Planning Time: 1.331 ms
 Execution Time: 172.798 ms
(8 rows)
Enter fullscreen mode Exit fullscreen mode

Notice the parallel sequential scan. Now let's just create a vanilla B-tree index on content.

momen=# CREATE INDEX idx_simple_by_content ON simple (content);
momen=# explain analyze select * from simple where content = 'f0f453fd-8f7c-44f2-bcc0-124f706786cf';
QUERY PLAN
-------------------------------------------------------------------------
Index Scan using idx_simple_by_content_b_tree on simple  (cost=0.42..8.44 rows=1 width=41) 
(actual time=0.287..0.288 rows=1 loops=1)
Index Cond: (content = 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text)
Planning Time: 0.394 ms
Execution Time: 0.323 ms
(4 rows)
Enter fullscreen mode Exit fullscreen mode

We now have an index scan. And execution time goes from hundreds of ms to hundreds of us, about a 1000x difference on this 1 million row dataset.

Prefix Matches

A bit more interesting obviously, so let's try these with the previously created B-tree index on content.

momen=# explain analyze select * from simple where content like 'f0f453fd-8f7c-44f2-bcc0-124f706786cf%';
QUERY PLAN
-------------------------------------------------------------------------
Gather  (cost=1000.00..15564.33 rows=100 width=41) (actual time=0.528..100.149 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
->  Parallel Seq Scan on simple  (cost=0.00..14554.33 rows=42 width=41) (actual time=26.435..52.645 rows=0 loops=3)
Filter: (content ~~ 'f0f453fd-8f7c-44f2-bcc0-124f706786cf%'::text)
Rows Removed by Filter: 333333
Planning Time: 1.960 ms
Execution Time: 100.194 ms
(8 rows)
Odd, we are not going through index at all. We ended up with a parallel sequential scan. One would think prefix matches should be indexable, since we can quite simply express this as a conjunction of two range conditions, and B-trees are perfect for those queries. So let's try that. 
momen=# explain analyze select * from simple where content >= 'f0f453fd-8f7c-44f2-bcc0-124f706786cf' and content < 'f0f453fd-8f7c-44f2-bcc0-124f706786cg';
QUERY PLAN
-------------------------------------------------------------------------
Index Scan using idx_simple_by_content_b_tree on simple  (cost=0.42..8.45 rows=1 width=41) (actual time=0.113..0.113 rows=1 loops=1)
Index Cond: ((content >= 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text) AND (content < 'f0f453fd-8f7c-44f2-bcc0-124f706786cg'::text))
Planning Time: 0.167 ms
Execution Time: 0.128 ms
(4 rows)
Enter fullscreen mode Exit fullscreen mode

Amazing! The index is performing its magic tricks again! So it appears that Postgres wasn't smart enough to figure out this transformation. Oh, what happened to SQL's promise of "let users express what data to retrieve, let the engine underneath take care of seamlessly retrieving it"? It turns out that the reason behind Postgres' "inability" to translate "LIKE 'value%'" was due to the limitations of the >= and < operators. Both only operate on a byte value level, and once you have some collation other than "C", things can breakdown very quickly. So that "inability" was really Postgres' attempt to protect us from ourselves. The correct operators to use turns out to be "~>=~" and "~<~". Let's bask in the light of some magic:

momen=# create index idx_simple_by_content_b_tree_text_pattern_ops on simple (content text_pattern_ops);
momen=# explain analyze select * from simple where content like 'f0f453fd-8f7c-44f2-bcc0-124f706786cf%';
QUERY PLAN
-------------------------------------------------------------------------
Index Scan using idx_simple_by_content_b_tree_text_pattern_ops on simple  (cost=0.42..2.65 rows=100 width=41) (actual time=0.041..0.042 rows=1 loops=1)
Index Cond: ((content ~>=~ 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text) AND (content ~<~ 'f0f453fd-8f7c-44f2-bcc0-124f706786cg'::text))
Filter: (content ~~ 'f0f453fd-8f7c-44f2-bcc0-124f706786cf%'::text)
Planning Time: 1.785 ms
Execution Time: 0.053 ms
(5 rows)

Enter fullscreen mode Exit fullscreen mode

That "text_pattern_ops" is all the "nudge" Postgres needed. "text_pattern_ops" is an "opclass". "The operator class identifies the operators to be used by the index for that column. ", says Postgres' official document. Usually we do not specify one, and that just means that Postgres will choose the default operator class for that column's data type. In addition to =, the default operator class supports the following operations: <, <=, >, and >=. And that was why the "LIKE" operator that operates on two text values, which is implemented by Postgres' "", wasn't getting the proper treatment. On line 6, we can clearly see that the index condition involves the text version of ">=" and "<", namely "~>=~" and "~<~". And Postgres does indeed translate the "" condition for us!

A natural follow up would be, what are all the operators supported by "text_pattern_ops"?

momen=# SELECT am.amname AS index_method,
       opf.opfname AS opfamily_name,
       amop.amopopr::regoperator AS opfamily_operator
  FROM pg_am am,
       pg_opfamily opf,
       pg_amop amop
 WHERE opf.opfmethod = am.oid AND amop.amopfamily = opf.oid
       AND opf.opfname = 'text_pattern_ops';
 index_method |  opfamily_name   | opfamily_operator
--------------+------------------+-------------------
 btree        | text_pattern_ops | ~<~(text,text)
 btree        | text_pattern_ops | ~<=~(text,text)
 btree        | text_pattern_ops | =(text,text)
 btree        | text_pattern_ops | ~>=~(text,text)
 btree        | text_pattern_ops | ~>~(text,text)
 hash         | text_pattern_ops | =(text,text)
(6 rows)
Enter fullscreen mode Exit fullscreen mode

You might have guessed, since we dropped the index that supported <, <=, >, and >= operators, some queries now suffer.

momen=# explain analyze select * from simple where content >= 'f0f453fd-8f7c-44f2-bcc0-124f706786cf' and content < 'f0f453fd-8f7c-44f2-bcc0-124f706786cg';
QUERY PLAN
-------------------------------------------------------------------------
Gather  (cost=1000.00..16596.10 rows=1 width=41) (actual time=0.648..204.607 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
->  Parallel Seq Scan on simple  (cost=0.00..15596.00 rows=1 width=41) (actual time=89.163..148.741 rows=0 loops=3)
 Filter: ((content >= 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text) AND (content < 'f0f453fd-8f7c-44f2-bcc0-124f706786cg'::text))
 Rows Removed by Filter: 333333
Planning Time: 0.586 ms
Execution Time: 204.665 ms
(8 rows)
Enter fullscreen mode Exit fullscreen mode

And yes, now we are back to sequential scans. Though one could very reasonably argue, few requirements would involve byte-wise string range lookups. And you could always choose to keep the original B-tree index.

Image description

Suffix Matches

Now we know what operators "text_pattern_ops" support, it is not difficult to come to the conclusion that we can't express LIKE '%something' solely in terms of the text version of less-than, less-than-or-equal-to, equal-to, greater-than and greater-than-or-equal-to. (Give it a try if you don't believe me).
And that is reflected Postgres' query plan:

momen=# explain analyze select * from simple where content like '%f0f453fd-8f7c-44f2-bcc0-124f706786cf';
QUERY PLAN
-------------------------------------------------------------------------
Gather  (cost=1000.00..14388.26 rows=100 width=41) (actual time=0.599..178.746 rows=1 loops=1)
Workers Planned: 3
Workers Launched: 3
->  Parallel Seq Scan on simple  (cost=0.00..13378.26 rows=32 width=41) (actual time=27.477..52.040 rows=0 loops=4)
Filter: (content ~~ '%f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text)
Rows Removed by Filter: 250000
Planning Time: 0.114 ms
Execution Time: 178.763 ms
(8 rows)

Enter fullscreen mode Exit fullscreen mode

As expected, we have reverted back to sequential scan. However, once we reverse the string, "%query" becomes effectively "query%". So let's go that route.

momen=# create index idx_simple_by_reverse_content_b_tree_text_pattern_ops on simple (reverse(content) text_pattern_ops);
momen=# explain analyze select * from simple where reverse(content) like 'fc687607f421-0ccb-2f44-c7f8-df354f0f%';                               QUERY PLAN
-------------------------------------------------------------------------
Bitmap Heap Scan on simple  (cost=92.38..4250.57 rows=5000 width=41) (actual time=0.023..0.023 rows=1 loops=1)
Filter: (reverse(content) ~~ 'fc687607f421-0ccb-2f44-c7f8-df354f0f%'::text)
Heap Blocks: exact=1
->  Bitmap Index Scan on idx_simple_by_reverse_content_b_tree_text_pattern_ops  (cost=0.00..91.13 rows=5000 width=0) (actual time=0.015..0.015 rows=1 loops=1)
Index Cond: ((reverse(content) ~>=~ 'fc687607f421-0ccb-2f44-c7f8-df354f0f'::text) AND (reverse(content) ~<~ 'fc687607f421-0ccb-2f44-c7f8-df354f0g'::text))
Planning Time: 0.162 ms
Execution Time: 0.047 ms
(7 rows)
Enter fullscreen mode Exit fullscreen mode

And yes, as expected, we have the beautiful performance afforded to us by the lovely index.

Infix Matches

Looking at infix matches, it is not possible to translate "LIKE '%query%'" in terms of any combination of "~>~" and "~<~" operators. What we now need is proper "" operator support natively in the index. To find such an index type, we need to first find the opclasses that support the "" operator, and then find the index types compatible with such opclasses.
It turns out that none of Postgres' default opclasses support "~~(text, text)", and one can run the following sql to verify:

SELECT am.amname AS index_method,
   opf.opfname AS opfamily_name,
   amop.amopopr::regoperator AS opfamily_operator
FROM pg_am am,
   pg_opfamily opf,
   pg_amop amop
WHERE opf.opfmethod = am.oid AND amop.amopfamily = opf.oid
   AND amop.amopopr::regoperator = '~~(text, text)'::regoperator;
However, an extension that is by default available in Postgres, called "pg_trgm", supports "~~(text, text)". As the name suggests, this extension gives Postgres some trigram-related abilities. So let us install that. 
momen=# create extension pg_trgm;
#CREATE EXTENSION
momen=# SELECT am.amname AS index_method,
      # opf.opfname AS opfamily_name,
      # amop.amopopr::regoperator AS opfamily_operator
 # FROM pg_am am,
 #      pg_opfamily opf,
 #      pg_amop amop
 # WHERE opf.opfmethod = am.oid AND amop.amopfamily = opf.oid
 #      AND amop.amopopr::regoperator = '~~(text, text)'::regoperator;
 index_method | opfamily_name | opfamily_operator
--------------+---------------+-------------------
 gist         | gist_trgm_ops | ~~(text,text)
 gin          | gin_trgm_ops  | ~~(text,text)
(2 rows)
Enter fullscreen mode Exit fullscreen mode

And we have two newly available options, using GIST + gist_trgm_ops or GIN + gin_trgm_ops, let's try out both.

momen=# create index idx_simple_by_content_gist_trgm_ops on simple USING GIST (content gist_trgm_ops);
CREATE INDEX
Time: 28290.219 ms (00:28.290)
momen=# explain analyze select * from simple where content like '%fc687607f421-0ccb-2f44-c7f8-df354f0f%';
QUERY PLAN
-------------------------------------------------------------------------
 Index Scan using idx_simple_by_content_gist_trgm_ops on simple  (cost=0.41..115.46 rows=100 width=41) (actual time=10.810..10.811 rows=0 loops=1)
   Index Cond: (content ~~ '%fc687607f421-0ccb-2f44-c7f8-df354f0f%'::text)
 Planning Time: 1.145 ms
 Execution Time: 10.829 ms
(4 rows)
Enter fullscreen mode Exit fullscreen mode
momen=# explain analyze select * from simple where content like '%0ccb%';
 QUERY PLAN
-------------------------------------------------------------------------
Index Scan using idx_simple_by_content_gist_trgm_ops on simple  (cost=0.41..115.46 rows=100 width=41) (actual time=1.774..104.431 rows=256 loops=1)
Index Cond: (content ~~ '%0ccb%'::text)
Rows Removed by Index Recheck: 24
Planning Time: 1.793 ms
Execution Time: 104.460 ms
(5 rows)
Enter fullscreen mode Exit fullscreen mode
momen=# create index idx_simple_by_content_gin_trgm_ops on simple USING GIN (content gin_trgm_ops);
CREATE INDEX
Time: 18051.048 ms (00:18.051)
momen=# explain analyze select * from simple where content like '%fc687607f421-0ccb-2f44-c7f8-df354f0f%';
QUERY PLAN
-------------------------------------------------------------------------------------
Bitmap Heap Scan on simple  (cost=200.98..311.19 rows=100 width=41) (actual time=11.378..11.379 rows=0 loops=1)
Recheck Cond: (content ~~ '%fc687607f421-0ccb-2f44-c7f8-df354f0f%'::text)
->  Bitmap Index Scan on idx_simple_by_content_gin_trgm_ops  (cost=0.00..200.95 rows=100 width=0) (actual time=11.375..11.375 rows=0 loops=1)
Index Cond: (content ~~ '%fc687607f421-0ccb-2f44-c7f8-df354f0f%'::text)
Planning Time: 0.294 ms
Execution Time: 11.505 ms
(6 rows)
Enter fullscreen mode Exit fullscreen mode
momen=# explain analyze select * from simple where content like '%0ccb%';
QUERY PLAN
-------------------------------------------------------------------------
Bitmap Heap Scan on simple  (cost=12.88..123.09 rows=100 width=41) (actual time=0.472..0.863 rows=256 loops=1)
Recheck Cond: (content ~~ '%0ccb%'::text)
Rows Removed by Index Recheck: 24
Heap Blocks: exact=279
->  Bitmap Index Scan on idx_simple_by_content_gin_trgm_ops  (cost=0.00..12.85 rows=100 width=0) (actual time=0.421..0.421 rows=280 loops=1)
Index Cond: (content ~~ '%0ccb%'::text)
Planning Time: 0.373 ms
Execution Time: 0.923 ms
(8 rows)
Enter fullscreen mode Exit fullscreen mode
Looks like GIST is slower to build than GIN, and much slower on shorter search patterns. So slow that GIST is only getting less than 2x performance compared to a parallel sequential scan on this 1M row table. In fact, GIST index takes more space, too, due to its tree structure. 
momen=# SELECT
    psai.indexrelname                              AS index_name,
    pg_size_pretty(pg_relation_size(c.oid))        AS table_size,
    pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size
FROM
    pg_tables t
LEFT JOIN pg_class c ON t.tablename = c.relname
LEFT JOIN pg_index i ON c.oid = i.indrelid
LEFT JOIN pg_stat_all_indexes psai ON i.indexrelid = psai.indexrelid
WHERE
    t.schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY 1;
 index_name                                        |table_size|index_size
-------------------------------------------------------+------------+----
 idx_simple_by_content_b_tree                       | 73 MB      | 56 MB
 idx_simple_by_content_b_tree_text_pattern_ops      | 73 MB      | 56 MB
 idx_simple_by_content_gin_trgm_ops                 | 73 MB      | 121 MB
 idx_simple_by_content_gist_trgm_ops                | 73 MB      | 200 MB
 idx_simple_by_reverse_content_b_tree_text_pattern_ops| 73 MB    | 56 MB
 simple_pkey                                        | 73 MB      | 21 MB
Enter fullscreen mode Exit fullscreen mode

So why do we even bother with GIST? Insert/update speed. Due to the "inverted index" nature, each insertion/update can trigger updates to a huge number of pages, significantly slowing the update speed. Further discussion of GIN indexes deserves its own post.

Image description

momen=# INSERT INTO simple_gin_only
SELECT nextval('simple_gin_only_id_seq'), gen_random_uuid() FROM generate_series(1, 100000);
INSERT 0 100000
Time: 2761.322 ms (00:02.761)
momen=# INSERT INTO simple_gist_only
SELECT nextval('simple_gist_only_id_seq'), gen_random_uuid() FROM generate_series(1, 100000);
INSERT 0 100000
Time: 4084.976 ms (00:04.085)
momen=# INSERT INTO simple_gin_only
SELECT nextval('simple_gin_only_id_seq'), 
gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text FROM simple order by id limit 10000;
INSERT 0 10000
Time: 2669.833 ms (00:02.670)
momen=# INSERT INTO simple_gist_only
SELECT nextval('simple_gist_only_id_seq'), gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text FROM simple order by id limit 10000;
INSERT 0 10000
Time: 1263.832 ms (00:01.264)
Enter fullscreen mode Exit fullscreen mode

What we observe is somewhat consistent with Postgres 9.4's documentation on this topic. As the size of the text grows, GIN's performance degradation is much more severe. So this now becomes a trade-off, is your data read-heavy or write-heavy?
Both GIN and GIST support a ton of text operations, and it seems other than size and insert/update performance, there now seems no particular reason to rely on B-tree indexes for text fields.

momen=# select * into simple_btree_only from simple;
momen=# create index idx_simple_btree_only_by_content on simple_btree_only using btree (content text_pattern_ops);
CREATE INDEX
momen=# INSERT INTO simple_btree_only
SELECT nextval('simple_btree_only_id_seq'), gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text FROM simple order by id limit 10000;
INSERT 0 10000
Time: 412.707 ms
Enter fullscreen mode Exit fullscreen mode

Regular-Expression Matches

Remember when we said "both GIN and GIST support a ton of text operations"? Well the payback is quick on this one.

momen=# explain analyze select * from simple_gin_only where content ~ 'ba1.*abc';
QUERY PLAN
-------------------------------------------------------------------------
Bitmap Heap Scan on simple_gin_only  (cost=249.65..398.44 rows=135 width=41) (actual time=2.126..2.338 rows=37 loops=1)
Recheck Cond: (content ~ 'ba1.*abc'::text)
Rows Removed by Index Recheck: 28
Heap Blocks: exact=65
->  Bitmap Index Scan on idx_simple_gin_only_by_content_gin_trgm_ops  (cost=0.00..249.61 rows=135 width=0) (actual time=2.106..2.106 rows=83 loops=1)
Index Cond: (content ~ 'ba1.*abc'::text)
Planning Time: 0.454 ms
Execution Time: 2.384 ms
(8 rows)
Enter fullscreen mode Exit fullscreen mode
momen=# explain analyze select * from simple_gist_only where content ~ 'ba1.*abc';
QUERY PLAN
-------------------------------------------------------------------------
Index Scan using idx_simple_gist_only_by_content_gist_trgm_ops on simple_gist_only  (cost=0.41..156.79 rows=136 width=41) (actual time=7.026..264.619 rows=42 loops=1)
Index Cond: (content ~ 'ba1.*abc'::text)
Rows Removed by Index Recheck: 28
Planning Time: 0.230 ms
Execution Time: 264.655 ms
(5 rows)
Enter fullscreen mode Exit fullscreen mode

Again, GIST seems like a rather clear loser, though still performing better than sequential scan. So the answer would most likely be, "just use a GIN with trigrams".

Uniqueness Constraint

Neither GIN nor GIST indexes can be used to support a unique constraint. In fact, according to Postgres' own documentation, "currently, only B-tree indexes can be declared unique."
That is usually not too much of a problem, but what about a huge string that doesn't fit in a single page? All B-tree entries must fit inside a page, which in most cases, is 8kB in size. And if we try to create a (unique) index on a column that does contain large strings greater than 8kB in size, we are greeted with the following error message.

momen=# insert into simple_btree_only values (nextval('simple_btree_only_id_seq'), repeat('a', 100000000));
ERROR:  index row requires 1144712 bytes, maximum size is 8191
Enter fullscreen mode Exit fullscreen mode

But there are ways out. We can shorten the content inside a B-tree index by first hashing them.

momen=# CREATE UNIQUE INDEX udx_simple_btree_only_by_content_hash ON simple_btree_only USING btree (digest(content, 'sha512'));
CREATE INDEX
momen=# insert into simple_btree_only values (1, 'f0f453fd-8f7c-44f2-bcc0-124f706786cf');
ERROR:  duplicate key value violates unique constraint "udx_simple_btree_only_by_content_hash"
DETAIL:  Key (digest(content, 'sha512'::text))=(\xcf187c6795e52b388d721afeae11e86d115742bf22deae5e360b028643b51a1a1d13dd646fa72b90c8fe16edd372c2eeb76cdebe6424b18ce0a0aecc76290889) already exists.
Enter fullscreen mode Exit fullscreen mode

The problem with this approach is that although we have a B-tree index that is orders of magnitude better at point selects than GIN or GIST indexes (reason is again worth another post), we can't really utilize this index unless we modify our query to operate on (digest(content, 'sha512')), rather than the vanilla content. Doable, not great though.

Since we are doing hashing using the "digest" method, why don't we just use a hash index? We can't directly create a unique index using hash, as said in the aforementioned documentation. However, for some reason that is yet unclear to me, we can create a constraint that automatically creates a hash index to support it. The same index can also be used for point selects.

momen=# alter table simple_hash_only add constraint udx_simple_hash_only_by_content exclude using hash (content with =);
ALTER TABLE
Time: 3299.922 ms (00:03.300)
momen=# explain analyze select * from simple_hash_only where content = 'f0f453fd-8f7c-44f2-bcc0-124f706786cf';
                                                                    QUERY PLAN
-------------------------------------------------------------------------
Index Scan using udx_simple_hash_only_by_content on simple_hash_only  (cost=0.00..2.22 rows=1 width=41) (actual time=0.037..0.039 rows=1 loops=1)
Index Cond: (content = 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text)
Planning Time: 0.214 ms
Execution Time: 0.065 ms
(4 rows)
momen=# insert into simple_hash_only values (1, 'f0f453fd-8f7c-44f2-bcc0-124f706786cf');
ERROR:  conflicting key value violates exclusion constraint "udx_simple_hash_only_by_content"
DETAIL:  Key (content)=(f0f453fd-8f7c-44f2-bcc0-124f706786cf) conflicts with existing key (content)=(f0f453fd-8f7c-44f2-bcc0-124f706786cf).
Enter fullscreen mode Exit fullscreen mode

Conclusion

The Simple Version

Same as TL;DR.

The Longer Version

TL;DR plus the following:

If you are storing long strings (greater than at least a few hundred characters on average) and/or searching with long patterns, use GIST instead of GIN, especially when your workload is comparatively write-heavy. If you know or can enforce the maximum length of the stored strings, a B-tree can be used for both uniqueness and point selects.

Top comments (0)