Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on Oct 19 • Edited on Oct 22

Indexing for a Scalable Serialization Isolation Level

#sql #yugabytedb #postgres #database

The previous post used the primary key to record read intents in YugabyteDB. Serializable isolation can be scalable, but it requires optimal schema and indexes with a good knowledge of how the database works.
Indexes define the predicate locks for PostgreSQL Serializable Snapshot Isolation, and the primary key defines the range locks used by YugabyteDB's Two-Phase Commit.
Here an example taken from nathanl/postgresql_serializable_isolation.sql which simply inserts two rows from two transactions:


create table users(
  id serial not null primary key,
  username varchar not null
);

-- create index on users(username asc);

-- session 1

begin transaction isolation level serializable;
select * from users where username = 'alice';
insert into users ("username") values ('alice');

-- session 2
\! psql -c '\timing on' -c "begin transaction isolation level serializable; select * from users where username = 'bob'; insert into users ("username") values ('bob'); commit;" &

-- session 1

\! sleep 1
commit;
\! sleep 1
select * from users;

I've run this at a serializable isolation level. Let's examine the performance in PostgreSQL and YugabyteDB, with and without an index on "username".

PostgreSQL Seq Scan

Without an index on "username", the SELECT does a Full Table Scan.
Session 2 can commit, but session 1 detects a conflict on commit:


COMMIT
Time: 2.844 ms

postgres=*# commit;
ERROR:  could not serialize access due to read/write dependencies among transactions
DETAIL:  Reason code: Canceled on identification as a pivot, during commit attempt.
HINT:  The transaction might succeed if retried.

This serializable error is a false positive. The two sessions read different values, and they are serializable. PostgreSQL employs predicate locks but only records the predicates used for the scan operation. The lock covers the entire table since it was read with a Seq Scan.

PostgreSQL Index Scan

I run the same with the index (uncomment the CREATE INDEX in the script), and there's no error:


COMMIT
Time: 2.581 ms

postgres=*# commit;
COMMIT
postgres=# \! sleep 1
postgres=# select * from users;
 id | username
----+----------
  1 | alice
  2 | bob

In this case, PostgreSQL uses the index condition as a predicate lock, and they do not conflict between the two transactions.

Note that in this scenario, the "username" column should be declared unique instead of creating an additional index. This creates an implicit index and has the same effect:

create table users(
  id serial not null primary ley,
  username varchar not null unique
);

YugabyteDB

YugabyteDB uses Wait-on-Conflict to prevent the serializable error, allowing both transactions to commit even without the index.

yugabyte=*# commit;
COMMIT

COMMIT
Time: 926.484 ms

I waited one second before committing to the first session, and this delay is reflected in the response time of the second session, which waited for the first one. The reason for this is similar to what we have observed with PostgreSQL: the read lock intent is at the table level.

This is visible as STRONG_READ at the level of the "relation," which is blocked by the insert operation that acquires a WEAK_WRITE lock on the relation:

yugabyte=*# select
 substr(ybdetails->>'transactionid',1,8) transactionid , substr(ybdetails->>'blocked_by',1,8) blocked_by,
 relkind, locktype, mode, relname , ybdetails->'keyrangedetails'->'cols'
from pg_locks natural join (select oid relation, relname, relkind from pg_class) pg_class
order by transactionid, relkind desc, relname, case locktype when 'relation' then 1 when 'keyrange' then 2 when 'row' then 3 end, mode desc, locktype desc;
 transactionid | blocked_by | relkind | locktype |         mode         | relname | ?column?
---------------+------------+---------+----------+----------------------+---------+----------
 74e1329c      |            | r       | relation | WEAK_READ,WEAK_WRITE | users   | null
 74e1329c      |            | r       | relation | STRONG_READ          | users   | null
 74e1329c      |            | r       | row      | STRONG_WRITE         | users   | ["1"]
 74e1329c      |            | r       | row      | STRONG_READ          | users   | ["1"]
 c5611a22      | ["74e132   | r       | relation | STRONG_READ          | users   | null

The index will add the range lock intents in the index, but there's still a conflict with the table lock intents at "relation" level:

yugabyte=*# select
 substr(ybdetails->>'transactionid',1,8) transactionid , substr(ybdetails->>'blocked_by',1,8) blocked_by,
 relkind, locktype, mode, relname , ybdetails->'keyrangedetails'->'cols'
from pg_locks natural join (select oid relation, relname, relkind from pg_class) pg_class
order by transactionid, relkind desc, relname, case locktype when 'relation' then 1 when 'keyrange' then 2 when 'row' then 3 end, mode desc, locktype desc;
 transactionid | blocked_by | relkind | locktype |         mode         |      relname       |                  ?column?
---------------+------------+---------+----------+----------------------+--------------------+---------------------------------------------
 02a108e3      |            | r       | relation | WEAK_READ,WEAK_WRITE | users              | null
 02a108e3      |            | r       | relation | STRONG_READ          | users              | null
 02a108e3      |            | r       | row      | STRONG_WRITE         | users              | ["1"]
 02a108e3      |            | r       | row      | STRONG_READ          | users              | ["1"]
 02a108e3      |            | i       | relation | WEAK_WRITE           | users_username_idx | null
 02a108e3      |            | i       | keyrange | WEAK_WRITE           | users_username_idx | ["\"alice\""]
 02a108e3      |            | i       | row      | STRONG_WRITE         | users_username_idx | ["\"alice\"", "\"H\\x80\\x00\\x00\\x01!\""]
 5b82a9e3      | ["02a108   | r       | relation | STRONG_READ          | users              | null

As a consequence of it, simply adding the index in YugabyteDB doesn't reduce the wait. If we want to minimize the latency for this use case, it is preferable to define the username as a primary key rather than the generated identifier:

create table users(
  id serial not null unique,
  username varchar not null primary key
);

In PostgreSQL, table rows are stored in heap tables, and all indexes, including the primary key, are considered secondary indexes. However, in YugabyteDB, the rows are stored in the primary key LSM Tree, so defining the primary key for the main access pattern is essential. There are no limitations because all secondary indexes are global and shared on their key.
Another reason the primary key matters in YugabyteDB is to improve serializable locks. In PostgreSQL, even if the scan predicates define the read intents, they are stored in memory. On the other hand, YugabyteDB is distributed and doesn't use a single-node shared memory, so it stores the intents with the table row in the primary key index.

With such a definition, both transactions can be committed without waiting for the other:


COMMIT
Time: 147.917 ms

yugabyte=*# commit;
COMMIT

Here are the locks just before the commits:

 transactionid | blocked_by | relkind | locktype |         mode         |   relname    |    ?column?
---------------+------------+---------+----------+----------------------+--------------+-----------------
 350a076c      |            | r       | relation | WEAK_READ,WEAK_WRITE | users        | null
 350a076c      |            | r       | relation | WEAK_READ            | users        | null
 350a076c      |            | r       | row      | STRONG_WRITE         | users        | ["\"alice\""]
 350a076c      |            | r       | row      | STRONG_READ          | users        | ["\"alice\""]
 350a076c      |            | r       | row      | STRONG_READ          | users        | ["\"alice\""]
 350a076c      |            | i       | relation | WEAK_READ,WEAK_WRITE | users_id_key | null
 350a076c      |            | i       | keyrange | WEAK_READ,WEAK_WRITE | users_id_key | ["1"]
 350a076c      |            | i       | row      | STRONG_WRITE         | users_id_key | ["1", "null"]
 350a076c      |            | i       | row      | STRONG_READ          | users_id_key | ["1", "null"]
 97f5bd76      |            | r       | relation | WEAK_READ,WEAK_WRITE | users        | null
 97f5bd76      |            | r       | relation | WEAK_READ            | users        | null
 97f5bd76      |            | r       | row      | STRONG_WRITE         | users        | ["\"bob\""]
 97f5bd76      |            | r       | row      | STRONG_READ          | users        | ["\"bob\""]
 97f5bd76      |            | r       | row      | STRONG_READ          | users        | ["\"bob\""]
 97f5bd76      |            | i       | relation | WEAK_READ,WEAK_WRITE | users_id_key | null
 97f5bd76      |            | i       | keyrange | WEAK_READ,WEAK_WRITE | users_id_key | ["101"]
 97f5bd76      |            | i       | row      | STRONG_WRITE         | users_id_key | ["101", "null"]
 97f5bd76      |            | i       | row      | STRONG_READ          | users_id_key | ["101", "null"]

This situation does not involve any blocking. The STRONG_READ intents only affect individual rows because each transaction accesses different rows.

The serializable isolation level can be scalable. However, like many other aspects of SQL database performance, its performance depends on the definition of primary keys and secondary indexes. In SQL, "serializable" doesn't mean "serialized". The example above is serializable (with the same effect as if serialized), but the two transactions can run concurrently.

Top comments (2)

Rafael Ponte • Oct 25

Thanks for this article as well, Franck 👏🏻

At this moment, I am digesting this sentence below because it sounds essential:

Serializable isolation can be scalable, but it requires optimal schema and indexes with a good knowledge of how the database works.

Now, besides the retry logic the application has to deal with, we developers must pay attention to how to design the schema to take the best out of the SERIALIZABLE isolation level in each database.

If you have other examples about this specific subject, please write new blog posts about it 🙏🏻

Rafael Ponte • Oct 25 • Edited

By the way, the Postgres documentation is AMAZING! It has some sentences about those issues with SERIALIZABLE isolation level well documented:

postgresql.org/docs/current/transa...

For optimal performance when relying on Serializable transactions for concurrency control, these issues should be considered:

[...]

When the system is forced to combine multiple page-level predicate locks into a single relation-level predicate lock because the predicate lock table is short of memory, an increase in the rate of serialization failures may occur. You can avoid this by increasing max_pred_locks_per_transaction, max_pred_locks_per_relation, and/or max_pred_locks_per_page.

A sequential scan will always necessitate a relation-level predicate lock. This can result in an increased rate of serialization failures. It may be helpful to encourage the use of index scans by reducing random_page_cost and/or increasing cpu_tuple_cost. Be sure to weigh any decrease in transaction rollbacks and restarts against any overall change in query execution time.

DEV Community

Indexing for a Scalable Serialization Isolation Level

PostgreSQL Seq Scan

PostgreSQL Index Scan

YugabyteDB

Top comments (2)

Read next

The Strength of Vertica DB: A Case Study in Sales Analytics

Populating a Vertica Database with Random Data

How to Install Vertica DB Client (vsql) on Ubuntu 24.04

CTE | Common Table Expression | Real world scenarios examples