DEV Community

Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on • Edited on

Foreign Keys on Distributed SQL: don't worry, it scales

In a previous post I demonstrated that joins across nodes can add some latency but are still scalable when the correct join method is used. We see something similar with Foreign Keys. There are some myths that SQL joins and foreign keys don't scale, but that comes from the incorrect assumption that those are done row-by-row. When batched, the latency increases with the complexity of the SQL query and schema but still scales when tables become larger and throughput higher.


I'll experiment on YugabyteDB Managed with three nodes on different availability zones:
Image description

I create 12 tables (demo1 to demo12) with 42 rows (column n from 1 to 42):

select format('
create table demo%1$s (n int primary key);
insert into  demo%1$s(n) select generate_series(1,42);
',generate_series(1,12)) ;
\gexec
Enter fullscreen mode Exit fullscreen mode

I create one table (demo0) with 12 columns (n1 to n12) that are foreign keys referencing the 12 tables (demo1 to demo12):

drop table if exists demo0;
select format('
create table demo0 (%s) with (colocation=false);
', string_agg('n'||generate_series::text|| ' int references demo'||generate_series::text,','))
from generate_series(1,12);
\gexec
Enter fullscreen mode Exit fullscreen mode

I insert 42 rows into this table, which has to check the 42 foreign keys:

select format('
explain (analyze, costs off, summary on, dist)
 insert into  demo0(%s)
 select %s from generate_series(1,42) n;
', string_agg('n'||n::text,','),string_agg('n',','))
from generate_series(1,12) n; \gexec
Enter fullscreen mode Exit fullscreen mode

Here is my execution with its execution plan:

yugabyte=>  explain (analyze, costs off, summary on, dist)
  insert into  demo0(n1,n2,n3,n4,n5,n6,n7,n8,n9,n10,n11,n12)
  select n,n,n,n,n,n,n,n,n,n,n,n from generate_series(1,42) n;

                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Insert on demo0 (actual time=0.855..0.855 rows=0 loops=1)
   ->  Function Scan on generate_series n (actual time=0.012..0.020 rows=42 loops=1)
 Planning Time: 0.054 ms
 Trigger for constraint demo0_n1_fkey: time=15.284 calls=42
 Trigger for constraint demo0_n2_fkey: time=0.098 calls=42
 Trigger for constraint demo0_n3_fkey: time=0.060 calls=42
 Trigger for constraint demo0_n4_fkey: time=0.057 calls=42
 Trigger for constraint demo0_n5_fkey: time=0.063 calls=42
 Trigger for constraint demo0_n6_fkey: time=0.058 calls=42
 Trigger for constraint demo0_n7_fkey: time=0.057 calls=42
 Trigger for constraint demo0_n8_fkey: time=0.057 calls=42
 Trigger for constraint demo0_n9_fkey: time=0.071 calls=42
 Trigger for constraint demo0_n10_fkey: time=0.056 calls=42
 Trigger for constraint demo0_n11_fkey: time=0.059 calls=42
 Trigger for constraint demo0_n12_fkey: time=0.058 calls=42
 Execution Time: 16.939 ms
 Storage Read Requests: 12
 Storage Read Execution Time: 0.004 ms
 Storage Write Requests: 42
 Catalog Read Requests: 0
 Catalog Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 0.004 ms
 Peak Memory Usage: 151 kB
(24 rows)
Enter fullscreen mode Exit fullscreen mode

The execution time is 17 milliseconds. In addition to the 42 write requests, there were 12 read requests, one per foreign key. They were all batched to reduce the roundtrips between nodes. 17 milliseconds is not perceptible by the user, and there are probably not many tables with so many foreign keys. I've also run it with different number of foreign keys:

Checking the foreign key introduces cross-node latency but is scalable, not depending on the number of rows, as the number of read requests is only the number of foreign keys to check.

This is what you will see generally with YugabyteDB. When distributing all tables, the latency depends on the complexity of the schema, and the query, but it still scales.

To understand the performance beyond the read requests, you can explain (analyze, dist, debug) to show the number of seek operations in the LSM-Tree. For my 42 rows insert with 12 foreign keys I see:

 Metric rocksdb_number_db_seek: 449
 Metric rocksdb_number_db_next: 863
 Metric rocksdb_number_db_seek_found: 449
 Metric rocksdb_number_db_next_found: 827
 Metric rocksdb_iter_bytes_read: 52384
 Metric docdb_keys_found: 504
Enter fullscreen mode Exit fullscreen mode

There are 42x12=504 values to read from the 12 referenced tables and that's what we see here with a seek into the LSM-Tree. Those involve random IO, most of them from memory buffers, and CPU for key comparisons. This not different similar in monolithic or distributed SQL database. Distributed SQL add some some network calls when all tables are distributed, but YugabyteDB sends them by batch so that they are limited. This is usually acceptable in single-region deployments. For multi-region, you have the possibility to set leader preferences to control the data placement.

More details on batching size

You may want to know more about this batching of foreign key reference existence check. The batch size is controlled by ysql_session_max_batch_size which defaults to 3072 in my cluster and can be set differently with the PostgreSQL session parameter. There's also a higher level parameter ysql_max_in_flight_ops that defaults to 10000. Please, be aware that the defaults are probably fine. I'm changing them in a lab for this experimentation.

You can observe it the same demo if you run it with more rows than 42. For example, with 300 (so that 300*12 is higher than 3072) you will observe Storage Read Requests: 24 instead of Storage Read Requests: 12:

yugabyte=> explain (analyze, costs off, summary on, dist)
  insert into  demo0(n1,n2,n3,n4,n5,n6,n7,n8,n9,n10,n11,n12)
  select n,n,n,n,n,n,n,n,n,n,n,n from generate_series(1,300) n;

                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Insert on demo0 (actual time=19.696..19.696 rows=0 loops=1)
   ->  Function Scan on generate_series n (actual time=0.035..0.090 rows=300 loops=1)
 Planning Time: 0.145 ms
 Trigger for constraint demo0_n1_fkey: time=60.139 calls=300
 Trigger for constraint demo0_n2_fkey: time=0.258 calls=300
 Trigger for constraint demo0_n3_fkey: time=0.247 calls=300
 Trigger for constraint demo0_n4_fkey: time=10.559 calls=300
 Trigger for constraint demo0_n5_fkey: time=0.275 calls=300
 Trigger for constraint demo0_n6_fkey: time=0.231 calls=300
 Trigger for constraint demo0_n7_fkey: time=0.247 calls=300
 Trigger for constraint demo0_n8_fkey: time=0.239 calls=300
 Trigger for constraint demo0_n9_fkey: time=0.232 calls=300
 Trigger for constraint demo0_n10_fkey: time=0.248 calls=300
 Trigger for constraint demo0_n11_fkey: time=0.228 calls=300
 Trigger for constraint demo0_n12_fkey: time=0.231 calls=300
 Execution Time: 93.230 ms
 Storage Read Requests: 24
 Storage Read Execution Time: 0.003 ms
 Storage Write Requests: 300
 Catalog Read Requests: 9
 Catalog Read Execution Time: 11.765 ms
 Catalog Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 11.769 ms
 Peak Memory Usage: 1768 kB
(25 rows)
Enter fullscreen mode Exit fullscreen mode

You also see that not all constraint checks were flushed by demo0_n1_fkey but also demo0_n1_fkey takes its part.

If you set ysql_session_max_batch_size=3600 you will be back to 12 read requests:

yugabyte=> set ysql_session_max_batch_size=3600;
SET
Time: 33.527 ms
yugabyte=> explain (analyze, costs off, summary on, dist)
  insert into  demo0(n1,n2,n3,n4,n5,n6,n7,n8,n9,n10,n11,n12)
  select n,n,n,n,n,n,n,n,n,n,n,n from generate_series(1,300) n;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Insert on demo0 (actual time=5.445..5.445 rows=0 loops=1)
   ->  Function Scan on generate_series n (actual time=0.040..0.110 rows=300 loops=1)
 Planning Time: 0.050 ms
 Trigger for constraint demo0_n1_fkey: time=55.210 calls=300
 Trigger for constraint demo0_n2_fkey: time=0.296 calls=300
 Trigger for constraint demo0_n3_fkey: time=0.248 calls=300
 Trigger for constraint demo0_n4_fkey: time=0.275 calls=300
 Trigger for constraint demo0_n5_fkey: time=0.253 calls=300
 Trigger for constraint demo0_n6_fkey: time=0.248 calls=300
 Trigger for constraint demo0_n7_fkey: time=0.238 calls=300
 Trigger for constraint demo0_n8_fkey: time=0.241 calls=300
 Trigger for constraint demo0_n9_fkey: time=0.242 calls=300
 Trigger for constraint demo0_n10_fkey: time=0.245 calls=300
 Trigger for constraint demo0_n11_fkey: time=0.240 calls=300
 Trigger for constraint demo0_n12_fkey: time=0.252 calls=300
 Execution Time: 63.757 ms
 Storage Read Requests: 12
 Storage Read Execution Time: 0.003 ms
 Storage Write Requests: 300
 Catalog Read Requests: 0
 Catalog Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 0.003 ms
 Peak Memory Usage: 1597 kB
(24 rows)
Enter fullscreen mode Exit fullscreen mode

This lab is designed to help you understand how things work. It's unlikely that your OLTP application inserts a batch of 300 rows into a table with 12 foreign keys. If it does, a response time of 60 milliseconds is probably acceptable. So, don't worry about optimizing prematurely if it's not necessary. Distributing the tables without constraints can be helpful for elasticity. When you add new nodes, some tablets will automatically move. When data grows, some tablets will be automatically split. In a multi-region setup where you need to reduce latency, you can start thinking about placement preferences. This will ensure that the tablet leaders you read and write to are close to the node you are connected to, thus reducing latency.

Top comments (0)