Batched Nested Loop for Join With Large Pagination

#yugabytedb #distributed #postgres #database

TL;DR: You can skip to the last execution plan to see the best solution (Order-Preserving Batched Nested Loop)

In the previous post, we have seen that in a Distributed SQL database, the response time is affected by the pagination LIMIT and the one-to-many cardinality. For larger paginations or when many rows do not match, especially with an inner join, the number of Nested Loops will increase, resulting in longer response times. To solve this issue, with YugabyteDB, you can use Batched Nested Loops. However, you may need to modify your paginating query to apply the limit to the driving table.

Note: I mentioned LIMIT, but in PostgreSQL and YugabyteDB, it is the same as FETCH FIRST ... ROWS ONLY, which is the SQL ANSI was to do it.

I'm building a similar example but in the category '🍓' with one million rows in table "one" there are no rows matching in table "many":

drop table if exists one, many;
create extension if not exists pgcrypto;

create table one (
 primary key (one_id)
 , one_id uuid default gen_random_uuid()
 , category text
 , created_at timestamptz default clock_timestamp()
);

create table many( 
 primary key (many_id)
 , many_id uuid default gen_random_uuid()
 , one_id uuid not null references one(one_id)
 , value float
);

insert into  one (category)
 select '🍓' from generate_series(1,1000000)
;

-- Access to "one" by category, ordered by "created_at"
create index one_category_hash_created_desc_id
 on one(category, created_at desc, one_id)
;

-- Access to "many" by its foreign key to "one"
create index many_one_asc
 on many ( one_id ) include ( value )
;

I have not analyzed the tables and will use hints to guarantee the Nested Loop from "one" to "many".

Non-batched Nested Loops

I'm running a query that joins "one" to "many", sort the result on a column from "one", and limit to the first 1000 rows:

explain (costs off, analyze, dist)
/*+ Leading( (one many) ) NestLoop( one many) 
*/
select one.category, one.created_at, many.value
 from one
 left outer join many using(one_id)
 where one.category='🍓'
 order by one.created_at desc
 fetch first 10000 rows only;

                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=1.951..4494.646 rows=10000 loops=1)
   ->  Nested Loop Left Join (actual time=1.950..4489.391 rows=10000 loops=1)
         ->  Index Only Scan using one_category_hash_created_desc_id on one
               (actual time=1.241..35.741 rows=10000 loops=1)
               Index Cond: (category = '🍓'::text)
               Heap Fetches: 0
               Storage Index Read Requests: 10
               Storage Index Read Execution Time: 1.103 ms
         ->  Index Only Scan using many_one_asc on many 
               (actual time=0.431..0.431 rows=0 loops=10000)
               Index Cond: (one_id = one.one_id)
               Heap Fetches: 0
               Storage Index Read Requests: 1
               Storage Index Read Execution Time: 0.394 ms
 Execution Time: 4499.898 ms
 Storage Read Requests: 10010
 Storage Read Execution Time: 3943.079 ms
 Storage Execution Time: 3943.079 ms
 Peak Memory Usage: 132 kB
(22 rows)

This takes a long time because of the 10000 loops. If you try the same with an inner join instead of an outer join, this will be one million loops and given the latency in a distributed database, is not practicable.

Batched Nested Loops

I can batch those loops with Batched Nested Loop which I set to its maximum:

explain (costs off, analyze, dist)
/*+ Leading( (one many) ) NestLoop( one many) 
    Set( yb_bnl_batch_size 1024 )
*/
select one.category, one.created_at, many.value
 from one
 left outer join many using(one_id)
 where one.category='🍓'
 order by one.created_at desc
 fetch first 10000 rows only;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=6736.806..6739.998 rows=10000 loops=1)
   ->  Sort (actual time=6736.804..6737.940 rows=10000 loops=1)
         Sort Key: one.created_at DESC
         Sort Method: top-N heapsort  Memory: 1247kB
         ->  YB Batched Nested Loop Left Join 
               (actual time=7.996..6450.373 rows=1000000 loops=1)
               Join Filter: (one.one_id = many.one_id)
               ->  Index Only Scan using one_category_hash_created_desc_id on one 
                     (actual time=1.282..511.405 rows=1000000 loops=1)
                     Index Cond: (category = '🍓'::text)
                     Heap Fetches: 0
                     Storage Index Read Requests: 977
                     Storage Index Read Execution Time: 1.905 ms
               ->  Index Only Scan using many_one_asc on many 
                     (actual time=4.907..4.907 rows=0 loops=977)
                     Index Cond: (one_id = ANY (ARRAY[one.one_id, $1, $2, ..., $1023]))
                     Heap Fetches: 0
                     Storage Index Read Requests: 1
                     Storage Index Read Execution Time: 4.043 ms
 Execution Time: 6741.948 ms
 Storage Read Requests: 1954
 Storage Read Execution Time: 3951.758 ms
 Storage Execution Time: 3951.758 ms
 Peak Memory Usage: 4480 kB
(26 rows)

This reduced the number of loops to loops=977 but it had to read too many rows: rows=1000000. Reading all rows from the outer table is not scalable. The reason is that the Batched Nested Loop doesn't preserve the ordering (before #19589). Then, all rows had to be sorted before being able to apply the pagination limit.

Batched Nested Loop and LIMIT in subquery

The solution in this case is to write the query differently and apply the pagination to the driving table. For this I replace "one" by a subquery that contains all the filters (WHERE clause), and pagination (ORDER BY ... LIMIT):

explain (costs off, analyze, dist)
/*+ Leading( (one many) ) NestLoop( one many) 
    Set( yb_bnl_batch_size 1024 )
*/
select one.category, one.created_at, many.value
 from (
    select * from one
    where one.category='🍓'
    order by one.created_at desc
    fetch first 10000 rows only
 ) one
 left outer join many using(one_id)
 order by one.created_at desc
;

                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Sort (actual time=71.749..72.805 rows=10000 loops=1)
   Sort Key: one.created_at DESC
   Sort Method: quicksort  Memory: 853kB
   ->  YB Batched Nested Loop Left Join
         (actual time=7.819..67.987 rows=10000 loops=1)
         Join Filter: (one.one_id = many.one_id)
         ->  Limit (actual time=1.391..8.820 rows=10000 loops=1)
               ->  Index Only Scan using one_category_hash_created_desc_id on one
                     (actual time=1.389..6.697 rows=10000 loops=1)
                     Index Cond: (category = '🍓'::text)
                     Heap Fetches: 0
                     Storage Index Read Requests: 10
                     Storage Index Read Execution Time: 1.232 ms
         ->  Index Only Scan using many_one_asc on many
               (actual time=4.778..4.778 rows=0 loops=10)
               Index Cond: (one_id = ANY (ARRAY[one.one_id, $1, $2, ..., $1023]))
               Heap Fetches: 0
               Storage Index Read Requests: 1
               Storage Index Read Execution Time: 3.823 ms
 Execution Time: 74.776 ms
 Storage Read Requests: 20
 Storage Read Execution Time: 39.462 ms
 Storage Execution Time: 39.462 ms
 Peak Memory Usage: 2806 kB
(26 rows)

Now we are back to reading only rows=10000 rows from "one" and only loops=10 loops to "many", 10 batches of 1024 arrays. The time complexity is back to O(1) and depends only on the LIMIT, with at maximum one remote call per thousand of rows.

With the same indexing method as in previous post, my join remains scalable by pushing down the pagination to the driving table, and enabling yb_bnl_batch_size. From a developer point of view, this is still easier than dealing with manual partitioning and physical colocation to reduce latency, or de-normalizing to avoid joins.

Order Preserving Batched Nested Loop in 2.20.2

The issue mentioned above is fixed. Look at the following execution plan where Sort Keys: one.created_at DESC is mentioned below the join:

yugabyte=# select version();
                                                                                                                          version                                       
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 11.2-YB-2.20.2.0-b0 on aarch64-unknown-linux-gnu, compiled by clang version 16.0.6 (/opt/yb-build/llvm/yb-llvm-v16.0.6-yb-3-1695119965-1e6329f4-centos7-aarch64-build/src/llvm-project/clang 1e6329f40e5c531c09ade7015278078682293ebd), 64-bit
(1 row)

yugabyte=# explain (costs off, analyze, dist)
/*+ Set( yb_bnl_batch_size 1024 ) */
select one.category, one.created_at, many.value
 from one
 left outer join many using(one_id)
 where one.category='🍓'
 order by one.created_at desc
 fetch first 10000 rows only;
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=5.520..50.596 rows=10000 loops=1)
   ->  YB Batched Nested Loop Left Join (actual time=5.519..47.795 rows=10000 loops=1)
         Join Filter: (one.one_id = many.one_id)
         Sort Keys: one.created_at DESC
         ->  Index Only Scan using one_category_hash_created_desc_id on one (actual time=1.077..5.495 rows=10240 loops=1)
               Index Cond: (category = '🍓'::text)
               Heap Fetches: 0
               Storage Index Read Requests: 10
               Storage Index Read Execution Time: 0.959 ms
         ->  Index Only Scan using many_one_asc on many (actual time=2.865..2.865 rows=0 loops=10)
               Index Cond: (one_id = ANY (ARRAY[one.one_id, $1, $2, ..., $1023]))
               Heap Fetches: 0
               Storage Index Read Requests: 1
               Storage Index Read Execution Time: 2.037 ms
 Planning Time: 0.502 ms
 Execution Time: 53.284 ms
 Storage Read Requests: 20
 Storage Read Execution Time: 21.330 ms
 Storage Write Requests: 0
 Catalog Read Requests: 0
 Catalog Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 21.330 ms
 Peak Memory Usage: 2023 kB
(24 rows)

DEV Community

Batched Nested Loop for Join With Large Pagination

Non-batched Nested Loops

Batched Nested Loops

Batched Nested Loop and LIMIT in subquery

Order Preserving Batched Nested Loop in 2.20.2

Top comments (0)

Read next

Supabase RLS Alternative

Three Tier Architecture. Overview and Each tier explained.

Exploring new AWS Aurora DSQL. What is it ? Why it is important ? How to quickstart ?

Redis