Tomas@PawSQL

Posted on Jun 21, 2023

Order by random Optimization

#sql #mysql #postgres #pawsql

Problem

Functionrand(MySQL) or random(PostgreSQL) returns a random floating-point value in the range 0 to 1.0. It is useful to get a random sample of dataset using following query.

select * from orders order by rand() limit 1;

This works fine if customer table is less than 10,000 rows. But when you have 1,000,000 rows, the overhead for sorting the rows becomes unacceptable. The reason is quiet clear: we sort all rows, but only keep one.

MySQL Plan

-> Limit: 1 row(s)  (actual time=203.465..203.465 rows=1 loops=1)
    -> Sort: `rand()`, limit input to 1 row(s) per chunk  (actual time=203.464..203.464 rows=1 loops=1)
        -> Stream results  (cost=20540.71 rows=200128) (actual time=0.035..173.789 rows=201800 loops=1)
            -> Table scan on l  (cost=20540.71 rows=200128) (actual time=0.030..121.293 rows=201800 loops=1)

PostgreSQL Plan

Limit  (cost=4962.22..4962.22 rows=1 width=66) (actual time=58.080..58.082 rows=1 loops=1)
  ->  Sort  (cost=4962.22..5391.96 rows=171898 width=66) (actual time=58.078..58.079 rows=1 loops=1)
        Sort Key: (random())
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Seq Scan on orders l  (cost=0.00..4102.73 rows=171898 width=66) (actual time=0.010..26.818 rows=171898 loops=1)

Solutions

If there is an unique index on a numeric column with uniformly distributed values, the query can be rewrite to more efficient one to avoid full table scan and all row sorting. create unique index ord_idx_key on orders(o_orderkey)

Rewritten SQL

select
    *
from
    orders
where
    o_orderkey >= (
    select
        floor( RAND() * ((select MAX(o_orderkey) from orders)-(select MIN(o_orderkey) from orders)) + (select MIN(o_orderkey) from orders)))
order by
    o_orderkey
limit 10;

Analyze Plan of MySQL

-> Limit: 10 row(s)  (cost=0.03 rows=3) (actual time=76.877..124.157 rows=10 loops=1)
    -> Filter: (orders.O_ORDERKEY >= floor(((rand() * <cache>(((select #3) - (select #4)))) + (select #5))))  (cost=0.03 rows=3) (actual time=76.876..124.155 rows=10 loops=1)
        -> Index scan on orders using PAW_IDX0334337551  (cost=0.03 rows=10) (actual time=0.015..111.731 rows=106410 loops=1)
        -> Select #3 (subquery in condition; run only once)
            -> Rows fetched before execution  (cost=0.00..0.00 rows=1) (actual time=0.000..0.000 rows=1 loops=1)
        -> Select #4 (subquery in condition; run only once)
            -> Rows fetched before execution  (cost=0.00..0.00 rows=1) (actual time=0.000..0.000 rows=1 loops=1)
        -> Select #5 (subquery in condition; run only once)
            -> Rows fetched before execution  (cost=0.00..0.00 rows=1) (actual time=0.000..0.000 rows=1 loops=1)

Analyze Plan of PostgreSQL

Limit  (cost=1.27..1.44 rows=1 width=66) (actual time=27.840..27.847 rows=1 loops=1)
->  Nested Loop  (cost=1.27..9724.17 rows=57299 width=66) (actual time=27.838..27.845 rows=1 loops=1)
      Join Filter: ((t1.o_orderkey)::double precision >= (floor(((random() * (($1 - $3))::double precision) + ($5)::double precision))))
      Rows Removed by Join Filter: 91974
      ->  Index Scan using ord_idx_key on orders t1  (cost=0.29..6714.94 rows=171898 width=58) (actual time=0.022..9.406 rows=91975 loops=1)
      ->  Materialize  (cost=0.98..1.02 rows=1 width=8) (actual time=0.000..0.000 rows=1 loops=91975)
            ->  Result  (cost=0.98..1.01 rows=1 width=8) (actual time=0.112..0.117 rows=1 loops=1)
                  InitPlan 2 (returns $1)
                    ->  Result  (cost=0.32..0.33 rows=1 width=4) (actual time=0.080..0.082 rows=1 loops=1)
                          InitPlan 1 (returns $0)
                            ->  Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.079..0.079 rows=1 loops=1)
                                  ->  Index Only Scan Backward using ord_idx_key on orders  (cost=0.29..3704.51 rows=171898 width=4) (actual time=0.078..0.079 rows=1 loops=1)
                                        Index Cond: (o_orderkey IS NOT NULL)
                                        Heap Fetches: 0
                  InitPlan 4 (returns $3)
                    ->  Result  (cost=0.32..0.33 rows=1 width=4) (actual time=0.013..0.014 rows=1 loops=1)
                          InitPlan 3 (returns $2)
                            ->  Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.013..0.013 rows=1 loops=1)
                                  ->  Index Only Scan using ord_idx_key on orders orders_1  (cost=0.29..3704.51 rows=171898 width=4) (actual time=0.013..0.013 rows=1 loops=1)
                                        Index Cond: (o_orderkey IS NOT NULL)
                                        Heap Fetches: 0
                  InitPlan 6 (returns $5)
                    ->  Result  (cost=0.32..0.33 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=1)
                          InitPlan 5 (returns $4)
                            ->  Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.006..0.007 rows=1 loops=1)
                                  ->  Index Only Scan using ord_idx_key on orders orders_2  (cost=0.29..3704.51 rows=171898 width=4) (actual time=0.006..0.007 rows=1 loops=1)
                                        Index Cond: (o_orderkey IS NOT NULL)
                                        Heap Fetches: 0

Otherwise, we need to create a map table to solve this problem.

   create table orders_key_map ( row_id int not NULL primary key, o_orderkey int not null);
   SET @id = 0;
   INSERT INTO orders_key_map SELECT @id := @id + 1, id FROM holes;

   > select * from orders_key_map;
   +--------+-----------+
   | row_id | o_orderkey |
   +--------+-----------+
   |      1 |         100 |
   |      2 |         102 |
   |      3 |         300 |
   |      4 |         833 |
   |      5 |        1116 |
   +--------+-----------+

And then we can use following SQL to get a random sample of orders table:

   select
    *
   from
    orders o, orders_key_map m
   where
       o.o_orderkey = m.o_orderkey
    m.row_id >= (
    select
        floor( RAND() * ((select MAX(row_id) from orders_key_map)-(select MIN(row_id) from orders_key_map)) + (select MIN(row_id) from orders_key_map)))
   order by
    row_id
   limit 1;

About PawSQL

PawSQL is dedicated to automatic and intelligent database performance optimization. The products provided by PawSQL include:

PawSQL Cloud, an online automated SQL optimization tool that supports SQL auditing, intelligent query rewriting, cost-based index recommendations, suitable for database administrators and data application developers.
PawSQL Advisor, an IntelliJ plugin that is suitable for data application developers and can be installed via the IDEA/DataGrip marketplace by searching for "PawSQL Advisor" by name.
PawSQL Engine, which is the backend optimization engine of the PawSQL series of products, and provides optimization capability through http/json interfaces. PawSQL Engine is provided for deployment and installation as a docker image.

Contact Us

Site: https://app.pawslq.com
Email: service@pawsql.com
Twitter: https://twitter.com/pawsql

DEV Community

Order by random Optimization

Problem

Solutions

About PawSQL

Contact Us

Top comments (0)

Read next

Patroni, ETCD ve HAProxy kullanarak Cluster Kurulumu ve Yapılandırması

How to Use JSONB in PostgreSQL with Step-by-Step Instructions

Comprehensive Weather Data Analysis Using Python: Temperature, Rainfall Trends, and Visualizations

Creating a CRUD Application with PHP, MySQL, and Lithe