DEV Community

Cover image for Approximate Count Distinct in YugabyteDB (and PostgreSQL) with HyperLogLog
Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on • Edited on

Approximate Count Distinct in YugabyteDB (and PostgreSQL) with HyperLogLog

Some extensions are very easy to use in YugabyteDB, when they are bundled and accessible with a simple CREATE EXTENSION and that's the case for hll which provides HyperLogLog functions that can provide a fast approximate discount count:

yugabyte=# \x
Expanded display is on.

yugabyte=# select * from pg_available_extensions
           where comment ilike '%hyperloglog%';

-[ RECORD 1 ]-----+----------------------------------
name              | hll
default_version   | 2.16
installed_version |
comment           | type for storing hyperloglog data

yugabyte=# \x
Expanded display is off.

yugabyte=# create extension if not exists hll;
CREATE EXTENSION

yugabyte=# \dx+ hll

                          Objects in extension "hll"
                              Object description
-------------------------------------------------------------------
 cast from bigint to hll_hashval
 cast from bytea to hll
 cast from hll to hll
 cast from integer to hll_hashval
 function hll(hll,integer,boolean)
 function hll_add(hll,hll_hashval)
 function hll_add_agg(hll_hashval)
 function hll_add_agg(hll_hashval,integer)
 function hll_add_agg(hll_hashval,integer,integer)
...
Enter fullscreen mode Exit fullscreen mode

Example

I create a table with 10 million rows, the id is unique, with 10 million distinct values, val1 has only 42 distinct values, and val2 has one million distinct values:

yugabyte=# create table demo (
            id bigint primary key, val1 int, val2 int
            );

CREATE TABLE

yugabyte=# insert into demo
           select n, mod(n,42) val1 , mod(n,1e6) val2
           from generate_series(1,1e7) n;

INSERT 0 10000000
Enter fullscreen mode Exit fullscreen mode

Here is the exact number of distinct values for those 3 columns:

yugabyte=# \timing on

yugabyte=# select
              count(distinct id)
             ,count(distinct val1)
             ,count(distinct val2)
             from demo
            ;

  count   | count |  count
----------+-------+---------
 10000000 |    42 | 1000000
(1 row)

Time: 28592.003 ms (00:28.592)

Enter fullscreen mode Exit fullscreen mode

This is exact but it takes time. Let's use the HyperLogLog functions.

hll_cardinality

The idea is to build a hash from the values (hll_hash_bigint) and aggregate it (hll_add_agg) and then count (hll_cardinality):

yugabyte=# select
              hll_cardinality(hll_add_agg(hll_hash_bigint(id)))
             ,hll_cardinality(hll_add_agg(hll_hash_bigint(val1)))
             ,hll_cardinality(hll_add_agg(hll_hash_bigint(val2)))
             from demo;

 hll_cardinality  | hll_cardinality | hll_cardinality
------------------+-----------------+------------------
 9656451.96827238 |              42 | 995263.714893331
(1 row)

Time: 20837.501 ms (00:20.838)

Enter fullscreen mode Exit fullscreen mode

This gives an approximate result and is a bit faster. The interesting thing is that the approximation is proportional to the result: I've an exact result for 42, a 0.5% approximation for 1 million, and 3.5% for 10 millions.

The reason for the difference is visible in the execution plan, especially with YugabyteDB which shows the Peak Memory Usage:

yugabyte=# explain (costs off, analyze, buffers)
           select
              count(distinct id)
             ,count(distinct val1)
             ,count(distinct val2)
             from demo
            ;

                                 QUERY PLAN
-----------------------------------------------------------------------------
 Aggregate (actual time=28207.491..28207.491 rows=1 loops=1)
   Buffers: temp read=90975 written=91146
   ->  Seq Scan on demo (actual time=2.592..19544.667 rows=10000000 loops=1)
 Planning Time: 5.891 ms
 Execution Time: 28210.625 ms
 Peak Memory Usage: 18619 kB
(6 rows)

Time: 28249.986 ms (00:28.250)
Enter fullscreen mode Exit fullscreen mode

In all cases, it takes 19 seconds to scan all rows (actual time=2.592..19544.667 rows=10000000) but then they go to memory to be aggregated. The Peak Memory Usage is 18 MB. The work_mem is not sufficient and it has to spill to temporary files on disk: 700MB have been read and written (read=90975 written=91146 are 8k buffers). This takes additional 9 seconds here.

Let's do the same with the appoximate count:

yugabyte=# explain (costs off, analyze, buffers)
           select
              hll_cardinality(hll_add_agg(hll_hash_bigint(id)))
             ,hll_cardinality(hll_add_agg(hll_hash_bigint(val1)))
             ,hll_cardinality(hll_add_agg(hll_hash_bigint(val2)))
             from demo;

                                  QUERY PLAN
-----------------------------------------------------------------------------
 Aggregate (actual time=20028.517..20028.517 rows=1 loops=1)
   ->  Seq Scan on demo (actual time=2.180..18538.887 rows=10000000 loops=1)
 Planning Time: 2.562 ms
 Execution Time: 20028.832 ms
 Peak Memory Usage: 330 kB
(5 rows)

Time: 20049.146 ms (00:20.049)

Enter fullscreen mode Exit fullscreen mode

The time to scan the table is the same but now, thanks to the HyperLogLog aggregation, the memory usage is small (Peak Memory Usage: 330 kB) and there's no need to read and write to temporary files. This explains how it is faster than the exact count.

Maintained aggregate

If I want to reduce it further, I can pre-build a table with the HyperLogLog aggregation:

yugabyte=# create table demo_hll (
              hash_code int primary key
              , id_hll hll, val1_hll hll, val2_hll hll
             );
CREATE TABLE

Time: 271.705 ms

yugabyte=# insert into demo_hll select  
             yb_hash_code(id)%16             
             ,hll_add_agg(hll_hash_bigint(id))
             ,hll_add_agg(hll_hash_bigint(val1))
             ,hll_add_agg(hll_hash_bigint(val2))
             from demo
             group by yb_hash_code(id)%16
             ;
INSERT 0 16

Time: 25106.458 ms (00:25.106)
Enter fullscreen mode Exit fullscreen mode

Because I will update this table for each insert I've created 16 buckets to keep it scalable. The table is still small:

yugabyte=# select pg_size_pretty(pg_table_size('demo_hll'));

 pg_size_pretty
----------------
 265 MB
(1 row)

Enter fullscreen mode Exit fullscreen mode

The hll from different buckets can be combined with hll_union_agg to get the hll_cardinality:

yugabyte=# select
              hll_cardinality(hll_union_agg(id_hll))
             ,hll_cardinality(hll_union_agg(val1_hll))
             ,hll_cardinality(hll_union_agg(val2_hll))
             from demo_hll;

 hll_cardinality  | hll_cardinality | hll_cardinality
------------------+-----------------+------------------
 9656451.96827238 |              42 | 995263.714893331
(1 row)

Time: 12.453 ms

yugabyte=# explain (costs off, analyze)
           select
              hll_cardinality(hll_union_agg(id_hll))
             ,hll_cardinality(hll_union_agg(val1_hll))
             ,hll_cardinality(hll_union_agg(val2_hll))
             from demo_hll;

                              QUERY PLAN
-----------------------------------------------------------------------
 Aggregate (actual time=2.220..2.220 rows=1 loops=1)
   ->  Seq Scan on demo_hll (actual time=0.413..1.921 rows=16 loops=1)
 Planning Time: 1.997 ms
 Execution Time: 2.283 ms
 Peak Memory Usage: 462 kB
(5 rows)

Enter fullscreen mode Exit fullscreen mode

The approximation is still good even from multiple buckets but this would have no interest if it needed to be rebuilt each time we want a fresh count. The hll extension provides hll_add to update the HyperLogLog structures when adding new values, and I can use a trigger do do it automatically for new inserts:

CREATE OR REPLACE FUNCTION demo_hll()
  RETURNS TRIGGER AS
$$
BEGIN
 UPDATE demo_hll
 SET id_hll = hll_add(id_hll   , hll_hash_bigint(NEW.id)),
     val1_hll = hll_add(val1_hll , hll_hash_bigint(NEW.val1)),
     val2_hll = hll_add(val2_hll , hll_hash_bigint(NEW.val2))
  WHERE hash_code = yb_hash_code(NEW.id)%16;
 RETURN NEW;
END;
$$
LANGUAGE plpgsql;

CREATE TRIGGER demo_trigger
AFTER INSERT ON demo
FOR EACH ROW
EXECUTE FUNCTION demo_hll();

Enter fullscreen mode Exit fullscreen mode

Now I test with some inserts to see how the approximate distinct count is accurate from the aggreagate:

yugabyte=# insert into demo
           select n+1e7, mod(n,99) val1 , n val2
           from generate_series(1,1e3) n;

INSERT 0 1000
Time: 2048.925 ms (00:02.049)

yugabyte=# select
              hll_cardinality(hll_union_agg(id_hll))
             ,hll_cardinality(hll_union_agg(val1_hll))
             ,hll_cardinality(hll_union_agg(val2_hll))
             from demo_hll;

 hll_cardinality  | hll_cardinality | hll_cardinality
------------------+-----------------+------------------
 9656451.96827238 |              99 | 995263.714893331
(1 row)

Enter fullscreen mode Exit fullscreen mode

The approximate count has been updated (I inserted 99 different integer instead of 42 before).

Fast exact count distinct with YugabyteDB

YugabyteDB has an optimisation for DISTINCT: it is pushed down to the storage, and doesn't need to scan all rows because it can skip entries in the LSM-Tree. This works with range-sharding indexes:

create index demo_hll_id on demo (id asc);
create index demo_hll_val1 on demo (val1 asc);
create index demo_hll_val2 on demo (val2 asc);

Enter fullscreen mode Exit fullscreen mode

To benefit from the skip scan, I need to force an Index Scan, and I do that by adding a ... >= (select min(...) from ...):

yugabyte=# select count(*) from (
            select distinct id from demo 
            where id>=(select min(id) from demo)
           ) distinct_pushdown ;

  count
----------
 10001000
(1 row)

Time: 17707.842 ms (00:17.708)

yugabyte=# select count(*) from (
            select distinct val2 from demo 
            where val2>=(select min(val2) from demo)
           ) distinct_pushdown ;

  count
---------
 1000000
(1 row)

Time: 6901.541 ms (00:06.902)

Enter fullscreen mode Exit fullscreen mode

The DISTINCT is pushed down to DocDB, the distributed storage, but all distinct values are fetched to be counted in YSQL, the postgres backend. The less distinct values, and the faster it is:

yugabyte=# select count(*) from (
            select distinct val1 from demo 
            where val1>=(select min(val1) from demo)
           ) distinct_pushdown ;

Time: 17724.416 ms (00:17.724) count
-------
    99
(1 row)

Time: 15.813 ms

Enter fullscreen mode Exit fullscreen mode

This one is really fast. When you know that there are not a lot of distinct values and you have an ascending or descending index on it, it is the fastest and returns an exact value.

Analyze

To help the query planner to estimate predicate selectivity (the cost based optimizer is still in beta in Yugabyte 2.19 and requires to ANALYZE the tables and setting yb_enable_optimizer_statistics=on) the number of distinct values is stored in the catalog, visible in pg_stats:

yugabyte=# analyze demo;
WARNING:  'analyze' is a beta feature!
LINE 1: analyze demo;
        ^
HINT:  Set 'ysql_beta_features' yb-tserver gflag to true to suppress the warning for all beta features.
ANALYZE
Time: 10054.772 ms (00:10.055)

yugabyte=# select schemaname, tablename, attname, n_distinct from pg_stats;
 schemaname | tablename | attname | n_distinct
------------+-----------+---------+------------
 public     | demo      | id      |         -1
 public     | demo      | val1    |         45
 public     | demo      | val2    |     972706
(3 rows)

Time: 15.473 ms

Enter fullscreen mode Exit fullscreen mode

The values are approximated, gathered from a sample. The positive values of n_distinct is the estimation of distinct values. The negative ones are divided by the number of rows (this helps the optimizer to get an accurate picture even when the number of rows increase). A value of -1 means that all rows are distinct.

To summarize...

Queries that need to aggregate rows in an OLTP database should be designed to get the best balance between the exact value and the best performance. YugabyteDB has many optimizations in the distributed storage and knowing them helps to get the best performance, like an Index Scan that can skip to read only the distinct values. When reading many rows is necessary, we rarely need an exact result on a database where data is always updated, and HyperLogLog functions help to reduce the memory and cpu usage when deduplicating the result.

Top comments (0)