In the previous post we have seen parallel scan for a count(*)
: all rows are read and counted on each tablet, in parallel, returning a partial count to be aggregated to the exact count.
What if we don't need an exact count? Can we get an approximate count by reading a sample of rows? Yes, if we use hash sharding and the rows are evenly distributed. YugabyteDB applies a hash function on the hash columns of the primary key, to a number between 0 and 65536, and rows are sharded by ranges of hash value. If I read only the hash values between 0 and 65536/100 I will read only 1% of the rows and hit only one or a few tablet.
Example
On a 9 nodes cluster I've created the following table:
drop table if exists demo;
create extension if not exists pgcrypto;
create table demo (id uuid default gen_random_uuid() primary key, value float);
insert into demo(value) select random() from generate_series(1,1000000)\;
\watch interval=0 c=14
I've run the insert until I have 150 million rows (I used the \watch enhancement in PG16):
yugabyte=> \timing on
Timing is on.
yugabyte=> select count(*) from demo;
count
-----------
150000000
(1 row)
Time: 18980.257 ms (00:18.980)
You can see that the count(*)
takes 18 seconds to count 150 million rows.
Approximate count
I can approximate the count by reading 0.1% of the rows:
yugabyte=> select 1000*count(*) from demo where yb_hash_code(id)<65536/1000;
?column?
-----------
148268000
(1 row)
Time: 854.053 ms
That's not bad. I get the approximate count in less that 1 second, with 1% error.
Even reading one hash code out of 65536 gives an acceptable approximate count in 33 milliseconds:
yugabyte=> select 65536*count(*) from demo where yb_hash_code(id)=1;
?column?
-----------
151388160
(1 row)
Time: 33.269 ms
yb_approximate_count()
This is easy to do manually but requires knowing the columns that are part of hashing in the primary key. I can get them automatically from pg_index.indoption
with a few joins on the catalog tables. I've created a function to do that automatically, and run the partial count:
drop function if exists yb_approximate_count;
create or replace function yb_approximate_count(
table_relid oid -- the oid of a hash partitioned table
, pct float default null -- percentage of rows (actually of hash codes)
) returns bigint as
$$
declare
count_sql text;
count_result bigint;
begin
with
/*+ Set(random_page_cost 1e42) */ -- https://github.com/yugabyte/yugabyte-db/issues/15224#issuecomment-1341239529
ind_att_option as (
select indrelid, indexrelid, schemaname, tablename, indexname, indisprimary, key_position, indoption, attname, n_distinct from
(
select indrelid, indexrelid, indisprimary, indkeys.indkey, indclass[indkeys.k-1], k as key_position,
-- indoption defined in https://github.com/yugabyte/yugabyte-db/blob/2.18/src/postgres/src/include/catalog/pg_index.h#L78
case indoption[indkeys.k-1]::int when 0 then 'ASC' when 1 then 'DESC' when 3 then 'ASC NULLS FIRST' when 4 then 'HASH' end as indoption
from pg_index
, unnest(indkey) with ordinality as indkeys(indkey,k)
) pg_index
natural left join (select oid as indexrelid, relname as indexname from pg_class) idx
natural left join (select oid as indrelid, relnamespace, relname as tablename from pg_class) tab
natural left join (select oid as relnamespace, nspname as schemaname from pg_namespace) nsp
natural left join (select attrelid as indrelid, attnum as indkey, attname as attname from pg_attribute) att
natural left join (select schemaname, tablename, attname, n_distinct from pg_stats) stat
),
ind_hash as (
select schemaname, tablename,string_agg(format('%I',attname), ',' order by key_position) hash_cols, indisprimary
, case when indisprimary then indrelid else indexrelid end relid
from ind_att_option
where indoption='HASH'
group by schemaname, tablename, indrelid, indexrelid, indisprimary
)
select
format(
case
when pct=100 -- read all (pushdown parallel scan)
then 'select count(*) as "%2$I.%3$I" from %2$I.%3$I'
when pct is null -- read only one hash code (1/65536th)
then 'select 65536*count(*) as "%2$I.%3$I" from %2$I.%3$I where yb_hash_code(%4$s)=1'
else -- read a percentage of hash code range
'select %1$s*count(*) as "%2$I.%3$I" from %2$I.%3$I where yb_hash_code(%4$s)< 65536/%1$s'
end , 100/pct , schemaname, tablename, hash_cols
) into count_sql
from ind_hash
--, lateral (select num_tablets from yb_table_properties(relid)) as yb_table_properties
where indisprimary and relid=table_relid
;
if count_sql is null then
raise warning '%','ERROR: Can''t find HASH columns for the table''s primary key';
return null;
end if;
raise debug 'DEBUG: Running: %',count_sql;
execute count_sql into count_result;
return count_result;
end;
$$ language plpgsql;
The approximate count on 0.1% rows takes less than 1 second, and counting from one yb_hash_code() only takes 150 milliseconds:
yugabyte=> select yb_approximate_count('demo'::regclass,0.1);
yb_approximate_count
----------------------
148268000
(1 row)
Time: 994.010 ms
yugabyte=> select yb_approximate_count('demo'::regclass);
yb_approximate_count
----------------------
151388160
(1 row)
Time: 152.628 ms
yugabyte=> select yb_approximate_count('demo'::regclass,1);
yb_approximate_count
----------------------
149836900
(1 row)
Time: 8566.000 ms (00:08.566)
Note that the query with the where clause doesn't pushdown the count, and then returns all rows to the postgres backend. Then, it has a faster response time than the parallel pushed down count only for a very small sample, like 0.1%. The function without a percentage reads only from one yb_hash_code(), so 1/65536, and may be sufficient for an approximation.
This function works only on hash sharding with even distribution. With range sharding, counting on a range cannot be extrapolated to get a relevant approximate count.
Top comments (0)