I was reading "pg_duckdb beta release: Even faster analytics in Postgres", which demonstrates that the execution of TPC-DS Query 01 is 1500 times faster on DuckDB compared to PostgreSQL. Naturally, I was curious to see how this query performs in YugabyteDB. However, when I examined the SQL query that was used, which repeatedly accesses the same table and conducts analytics without utilizing analytic functions, I wondered: should we be spending time, in 2024, examining queries from analytics benchmarks that were written on SQL-92 while ignoring the window functions introduced in SQL:2003?
I reproduced what was run in this article, but I've run it on YugabyteDB.
Generate Data
I generated the TPC-DS data with DuckDB
duckdb <<'SQL'
CALL dsdgen(sf=1);
EXPORT DATABASE 'tpcds' (FORMAT CSV, DELIMITER '|');
SQL
wc -l tpcds/*.csv | sort -n
This generates .csv
files in the tpcds
directory, with header and separated by |
.
Load Data
I used the CREATE TABLE statements from DuckDB repository to create the tables in a tpcds
database, and load data into YugabyteDB with \copy
:
{
echo 'drop database tpcds;'
echo 'create database tpcds;'
echo '\connect tpcds;'
for t in tpcds/*.csv
do
curl -Ls https://raw.githubusercontent.com/duckdb/duckdb/refs/heads/main/extension/tpcds/dsdgen/schema/$(basename $t .csv).sql
echo '\copy' $(basename $t .csv) from "'$t'" "(format 'csv', header 1, delimiter '|')"
done
echo 'analyze;'
} | psql -e
Original Query
The code for TPC-DS Query 01 in the DuckDB repository is the following:
WITH customer_total_return AS
(SELECT sr_customer_sk AS ctr_customer_sk,
sr_store_sk AS ctr_store_sk,
sum(sr_return_amt) AS ctr_total_return
FROM store_returns,
date_dim
WHERE sr_returned_date_sk = d_date_sk
AND d_year = 2000
GROUP BY sr_customer_sk,
sr_store_sk)
SELECT c_customer_id
FROM customer_total_return ctr1,
store,
customer
WHERE ctr1.ctr_total_return >
(SELECT avg(ctr_total_return)*1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'TN'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id
LIMIT 100;
When I come across the condition WHERE ctr1.ctr_total_return > (SELECT avg(ctr_total_return) * 1.2)
from the same table, I immediately think that the query was written by someone who ignores SQL window functions. This executes the subquery for each row, to read the same table, and each of these loops performs a full table scan. The time complexity is O(N²) where N is the number of rows in the table.
Here is the related part of the PostgreSQL execution plan
-> CTE Scan on customer_total_return ctr1 (cost=0.00..43236.24 rows=462 width=8) (actual time=93.217..222789.836 rows=12877 loops=1)
Filter: (ctr_total_return > (SubPlan 2))
Rows Removed by Filter: 37120
SubPlan 2
-> Aggregate (cost=31.18..31.20 rows=1 width=32) (actual time=4.454..4.454 rows=1 loops=49997)
-> CTE Scan on customer_total_return ctr2 (cost=0.00..31.16 rows=7 width=32) (actual time=0.030..3.630 rows=8168 loops=49997)
Filter: (ctr1.ctr_store_sk = ctr_store_sk)
Rows Removed by Filter: 41829
This reads 49997 rows from "customer_total_return", and for each, scans those 49997 to eliminate on average 41829 rows and return 8168 to be aggregated into an average.
Should we compare database performances based on such an inefficient query execution? Probably not, especially since PostgreSQL always decides not to add more complexity to the query planner when performance could be improved by rewriting the query more efficiently.
What do you want to achieve?
TPC benchmarks describe the business use case and let you write the queries. The TPC-DS specifications description for this query is:
B.1 query1.tpl
Find customers who have returned items more than 20% more often than the average customer returns for a store in a given state for a given year.
Qualification Substitution Parameters:
- YEAR.01=2000
- STATE.01=TN
- AGG_FIELD.01 = SR_RETURN_AMT
Here is my first write of an SQL statement to answer this question:
with sr_per_customer_store as (
-- Sum of return amounts for each customer per store
select
-- Aggregation for AGG_FIELD.01 = SR_RETURN_AMT
sum(sr_return_amt) sr_return_amt,
sr_customer_sk, sr_store_sk
from store_returns
where
sr_returned_date_sk in (
-- Filter for YEAR.01=2000
select d_date_sk from date_dim where d_year = 2000
)
and
sr_store_sk in (
-- Filter for STATE.01=TN
select s_store_sk from store where s_state = 'TN'
)
-- Additional filters could be uncommented (see later)
--and sr_customer_sk is not null
--and sr_store_sk is not null
group by
sr_customer_sk, sr_store_sk
)
, sr_per_customer_store_with_avg as (
-- adds the average customer returns for a store to each row
select
avg(sr_return_amt) over (partition by sr_store_sk) as avg_return_amt,
sr_customer_sk, sr_store_sk, sr_return_amt
from sr_per_customer_store
)
-- main query to filter on the 20% more often than the average
select c_customer_id
-- , avg_return_amt, sr_customer_sk, sr_store_sk, sr_return_amt
from sr_per_customer_store_with_avg
join customer on c_customer_sk=sr_customer_sk
where sr_return_amt > 1.20 * avg_return_amt
order by c_customer_id
limit 100
;
YugabyteDB
I ran my query on a small YugabyteDB cluster without any optimization (no indexes defined by the DDL I used) and got the result in 500 milliseconds.
tpcds=> select version();
version
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 11.2-YB-2024.1.3.0-b0 on x86_64-pc-linux-gnu, compiled by clang version 17.0.6 (https://github.com/yugabyte/llvm-project.git 9b881774e40024e901fc6f3d313607b071c08631), 64-bit
(1 row)
yugabyte=> explain (analyze, dist, costs off)
with sr_per_customer_store as (
-- Sum of return amounts for each customer per store
select
-- Aggregation for AGG_FIELD.01 = SR_RETURN_AMT
sum(sr_return_amt) sr_return_amt,
sr_customer_sk, sr_store_sk
from store_returns
where
sr_returned_date_sk in (
-- Filter for YEAR.01=2000
select d_date_sk from date_dim where d_year = 2000
)
and
sr_store_sk in (
-- Filter for STATE.01=TN
select s_store_sk from store where s_state = 'TN'
)
-- Additional filters could be uncommented (see later)
--and sr_customer_sk is not null
--and sr_store_sk is not null
group by
sr_customer_sk, sr_store_sk
)
, sr_per_customer_store_with_avg as (
-- adds the average customer returns for a store to each row
select
avg(sr_return_amt) over (partition by sr_store_sk) as avg_return_amt,
sr_customer_sk, sr_store_sk, sr_return_amt
from sr_per_customer_store
)
-- main query to filter on the 20% more often than the average
select c_customer_id
-- , avg_return_amt, sr_customer_sk, sr_store_sk, sr_return_amt
from sr_per_customer_store_with_avg
join customer on c_customer_sk=sr_customer_sk
where sr_return_amt > 1.20 * avg_return_amt
order by c_customer_id
limit 100
;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Limit (actual time=522.301..522.315 rows=100 loops=1)
CTE sr_per_customer_store
-> GroupAggregate (actual time=325.035..352.718 rows=49499 loops=1)
Group Key: store_returns.sr_customer_sk, store_returns.sr_store_sk
-> Sort (actual time=325.023..330.820 rows=54438 loops=1)
Sort Key: store_returns.sr_customer_sk, store_returns.sr_store_sk
Sort Method: quicksort Memory: 3920kB
-> Hash Semi Join (actual time=13.765..304.687 rows=54438 loops=1)
Hash Cond: (store_returns.sr_store_sk = store.s_store_sk)
-> Hash Semi Join (actual time=12.338..292.033 rows=55440 loops=1)
Hash Cond: (store_returns.sr_returned_date_sk = date_dim.d_date_sk)
-> Seq Scan on store_returns (actual time=1.266..237.662 rows=287867 loops=1)
Storage Table Read Requests: 282
Storage Table Read Execution Time: 59.762 ms
Storage Table Rows Scanned: 287867
-> Hash (actual time=11.047..11.047 rows=366 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 21kB
-> Seq Scan on date_dim (actual time=10.952..10.985 rows=366 loops=1)
Storage Filter: (d_year = 2000)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 10.861 ms
Storage Table Rows Scanned: 73049
-> Hash (actual time=1.421..1.421 rows=12 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on store (actual time=1.404..1.407 rows=12 loops=1)
Storage Filter: ((s_state)::text = 'TN'::text)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 1.265 ms
Storage Table Rows Scanned: 12
CTE sr_per_customer_store_with_avg
-> WindowAgg (actual time=375.957..393.435 rows=49499 loops=1)
-> Sort (actual time=373.736..377.333 rows=49499 loops=1)
Sort Key: sr_per_customer_store.sr_store_sk
Sort Method: quicksort Memory: 3689kB
-> CTE Scan on sr_per_customer_store (actual time=325.039..365.754 rows=49499 loops=1)
-> Sort (actual time=522.300..522.303 rows=100 loops=1)
Sort Key: customer.c_customer_id
Sort Method: top-N heapsort Memory: 37kB
-> Hash Join (actual time=472.245..520.617 rows=12871 loops=1)
Hash Cond: (sr_per_customer_store_with_avg.sr_customer_sk = customer.c_customer_sk)
-> CTE Scan on sr_per_customer_store_with_avg (actual time=375.965..415.768 rows=12877 loops=1)
Filter: (sr_return_amt > (1.20 * avg_return_amt))
Rows Removed by Filter: 36622
-> Hash (actual time=96.147..96.147 rows=100000 loops=1)
Buckets: 65536 Batches: 2 Memory Usage: 3245kB
-> Seq Scan on customer (actual time=1.323..73.629 rows=100000 loops=1)
Storage Table Read Requests: 99
Storage Table Read Execution Time: 55.715 ms
Storage Table Rows Scanned: 100000
Planning Time: 0.224 ms
Execution Time: 523.047 ms
Storage Read Requests: 383
Storage Read Execution Time: 127.603 ms
Storage Rows Scanned: 460928
Storage Write Requests: 0
Catalog Read Requests: 0
Catalog Write Requests: 0
Storage Flush Requests: 0
Storage Execution Time: 127.603 ms
Peak Memory Usage: 24691 kB
(60 rows)
Looking at the execution plan, we can improve it with the proper indexes. For example, I can gain 100 milliseconds with an index on the "date" and "store" filtering:
create index on store_returns ( sr_store_sk , sr_returned_date_sk , sr_customer_sk, sr_return_amt );
YugabyteDB improves this with a Batched Nested Loop Join:
-> YB Batched Nested Loop Join (actual time=13.446..90.820 rows=55440 loops=1)
Join Filter: (store_returns.sr_returned_date_sk = date_dim.d_date_sk)
-> HashAggregate (actual time=11.102..11.132 rows=366 loops=1)
Group Key: date_dim.d_date_sk
-> Seq Scan on date_dim (actual time=10.992..11.023 rows=366 loops=1)
Storage Filter: (d_year = 2000)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 10.918 ms
Storage Table Rows Scanned: 73049
-> Index Only Scan using store_returns_sr_store_sk_sr_returned_date_sk_sr_customer_s_idx on store_returns (actual time=2.196..55.979 rows=55440 loops=1)
Index Cond: (sr_returned_date_sk = ANY (ARRAY[date_dim.d_date_sk, $1, $2, ..., $1023]))
Heap Fetches: 0
Storage Index Read Requests: 55
Storage Index Read Execution Time: 9.239 ms
Storage Index Rows Scanned: 55440
Improving the schema by properly indexing isn't the primary focus of this blog post. The original benchmark comparison between PostgreSQL and DuckDB was based on an execution time that exceeded one minute, caused by an inefficient SQL query, which is typically something PostgreSQL doesn't want to optimize (it has always been decided to avoid complicating the query planner code when the problem can be fixed in the query - see Discussion on missing optimizations).
I have written an SQL query that addresses the business question using standard SQL features established 20 years ago and implemented in all databases. Given the lack of indexes, the performance is acceptable, but it can be optimized further. There's no need to switch to another database engine to achieve improvements of hundreds of milliseconds.
DuckDB
The original query runs faster on DuckDB because the inefficiency arises not from a database table, but from a temporary table created by the Common Table Expression (CTE). DuckDB is specifically optimized for querying unindexed data, allowing it to avoid executing correlated queries multiple times. It employs a Delimiting Join to eliminate unnecessary repeated computations.
Using this DuckDB optimization on any PostgreSQL-compatible database is still possible using the DuckDB POSTGRES_SCANNER. I have started DuckDB and attached my YugabyteDB database by specifying its endpoint:
ATTACH 'dbname=tpcds user=admin host=eu-west-1.b808348e-9ae2-b506-43d5-5c44b86d5882.cloud.yugabyte.com port=5433 password=******' AS yb (TYPE POSTGRES, READ_ONLY);
USE yb;
I've run the original query with EXPLAIN, and here is the part of the execution plan related to the correlated subquery comparing "ctr_total_return" of each row to the average of the same "ctr_store_sk":
This optimization is interesting, but as shown above, refactoring the SQL query to utilize built-in window functions available in all SQL databases proves to be more effective.
Validation of the new query
When I refactor a query, I always verify that the results are the same. I ran both queries without the LIMIT 100, to compare the result, but I ended up with different rows.
When I ran my query step-by-step, I observed that the first Common Table Expression has some null values for "sr_customer_sk" and "sr_store_sk":
tpcds=> with sr_per_customer_store as (
-- Sum of return amounts for each customer per store
select
-- Aggregation for AGG_FIELD.01 = SR_RETURN_AMT
sum(sr_return_amt) sr_return_amt,
sr_customer_sk, sr_store_sk
from store_returns
where
sr_returned_date_sk in (
-- Filter for YEAR.01=2000
select d_date_sk from date_dim where d_year = 2000
)
and
sr_store_sk in (
-- Filter for STATE.01=TN
select s_store_sk from store where s_state = 'TN'
)
-- Additional filters could be uncommented (see later)
--and sr_customer_sk is not null
--and sr_store_sk is not null
group by
sr_customer_sk, sr_store_sk
)
select * from sr_per_customer_store
;
sr_return_amt | sr_customer_sk | sr_store_sk
---------------+----------------+-------------
485.94 | 6 | 7
4303.04 | 16 | 2
113.10 | 19 | 8
428.80 | 23 | 7
...
757.68 | 99981 | 2
508.44 | 99981 | 8
75.32 | 99983 | 2
434.60 | 99983 | 7
97.02 | 99988 | 2
14.62 | 99992 | 7
42.99 | 100000 | 4
30596.45 | | 1
37542.86 | | 2
44972.55 | | 4
32704.92 | | 7
30601.72 | | 8
37231.63 | | 10
(49499 rows)
I don't know the meaning of those NULLS in the foreign key to "customer" and "store" tables. Maybe it stores some aggregates in the same way as what GROUPING SETS do. Given the business question ("the average customer returns for a store in a given state"), we should not include the rows from "store_return" where the customer or the store is unknown, and this is why my query adds those predicates:
and sr_customer_sk is not null
and sr_store_sk is not null
The original query produces the same result when adding the same predicates. The issue with the original query is that the joins to the "store" and "customer" tables eliminate the rows with a null foreign key. However, they still consider them when calculating the average. Not only was the original query used in the benchmark written with inefficient code, but it also seemed to provide wrong results.
By crafting a query directly from the business question and utilizing a clean SQL code with a WITH clause to define common table expressions, I find it easier to reason through each step, which increases the likelihood of obtaining accurate results.
Conclusion
This clearly illustrates my answer to the question in the title: do not run your benchmarks with queries that ignore the past 30 years of SQL. I ran the original query to verify the results, and although it took a couple of minutes, I didn’t spend additional time trying to understand the reasons for its performance. Instead, I focused on writing a better-structured SQL query that improves readability for developers and also the optimizations by the query planner.
Some commercial databases have integrated complex transformations into their query planners to support legacy applications or inefficient SQL generators. As a result, benchmarking these outdated and inefficient SQL queries, such as those from older TPC implementations, can be a valid choice. In contrast, PostgreSQL has prioritized simplicity in its code, which means it may perform poorly with poorly designed queries. Therefore, to effectively compare performance with PostgreSQL-compatible databases, focusing on queries that avoid inefficient code is crucial.
Top comments (0)