Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on Oct 3, 2022

Real Application Testing on 🚀YugabyteDB 🐘pgreplay

#postgres #yugabytedb #rat #replay

You may want to capture a workload in a database and replay it. Oracle has RAT (Real Application Testing), PostgreSQL has pgreplay (https://github.com/laurenz/pgreplay) maintained by Laurenz Albe.

YugabyteDB, the PostgreSQL-compatible open-source distributed SQL database, uses the postgres query layer and can use pgreplay as a target or source. This is the beauty of adding distributed capabilities to the PostgreSQL code base. Here is the quick test I've run to verify all works as expected.

I will run this example on a lab with only one node. I define the log destination as csvlog, and set the log parameters as recommended by the documentation:

docker run -v /var/tmp/yb:/var/tmp \
 -d --rm --name yb yugabytedb/yugabyte:2.15.2.0-b87 \
bash -c '

cat > tserver.flagfile <<CAT
--ysql_pg_conf_csv=\
log_destination=csvlog,\
log_statement=all,\
log_min_messages=error,\
log_min_error_statement=log,\
log_connections=on,\
log_disconnections=on
CAT

yugabyted start --tserver_flags=flagfile=tserver.flagfile 

while true ; do tail -F /root/var/logs/tserver/postgres* ; done
'

I export /var/tmp as an external volume. This is where I will copy the log for the workload I want to replay. I'll rotate the log with pg_rotate_logfile() and get the file name with pg_current_logfile(). The YugabyteDB query layer (YSQL) is PostgreSQL, you can use the same parameters and functions.

I initialize pgbench tables (pgbench is ysql_bench in the YugabyteDB distribution, adding a few additional features, but you can also use the PostgresSQL pgbench):

docker exec -i yb bash -c '

/home/yugabyte/postgres/bin/ysql_bench -i 

'

Capture

I run a simple pgbench, taking care to rotate the logfile before, and copy it into /var/tmp at the end:

docker exec -i yb bash -c "

ysqlsh -c \"select pg_rotate_logfile()\"

/home/yugabyte/postgres/bin/ysql_bench -n

ysqlsh -tc \"
select format('cp %L /var/tmp/workload.csv',pg_current_logfile())
\" | sh -x

"

The csvlog covering this pgbench workload is /var/tmp/workload.csv, ready to be replayed.

Usually, you replay it on a clone of the database (see PITR snapshot: an easy flashback / backtrack for application releases) or pg_dump (ysql_dump is the YugabyteDB version) from the initial state.

Here, I will simply reinit with pgbench -i. I also count and sum the rows to verify the replay (pgbench uses random amounts):

docker exec -i yb bash -c '
ysqlsh -c "select count(*), sum(abalance) from ysql_bench_accounts;"
/home/yugabyte/postgres/bin/ysql_bench -i 
ysqlsh -c "select count(*), sum(abalance) from ysql_bench_accounts;"
'

Replay

I build a docker image with pgreplay:

git clone https://github.com/laurenz/pgreplay.git
cd pgreplay
docker build -t laurenz/pgreplay -f Dockerfile .

It can be interesting to look at the csvlog. I'm using VisiData for that. You can install it with pip3 install visidata or use a docker image with it - mounting the /var/tmp/yb volume to it:

docker run --rm -it -w /logs -v /var/tmp/yb:/logs \
jauderho/visidata:latest \
workload.csv

There's no header but the format is described in https://www.postgresql.org/docs/current/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-CSVLOG

I run pgreplay in a container from the image I've built above, accessing the volume to read the csvlog with -v and the database to run the statements with --link:

docker run --rm -ti -w /logs --link yb:yb \
-v /var/tmp/yb:/logs \
laurenz/pgreplay pgreplay -c -h yb -p 5433 \
/logs/workload.csv

I check that I have the same values to confirm the replay:

docker exec -i yb bash -c '
ysqlsh -c "select count(*), sum(abalance) from ysql_bench_accounts;"
'

This blog was just to verify that it works with YugabyteDB. Check pgreplay documentation for more, all works the same in YugabyteDB. If you want to capture a workload from connections on multiple database nodes, each one will have their logfile. You can merge them. The Session ID (the 6th field in the csvlog built from start time and backend pid will probably not collide with another one, but you can make it unique by concatenating a node number if you want). The replay connects to one node, but though a HA proxy the connections can be distributed to multiple ones. All depends on what you want to capture and wh you want to replay. Capturing from PostgreSQL and replaying to YugabyteDB is also a good way to check that all works the same without performance regressions.

DEV Community

Real Application Testing on 🚀YugabyteDB 🐘pgreplay

Capture

Replay

Top comments (0)

Read next

A Beginner’s Guide to Computer Networks and the OSI Model

A Lab Manual to Devops

Learn .env in Express.js for Beginners (Effortless Setup)

Strings in Python: Indexing and Slicing for Beginners