Measuring performance of distributed system is not easy task. It often takes time to generate dataset, write queries, running them and keep the performance result. In spite of the fixed form of measuring process, we often have to reinvent wheels because the difference of the interface of distributed systems. Every system does not support ANSI SQL. Even Hive and Presto do not provide compatible SQL syntax and functionality.
Benchto is a macro framework to define and run macro benchmark in clustered environment. It is a tool mainly developed by team of Presto in Facebook and Teradata. Although it was useful, the usage was a little complex to me. That’s the reason why I tried to clarify how to use Benchto to measure Presto cluster performance.
Launch Benchto Service
The overview of Benchto is here.
Mainly there are three components in Benchto required to run benchrmatk.
benchto-service
The data store to keep the result of benchmark. It needs RDBMS like PostgreSQL to persistent the result to be reviewed later. It provides a simple web UI to compare the result by environment. It shows:
- duration
- total cpu time
- total blocked time
- output data size
- processed input data size
- total scheduled time
- total planning time
- peak memory reservation
It provides sufficient informatio to review cluster performance. Though the UI is not rich itself, it is very easy to launch the service. First you need RDBMS to store the result. benchto-service recognizes the database in local machine as default. So you can use docker container for test.
$ docker run --name benchto-postgres -e POSTGRES_PASSWORD=postgres -p 5432:5432 postgres
Now you have a PostgreSQL which can be accessed through 5432 port. benchto-service can be launched from the repository.
$ cd benchto
$ env SERVER_PORT=8081 ./mvnw spring-boot:run -pl benchto-service
You will be able to see the UI by accessing localhost:8081.
benchto-driver
benchto-driver is a Java application which loads queries to be run and test it against distributed system you want to test. That’s the one to be created per distributed database system. Presto project already provides the driver for Presto. We can use the driver in order to run benchmark against your Presto cluster. Driver for Presto is installed as presto-benchto-benchmarks
module. We need to prepare several configurations to run the benchmark.
Profile is used for specifying the cluster uri and environment where the result is stored. The profile looks like this:
benchmark-service:
url: http://localhost:8081
data-sources:
presto:
url: jdbc:presto://your.presto.coordinator.com:8080
driver-class-name: com.facebook.presto.jdbc.PrestoDriver
environment:
name: test-env
presto:
url: http://your.presto.coordinator.com:8080
benchmark:
feature:
presto:
metrics.collection.enabled: true
macros:
sleep-4s:
command: echo "Sleeping for 4s" && sleep 4
The name of file should be application-<Profile Name>.yaml
. Please make sure the extension of the file to be .yaml
not .yml
. Then we need to define the benchmark spec in yaml file as well. Let’s assume we saved the file with name tpch.yaml
.
datasource: presto
query-names: presto/tpch/${query}.sql
runs: 3
prewarm-runs: 2
before-execution: sleep-4s
frequency: 7
database: tpch
prefix: ""
variables:
1:
query: q01, q02, q03, q04, q05, q06, q07, q08, q09, q10, q11, q12, q13, q14, q15
schema: sf100
If you want to run benchmark with TPC-H or TPC-DS, Presto already provides the neceessary resources in presto-benchto-benchmarks/src/main/resources/sql/presto
. query-names
specifies the SQLs to be run. You can embed variables in this file. database
, schema
will be rendered in the query file so that we change the table just by editing tpch.yaml
.
We can launch benchto-driver with this command.
$ cd presto
$ ./mvnw clean package -pl presto-benchto-benchmarks
$ java \
-jar presto-benchto-benchmarks/target/presto-benchto-benchmarks-0.196-SNAPSHOT-executable.jar \
--sql=presto-benchto-benchmarks/src/main/resources/sql \
--benchmarks=. \
--activeBenchmarks=tpch \
--profile=<Profile Name>
--sql
and --benchmarks
specify the directory for SQLs and benchmark spec files to be run respectively. --activeBenchmarks
and --profile
specify the benchmark spec and profile to be run. If all configuration is set properly, queries will be submitted to the cluster and you will be able to see the result stored in benchto-service.