Notes on TPC-H

2025-06-17
tpch

TPC-H is a database benchmark. It is maintained by the Transaction Processing Council and the H benchmark is specifically testing the database using business relevant data and questions. The goal of the benchmark is to test how well a database stores and answers business needs.

⇒ https://www.tpc.org/tpch/default5.asp

This seems like a very useful test to do for multivalue and one that I'm surprised doesn't already exist. Though it could be that I just haven't found it yet.

I think the data needs to be first understand and mapped into a multivalue way of thinking as doing it just like how SQL does it would defeat the purpose of using multivalue.

TPC-H - Benchmark Documents

The first thing to do is to request the TPC-H benchmark files from the TPC site. This involves submitting your e-mail and getting a download link.

The zip contains the dbgen tool to generate the data and it contains a PDF with the specification. It has the schema of the tables and it contains the business questions that will be used for the benchmark. They also have sample SQL commands.

Sqlite Benchmark

However before that, I want to get the benchmark running with sqlite as that will give me something to compare to. The end goal is to compare to postgres.

This is the link to a sqlite database that you can download that has the relevant data. This is a the smallest case. To get the larger tests, I would need to run the data generation tool.

⇒ https://github.com/lovasoa/TPCH-sqlite

This link has the queries that need to be run, for the benchmark there is a total of 22 queries to execute.

⇒ https://github.com/CrunchyData/crunchy-bridge-for-analytics-examples

The queries are written for postgres so they need some massaging to work with sqlite.

This link has the answers for each of the queries.

⇒ https://github.com/electrum/tpch-dbgen

I was able to get this all loaded and working pretty easily. The great thing was that I didn't need to generate the test database as I just wanted to get a feel for the data first.

MongoDB Benchmark

There have been some papers that have looked at porting the TPC-H benchmark to mongodb. They had to port the schema to be a document and also wrote programs to run the queries. Mongodb will likely be a better comparison to multivalue as it is more similar. It will be good to get the comparisons in general though.

⇒ https://www.ifi.uzh.ch/dam/jcr:ffffffff-96c1-007c-0000-000010c732ce/VertiefungRutishauser.pdf

Future Plans

The goal is to first get the smallest dataset ported to multivalue and do a test.

The big thing is that this will likely be an apples to oranges comparison as Pick basically requires using BASIC to answer the query as the query language itself is not that powerful.

This is looking to be a useful project as this will also allow us to compare the flavors of pick against each other in addition to comparing against SQL databases and NoSQL databases.

References

An example of getting the TPC-H dataset working in postgres. It has the file sizes of the generated table files and also has timing for postgres.

⇒ https://myfpgablog.blogspot.com/2016/08/tpc-h-queries-on-postgresql.html

Snowflake's document on TPC-H:

⇒ https://docs.snowflake.com/en/user-guide/sample-data-tpch