Buzz is now compatible with Delta Lake

An efficient and flexible data catalog is the key to a powerful query engine.

Written on July 1, 2021

A few months ago, we open sourced Buzz, a serverless query engine. In that first version, users had to compile the data catalog into the binary to be able to start querying their datasets. The natural next step would have been to allow discovering data files with the listing API of the cloud storage and automatically parse Hive-style partitioning:

year=2020/month=02/day=02/data.parquet

year=2020/month=02/day=06/data.parquet

year=2020/month=04/day=08/data.parquet

This very classical feature is still on the roadmap, but we decided to first implement a more "state of the art" catalog standard. Delta Lake was the perfect candidate. It allows a very efficient retrieving the the file lists along with many other functionalities such as ACID transactions when writing to the data lake. You can learn more about all the great benefits of this catalog format in the official documentation [1].

Even better, a Rust implementation of Delta Lake was recently developed and is actively maintained [2]. This allowed us to speed up the integration of this catalog format into Buzz.

You can now query any Delta table stored on S3 by simply specifying its URI in your query. You can also prune the files to be scanned with a partition filter, thus making Buzz compatible with petabyte datasets.

If you try it out on your datasets, we are very interested by you feedback, either through an issue on Github [3] or by reaching out to us directly!

[1] https://github.com/delta-io/delta-rs

[2] https://docs.delta.io/latest/delta-intro.html

[3] https://github.com/cloudfuse-io/buzz-rust/issues

BACK TO BLOG