Open sourcing Buzz
We are glad to announce that we open sourced Buzz, a serverless query engine written in Rust.
Written on January 15, 2021
After several months of tests and hard work, the first version of Buzz was released this week. It is a very important step for Cloudfuse in its journey toward creating a new generation of highly scalable data tools.
First of all, you should definitively take a look at the code [1] and try it out! It is pretty simple to deploy and run, everything is documented in the source repository.
The tool is still at an early stage, but already demonstrates great capabilities. We tried it out on quite a few datasets, and it has consistently shown very competitive performances. For example, it can do most aggregations on the NYC Taxi dataset in under 5 seconds!
Buzz uses cloud functions, such as AWS Lambda, to scale the computing resource pool for each query. No more under or over-provisioned clusters, the engine spins up dynamically the right amount of CPUs for the data you want to process. Thanks to the responsiveness of AWS Lambda, you can get hundreds of small virtual machines within milliseconds, which makes it possible to scan gigabytes of data in parallel from the cloud storage and still be serverless.
Cloud functions are very flexible, but they cannot receive inbound connections, so we need a way to gather the intermediate results. Currently, we solve this by using a container based component to act as the reducer. We run it on Fargate to avoid any dependency on external infrastructure while maintaining boot latency low. But even though Fargate only takes a few seconds to provisions the resource and the Buzz container is lightweight and quick to boot, the buzzkill (pun intended) is the network setup time. Altogether, it takes up to 30 seconds for the container to be reachable through its private IP. This implies that we need to maintain the reducer started throughout the query session, making the system only partially serverless. Even if this latency is annoying, for workloads where the heavy lifting can be left to the cloud functions, the huge scaling capability of Buzz is still intact: with only one small long running container, we can scan terrabytes of data in seconds!
You should only note some limitations for this first version:
only SQL supported by DataFusion [2] is supported by Buzz
only single zone capacity is supported
only two-step queries are supported (HBee then HComb)
only single datasource queries can be run (no join)
a Buzz stack can only read S3 in its own region (because of S3 Gateway Endpoint)
We hope to raise these limits in the upcoming months. The issues opened in Github [1] show the new features we currently plan to add, but feel free to open new ones or comment existing ones if you are interested in using the system.