Running seven Data Engines on Lambda

We successfully deployed an ran seven different open-source query engines on AWS Lambda. Read about our learning.

Written on December 5, 2022

We just managed to lambdatize ClickHouse [1]. This brings up to seven the number of query engines for which we can execute SQL queries on data stored on S3 from AWS Lambda. Let us walk you through this amazing adventure!

Choosing the engines

You might be wondering how we chose the seven query engines we decided to run on AWS Lambda. First, we have to take a step back and reconsider the big picture of what we are trying to achieve.

As discussed in our earlier blog post, Cloudfuse's vision is to bring serverless to existing data engines in order to provide a more open and flexible alternative to managed services. These engines distinguish on many dimensions, but for us, the most crucial ones where the following:

Open vs Closed

We only selected open-source solutions for this first round of integration. It seems fair to prioritize the players that share their knowledge with their communities because it's an approach we believe in and strongly support. But having access to the source code is also very helpful in this kind of integration work. It happened multiple times that we had to lightly patch the engine to overcome AWS Lambda limitations. That wouldn't have been possible with a closed source solution. If you would like to experiment running a specific close source engine on AWS Lambda, don't hesitate to reach out to figure out how we could collaborate on this.

Distributed vs Library

Libraries contain logical tools to enable a user to process data locally. A popular example among data scientists is the Pandas Python library [2]. Distributed query engines are often built on top of such libraries, but contain extra logic to schedule and distribute the data processing task on multiple servers. Dask [3] for instance is the distributed query engines that enables running Pandas on multiple machines. Data libraries can enable many use cases, and they are already commonly used on cloud functions. But they will hit some boundaries as the data volumes scale. This is why the real challenge is to run distributed query engines in a serverless fashion.

Data format compatibility

Most open query engines are capable of processing multiples data formats. By design though some data engines only support querying data in their own storage format. This is particularly true for engines that are specialized in full text search, which can be heavily optimized by using heavily indexed data formats. It is harder to build a fair comparison between engines that would use the same data format, but that doesn't mean that they couldn't take advantage of running serverless. If you are interested in seeing such an engine be integrated to our benchmarks somehow, we will probably be interested in collaborating!

Our test case

When exploring a dataset, it is very usual to explore it dimension by dimension to better understand trends that are specific to given attributes. For instance, when exploring sales figures, an executive might:

  • first break down the number by quarter,

  • then select the quarter with the worst performance,

  • then group the data by department,

  • compare this to the same quarter the year before,

  • filter further down on the department that had an outstanding sales increase,

  • ...

This process is commonly called slicing and dicing and is a very efficient way to confront domain intuition with ground truth. This is why we chose a typical slicing query for benchmarking the engines:

  • SELECT payment_type, SUM(trip_distance) FROM file GROUP BY payment_type

As for the file format, we selected Parquet because query engines are usually well optimized against it. It is the most common "generic performance" format, meaning that data engineers use it when they want to be able to efficiently scan their dataset but have no strong a priori knowledge about the kind of query that will be performed.

We chose a medium sized file of approximately 120MB because we wanted a fair amount of processing to take place, but we did not want to struggle with memory issues.

Performance comparison

As a dashboard is worth a thousand words, here is the link to our results.

Disclaimer: measured performance differences between engines depend greatly on on how the integration with AWS Lambda was written. Our ambition was not to go deep on optimization for this first round. If you believe that an engine should show different figures, feel free to contact us so we can troubleshoot it together. You can also directly submit contributions to our open-source benchmarking repository [4].

[1] https://github.com/cloudfuse-io/lambdatization/pull/73

[2] https://pandas.pydata.org/

[3] https://www.dask.org/

[4] https://github.com/cloudfuse-io/lambdatization/

Note about Influx IOx: We tried to add the the new InfluxData engine called IOx to our list. Unfortunately, even though the engine is using Parquet as a storage format and object storage as backend, we couldn't find a way to query external Parquet files from it. If you want to discuss this result, please reach out!