Open Source and Serverless Data
Serverless data should not be synonymous to black box managed services.
Written on September 5, 2022
If you are reading these lines, you are probably convinced that value is hiding inside your data, and you are set to solve the following question: how can I extract that value and make it actionable?
At Cloudfuse we define data analytics as the set of techniques that enable answering business questions from the data.
Even though the definition is simple, it actually covers a very diverse environment:
Different businesses and roles will have very different questions. Some use cases focus on high level aggregated trends whereas others can be interested in specific low occurrence events.
The question typology evolves through the different phases of an analysis. If the first interrogations of the analysts are often about the boundaries of the dataset itself, they often specify into much more concrete investigations about domain related patterns that are hidden inside it.
The data itself can take very different forms, from structured tabular data to free form text or even photos and videos.
Unfortunately, the problem set that needs to be addressed to be able to tackle this diversity of data analytics is extremely broad. As always with engineering, different problems imply different methods and tools for solving them.
Many projects promised a one stop shop for all analytics. But this is actually very hard to deliver from the technical perspective, the reason being that data engineering solutions often need compromises. Optimizing for one class of problems means giving up on another one. Even though many automatic optimization strategies exist, none is able to tackle the huge diversity of the problem space.
Spotlight on the Cloud
One of the very first question that sets the stage for a data project is the usage or not of the cloud. This question is usually settled at an organizational level and can thus mostly be seen as an external given rather than a purely technical choice. Though we agree that on-premise and hybrid environments will keep evolving and growing in the coming years, it is in the cloud that we, at Cloudfuse, see the biggest opportunities.
Currently data analytics solutions in the cloud can be divided into two broad categories:
Engines managed by the users. These can be open source or not, but the user keeps a control on the resources that are used.
Managed services. The data is loaded into a black box services that takes care of its storage and querying.
There are two fundamental advantages to managed services:
Less operational burden, these systems are often easier to configure and maintain
Serverless, the service provider can mutualize data processing resources and thus offer a usage based pricing
But they also come with a undeniable cost. As discussed earlier, data engineering goes with compromises, and no system is able to behave equally well across all use cases. When the analytical use case meets one of these limitations, the engineering teams have no possibility of adjusting the system to their needs.
At Cloudfuse, we believe that there is a path in the cloud that conciliates the best of both worlds. We believe in user managed engines, especially open source ones, but with minimal operational burden and highly interactive serveless capabilities. The elasticity of compute resources in the cloud is still heavily underutilized by data analytics solutions. Our vision is to provide the tooling to make it possible.
Let's dive a little bit deeper into technical details!
How managed services provide serverless capabilities
The central piece to answer this question is a feature called "object storage". It is a simple data storage service that scales virtually infinitely. Because it is used for large variety of use cases, all cloud providers have provisioned it at such a huge scale that it is unlikely that any individual user can hit its limits.
Until recently, data querying solution all stated the same: “Object storage is slow! Only use it for your backup and cold storage”. But that has changed with services like Snowflake proving that it is possible decouple the compute and the storage without sacrificing on latency. Once this is achieved, the natural next step is to enable shutting down the compute resources when they are not used. Instead of pre-allocating resources for data processing use cases, they can be dynamically assigned to workloads. This effectively shifts the pricing to a usage based model: when no processing is running, users only pay for the data storage.
To be able to provision data processing resources for their users interactively (in just a few seconds), managed data services maintain pools of pre-initilized query engines. The challenge is then to strike a balance between a sufficient pool to always be able to meet demand without wasting too much idle resource. There is also an economy of scale: a larger number of users averages out the usage spikes of individual ones.
Is that applicable to user managed engines?
At the level of an individual cloud account, it is not possible to use the same pooling mechanism as the one used by managed services. But cloud providers are actually getting better and better at provisioning compute resources within milliseconds. One type of cloud resources is particularly performant in that regard: cloud functions. Cloud functions are small virtual servers that are started adhoc for small computing tasks such as API endpoint queries. Just as managed data services do with their data processing resources, cloud providers maintain large pools of these resources to meet demand. Interestingly here, the pool of resources is not only shared between users of data workloads but a much wider variety of use cases.
Cloud functions have some limitations though. For instance network configurability is very narrow. This is why at Cloudfuse, we are building the right tooling to overcome these limitations.
Once we have cleared the question of provisioning large amounts of compute interactively, the next challenge is to setup the appliance to perform the analytical workload. Many data frameworks were not optimized to start quickly. They were designed to be installed on traditional instances or even on on-premise servers. This is particularly true for query engines that run on top of the JVM such as Spark or Trino. Luckily, there is a new generation of engines, often written in lower level languages, that can get ready to process data within a few seconds. ClickHouse and Databend are great examples for this. With the right configurations, they might even be tuned to start in a few hundreds of milliseconds. It is one of our tasks to collaborate with the communities and maintainers of these technologies to identify these optimization opportunities.
About our earlier work
We have been working on serverless query engines since 2020. If you want to learn more about our earlier work, for instance around our Buzz query engine, feel free to take a look at our blog archive.