Memory allocation in cloud functions

You can start thousands of concurrent cloud functions in milliseconds on AWS Lambda. But this comes with some compromises. Let's look at memory allocation performances.

Written on May 28, 2020

BACK TO BLOG

The whole point of cloud computing is to build a layer of abstraction on the hardware that allows customers to provision the computing resources they need when they need them. This means that in most cases, you will be executing your application in a virtually isolated environment, but the actual hardware is going to be shared with other customers. With cloud functions, this sharing of resources is pushed one step further, as your running environment can be as small as 128MB and run only for a few milliseconds. On the other side of the curtain, the cloud provider wants to pack as many customers as possible on a given piece of hardware to improve its profitability. But of course, this optimization is constrained by the QoS the provider wants to ensure on the cloud function service.

With most cloud function providers (e.g. AWS and GCP), you dimension the runtime by specifying a given amount of RAM. The amount of CPU and IO bandwidth (network, RAM access or even temp disk) is said to be allocated proportionally to the RAM, even if we have to admit is not always very explicitly documented. It is quite easy to imagine how CPU and IO bandwidth can be divided between the competing workloads on the hardware because those are "point in time" resources. By collocating different types of workloads on the same hardware, you can minimize the probability that everybody requests the same resource at the same time. But the memory allocation is another deal. Memory persists for the whole duration of the function, and usually spans multiple executions as functions are kept warm for a given period of time. Does it mean that the cloud provider needs to strictly divide its hardware RAM between the provisioned RAM. Probably not. For instance Firecracker, the microVM engine that powers AWS Lambda, is designed to allow oversubscription of RAM [1]. Indeed, only very few workloads will actually allocate memory pages up to the provisioned amount of RAM, and this has been used by hypervisors for a long time [2]. So, you can make coexist 1000 VM with 1GB of RAM provisioned on a 500GB hardware if:

you know that in average the VMs do not use more than half of their allocated RAM
you have a way to dynamically allocate host RAM to the cloud function VMs when they need it

AWS probably has proprietary drivers on top of the open source codebase, so we decided not to dive too far into Firecracker. What really interests us is the performance impact. Indeed, it seems likely that this two-step memory allocation is slower than regular page allocation, or that it is throttled somehow.

Let's benchmark!

Memory is divided into pages (usually 4KB). Pages from the virtual memory are allocated to physical memory on demand, only when they are used for the first time. We want to test the difference in speed between allocations and writes into memory regions that are already mapped to physical memory from those that need to create new mappings.

We are going to focus on AWS for now, and we arbitrarily use 2048MB RAM. This will allow us to test how fast Lambda can actually provide us the memory we provisioned. We use the memory pool provided by Arrow that uses jemalloc as an allocator [3].

We allocate 100 buffers of 1MB and paint them with a given character using memset. 1MB is a reasonable size for an Arrow array (~100k value). The painting operation is important, because a page that is allocated but not touched is usually not mapped to physical memory. We then free the 100 buffers and re-allocate them, painting them with another character.

START RequestId: 166b97e3-5fc4-4e50-8c24-17df268e859e Version: $LATESTbackend_name:jemallocALLOCATING 100 CHUNCKS OF 1048576 BYTESiteration;allocation_duration;new_pages_allocated1;1184;2561;1169;2561;1162;2561;1153;2561;1139;256(...)1;1091;2561;1101;2561;1117;2561;1112;2561;1111;2562;107;02;119;02;107;02;111;02;110;0(...)2;108;02;110;02;164;02;112;02;107;0END RequestId: 166b97e3-5fc4-4e50-8c24-17df268e859eREPORT RequestId: 166b97e3-5fc4-4e50-8c24-17df268e859e Duration: 128.62 ms Billed Duration: 200 ms Memory Size: 2048 MB Max Memory Used: 145 MB Init Duration: 56.32 ms

You can see that on the second run, no new page is mapped and allocation is 10x faster. This means two things:

jemalloc played nicely and reused the exact same memory addresses than for the first batch of allocations, great!
mapping to physical memory really does have a cost. If we try to allocate larger chunks, the allocation is proportionally slower. For instance, increasing the buffer size to 2MB (512 pages) increases the page creation time to 2750µs and the allocation on existing addresses to 220µs.

Do we care?

When playing with performance optimizations at this level, you will always ask yourself if it really makes a difference on the run of the actual application. A 10x speed up is tempting, but the true question is whether this actually sums up to something significant or not. Here, we see that an allocation of 1MB takes approximately 1ms when it needs to redeem the physical memory from the hypervisor, so we can say that we have an allocation speed for physical memory of approximately 1GB/s. To give you a few points of comparison:

When you download from S3, if you do it with 10 thread you can get speeds close to 100MB/s
Gzip typically decompresses at 200MB/s per core

1GB/s might not seem like a bottleneck in this context. But there is one enemy that could wake up this threshold: memory copy. Arrow does a very good job at avoiding copies whenever possible but they can still happen occasionally.

Could we mitigate it?

Video games and other real time engines are familiar with this phenomenon and typically use the concept of "arena" allocators to pre-reserve physical memory on application start. This is definitely something that can be done in Buzz because during the first hundreds of milliseconds of their life, our workers are mainly communicating on the network to setup their context. This could easily be parallelized with a thread that initiates a large arena for Arrow to work in. But sadly, it is unlikely that it will be doable by allocating memory blindly with jemalloc. Jemalloc also structures its allocations into arenas, but it will be hard to make sure that the mappings that we made to physical memory will be re-used on sub-sequent allocation. In particular, jemalloc uses specific arenas for specific thread, which makes it even less likely that the pre-fetched mappings will be picked up.

Conclusion

This discussion was very dense. And there are still many aspects to explore. If you have any questions on how we arrived at these observations or if you want to explore the problematic further with us, contact us!

[1] Firecracker: Lightweight Virtualization for Serverless Applications

[2] KVM oversubscription

[3] Apache Arrow and jemalloc

[4] Structures in jemalloc

BACK TO BLOG