Investigating a possible Linux bug: small-sized sbrk calls unexpectedly succeed despite low RLIMIT_AS

This article documents the investigatory process in detail. Skip to the patch for a summary of the situation.

Background§

I assume knowledge of the differences between the heap and the stack, along with experience using malloc to allocate memory.

Each process on Linux is given its own address space with the layout:

+-----------+ <- Largest memory address (0xfffffffff..)
|   Stack   | 
+-----------+ <- stack pointer
|           |
|           | || stack grows down
|           | \/
|           |
|           | <-- Empty space
|           |
|           | /\
|           | || heap grows up
|           |
+-----------+ <- process break
|    Heap   | 
+-----------+
|    BSS    | <- Unitialized global data
+-----------+
|    data   | <- Initialized global data
+-----------+
|    text   | <- Executable machine code
+-----------+ <- Lowest memory address (0x000000000..)

sbrk and brk are both functions that modify the process break to either grow or shrink the heap - thus changing the amount of memory that the program can access. They are commonly used to implement memory allocation inside malloc().

#include <unistd.h>
int brk(void *addr); // addr is the desired break address
void *sbrk(intptr_t incr); // incr is the amount of bytes to grow/shrink break by

Introduction§

An assignment to implement a custom memory allocator in C has the test_out_of_ram test, which is supposed to ensure that the implementation gracefully handles running out of heap memory.

The test does the following:

  1. It uses the getrlimit and setrlimit functions with the RLIMIT_AS parameter to restrict address space growth; artificially limiting the amount of memory that the program can access to test its behavior when out-of-memory. (we will discuss this further later on)
  2. Since our malloc implementation allocates memory from sbrk in large chunks, the test allocates large amount of memory to nearly fill up existing chunks, guaranteeing that the next my_malloc invocation will cause a sbrk call.
  3. It then calls malloc and asserts that it returned a NULL pointer, indicating an error.

Step 3 listed above is implemented incorrectly; the test case erroneously calls malloc instead of my_malloc and ends up testing libc malloc's behavior when out-of-memory instead of the custom implementation.

1#include "my_malloc.h"
2#include <stdio.h>
3#include <stdlib.h>
4#include <sys/resource.h>
5
6int main () {
7 // Limit address space size to 1 byte using setrlimit
8 struct rlimit lim;
9 getrlimit(RLIMIT_AS, &lim);
10 lim.rlim_cur = 1;
11 setrlimit(RLIMIT_AS, &lim);
12
13 // Alloc slightly less than arena size to guarantee
14 // that next allocation requires allocating a new chunk by calling sbrk.
15 void * p = my_malloc(ARENA_SIZE - 3 * ALLOC_HEADER_SIZE);
16
17 printf("Mallocing an additional 8 bytes, which requires more memory from "
18 "sbrk, but sbrk will fail\n");
19
20 // This erroneously tests `malloc` instead of `my_malloc`.
21 void * ptr = malloc(8);
22
23 if (ptr == NULL) {
24 perror("SUCCESS: Malloc Failed");
25 } else {
26 printf("Malloc should have failed due to sbrk returning failure\n");
27 }
28}

This test case normally passes, since libc malloc demonstrates correct behaviour.


The extra-credit portion of the assignment involves using LD_PRELOAD to override the test's calls to the malloc symbol with my_malloc. I got Chrome running on my memory allocator! Unfortunately, the test now fails. That is to say, my_malloc unexpectedly succeeds after setting rlimit.

Analysis§

Analyzing the test success§

Normally, when malloc is not overridden (and points to libc's malloc), the following occurs on line 21 (void * ptr = malloc(8)):

Mallocing an additional 8 bytes, which requires more memory from sbrk, but sbrk will fail

Breakpoint 1, __GI___sbrk (increment=135168) at ./misc/sbrk.c:37
37    in ./misc/sbrk.c
(gdb) finish
Run till exit from #0  __GI___sbrk (increment=135168) at ./misc/sbrk.c:37
__glibc_morecore (increment=<optimized out>) at ./malloc/morecore.c:30
30    ./malloc/morecore.c: No such file or directory.
Value returned is $1 = (void *) 0xffffffffffffffff

sbrk() is called with a large increment size (135168 bytes for some reason). This returns -1 to indicate failure, because setrlimit restricts it from growing in size. This failure is propagated to libc malloc, which returns NULL. This is the expected behavior.

Analyzing test failure when malloc -> my_malloc§

However, when malloc forwards to my_malloc (which is the case in the extra credit assignment), this call succeeds. Using gdb, we find:

  1. Inside my_malloc: There is no existing chunk large enough to allocate the requested amount of memory.
  2. We therefore allocate a new chunk of size 1024.
  3. The call sbrk(1024) unexpectedly succeeds!

Furthermore, if we call sbrk(1024) in gdb again, it succeeds... three more times. It fails every time after these four successful calls.

Explaining sbrk() success§

The setrlimit documentation says the following (emphasis mine):

RLIMIT_AS
This is the maximum size of a process' total available memory, in bytes. If this limit is exceeded, the brk(), malloc(), mmap() and sbrk() functions will fail with errno set to [ENOMEM].

Remember that we've set RLIMIT_AS to 1, which is a very low value. So low, that if the documentation is correct, sbrk should not succeed at incrementing even a single byte, since then the process' total available memory would exceed 1 byte.

Digging into glibc's source code for sbrk, we find it internally calls brk. brk's documentation says:

int brk(void *addr);

The brk() function sets the break value to addr and changes the allocated space accordingly.

We now look into glibc's implementation of brk, but we find that glibc only contains a stub that calls the Linux kernel's brk syscall.

34int
35__brk (void *addr)
36{
37 __curbrk = __brk_call (addr);
38 if (__curbrk < addr)
39 {
40 __set_errno (ENOMEM);
41 return -1;
42 }
43
44 return 0;
45}

To learn more, we'll have to look at the source code for the brk syscall in the Linux kernel.

The kernel maintains memory in pages. We find that brk internally controls a variable that tracks where the current brk is, and allocates new pages if the brk variable crosses a page boundary. Early on, we see that newbrk (the desired brk value) and oldbrk (the previous brk value) are rounded to a multiple of the page size and compared for equality.

newbrk = PAGE_ALIGN(brk);
oldbrk = PAGE_ALIGN(mm->brk);
if (oldbrk == newbrk) {
  mm->brk = brk;
  goto success;
}

That is, if the new brk does not exceed a page boundary compared to the old brk, the brk() syscall reduces to merely increasing the value of brk variable without any other action. This is valid since the additional memory is already a part of an initialized page.

The Answer§

The kernel checks for RLIMIT_AS setrlimit violations when brk requires allocating a new page. This does not occur if the brk change is small enough to be contained in one page.

getconf PAGESIZE on my machine returns 4096. This fits closely with sbrk(1024) never succeeding for more than four times before always failing. This is validated by going through the kernel source code and observing that RLIMIT_AS is checked in the may_expand_vm call which occurs after brk has been page-aligned.

RLIMIT_AS and RLIMIT_DATA are treated differently§

While the above conclusion explains how sbrk() surprisingly succeeds despite a very low RLIMIT_AS limit, we find an interesting snippet of code from a commit made in April 2006:

224/*
225* Check against rlimit here. If this check is done later after the test
226* of oldbrk with newbrk then it can escape the test and let the data
227* segment grow beyond its set limit the in case where the limit is
228* not page aligned -Ram Gupta
229*/
230rlim = current->signal->rlim[RLIMIT_DATA].rlim_cur;
231if (rlim < RLIM_INFINITY && brk - mm->start_data > rlim)
232 goto out;
233
234newbrk = PAGE_ALIGN(brk);
235oldbrk = PAGE_ALIGN(mm->brk);

RLIMIT_DATA is another value that can be manipulated using setrlimit that is very similar to RLIMIT_AS:

RLIMIT_DATA
This is the maximum size of a process' data segment in bytes. If this limit is exceeded, the brk(), malloc() and sbrk() functions will fail with errno set to [ENOMEM].

RLIMIT_DATA is checked for setrlimit violations before page-aligning brk opposed to how RLIMIT_AS is checked after page-aligning brk. Substituting RLIMIT_DATA instead of RLIMIT_AS in the test case makes it behave expectedly.

This difference between two very similar settings bothered me, so...

Submitted Kernel Patch§

I submitted a patch to the Linux kernel explaining the issue:

From linux-mm  Mon Aug 19 00:35:00 2024
From: Kartavya Vashishtha <sendtokartavya () gmail ! com>
Date: Mon, 19 Aug 2024 00:35:00 +0000
To: linux-mm
Subject: [PATCH] mm/mmap.c: make brk() check RLIMIT_AS before page-aligning requested amount
Message-Id: <66c29336.050a0220.395e9a.76bf () mx ! google ! com>
X-MARC-Message: https://marc.info/?l=linux-mm&m=172402763105157

Currently, brk() only checks against RLIMIT_DATA when validating whether
the requested amount of memory is valid with respect to rlimit.

RLIMIT_AS is checked later in the `may_expand_vm` call in `do_brk_flags`,
but that call occurs after aligning the new brk to a page boundary, making
the following possible:

1. Allocate a non-page-sized amount of memory with brk()

2. brk() will internally page-align the requested amount, and allocate
the necessary amount of pages.

3. Set RLIMIT_AS to 1 byte using setrlimit.

4. Calling brk() again with a small increment (such that it does not
overflow to the next page) will succeed.

This violates setrlimit RLIMIT_AS, since the call succeeds despite a
1 byte limit.

The following code snippet reproduces this behavior:

```
int main() {
        void * mem = malloc(4096);
        sbrk(32);

	    // set RLIMIT_AS for the process's address space to 1 byte
        // This causes all future calls to sbrk to fail
        struct rlimit lim;
        getrlimit(RLIMIT_AS, &lim);
        lim.rlim_cur = 1;
        printf("lim.rlim_max: %ld\n", lim.rlim_max);
        setrlimit(RLIMIT_AS, &lim);

        printf("Mallocing an additional 8 bytes, which requires more"
	           "memory from sbrk, but sbrk SHOULD fail\n");
        void * ptr = sbrk(8);
        printf("sbrk result: %p\n", ptr);
        if (ptr != -1) {
                printf("sbrk unexpectedly passed\n");
        } else {
                printf("sbrk expectedly failed\n");
        }
        free(mem);
}
```

Signed-off-by: Kartavya Vashishtha <sendtokartavya@gmail.com>
---
 mm/mmap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index d0dfc85b209b..5f7fc6591323 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -253,8 +253,8 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 	 * segment grow beyond its set limit the in case where the limit is
 	 * not page aligned -Ram Gupta
 	 */
-	if (check_data_rlimit(rlimit(RLIMIT_DATA), brk, mm->start_brk,
-			      mm->end_data, mm->start_data))
+	if (check_data_rlimit(min(rlimit(RLIMIT_AS), rlimit(RLIMIT_DATA)),
+			      brk, mm->start_brk, mm->end_data, mm->start_data))
 		goto out;
 
 	newbrk = PAGE_ALIGN(brk);
-- 

I received the following response:

From linux-mm  Mon Aug 19 09:06:06 2024
From: Lorenzo Stoakes <lorenzo.stoakes () oracle ! com>
Date: Mon, 19 Aug 2024 09:06:06 +0000
To: linux-mm
Subject: Re: [PATCH] mm/mmap.c: make brk() check RLIMIT_AS before page-aligning requested amount
Message-Id: <334b96f7-b01e-483b-86d3-a801c459a96b () lucifer ! local>
X-MARC-Message: https://marc.info/?l=linux-mm&m=172405830317307

Thanks for your patch (I think your first? :), I don't think this is
correct however. The address space usage is not changing between these
calls, it's staying exactly the same (in fact, your efforts to avoid going
over a page boundary ensure this).

All memory allocations in the kernel are performed at page granularity, as
this is the only way to actually allocate and map memory.

As per the man page:

    Lowering the soft limit for a resource below the process's current
    consumption of that resource will succeed (but will prevent the process
    from further increasing its consumption of the resource).

So the limit will not prevent usage of existing resource (i.e. the address
space usage).

It seems that we make an effort with RLIMIT_DATA to actually proactively
prevent usage beyond the limit, again as per the man page:

       RLIMIT_DATA
              This is the maximum size of the process's data segment
              (initialized data, uninitialized data, and heap).  The limit
              is specified in bytes, and is rounded down to the system page
              size.  This limit affects calls to brk(2), sbrk(2), and
              (since Linux 4.7) mmap(2), which fail with the error ENOMEM
              upon encountering the soft limit of this resource.

Note 'encountering' the soft limit of this resource. In the case of
RLIMIT_AS:

       RLIMIT_AS
              ...
              This limit affects calls to brk(2), mmap(2), and mremap(2),
              which fail with the error ENOMEM upon exceeding this limit.
              ...

Note 'exceeding' this limit.

Acknowledgements§

thomasqm spotted that the RLIMIT_DATA check before page-alignment was not a RLIMIT_AS check.



Feel free to write to me to point out an error, suggest a topic, or just say hi!