Investigating a possible Linux bug: small-sized sbrk calls unexpectedly succeed despite low RLIMIT_AS
This article documents the investigatory process in detail. Skip to the patch for a summary of the situation.
Background§
I assume knowledge of the differences between the heap and the stack, along with experience using malloc
to allocate memory.
Each process on Linux is given its own address space with the layout:
+-----------+ <- Largest memory address (0xfffffffff..)
| Stack |
+-----------+ <- stack pointer
| |
| | || stack grows down
| | \/
| |
| | <-- Empty space
| |
| | /\
| | || heap grows up
| |
+-----------+ <- process break
| Heap |
+-----------+
| BSS | <- Unitialized global data
+-----------+
| data | <- Initialized global data
+-----------+
| text | <- Executable machine code
+-----------+ <- Lowest memory address (0x000000000..)
sbrk
and brk
are both functions that modify the process break to either grow or shrink the heap - thus changing the amount of memory that the program can access. They are commonly used to implement memory allocation inside malloc()
.
int ; // addr is the desired break address
void *; // incr is the amount of bytes to grow/shrink break by
Introduction§
An assignment to implement a custom memory allocator in C has the test_out_of_ram
test, which is supposed to ensure that the implementation
gracefully handles running out of heap memory.
The test does the following:
- It uses the
getrlimit
andsetrlimit
functions with theRLIMIT_AS
parameter to restrict address space growth; artificially limiting the amount of memory that the program can access to test its behavior when out-of-memory. (we will discuss this further later on) - Since our malloc implementation allocates memory from
sbrk
in large chunks, the test allocates large amount of memory to nearly fill up existing chunks, guaranteeing that the nextmy_malloc
invocation will cause asbrk
call. - It then calls
malloc
and asserts that it returned aNULL
pointer, indicating an error.
Step 3 listed above is implemented incorrectly; the test case erroneously calls
malloc
instead ofmy_malloc
and ends up testing libc malloc's behavior when out-of-memory instead of the custom implementation.
1 2 3 4 5
6 int
This test case normally passes, since libc malloc
demonstrates correct behaviour.
The extra-credit portion of the assignment involves using LD_PRELOAD
to override the test's calls to the malloc
symbol with my_malloc
. I got Chrome running on my memory allocator! Unfortunately, the test now fails. That is to say, my_malloc
unexpectedly succeeds after setting rlimit.
Analysis§
Analyzing the test success§
Normally, when malloc is not overridden (and points to libc's malloc), the following occurs on line 21 (void * ptr = malloc(8)
):
Mallocing an additional 8 bytes, which requires more memory from sbrk, but sbrk will fail
Breakpoint 1, __GI___sbrk (increment=135168) at ./misc/sbrk.c:37
37 in ./misc/sbrk.c
(gdb) finish
Run till exit from #0 __GI___sbrk (increment=135168) at ./misc/sbrk.c:37
__glibc_morecore (increment=<optimized out>) at ./malloc/morecore.c:30
30 ./malloc/morecore.c: No such file or directory.
Value returned is $1 = (void *) 0xffffffffffffffff
sbrk()
is called with a large increment size (135168 bytes for some reason). This returns -1 to indicate failure, because setrlimit restricts it from growing in size. This failure is propagated to libc malloc
, which returns NULL
. This is the expected behavior.
Analyzing test failure when malloc -> my_malloc
§
However, when malloc
forwards to my_malloc
(which is the case in the extra credit assignment), this call succeeds. Using gdb, we find:
- Inside
my_malloc
: There is no existing chunk large enough to allocate the requested amount of memory. - We therefore allocate a new chunk of size 1024.
- The call
sbrk(1024)
unexpectedly succeeds!
Furthermore, if we call sbrk(1024)
in gdb again, it succeeds... three more times. It fails every time after these four successful calls.
Explaining sbrk() success§
The setrlimit documentation says the following (emphasis mine):
RLIMIT_AS
This is the maximum size of a process' total available memory, in bytes. If this limit is exceeded, thebrk()
,malloc()
,mmap()
andsbrk()
functions will fail with errno set to [ENOMEM].
Remember that we've set RLIMIT_AS
to 1, which is a very low value. So low, that if the documentation is correct, sbrk
should not succeed at incrementing even a single byte, since then the process' total available memory would exceed 1 byte.
Digging into glibc's source code for sbrk
, we find it internally calls brk
. brk
's documentation says:
int brk(void *addr);
The
brk()
function sets the break value to addr and changes the allocated space accordingly.
We now look into glibc's implementation of brk
, but we find that glibc only contains a stub that calls the Linux kernel's brk syscall.
34 int
35
To learn more, we'll have to look at the source code for the brk
syscall in the Linux kernel.
The kernel maintains memory in pages. We find that brk internally controls a variable that tracks where the current brk is, and allocates new pages if the brk variable crosses a page boundary. Early on, we see that newbrk
(the desired brk
value) and oldbrk
(the previous brk
value) are rounded to a multiple of the page size and compared for equality.
newbrk = ;
oldbrk = ;
if
That is, if the new brk
does not exceed a page boundary compared to the old brk
, the brk()
syscall reduces to merely increasing the value of brk
variable without any other action. This is valid since the additional memory is already a part of an initialized page.
The Answer§
The kernel checks for RLIMIT_AS
setrlimit violations when brk
requires allocating a new page. This does not occur if the brk
change is small enough to be contained in one page.
getconf PAGESIZE
on my machine returns 4096. This fits closely with sbrk(1024)
never succeeding for more than four times before always failing. This is validated by going through the kernel source code and observing that RLIMIT_AS
is checked in the may_expand_vm
call which occurs after brk
has been page-aligned.
RLIMIT_AS and RLIMIT_DATA are treated differently§
While the above conclusion explains how sbrk()
surprisingly succeeds despite a very low RLIMIT_AS
limit, we find an interesting snippet of code from a commit made in April 2006:
224 /*
225 * Check against rlimit here. If this check is done later after the test
226 * of oldbrk with newbrk then it can escape the test and let the data
227 * segment grow beyond its set limit the in case where the limit is
228 * not page aligned -Ram Gupta
229 */
230 rlim = current->signal->rlim.rlim_cur;
231 if
232 goto out;
233
234 newbrk = ;
235 oldbrk = ;
RLIMIT_DATA
is another value that can be manipulated using setrlimit
that is very similar to RLIMIT_AS
:
RLIMIT_DATA
This is the maximum size of a process' data segment in bytes. If this limit is exceeded, thebrk()
,malloc()
andsbrk()
functions will fail with errno set to [ENOMEM].
RLIMIT_DATA
is checked for setrlimit
violations before page-aligning brk
opposed to how RLIMIT_AS
is checked after page-aligning brk
. Substituting RLIMIT_DATA
instead of RLIMIT_AS
in the test case makes it behave expectedly.
This difference between two very similar settings bothered me, so...
Submitted Kernel Patch§
I submitted a patch to the Linux kernel explaining the issue:
From linux-mm Mon Aug 19 00:35:00 2024
From: Kartavya Vashishtha <sendtokartavya () gmail ! com>
Date: Mon, 19 Aug 2024 00:35:00 +0000
To: linux-mm
Subject: [PATCH] mm/mmap.c: make brk() check RLIMIT_AS before page-aligning requested amount
Message-Id: <66c29336.050a0220.395e9a.76bf () mx ! google ! com>
X-MARC-Message: https://marc.info/?l=linux-mm&m=172402763105157
Currently, brk() only checks against RLIMIT_DATA when validating whether
the requested amount of memory is valid with respect to rlimit.
RLIMIT_AS is checked later in the `may_expand_vm` call in `do_brk_flags`,
but that call occurs after aligning the new brk to a page boundary, making
the following possible:
1. Allocate a non-page-sized amount of memory with brk()
2. brk() will internally page-align the requested amount, and allocate
the necessary amount of pages.
3. Set RLIMIT_AS to 1 byte using setrlimit.
4. Calling brk() again with a small increment (such that it does not
overflow to the next page) will succeed.
This violates setrlimit RLIMIT_AS, since the call succeeds despite a
1 byte limit.
The following code snippet reproduces this behavior:
```
int main() {
void * mem = malloc(4096);
sbrk(32);
// set RLIMIT_AS for the process's address space to 1 byte
// This causes all future calls to sbrk to fail
struct rlimit lim;
getrlimit(RLIMIT_AS, &lim);
lim.rlim_cur = 1;
printf("lim.rlim_max: %ld\n", lim.rlim_max);
setrlimit(RLIMIT_AS, &lim);
printf("Mallocing an additional 8 bytes, which requires more"
"memory from sbrk, but sbrk SHOULD fail\n");
void * ptr = sbrk(8);
printf("sbrk result: %p\n", ptr);
if (ptr != -1) {
printf("sbrk unexpectedly passed\n");
} else {
printf("sbrk expectedly failed\n");
}
free(mem);
}
```
Signed-off-by: Kartavya Vashishtha <sendtokartavya@gmail.com>
---
mm/mmap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index d0dfc85b209b..5f7fc6591323 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -253,8 +253,8 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
* segment grow beyond its set limit the in case where the limit is
* not page aligned -Ram Gupta
*/
- if (check_data_rlimit(rlimit(RLIMIT_DATA), brk, mm->start_brk,
- mm->end_data, mm->start_data))
+ if (check_data_rlimit(min(rlimit(RLIMIT_AS), rlimit(RLIMIT_DATA)),
+ brk, mm->start_brk, mm->end_data, mm->start_data))
goto out;
newbrk = PAGE_ALIGN(brk);
--
I received the following response:
From linux-mm Mon Aug 19 09:06:06 2024
From: Lorenzo Stoakes <lorenzo.stoakes () oracle ! com>
Date: Mon, 19 Aug 2024 09:06:06 +0000
To: linux-mm
Subject: Re: [PATCH] mm/mmap.c: make brk() check RLIMIT_AS before page-aligning requested amount
Message-Id: <334b96f7-b01e-483b-86d3-a801c459a96b () lucifer ! local>
X-MARC-Message: https://marc.info/?l=linux-mm&m=172405830317307
Thanks for your patch (I think your first? :), I don't think this is
correct however. The address space usage is not changing between these
calls, it's staying exactly the same (in fact, your efforts to avoid going
over a page boundary ensure this).
All memory allocations in the kernel are performed at page granularity, as
this is the only way to actually allocate and map memory.
As per the man page:
Lowering the soft limit for a resource below the process's current
consumption of that resource will succeed (but will prevent the process
from further increasing its consumption of the resource).
So the limit will not prevent usage of existing resource (i.e. the address
space usage).
It seems that we make an effort with RLIMIT_DATA to actually proactively
prevent usage beyond the limit, again as per the man page:
RLIMIT_DATA
This is the maximum size of the process's data segment
(initialized data, uninitialized data, and heap). The limit
is specified in bytes, and is rounded down to the system page
size. This limit affects calls to brk(2), sbrk(2), and
(since Linux 4.7) mmap(2), which fail with the error ENOMEM
upon encountering the soft limit of this resource.
Note 'encountering' the soft limit of this resource. In the case of
RLIMIT_AS:
RLIMIT_AS
...
This limit affects calls to brk(2), mmap(2), and mremap(2),
which fail with the error ENOMEM upon exceeding this limit.
...
Note 'exceeding' this limit.
Acknowledgements§
thomasqm spotted that the RLIMIT_DATA
check before page-alignment was not a RLIMIT_AS
check.
Feel free to write to me to point out an error, suggest a topic, or just say hi!