[PATCH] linux-user: Reserve space for brk

Richard Henderson posted 1 patch 4 years, 3 months ago
Test docker-quick@centos7 passed
Test docker-mingw@fedora passed
Test FreeBSD passed
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20200117230245.5040-1-richard.henderson@linaro.org
Maintainers: Laurent Vivier <laurent@vivier.eu>, Riku Voipio <riku.voipio@iki.fi>
linux-user/qemu.h    |  1 +
linux-user/elfload.c | 73 +++++++++++++++++++++++++++++++++-----------
2 files changed, 57 insertions(+), 17 deletions(-)
[PATCH] linux-user: Reserve space for brk
Posted by Richard Henderson 4 years, 3 months ago
With bad luck, we can wind up with no space at all for brk,
which will generally cause the guest malloc to fail.

This bad luck is easier to come by with ET_DYN (PIE) binaries,
where either the stack or the interpreter (ld.so) gets placed
immediately after the main executable.

But there's nothing preventing this same thing from happening
with ET_EXEC (normal) binaries, during probe_guest_base().

In both cases, reserve some extra space via mmap and release
it back to the system after loading the interpreter and
allocating the stack.

The choice of 16MB is somewhat arbitrary.  It's enough for libc
to get going, but without being so large that 32-bit guests or
32-bit hosts are in danger of running out of virtual address space.
It is expected that libc will be able to fall back to mmap arenas
after the limited brk space is exhausted.

Launchpad: https://bugs.launchpad.net/qemu/+bug/1749393
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---

Note that the LP comments mention the fix for this in the kernel,
and about there being a "guaranteed 128MB gap" for x86_64.  As far
as I can see, this "gap" is part of the unmapped_area() algorithm.
For qemu, this would correspond to mmap_find_vma(), except that,
except when we fall back to mmap_find_vma_reserved(), we are not
100% in control over the allocation.


r~

---
 linux-user/qemu.h    |  1 +
 linux-user/elfload.c | 73 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 57 insertions(+), 17 deletions(-)

diff --git a/linux-user/qemu.h b/linux-user/qemu.h
index f6f5fe5fbb..560a68090e 100644
--- a/linux-user/qemu.h
+++ b/linux-user/qemu.h
@@ -35,6 +35,7 @@ struct image_info {
         abi_ulong       end_data;
         abi_ulong       start_brk;
         abi_ulong       brk;
+        abi_ulong       reserve_brk;
         abi_ulong       start_mmap;
         abi_ulong       start_stack;
         abi_ulong       stack_limit;
diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index 07b16cc0f4..2edb5d5b31 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -10,6 +10,7 @@
 #include "qemu/path.h"
 #include "qemu/queue.h"
 #include "qemu/guest-random.h"
+#include "qemu/units.h"
 
 #ifdef _ARCH_PPC64
 #undef ARCH_DLINFO
@@ -2364,24 +2365,51 @@ static void load_elf_image(const char *image_name, int image_fd,
         }
     }
 
-    load_addr = loaddr;
-    if (ehdr->e_type == ET_DYN) {
-        /* The image indicates that it can be loaded anywhere.  Find a
-           location that can hold the memory space required.  If the
-           image is pre-linked, LOADDR will be non-zero.  Since we do
-           not supply MAP_FIXED here we'll use that address if and
-           only if it remains available.  */
-        load_addr = target_mmap(loaddr, hiaddr - loaddr, PROT_NONE,
-                                MAP_PRIVATE | MAP_ANON | MAP_NORESERVE,
-                                -1, 0);
-        if (load_addr == -1) {
-            goto exit_perror;
+    if (pinterp_name != NULL) {
+        /*
+         * This is the main executable.
+         *
+         * Reserve extra space for brk.
+         * We hold on to this space while placing the interpreter
+         * and the stack, lest they be placed immediately after
+         * the data segment and block allocation from the brk.
+         *
+         * 16MB is chosen as "large enough" without being so large
+         * as to allow the result to not fit with a 32-bit guest on
+         * a 32-bit host.
+         */
+        info->reserve_brk = 16 * MiB;
+        hiaddr += info->reserve_brk;
+
+        if (ehdr->e_type == ET_EXEC) {
+            /*
+             * Make sure that the low address does not conflict with
+             * MMAP_MIN_ADDR or the QEMU application itself.
+             */
+            probe_guest_base(image_name, loaddr, hiaddr);
         }
-    } else if (pinterp_name != NULL) {
-        /* This is the main executable.  Make sure that the low
-           address does not conflict with MMAP_MIN_ADDR or the
-           QEMU application itself.  */
-        probe_guest_base(image_name, loaddr, hiaddr);
+    }
+
+    /*
+     * Reserve address space for all of this.
+     *
+     * In the case of ET_EXEC, we supply MAP_FIXED so that we get
+     * exactly the address range that is required.
+     *
+     * Otherwise this is ET_DYN, and we are searching for a location
+     * that can hold the memory space required.  If the image is
+     * pre-linked, LOADDR will be non-zero, and the kernel should
+     * honor that address if it happens to be free.
+     *
+     * In both cases, we will overwrite pages in this range with mappings
+     * from the executable.
+     */
+    load_addr = target_mmap(loaddr, hiaddr - loaddr, PROT_NONE,
+                            MAP_PRIVATE | MAP_ANON | MAP_NORESERVE |
+                            (ehdr->e_type == ET_EXEC ? MAP_FIXED : 0),
+                            -1, 0);
+    if (load_addr == -1) {
+        goto exit_perror;
     }
     load_bias = load_addr - loaddr;
 
@@ -2860,6 +2888,17 @@ int load_elf_binary(struct linux_binprm *bprm, struct image_info *info)
     bprm->core_dump = &elf_core_dump;
 #endif
 
+    /*
+     * If we reserved extra space for brk, release it now.
+     * The implementation of do_brk in syscalls.c expects to be able
+     * to mmap pages in this space.
+     */
+    if (info->reserve_brk) {
+        abi_ulong start_brk = HOST_PAGE_ALIGN(info->brk);
+        abi_ulong end_brk = HOST_PAGE_ALIGN(info->brk + info->reserve_brk);
+        target_munmap(start_brk, end_brk - start_brk);
+    }
+
     return 0;
 }
 
-- 
2.20.1


Re: [PATCH] linux-user: Reserve space for brk
Posted by Michael Tokarev 9 months ago
Replying to an old(ish) email..

18.01.2020 02:02, Richard Henderson wrote:
> With bad luck, we can wind up with no space at all for brk,
> which will generally cause the guest malloc to fail.
> 
> This bad luck is easier to come by with ET_DYN (PIE) binaries,
> where either the stack or the interpreter (ld.so) gets placed
> immediately after the main executable.
> 
> But there's nothing preventing this same thing from happening
> with ET_EXEC (normal) binaries, during probe_guest_base().
> 
> In both cases, reserve some extra space via mmap and release
> it back to the system after loading the interpreter and
> allocating the stack.
> 
> The choice of 16MB is somewhat arbitrary.  It's enough for libc
> to get going, but without being so large that 32-bit guests or
> 32-bit hosts are in danger of running out of virtual address space.
> It is expected that libc will be able to fall back to mmap arenas
> after the limited brk space is exhausted.
> 
> Launchpad: https://bugs.launchpad.net/qemu/+bug/1749393
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Tested-by: Alex Bennée <alex.bennee@linaro.org>
> Message-Id: <20200117230245.5040-1-richard.henderson@linaro.org>
> Signed-off-by: Laurent Vivier <laurent@vivier.eu>

This is commit 6fd5944980f4ccee728ce34bdaffc117db50b34d now, landed
before qemu v5.0.0.

This commit breaks binaries linked with klibc, a tiny libc and a set
of utils used in debian for initramfs.  It is broken on all architectures,
not just armhf where I tested it, - all qemu-user emulation is broken.

After this change and up to current 8.0.3, it behaves like this:

./arm-linux-user/qemu-arm /usr/lib/klibc/bin/fstype
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
Segmentation fault

I tested it with klibc & klibc-utils 2.0.4-2 packages for armhf
from http://snapshot.debian.org/package/klibc/2.0.4-2/ - this is an
old version which does not use any fancy syscalls like time64, so
it was possible to use it with old qemu versions. The binaries are
attached to this email, klibc.tar.gz - it has 2 files,
  usr/lib/klibc-sKNr1Fw-Rh9G1FYpGCXRnrwmP2A.so
  fstype
the lib should go to /usr/lib/ for the executable to find it, the
executable can be put at any place.

See also https://bugs.debian.org/1040981 for the original debian bug
report about this. It has more interesting information at the end.

What can be done with this one?

Thanks,

/mjt
Re: [PATCH] linux-user: Reserve space for brk
Posted by Michael Tokarev 9 months ago
See also

  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1040981#84

(this and next messages at least)

for some additional background and analysis.

There's also

  https://github.com/hdeller/qemu-hppa/commits/upx-fix

mentioned in subsequent message there.

/mjt
Re: [PATCH] linux-user: Reserve space for brk
Posted by Michael Tokarev 9 months ago
14.07.2023 22:43, Michael Tokarev wrote:
..
> ./arm-linux-user/qemu-arm /usr/lib/klibc/bin/fstype
> qemu: uncaught target signal 11 (Segmentation fault) - core dumped
> Segmentation fault

Note: the segfault happens before qemu tries to run the program,
apparently when it is doing the file sections processing.

/mjt
Re: [PATCH] linux-user: Reserve space for brk
Posted by Warner Losh 4 years, 3 months ago
On Fri, Jan 17, 2020 at 4:03 PM Richard Henderson <
richard.henderson@linaro.org> wrote:

> With bad luck, we can wind up with no space at all for brk,
> which will generally cause the guest malloc to fail.
>
...

> The choice of 16MB is somewhat arbitrary.  It's enough for libc
> to get going, but without being so large that 32-bit guests or
> 32-bit hosts are in danger of running out of virtual address space.
> It is expected that libc will be able to fall back to mmap arenas
> after the limited brk space is exhausted.
>

The biggest abuser of brk() is emacs, so according to one old wag
this should be 80MB. Eighty Megabytes And Continuously Swapping.

Sorry that I don't have anything constructive to add.

Warner
Re: [PATCH] linux-user: Reserve space for brk
Posted by Alex Bennée 4 years, 2 months ago
Warner Losh <imp@bsdimp.com> writes:

> On Fri, Jan 17, 2020 at 4:03 PM Richard Henderson <
> richard.henderson@linaro.org> wrote:
>
>> With bad luck, we can wind up with no space at all for brk,
>> which will generally cause the guest malloc to fail.
>>
> ...
>
>> The choice of 16MB is somewhat arbitrary.  It's enough for libc
>> to get going, but without being so large that 32-bit guests or
>> 32-bit hosts are in danger of running out of virtual address space.
>> It is expected that libc will be able to fall back to mmap arenas
>> after the limited brk space is exhausted.
>>
>
> The biggest abuser of brk() is emacs, so according to one old wag
> this should be 80MB. Eighty Megabytes And Continuously Swapping.

Is this related to the dumper that it uses? I guess the new dumper
system isn't a problem as I was able to run emacs -nw -q on the buster
image without problem.

-- 
Alex Bennée

Re: [PATCH] linux-user: Reserve space for brk
Posted by Alex Bennée 4 years, 2 months ago
Richard Henderson <richard.henderson@linaro.org> writes:

> With bad luck, we can wind up with no space at all for brk,
> which will generally cause the guest malloc to fail.
>
> This bad luck is easier to come by with ET_DYN (PIE) binaries,
> where either the stack or the interpreter (ld.so) gets placed
> immediately after the main executable.
>
> But there's nothing preventing this same thing from happening
> with ET_EXEC (normal) binaries, during probe_guest_base().
>
> In both cases, reserve some extra space via mmap and release
> it back to the system after loading the interpreter and
> allocating the stack.
>
> The choice of 16MB is somewhat arbitrary.  It's enough for libc
> to get going, but without being so large that 32-bit guests or
> 32-bit hosts are in danger of running out of virtual address space.
> It is expected that libc will be able to fall back to mmap arenas
> after the limited brk space is exhausted.
>
> Launchpad: https://bugs.launchpad.net/qemu/+bug/1749393
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

and I can debootstrap buster again ;-)

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>
> Note that the LP comments mention the fix for this in the kernel,
> and about there being a "guaranteed 128MB gap" for x86_64.  As far
> as I can see, this "gap" is part of the unmapped_area() algorithm.
> For qemu, this would correspond to mmap_find_vma(), except that,
> except when we fall back to mmap_find_vma_reserved(), we are not
> 100% in control over the allocation.
>
>
> r~
>
> ---
>  linux-user/qemu.h    |  1 +
>  linux-user/elfload.c | 73 +++++++++++++++++++++++++++++++++-----------
>  2 files changed, 57 insertions(+), 17 deletions(-)
>
> diff --git a/linux-user/qemu.h b/linux-user/qemu.h
> index f6f5fe5fbb..560a68090e 100644
> --- a/linux-user/qemu.h
> +++ b/linux-user/qemu.h
> @@ -35,6 +35,7 @@ struct image_info {
>          abi_ulong       end_data;
>          abi_ulong       start_brk;
>          abi_ulong       brk;
> +        abi_ulong       reserve_brk;
>          abi_ulong       start_mmap;
>          abi_ulong       start_stack;
>          abi_ulong       stack_limit;
> diff --git a/linux-user/elfload.c b/linux-user/elfload.c
> index 07b16cc0f4..2edb5d5b31 100644
> --- a/linux-user/elfload.c
> +++ b/linux-user/elfload.c
> @@ -10,6 +10,7 @@
>  #include "qemu/path.h"
>  #include "qemu/queue.h"
>  #include "qemu/guest-random.h"
> +#include "qemu/units.h"
>  
>  #ifdef _ARCH_PPC64
>  #undef ARCH_DLINFO
> @@ -2364,24 +2365,51 @@ static void load_elf_image(const char *image_name, int image_fd,
>          }
>      }
>  
> -    load_addr = loaddr;
> -    if (ehdr->e_type == ET_DYN) {
> -        /* The image indicates that it can be loaded anywhere.  Find a
> -           location that can hold the memory space required.  If the
> -           image is pre-linked, LOADDR will be non-zero.  Since we do
> -           not supply MAP_FIXED here we'll use that address if and
> -           only if it remains available.  */
> -        load_addr = target_mmap(loaddr, hiaddr - loaddr, PROT_NONE,
> -                                MAP_PRIVATE | MAP_ANON | MAP_NORESERVE,
> -                                -1, 0);
> -        if (load_addr == -1) {
> -            goto exit_perror;
> +    if (pinterp_name != NULL) {
> +        /*
> +         * This is the main executable.
> +         *
> +         * Reserve extra space for brk.
> +         * We hold on to this space while placing the interpreter
> +         * and the stack, lest they be placed immediately after
> +         * the data segment and block allocation from the brk.
> +         *
> +         * 16MB is chosen as "large enough" without being so large
> +         * as to allow the result to not fit with a 32-bit guest on
> +         * a 32-bit host.
> +         */
> +        info->reserve_brk = 16 * MiB;
> +        hiaddr += info->reserve_brk;
> +
> +        if (ehdr->e_type == ET_EXEC) {
> +            /*
> +             * Make sure that the low address does not conflict with
> +             * MMAP_MIN_ADDR or the QEMU application itself.
> +             */
> +            probe_guest_base(image_name, loaddr, hiaddr);
>          }
> -    } else if (pinterp_name != NULL) {
> -        /* This is the main executable.  Make sure that the low
> -           address does not conflict with MMAP_MIN_ADDR or the
> -           QEMU application itself.  */
> -        probe_guest_base(image_name, loaddr, hiaddr);
> +    }
> +
> +    /*
> +     * Reserve address space for all of this.
> +     *
> +     * In the case of ET_EXEC, we supply MAP_FIXED so that we get
> +     * exactly the address range that is required.
> +     *
> +     * Otherwise this is ET_DYN, and we are searching for a location
> +     * that can hold the memory space required.  If the image is
> +     * pre-linked, LOADDR will be non-zero, and the kernel should
> +     * honor that address if it happens to be free.
> +     *
> +     * In both cases, we will overwrite pages in this range with mappings
> +     * from the executable.
> +     */
> +    load_addr = target_mmap(loaddr, hiaddr - loaddr, PROT_NONE,
> +                            MAP_PRIVATE | MAP_ANON | MAP_NORESERVE |
> +                            (ehdr->e_type == ET_EXEC ? MAP_FIXED : 0),
> +                            -1, 0);
> +    if (load_addr == -1) {
> +        goto exit_perror;
>      }
>      load_bias = load_addr - loaddr;
>  
> @@ -2860,6 +2888,17 @@ int load_elf_binary(struct linux_binprm *bprm, struct image_info *info)
>      bprm->core_dump = &elf_core_dump;
>  #endif
>  
> +    /*
> +     * If we reserved extra space for brk, release it now.
> +     * The implementation of do_brk in syscalls.c expects to be able
> +     * to mmap pages in this space.
> +     */
> +    if (info->reserve_brk) {
> +        abi_ulong start_brk = HOST_PAGE_ALIGN(info->brk);
> +        abi_ulong end_brk = HOST_PAGE_ALIGN(info->brk + info->reserve_brk);
> +        target_munmap(start_brk, end_brk - start_brk);
> +    }
> +
>      return 0;
>  }


-- 
Alex Bennée

Re: [PATCH] linux-user: Reserve space for brk
Posted by Laurent Vivier 4 years, 2 months ago
Le 18/01/2020 à 00:02, Richard Henderson a écrit :
> With bad luck, we can wind up with no space at all for brk,
> which will generally cause the guest malloc to fail.
> 
> This bad luck is easier to come by with ET_DYN (PIE) binaries,
> where either the stack or the interpreter (ld.so) gets placed
> immediately after the main executable.
> 
> But there's nothing preventing this same thing from happening
> with ET_EXEC (normal) binaries, during probe_guest_base().
> 
> In both cases, reserve some extra space via mmap and release
> it back to the system after loading the interpreter and
> allocating the stack.
> 
> The choice of 16MB is somewhat arbitrary.  It's enough for libc
> to get going, but without being so large that 32-bit guests or
> 32-bit hosts are in danger of running out of virtual address space.
> It is expected that libc will be able to fall back to mmap arenas
> after the limited brk space is exhausted.
> 
> Launchpad: https://bugs.launchpad.net/qemu/+bug/1749393
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
> 
> Note that the LP comments mention the fix for this in the kernel,
> and about there being a "guaranteed 128MB gap" for x86_64.  As far
> as I can see, this "gap" is part of the unmapped_area() algorithm.
> For qemu, this would correspond to mmap_find_vma(), except that,
> except when we fall back to mmap_find_vma_reserved(), we are not
> 100% in control over the allocation.
> 
> 
> r~
> 
> ---
>  linux-user/qemu.h    |  1 +
>  linux-user/elfload.c | 73 +++++++++++++++++++++++++++++++++-----------
>  2 files changed, 57 insertions(+), 17 deletions(-)
> 
> diff --git a/linux-user/qemu.h b/linux-user/qemu.h
> index f6f5fe5fbb..560a68090e 100644
> --- a/linux-user/qemu.h
> +++ b/linux-user/qemu.h
> @@ -35,6 +35,7 @@ struct image_info {
>          abi_ulong       end_data;
>          abi_ulong       start_brk;
>          abi_ulong       brk;
> +        abi_ulong       reserve_brk;
>          abi_ulong       start_mmap;
>          abi_ulong       start_stack;
>          abi_ulong       stack_limit;
> diff --git a/linux-user/elfload.c b/linux-user/elfload.c
> index 07b16cc0f4..2edb5d5b31 100644
> --- a/linux-user/elfload.c
> +++ b/linux-user/elfload.c
> @@ -10,6 +10,7 @@
>  #include "qemu/path.h"
>  #include "qemu/queue.h"
>  #include "qemu/guest-random.h"
> +#include "qemu/units.h"
>  
>  #ifdef _ARCH_PPC64
>  #undef ARCH_DLINFO
> @@ -2364,24 +2365,51 @@ static void load_elf_image(const char *image_name, int image_fd,
>          }
>      }
>  
> -    load_addr = loaddr;
> -    if (ehdr->e_type == ET_DYN) {
> -        /* The image indicates that it can be loaded anywhere.  Find a
> -           location that can hold the memory space required.  If the
> -           image is pre-linked, LOADDR will be non-zero.  Since we do
> -           not supply MAP_FIXED here we'll use that address if and
> -           only if it remains available.  */
> -        load_addr = target_mmap(loaddr, hiaddr - loaddr, PROT_NONE,
> -                                MAP_PRIVATE | MAP_ANON | MAP_NORESERVE,
> -                                -1, 0);
> -        if (load_addr == -1) {
> -            goto exit_perror;
> +    if (pinterp_name != NULL) {
> +        /*
> +         * This is the main executable.
> +         *
> +         * Reserve extra space for brk.
> +         * We hold on to this space while placing the interpreter
> +         * and the stack, lest they be placed immediately after
> +         * the data segment and block allocation from the brk.
> +         *
> +         * 16MB is chosen as "large enough" without being so large
> +         * as to allow the result to not fit with a 32-bit guest on
> +         * a 32-bit host.
> +         */
> +        info->reserve_brk = 16 * MiB;
> +        hiaddr += info->reserve_brk;
> +
> +        if (ehdr->e_type == ET_EXEC) {
> +            /*
> +             * Make sure that the low address does not conflict with
> +             * MMAP_MIN_ADDR or the QEMU application itself.
> +             */
> +            probe_guest_base(image_name, loaddr, hiaddr);
>          }
> -    } else if (pinterp_name != NULL) {
> -        /* This is the main executable.  Make sure that the low
> -           address does not conflict with MMAP_MIN_ADDR or the
> -           QEMU application itself.  */
> -        probe_guest_base(image_name, loaddr, hiaddr);
> +    }
> +
> +    /*
> +     * Reserve address space for all of this.
> +     *
> +     * In the case of ET_EXEC, we supply MAP_FIXED so that we get
> +     * exactly the address range that is required.
> +     *
> +     * Otherwise this is ET_DYN, and we are searching for a location
> +     * that can hold the memory space required.  If the image is
> +     * pre-linked, LOADDR will be non-zero, and the kernel should
> +     * honor that address if it happens to be free.
> +     *
> +     * In both cases, we will overwrite pages in this range with mappings
> +     * from the executable.
> +     */
> +    load_addr = target_mmap(loaddr, hiaddr - loaddr, PROT_NONE,
> +                            MAP_PRIVATE | MAP_ANON | MAP_NORESERVE |
> +                            (ehdr->e_type == ET_EXEC ? MAP_FIXED : 0),
> +                            -1, 0);
> +    if (load_addr == -1) {
> +        goto exit_perror;
>      }
>      load_bias = load_addr - loaddr;
>  
> @@ -2860,6 +2888,17 @@ int load_elf_binary(struct linux_binprm *bprm, struct image_info *info)
>      bprm->core_dump = &elf_core_dump;
>  #endif
>  
> +    /*
> +     * If we reserved extra space for brk, release it now.
> +     * The implementation of do_brk in syscalls.c expects to be able
> +     * to mmap pages in this space.
> +     */
> +    if (info->reserve_brk) {
> +        abi_ulong start_brk = HOST_PAGE_ALIGN(info->brk);
> +        abi_ulong end_brk = HOST_PAGE_ALIGN(info->brk + info->reserve_brk);
> +        target_munmap(start_brk, end_brk - start_brk);
> +    }
> +
>      return 0;
>  }
>  
> 

Applied to my linux-user branch.

Thanks,
Laurent