[PATCH] util/mmap: optimize qemu_ram_mmap() alignment

Steve Sistare posted 1 patch 1 year ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/1681141583-87816-1-git-send-email-steven.sistare@oracle.com
util/mmap-alloc.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
[PATCH] util/mmap: optimize qemu_ram_mmap() alignment
Posted by Steve Sistare 1 year ago
Guest RAM created with memory-backend-memfd is aligned to a
QEMU_VMALLOC_ALIGN=2M boundary, and memory-backend-memfd does not support
the "align" parameter to change the default.  This is sub-optimal on
aarch64 kernels with a 64 KB page size and 512 MB huge page size, as the
range will not be backed by huge pages.  Moreover, any shared allocation
using qemu_ram_mmap() will be sub-optimal on such a system if the align
parameter is less than 512 MB.

The kernel is smart enough to return a hugely aligned pointer for MAP_SHARED
mappings when /sys/kernel/mm/transparent_hugepage/shmem_enabled allows it.
However, the qemu function qemu_ram_mmap() mmap's twice to perform its own
alignment:

    guardptr = mmap(0, total, PROT_NONE, flags, ...
    flags |= shared ? MAP_SHARED : MAP_PRIVATE;
    ptr = mmap(guardptr + offset, size, prot, flags | map_sync_flags, ...

On the first call, flags has MAP_PRIVATE, hence the kernel does not apply
its shared memory policy, and returns a non-huge-aligned guardptr.

To fix, for shared mappings, pass MAP_SHARED to both mmap calls.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
---
 util/mmap-alloc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 5ed7d29..37a0d1e 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -121,7 +121,7 @@ static bool map_noreserve_effective(int fd, uint32_t qemu_map_flags)
  * Reserve a new memory region of the requested size to be used for mapping
  * from the given fd (if any).
  */
-static void *mmap_reserve(size_t size, int fd)
+static void *mmap_reserve(size_t size, int fd, int final_flags)
 {
     int flags = MAP_PRIVATE;
 
@@ -144,6 +144,7 @@ static void *mmap_reserve(size_t size, int fd)
 #else
     fd = -1;
     flags |= MAP_ANONYMOUS;
+    flags |= final_flags & MAP_SHARED;
 #endif
 
     return mmap(0, size, PROT_NONE, flags, fd, 0);
@@ -232,7 +233,7 @@ void *qemu_ram_mmap(int fd,
      */
     total = size + align;
 
-    guardptr = mmap_reserve(total, fd);
+    guardptr = mmap_reserve(total, fd, qemu_map_flags);
     if (guardptr == MAP_FAILED) {
         return MAP_FAILED;
     }
-- 
1.8.3.1
Re: [PATCH] util/mmap: optimize qemu_ram_mmap() alignment
Posted by David Hildenbrand 1 year ago
On 10.04.23 17:46, Steve Sistare wrote:
> Guest RAM created with memory-backend-memfd is aligned to a
> QEMU_VMALLOC_ALIGN=2M boundary, and memory-backend-memfd does not support
> the "align" parameter to change the default.  This is sub-optimal on
> aarch64 kernels with a 64 KB page size and 512 MB huge page size, as the
> range will not be backed by huge pages.  Moreover, any shared allocation
> using qemu_ram_mmap() will be sub-optimal on such a system if the align
> parameter is less than 512 MB.
> 
> The kernel is smart enough to return a hugely aligned pointer for MAP_SHARED
> mappings when /sys/kernel/mm/transparent_hugepage/shmem_enabled allows it.
> However, the qemu function qemu_ram_mmap() mmap's twice to perform its own
> alignment:
> 
>      guardptr = mmap(0, total, PROT_NONE, flags, ...
>      flags |= shared ? MAP_SHARED : MAP_PRIVATE;
>      ptr = mmap(guardptr + offset, size, prot, flags | map_sync_flags, ...
> 
> On the first call, flags has MAP_PRIVATE, hence the kernel does not apply
> its shared memory policy, and returns a non-huge-aligned guardptr.
> 
> To fix, for shared mappings, pass MAP_SHARED to both mmap calls.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   util/mmap-alloc.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index 5ed7d29..37a0d1e 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -121,7 +121,7 @@ static bool map_noreserve_effective(int fd, uint32_t qemu_map_flags)
>    * Reserve a new memory region of the requested size to be used for mapping
>    * from the given fd (if any).
>    */
> -static void *mmap_reserve(size_t size, int fd)
> +static void *mmap_reserve(size_t size, int fd, int final_flags)
>   {
>       int flags = MAP_PRIVATE;
>   
> @@ -144,6 +144,7 @@ static void *mmap_reserve(size_t size, int fd)
>   #else
>       fd = -1;
>       flags |= MAP_ANONYMOUS;
> +    flags |= final_flags & MAP_SHARED;

Setting both, MAP_PRIVATE and MAP_SHARED sure sounds very wrong.

The traditional way to handle that is via QEMU_VMALLOC_ALIGN.

I think you'd actually want to turn QEMU_VMALLOC_ALIGN into a function 
that is able to special-case like 
hw/virtio/virtio-mem.c:virtio_mem_default_thp_size() does for 
"qemu_real_host_page_size() == 64 * KiB".

If you set QEMU_VMALLOC_ALIGN properly, you'll also have any DIMMs get 
that alignment in guest physical address space.

-- 
Thanks,

David / dhildenb
Re: [PATCH] util/mmap: optimize qemu_ram_mmap() alignment
Posted by Steven Sistare 1 year ago
On 4/11/2023 3:57 AM, David Hildenbrand wrote:
> On 10.04.23 17:46, Steve Sistare wrote:
>> Guest RAM created with memory-backend-memfd is aligned to a
>> QEMU_VMALLOC_ALIGN=2M boundary, and memory-backend-memfd does not support
>> the "align" parameter to change the default.  This is sub-optimal on
>> aarch64 kernels with a 64 KB page size and 512 MB huge page size, as the
>> range will not be backed by huge pages.  Moreover, any shared allocation
>> using qemu_ram_mmap() will be sub-optimal on such a system if the align
>> parameter is less than 512 MB.
>>
>> The kernel is smart enough to return a hugely aligned pointer for MAP_SHARED
>> mappings when /sys/kernel/mm/transparent_hugepage/shmem_enabled allows it.
>> However, the qemu function qemu_ram_mmap() mmap's twice to perform its own
>> alignment:
>>
>>      guardptr = mmap(0, total, PROT_NONE, flags, ...
>>      flags |= shared ? MAP_SHARED : MAP_PRIVATE;
>>      ptr = mmap(guardptr + offset, size, prot, flags | map_sync_flags, ...
>>
>> On the first call, flags has MAP_PRIVATE, hence the kernel does not apply
>> its shared memory policy, and returns a non-huge-aligned guardptr.
>>
>> To fix, for shared mappings, pass MAP_SHARED to both mmap calls.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   util/mmap-alloc.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
>> index 5ed7d29..37a0d1e 100644
>> --- a/util/mmap-alloc.c
>> +++ b/util/mmap-alloc.c
>> @@ -121,7 +121,7 @@ static bool map_noreserve_effective(int fd, uint32_t qemu_map_flags)
>>    * Reserve a new memory region of the requested size to be used for mapping
>>    * from the given fd (if any).
>>    */
>> -static void *mmap_reserve(size_t size, int fd)
>> +static void *mmap_reserve(size_t size, int fd, int final_flags)
>>   {
>>       int flags = MAP_PRIVATE;
>>   @@ -144,6 +144,7 @@ static void *mmap_reserve(size_t size, int fd)
>>   #else
>>       fd = -1;
>>       flags |= MAP_ANONYMOUS;
>> +    flags |= final_flags & MAP_SHARED;
> 
> Setting both, MAP_PRIVATE and MAP_SHARED sure sounds very wrong.

Yes, thanks.  I introduced that mistake when I ported the fix from an earlier qemu that did not
have mmap_reserve.  Should be:

    fd = -1;
    flags = MAP_ANONYMOUS;
    flags |= final_flags & (MAP_SHARED | MAP_PRIVATE);

> The traditional way to handle that is via QEMU_VMALLOC_ALIGN.
> 
> I think you'd actually want to turn QEMU_VMALLOC_ALIGN into a function that is able to special-case like hw/virtio/virtio-mem.c:virtio_mem_default_thp_size() does for "qemu_real_host_page_size() == 64 * KiB".
> 
> If you set QEMU_VMALLOC_ALIGN properly, you'll also have any DIMMs get that alignment in guest physical address space.

If we increase QEMU_VMALLOC_ALIGN, then all allocations will be 512MB aligned.  If we make many small
allocations, we could conceivably run out of VA space.  Further, the processor may use low order 
address bits in internal hashes, and now offset X in every allocated range will have the same value for
the low 29 bits, possibly causing more collisions and reducing performance.  Further, the huge alignment
is not even needed if huge pages for shmem have been disabled in sysconfig.

We could avoid that by adding logic to also consider allocation size when choosing alignment, and
checking sysconfig tunables.  Or, we can just let the kernel do so by telling it the truth about 
memory flavor when calling mmap.

- Steve

Re: [PATCH] util/mmap: optimize qemu_ram_mmap() alignment
Posted by David Hildenbrand 1 year ago
On 11.04.23 16:39, Steven Sistare wrote:
> On 4/11/2023 3:57 AM, David Hildenbrand wrote:
>> On 10.04.23 17:46, Steve Sistare wrote:
>>> Guest RAM created with memory-backend-memfd is aligned to a
>>> QEMU_VMALLOC_ALIGN=2M boundary, and memory-backend-memfd does not support
>>> the "align" parameter to change the default.  This is sub-optimal on
>>> aarch64 kernels with a 64 KB page size and 512 MB huge page size, as the
>>> range will not be backed by huge pages.  Moreover, any shared allocation
>>> using qemu_ram_mmap() will be sub-optimal on such a system if the align
>>> parameter is less than 512 MB.
>>>
>>> The kernel is smart enough to return a hugely aligned pointer for MAP_SHARED
>>> mappings when /sys/kernel/mm/transparent_hugepage/shmem_enabled allows it.
>>> However, the qemu function qemu_ram_mmap() mmap's twice to perform its own
>>> alignment:
>>>
>>>       guardptr = mmap(0, total, PROT_NONE, flags, ...
>>>       flags |= shared ? MAP_SHARED : MAP_PRIVATE;
>>>       ptr = mmap(guardptr + offset, size, prot, flags | map_sync_flags, ...
>>>
>>> On the first call, flags has MAP_PRIVATE, hence the kernel does not apply
>>> its shared memory policy, and returns a non-huge-aligned guardptr.
>>>
>>> To fix, for shared mappings, pass MAP_SHARED to both mmap calls.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>    util/mmap-alloc.c | 5 +++--
>>>    1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
>>> index 5ed7d29..37a0d1e 100644
>>> --- a/util/mmap-alloc.c
>>> +++ b/util/mmap-alloc.c
>>> @@ -121,7 +121,7 @@ static bool map_noreserve_effective(int fd, uint32_t qemu_map_flags)
>>>     * Reserve a new memory region of the requested size to be used for mapping
>>>     * from the given fd (if any).
>>>     */
>>> -static void *mmap_reserve(size_t size, int fd)
>>> +static void *mmap_reserve(size_t size, int fd, int final_flags)
>>>    {
>>>        int flags = MAP_PRIVATE;
>>>    @@ -144,6 +144,7 @@ static void *mmap_reserve(size_t size, int fd)
>>>    #else
>>>        fd = -1;
>>>        flags |= MAP_ANONYMOUS;
>>> +    flags |= final_flags & MAP_SHARED;
>>
>> Setting both, MAP_PRIVATE and MAP_SHARED sure sounds very wrong.
> 
> Yes, thanks.  I introduced that mistake when I ported the fix from an earlier qemu that did not
> have mmap_reserve.  Should be:
> 
>      fd = -1;
>      flags = MAP_ANONYMOUS;
>      flags |= final_flags & (MAP_SHARED | MAP_PRIVATE);
> 

Better, but I still don't like bringing in some Linux kernel MAP_SHARED 
logic that apparently does better right now in some case right now ... 
because this is supposed to work on POSIX and ideally optimizes on 
various systems+configurations.


>> The traditional way to handle that is via QEMU_VMALLOC_ALIGN.
>>
>> I think you'd actually want to turn QEMU_VMALLOC_ALIGN into a function that is able to special-case like hw/virtio/virtio-mem.c:virtio_mem_default_thp_size() does for "qemu_real_host_page_size() == 64 * KiB".
>>
>> If you set QEMU_VMALLOC_ALIGN properly, you'll also have any DIMMs get that alignment in guest physical address space.
> 
> If we increase QEMU_VMALLOC_ALIGN, then all allocations will be 512MB aligned.  If we make many small
> allocations, we could conceivably run out of VA space.  Further, the processor may use low order

Running out of VA space? I'm not so sure. We are talking about the 
qemu_ram_mmap() function here ... (well, and file_ram_alloc() which only 
uses QEMU_VMALLOC_ALIGN on s390x).

This is all about guest RAM, although the name suggests otherwise.


> address bits in internal hashes, and now offset X in every allocated range will have the same value for
> the low 29 bits, possibly causing more collisions and reducing performance.  Further, the huge alignment
> is not even needed if huge pages for shmem have been disabled in sysconfig.

I'm not convinced this is worth optimizing, or even special-casing just 
for arm64 when we're only talking about mapping guest RAM.

Besides, what about ordinary anon THP? See below.

> 
> We could avoid that by adding logic to also consider allocation size when choosing alignment, and
> checking sysconfig tunables.  Or, we can just let the kernel do so by telling it the truth about
> memory flavor when calling mmap.

It's probably best to not trust the kernel to do the right alignment 
thing because we learned that it's not easy to get such optimizations 
into the kernel:

https://lkml.kernel.org/r/20220809142457.4751229f@imladris.surriel.com

got reverted again, which would have done the right thing for anon THP.


I'd suggest that we either hard-code it for arm64 as well, by adjusting 
QEMU_VMALLOC_ALIGN (if we don't care about giving it a better name and 
making it a function).

#elif defined(__linux__) && defined(__aarch64__)
# define QEMU_VMALLOC_ALIGN((qemu_real_host_page_size() == 64 * KiB) ? 
512 MiB : 2 MiB)
#elif ...

Alternatively, we could rewrite into a proper function that caches the 
result and tries looking up the actual host THP size using 
"/sys/kernel/mm/transparent_hugepage/hpage_pmd_size" under Linux before 
falling back to the known defaults.

Sure, we could optimize in the caller (allocation size smaller than 
alignment?), but that requires good justification.

-- 
Thanks,

David / dhildenb