Simple reproducer:
qemu.git $ ./build/qemu-system-x86_64 \
-m size=8389632k,slots=16,maxmem=25600000k \
-object
'{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"/hugepages2M/","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \
-numa node,nodeid=0,cpus=0,memdev=ram-node0
With current master I get:
qemu-system-x86_64: cannot bind memory to host NUMA nodes: Invalid argument
The problem is that memory size (8193MiB) is not an integer
multiple of underlying pagesize (2MiB) which triggers a check
inside of mbind(), since we can't really set policy just to a
fraction of a page. As qemu_madvise() has the same expectation,
round size passed to underlying pagesize.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
backends/hostmem.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/backends/hostmem.c b/backends/hostmem.c
index 1a6fd1c714..9b727699f6 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -179,6 +179,8 @@ static void host_memory_backend_set_merge(Object *obj, bool value, Error **errp)
void *ptr = memory_region_get_ram_ptr(&backend->mr);
uint64_t sz = memory_region_size(&backend->mr);
+ sz = ROUND_UP(sz, qemu_ram_pagesize(backend->mr.ram_block));
+
if (qemu_madvise(ptr, sz,
value ? QEMU_MADV_MERGEABLE : QEMU_MADV_UNMERGEABLE)) {
warn_report("Couldn't change property 'merge' on '%s': %s",
@@ -208,6 +210,8 @@ static void host_memory_backend_set_dump(Object *obj, bool value, Error **errp)
void *ptr = memory_region_get_ram_ptr(&backend->mr);
uint64_t sz = memory_region_size(&backend->mr);
+ sz = ROUND_UP(sz, qemu_ram_pagesize(backend->mr.ram_block));
+
if (qemu_madvise(ptr, sz,
value ? QEMU_MADV_DODUMP : QEMU_MADV_DONTDUMP)) {
warn_report("Couldn't change property 'dump' on '%s': %s",
@@ -344,6 +348,13 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
ptr = memory_region_get_ram_ptr(&backend->mr);
sz = memory_region_size(&backend->mr);
+ /*
+ * Round up size to be an integer multiple of pagesize, because
+ * both madvise() and mbind() does not really like setting
+ * advice/policy to just a fraction of a page.
+ */
+ sz = ROUND_UP(sz, qemu_ram_pagesize(backend->mr.ram_block));
+
if (backend->merge &&
qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE)) {
warn_report("Couldn't set property 'merge' on '%s': %s",
--
2.44.1
On 5/28/24 09:15, Michal Privoznik wrote: > + sz = ROUND_UP(sz, qemu_ram_pagesize(backend->mr.ram_block)); Second argument is evaluated twice. You probably don't want that to be a function call. r~
Am 28.05.24 um 18:15 schrieb Michal Privoznik:
> ./build/qemu-system-x86_64 \ -m size=8389632k,slots=16,maxmem=25600000k \
> -object
> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"hugepages2M","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,cpus=0,memdev=ram-node0
For DIMMs and friends we now (again) enforce that the size must be aligned to
the page size:
commit 540a1abbf0b243e4cfb4333c5d30a041f7080ba4
Author: David Hildenbrand <david@redhat.com>
Date: Wed Jan 17 14:55:54 2024 +0100
memory-device: reintroduce memory region size check
We used to check that the memory region size is multiples of the overall
requested address alignment for the device memory address.
We removed that check, because there are cases (i.e., hv-balloon) where
devices unconditionally request an address alignment that has a very large
alignment (i.e., 32 GiB), but the actual memory device size might not be
multiples of that alignment.
However, this change:
(a) allows for some practically impossible DIMM sizes, like "1GB+1 byte".
(b) allows for DIMMs that partially cover hugetlb pages, previously
reported in [1].
...
Partial hugetlb pages do not particularly make sense; wasting memory. Do we
expect people to actually ave working setup that we might break when disallowing
such configurations? Or would we actually help them identify that something
non-sensical is happening?
When using memory-backend-memfd, we already do get a proper error:
./build/qemu-system-x86_64 -m 2047m -object
memory-backend-memfd,id=ram-node0,hugetlb=on,size=2047m,reserve=off -numa
node,nodeid=0,cpus=0,memdev=ram-node0 -S
qemu-system-x86_64: failed to resize memfd to 2146435072: Invalid argument
So this only applies to memory-backend-file, where we maybe should fail in a
similar way?
--
Thanks,
David / dhildenb
On 5/28/24 18:47, David Hildenbrand wrote:
> Am 28.05.24 um 18:15 schrieb Michal Privoznik:
>> ./build/qemu-system-x86_64 \ -m
>> size=8389632k,slots=16,maxmem=25600000k \ -object
>> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"hugepages2M","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,cpus=0,memdev=ram-node0
>
> For DIMMs and friends we now (again) enforce that the size must be
> aligned to the page size:
>
> commit 540a1abbf0b243e4cfb4333c5d30a041f7080ba4
> Author: David Hildenbrand <david@redhat.com>
> Date: Wed Jan 17 14:55:54 2024 +0100
>
> memory-device: reintroduce memory region size check
>
> We used to check that the memory region size is multiples of the
> overall
> requested address alignment for the device memory address.
>
> We removed that check, because there are cases (i.e., hv-balloon) where
> devices unconditionally request an address alignment that has a very
> large
> alignment (i.e., 32 GiB), but the actual memory device size might
> not be
> multiples of that alignment.
>
> However, this change:
>
> (a) allows for some practically impossible DIMM sizes, like "1GB+1
> byte".
> (b) allows for DIMMs that partially cover hugetlb pages, previously
> reported in [1].
> ...
>
> Partial hugetlb pages do not particularly make sense; wasting memory. Do
> we expect people to actually ave working setup that we might break when
> disallowing such configurations? Or would we actually help them identify
> that something non-sensical is happening?
>
> When using memory-backend-memfd, we already do get a proper error:
>
> ./build/qemu-system-x86_64 -m 2047m -object
> memory-backend-memfd,id=ram-node0,hugetlb=on,size=2047m,reserve=off
> -numa node,nodeid=0,cpus=0,memdev=ram-node0 -S
> qemu-system-x86_64: failed to resize memfd to 2146435072: Invalid argument
>
> So this only applies to memory-backend-file, where we maybe should fail
> in a similar way?
>
Yeah, let's fail in that case. But non-aligned length is just one of
many reasons madvise()/mbind() can fail. What about the others? Should
we make them report an error or just a warning?
Michal
On 29.05.24 08:48, Michal Prívozník wrote:
> On 5/28/24 18:47, David Hildenbrand wrote:
>> Am 28.05.24 um 18:15 schrieb Michal Privoznik:
>>> ./build/qemu-system-x86_64 \ -m
>>> size=8389632k,slots=16,maxmem=25600000k \ -object
>>> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"hugepages2M","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,cpus=0,memdev=ram-node0
>>
>> For DIMMs and friends we now (again) enforce that the size must be
>> aligned to the page size:
>>
>> commit 540a1abbf0b243e4cfb4333c5d30a041f7080ba4
>> Author: David Hildenbrand <david@redhat.com>
>> Date: Wed Jan 17 14:55:54 2024 +0100
>>
>> memory-device: reintroduce memory region size check
>>
>> We used to check that the memory region size is multiples of the
>> overall
>> requested address alignment for the device memory address.
>>
>> We removed that check, because there are cases (i.e., hv-balloon) where
>> devices unconditionally request an address alignment that has a very
>> large
>> alignment (i.e., 32 GiB), but the actual memory device size might
>> not be
>> multiples of that alignment.
>>
>> However, this change:
>>
>> (a) allows for some practically impossible DIMM sizes, like "1GB+1
>> byte".
>> (b) allows for DIMMs that partially cover hugetlb pages, previously
>> reported in [1].
>> ...
>>
>> Partial hugetlb pages do not particularly make sense; wasting memory. Do
>> we expect people to actually ave working setup that we might break when
>> disallowing such configurations? Or would we actually help them identify
>> that something non-sensical is happening?
>>
>> When using memory-backend-memfd, we already do get a proper error:
>>
>> ./build/qemu-system-x86_64 -m 2047m -object
>> memory-backend-memfd,id=ram-node0,hugetlb=on,size=2047m,reserve=off
>> -numa node,nodeid=0,cpus=0,memdev=ram-node0 -S
>> qemu-system-x86_64: failed to resize memfd to 2146435072: Invalid argument
>>
>> So this only applies to memory-backend-file, where we maybe should fail
>> in a similar way?
>>
>
> Yeah, let's fail in that case. But non-aligned length is just one of
> many reasons madvise()/mbind() can fail. What about the others? Should
> we make them report an error or just a warning?
Regarding madvise(), we should report at least a warning.
In qemu_ram_setup_dump() we print an error if QEMU_MADV_DONTDUMP failed.
But we swallow any errors from memory_try_enable_merging() ... in
general, we likely have to distinguish the "not supported by the OS"
case from " actually supported but failed" case.
In the second patch, maybe we should really fail if something unexpected
happens, instead of fake-changing the properties.
--
Cheers,
David / dhildenb
© 2016 - 2025 Red Hat, Inc.