From: William Roche <william.roche@oracle.com>
When the VM reboots, a memory reset is performed calling
qemu_ram_remap() on all hwpoisoned pages.
While we take into account the recorded page sizes to repair the
memory locations, a large page also needs to punch a hole in the
backend file to regenerate a usable memory, cleaning the HW
poisoned section. This is mandatory for hugetlbfs case for example.
Signed-off-by: William Roche <william.roche@oracle.com>
---
system/physmem.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/system/physmem.c b/system/physmem.c
index 3757428336..3f6024a92d 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
prot = PROT_READ;
prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
if (block->fd >= 0) {
+ if (length > TARGET_PAGE_SIZE && fallocate(block->fd,
+ FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ offset + block->fd_offset, length) != 0) {
+ error_report("Could not recreate the file hole for "
+ "addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+ length, addr);
+ exit(1);
+ }
area = mmap(vaddr, length, prot, flags, block->fd,
offset + block->fd_offset);
} else {
--
2.43.5
On 22.10.24 23:35, “William Roche wrote: > From: William Roche <william.roche@oracle.com> > > When the VM reboots, a memory reset is performed calling > qemu_ram_remap() on all hwpoisoned pages. > While we take into account the recorded page sizes to repair the > memory locations, a large page also needs to punch a hole in the > backend file to regenerate a usable memory, cleaning the HW > poisoned section. This is mandatory for hugetlbfs case for example. > > Signed-off-by: William Roche <william.roche@oracle.com> > --- > system/physmem.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/system/physmem.c b/system/physmem.c > index 3757428336..3f6024a92d 100644 > --- a/system/physmem.c > +++ b/system/physmem.c > @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) > prot = PROT_READ; > prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE; > if (block->fd >= 0) { > + if (length > TARGET_PAGE_SIZE && fallocate(block->fd, > + FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, > + offset + block->fd_offset, length) != 0) { > + error_report("Could not recreate the file hole for " > + "addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "", > + length, addr); > + exit(1); > + } > area = mmap(vaddr, length, prot, flags, block->fd, > offset + block->fd_offset); > } else { Ah! Just what I commented to patch #3; we should be using ram_discard_range(). It might be better to avoid the mmap() completely if ram_discard_range() worked. And as raised, there is the problem with memory preallocation (where we should fail if it doesn't work) and ram discards being disabled because something relies on long-term page pinning ... -- Cheers, David / dhildenb
On 10/23/24 09:30, David Hildenbrand wrote: > On 22.10.24 23:35, “William Roche wrote: >> From: William Roche <william.roche@oracle.com> >> >> When the VM reboots, a memory reset is performed calling >> qemu_ram_remap() on all hwpoisoned pages. >> While we take into account the recorded page sizes to repair the >> memory locations, a large page also needs to punch a hole in the >> backend file to regenerate a usable memory, cleaning the HW >> poisoned section. This is mandatory for hugetlbfs case for example. >> >> Signed-off-by: William Roche <william.roche@oracle.com> >> --- >> system/physmem.c | 8 ++++++++ >> 1 file changed, 8 insertions(+) >> >> diff --git a/system/physmem.c b/system/physmem.c >> index 3757428336..3f6024a92d 100644 >> --- a/system/physmem.c >> +++ b/system/physmem.c >> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr, >> ram_addr_t length) >> prot = PROT_READ; >> prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE; >> if (block->fd >= 0) { >> + if (length > TARGET_PAGE_SIZE && >> fallocate(block->fd, >> + FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, >> + offset + block->fd_offset, length) != 0) { >> + error_report("Could not recreate the file >> hole for " >> + "addr: " RAM_ADDR_FMT "@" >> RAM_ADDR_FMT "", >> + length, addr); >> + exit(1); >> + } >> area = mmap(vaddr, length, prot, flags, block->fd, >> offset + block->fd_offset); >> } else { > > Ah! Just what I commented to patch #3; we should be using > ram_discard_range(). It might be better to avoid the mmap() completely > if ram_discard_range() worked. I think you are referring to ram_block_discard_range() here, as ram_discard_range() seems to relate to VM migrations, maybe not a VM reset. Remapping the page is needed to get rid of the poison. So if we want to avoid the mmap(), we have to shrink the memory address space -- which can be a real problem if we imagine a VM with 1G large pages for example. qemu_ram_remap() is used to regenerate the lost memory and the mmap() call looks mandatory on the reset phase. > > And as raised, there is the problem with memory preallocation (where > we should fail if it doesn't work) and ram discards being disabled > because something relies on long-term page pinning ... Yes. Do you suggest that we add a call to qemu_prealloc_mem() for the remapped area in case of a backend->prealloc being true ? Or as we are running on posix machines for this piece of code (ifndef _WIN32) maybe we could simply add a MAP_POPULATE flag to the mmap call done in qemu_ram_remap() in the case where the backend requires a 'prealloc' ? Can you confirm if this flag could be used on all systems running this code ? Unfortunately, I don't know how to get the MEMORY_BACKEND corresponding to a given memory block. I'm not sure that MEMORY_BACKEND(block->mr) is a valid way to retrieve the Backend object and its 'prealloc' property here. Could you please give me a direction here ? I can send a new version using ram_block_discard_range() as you suggested to replace the direct call to fallocate(), if you think it would be better. Please let me know what other enhancement(s) you'd like to see in this code change. Thanks in advance, William.
On 26.10.24 01:27, William Roche wrote: > On 10/23/24 09:30, David Hildenbrand wrote: > >> On 22.10.24 23:35, “William Roche wrote: >>> From: William Roche <william.roche@oracle.com> >>> >>> When the VM reboots, a memory reset is performed calling >>> qemu_ram_remap() on all hwpoisoned pages. >>> While we take into account the recorded page sizes to repair the >>> memory locations, a large page also needs to punch a hole in the >>> backend file to regenerate a usable memory, cleaning the HW >>> poisoned section. This is mandatory for hugetlbfs case for example. >>> >>> Signed-off-by: William Roche <william.roche@oracle.com> >>> --- >>> system/physmem.c | 8 ++++++++ >>> 1 file changed, 8 insertions(+) >>> >>> diff --git a/system/physmem.c b/system/physmem.c >>> index 3757428336..3f6024a92d 100644 >>> --- a/system/physmem.c >>> +++ b/system/physmem.c >>> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr, >>> ram_addr_t length) >>> prot = PROT_READ; >>> prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE; >>> if (block->fd >= 0) { >>> + if (length > TARGET_PAGE_SIZE && >>> fallocate(block->fd, >>> + FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, >>> + offset + block->fd_offset, length) != 0) { >>> + error_report("Could not recreate the file >>> hole for " >>> + "addr: " RAM_ADDR_FMT "@" >>> RAM_ADDR_FMT "", >>> + length, addr); >>> + exit(1); >>> + } >>> area = mmap(vaddr, length, prot, flags, block->fd, >>> offset + block->fd_offset); >>> } else { >> >> Ah! Just what I commented to patch #3; we should be using >> ram_discard_range(). It might be better to avoid the mmap() completely >> if ram_discard_range() worked. > Hi! > > I think you are referring to ram_block_discard_range() here, as > ram_discard_range() seems to relate to VM migrations, maybe not a VM reset. Please take a look at the users of ram_block_discard_range(), including virtio-balloon to completely zap guest memory, so we will get fresh memory on next access. It takes care of process-private and file-backed (shared) memory. > > Remapping the page is needed to get rid of the poison. So if we want to > avoid the mmap(), we have to shrink the memory address space -- which > can be a real problem if we imagine a VM with 1G large pages for > example. qemu_ram_remap() is used to regenerate the lost memory and the > mmap() call looks mandatory on the reset phase. Why can't we use ram_block_discard_range() to zap the poisoned page (unmap from page tables + conditionallydrop from the page cache)? Is there anything important I am missing? > > >> >> And as raised, there is the problem with memory preallocation (where >> we should fail if it doesn't work) and ram discards being disabled >> because something relies on long-term page pinning ... > > > Yes. Do you suggest that we add a call to qemu_prealloc_mem() for the > remapped area in case of a backend->prealloc being true ? Yes. Otherwise, with hugetlb, you might run out of hugetlb pages at runtime and SIGBUS QEMU :( > > Or as we are running on posix machines for this piece of code (ifndef > _WIN32) maybe we could simply add a MAP_POPULATE flag to the mmap call > done in qemu_ram_remap() in the case where the backend requires a > 'prealloc' ? Can you confirm if this flag could be used on all systems > running this code ? Please use qemu_prealloc_mem(). MAP_POPULATE has no guarantees, it's really weird :/ mmap() might succeed even though MAP_POPULATE didn't work ... and it's problematic with NUMA policies because we essentially lose (overwrite) them. And the whole mmap(MAP_FIXED) is an ugly hack. For example, we wouldn't reset the memory policy we apply in host_memory_backend_memory_complete() ... that code really needs a rewrite to do it properly. Ideally, we'd do something high-level like if (ram_block_discard_is_disabled()) { /* * We cannot safely discard RAM, ... for example we might have * to remap all guest RAM into vfio after discarding the * problematic pages ... TODO. */ exit(0); } /* Throw away the problematic (poisoned) page. *./ if (ram_block_discard_range()) { /* Conditionally fallback to MAP_FIXED workaround */ ... } /* If prealloction was requested, we really must re-preallcoate. */ if (prealloc && qemu_prealloc_mem()) { /* Preallocation failed .... */ exit(0); } As you note the last part is tricky. See bwloe. > > Unfortunately, I don't know how to get the MEMORY_BACKEND corresponding > to a given memory block. I'm not sure that MEMORY_BACKEND(block->mr) is > a valid way to retrieve the Backend object and its 'prealloc' property > here. Could you please give me a direction here ? We could add a RAM_PREALLOC flag to hint that this memory has "prealloc" semantics. I once had an alternative approach: Similar to ram_block_notify_resize() we would implement ram_block_notify_remap(). That's where the backend could register and re-apply mmap properties like NUMA policies (in case we have to fallback to MAP_FIXED) and handle the preallocation. So one would implement a ram_block_notify_remap() and maybe indicate if we had to do MAP_FIXED or if we only discarded the page. I once had a prototype for that, let me dig ... > > I can send a new version using ram_block_discard_range() as you > suggested to replace the direct call to fallocate(), if you think it > would be better. > Please let me know what other enhancement(s) you'd like to see in this > code change. Something along the lines above. Please let me know if you see problems with that approach that I am missing. -- Cheers, David / dhildenb
On 10/28/24 18:01, David Hildenbrand wrote: > On 26.10.24 01:27, William Roche wrote: >> On 10/23/24 09:30, David Hildenbrand wrote: >> >>> On 22.10.24 23:35, “William Roche wrote: >>>> From: William Roche <william.roche@oracle.com> >>>> >>>> When the VM reboots, a memory reset is performed calling >>>> qemu_ram_remap() on all hwpoisoned pages. >>>> While we take into account the recorded page sizes to repair the >>>> memory locations, a large page also needs to punch a hole in the >>>> backend file to regenerate a usable memory, cleaning the HW >>>> poisoned section. This is mandatory for hugetlbfs case for example. >>>> >>>> Signed-off-by: William Roche <william.roche@oracle.com> >>>> --- >>>> system/physmem.c | 8 ++++++++ >>>> 1 file changed, 8 insertions(+) >>>> >>>> diff --git a/system/physmem.c b/system/physmem.c >>>> index 3757428336..3f6024a92d 100644 >>>> --- a/system/physmem.c >>>> +++ b/system/physmem.c >>>> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr, >>>> ram_addr_t length) >>>> prot = PROT_READ; >>>> prot |= block->flags & RAM_READONLY ? 0 : >>>> PROT_WRITE; >>>> if (block->fd >= 0) { >>>> + if (length > TARGET_PAGE_SIZE && >>>> fallocate(block->fd, >>>> + FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, >>>> + offset + block->fd_offset, length) != 0) { >>>> + error_report("Could not recreate the file >>>> hole for " >>>> + "addr: " RAM_ADDR_FMT "@" >>>> RAM_ADDR_FMT "", >>>> + length, addr); >>>> + exit(1); >>>> + } >>>> area = mmap(vaddr, length, prot, flags, >>>> block->fd, >>>> offset + block->fd_offset); >>>> } else { >>> >>> Ah! Just what I commented to patch #3; we should be using >>> ram_discard_range(). It might be better to avoid the mmap() completely >>> if ram_discard_range() worked. >> > > Hi! > >> >> I think you are referring to ram_block_discard_range() here, as >> ram_discard_range() seems to relate to VM migrations, maybe not a VM >> reset. > > Please take a look at the users of ram_block_discard_range(), including > virtio-balloon to completely zap guest memory, so we will get fresh > memory on next access. It takes care of process-private and file-backed > (shared) memory. The calls to madvise should take care of releasing the memory for the mapped area, and it is called for standard page sized memory. >> >> Remapping the page is needed to get rid of the poison. So if we want to >> avoid the mmap(), we have to shrink the memory address space -- which >> can be a real problem if we imagine a VM with 1G large pages for >> example. qemu_ram_remap() is used to regenerate the lost memory and the >> mmap() call looks mandatory on the reset phase. > > Why can't we use ram_block_discard_range() to zap the poisoned page > (unmap from page tables + conditionally drop from the page cache)? Is > there anything important I am missing? Or maybe _I'm_ missing something important, but what I understand is that: need_madvise = (rb->page_size == qemu_real_host_page_size()); ensures that the madvise call on ram_block_discard_range() is not done in the case off hugepages. In this case, we need to call mmap the remap the hugetlbfs large page. As I said in the previous email, recent kernels start to implement these calls for hugetlbfs, but I'm not sure that changing the mechanism of this ram_block_discard_range() function now is appropriate. Do you agree with that ? >> >> >>> >>> And as raised, there is the problem with memory preallocation (where >>> we should fail if it doesn't work) and ram discards being disabled >>> because something relies on long-term page pinning ... >> >> >> Yes. Do you suggest that we add a call to qemu_prealloc_mem() for the >> remapped area in case of a backend->prealloc being true ? > > Yes. Otherwise, with hugetlb, you might run out of hugetlb pages at > runtime and SIGBUS QEMU :( > >> >> Or as we are running on posix machines for this piece of code (ifndef >> _WIN32) maybe we could simply add a MAP_POPULATE flag to the mmap call >> done in qemu_ram_remap() in the case where the backend requires a >> 'prealloc' ? Can you confirm if this flag could be used on all systems >> running this code ? > > Please use qemu_prealloc_mem(). MAP_POPULATE has no guarantees, it's > really weird :/ mmap() might succeed even though MAP_POPULATE didn't > work ... and it's problematic with NUMA policies because we essentially > lose (overwrite) them. > > And the whole mmap(MAP_FIXED) is an ugly hack. For example, we wouldn't > reset the memory policy we apply in > host_memory_backend_memory_complete() ... that code really needs a > rewrite to do it properly. Maybe I can try to call madvise on hugepages too, only in this VM reset situation, and deal with the failure scenario of older kernels not supporting it... Leaving the behavior unchanged for every other locations calling this function. But I'll need to verify these madvise effect on hugetlbfs on the latest upstream kernel and some older kernels too. > > Ideally, we'd do something high-level like > > > if (ram_block_discard_is_disabled()) { > /* > * We cannot safely discard RAM, ... for example we might have > * to remap all guest RAM into vfio after discarding the > * problematic pages ... TODO. > */ > exit(0); > } > > /* Throw away the problematic (poisoned) page. *./ > if (ram_block_discard_range()) { > /* Conditionally fallback to MAP_FIXED workaround */ > ... > } > > /* If prealloction was requested, we really must re-preallcoate. */ > if (prealloc && qemu_prealloc_mem()) { > /* Preallocation failed .... */ > exit(0); > } > > As you note the last part is tricky. See bwloe. > >> >> Unfortunately, I don't know how to get the MEMORY_BACKEND corresponding >> to a given memory block. I'm not sure that MEMORY_BACKEND(block->mr) is >> a valid way to retrieve the Backend object and its 'prealloc' property >> here. Could you please give me a direction here ? > > We could add a RAM_PREALLOC flag to hint that this memory has "prealloc" > semantics. > > I once had an alternative approach: Similar to ram_block_notify_resize() > we would implement ram_block_notify_remap(). > > That's where the backend could register and re-apply mmap properties > like NUMA policies (in case we have to fallback to MAP_FIXED) and handle > the preallocation. > > So one would implement a ram_block_notify_remap() and maybe indicate if > we had to do MAP_FIXED or if we only discarded the page. > > I once had a prototype for that, let me dig ... That would be great ! Thanks. > >> >> I can send a new version using ram_block_discard_range() as you >> suggested to replace the direct call to fallocate(), if you think it >> would be better. >> Please let me know what other enhancement(s) you'd like to see in this >> code change. > > Something along the lines above. Please let me know if you see problems > with that approach that I am missing. Let me check the madvise use on hugetlbfs and if it works as expected, I'll try to implement a V2 version of the fix proposal integrating a modified ram_block_discard_range() function. I'll also remove the page size information from the signal handlers and only keep it in the kvm_hwpoison_page_add() function. I'll investigate how to keep track of the 'prealloc' attribute to optionally use when remapping the hugepages (on older kernels). And if you find the prototype code you talked about that would definitely help :) Thanks a lot, William.
>>> >>> Remapping the page is needed to get rid of the poison. So if we want to >>> avoid the mmap(), we have to shrink the memory address space -- which >>> can be a real problem if we imagine a VM with 1G large pages for >>> example. qemu_ram_remap() is used to regenerate the lost memory and the >>> mmap() call looks mandatory on the reset phase. >> >> Why can't we use ram_block_discard_range() to zap the poisoned page >> (unmap from page tables + conditionally drop from the page cache)? Is >> there anything important I am missing? > > Or maybe _I'm_ missing something important, but what I understand is that: > need_madvise = (rb->page_size == qemu_real_host_page_size()); > > ensures that the madvise call on ram_block_discard_range() is not done > in the case off hugepages. > In this case, we need to call mmap the remap the hugetlbfs large page. Right, madvise(DONTNEED) works ever since "90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")". But as you note, in QEMU we never called madvise(DONTNEED) for hugetlb as of today. But note that we always have an "fd" with hugetlb, because we never use mmap(MAP_ANON|MAP_PRIVATE|MAP_HUGETLB) in QEMU. The weird thing is that if you have a mmap(fd, MAP_PRIVATE) hugetlb mapping, fallocate(fd, FALLOC_FL_PUNCH_HOLE) will *also* zap any private pages. So in contrast to "ordinary" memory, the madvise(DONTNEED) is not required. (yes, it's very weird) So the fallocate(fd, FALLOC_FL_PUNCH_HOLE) will zap the hugetlb page and you will get a fresh one on next fault. For all the glorious details, see: https://lore.kernel.org/linux-mm/2ddd0a26-33fd-9cde-3501-f0584bbffefc@redhat.com/ > > As I said in the previous email, recent kernels start to implement these > calls for hugetlbfs, but I'm not sure that changing the mechanism of > this ram_block_discard_range() function now is appropriate. > Do you agree with that ? The key point is that it works for hugetlb without madvise(DONTNEED), which is weird :) Which is also why the introducing kernel change added "Do note that there is no compelling use case for adding this support. This was discussed in the RFC [1]. However, adding support makes sense as it is fairly trivial and brings hugetlb functionality more in line with 'normal' memory." [...] >> >> So one would implement a ram_block_notify_remap() and maybe indicate if >> we had to do MAP_FIXED or if we only discarded the page. >> >> I once had a prototype for that, let me dig ... > > That would be great ! Thanks. Found them: https://gitlab.com/virtio-mem/qemu/-/commit/f528c861897d1086ae84ea1bcd6a0be43e8fea7d https://gitlab.com/virtio-mem/qemu/-/commit/c5b0328654def8f168497715409d6364096eb63f https://gitlab.com/virtio-mem/qemu/-/commit/15e9737907835105c132091ad10f9d0c9c68ea64 But note that I didn't realize back then that the mmap(MAP_FIXED) is the wrong way to do it, and that we actually have to DONTNEED/PUNCH_HOLE to do it properly. But to get the preallocation performed by the backend, it should still be valuable. Note that I wonder if we can get rid of the mmap(MAP_FIXED) handling completely: likely we only support Linux with MCE recovery, and ram_block_discard_range() should do what we need under Linux. That would make it a lot simpler. > >> >>> >>> I can send a new version using ram_block_discard_range() as you >>> suggested to replace the direct call to fallocate(), if you think it >>> would be better. >>> Please let me know what other enhancement(s) you'd like to see in this >>> code change. >> >> Something along the lines above. Please let me know if you see problems >> with that approach that I am missing. > > > Let me check the madvise use on hugetlbfs and if it works as expected, > I'll try to implement a V2 version of the fix proposal integrating a > modified ram_block_discard_range() function. As discussed, it might all be working. If not, we would have to fix ram_block_discard_range(). > > I'll also remove the page size information from the signal handlers > and only keep it in the kvm_hwpoison_page_add() function. That's good. Especially because there was talk in the last bi-weekly MM sync [1] about possibly indicating only the actually failed cachelines in the future, not necessarily the full page. So relying on that interface to return the actual pagesize would no be future proof. That session was in general very interesting and very relevant for your work; did you by any chance attend it? If not, we should find you the recordings, because the idea is to be able to configure to not-unmap-during-mce, and instead only inform the guest OS about the MCE (forward it). Which avoids any HGM (high-granularity mapping) issues completely. Only during reboot of the VM we will have to do exactly what is being done in this series: zap the whole *page* so our fresh OS will see "all non-faulty" memory. [1] https://lkml.kernel.org/r/9242f7cc-6b9d-b807-9079-db0ca81f3c6d@google.com > > I'll investigate how to keep track of the 'prealloc' attribute to > optionally use when remapping the hugepages (on older kernels). > And if you find the prototype code you talked about that would > definitely help :) Right, the above should help getting that sorted out (but code id 4 years old, so it won't "just apply"). -- Cheers, David / dhildenb
From: William Roche <william.roche@oracle.com> Hi David, Here is an updated description of the patch set: --- This set of patches fixes several problems with hardware memory errors impacting hugetlbfs memory backed VMs. When using hugetlbfs large pages, any large page location being impacted by an HW memory error results in poisoning the entire page, suddenly making a large chunk of the VM memory unusable. The main problem that currently exists in Qemu is the lack of backend file repair before resetting the VM memory, resulting in the impacted memory to be silently unusable even after a VM reboot. In order to fix this issue, we track the page size of the impacted memory block with the associated poisoned page location. Using the size information we also call ram_block_discard_range() to regenerate the memory on VM reset when running qemu_ram_remap(). So that a poisoned memory backed by a hugetlbfs file is regenerated with a hole punched in this file. A new page is loaded when the location is first touched. In case of a discard failure we fall back to unmap/remap the memory location and reset the memory settings. We also have to honor the 'prealloc' attribute even after a successful discard, so we reapply the memory settings in this case too. This memory setting is performed by a new remap notification mechanism calling host_memory_backend_ram_remapped() function when a region of a memory block is remapped. Issue also a message providing the impact information of a large page memory loss. Only reported once when the page is poisoned. --- v1 -> v2: . I removed the kernel SIGBUS siginfo provided lsb size information tracking. Only relying on the RAMBlock page_size instead. . I adapted the 3 patches you indicated me to implement the notification mechanism on remap. Thank you for this code! I left them as Authored by you. But I haven't tested if the policy setting works as expected on VM reset, only that the replacement of physical memory works. . I also removed the old memory setting that was kept in qemu_ram_remap() but this small last fix could probably be merged with your last commit. I also got yesterday the recording of the mm-linux session about the kernel modification on largepage poisoning, and discussed this topic with a colleague of mine who attended the meeting. About the use of -mem-path question you asked me, we communicated the information about the deprecated aspect of this option and advise all users to use the following options instead. -object memory-backend-file,id=pc.ram,mem-path=/dev/hugepages,prealloc,size=XXX -machine memory-backend=pc.ram We could now add the request to use a share=on attribute too, to avoid the additional message about dangerous discard situations. This code is scripts/checkpatch.pl clean 'make check' runs fine on both x86 and Arm. David Hildenbrand (3): numa: Introduce and use ram_block_notify_remap() hostmem: Factor out applying settings hostmem: Handle remapping of RAM William Roche (4): accel/kvm: Keep track of the HWPoisonPage page_size system/physmem: poisoned memory discard on reboot accel/kvm: Report the loss of a large memory page system/physmem: Memory settings applied on remap notification accel/kvm/kvm-all.c | 17 +++- backends/hostmem.c | 184 +++++++++++++++++++++++--------------- hw/core/numa.c | 11 +++ include/exec/cpu-common.h | 1 + include/exec/ramlist.h | 3 + include/sysemu/hostmem.h | 1 + include/sysemu/kvm_int.h | 4 +- system/physmem.c | 62 ++++++++----- target/arm/kvm.c | 2 +- target/i386/kvm/kvm.c | 2 +- 10 files changed, 189 insertions(+), 98 deletions(-) -- 2.43.5
From: William Roche <william.roche@oracle.com>
When a memory page is added to the hwpoison_page_list, include
the page size information. This size is the backend real page
size. To better deal with hugepages, we create a single entry
for the entire page.
Signed-off-by: William Roche <william.roche@oracle.com>
---
accel/kvm/kvm-all.c | 8 +++++++-
include/exec/cpu-common.h | 1 +
system/physmem.c | 13 +++++++++++++
3 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 801cff16a5..6dd06f5edf 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
*/
typedef struct HWPoisonPage {
ram_addr_t ram_addr;
+ size_t page_size;
QLIST_ENTRY(HWPoisonPage) list;
} HWPoisonPage;
@@ -1278,7 +1279,7 @@ static void kvm_unpoison_all(void *param)
QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
QLIST_REMOVE(page, list);
- qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+ qemu_ram_remap(page->ram_addr, page->page_size);
g_free(page);
}
}
@@ -1286,6 +1287,10 @@ static void kvm_unpoison_all(void *param)
void kvm_hwpoison_page_add(ram_addr_t ram_addr)
{
HWPoisonPage *page;
+ size_t sz = qemu_ram_pagesize_from_addr(ram_addr);
+
+ if (sz > TARGET_PAGE_SIZE)
+ ram_addr = ROUND_DOWN(ram_addr, sz);
QLIST_FOREACH(page, &hwpoison_page_list, list) {
if (page->ram_addr == ram_addr) {
@@ -1294,6 +1299,7 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr)
}
page = g_new(HWPoisonPage, 1);
page->ram_addr = ram_addr;
+ page->page_size = sz;
QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
}
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 638dc806a5..8f8f7ad567 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -108,6 +108,7 @@ bool qemu_ram_is_named_file(RAMBlock *rb);
int qemu_ram_get_fd(RAMBlock *rb);
size_t qemu_ram_pagesize(RAMBlock *block);
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr);
size_t qemu_ram_pagesize_largest(void);
/**
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..750604d47d 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1665,6 +1665,19 @@ size_t qemu_ram_pagesize(RAMBlock *rb)
return rb->page_size;
}
+/* Return backend real page size used for the given ram_addr. */
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr)
+{
+ RAMBlock *rb;
+
+ RCU_READ_LOCK_GUARD();
+ rb = qemu_get_ram_block(addr);
+ if (!rb) {
+ return TARGET_PAGE_SIZE;
+ }
+ return qemu_ram_pagesize(rb);
+}
+
/* Returns the largest size of page in use */
size_t qemu_ram_pagesize_largest(void)
{
--
2.43.5
On 07.11.24 11:21, “William Roche wrote: > From: William Roche <william.roche@oracle.com> > > When a memory page is added to the hwpoison_page_list, include > the page size information. This size is the backend real page > size. To better deal with hugepages, we create a single entry > for the entire page. > > Signed-off-by: William Roche <william.roche@oracle.com> > --- > accel/kvm/kvm-all.c | 8 +++++++- > include/exec/cpu-common.h | 1 + > system/physmem.c | 13 +++++++++++++ > 3 files changed, 21 insertions(+), 1 deletion(-) > > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c > index 801cff16a5..6dd06f5edf 100644 > --- a/accel/kvm/kvm-all.c > +++ b/accel/kvm/kvm-all.c > @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension) > */ > typedef struct HWPoisonPage { > ram_addr_t ram_addr; > + size_t page_size; > QLIST_ENTRY(HWPoisonPage) list; > } HWPoisonPage; > > @@ -1278,7 +1279,7 @@ static void kvm_unpoison_all(void *param) > > QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) { > QLIST_REMOVE(page, list); > - qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE); > + qemu_ram_remap(page->ram_addr, page->page_size); > g_free(page); I'm curious, can't we simply drop the size parameter from qemu_ram_remap() completely and determine the page size internally from the RAMBlock that we are looking up already? This way, we avoid yet another lookup in qemu_ram_pagesize_from_addr(), and can just handle it completely in qemu_ram_remap(). In particular, to be future proof, we should also align the offset down to the pagesize. I'm thinking about something like this: diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 801cff16a5..8a47aa7258 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param) QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) { QLIST_REMOVE(page, list); - qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE); + qemu_ram_remap(page->ram_addr); g_free(page); } } diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h index 638dc806a5..50a829d31f 100644 --- a/include/exec/cpu-common.h +++ b/include/exec/cpu-common.h @@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t; /* memory API */ -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length); +void qemu_ram_remap(ram_addr_t addr); /* This should not be used by devices. */ ram_addr_t qemu_ram_addr_from_host(void *ptr); ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr); diff --git a/system/physmem.c b/system/physmem.c index dc1db3a384..5f19bec089 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -2167,10 +2167,10 @@ void qemu_ram_free(RAMBlock *block) } #ifndef _WIN32 -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) +void qemu_ram_remap(ram_addr_t addr) { RAMBlock *block; - ram_addr_t offset; + ram_addr_t offset, length; int flags; void *area, *vaddr; int prot; @@ -2178,6 +2178,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) RAMBLOCK_FOREACH(block) { offset = addr - block->offset; if (offset < block->max_length) { + /* Respect the pagesize of our RAMBlock. */ + offset = QEMU_ALIGN_DOWN(offset, qemu_ram_pagesize(block)); + length = qemu_ram_pagesize(block); + vaddr = ramblock_ptr(block, offset); if (block->flags & RAM_PREALLOC) { ; @@ -2206,6 +2210,8 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) memory_try_enable_merging(vaddr, length); qemu_ram_setup_dump(vaddr, length); } + + break; } } } -- Cheers, David / dhildenb
On 11/12/24 11:30, David Hildenbrand wrote: > On 07.11.24 11:21, “William Roche wrote: >> From: William Roche <william.roche@oracle.com> >> >> When a memory page is added to the hwpoison_page_list, include >> the page size information. This size is the backend real page >> size. To better deal with hugepages, we create a single entry >> for the entire page. >> >> Signed-off-by: William Roche <william.roche@oracle.com> >> --- >> accel/kvm/kvm-all.c | 8 +++++++- >> include/exec/cpu-common.h | 1 + >> system/physmem.c | 13 +++++++++++++ >> 3 files changed, 21 insertions(+), 1 deletion(-) >> >> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c >> index 801cff16a5..6dd06f5edf 100644 >> --- a/accel/kvm/kvm-all.c >> +++ b/accel/kvm/kvm-all.c >> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned >> int extension) >> */ >> typedef struct HWPoisonPage { >> ram_addr_t ram_addr; >> + size_t page_size; >> QLIST_ENTRY(HWPoisonPage) list; >> } HWPoisonPage; >> @@ -1278,7 +1279,7 @@ static void kvm_unpoison_all(void *param) >> QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) { >> QLIST_REMOVE(page, list); >> - qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE); >> + qemu_ram_remap(page->ram_addr, page->page_size); >> g_free(page); > > I'm curious, can't we simply drop the size parameter from qemu_ram_remap() > completely and determine the page size internally from the RAMBlock that > we are looking up already? > > This way, we avoid yet another lookup in qemu_ram_pagesize_from_addr(), > and can just handle it completely in qemu_ram_remap(). > > In particular, to be future proof, we should also align the offset down to > the pagesize. > > I'm thinking about something like this: > > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c > index 801cff16a5..8a47aa7258 100644 > --- a/accel/kvm/kvm-all.c > +++ b/accel/kvm/kvm-all.c > @@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param) > > QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) { > QLIST_REMOVE(page, list); > - qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE); > + qemu_ram_remap(page->ram_addr); > g_free(page); > } > } > diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h > index 638dc806a5..50a829d31f 100644 > --- a/include/exec/cpu-common.h > +++ b/include/exec/cpu-common.h > @@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t; > > /* memory API */ > > -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length); > +void qemu_ram_remap(ram_addr_t addr); > /* This should not be used by devices. */ > ram_addr_t qemu_ram_addr_from_host(void *ptr); > ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr); > diff --git a/system/physmem.c b/system/physmem.c > index dc1db3a384..5f19bec089 100644 > --- a/system/physmem.c > +++ b/system/physmem.c > @@ -2167,10 +2167,10 @@ void qemu_ram_free(RAMBlock *block) > } > > #ifndef _WIN32 > -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) > +void qemu_ram_remap(ram_addr_t addr) > { > RAMBlock *block; > - ram_addr_t offset; > + ram_addr_t offset, length; > int flags; > void *area, *vaddr; > int prot; > @@ -2178,6 +2178,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t > length) > RAMBLOCK_FOREACH(block) { > offset = addr - block->offset; > if (offset < block->max_length) { > + /* Respect the pagesize of our RAMBlock. */ > + offset = QEMU_ALIGN_DOWN(offset, qemu_ram_pagesize(block)); > + length = qemu_ram_pagesize(block); > + > vaddr = ramblock_ptr(block, offset); > if (block->flags & RAM_PREALLOC) { > ; > @@ -2206,6 +2210,8 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t > length) > memory_try_enable_merging(vaddr, length); > qemu_ram_setup_dump(vaddr, length); > } > + > + break; > } > } > } > > Yes this is a working possibility, and as you say it would provide the advantage to avoid a size lookup (needed because the kernel siginfo can be incorrect) and avoid tracking the poisoned pages size, with the addresses. But if we want to keep the information about the loss of a large page (which I think is useful) we would have to introduce the page size lookup when adding the page to the poison list. So according to me, keeping track of the page size and reusing it on remap isn't so bad. But if you prefer that we don't track the page size and do a lookup on page insert into the poison list and another in qemu_ram_remap(), of course we can do that. There is also something to consider about the future: we'll also have to deal with migration of VM that have been impacted by a memory error. And knowing about the poisoned pages size could be useful too. But this is another topic... I would vote to keep this size tracking.
On 12.11.24 19:17, William Roche wrote: > On 11/12/24 11:30, David Hildenbrand wrote: >> On 07.11.24 11:21, “William Roche wrote: >>> From: William Roche <william.roche@oracle.com> >>> >>> When a memory page is added to the hwpoison_page_list, include >>> the page size information. This size is the backend real page >>> size. To better deal with hugepages, we create a single entry >>> for the entire page. >>> >>> Signed-off-by: William Roche <william.roche@oracle.com> >>> --- >>> accel/kvm/kvm-all.c | 8 +++++++- >>> include/exec/cpu-common.h | 1 + >>> system/physmem.c | 13 +++++++++++++ >>> 3 files changed, 21 insertions(+), 1 deletion(-) >>> >>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c >>> index 801cff16a5..6dd06f5edf 100644 >>> --- a/accel/kvm/kvm-all.c >>> +++ b/accel/kvm/kvm-all.c >>> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned >>> int extension) >>> */ >>> typedef struct HWPoisonPage { >>> ram_addr_t ram_addr; >>> + size_t page_size; >>> QLIST_ENTRY(HWPoisonPage) list; >>> } HWPoisonPage; >>> @@ -1278,7 +1279,7 @@ static void kvm_unpoison_all(void *param) >>> QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) { >>> QLIST_REMOVE(page, list); >>> - qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE); >>> + qemu_ram_remap(page->ram_addr, page->page_size); >>> g_free(page); >> >> I'm curious, can't we simply drop the size parameter from qemu_ram_remap() >> completely and determine the page size internally from the RAMBlock that >> we are looking up already? >> >> This way, we avoid yet another lookup in qemu_ram_pagesize_from_addr(), >> and can just handle it completely in qemu_ram_remap(). >> >> In particular, to be future proof, we should also align the offset down to >> the pagesize. >> >> I'm thinking about something like this: >> >> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c >> index 801cff16a5..8a47aa7258 100644 >> --- a/accel/kvm/kvm-all.c >> +++ b/accel/kvm/kvm-all.c >> @@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param) >> >> QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) { >> QLIST_REMOVE(page, list); >> - qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE); >> + qemu_ram_remap(page->ram_addr); >> g_free(page); >> } >> } >> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h >> index 638dc806a5..50a829d31f 100644 >> --- a/include/exec/cpu-common.h >> +++ b/include/exec/cpu-common.h >> @@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t; >> >> /* memory API */ >> >> -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length); >> +void qemu_ram_remap(ram_addr_t addr); >> /* This should not be used by devices. */ >> ram_addr_t qemu_ram_addr_from_host(void *ptr); >> ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr); >> diff --git a/system/physmem.c b/system/physmem.c >> index dc1db3a384..5f19bec089 100644 >> --- a/system/physmem.c >> +++ b/system/physmem.c >> @@ -2167,10 +2167,10 @@ void qemu_ram_free(RAMBlock *block) >> } >> >> #ifndef _WIN32 >> -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) >> +void qemu_ram_remap(ram_addr_t addr) >> { >> RAMBlock *block; >> - ram_addr_t offset; >> + ram_addr_t offset, length; >> int flags; >> void *area, *vaddr; >> int prot; >> @@ -2178,6 +2178,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t >> length) >> RAMBLOCK_FOREACH(block) { >> offset = addr - block->offset; >> if (offset < block->max_length) { >> + /* Respect the pagesize of our RAMBlock. */ >> + offset = QEMU_ALIGN_DOWN(offset, qemu_ram_pagesize(block)); >> + length = qemu_ram_pagesize(block); >> + >> vaddr = ramblock_ptr(block, offset); >> if (block->flags & RAM_PREALLOC) { >> ; >> @@ -2206,6 +2210,8 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t >> length) >> memory_try_enable_merging(vaddr, length); >> qemu_ram_setup_dump(vaddr, length); >> } >> + >> + break; >> } >> } >> } >> >> > > > Yes this is a working possibility, and as you say it would provide the > advantage to avoid a size lookup (needed because the kernel siginfo can > be incorrect) and avoid tracking the poisoned pages size, with the > addresses. > > But if we want to keep the information about the loss of a large page > (which I think is useful) we would have to introduce the page size > lookup when adding the page to the poison list. So according to me, Right, that would be independent of the remap logic. What I dislike about qemu_ram_remap() is that it looks like we could be remapping a range that's possibly larger than a single page. But it really only works on a single address, expanding that to the page. Passing in a length that crosses RAMBlocks would not work as expected ... So I'd prefer if we let qemu_ram_remap() do exactly that ... remap a single page ... > keeping track of the page size and reusing it on remap isn't so bad. But > if you prefer that we don't track the page size and do a lookup on page > insert into the poison list and another in qemu_ram_remap(), of course > we can do that. ... and lookup the page size manually here if we really have to, for example to warn/trace errors. > > There is also something to consider about the future: we'll also have to > deal with migration of VM that have been impacted by a memory error. And > knowing about the poisoned pages size could be useful too. But this is > another topic... Yes, although the destination should be able to derive the same thing from the address I guess. We expect src and dst QEMU to use the same memory backing. -- Cheers, David / dhildenb
From: William Roche <william.roche@oracle.com>
We take into account the recorded page sizes to repair the
memory locations, calling ram_block_discard_range() to punch a hole
in the backend file when necessary and regenerate a usable memory.
Fall back to unmap/remap the memory location(s) if the kernel doesn't
support the madvise calls used by ram_block_discard_range().
Hugetlbfs poison case is also taken into account as a hole punch
with fallocate will reload a new page when first touched.
Signed-off-by: William Roche <william.roche@oracle.com>
---
system/physmem.c | 50 +++++++++++++++++++++++++++++-------------------
1 file changed, 30 insertions(+), 20 deletions(-)
diff --git a/system/physmem.c b/system/physmem.c
index 750604d47d..dfea120cc5 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
} else if (xen_enabled()) {
abort();
} else {
- flags = MAP_FIXED;
- flags |= block->flags & RAM_SHARED ?
- MAP_SHARED : MAP_PRIVATE;
- flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
- prot = PROT_READ;
- prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
- if (block->fd >= 0) {
- area = mmap(vaddr, length, prot, flags, block->fd,
- offset + block->fd_offset);
- } else {
- flags |= MAP_ANONYMOUS;
- area = mmap(vaddr, length, prot, flags, -1, 0);
- }
- if (area != vaddr) {
- error_report("Could not remap addr: "
- RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
- length, addr);
- exit(1);
+ if (ram_block_discard_range(block, offset + block->fd_offset,
+ length) != 0) {
+ if (length > TARGET_PAGE_SIZE) {
+ /* punch hole is mandatory on hugetlbfs */
+ error_report("large page recovery failure addr: "
+ RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+ length, addr);
+ exit(1);
+ }
+ flags = MAP_FIXED;
+ flags |= block->flags & RAM_SHARED ?
+ MAP_SHARED : MAP_PRIVATE;
+ flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
+ prot = PROT_READ;
+ prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
+ if (block->fd >= 0) {
+ area = mmap(vaddr, length, prot, flags, block->fd,
+ offset + block->fd_offset);
+ } else {
+ flags |= MAP_ANONYMOUS;
+ area = mmap(vaddr, length, prot, flags, -1, 0);
+ }
+ if (area != vaddr) {
+ error_report("Could not remap addr: "
+ RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+ length, addr);
+ exit(1);
+ }
+ memory_try_enable_merging(vaddr, length);
+ qemu_ram_setup_dump(vaddr, length);
}
- memory_try_enable_merging(vaddr, length);
- qemu_ram_setup_dump(vaddr, length);
}
}
}
--
2.43.5
On 07.11.24 11:21, “William Roche wrote: > From: William Roche <william.roche@oracle.com> > > We take into account the recorded page sizes to repair the > memory locations, calling ram_block_discard_range() to punch a hole > in the backend file when necessary and regenerate a usable memory. > Fall back to unmap/remap the memory location(s) if the kernel doesn't > support the madvise calls used by ram_block_discard_range(). > > Hugetlbfs poison case is also taken into account as a hole punch > with fallocate will reload a new page when first touched. > > Signed-off-by: William Roche <william.roche@oracle.com> > --- > system/physmem.c | 50 +++++++++++++++++++++++++++++------------------- > 1 file changed, 30 insertions(+), 20 deletions(-) > > diff --git a/system/physmem.c b/system/physmem.c > index 750604d47d..dfea120cc5 100644 > --- a/system/physmem.c > +++ b/system/physmem.c > @@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) > } else if (xen_enabled()) { > abort(); > } else { > - flags = MAP_FIXED; > - flags |= block->flags & RAM_SHARED ? > - MAP_SHARED : MAP_PRIVATE; > - flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0; > - prot = PROT_READ; > - prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE; > - if (block->fd >= 0) { > - area = mmap(vaddr, length, prot, flags, block->fd, > - offset + block->fd_offset); > - } else { > - flags |= MAP_ANONYMOUS; > - area = mmap(vaddr, length, prot, flags, -1, 0); > - } > - if (area != vaddr) { > - error_report("Could not remap addr: " > - RAM_ADDR_FMT "@" RAM_ADDR_FMT "", > - length, addr); > - exit(1); > + if (ram_block_discard_range(block, offset + block->fd_offset, > + length) != 0) { > + if (length > TARGET_PAGE_SIZE) { > + /* punch hole is mandatory on hugetlbfs */ > + error_report("large page recovery failure addr: " > + RAM_ADDR_FMT "@" RAM_ADDR_FMT "", > + length, addr); > + exit(1); > + } For shared memory we really need it. Private file-backed is weird ... because we don't know if the shared or the private page is problematic ... :( Maybe we should just do: if (block->fd >= 0) { /* mmap(MAP_FIXED) cannot reliably zap our problematic page. */ error_report(...); exit(-1); } Or alternatively if (block->fd >= 0 && qemu_ram_is_shared(block)) { /* mmap() cannot possibly zap our problematic page. */ error_report(...); exit(-1); } else if (block->fd >= 0) { /* * MAP_PRIVATE file-backed ... mmap() can only zap the private * page, not the shared one ... we don't know which one is * problematic. */ warn_report(...); } > + flags = MAP_FIXED; > + flags |= block->flags & RAM_SHARED ? > + MAP_SHARED : MAP_PRIVATE; > + flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0; > + prot = PROT_READ; > + prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE; > + if (block->fd >= 0) { > + area = mmap(vaddr, length, prot, flags, block->fd, > + offset + block->fd_offset); > + } else { > + flags |= MAP_ANONYMOUS; > + area = mmap(vaddr, length, prot, flags, -1, 0); > + } > + if (area != vaddr) { > + error_report("Could not remap addr: " > + RAM_ADDR_FMT "@" RAM_ADDR_FMT "", > + length, addr); > + exit(1); > + } > + memory_try_enable_merging(vaddr, length); > + qemu_ram_setup_dump(vaddr, length); Can we factor the mmap hack out into a separate helper function to clean this up a bit? -- Cheers, David / dhildenb
On 11/12/24 12:07, David Hildenbrand wrote: > On 07.11.24 11:21, “William Roche wrote: >> From: William Roche <william.roche@oracle.com> >> >> We take into account the recorded page sizes to repair the >> memory locations, calling ram_block_discard_range() to punch a hole >> in the backend file when necessary and regenerate a usable memory. >> Fall back to unmap/remap the memory location(s) if the kernel doesn't >> support the madvise calls used by ram_block_discard_range(). >> >> Hugetlbfs poison case is also taken into account as a hole punch >> with fallocate will reload a new page when first touched. >> >> Signed-off-by: William Roche <william.roche@oracle.com> >> --- >> system/physmem.c | 50 +++++++++++++++++++++++++++++------------------- >> 1 file changed, 30 insertions(+), 20 deletions(-) >> >> diff --git a/system/physmem.c b/system/physmem.c >> index 750604d47d..dfea120cc5 100644 >> --- a/system/physmem.c >> +++ b/system/physmem.c >> @@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, >> ram_addr_t length) >> } else if (xen_enabled()) { >> abort(); >> } else { >> - flags = MAP_FIXED; >> - flags |= block->flags & RAM_SHARED ? >> - MAP_SHARED : MAP_PRIVATE; >> - flags |= block->flags & RAM_NORESERVE ? >> MAP_NORESERVE : 0; >> - prot = PROT_READ; >> - prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE; >> - if (block->fd >= 0) { >> - area = mmap(vaddr, length, prot, flags, block->fd, >> - offset + block->fd_offset); >> - } else { >> - flags |= MAP_ANONYMOUS; >> - area = mmap(vaddr, length, prot, flags, -1, 0); >> - } >> - if (area != vaddr) { >> - error_report("Could not remap addr: " >> - RAM_ADDR_FMT "@" RAM_ADDR_FMT "", >> - length, addr); >> - exit(1); >> + if (ram_block_discard_range(block, offset + block- >> >fd_offset, >> + length) != 0) { >> + if (length > TARGET_PAGE_SIZE) { >> + /* punch hole is mandatory on hugetlbfs */ >> + error_report("large page recovery failure >> addr: " >> + RAM_ADDR_FMT "@" RAM_ADDR_FMT "", >> + length, addr); >> + exit(1); >> + } > > For shared memory we really need it. > > Private file-backed is weird ... because we don't know if the shared or > the private page is problematic ... :( I agree with you, and we have to decide when should we bail out if ram_block_discard_range() doesn't work. According to me, if discard doesn't work and we are dealing with file-backed largepages (shared or not) we have to exit, because the fallocate is mandatory. It is the case with hugetlbfs. In the non-file-backed case, or the file-backed non-largepage private case, according to me we can trust the mmap() method to put everything back in place for the VM reset to work as expected. Are there aspects I don't see, and for which mmap + the remap handler is not sufficient and we should also bail out here ? > > Maybe we should just do: > > if (block->fd >= 0) { > /* mmap(MAP_FIXED) cannot reliably zap our problematic page. */ > error_report(...); > exit(-1); > } > > Or alternatively > > if (block->fd >= 0 && qemu_ram_is_shared(block)) { > /* mmap() cannot possibly zap our problematic page. */ > error_report(...); > exit(-1); > } else if (block->fd >= 0) { > /* > * MAP_PRIVATE file-backed ... mmap() can only zap the private > * page, not the shared one ... we don't know which one is > * problematic. > */ > warn_report(...); > } I also agree that any file-backed/shared case should bail out if discard (fallocate) fails, no mater large or standard pages are used. In the case of file-backed private standard pages, I think that a poison on the private page can be fixed with a new mmap. According to me, there are 2 cases to consider: at the moment the poison is seen, the page was dirty (so it means that it was a pure private page), or the page was not dirty, and in this case the poison could replace this non-dirty page with a new copy of the file content. In both cases, I'd say that the remap should clean up the poison. So the conditions when discard fails, could be something like: if (block->fd >= 0 && (qemu_ram_is_shared(block) || (length > TARGET_PAGE_SIZE))) { /* punch hole is mandatory, mmap() cannot possibly zap our page*/ error_report("%spage recovery failure addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "", (length > TARGET_PAGE_SIZE) ? "large " : "", length, addr); exit(1); } >> + flags = MAP_FIXED; >> + flags |= block->flags & RAM_SHARED ? >> + MAP_SHARED : MAP_PRIVATE; >> + flags |= block->flags & RAM_NORESERVE ? >> MAP_NORESERVE : 0; >> + prot = PROT_READ; >> + prot |= block->flags & RAM_READONLY ? 0 : >> PROT_WRITE; >> + if (block->fd >= 0) { >> + area = mmap(vaddr, length, prot, flags, >> block->fd, >> + offset + block->fd_offset); >> + } else { >> + flags |= MAP_ANONYMOUS; >> + area = mmap(vaddr, length, prot, flags, -1, 0); >> + } >> + if (area != vaddr) { >> + error_report("Could not remap addr: " >> + RAM_ADDR_FMT "@" RAM_ADDR_FMT "", >> + length, addr); >> + exit(1); >> + } >> + memory_try_enable_merging(vaddr, length); >> + qemu_ram_setup_dump(vaddr, length); > > Can we factor the mmap hack out into a separate helper function to clean > this up a bit? Sure, I'll do that.
>> For shared memory we really need it. >> >> Private file-backed is weird ... because we don't know if the shared or >> the private page is problematic ... :( > > > I agree with you, and we have to decide when should we bail out if > ram_block_discard_range() doesn't work. > According to me, if discard doesn't work and we are dealing with > file-backed largepages (shared or not) we have to exit, because the > fallocate is mandatory. It is the case with hugetlbfs. > > In the non-file-backed case, or the file-backed non-largepage private > case, according to me we can trust the mmap() method to put everything > back in place for the VM reset to work as expected. > Are there aspects I don't see, and for which mmap + the remap handler is > not sufficient and we should also bail out here ? mmap() will only zap anonymous pages, no pagecache pages. See below. >> >> Maybe we should just do: >> >> if (block->fd >= 0) { >> /* mmap(MAP_FIXED) cannot reliably zap our problematic page. */ >> error_report(...); >> exit(-1); >> } >> >> Or alternatively >> >> if (block->fd >= 0 && qemu_ram_is_shared(block)) { >> /* mmap() cannot possibly zap our problematic page. */ >> error_report(...); >> exit(-1); >> } else if (block->fd >= 0) { >> /* >> * MAP_PRIVATE file-backed ... mmap() can only zap the private >> * page, not the shared one ... we don't know which one is >> * problematic. >> */ >> warn_report(...); >> } > > I also agree that any file-backed/shared case should bail out if discard > (fallocate) fails, no mater large or standard pages are used. > > In the case of file-backed private standard pages, I think that a poison > on the private page can be fixed with a new mmap. > According to me, there are 2 cases to consider: at the moment the poison > is seen, the page was dirty (so it means that it was a pure private > page), or the page was not dirty, and in this case the poison could > replace this non-dirty page with a new copy of the file content. > In both cases, I'd say that the remap should clean up the poison. Let's assume we have mmap(MAP_RIVATE, fd). The following scenarios are possible: (a) We only have a pagecache page (never written) that is poisoned -> mmap(MAP_FIXED) cannot resolve that (b) We only have an anonymous page (e.g., pagecache truncated, or if the hugetlb file was empty) that is poisoned -> mmap(MAP_FIXED) can resolve that (c) We have an anonymous and a pagecache page (written -> COW). (c1) Anonymous page is poisoned -> mmap(MAP_FIXED) can resolve that (c2) Pagecache page is poisoned -> mmap(MAP_FIXED) cannot resolve that So mmap(MAP_FIXED) cannot sort out all cases. In practice, (a) and (c2) are uncommon, but possible. (b) is common with hugetlb. (a) and (c) are uncommon with hugetlb, just because of the nature of hugetlb pages being a scarce resource. And IIRC, (b) with hugetlb should should be sorted out with mmap(MAP_FIXED). Please double-check. > > So the conditions when discard fails, could be something like: > > if (block->fd >= 0 && (qemu_ram_is_shared(block) || > (length > TARGET_PAGE_SIZE))) { > /* punch hole is mandatory, mmap() cannot possibly zap our page*/ > error_report("%spage recovery failure addr: " > RAM_ADDR_FMT "@" RAM_ADDR_FMT "", > (length > TARGET_PAGE_SIZE) ? "large " : "", > length, addr); I'm not sure if we should be special-casing hugetlb. If we want to be 100% sure, we will do if (block->fd >= 0) { error_report(); exit(1); } But we could decide to be "nice" to hugetlb and assume (b) for them above: that is, we would do /* * mmap() cannot zap pagecache pages, only anonymous pages. As soon as * we might have pagecache pages involved (either private or shared * mapping), we must be careful. However, MAP_PRIVATE on empty hugetlb * files is common, and extremely uncommon on non-empty hugetlb files, * so we'll special-case them here. */ if (block->fd >= 0 && (qemu_ram_is_shared(block) || length == TARGET_PAGE_SIZE))) { ... } [in practice, we could use /proc/self/pagemap to see if we map an anonymous page ... but I'd rather not go down that path just yet] But, in the end the expectation is that madvise()+fallocate() will usually not fail. -- Cheers, David / dhildenb
From: William Roche <william.roche@oracle.com>
When an entire large page is impacted by an error (hugetlbfs case),
report better the size and location of this large memory hole, so
give a warning message when this page is first hit:
Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST addr Z
Signed-off-by: William Roche <william.roche@oracle.com>
---
accel/kvm/kvm-all.c | 9 ++++++++-
include/sysemu/kvm_int.h | 4 +++-
target/arm/kvm.c | 2 +-
target/i386/kvm/kvm.c | 2 +-
4 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 6dd06f5edf..a572437115 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1284,7 +1284,7 @@ static void kvm_unpoison_all(void *param)
}
}
-void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, void *ha, hwaddr gpa)
{
HWPoisonPage *page;
size_t sz = qemu_ram_pagesize_from_addr(ram_addr);
@@ -1301,6 +1301,13 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr)
page->ram_addr = ram_addr;
page->page_size = sz;
QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
+
+ if (sz > TARGET_PAGE_SIZE) {
+ gpa = ROUND_DOWN(gpa, sz);
+ ha = (void *)ROUND_DOWN((uint64_t)ha, sz);
+ warn_report("Memory error: Loosing a large page (size: %zu) "
+ "at QEMU addr %p and GUEST addr 0x%" HWADDR_PRIx, sz, ha, gpa);
+ }
}
bool kvm_hwpoisoned_mem(void)
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index a1e72763da..ee34f1d225 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -178,10 +178,12 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size);
*
* Parameters:
* @ram_addr: the address in the RAM for the poisoned page
+ * @hva: host virtual address aka QEMU addr
+ * @gpa: guest physical address aka GUEST addr
*
* Add a poisoned page to the list
*
* Return: None.
*/
-void kvm_hwpoison_page_add(ram_addr_t ram_addr);
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, void *hva, hwaddr gpa);
#endif
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index f1f1b5b375..aae66dba41 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2359,7 +2359,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
ram_addr = qemu_ram_addr_from_host(addr);
if (ram_addr != RAM_ADDR_INVALID &&
kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
- kvm_hwpoison_page_add(ram_addr);
+ kvm_hwpoison_page_add(ram_addr, addr, paddr);
/*
* If this is a BUS_MCEERR_AR, we know we have been called
* synchronously from the vCPU thread, so we can easily
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index fd9f198892..fd7cd7008e 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -753,7 +753,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
ram_addr = qemu_ram_addr_from_host(addr);
if (ram_addr != RAM_ADDR_INVALID &&
kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
- kvm_hwpoison_page_add(ram_addr);
+ kvm_hwpoison_page_add(ram_addr, addr, paddr);
kvm_mce_inject(cpu, paddr, code);
/*
--
2.43.5
On 07.11.24 11:21, “William Roche wrote: > From: William Roche <william.roche@oracle.com> > > When an entire large page is impacted by an error (hugetlbfs case), > report better the size and location of this large memory hole, so > give a warning message when this page is first hit: > Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST addr Z > Hm, I wonder if we really want to special-case hugetlb here. Why not make the warning independent of the underlying page size? -- Cheers, David / dhildenb
On 11/12/24 12:13, David Hildenbrand wrote: > On 07.11.24 11:21, “William Roche wrote: >> From: William Roche <william.roche@oracle.com> >> >> When an entire large page is impacted by an error (hugetlbfs case), >> report better the size and location of this large memory hole, so >> give a warning message when this page is first hit: >> Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST >> addr Z >> > > Hm, I wonder if we really want to special-case hugetlb here. > > Why not make the warning independent of the underlying page size? We already have a warning provided by Qemu (in kvm_arch_on_sigbus_vcpu()): Guest MCE Memory Error at QEMU addr Y and GUEST addr Z of type BUS_MCEERR_AR/_AO injected The one I suggest is an additional message provided before the above message. Here is an example: qemu-system-x86_64: warning: Memory error: Loosing a large page (size: 2097152) at QEMU addr 0x7fdd7d400000 and GUEST addr 0x11600000 qemu-system-x86_64: warning: Guest MCE Memory Error at QEMU addr 0x7fdd7d400000 and GUEST addr 0x11600000 of type BUS_MCEERR_AO injected According to me, this large page case additional message will help to better understand the probable sudden proliferation of memory errors that can be reported by Qemu on the impacted range. Not only will the machine administrator identify better that a single memory error had this large impact, it can also help us to better measure the impact of fixing the large page memory error support in the field (in the future). These are some reasons why I do think this large page specific message can be useful.
On 12.11.24 19:17, William Roche wrote: > On 11/12/24 12:13, David Hildenbrand wrote: >> On 07.11.24 11:21, “William Roche wrote: >>> From: William Roche <william.roche@oracle.com> >>> >>> When an entire large page is impacted by an error (hugetlbfs case), >>> report better the size and location of this large memory hole, so >>> give a warning message when this page is first hit: >>> Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST >>> addr Z >>> >> >> Hm, I wonder if we really want to special-case hugetlb here. >> >> Why not make the warning independent of the underlying page size? > > We already have a warning provided by Qemu (in kvm_arch_on_sigbus_vcpu()): > > Guest MCE Memory Error at QEMU addr Y and GUEST addr Z of type > BUS_MCEERR_AR/_AO injected > > The one I suggest is an additional message provided before the above > message. > > Here is an example: > qemu-system-x86_64: warning: Memory error: Loosing a large page (size: > 2097152) at QEMU addr 0x7fdd7d400000 and GUEST addr 0x11600000 > qemu-system-x86_64: warning: Guest MCE Memory Error at QEMU addr > 0x7fdd7d400000 and GUEST addr 0x11600000 of type BUS_MCEERR_AO injected > Hm, I think we should definitely be including the size in the existing one. That code was written without huge pages in mind. We should similarly warn in the arm implementation (where I don't see a similar message yet). > > According to me, this large page case additional message will help to > better understand the probable sudden proliferation of memory errors > that can be reported by Qemu on the impacted range. > Not only will the machine administrator identify better that a single > memory error had this large impact, it can also help us to better > measure the impact of fixing the large page memory error support in the > field (in the future). What about extending the existing one to something like warning: Guest MCE Memory Error at QEMU addr $ADDR and GUEST $PADDR of type BUS_MCEERR_AO and size $SIZE (large page) injected With the "large page" hint you can highlight that this is special. On a related note ...I think we have a problem. Assume we got a SIGBUS on a huge page (e.g., somewhere in a 1 GiB page). We will call kvm_mce_inject(cpu, paddr, code) / acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr) But where is the size information? :// Won't the VM simply assume that there was a MCE on a single 4k page starting at paddr? I'm not sure if we can inject ranges, or if we would have to issue one MCE per page ... hm, what's your take on this? -- Cheers, David / dhildenb
Thanks for the feedback on the patches, I'll send a new version in the coming week. But I just wanted to answer now the questions you asked on this specific one as they are related to the importance of fixing the large page failures handling. On 11/12/24 23:22, David Hildenbrand wrote: > On 12.11.24 19:17, William Roche wrote: >> On 11/12/24 12:13, David Hildenbrand wrote: >>> On 07.11.24 11:21, “William Roche wrote: >>>> From: William Roche <william.roche@oracle.com> >>>> >>>> When an entire large page is impacted by an error (hugetlbfs case), >>>> report better the size and location of this large memory hole, so >>>> give a warning message when this page is first hit: >>>> Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST >>>> addr Z >>>> >>> >>> Hm, I wonder if we really want to special-case hugetlb here. >>> >>> Why not make the warning independent of the underlying page size? >> >> We already have a warning provided by Qemu (in >> kvm_arch_on_sigbus_vcpu()): >> >> Guest MCE Memory Error at QEMU addr Y and GUEST addr Z of type >> BUS_MCEERR_AR/_AO injected >> >> The one I suggest is an additional message provided before the above >> message. >> >> Here is an example: >> qemu-system-x86_64: warning: Memory error: Loosing a large page (size: >> 2097152) at QEMU addr 0x7fdd7d400000 and GUEST addr 0x11600000 >> qemu-system-x86_64: warning: Guest MCE Memory Error at QEMU addr >> 0x7fdd7d400000 and GUEST addr 0x11600000 of type BUS_MCEERR_AO injected >> > > Hm, I think we should definitely be including the size in the existing > one. That code was written without huge pages in mind. Yes we can do that, and get the page size at this level to pass as a 'page_sise' argument to kvm_hwpoison_page_add(). It would make the message longer as we will have the extra information about the large page on all messages when an error impacts a large page. We could change the messages only when we are dealing with a large page, so that the standard (4k) case isn't modified. > > We should similarly warn in the arm implementation (where I don't see a > similar message yet). Ok, I'll also add a message for the ARM platform. >> >> According to me, this large page case additional message will help to >> better understand the probable sudden proliferation of memory errors >> that can be reported by Qemu on the impacted range. >> Not only will the machine administrator identify better that a single >> memory error had this large impact, it can also help us to better >> measure the impact of fixing the large page memory error support in the >> field (in the future). > > What about extending the existing one to something like > > warning: Guest MCE Memory Error at QEMU addr $ADDR and GUEST $PADDR of > type BUS_MCEERR_AO and size $SIZE (large page) injected > > > With the "large page" hint you can highlight that this is special. Right, we can do it that way. It also gives the impression that we somehow inject errors on a large range of the memory. Which is not the case. I'll send a proposal with a different formulation, so that you can choose. > On a related note ...I think we have a problem. Assume we got a SIGBUS > on a huge page (e.g., somewhere in a 1 GiB page). > > We will call kvm_mce_inject(cpu, paddr, code) / > acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr) > > But where is the size information? :// Won't the VM simply assume that > there was a MCE on a single 4k page starting at paddr? This is absolutely right ! It's exactly what happens: The VM kernel received the information and considers that only the impacted page has to be poisoned. That's also the reason why Qemu repeats the error injections every time the poisoned large page is accessed (for all other touched 4k pages located on this "memory hole"). > > I'm not sure if we can inject ranges, or if we would have to issue one > MCE per page ... hm, what's your take on this? I don't know of any size information about a memory error reported by the hardware. The kernel doesn't seem to expect any such information. It explains why there is no impact/blast size information provided when an error is relayed to the VM. We could take the "memory hole" size into account in Qemu, but repeating error injections is not going to help a lot either: We'd need to give the VM some time to deal with an error injection before producing a new error for the next page etc... in the case (x86 only) where an asynchronous error is relayed with BUS_MCEERR_AO, we would also have to repeat the error for all the 4k pages located on the lost large page too. We can see that the Linux kernel has some mechanisms to deal with a seldom 4k page loss, but a larger blast is very likely to crash the VM (which is fine). And as a significant part of the memory is no longer accessible, dealing with the error itself can be impaired and we increase the risk of loosing data, even though most of the memory on the large page could still be used. Now if we can recover the 'still valid' memory of the impacted large page, we can significantly reduce this blast and give a much better chance to the VM to survive the incident or crash more gracefully. I've looked at the project you indicated me, which is not ready to be adopted: https://lore.kernel.org/linux-mm/20240924043924.3562257-2-jiaqiyan@google.com/T/ But we see that, this large page enhancement is needed, sometimes just to give a chance to the VM to survive a little longer before being terminated or moved. Injecting multiple MCEs or ACPI error records doesn't help, according to me. William.
>> Hm, I think we should definitely be including the size in the existing >> one. That code was written without huge pages in mind. > > Yes we can do that, and get the page size at this level to pass as a > 'page_sise' argument to kvm_hwpoison_page_add(). > > It would make the message longer as we will have the extra information > about the large page on all messages when an error impacts a large page. > We could change the messages only when we are dealing with a large page, > so that the standard (4k) case isn't modified. Right. And likely we should call it "huge page" instead, which is the Linux term for anything larger than a single page. [...] >> >> With the "large page" hint you can highlight that this is special. > > Right, we can do it that way. It also gives the impression that we > somehow inject errors on a large range of the memory. Which is not the > case. I'll send a proposal with a different formulation, so that you can > choose. > Make sense. > > >> On a related note ...I think we have a problem. Assume we got a SIGBUS >> on a huge page (e.g., somewhere in a 1 GiB page). >> >> We will call kvm_mce_inject(cpu, paddr, code) / >> acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr) >> >> But where is the size information? :// Won't the VM simply assume that >> there was a MCE on a single 4k page starting at paddr? > > This is absolutely right ! > It's exactly what happens: The VM kernel received the information and > considers that only the impacted page has to be poisoned. > > That's also the reason why Qemu repeats the error injections every time > the poisoned large page is accessed (for all other touched 4k pages > located on this "memory hole"). :/ So we always get from Linux the full 1Gig range and always report the first 4k page essentially, on any such access, right? BTW, should we handle duplicates in our poison list? > >> >> I'm not sure if we can inject ranges, or if we would have to issue one >> MCE per page ... hm, what's your take on this? > > I don't know of any size information about a memory error reported by > the hardware. The kernel doesn't seem to expect any such information. > It explains why there is no impact/blast size information provided when > an error is relayed to the VM. > > We could take the "memory hole" size into account in Qemu, but repeating > error injections is not going to help a lot either: We'd need to give > the VM some time to deal with an error injection before producing a new > error for the next page etc... in the case (x86 only) where an I had the same thoughts. > asynchronous error is relayed with BUS_MCEERR_AO, we would also have to > repeat the error for all the 4k pages located on the lost large page too. > > We can see that the Linux kernel has some mechanisms to deal with a > seldom 4k page loss, but a larger blast is very likely to crash the VM > (which is fine). Right, and that will inevitably happen when we get a MVE on a 1GiG hugetlb page, correct? The whole thing will be inaccessible. > And as a significant part of the memory is no longer > accessible, dealing with the error itself can be impaired and we > increase the risk of loosing data, even though most of the memory on the > large page could still be used. > > Now if we can recover the 'still valid' memory of the impacted large > page, we can significantly reduce this blast and give a much better > chance to the VM to survive the incident or crash more gracefully. Right. That cannot be sorted out in user space alone, unfortunately. > > I've looked at the project you indicated me, which is not ready to be > adopted: > https://lore.kernel.org/linux-mm/20240924043924.3562257-2-jiaqiyan@google.com/T/ > Yes, that goes into a better direction, though. > But we see that, this large page enhancement is needed, sometimes just > to give a chance to the VM to survive a little longer before being > terminated or moved. > Injecting multiple MCEs or ACPI error records doesn't help, according to me. I suspect that in most cases, when we get an MCE on a 1Gig page in the hypervisor, our running Linux guest will soon crash, because it really lost 1 Gig of contiguous memory. :( -- Cheers, David / dhildenb
From: David Hildenbrand <david@redhat.com>
Notify registered listeners about the remap at the end of
qemu_ram_remap() so e.g., a memory backend can re-apply its
settings correctly.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
hw/core/numa.c | 11 +++++++++++
include/exec/ramlist.h | 3 +++
system/physmem.c | 1 +
3 files changed, 15 insertions(+)
diff --git a/hw/core/numa.c b/hw/core/numa.c
index 1b5f44baea..4ca67db483 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -895,3 +895,14 @@ void ram_block_notify_resize(void *host, size_t old_size, size_t new_size)
}
}
}
+
+void ram_block_notify_remap(void *host, size_t offset, size_t size)
+{
+ RAMBlockNotifier *notifier;
+
+ QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+ if (notifier->ram_block_remapped) {
+ notifier->ram_block_remapped(notifier, host, offset, size);
+ }
+ }
+}
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
index d9cfe530be..c1dc785a57 100644
--- a/include/exec/ramlist.h
+++ b/include/exec/ramlist.h
@@ -72,6 +72,8 @@ struct RAMBlockNotifier {
size_t max_size);
void (*ram_block_resized)(RAMBlockNotifier *n, void *host, size_t old_size,
size_t new_size);
+ void (*ram_block_remapped)(RAMBlockNotifier *n, void *host, size_t offset,
+ size_t size);
QLIST_ENTRY(RAMBlockNotifier) next;
};
@@ -80,6 +82,7 @@ void ram_block_notifier_remove(RAMBlockNotifier *n);
void ram_block_notify_add(void *host, size_t size, size_t max_size);
void ram_block_notify_remove(void *host, size_t size, size_t max_size);
void ram_block_notify_resize(void *host, size_t old_size, size_t new_size);
+void ram_block_notify_remap(void *host, size_t offset, size_t size);
GString *ram_block_format(void);
diff --git a/system/physmem.c b/system/physmem.c
index dfea120cc5..e72ca31451 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2228,6 +2228,7 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
memory_try_enable_merging(vaddr, length);
qemu_ram_setup_dump(vaddr, length);
}
+ ram_block_notify_remap(block->host, offset, length);
}
}
}
--
2.43.5
From: David Hildenbrand <david@redhat.com>
We want to reuse the functionality when remapping or resizing RAM.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
backends/hostmem.c | 155 ++++++++++++++++++++++++---------------------
1 file changed, 82 insertions(+), 73 deletions(-)
diff --git a/backends/hostmem.c b/backends/hostmem.c
index 181446626a..bf85d716e5 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -36,6 +36,87 @@ QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
#endif
+static void host_memory_backend_apply_settings(HostMemoryBackend *backend,
+ void *ptr, uint64_t size,
+ Error **errp)
+{
+ bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
+
+ if (backend->merge) {
+ qemu_madvise(ptr, size, QEMU_MADV_MERGEABLE);
+ }
+ if (!backend->dump) {
+ qemu_madvise(ptr, size, QEMU_MADV_DONTDUMP);
+ }
+#ifdef CONFIG_NUMA
+ unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
+ /* lastbit == MAX_NODES means maxnode = 0 */
+ unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
+ /*
+ * Ensure policy won't be ignored in case memory is preallocated
+ * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
+ * this doesn't catch hugepage case.
+ */
+ unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
+ int mode = backend->policy;
+
+ /*
+ * Check for invalid host-nodes and policies and give more verbose
+ * error messages than mbind().
+ */
+ if (maxnode && backend->policy == MPOL_DEFAULT) {
+ error_setg(errp, "host-nodes must be empty for policy default,"
+ " or you should explicitly specify a policy other"
+ " than default");
+ return;
+ } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
+ error_setg(errp, "host-nodes must be set for policy %s",
+ HostMemPolicy_str(backend->policy));
+ return;
+ }
+
+ /*
+ * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
+ * as argument to mbind() due to an old Linux bug (feature?) which
+ * cuts off the last specified node. This means backend->host_nodes
+ * must have MAX_NODES+1 bits available.
+ */
+ assert(sizeof(backend->host_nodes) >=
+ BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
+ assert(maxnode <= MAX_NODES);
+
+#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
+ if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
+ /*
+ * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
+ * silently picks the first node.
+ */
+ mode = MPOL_PREFERRED_MANY;
+ }
+#endif
+
+ if (maxnode &&
+ mbind(ptr, size, mode, backend->host_nodes, maxnode + 1, flags)) {
+ if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
+ error_setg_errno(errp, errno,
+ "cannot bind memory to host NUMA nodes");
+ return;
+ }
+ }
+#endif
+ /*
+ * Preallocate memory after the NUMA policy has been instantiated.
+ * This is necessary to guarantee memory is allocated with
+ * specified NUMA policy in place.
+ */
+ if (backend->prealloc &&
+ !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
+ ptr, size, backend->prealloc_threads,
+ backend->prealloc_context, async, errp)) {
+ return;
+ }
+}
+
char *
host_memory_backend_get_name(HostMemoryBackend *backend)
{
@@ -337,7 +418,6 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
void *ptr;
uint64_t sz;
size_t pagesize;
- bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
if (!bc->alloc) {
return;
@@ -357,78 +437,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
return;
}
- if (backend->merge) {
- qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
- }
- if (!backend->dump) {
- qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
- }
-#ifdef CONFIG_NUMA
- unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
- /* lastbit == MAX_NODES means maxnode = 0 */
- unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
- /*
- * Ensure policy won't be ignored in case memory is preallocated
- * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
- * this doesn't catch hugepage case.
- */
- unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
- int mode = backend->policy;
-
- /* check for invalid host-nodes and policies and give more verbose
- * error messages than mbind(). */
- if (maxnode && backend->policy == MPOL_DEFAULT) {
- error_setg(errp, "host-nodes must be empty for policy default,"
- " or you should explicitly specify a policy other"
- " than default");
- return;
- } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
- error_setg(errp, "host-nodes must be set for policy %s",
- HostMemPolicy_str(backend->policy));
- return;
- }
-
- /*
- * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
- * as argument to mbind() due to an old Linux bug (feature?) which
- * cuts off the last specified node. This means backend->host_nodes
- * must have MAX_NODES+1 bits available.
- */
- assert(sizeof(backend->host_nodes) >=
- BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
- assert(maxnode <= MAX_NODES);
-
-#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
- if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
- /*
- * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
- * silently picks the first node.
- */
- mode = MPOL_PREFERRED_MANY;
- }
-#endif
-
- if (maxnode &&
- mbind(ptr, sz, mode, backend->host_nodes, maxnode + 1, flags)) {
- if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
- error_setg_errno(errp, errno,
- "cannot bind memory to host NUMA nodes");
- return;
- }
- }
-#endif
- /*
- * Preallocate memory after the NUMA policy has been instantiated.
- * This is necessary to guarantee memory is allocated with
- * specified NUMA policy in place.
- */
- if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
- ptr, sz,
- backend->prealloc_threads,
- backend->prealloc_context,
- async, errp)) {
- return;
- }
+ host_memory_backend_apply_settings(backend, ptr, sz, errp);
}
static bool
--
2.43.5
From: David Hildenbrand <david@redhat.com>
Let's register a RAM block notifier and react on remap notifications.
Simply re-apply the settings. Warn only when something goes wrong.
Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be
that hostmem is still missing to update that flag ...
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
backends/hostmem.c | 29 +++++++++++++++++++++++++++++
include/sysemu/hostmem.h | 1 +
2 files changed, 30 insertions(+)
diff --git a/backends/hostmem.c b/backends/hostmem.c
index bf85d716e5..fbd8708664 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -361,11 +361,32 @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
backend->prealloc_threads = value;
}
+static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host,
+ size_t offset, size_t size)
+{
+ HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
+ ram_notifier);
+ Error *err = NULL;
+
+ if (!host_memory_backend_mr_inited(backend) ||
+ memory_region_get_ram_ptr(&backend->mr) != host) {
+ return;
+ }
+
+ host_memory_backend_apply_settings(backend, host + offset, size, &err);
+ if (err) {
+ warn_report_err(err);
+ }
+}
+
static void host_memory_backend_init(Object *obj)
{
HostMemoryBackend *backend = MEMORY_BACKEND(obj);
MachineState *machine = MACHINE(qdev_get_machine());
+ backend->ram_notifier.ram_block_remapped = host_memory_backend_ram_remapped;
+ ram_block_notifier_add(&backend->ram_notifier);
+
/* TODO: convert access to globals to compat properties */
backend->merge = machine_mem_merge(machine);
backend->dump = machine_dump_guest_core(machine);
@@ -379,6 +400,13 @@ static void host_memory_backend_post_init(Object *obj)
object_apply_compat_props(obj);
}
+static void host_memory_backend_finalize(Object *obj)
+{
+ HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+ ram_block_notifier_remove(&backend->ram_notifier);
+}
+
bool host_memory_backend_mr_inited(HostMemoryBackend *backend)
{
/*
@@ -595,6 +623,7 @@ static const TypeInfo host_memory_backend_info = {
.instance_size = sizeof(HostMemoryBackend),
.instance_init = host_memory_backend_init,
.instance_post_init = host_memory_backend_post_init,
+ .instance_finalize = host_memory_backend_finalize,
.interfaces = (InterfaceInfo[]) {
{ TYPE_USER_CREATABLE },
{ }
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index de47ae59e4..062a68c8fc 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -81,6 +81,7 @@ struct HostMemoryBackend {
HostMemPolicy policy;
MemoryRegion mr;
+ RAMBlockNotifier ram_notifier;
};
bool host_memory_backend_mr_inited(HostMemoryBackend *backend);
--
2.43.5
On 07.11.24 11:21, “William Roche wrote: > From: David Hildenbrand <david@redhat.com> > > Let's register a RAM block notifier and react on remap notifications. > Simply re-apply the settings. Warn only when something goes wrong. > > Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be > that hostmem is still missing to update that flag ... > > Signed-off-by: David Hildenbrand <david@redhat.com> > Signed-off-by: William Roche <william.roche@oracle.com> > --- > backends/hostmem.c | 29 +++++++++++++++++++++++++++++ > include/sysemu/hostmem.h | 1 + > 2 files changed, 30 insertions(+) > > diff --git a/backends/hostmem.c b/backends/hostmem.c > index bf85d716e5..fbd8708664 100644 > --- a/backends/hostmem.c > +++ b/backends/hostmem.c > @@ -361,11 +361,32 @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v, > backend->prealloc_threads = value; > } > > +static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host, > + size_t offset, size_t size) > +{ > + HostMemoryBackend *backend = container_of(n, HostMemoryBackend, > + ram_notifier); > + Error *err = NULL; > + > + if (!host_memory_backend_mr_inited(backend) || > + memory_region_get_ram_ptr(&backend->mr) != host) { > + return; > + } > + > + host_memory_backend_apply_settings(backend, host + offset, size, &err); > + if (err) { > + warn_report_err(err); I wonder if we want to fail hard instead, or have a way to tell the notifier that something wen wrong. -- Cheers, David / dhildenb
On 11/12/24 14:45, David Hildenbrand wrote: > On 07.11.24 11:21, “William Roche wrote: >> From: David Hildenbrand <david@redhat.com> >> >> Let's register a RAM block notifier and react on remap notifications. >> Simply re-apply the settings. Warn only when something goes wrong. >> >> Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be >> that hostmem is still missing to update that flag ... >> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> Signed-off-by: William Roche <william.roche@oracle.com> >> --- >> backends/hostmem.c | 29 +++++++++++++++++++++++++++++ >> include/sysemu/hostmem.h | 1 + >> 2 files changed, 30 insertions(+) >> >> diff --git a/backends/hostmem.c b/backends/hostmem.c >> index bf85d716e5..fbd8708664 100644 >> --- a/backends/hostmem.c >> +++ b/backends/hostmem.c >> @@ -361,11 +361,32 @@ static void >> host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v, >> backend->prealloc_threads = value; >> } >> +static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, >> void *host, >> + size_t offset, size_t size) >> +{ >> + HostMemoryBackend *backend = container_of(n, HostMemoryBackend, >> + ram_notifier); >> + Error *err = NULL; >> + >> + if (!host_memory_backend_mr_inited(backend) || >> + memory_region_get_ram_ptr(&backend->mr) != host) { >> + return; >> + } >> + >> + host_memory_backend_apply_settings(backend, host + offset, size, >> &err); >> + if (err) { >> + warn_report_err(err); > > I wonder if we want to fail hard instead, or have a way to tell the > notifier that something wen wrong. > It depends on what the caller would do with this information. Is there a way to workaround the problem ? (I don't think so) Can the VM continue to run without doing anything about it ? (Maybe?) Currently all numa notifiers don't return errors. This function is only called from ram_block_notify_remap() in qemu_ram_remap(), I would vote for a "fail hard" in case where the settings are mandatory to continue. HTH.
On 12.11.24 19:17, William Roche wrote: > On 11/12/24 14:45, David Hildenbrand wrote: >> On 07.11.24 11:21, “William Roche wrote: >>> From: David Hildenbrand <david@redhat.com> >>> >>> Let's register a RAM block notifier and react on remap notifications. >>> Simply re-apply the settings. Warn only when something goes wrong. >>> >>> Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be >>> that hostmem is still missing to update that flag ... >>> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> Signed-off-by: William Roche <william.roche@oracle.com> >>> --- >>> backends/hostmem.c | 29 +++++++++++++++++++++++++++++ >>> include/sysemu/hostmem.h | 1 + >>> 2 files changed, 30 insertions(+) >>> >>> diff --git a/backends/hostmem.c b/backends/hostmem.c >>> index bf85d716e5..fbd8708664 100644 >>> --- a/backends/hostmem.c >>> +++ b/backends/hostmem.c >>> @@ -361,11 +361,32 @@ static void >>> host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v, >>> backend->prealloc_threads = value; >>> } >>> +static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, >>> void *host, >>> + size_t offset, size_t size) >>> +{ >>> + HostMemoryBackend *backend = container_of(n, HostMemoryBackend, >>> + ram_notifier); >>> + Error *err = NULL; >>> + >>> + if (!host_memory_backend_mr_inited(backend) || >>> + memory_region_get_ram_ptr(&backend->mr) != host) { >>> + return; >>> + } >>> + >>> + host_memory_backend_apply_settings(backend, host + offset, size, >>> &err); >>> + if (err) { >>> + warn_report_err(err); >> >> I wonder if we want to fail hard instead, or have a way to tell the >> notifier that something wen wrong. >> > > It depends on what the caller would do with this information. Is there a > way to workaround the problem ? (I don't think so) Primarily only preallocation will fail, and that ... > Can the VM continue to run without doing anything about it ? (Maybe?) > ... will make crash the QEMU at some point later (SIGBUS), which is very bad. > Currently all numa notifiers don't return errors. > > This function is only called from ram_block_notify_remap() in > qemu_ram_remap(), I would vote for a "fail hard" in case where the > settings are mandatory to continue. "fail hard" is likely the best approach for now. -- Cheers, David / dhildenb
From: William Roche <william.roche@oracle.com>
Merging and dump settings are handled by the remap notification
in addition to memory policy and preallocation.
If preallocation is set on a memory block, qemu_prealloc_mem()
call is needed also after a ram_block_discard_range() use for
this block.
Signed-off-by: William Roche <william.roche@oracle.com>
---
system/physmem.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/system/physmem.c b/system/physmem.c
index e72ca31451..72129d5b1b 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2225,8 +2225,6 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
length, addr);
exit(1);
}
- memory_try_enable_merging(vaddr, length);
- qemu_ram_setup_dump(vaddr, length);
}
ram_block_notify_remap(block->host, offset, length);
}
--
2.43.5
© 2016 - 2024 Red Hat, Inc.