[v1] hugetlbfs largepage RAS project

[RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by “William Roche 1 year, 5 months ago

From: William Roche <william.roche@oracle.com>


Apologies for the noise; resending as I missed CC'ing the maintainers of the
changed files


Hello,

This is a Qemu RFC to introduce the possibility to deal with hardware
memory errors impacting hugetlbfs memory backed VMs. When using
hugetlbfs large pages, any large page location being impacted by an
HW memory error results in poisoning the entire page, suddenly making
a large chunk of the VM memory unusable.

The implemented proposal is simply a memory mapping change when an HW error
is reported to Qemu, to transform a hugetlbfs large page into a set of
standard sized pages. The failed large page is unmapped and a set of
standard sized pages are mapped in place.
This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is received
by qemu and the reported location corresponds to a large page.

This gives the possibility to:
- Take advantage of newer hypervisor kernel providing a way to retrieve
still valid data on the impacted hugetlbfs poisoned large page.
If the backend file is MAP_SHARED, we can copy the valid data into the
set of standard sized pages. But if an error is returned when accessing
a location we consider it poisoned and mark the corresponding standard sized
memory page as poisoned with a MADV_HWPOISON madvise call. Hence, the VM
can also continue to use the possible valid pieces of information retrieved.
- Adjust the poison address information. When accessing a poison location,
an older Kernel version may only provide the address of the beginning of
the poisoned large page in the associated SIGBUS siginfo data. Pointing to
a more accurate touched poison location allows the VM kernel to trigger
the right memory error reaction.

A warning is given for hugetlbfs backed memory-regions that are mapped
without the 'share=on' option.
(This warning is also given when using the deprecated "-mem-path" option)

The hugetlbfs memory mapping option should look like that
(with XXX replaced with the actual size):
  -object memory-backend-file,id=pc.ram,mem-path=/dev/hugepages,prealloc=on,share=on,size=XXX -machine memory-backend=pc.ram

I'm introducing new system/hugetlbfs_ras.[ch] files to separate the specific
code for this feature. It's only compiled on Linux versions.

Note that we have to be able to mark as "poison" a replacing valid standard
sized page. We currently do that calling madvise(..., MADV_HWPOISON).
But this requires qemu process to have CAP_SYS_ADMIN priviledge.
Using userfaultfd instead of madvise() to mark the pages as poison could
remove this constraint, and complicating the code adding thread(s) dealing
with the user page faults service.


It's also worth mentioning the IO memory, vfio configured memory buffers
case. The Qemu memory remapping (if it succeeds) will not reconfigure any
device IO buffers locations (no dma unmap/remap is performed) and if an
hardware IO is supposed to access (read or write) a poisoned hugetlbfs
page, I would expect it to fail the same way as before (as its location
hasn't been updated to take into account the new mapping).
But can someone confirm this possible behavior ? Or indicate me what should
be done to deal with this type of memory buffers ?

Details:
--------
The following problems had to be considered:

. kvm dealing with memory faults:
 - Address space mapping changes can't be handled in a signal handler (mmap
   is not async signal safe for example)
     We have a separate listener thread (only created when we use hugetlbfs)
     to deal with the mapping changes.
 - If a memory is not mapped when accessed, kvm fails with
   (exit_reason: KVM_EXIT_UNKNOWN)
     To avoid that, I needed to prevent the access to a changing memory
     region: pausing the VM is used to do so.
 - A fault on a poisoned hugetlbfs large page will report a hardcoded page
   size of 4k (See kernel kvm_send_hwpoison_signal() function)
     When a SIGBUS is received with a page size indication of 4k we have to
     verify if the impacted page is not a hugetlbfs page.
 - Asynchronous SIGBUS/BUS_MCEERR_AO signals provide the right page size,
   but the current Qemu version needs to take the information into account.

. system/physmem needed fixes:
 - When recreating the memory mapping on VM reset, we have to consider the
   memory size impacted.
 - In the case of a mapped file, punching a hole is necessary to clean the
   poison.

. Implementation details:
 - SIGBUS signal received for a large page will trigger the page modification,
   but in order to pause the VM, the signal handers have to terminate.
     So we return from the SIGBUS signal handler(s) when a VM has to be stopped.
     A memory access that generated a SIGBUS/BUS_MCEERR_AR signals before the
     VM pause, will be repeated when the VM resumes. If the memory is still
     not accessible (poisoned) the signal will be generated again by the
     hypervisor kernel.
     In the case of an asyncrounous SIGBUS/BUS_MCEERR_AO signal, the signal is
     not repeated by the kernel and will be recorded by qemu in order to be
     replayed when the VM resumes.
 - Poisoning a memory page with MADV_HWPOISON can generate a SIGBUS when
   called. The listener thread taking care of the memory modification needs
   to deal with this case. To do so, it sets a thread specific variable
   that is recognized by the sigbus handler.


Some questions:
---------------
. Should we take extra care for IO memory, vfio configured memory buffers ?

. My feature code is enclosed within "ifdef CONFIG_HUGETLBFS_RAS" and is only
  compiled on linux versions
  Should we have a configure option to prevent the introduction of this
  feature in the code (turning off CONFIG_HUGETLBFS_RAS) ?

. Should I include the content of my system/hugetlbfs_ras.[ch] files into
  another existing file ?

. Should we force 'sharing' when using "-mem-path" option, instead of the
  -object memory-backend-file,share=on,... ?


This prototype is scripts/checkpatch.pl clean (except for the MAINTAINERS
update for the 2 added files).
'make check' runs fine on both x86 and ARM
Units tests have been done on Intel, AMD and ARM platforms.



William Roche (6):
  accel/kvm: SIGBUS handler should also deal with si_addr_lsb
  accel/kvm: Keep track of the HWPoisonPage sizes
  system/physmem: Remap memory pages on reset based on the page size
  system: Introducing hugetlbfs largepage RAS feature
  system/hugetlb_ras: Handle madvise SIGBUS signal on listener
  system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume

 accel/kvm/kvm-all.c      |  24 +-
 accel/stubs/kvm-stub.c   |   4 +-
 include/qemu/osdep.h     |   5 +-
 include/sysemu/kvm.h     |   7 +-
 include/sysemu/kvm_int.h |   3 +-
 meson.build              |   2 +
 system/cpus.c            |  15 +-
 system/hugetlbfs_ras.c   | 645 +++++++++++++++++++++++++++++++++++++++
 system/hugetlbfs_ras.h   |   4 +
 system/meson.build       |   1 +
 system/physmem.c         |  30 ++
 target/arm/kvm.c         |  15 +-
 target/i386/kvm/kvm.c    |  15 +-
 util/oslib-posix.c       |   3 +
 14 files changed, 753 insertions(+), 20 deletions(-)
 create mode 100644 system/hugetlbfs_ras.c
 create mode 100644 system/hugetlbfs_ras.h

-- 
2.43.5

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by David Hildenbrand 1 year, 5 months ago

On 10.09.24 12:02, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 

Hi,

> 
> Apologies for the noise; resending as I missed CC'ing the maintainers of the
> changed files
> 
> 
> Hello,
> 
> This is a Qemu RFC to introduce the possibility to deal with hardware
> memory errors impacting hugetlbfs memory backed VMs. When using
> hugetlbfs large pages, any large page location being impacted by an
> HW memory error results in poisoning the entire page, suddenly making
> a large chunk of the VM memory unusable.
> 
> The implemented proposal is simply a memory mapping change when an HW error
> is reported to Qemu, to transform a hugetlbfs large page into a set of
> standard sized pages. The failed large page is unmapped and a set of
> standard sized pages are mapped in place.
> This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is received
> by qemu and the reported location corresponds to a large page.
> 
> This gives the possibility to:
> - Take advantage of newer hypervisor kernel providing a way to retrieve
> still valid data on the impacted hugetlbfs poisoned large page.
> If the backend file is MAP_SHARED, we can copy the valid data into the

How are you dealing with other consumers of the shared memory, such as 
vhost-user processes, vm migration whereby RAM is migrated using file 
content, vfio that might have these pages pinned?

In general, you cannot simply replace pages by private copies when 
somebody else might be relying on these pages to go to actual guest RAM.

It sounds very hacky and incomplete at first.


-- 
Cheers,

David / dhildenb

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by William Roche 1 year, 5 months ago

On 9/10/24 13:36, David Hildenbrand wrote:

> On 10.09.24 12:02, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>
> Hi,
>
>>
>> Apologies for the noise; resending as I missed CC'ing the maintainers 
>> of the
>> changed files
>>
>>
>> Hello,
>>
>> This is a Qemu RFC to introduce the possibility to deal with hardware
>> memory errors impacting hugetlbfs memory backed VMs. When using
>> hugetlbfs large pages, any large page location being impacted by an
>> HW memory error results in poisoning the entire page, suddenly making
>> a large chunk of the VM memory unusable.
>>
>> The implemented proposal is simply a memory mapping change when an HW 
>> error
>> is reported to Qemu, to transform a hugetlbfs large page into a set of
>> standard sized pages. The failed large page is unmapped and a set of
>> standard sized pages are mapped in place.
>> This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is 
>> received
>> by qemu and the reported location corresponds to a large page.
>>
>> This gives the possibility to:
>> - Take advantage of newer hypervisor kernel providing a way to retrieve
>> still valid data on the impacted hugetlbfs poisoned large page.
>> If the backend file is MAP_SHARED, we can copy the valid data into the

Thank you David for this first reaction on this proposal.

> How are you dealing with other consumers of the shared memory,
> such as vhost-user processes,

In the current proposal, I don't deal with this aspect.
In fact, any other process sharing the changed memory will
continue to map the poisoned large page. So any access to
this page will generate a SIGBUS to this other process.

In this situation vhost-user processes should continue to receive
SIGBUS signals (and probably continue to die because of that).

So I do see a real problem if 2 qemu processes are sharing the
same hugetlbfs segment -- in this case, error recovery should not
occur on this piece of the memory. Maybe dealing with this situation
with "ivshmem" options is doable (marking the shared segment
"not eligible" to hugetlbfs recovery, just like not "share=on"
hugetlbfs entries are not eligible)
-- I need to think about this specific case.

Please let me know if there is a better way to deal with this
shared memory aspect and have a better system reaction.

> vm migration whereby RAM is migrated using file content,

Migration doesn't currently work with memory poisoning.
You can give a look at the already integrated following commit:

06152b89db64 migration: prevent migration when VM has poisoned memory

This proposal doesn't change anything on this side.

> vfio that might have these pages pinned?

AFAIK even pinned memory can be impacted by memory error and poisoned
by the kernel. Now as I said in the cover letter, I'd like to know if
we should take extra care for IO memory, vfio configured memory buffers...

> In general, you cannot simply replace pages by private copies
> when somebody else might be relying on these pages to go to
> actual guest RAM.

This is correct, but the current proposal is dealing with a specific
shared memory type: poisoned large pages. So any other process mapping
this type of page can't access it without generating a SIGBUS.

> It sounds very hacky and incomplete at first.

As you can see, RAS features need to be completed.
And if this proposal is incomplete, what other changes should be
done to complete it ?

I do hope we can discuss this RFC to adapt what is incorrect, or
find a better way to address this situation.

Thanks in advance for your feedback,
William.

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by David Hildenbrand 1 year, 4 months ago

Hi again,

>>> This is a Qemu RFC to introduce the possibility to deal with hardware
>>> memory errors impacting hugetlbfs memory backed VMs. When using
>>> hugetlbfs large pages, any large page location being impacted by an
>>> HW memory error results in poisoning the entire page, suddenly making
>>> a large chunk of the VM memory unusable.
>>>
>>> The implemented proposal is simply a memory mapping change when an HW
>>> error
>>> is reported to Qemu, to transform a hugetlbfs large page into a set of
>>> standard sized pages. The failed large page is unmapped and a set of
>>> standard sized pages are mapped in place.
>>> This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is
>>> received
>>> by qemu and the reported location corresponds to a large page.

One clarifying question: you simply replace the hugetlb page by multiple 
small pages using mmap(MAP_FIXED). So you

(a) are not able to recover any memory of the original page (as of now)
(b) no longer have a hugetlb page and, therefore, possibly a performance
     degradation, relevant in low-latency applications that really care
     about the usage of hugetlb pages.
(c) run into the described inconsistency issues

Why is what you propose beneficial over just fallocate(PUNCH_HOLE) the 
full page and get a fresh, non-poisoned page instead?

Sure, you have to reserve some pages if that ever happens, but what is 
the big selling point over PUNCH_HOLE + realloc? (sorry if I missed it 
and it was spelled out)

>>>
>>> This gives the possibility to:
>>> - Take advantage of newer hypervisor kernel providing a way to retrieve
>>> still valid data on the impacted hugetlbfs poisoned large page.

Reading that again, that shouldn't have to be hypervisor-specific. 
Really, if someone were to extract data from a poisoned hugetlb folio, 
it shouldn't be hypervisor-specific. The kernel should be able to know 
which regions are accessible and could allow ways for reading these, one 
way or the other.

It could just be a fairly hugetlb-special feature that would replace the 
poisoned page by a fresh hugetlb page where as much page content as 
possible has been recoverd from the old one.

>>> If the backend file is MAP_SHARED, we can copy the valid data into the
> 
> 
> Thank you David for this first reaction on this proposal.
> 
> 
>> How are you dealing with other consumers of the shared memory,
>> such as vhost-user processes,
> 
> 
> In the current proposal, I don't deal with this aspect.
> In fact, any other process sharing the changed memory will
> continue to map the poisoned large page. So any access to
> this page will generate a SIGBUS to this other process.
> 
> In this situation vhost-user processes should continue to receive
> SIGBUS signals (and probably continue to die because of that).

That's ... suboptimal. :)

Assume you have a 1 GiB page. The guest OS can happily allocate buffers 
in there so they can end up in vhost-user and crash that process. 
Without any warning.

> 
> So I do see a real problem if 2 qemu processes are sharing the
> same hugetlbfs segment -- in this case, error recovery should not
> occur on this piece of the memory. Maybe dealing with this situation
> with "ivshmem" options is doable (marking the shared segment
> "not eligible" to hugetlbfs recovery, just like not "share=on"
> hugetlbfs entries are not eligible)
> -- I need to think about this specific case.
> 
> Please let me know if there is a better way to deal with this
> shared memory aspect and have a better system reaction.

Not creating the inconsistency in the first place :)

>> vm migration whereby RAM is migrated using file content,
> 
> 
> Migration doesn't currently work with memory poisoning.
> You can give a look at the already integrated following commit:
> 
> 06152b89db64 migration: prevent migration when VM has poisoned memory
> 
> This proposal doesn't change anything on this side.

That commit is fairly fresh and likely missed the option to *not* 
migrate RAM by reading it, but instead by migrating it through a shared 
file. For example, VM life-upgrade (CPR) wants to use that (or is 
already using that), to avoid RAM migration completely.

> 
>> vfio that might have these pages pinned?
> 
> AFAIK even pinned memory can be impacted by memory error and poisoned
> by the kernel. Now as I said in the cover letter, I'd like to know if
> we should take extra care for IO memory, vfio configured memory buffers...

Assume your GPU has a hugetlb folio pinned via vfio. As soon as you make 
the guest RAM point at anything else as VFIO is aware of, we end up in 
the same problem we had when we learned about having to disable balloon 
inflation (MADVISE_DONTNEED) as soon as VFIO pinned pages.

We'd have to inform VFIO that the mapping is now different. Otherwise 
it's really better to crash the VM than having your GPU read/write 
different data than your CPU reads/writes,

> 
> 
>> In general, you cannot simply replace pages by private copies
>> when somebody else might be relying on these pages to go to
>> actual guest RAM.
> 
> This is correct, but the current proposal is dealing with a specific
> shared memory type: poisoned large pages. So any other process mapping
> this type of page can't access it without generating a SIGBUS.

Right, and that's the issue. Because, for example, how should the VM be 
aware that this memory is now special and must not be used for some 
purposes without leading to problems elsewhere?

> 
> 
>> It sounds very hacky and incomplete at first.
> 
> As you can see, RAS features need to be completed.
> And if this proposal is incomplete, what other changes should be
> done to complete it ?
> 
> I do hope we can discuss this RFC to adapt what is incorrect, or
> find a better way to address this situation.

One long-term goal people are working on is to allow remapping the 
hugetlb folios in smaller granularity, such that only a single affected 
PTE can be marked as poisoned. (used to be called high-granularity-mapping)

However, at the same time, the focus hseems to shift towards using 
guest_memfd instead of hugetlb, once it supports 1 GiB pages and shared 
memory. It will likely be easier to support mapping 1 GiB pages using 
PTEs that way, and there are ongoing discussions how that can be 
achieved more easily.

There are also discussions [1] about not poisoning the mappings at all 
and handling it differently. But I haven't yet digested how exactly that 
could look like in reality.


[1] https://lkml.kernel.org/r/20240828234958.GE3773488@nvidia.com

-- 
Cheers,

David / dhildenb

[PATCH v5 0/6] Poisoned memory recovery on reboot

Posted by “William Roche 1 year ago

From: William Roche <william.roche@oracle.com>

Hello David,

I'm keeping the description of the patch set you already reviewed:
 ---
This set of patches fixes several problems with hardware memory errors
impacting hugetlbfs memory backed VMs and the generic memory recovery
on VM reset.
When using hugetlbfs large pages, any large page location being impacted
by an HW memory error results in poisoning the entire page, suddenly
making a large chunk of the VM memory unusable.

The main problem that currently exists in Qemu is the lack of backend
file repair before resetting the VM memory, resulting in the impacted
memory to be silently unusable even after a VM reboot.

In order to fix this issue, we take into account the page size of the
impacted memory block when dealing with the associated poisoned page
location.

Using the page size information we also try to regenerate the memory
calling ram_block_discard_range() on VM reset when running
qemu_ram_remap(). So that a poisoned memory backed by a hugetlbfs
file is regenerated with a hole punched in this file. A new page is
loaded when the location is first touched.

In case of a discard failure we fall back to remapping the memory
location. We also have to reset the memory settings and honor the
'prealloc' attribute.

This memory setting is performed by a new remap notification mechanism
calling host_memory_backend_ram_remapped() function when a region of
a memory block is remapped.

We also enrich the messages used to report a memory error relayed to
the VM, providing an identification of memory page and its size in
case of a large page impacted.
 ----

v4->v5
. Updated commit messages (for patches 1, 5 and 6)
. Fixed comment typo of patch 2
. Changed the fall back function parameters to match the
  ram_block_discard_range() function.
. Removed the unused case of remapping a file in this function
. add the assert(block->fd < 0) in this function too
. I merged my patch 7 with you patch 6 (we only have 6 patches now)

This code is scripts/checkpatch.pl clean
'make check' runs clean on both x86 and ARM.


David Hildenbrand (3):
  numa: Introduce and use ram_block_notify_remap()
  hostmem: Factor out applying settings
  hostmem: Handle remapping of RAM

William Roche (3):
  system/physmem: handle hugetlb correctly in qemu_ram_remap()
  system/physmem: poisoned memory discard on reboot
  accel/kvm: Report the loss of a large memory page

 accel/kvm/kvm-all.c       |   2 +-
 backends/hostmem.c        | 189 +++++++++++++++++++++++---------------
 hw/core/numa.c            |  11 +++
 include/exec/cpu-common.h |   3 +-
 include/exec/ramlist.h    |   3 +
 include/system/hostmem.h  |   1 +
 system/physmem.c          |  82 ++++++++++++-----
 target/arm/kvm.c          |  13 +++
 target/i386/kvm/kvm.c     |  18 +++-
 9 files changed, 218 insertions(+), 104 deletions(-)

-- 
2.43.5

Re: [PATCH v5 0/6] Poisoned memory recovery on reboot

Posted by David Hildenbrand 1 year ago

On 10.01.25 22:13, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> Hello David,
> 
> I'm keeping the description of the patch set you already reviewed:

Hi,

one request, can you send it out next time (v6) *not* as reply to the 
previous thread, but just as a new thread to the ML?

This way, it doesn't all get buried in an RFC thread that a couple of 
people might just ignore.

-- 
Cheers,

David / dhildenb

Re: [PATCH v5 0/6] Poisoned memory recovery on reboot

Posted by William Roche 1 year ago

On 1/14/25 15:12, David Hildenbrand wrote:
> On 10.01.25 22:13, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> Hello David,
>>
>> I'm keeping the description of the patch set you already reviewed:
> 
> Hi,
> 
> one request, can you send it out next time (v6) *not* as reply to the 
> previous thread, but just as a new thread to the ML?
> 
> This way, it doesn't all get buried in an RFC thread that a couple of 
> people might just ignore.
> 

Sure, I'll just send a v6 version now.

[PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()

Posted by “William Roche 1 year ago

From: William Roche <william.roche@oracle.com>

The list of hwpoison pages used to remap the memory on reset
is based on the backend real page size. When dealing with
hugepages, we create a single entry for the entire page.

To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete
hugetlb page; hugetlb pages cannot be partially mapped.

Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c       |  6 +++++-
 include/exec/cpu-common.h |  3 ++-
 system/physmem.c          | 32 ++++++++++++++++++++++++++------
 3 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index c65b790433..4f2abd5774 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1288,7 +1288,7 @@ static void kvm_unpoison_all(void *param)
 
     QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
         QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr);
         g_free(page);
     }
 }
@@ -1296,6 +1296,10 @@ static void kvm_unpoison_all(void *param)
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
+    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
+
+    if (page_size > TARGET_PAGE_SIZE)
+        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index b1d76d6985..dbdf22fded 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
 
 /* memory API */
 
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
+void qemu_ram_remap(ram_addr_t addr);
 /* This should not be used by devices.  */
 ram_addr_t qemu_ram_addr_from_host(void *ptr);
 ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
@@ -108,6 +108,7 @@ bool qemu_ram_is_named_file(RAMBlock *rb);
 int qemu_ram_get_fd(RAMBlock *rb);
 
 size_t qemu_ram_pagesize(RAMBlock *block);
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr);
 size_t qemu_ram_pagesize_largest(void);
 
 /**
diff --git a/system/physmem.c b/system/physmem.c
index c76503aea8..7a87548f99 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1665,6 +1665,19 @@ size_t qemu_ram_pagesize(RAMBlock *rb)
     return rb->page_size;
 }
 
+/* Return backend real page size used for the given ram_addr */
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr)
+{
+    RAMBlock *rb;
+
+    RCU_READ_LOCK_GUARD();
+    rb =  qemu_get_ram_block(addr);
+    if (!rb) {
+        return TARGET_PAGE_SIZE;
+    }
+    return qemu_ram_pagesize(rb);
+}
+
 /* Returns the largest size of page in use */
 size_t qemu_ram_pagesize_largest(void)
 {
@@ -2167,17 +2180,22 @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     ram_addr_t offset;
     int flags;
     void *area, *vaddr;
     int prot;
+    size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
         offset = addr - block->offset;
         if (offset < block->max_length) {
+            /* Respect the pagesize of our RAMBlock */
+            page_size = qemu_ram_pagesize(block);
+            offset = QEMU_ALIGN_DOWN(offset, page_size);
+
             vaddr = ramblock_ptr(block, offset);
             if (block->flags & RAM_PREALLOC) {
                 ;
@@ -2191,21 +2209,23 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 prot = PROT_READ;
                 prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
                 if (block->fd >= 0) {
-                    area = mmap(vaddr, length, prot, flags, block->fd,
+                    area = mmap(vaddr, page_size, prot, flags, block->fd,
                                 offset + block->fd_offset);
                 } else {
                     flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, length, prot, flags, -1, 0);
+                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
                 }
                 if (area != vaddr) {
                     error_report("Could not remap addr: "
                                  RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 length, addr);
+                                 page_size, addr);
                     exit(1);
                 }
-                memory_try_enable_merging(vaddr, length);
-                qemu_ram_setup_dump(vaddr, length);
+                memory_try_enable_merging(vaddr, page_size);
+                qemu_ram_setup_dump(vaddr, page_size);
             }
+
+            break;
         }
     }
 }
-- 
2.43.5

Re: [PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()

Posted by David Hildenbrand 1 year ago

On 10.01.25 22:14, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> The list of hwpoison pages used to remap the memory on reset
> is based on the backend real page size. When dealing with
> hugepages, we create a single entry for the entire page.
> 
> To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete
> hugetlb page; hugetlb pages cannot be partially mapped.
> 
> Co-developed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---

See my comments to v4 version and my patch proposal.

-- 
Cheers,

David / dhildenb

Re: [PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()

Posted by William Roche 1 year ago

On 1/14/25 15:02, David Hildenbrand wrote:
> On 10.01.25 22:14, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> The list of hwpoison pages used to remap the memory on reset
>> is based on the backend real page size. When dealing with
>> hugepages, we create a single entry for the entire page.
>>
>> To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete
>> hugetlb page; hugetlb pages cannot be partially mapped.
>>
>> Co-developed-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
> 
> See my comments to v4 version and my patch proposal.

I'm copying and answering your comments here:


On 1/14/25 14:56, David Hildenbrand wrote:
> On 10.01.25 21:56, William Roche wrote:
>> On 1/8/25 22:34, David Hildenbrand wrote:
>>> On 14.12.24 14:45, “William Roche wrote:
>>>> From: William Roche <william.roche@oracle.com>
>>>> [...]
>>>> @@ -1286,6 +1286,10 @@ static void kvm_unpoison_all(void *param)
>>>>    void kvm_hwpoison_page_add(ram_addr_t ram_addr)
>>>>    {
>>>>        HWPoisonPage *page;
>>>> +    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
>>>> +
>>>> +    if (page_size > TARGET_PAGE_SIZE)
>>>> +        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
>>>
>>> Is that part still required? I thought it would be sufficient (at least
>>> in the context of this patch) to handle it all in qemu_ram_remap().
>>>
>>> qemu_ram_remap() will calculate the range to process based on the
>>> RAMBlock page size. IOW, the QEMU_ALIGN_DOWN() we do now in
>>> qemu_ram_remap().
>>>
>>> Or am I missing something?
>>>
>>> (sorry if we discussed that already; if there is a good reason it might
>>> make sense to state it in the patch description)
>>
>> You are right, but at this patch level we still need to round up the
> 
> s/round up/align_down/
> 
>> address and doing it here is small enough.
> 
> Let me explain.
> 
> qemu_ram_remap() in this patch here doesn't need an aligned addr. It
> will compute the offset into the block and align that down.
> 
> The only case where we need the addr besides from that is the
> error_report(), where I am not 100% sure if that is actually what we
> want to print. We want to print something like ram_block_discard_range().
> 
> 
> Note that ram_addr_t is a weird, separate address space. The alignment
> does not have any guarantees / semantics there.
> 
> 
> See ram_block_add() where we set
>      new_block->offset = find_ram_offset(new_block->max_length);
> 
> independent of any other RAMBlock properties.
> 
> The only alignment we do is
>      candidate = ROUND_UP(candidate, BITS_PER_LONG << TARGET_PAGE_BITS);
> 
> There is no guarantee that new_block->offset will be aligned to 1 GiB with
> a 1 GiB hugetlb mapping.
> 
> 
> Note that there is another conceptual issue in this function: offset
> should be of type uint64_t, it's not really ram_addr_t, but an
> offset into the RAMBlock.

Ok.

> 
>> Of course, the code changes on patch 3/7 where we change both x86 and
>> ARM versions of the code to align the memory pointer correctly in both
>> cases.
> 
> Thinking about it more, we should never try aligning ram_addr_t, only
> the offset into the memory block or the virtual address.
> 
> So please remove this from this ram_addr_t alignment from this patch, 
> and look into
> aligning the virtual address / offset for the other user. Again, aligning
> ram_addr_t is not guaranteed to work correctly.
> 

Thanks for the technical details.

The ram_addr_t value alignment on the beginning of the page was useful 
to create a single entry in the hwpoison_page_list for a large page, but 
I understand that this use of ram_addr alignment may not be always accurate.
Removing this alignment (without replacing it with something else) will 
end up creating several page entries in this list for the same hugetlb 
page. Because when we loose a large page, we can receive several MCEs 
for the sub-page locations touched on this large page before the VM crashes.
So the recovery phase on reset will go through the list to discard/remap 
all the entries, and the same hugetlb page can be treated several times. 
But when we had a single entry for a large page, this multiple 
discard/remap does not occur.

Now, it could be technically acceptable to discard/remap a hugetlb page 
several times. Other than not being optimal and taking time, the same 
page being mapped or discarded multiple times doesn't seem to be a problem.
So we can leave the code like that  without complicating it with a block 
and offset attributes to the hwpoison_page_list entries for example.

> 
> So the patch itself should probably be (- patch description):
> 
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 801cff16a5..8a47aa7258 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param)
> 
>       QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>           QLIST_REMOVE(page, list);
> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
> +        qemu_ram_remap(page->ram_addr);
>           g_free(page);
>       }
>   }
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 638dc806a5..50a829d31f 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
> 
>   /* memory API */
> 
> -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
> +void qemu_ram_remap(ram_addr_t addr);
>   /* This should not be used by devices.  */
>   ram_addr_t qemu_ram_addr_from_host(void *ptr);
>   ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
> diff --git a/system/physmem.c b/system/physmem.c
> index 03d3618039..355588f5d5 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2167,17 +2167,35 @@ void qemu_ram_free(RAMBlock *block)
>   }
> 
>   #ifndef _WIN32
> -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
> +/*
> + * qemu_ram_remap - remap a single RAM page
> + *
> + * @addr: address in ram_addr_t address space.
> + *
> + * This function will try remapping a single page of guest RAM identified by
> + * @addr, essentially discarding memory to recover from previously poisoned
> + * memory (MCE). The page size depends on the RAMBlock (i.e., hugetlb). @addr
> + * does not have to point at the start of the page.
> + *
> + * This function is only to be used during system resets; it will kill the
> + * VM if remapping failed.
> + */
> +void qemu_ram_remap(ram_addr_t addr)
>   {
>       RAMBlock *block;
> -    ram_addr_t offset;
> +    uint64_t offset;
>       int flags;
>       void *area, *vaddr;
>       int prot;
> +    size_t page_size;
> 
>       RAMBLOCK_FOREACH(block) {
>           offset = addr - block->offset;
>           if (offset < block->max_length) {
> +            /* Respect the pagesize of our RAMBlock */
> +            page_size = qemu_ram_pagesize(block);
> +            offset = QEMU_ALIGN_DOWN(offset, page_size);
> +
>               vaddr = ramblock_ptr(block, offset);
>               if (block->flags & RAM_PREALLOC) {
>                   ;
> @@ -2191,21 +2209,22 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
>                   prot = PROT_READ;
>                   prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>                   if (block->fd >= 0) {
> -                    area = mmap(vaddr, length, prot, flags, block->fd,
> +                    area = mmap(vaddr, page_size, prot, flags, block->fd,
>                                   offset + block->fd_offset);
>                   } else {
>                       flags |= MAP_ANONYMOUS;
> -                    area = mmap(vaddr, length, prot, flags, -1, 0);
> +                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
>                   }
>                   if (area != vaddr) {
> -                    error_report("Could not remap addr: "
> -                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> -                                 length, addr);
> +                    error_report("Could not remap RAM %s:%" PRIx64 " +%zx",
> +                                 block->idstr, offset, page_size);
>                       exit(1);
>                   }
> -                memory_try_enable_merging(vaddr, length);
> -                qemu_ram_setup_dump(vaddr, length);
> +                memory_try_enable_merging(vaddr, page_size);
> +                qemu_ram_setup_dump(vaddr, page_size);
>               }
> +
> +            break;
>           }
>       }
>   }

I'll use your suggested changes, Thanks.

Re: [PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()

Posted by David Hildenbrand 1 year ago

On 27.01.25 22:16, William Roche wrote:
> On 1/14/25 15:02, David Hildenbrand wrote:
>> On 10.01.25 22:14, “William Roche wrote:
>>> From: William Roche <william.roche@oracle.com>
>>>
>>> The list of hwpoison pages used to remap the memory on reset
>>> is based on the backend real page size. When dealing with
>>> hugepages, we create a single entry for the entire page.
>>>
>>> To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete
>>> hugetlb page; hugetlb pages cannot be partially mapped.
>>>
>>> Co-developed-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>> ---
>>
>> See my comments to v4 version and my patch proposal.
> 
> I'm copying and answering your comments here:
> 
> 
> On 1/14/25 14:56, David Hildenbrand wrote:
>> On 10.01.25 21:56, William Roche wrote:
>>> On 1/8/25 22:34, David Hildenbrand wrote:
>>>> On 14.12.24 14:45, “William Roche wrote:
>>>>> From: William Roche <william.roche@oracle.com>
>>>>> [...]
>>>>> @@ -1286,6 +1286,10 @@ static void kvm_unpoison_all(void *param)
>>>>>     void kvm_hwpoison_page_add(ram_addr_t ram_addr)
>>>>>     {
>>>>>         HWPoisonPage *page;
>>>>> +    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
>>>>> +
>>>>> +    if (page_size > TARGET_PAGE_SIZE)
>>>>> +        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
>>>>
>>>> Is that part still required? I thought it would be sufficient (at least
>>>> in the context of this patch) to handle it all in qemu_ram_remap().
>>>>
>>>> qemu_ram_remap() will calculate the range to process based on the
>>>> RAMBlock page size. IOW, the QEMU_ALIGN_DOWN() we do now in
>>>> qemu_ram_remap().
>>>>
>>>> Or am I missing something?
>>>>
>>>> (sorry if we discussed that already; if there is a good reason it might
>>>> make sense to state it in the patch description)
>>>
>>> You are right, but at this patch level we still need to round up the
>>
>> s/round up/align_down/
>>
>>> address and doing it here is small enough.
>>
>> Let me explain.
>>
>> qemu_ram_remap() in this patch here doesn't need an aligned addr. It
>> will compute the offset into the block and align that down.
>>
>> The only case where we need the addr besides from that is the
>> error_report(), where I am not 100% sure if that is actually what we
>> want to print. We want to print something like ram_block_discard_range().
>>
>>
>> Note that ram_addr_t is a weird, separate address space. The alignment
>> does not have any guarantees / semantics there.
>>
>>
>> See ram_block_add() where we set
>>       new_block->offset = find_ram_offset(new_block->max_length);
>>
>> independent of any other RAMBlock properties.
>>
>> The only alignment we do is
>>       candidate = ROUND_UP(candidate, BITS_PER_LONG << TARGET_PAGE_BITS);
>>
>> There is no guarantee that new_block->offset will be aligned to 1 GiB with
>> a 1 GiB hugetlb mapping.
>>
>>
>> Note that there is another conceptual issue in this function: offset
>> should be of type uint64_t, it's not really ram_addr_t, but an
>> offset into the RAMBlock.
> 
> Ok.
> 
>>
>>> Of course, the code changes on patch 3/7 where we change both x86 and
>>> ARM versions of the code to align the memory pointer correctly in both
>>> cases.
>>
>> Thinking about it more, we should never try aligning ram_addr_t, only
>> the offset into the memory block or the virtual address.
>>
>> So please remove this from this ram_addr_t alignment from this patch,
>> and look into
>> aligning the virtual address / offset for the other user. Again, aligning
>> ram_addr_t is not guaranteed to work correctly.
>>
> 
> Thanks for the technical details.
> 
> The ram_addr_t value alignment on the beginning of the page was useful
> to create a single entry in the hwpoison_page_list for a large page, but
> I understand that this use of ram_addr alignment may not be always accurate.
> Removing this alignment (without replacing it with something else) will
> end up creating several page entries in this list for the same hugetlb
> page. Because when we loose a large page, we can receive several MCEs
> for the sub-page locations touched on this large page before the VM crashes.

Right, although the kernel will currently only a single event IIRC. At 
least for hugetlb.

> So the recovery phase on reset will go through the list to discard/remap
> all the entries, and the same hugetlb page can be treated several times.
> But when we had a single entry for a large page, this multiple
> discard/remap does not occur.
> 
> Now, it could be technically acceptable to discard/remap a hugetlb page
> several times. Other than not being optimal and taking time, the same
> page being mapped or discarded multiple times doesn't seem to be a problem.
> So we can leave the code like that  without complicating it with a block
> and offset attributes to the hwpoison_page_list entries for example.

Right, this is something to optimize when it really becomes a problem I 
think.

-- 
Cheers,

David / dhildenb

[PATCH v5 2/6] system/physmem: poisoned memory discard on reboot

Posted by “William Roche 1 year ago

From: William Roche <william.roche@oracle.com>

Repair poisoned memory location(s), calling ram_block_discard_range():
punching a hole in the backend file when necessary and regenerating
a usable memory.
If the kernel doesn't support the madvise calls used by this function
and we are dealing with anonymous memory, fall back to remapping the
location(s).

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 57 ++++++++++++++++++++++++++++++------------------
 1 file changed, 36 insertions(+), 21 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 7a87548f99..ae1caa97d8 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2180,13 +2180,32 @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
+/* Simply remap the given VM memory location from start to start+length */
+static void qemu_ram_remap_mmap(RAMBlock *block, uint64_t start, size_t length)
+{
+    int flags, prot;
+    void *area;
+    void *host_startaddr = block->host + start;
+
+    assert(block->fd < 0);
+    flags = MAP_FIXED | MAP_ANONYMOUS;
+    flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
+    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
+    prot = PROT_READ;
+    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
+    area = mmap(host_startaddr, length, prot, flags, -1, 0);
+    if (area != host_startaddr) {
+        error_report("Could not remap addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                     length, start);
+        exit(1);
+    }
+}
+
 void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     ram_addr_t offset;
-    int flags;
-    void *area, *vaddr;
-    int prot;
+    void *vaddr;
     size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
@@ -2202,24 +2221,20 @@ void qemu_ram_remap(ram_addr_t addr)
             } else if (xen_enabled()) {
                 abort();
             } else {
-                flags = MAP_FIXED;
-                flags |= block->flags & RAM_SHARED ?
-                         MAP_SHARED : MAP_PRIVATE;
-                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
-                prot = PROT_READ;
-                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
-                if (block->fd >= 0) {
-                    area = mmap(vaddr, page_size, prot, flags, block->fd,
-                                offset + block->fd_offset);
-                } else {
-                    flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
-                }
-                if (area != vaddr) {
-                    error_report("Could not remap addr: "
-                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 page_size, addr);
-                    exit(1);
+                if (ram_block_discard_range(block, offset, page_size) != 0) {
+                    /*
+                     * Fall back to using mmap() only for anonymous mapping,
+                     * as if a backing file is associated we may not be able
+                     * to recover the memory in all cases.
+                     * So don't take the risk of using only mmap and fail now.
+                     */
+                    if (block->fd >= 0) {
+                        error_report("Memory poison recovery failure addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     page_size, addr);
+                        exit(1);
+                    }
+                    qemu_ram_remap_mmap(block, offset, page_size);
                 }
                 memory_try_enable_merging(vaddr, page_size);
                 qemu_ram_setup_dump(vaddr, page_size);
-- 
2.43.5

Re: [PATCH v5 2/6] system/physmem: poisoned memory discard on reboot

Posted by David Hildenbrand 1 year ago

On 10.01.25 22:14, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> Repair poisoned memory location(s), calling ram_block_discard_range():
> punching a hole in the backend file when necessary and regenerating
> a usable memory.
> If the kernel doesn't support the madvise calls used by this function
> and we are dealing with anonymous memory, fall back to remapping the
> location(s).
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   system/physmem.c | 57 ++++++++++++++++++++++++++++++------------------
>   1 file changed, 36 insertions(+), 21 deletions(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index 7a87548f99..ae1caa97d8 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2180,13 +2180,32 @@ void qemu_ram_free(RAMBlock *block)
>   }
>   
>   #ifndef _WIN32
> +/* Simply remap the given VM memory location from start to start+length */
> +static void qemu_ram_remap_mmap(RAMBlock *block, uint64_t start, size_t length)
> +{
> +    int flags, prot;
> +    void *area;
> +    void *host_startaddr = block->host + start;
> +
> +    assert(block->fd < 0);
> +    flags = MAP_FIXED | MAP_ANONYMOUS;
> +    flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
> +    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> +    prot = PROT_READ;
> +    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> +    area = mmap(host_startaddr, length, prot, flags, -1, 0);
> +    if (area != host_startaddr) {
> +        error_report("Could not remap addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                     length, start);
> +        exit(1);
> +    }

Can we return an error and have a single error printed in the caller?

return area != host_startaddr ? -errno : 0;

> +}
> +
>   void qemu_ram_remap(ram_addr_t addr)
>   {
>       RAMBlock *block;
>       ram_addr_t offset;
> -    int flags;
> -    void *area, *vaddr;
> -    int prot;
> +    void *vaddr;
>       size_t page_size;
>   
>       RAMBLOCK_FOREACH(block) {
> @@ -2202,24 +2221,20 @@ void qemu_ram_remap(ram_addr_t addr)
>               } else if (xen_enabled()) {
>                   abort();
>               } else {
> -                flags = MAP_FIXED;
> -                flags |= block->flags & RAM_SHARED ?
> -                         MAP_SHARED : MAP_PRIVATE;
> -                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> -                prot = PROT_READ;
> -                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> -                if (block->fd >= 0) {
> -                    area = mmap(vaddr, page_size, prot, flags, block->fd,
> -                                offset + block->fd_offset);
> -                } else {
> -                    flags |= MAP_ANONYMOUS;
> -                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
> -                }
> -                if (area != vaddr) {
> -                    error_report("Could not remap addr: "
> -                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> -                                 page_size, addr);
> -                    exit(1);
> +                if (ram_block_discard_range(block, offset, page_size) != 0) {
> +                    /*
> +                     * Fall back to using mmap() only for anonymous mapping,
> +                     * as if a backing file is associated we may not be able
> +                     * to recover the memory in all cases.
> +                     * So don't take the risk of using only mmap and fail now.
> +                     */
> +                    if (block->fd >= 0) {
> +                        error_report("Memory poison recovery failure addr: "
> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                                     page_size, addr);

See my error message proposal as reply to v4 patch #1: we should similarly
just print rb->idstr, offset and page_size like ram_block_discard_range() does.

ram_addr_t is weird and not helpful for users.


To have a single error

if (ram_block_discard_range(block, offset, page_size) != 0) {
	/* ...
	if (block->fd >= 0 || qemu_ram_remap_mmap(block, offset, page_size)) {
		error_report() ... // use proposal from my reply to v4 patch #1
		exit(1);
	}
}

-- 
Cheers,

David / dhildenb

Re: [PATCH v5 2/6] system/physmem: poisoned memory discard on reboot

Posted by William Roche 1 year ago

On 1/14/25 15:07, David Hildenbrand wrote:
> On 10.01.25 22:14, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> Repair poisoned memory location(s), calling ram_block_discard_range():
>> punching a hole in the backend file when necessary and regenerating
>> a usable memory.
>> If the kernel doesn't support the madvise calls used by this function
>> and we are dealing with anonymous memory, fall back to remapping the
>> location(s).
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   system/physmem.c | 57 ++++++++++++++++++++++++++++++------------------
>>   1 file changed, 36 insertions(+), 21 deletions(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 7a87548f99..ae1caa97d8 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -2180,13 +2180,32 @@ void qemu_ram_free(RAMBlock *block)
>>   }
>>   #ifndef _WIN32
>> +/* Simply remap the given VM memory location from start to 
>> start+length */
>> +static void qemu_ram_remap_mmap(RAMBlock *block, uint64_t start, 
>> size_t length)
>> +{
>> +    int flags, prot;
>> +    void *area;
>> +    void *host_startaddr = block->host + start;
>> +
>> +    assert(block->fd < 0);
>> +    flags = MAP_FIXED | MAP_ANONYMOUS;
>> +    flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
>> +    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
>> +    prot = PROT_READ;
>> +    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>> +    area = mmap(host_startaddr, length, prot, flags, -1, 0);
>> +    if (area != host_startaddr) {
>> +        error_report("Could not remap addr: " RAM_ADDR_FMT "@" 
>> RAM_ADDR_FMT "",
>> +                     length, start);
>> +        exit(1);
>> +    }
> 
> Can we return an error and have a single error printed in the caller?
> 
> return area != host_startaddr ? -errno : 0;

Done.

> 
>> +}
>> +
>>   void qemu_ram_remap(ram_addr_t addr)
>>   {
>>       RAMBlock *block;
>>       ram_addr_t offset;
>> -    int flags;
>> -    void *area, *vaddr;
>> -    int prot;
>> +    void *vaddr;
>>       size_t page_size;
>>       RAMBLOCK_FOREACH(block) {
>> @@ -2202,24 +2221,20 @@ void qemu_ram_remap(ram_addr_t addr)
>>               } else if (xen_enabled()) {
>>                   abort();
>>               } else {
>> -                flags = MAP_FIXED;
>> -                flags |= block->flags & RAM_SHARED ?
>> -                         MAP_SHARED : MAP_PRIVATE;
>> -                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
>> -                prot = PROT_READ;
>> -                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>> -                if (block->fd >= 0) {
>> -                    area = mmap(vaddr, page_size, prot, flags, block->fd,
>> -                                offset + block->fd_offset);
>> -                } else {
>> -                    flags |= MAP_ANONYMOUS;
>> -                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
>> -                }
>> -                if (area != vaddr) {
>> -                    error_report("Could not remap addr: "
>> -                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>> -                                 page_size, addr);
>> -                    exit(1);
>> +                if (ram_block_discard_range(block, offset, page_size) != 0) {
>> +                    /*
>> +                     * Fall back to using mmap() only for anonymous mapping,
>> +                     * as if a backing file is associated we may not be able
>> +                     * to recover the memory in all cases.
>> +                     * So don't take the risk of using only mmap and fail now.
>> +                     */
>> +                    if (block->fd >= 0) {
>> +                        error_report("Memory poison recovery failure addr: "
>> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>> +                                     page_size, addr);
> 
> See my error message proposal as reply to v4 patch #1: we should similarly
> just print rb->idstr, offset and page_size like 
> ram_block_discard_range() does.
> 
> ram_addr_t is weird and not helpful for users.
> 
> 
> To have a single error
> 
> if (ram_block_discard_range(block, offset, page_size) != 0) {
>      /* ...
>      if (block->fd >= 0 || qemu_ram_remap_mmap(block, offset, page_size)) {
>          error_report() ... // use proposal from my reply to v4 patch #1
>          exit(1);
>      }
> }
> 

Done.

[PATCH v5 3/6] accel/kvm: Report the loss of a large memory page

Posted by “William Roche 1 year ago

From: William Roche <william.roche@oracle.com>

In case of a large page impacted by a memory error, enhance
the existing Qemu error message which indicates that the error
is injected in the VM, adding "on lost large page SIZE@ADDR".

Include also a similar message to the ARM platform.

In the case of a large page impacted, we now report:
...Memory Error at QEMU addr X and GUEST addr Y on lost large page SIZE@ADDR of type...

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c   |  4 ----
 target/arm/kvm.c      | 13 +++++++++++++
 target/i386/kvm/kvm.c | 18 ++++++++++++++----
 3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 4f2abd5774..f89568bfa3 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1296,10 +1296,6 @@ static void kvm_unpoison_all(void *param)
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
-    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
-
-    if (page_size > TARGET_PAGE_SIZE)
-        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index a9444a2c7a..323ce0045d 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2366,6 +2366,8 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
 {
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t page_size;
+    char lp_msg[54];
 
     assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
 
@@ -2373,6 +2375,14 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
+            page_size = qemu_ram_pagesize_from_addr(ram_addr);
+            if (page_size > TARGET_PAGE_SIZE) {
+                ram_addr = ROUND_DOWN(ram_addr, page_size);
+                snprintf(lp_msg, sizeof(lp_msg), " on lost large page "
+                    RAM_ADDR_FMT "@" RAM_ADDR_FMT "", page_size, ram_addr);
+            } else {
+                lp_msg[0] = '\0';
+            }
             kvm_hwpoison_page_add(ram_addr);
             /*
              * If this is a BUS_MCEERR_AR, we know we have been called
@@ -2389,6 +2399,9 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
                 kvm_cpu_synchronize_state(c);
                 if (!acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
                     kvm_inject_arm_sea(c);
+                    error_report("Guest Memory Error at QEMU addr %p and "
+                        "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
+                        addr, paddr, lp_msg, "BUS_MCEERR_AR");
                 } else {
                     error_report("failed to record the error");
                     abort();
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 2f66e63b88..7715cab7cf 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -741,6 +741,8 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
     CPUX86State *env = &cpu->env;
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t page_size;
+    char lp_msg[54];
 
     /* If we get an action required MCE, it has been injected by KVM
      * while the VM was running.  An action optional MCE instead should
@@ -753,6 +755,14 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
+            page_size = qemu_ram_pagesize_from_addr(ram_addr);
+            if (page_size > TARGET_PAGE_SIZE) {
+                ram_addr = ROUND_DOWN(ram_addr, page_size);
+                snprintf(lp_msg, sizeof(lp_msg), " on lost large page "
+                        RAM_ADDR_FMT "@" RAM_ADDR_FMT "", page_size, ram_addr);
+            } else {
+                lp_msg[0] = '\0';
+            }
             kvm_hwpoison_page_add(ram_addr);
             kvm_mce_inject(cpu, paddr, code);
 
@@ -763,12 +773,12 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
              */
             if (code == BUS_MCEERR_AR) {
                 error_report("Guest MCE Memory Error at QEMU addr %p and "
-                    "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
-                    addr, paddr, "BUS_MCEERR_AR");
+                    "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
+                    addr, paddr, lp_msg, "BUS_MCEERR_AR");
             } else {
                  warn_report("Guest MCE Memory Error at QEMU addr %p and "
-                     "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
-                     addr, paddr, "BUS_MCEERR_AO");
+                     "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
+                     addr, paddr, lp_msg, "BUS_MCEERR_AO");
             }
 
             return;
-- 
2.43.5

Re: [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page

Posted by David Hildenbrand 1 year ago

On 10.01.25 22:14, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> In case of a large page impacted by a memory error, enhance
> the existing Qemu error message which indicates that the error
> is injected in the VM, adding "on lost large page SIZE@ADDR".
> 
> Include also a similar message to the ARM platform.
> 
> In the case of a large page impacted, we now report:
> ...Memory Error at QEMU addr X and GUEST addr Y on lost large page SIZE@ADDR of type...
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   accel/kvm/kvm-all.c   |  4 ----
>   target/arm/kvm.c      | 13 +++++++++++++
>   target/i386/kvm/kvm.c | 18 ++++++++++++++----
>   3 files changed, 27 insertions(+), 8 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 4f2abd5774..f89568bfa3 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1296,10 +1296,6 @@ static void kvm_unpoison_all(void *param)
>   void kvm_hwpoison_page_add(ram_addr_t ram_addr)
>   {
>       HWPoisonPage *page;
> -    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
> -
> -    if (page_size > TARGET_PAGE_SIZE)
> -        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
>   
>       QLIST_FOREACH(page, &hwpoison_page_list, list) {
>           if (page->ram_addr == ram_addr) {
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index a9444a2c7a..323ce0045d 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -2366,6 +2366,8 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>   {
>       ram_addr_t ram_addr;
>       hwaddr paddr;
> +    size_t page_size;
> +    char lp_msg[54];
>   
>       assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
>   
> @@ -2373,6 +2375,14 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>           ram_addr = qemu_ram_addr_from_host(addr);
>           if (ram_addr != RAM_ADDR_INVALID &&
>               kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
> +            page_size = qemu_ram_pagesize_from_addr(ram_addr);
> +            if (page_size > TARGET_PAGE_SIZE) {
> +                ram_addr = ROUND_DOWN(ram_addr, page_size);
> +                snprintf(lp_msg, sizeof(lp_msg), " on lost large page "
> +                    RAM_ADDR_FMT "@" RAM_ADDR_FMT "", page_size, ram_addr);
> +            } else {
> +                lp_msg[0] = '\0';
> +            }
>               kvm_hwpoison_page_add(ram_addr);
>               /*
>                * If this is a BUS_MCEERR_AR, we know we have been called
> @@ -2389,6 +2399,9 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>                   kvm_cpu_synchronize_state(c);
>                   if (!acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
>                       kvm_inject_arm_sea(c);
> +                    error_report("Guest Memory Error at QEMU addr %p and "
> +                        "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
> +                        addr, paddr, lp_msg, "BUS_MCEERR_AR");
>                   } else {
>                       error_report("failed to record the error");
>                       abort();
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 2f66e63b88..7715cab7cf 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -741,6 +741,8 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>       CPUX86State *env = &cpu->env;
>       ram_addr_t ram_addr;
>       hwaddr paddr;
> +    size_t page_size;
> +    char lp_msg[54];
>   
>       /* If we get an action required MCE, it has been injected by KVM
>        * while the VM was running.  An action optional MCE instead should
> @@ -753,6 +755,14 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>           ram_addr = qemu_ram_addr_from_host(addr);
>           if (ram_addr != RAM_ADDR_INVALID &&
>               kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
> +            page_size = qemu_ram_pagesize_from_addr(ram_addr);
> +            if (page_size > TARGET_PAGE_SIZE) {
> +                ram_addr = ROUND_DOWN(ram_addr, page_size);

As raised, aligning ram_addr_t addresses to page_size is wrong.

Maybe we really want to print block->idstr, offset, size like I proposed 
at the other place, here as well?

-- 
Cheers,

David / dhildenb

Re: [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page

Posted by William Roche 1 year ago

On 1/14/25 15:09, David Hildenbrand wrote:
> On 10.01.25 22:14, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> In case of a large page impacted by a memory error, enhance
>> the existing Qemu error message which indicates that the error
>> is injected in the VM, adding "on lost large page SIZE@ADDR".
>>
>> Include also a similar message to the ARM platform.
>>
>> [...]
>> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
>> index a9444a2c7a..323ce0045d 100644
>> --- a/target/arm/kvm.c
>> +++ b/target/arm/kvm.c
>> @@ -2366,6 +2366,8 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>   {
>>       ram_addr_t ram_addr;
>>       hwaddr paddr;
>> +    size_t page_size;
>> +    char lp_msg[54];
>>       assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
>> @@ -2373,6 +2375,14 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>           ram_addr = qemu_ram_addr_from_host(addr);
>>           if (ram_addr != RAM_ADDR_INVALID &&
>>               kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
>> +            page_size = qemu_ram_pagesize_from_addr(ram_addr);
>> +            if (page_size > TARGET_PAGE_SIZE) {
>> +                ram_addr = ROUND_DOWN(ram_addr, page_size);
>> +                snprintf(lp_msg, sizeof(lp_msg), " on lost large page "
>> +                    RAM_ADDR_FMT "@" RAM_ADDR_FMT "", page_size, ram_addr);
>> +            } else {
>> +                lp_msg[0] = '\0';
>> +            }
>>               kvm_hwpoison_page_add(ram_addr);
>>               /*
>>                * If this is a BUS_MCEERR_AR, we know we have been called
>> @@ -2389,6 +2399,9 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>                   kvm_cpu_synchronize_state(c);
>>                   if (!acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
>>                       kvm_inject_arm_sea(c);
>> +                    error_report("Guest Memory Error at QEMU addr %p and "
>> +                        "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
>> +                        addr, paddr, lp_msg, "BUS_MCEERR_AR");
>>                   } else {
>>                       error_report("failed to record the error");
>>                       abort();
>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
>> index 2f66e63b88..7715cab7cf 100644
>> --- a/target/i386/kvm/kvm.c
>> +++ b/target/i386/kvm/kvm.c
>> @@ -741,6 +741,8 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int 
>> code, void *addr)
>>       CPUX86State *env = &cpu->env;
>>       ram_addr_t ram_addr;
>>       hwaddr paddr;
>> +    size_t page_size;
>> +    char lp_msg[54];
>>       /* If we get an action required MCE, it has been injected by KVM
>>        * while the VM was running.  An action optional MCE instead should
>> @@ -753,6 +755,14 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>           ram_addr = qemu_ram_addr_from_host(addr);
>>           if (ram_addr != RAM_ADDR_INVALID &&
>>               kvm_physical_memory_addr_from_host(c->kvm_state, addr, 
>> &paddr)) {
>> +            page_size = qemu_ram_pagesize_from_addr(ram_addr);
>> +            if (page_size > TARGET_PAGE_SIZE) {
>> +                ram_addr = ROUND_DOWN(ram_addr, page_size);
> 
> As raised, aligning ram_addr_t addresses to page_size is wrong.
> 
> Maybe we really want to print block->idstr, offset, size like I proposed 
> at the other place, here as well?


Yes, we can collect the information from the block associated to this 
ram_addr. But instead of duplicating the necessary code into both i386 
and ARM, I came back to adding the change into the 
kvm_hwpoison_page_add() function called from both i386 and ARM specific 
code.

I also needed a new possibility to retrieve the information while we are 
dealing with the SIGBUS signal, and created a new function to gather the 
information from the RAMBlock:
qemu_ram_block_location_info_from_addr(ram_addr_t ram_addr,
                                        struct RAMBlockInfo *b_info)
with the associated struct.

So that we can use the RCU_READ_LOCK_GUARD() and retrieve all the data.


Note about ARM failing on large pages:
----------=====----------------------
I could test that ARM VMs impacted by memory errors on a large 
underlying memory page, can end up looping on reporting the error:
The VM encountering an error has a high probability to crash and can try 
to save a vmcore with a kdump phase.

This fix introduces qemu messages reporting errors when they are relayed 
to the VM.
A large page being poisoned by an error on ARM can make a VM loop on the 
vmcore collection phase and the console would show messages like that 
appearing every 10 seconds (before the change):

  vvv
          Starting Kdump Vmcore Save Service...
[    3.095399] kdump[445]: Kdump is using the default log level(3).
[    3.173998] kdump[481]: saving to 
/sysroot/var/crash/127.0.0.1-2025-01-27-20:17:40/
[    3.189683] kdump[486]: saving vmcore-dmesg.txt to 
/sysroot/var/crash/127.0.0.1-2025-01-27-20:17:40/
[    3.213584] kdump[492]: saving vmcore-dmesg.txt complete
[    3.220295] kdump[494]: saving vmcore
[   10.029515] EDAC MC0: 1 UE unknown on unknown memory ( page:0x116c60 
offset:0x0 grain:1 - APEI location: )
[   10.033647] [Firmware Warn]: GHES: Invalid address in generic error 
data: 0x116c60000
[   10.036974] {2}[Hardware Error]: Hardware error from APEI Generic 
Hardware Error Source: 0
[   10.040514] {2}[Hardware Error]: event severity: recoverable
[   10.042911] {2}[Hardware Error]:  Error 0, type: recoverable
[   10.045310] {2}[Hardware Error]:   section_type: memory error
[   10.047666] {2}[Hardware Error]:   physical_address: 0x0000000116c60000
[   10.050486] {2}[Hardware Error]:   error_type: 0, unknown
[   20.053205] EDAC MC0: 1 UE unknown on unknown memory ( page:0x116c60 
offset:0x0 grain:1 - APEI location: )
[   20.057416] [Firmware Warn]: GHES: Invalid address in generic error 
data: 0x116c60000
[   20.060781] {3}[Hardware Error]: Hardware error from APEI Generic 
Hardware Error Source: 0
[   20.065472] {3}[Hardware Error]: event severity: recoverable
[   20.067878] {3}[Hardware Error]:  Error 0, type: recoverable
[   20.070273] {3}[Hardware Error]:   section_type: memory error
[   20.072686] {3}[Hardware Error]:   physical_address: 0x0000000116c60000
[   20.075590] {3}[Hardware Error]:   error_type: 0, unknown
  ^^^

with the fix, we now have a flood of messages like:

  vvv
qemu-system-aarch64: Memory Error on large page from 
ram-node1:d5e00000+0 +200000
qemu-system-aarch64: Guest Memory Error at QEMU addr 0xffff35c79000 and 
GUEST addr 0x115e79000 of type BUS_MCEERR_AR injected
qemu-system-aarch64: Memory Error on large page from 
ram-node1:d5e00000+0 +200000
qemu-system-aarch64: Guest Memory Error at QEMU addr 0xffff35c79000 and 
GUEST addr 0x115e79000 of type BUS_MCEERR_AR injected
qemu-system-aarch64: Memory Error on large page from 
ram-node1:d5e00000+0 +200000
qemu-system-aarch64: Guest Memory Error at QEMU addr 0xffff35c79000 and 
GUEST addr 0x115e79000 of type BUS_MCEERR_AR injected
  ^^^


In both cases, this situation loops indefinitely !

I'm just informing of a change of behavior, fixing this issue would most 
probably require VM kernel modifications  or a work-around in qemu when 
errors are reported too often, but is out of the scope of this current 
qemu fix.

Re: [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page

Posted by David Hildenbrand 1 year ago

> Yes, we can collect the information from the block associated to this
> ram_addr. But instead of duplicating the necessary code into both i386
> and ARM, I came back to adding the change into the
> kvm_hwpoison_page_add() function called from both i386 and ARM specific
> code.
> 
> I also needed a new possibility to retrieve the information while we are
> dealing with the SIGBUS signal, and created a new function to gather the
> information from the RAMBlock:
> qemu_ram_block_location_info_from_addr(ram_addr_t ram_addr,
>                                          struct RAMBlockInfo *b_info)
> with the associated struct.
> 
> So that we can use the RCU_READ_LOCK_GUARD() and retrieve all the data.

Makes sense.

> 
> 
> Note about ARM failing on large pages:
> ----------=====----------------------
> I could test that ARM VMs impacted by memory errors on a large
> underlying memory page, can end up looping on reporting the error:
> The VM encountering an error has a high probability to crash and can try
> to save a vmcore with a kdump phase.

Yeah, that's what I thought. If you rip out 1 GiB of memory, your VM is 
going to have a bad time :/

> 
> This fix introduces qemu messages reporting errors when they are relayed
> to the VM.
> A large page being poisoned by an error on ARM can make a VM loop on the
> vmcore collection phase and the console would show messages like that
> appearing every 10 seconds (before the change):
> 
>    vvv
>            Starting Kdump Vmcore Save Service...
> [    3.095399] kdump[445]: Kdump is using the default log level(3).
> [    3.173998] kdump[481]: saving to
> /sysroot/var/crash/127.0.0.1-2025-01-27-20:17:40/
> [    3.189683] kdump[486]: saving vmcore-dmesg.txt to
> /sysroot/var/crash/127.0.0.1-2025-01-27-20:17:40/
> [    3.213584] kdump[492]: saving vmcore-dmesg.txt complete
> [    3.220295] kdump[494]: saving vmcore
> [   10.029515] EDAC MC0: 1 UE unknown on unknown memory ( page:0x116c60
> offset:0x0 grain:1 - APEI location: )
> [   10.033647] [Firmware Warn]: GHES: Invalid address in generic error
> data: 0x116c60000
> [   10.036974] {2}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0
> [   10.040514] {2}[Hardware Error]: event severity: recoverable
> [   10.042911] {2}[Hardware Error]:  Error 0, type: recoverable
> [   10.045310] {2}[Hardware Error]:   section_type: memory error
> [   10.047666] {2}[Hardware Error]:   physical_address: 0x0000000116c60000
> [   10.050486] {2}[Hardware Error]:   error_type: 0, unknown
> [   20.053205] EDAC MC0: 1 UE unknown on unknown memory ( page:0x116c60
> offset:0x0 grain:1 - APEI location: )
> [   20.057416] [Firmware Warn]: GHES: Invalid address in generic error
> data: 0x116c60000
> [   20.060781] {3}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0
> [   20.065472] {3}[Hardware Error]: event severity: recoverable
> [   20.067878] {3}[Hardware Error]:  Error 0, type: recoverable
> [   20.070273] {3}[Hardware Error]:   section_type: memory error
> [   20.072686] {3}[Hardware Error]:   physical_address: 0x0000000116c60000
> [   20.075590] {3}[Hardware Error]:   error_type: 0, unknown
>    ^^^
> 
> with the fix, we now have a flood of messages like:
> 
>    vvv
> qemu-system-aarch64: Memory Error on large page from
> ram-node1:d5e00000+0 +200000
> qemu-system-aarch64: Guest Memory Error at QEMU addr 0xffff35c79000 and
> GUEST addr 0x115e79000 of type BUS_MCEERR_AR injected
> qemu-system-aarch64: Memory Error on large page from
> ram-node1:d5e00000+0 +200000
> qemu-system-aarch64: Guest Memory Error at QEMU addr 0xffff35c79000 and
> GUEST addr 0x115e79000 of type BUS_MCEERR_AR injected
> qemu-system-aarch64: Memory Error on large page from
> ram-node1:d5e00000+0 +200000
> qemu-system-aarch64: Guest Memory Error at QEMU addr 0xffff35c79000 and
> GUEST addr 0x115e79000 of type BUS_MCEERR_AR injected
>    ^^^
> 
> 
> In both cases, this situation loops indefinitely !
> 
> I'm just informing of a change of behavior, fixing this issue would most
> probably require VM kernel modifications  or a work-around in qemu when
> errors are reported too often, but is out of the scope of this current
> qemu fix.

Agreed. I think one problem is that kdump cannot really cope with new 
memory errors (it tries to not touch pages that had a memory error in 
the old kernel).

Maybe this is also due to the fact that we inform the kernel only about 
a single page vanishing, whereby actually a whole 1 GiB is vanishing.

-- 
Cheers,

David / dhildenb

[PATCH v5 4/6] numa: Introduce and use ram_block_notify_remap()

Posted by “William Roche 1 year ago

From: David Hildenbrand <david@redhat.com>

Notify registered listeners about the remap at the end of
qemu_ram_remap() so e.g., a memory backend can re-apply its
settings correctly.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 hw/core/numa.c         | 11 +++++++++++
 include/exec/ramlist.h |  3 +++
 system/physmem.c       |  1 +
 3 files changed, 15 insertions(+)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index 218576f745..003bcd8a66 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -895,3 +895,14 @@ void ram_block_notify_resize(void *host, size_t old_size, size_t new_size)
         }
     }
 }
+
+void ram_block_notify_remap(void *host, size_t offset, size_t size)
+{
+    RAMBlockNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+        if (notifier->ram_block_remapped) {
+            notifier->ram_block_remapped(notifier, host, offset, size);
+        }
+    }
+}
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
index d9cfe530be..c1dc785a57 100644
--- a/include/exec/ramlist.h
+++ b/include/exec/ramlist.h
@@ -72,6 +72,8 @@ struct RAMBlockNotifier {
                               size_t max_size);
     void (*ram_block_resized)(RAMBlockNotifier *n, void *host, size_t old_size,
                               size_t new_size);
+    void (*ram_block_remapped)(RAMBlockNotifier *n, void *host, size_t offset,
+                               size_t size);
     QLIST_ENTRY(RAMBlockNotifier) next;
 };
 
@@ -80,6 +82,7 @@ void ram_block_notifier_remove(RAMBlockNotifier *n);
 void ram_block_notify_add(void *host, size_t size, size_t max_size);
 void ram_block_notify_remove(void *host, size_t size, size_t max_size);
 void ram_block_notify_resize(void *host, size_t old_size, size_t new_size);
+void ram_block_notify_remap(void *host, size_t offset, size_t size);
 
 GString *ram_block_format(void);
 
diff --git a/system/physmem.c b/system/physmem.c
index ae1caa97d8..b8cd49a110 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2238,6 +2238,7 @@ void qemu_ram_remap(ram_addr_t addr)
                 }
                 memory_try_enable_merging(vaddr, page_size);
                 qemu_ram_setup_dump(vaddr, page_size);
+                ram_block_notify_remap(block->host, offset, page_size);
             }
 
             break;
-- 
2.43.5

[PATCH v5 5/6] hostmem: Factor out applying settings

Posted by “William Roche 1 year ago

From: David Hildenbrand <david@redhat.com>

We want to reuse the functionality when remapping RAM.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c | 155 ++++++++++++++++++++++++---------------------
 1 file changed, 82 insertions(+), 73 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index bceca1a8d9..46d80f98b4 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -36,6 +36,87 @@ QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
 QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
 #endif
 
+static void host_memory_backend_apply_settings(HostMemoryBackend *backend,
+                                               void *ptr, uint64_t size,
+                                               Error **errp)
+{
+    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
+
+    if (backend->merge) {
+        qemu_madvise(ptr, size, QEMU_MADV_MERGEABLE);
+    }
+    if (!backend->dump) {
+        qemu_madvise(ptr, size, QEMU_MADV_DONTDUMP);
+    }
+#ifdef CONFIG_NUMA
+    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
+    /* lastbit == MAX_NODES means maxnode = 0 */
+    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
+    /*
+     * Ensure policy won't be ignored in case memory is preallocated
+     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
+     * this doesn't catch hugepage case.
+     */
+    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
+    int mode = backend->policy;
+
+    /*
+     * Check for invalid host-nodes and policies and give more verbose
+     * error messages than mbind().
+     */
+    if (maxnode && backend->policy == MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be empty for policy default,"
+                   " or you should explicitly specify a policy other"
+                   " than default");
+        return;
+    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be set for policy %s",
+                   HostMemPolicy_str(backend->policy));
+        return;
+    }
+
+    /*
+     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
+     * as argument to mbind() due to an old Linux bug (feature?) which
+     * cuts off the last specified node. This means backend->host_nodes
+     * must have MAX_NODES+1 bits available.
+     */
+    assert(sizeof(backend->host_nodes) >=
+           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
+    assert(maxnode <= MAX_NODES);
+
+#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
+    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
+        /*
+         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
+         * silently picks the first node.
+         */
+        mode = MPOL_PREFERRED_MANY;
+    }
+#endif
+
+    if (maxnode &&
+        mbind(ptr, size, mode, backend->host_nodes, maxnode + 1, flags)) {
+        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
+            error_setg_errno(errp, errno,
+                             "cannot bind memory to host NUMA nodes");
+            return;
+        }
+    }
+#endif
+    /*
+     * Preallocate memory after the NUMA policy has been instantiated.
+     * This is necessary to guarantee memory is allocated with
+     * specified NUMA policy in place.
+     */
+    if (backend->prealloc &&
+        !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
+                           ptr, size, backend->prealloc_threads,
+                           backend->prealloc_context, async, errp)) {
+        return;
+    }
+}
+
 char *
 host_memory_backend_get_name(HostMemoryBackend *backend)
 {
@@ -337,7 +418,6 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
     void *ptr;
     uint64_t sz;
     size_t pagesize;
-    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
 
     if (!bc->alloc) {
         return;
@@ -357,78 +437,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
         return;
     }
 
-    if (backend->merge) {
-        qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
-    }
-    if (!backend->dump) {
-        qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
-    }
-#ifdef CONFIG_NUMA
-    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
-    /* lastbit == MAX_NODES means maxnode = 0 */
-    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
-    /*
-     * Ensure policy won't be ignored in case memory is preallocated
-     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
-     * this doesn't catch hugepage case.
-     */
-    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
-    int mode = backend->policy;
-
-    /* check for invalid host-nodes and policies and give more verbose
-     * error messages than mbind(). */
-    if (maxnode && backend->policy == MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be empty for policy default,"
-                   " or you should explicitly specify a policy other"
-                   " than default");
-        return;
-    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be set for policy %s",
-                   HostMemPolicy_str(backend->policy));
-        return;
-    }
-
-    /*
-     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
-     * as argument to mbind() due to an old Linux bug (feature?) which
-     * cuts off the last specified node. This means backend->host_nodes
-     * must have MAX_NODES+1 bits available.
-     */
-    assert(sizeof(backend->host_nodes) >=
-           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
-    assert(maxnode <= MAX_NODES);
-
-#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
-    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
-        /*
-         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
-         * silently picks the first node.
-         */
-        mode = MPOL_PREFERRED_MANY;
-    }
-#endif
-
-    if (maxnode &&
-        mbind(ptr, sz, mode, backend->host_nodes, maxnode + 1, flags)) {
-        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
-            error_setg_errno(errp, errno,
-                             "cannot bind memory to host NUMA nodes");
-            return;
-        }
-    }
-#endif
-    /*
-     * Preallocate memory after the NUMA policy has been instantiated.
-     * This is necessary to guarantee memory is allocated with
-     * specified NUMA policy in place.
-     */
-    if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
-                                                ptr, sz,
-                                                backend->prealloc_threads,
-                                                backend->prealloc_context,
-                                                async, errp)) {
-        return;
-    }
+    host_memory_backend_apply_settings(backend, ptr, sz, errp);
 }
 
 static bool
-- 
2.43.5

[PATCH v5 6/6] hostmem: Handle remapping of RAM

Posted by “William Roche 1 year ago

From: David Hildenbrand <david@redhat.com>

Let's register a RAM block notifier and react on remap notifications.
Simply re-apply the settings. Exit if something goes wrong.

Merging and dump settings are handled by the remap notification
in addition to memory policy and preallocation.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c       | 34 ++++++++++++++++++++++++++++++++++
 include/system/hostmem.h |  1 +
 system/physmem.c         |  4 ----
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 46d80f98b4..4589467c77 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -361,11 +361,37 @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
     backend->prealloc_threads = value;
 }
 
+static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host,
+                                             size_t offset, size_t size)
+{
+    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
+                                              ram_notifier);
+    Error *err = NULL;
+
+    if (!host_memory_backend_mr_inited(backend) ||
+        memory_region_get_ram_ptr(&backend->mr) != host) {
+        return;
+    }
+
+    host_memory_backend_apply_settings(backend, host + offset, size, &err);
+    if (err) {
+        /*
+         * If memory settings can't be successfully applied on remap,
+         * don't take the risk to continue without them.
+         */
+        error_report_err(err);
+        exit(1);
+    }
+}
+
 static void host_memory_backend_init(Object *obj)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
     MachineState *machine = MACHINE(qdev_get_machine());
 
+    backend->ram_notifier.ram_block_remapped = host_memory_backend_ram_remapped;
+    ram_block_notifier_add(&backend->ram_notifier);
+
     /* TODO: convert access to globals to compat properties */
     backend->merge = machine_mem_merge(machine);
     backend->dump = machine_dump_guest_core(machine);
@@ -379,6 +405,13 @@ static void host_memory_backend_post_init(Object *obj)
     object_apply_compat_props(obj);
 }
 
+static void host_memory_backend_finalize(Object *obj)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+    ram_block_notifier_remove(&backend->ram_notifier);
+}
+
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend)
 {
     /*
@@ -595,6 +628,7 @@ static const TypeInfo host_memory_backend_info = {
     .instance_size = sizeof(HostMemoryBackend),
     .instance_init = host_memory_backend_init,
     .instance_post_init = host_memory_backend_post_init,
+    .instance_finalize = host_memory_backend_finalize,
     .interfaces = (InterfaceInfo[]) {
         { TYPE_USER_CREATABLE },
         { }
diff --git a/include/system/hostmem.h b/include/system/hostmem.h
index 5c21ca55c0..170849e8a4 100644
--- a/include/system/hostmem.h
+++ b/include/system/hostmem.h
@@ -83,6 +83,7 @@ struct HostMemoryBackend {
     HostMemPolicy policy;
 
     MemoryRegion mr;
+    RAMBlockNotifier ram_notifier;
 };
 
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend);
diff --git a/system/physmem.c b/system/physmem.c
index b8cd49a110..306e71861d 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2205,7 +2205,6 @@ void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     ram_addr_t offset;
-    void *vaddr;
     size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
@@ -2215,7 +2214,6 @@ void qemu_ram_remap(ram_addr_t addr)
             page_size = qemu_ram_pagesize(block);
             offset = QEMU_ALIGN_DOWN(offset, page_size);
 
-            vaddr = ramblock_ptr(block, offset);
             if (block->flags & RAM_PREALLOC) {
                 ;
             } else if (xen_enabled()) {
@@ -2236,8 +2234,6 @@ void qemu_ram_remap(ram_addr_t addr)
                     }
                     qemu_ram_remap_mmap(block, offset, page_size);
                 }
-                memory_try_enable_merging(vaddr, page_size);
-                qemu_ram_setup_dump(vaddr, page_size);
                 ram_block_notify_remap(block->host, offset, page_size);
             }
 
-- 
2.43.5

Re: [PATCH v5 6/6] hostmem: Handle remapping of RAM

Posted by David Hildenbrand 1 year ago

On 10.01.25 22:14, “William Roche wrote:
> From: David Hildenbrand <david@redhat.com>
> 

You can make yourself the author and just make me a Co-developed-by here.

LGTM!

-- 
Cheers,

David / dhildenb

Re: [PATCH v5 6/6] hostmem: Handle remapping of RAM

Posted by William Roche 1 year ago


On 1/14/25 15:11, David Hildenbrand wrote:
> On 10.01.25 22:14, “William Roche wrote:
>> From: David Hildenbrand <david@redhat.com>
>>
> 
> You can make yourself the author and just make me a Co-developed-by here.
> 
> LGTM!
> 

Ok done.

Thanks.

[PATCH v4 0/7] Poisoned memory recovery on reboot

Posted by “William Roche 1 year, 1 month ago

From: William Roche <willia.roche@oracle.com>

Hello David,

Here is an new version of our code and an updated description of the
patch set:

 ---
This set of patches fixes several problems with hardware memory errors
impacting hugetlbfs memory backed VMs and the generic memory recovery
on VM reset.
When using hugetlbfs large pages, any large page location being impacted
by an HW memory error results in poisoning the entire page, suddenly
making a large chunk of the VM memory unusable.

The main problem that currently exists in Qemu is the lack of backend
file repair before resetting the VM memory, resulting in the impacted
memory to be silently unusable even after a VM reboot.

In order to fix this issue, we take into account the page size of the
impacted memory block when dealing with the associated poisoned page
location.

Using the page size information we also try to regenerate the memory
calling ram_block_discard_range() on VM reset when running
qemu_ram_remap(). So that a poisoned memory backed by a hugetlbfs
file is regenerated with a hole punched in this file. A new page is
loaded when the location is first touched.

In case of a discard failure we fall back to remapping the memory
location. We also have to reset the memory settings and honor the
'prealloc' attribute.

This memory setting is performed by a new remap notification mechanism
calling host_memory_backend_ram_remapped() function when a region of
a memory block is remapped.

We also enrich the messages used to report a memory error relayed to
the VM, providing an identification of memory page and its size in
case of a large page impacted.
 ----
 
 About patch 3/7, I still think that generating an additional message
 in the kvm_hwpoison_page_add() function creates a cleaner code without
 the need to repeat it for x86 and ARM. The message would be displayed
 before all the injection messages issued because of the large page
 failure. But we could go with this version if you prefer the existing
 message to be enriched.
 
 About patch 7/7, I could merge it with your patch 6/7 if you agree.
 
 
v3 -> v4:
. Fixed some commit messages typos
. Enhanced some code comments
. Changed the discard fall back conditions to consider only anonymous
  memory
. Fixed missing some variable name changes in intermediary patches.
. Modify the error message given when an error is injected to report
  the case of a large page
. use snprintf() to generate this message
. Adding this same type of message in the ARM case too


This code is scripts/checkpatch.pl clean
'make check' runs fine on both x86 and Arm.


David Hildenbrand (3):
  numa: Introduce and use ram_block_notify_remap()
  hostmem: Factor out applying settings
  hostmem: Handle remapping of RAM

William Roche (4):
  hwpoison_page_list and qemu_ram_remap are based on pages
  system/physmem: poisoned memory discard on reboot
  accel/kvm: Report the loss of a large memory page
  system/physmem: Memory settings applied on remap notification

 accel/kvm/kvm-all.c       |   2 +-
 backends/hostmem.c        | 189 +++++++++++++++++++++++---------------
 hw/core/numa.c            |  11 +++
 include/exec/cpu-common.h |   3 +-
 include/exec/ramlist.h    |   3 +
 include/sysemu/hostmem.h  |   1 +
 system/physmem.c          |  88 +++++++++++++-----
 target/arm/kvm.c          |  13 +++
 target/i386/kvm/kvm.c     |  18 +++-
 9 files changed, 225 insertions(+), 103 deletions(-)

-- 
2.43.5

Re: [PATCH v4 0/7] Poisoned memory recovery on reboot

Posted by David Hildenbrand 1 year, 1 month ago

On 14.12.24 14:45, “William Roche wrote:
> From: William Roche <willia.roche@oracle.com>
> 
> Hello David,

Hi!

Let me start reviewing today a bit (it's already late, and I'll continue 
tomorrow.

> 
> Here is an new version of our code and an updated description of the
> patch set:
> 
 >   ---> This set of patches fixes several problems with hardware 
memory errors
> impacting hugetlbfs memory backed VMs and the generic memory recovery
> on VM reset.
> When using hugetlbfs large pages, any large page location being impacted
> by an HW memory error results in poisoning the entire page, suddenly
> making a large chunk of the VM memory unusable.

I assume the problem that will remain is that a running VM will still 
lose that chunk (yet, we only indicate a single 4k page to the guest via 
an injected MCE :( ).

So the biggest point of this patch set is really the recovery on reboot.

And as I am writing this, I realize that the series subject correctly 
reflects that :)

-- 
Cheers,

David / dhildenb

Re: [PATCH v4 0/7] Poisoned memory recovery on reboot

Posted by William Roche 1 year ago

On 1/8/25 22:22, David Hildenbrand wrote:
> On 14.12.24 14:45, “William Roche wrote:
>> From: William Roche <willia.roche@oracle.com>
>>
>> Hello David,
> 
> Hi!
> 
> Let me start reviewing today a bit (it's already late, and I'll continue 
> tomorrow.
> 
>>
>> Here is an new version of our code and an updated description of the
>> patch set:
>>
>>   ---
>> This set of patches fixes several problems with hardware memory errors
>> impacting hugetlbfs memory backed VMs and the generic memory recovery
>> on VM reset.
>> When using hugetlbfs large pages, any large page location being impacted
>> by an HW memory error results in poisoning the entire page, suddenly
>> making a large chunk of the VM memory unusable.
> 
> I assume the problem that will remain is that a running VM will still 
> lose that chunk (yet, we only indicate a single 4k page to the guest via 
> an injected MCE :( ).
> 
> So the biggest point of this patch set is really the recovery on reboot.
> 
> And as I am writing this, I realize that the series subject correctly 
> reflects that :)
> 

Yes, I'm sending a new version v5 taking into account your remarks after 
answering each email you sent on the different patches.

[PATCH v4 1/7] hwpoison_page_list and qemu_ram_remap are based on pages

Posted by “William Roche 1 year, 1 month ago

From: William Roche <william.roche@oracle.com>

The list of hwpoison pages used to remap the memory on reset
is based on the backend real page size. When dealing with
hugepages, we create a single entry for the entire page.

Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c       |  6 +++++-
 include/exec/cpu-common.h |  3 ++-
 system/physmem.c          | 32 ++++++++++++++++++++++++++------
 3 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 801cff16a5..24c0c4ce3f 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param)
 
     QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
         QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr);
         g_free(page);
     }
 }
@@ -1286,6 +1286,10 @@ static void kvm_unpoison_all(void *param)
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
+    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
+
+    if (page_size > TARGET_PAGE_SIZE)
+        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 638dc806a5..59fbb324fa 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
 
 /* memory API */
 
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
+void qemu_ram_remap(ram_addr_t addr);
 /* This should not be used by devices.  */
 ram_addr_t qemu_ram_addr_from_host(void *ptr);
 ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
@@ -108,6 +108,7 @@ bool qemu_ram_is_named_file(RAMBlock *rb);
 int qemu_ram_get_fd(RAMBlock *rb);
 
 size_t qemu_ram_pagesize(RAMBlock *block);
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr);
 size_t qemu_ram_pagesize_largest(void);
 
 /**
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..2c90cc2d78 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1665,6 +1665,19 @@ size_t qemu_ram_pagesize(RAMBlock *rb)
     return rb->page_size;
 }
 
+/* Return backend real page size used for the given ram_addr */
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr)
+{
+    RAMBlock *rb;
+
+    RCU_READ_LOCK_GUARD();
+    rb =  qemu_get_ram_block(addr);
+    if (!rb) {
+        return TARGET_PAGE_SIZE;
+    }
+    return qemu_ram_pagesize(rb);
+}
+
 /* Returns the largest size of page in use */
 size_t qemu_ram_pagesize_largest(void)
 {
@@ -2167,17 +2180,22 @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     ram_addr_t offset;
     int flags;
     void *area, *vaddr;
     int prot;
+    size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
         offset = addr - block->offset;
         if (offset < block->max_length) {
+            /* Respect the pagesize of our RAMBlock */
+            page_size = qemu_ram_pagesize(block);
+            offset = QEMU_ALIGN_DOWN(offset, page_size);
+
             vaddr = ramblock_ptr(block, offset);
             if (block->flags & RAM_PREALLOC) {
                 ;
@@ -2191,21 +2209,23 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 prot = PROT_READ;
                 prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
                 if (block->fd >= 0) {
-                    area = mmap(vaddr, length, prot, flags, block->fd,
+                    area = mmap(vaddr, page_size, prot, flags, block->fd,
                                 offset + block->fd_offset);
                 } else {
                     flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, length, prot, flags, -1, 0);
+                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
                 }
                 if (area != vaddr) {
                     error_report("Could not remap addr: "
                                  RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 length, addr);
+                                 page_size, addr);
                     exit(1);
                 }
-                memory_try_enable_merging(vaddr, length);
-                qemu_ram_setup_dump(vaddr, length);
+                memory_try_enable_merging(vaddr, page_size);
+                qemu_ram_setup_dump(vaddr, page_size);
             }
+
+            break;
         }
     }
 }
-- 
2.43.5

Re: [PATCH v4 1/7] hwpoison_page_list and qemu_ram_remap are based on pages

Posted by David Hildenbrand 1 year, 1 month ago

On 14.12.24 14:45, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>

Subject should likely start with "system/physmem:".

Maybe

"system/physmem: handle hugetlb correctly in qemu_ram_remap()"

> 
> The list of hwpoison pages used to remap the memory on reset
> is based on the backend real page size. When dealing with
> hugepages, we create a single entry for the entire page.

Maybe add something like:

"To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete hugetlb 
page; hugetlb pages cannot be partially mapped."

> 
> Co-developed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   accel/kvm/kvm-all.c       |  6 +++++-
>   include/exec/cpu-common.h |  3 ++-
>   system/physmem.c          | 32 ++++++++++++++++++++++++++------
>   3 files changed, 33 insertions(+), 8 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 801cff16a5..24c0c4ce3f 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param)
>   
>       QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>           QLIST_REMOVE(page, list);
> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
> +        qemu_ram_remap(page->ram_addr);
>           g_free(page);
>       }
>   }
> @@ -1286,6 +1286,10 @@ static void kvm_unpoison_all(void *param)
>   void kvm_hwpoison_page_add(ram_addr_t ram_addr)
>   {
>       HWPoisonPage *page;
> +    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
> +
> +    if (page_size > TARGET_PAGE_SIZE)
> +        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);

Is that part still required? I thought it would be sufficient (at least 
in the context of this patch) to handle it all in qemu_ram_remap().

qemu_ram_remap() will calculate the range to process based on the 
RAMBlock page size. IOW, the QEMU_ALIGN_DOWN() we do now in 
qemu_ram_remap().

Or am I missing something?

(sorry if we discussed that already; if there is a good reason it might 
make sense to state it in the patch description)
-- 
Cheers,

David / dhildenb

Re: [PATCH v4 1/7] hwpoison_page_list and qemu_ram_remap are based on pages

Posted by William Roche 1 year ago

On 1/8/25 22:34, David Hildenbrand wrote:
> On 14.12.24 14:45, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
> 
> Subject should likely start with "system/physmem:".
> 
> Maybe
> 
> "system/physmem: handle hugetlb correctly in qemu_ram_remap()"

I updated the commit title

> 
>>
>> The list of hwpoison pages used to remap the memory on reset
>> is based on the backend real page size. When dealing with
>> hugepages, we create a single entry for the entire page.
> 
> Maybe add something like:
> 
> "To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete hugetlb 
> page; hugetlb pages cannot be partially mapped."
> 

Updated into the commit message

>>
>> Co-developed-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   accel/kvm/kvm-all.c       |  6 +++++-
>>   include/exec/cpu-common.h |  3 ++-
>>   system/physmem.c          | 32 ++++++++++++++++++++++++++------
>>   3 files changed, 33 insertions(+), 8 deletions(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 801cff16a5..24c0c4ce3f 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param)
>>       QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>           QLIST_REMOVE(page, list);
>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>> +        qemu_ram_remap(page->ram_addr);
>>           g_free(page);
>>       }
>>   }
>> @@ -1286,6 +1286,10 @@ static void kvm_unpoison_all(void *param)
>>   void kvm_hwpoison_page_add(ram_addr_t ram_addr)
>>   {
>>       HWPoisonPage *page;
>> +    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
>> +
>> +    if (page_size > TARGET_PAGE_SIZE)
>> +        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
> 
> Is that part still required? I thought it would be sufficient (at least 
> in the context of this patch) to handle it all in qemu_ram_remap().
> 
> qemu_ram_remap() will calculate the range to process based on the 
> RAMBlock page size. IOW, the QEMU_ALIGN_DOWN() we do now in 
> qemu_ram_remap().
> 
> Or am I missing something?
> 
> (sorry if we discussed that already; if there is a good reason it might 
> make sense to state it in the patch description)

You are right, but at this patch level we still need to round up the 
address and doing it here is small enough.
Of course, the code changes on patch 3/7 where we change both x86 and 
ARM versions of the code to align the memory pointer correctly in both 
cases.

Re: [PATCH v4 1/7] hwpoison_page_list and qemu_ram_remap are based on pages

Posted by David Hildenbrand 1 year ago

On 10.01.25 21:56, William Roche wrote:
> On 1/8/25 22:34, David Hildenbrand wrote:
>> On 14.12.24 14:45, “William Roche wrote:
>>> From: William Roche <william.roche@oracle.com>
>>
>> Subject should likely start with "system/physmem:".
>>
>> Maybe
>>
>> "system/physmem: handle hugetlb correctly in qemu_ram_remap()"
> 
> I updated the commit title
> 
>>
>>>
>>> The list of hwpoison pages used to remap the memory on reset
>>> is based on the backend real page size. When dealing with
>>> hugepages, we create a single entry for the entire page.
>>
>> Maybe add something like:
>>
>> "To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete hugetlb
>> page; hugetlb pages cannot be partially mapped."
>>
> 
> Updated into the commit message
> 
>>>
>>> Co-developed-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>> ---
>>>    accel/kvm/kvm-all.c       |  6 +++++-
>>>    include/exec/cpu-common.h |  3 ++-
>>>    system/physmem.c          | 32 ++++++++++++++++++++++++++------
>>>    3 files changed, 33 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>> index 801cff16a5..24c0c4ce3f 100644
>>> --- a/accel/kvm/kvm-all.c
>>> +++ b/accel/kvm/kvm-all.c
>>> @@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param)
>>>        QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>>            QLIST_REMOVE(page, list);
>>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>>> +        qemu_ram_remap(page->ram_addr);
>>>            g_free(page);
>>>        }
>>>    }
>>> @@ -1286,6 +1286,10 @@ static void kvm_unpoison_all(void *param)
>>>    void kvm_hwpoison_page_add(ram_addr_t ram_addr)
>>>    {
>>>        HWPoisonPage *page;
>>> +    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
>>> +
>>> +    if (page_size > TARGET_PAGE_SIZE)
>>> +        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
>>
>> Is that part still required? I thought it would be sufficient (at least
>> in the context of this patch) to handle it all in qemu_ram_remap().
>>
>> qemu_ram_remap() will calculate the range to process based on the
>> RAMBlock page size. IOW, the QEMU_ALIGN_DOWN() we do now in
>> qemu_ram_remap().
>>
>> Or am I missing something?
>>
>> (sorry if we discussed that already; if there is a good reason it might
>> make sense to state it in the patch description)
> 
> You are right, but at this patch level we still need to round up the

s/round up/align_down/

> address and doing it here is small enough.

Let me explain.

qemu_ram_remap() in this patch here doesn't need an aligned addr. It
will compute the offset into the block and align that down.

The only case where we need the addr besides from that is the
error_report(), where I am not 100% sure if that is actually what we
want to print. We want to print something like ram_block_discard_range().


Note that ram_addr_t is a weird, separate address space. The alignment
does not have any guarantees / semantics there.


See ram_block_add() where we set
	new_block->offset = find_ram_offset(new_block->max_length);

independent of any other RAMBlock properties.

The only alignment we do is
	candidate = ROUND_UP(candidate, BITS_PER_LONG << TARGET_PAGE_BITS);

There is no guarantee that new_block->offset will be aligned to 1 GiB with
a 1 GiB hugetlb mapping.


Note that there is another conceptual issue in this function: offset
should be of type uint64_t, it's not really ram_addr_t, but an
offset into the RAMBlock.

> Of course, the code changes on patch 3/7 where we change both x86 and
> ARM versions of the code to align the memory pointer correctly in both
> cases.

Thinking about it more, we should never try aligning ram_addr_t, only
the offset into the memory block or the virtual address.

So please remove this from this ram_addr_t alignment from this patch, and look into
aligning the virtual address / offset for the other user. Again, aligning
ram_addr_t is not guaranteed to work correctly.


So the patch itself should probably be (- patch description):


diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 801cff16a5..8a47aa7258 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param)
  
      QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
          QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr);
          g_free(page);
      }
  }
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 638dc806a5..50a829d31f 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
  
  /* memory API */
  
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
+void qemu_ram_remap(ram_addr_t addr);
  /* This should not be used by devices.  */
  ram_addr_t qemu_ram_addr_from_host(void *ptr);
  ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
diff --git a/system/physmem.c b/system/physmem.c
index 03d3618039..355588f5d5 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2167,17 +2167,35 @@ void qemu_ram_free(RAMBlock *block)
  }
  
  #ifndef _WIN32
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+/*
+ * qemu_ram_remap - remap a single RAM page
+ *
+ * @addr: address in ram_addr_t address space.
+ *
+ * This function will try remapping a single page of guest RAM identified by
+ * @addr, essentially discarding memory to recover from previously poisoned
+ * memory (MCE). The page size depends on the RAMBlock (i.e., hugetlb). @addr
+ * does not have to point at the start of the page.
+ *
+ * This function is only to be used during system resets; it will kill the
+ * VM if remapping failed.
+ */
+void qemu_ram_remap(ram_addr_t addr)
  {
      RAMBlock *block;
-    ram_addr_t offset;
+    uint64_t offset;
      int flags;
      void *area, *vaddr;
      int prot;
+    size_t page_size;
  
      RAMBLOCK_FOREACH(block) {
          offset = addr - block->offset;
          if (offset < block->max_length) {
+            /* Respect the pagesize of our RAMBlock */
+            page_size = qemu_ram_pagesize(block);
+            offset = QEMU_ALIGN_DOWN(offset, page_size);
+
              vaddr = ramblock_ptr(block, offset);
              if (block->flags & RAM_PREALLOC) {
                  ;
@@ -2191,21 +2209,22 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                  prot = PROT_READ;
                  prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
                  if (block->fd >= 0) {
-                    area = mmap(vaddr, length, prot, flags, block->fd,
+                    area = mmap(vaddr, page_size, prot, flags, block->fd,
                                  offset + block->fd_offset);
                  } else {
                      flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, length, prot, flags, -1, 0);
+                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
                  }
                  if (area != vaddr) {
-                    error_report("Could not remap addr: "
-                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 length, addr);
+                    error_report("Could not remap RAM %s:%" PRIx64 " +%zx",
+                                 block->idstr, offset, page_size);
                      exit(1);
                  }
-                memory_try_enable_merging(vaddr, length);
-                qemu_ram_setup_dump(vaddr, length);
+                memory_try_enable_merging(vaddr, page_size);
+                qemu_ram_setup_dump(vaddr, page_size);
              }
+
+            break;
          }
      }
  }
-- 
2.47.1

-- 
Cheers,

David / dhildenb

[PATCH v4 2/7] system/physmem: poisoned memory discard on reboot

Posted by “William Roche 1 year, 1 month ago

From: William Roche <william.roche@oracle.com>

Repair poisoned memory location(s), calling ram_block_discard_range():
punching a hole in the backend file when necessary and regenerating
a usable memory.
If the kernel doesn't support the madvise calls used by this function
and we are dealing with anonymous memory, fall back to remapping the
location(s).

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 63 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 42 insertions(+), 21 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 2c90cc2d78..b228a692f8 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2180,13 +2180,37 @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
+/* Try to simply remap the given location */
+static void qemu_ram_remap_mmap(RAMBlock *block, void* vaddr, size_t size,
+                                ram_addr_t offset)
+{
+    int flags, prot;
+    void *area;
+
+    flags = MAP_FIXED;
+    flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
+    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
+    prot = PROT_READ;
+    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
+    if (block->fd >= 0) {
+        area = mmap(vaddr, size, prot, flags, block->fd,
+                    offset + block->fd_offset);
+    } else {
+        flags |= MAP_ANONYMOUS;
+        area = mmap(vaddr, size, prot, flags, -1, 0);
+    }
+    if (area != vaddr) {
+        error_report("Could not remap addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                     size, block->offset + offset);
+        exit(1);
+    }
+}
+
 void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     ram_addr_t offset;
-    int flags;
-    void *area, *vaddr;
-    int prot;
+    void *vaddr;
     size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
@@ -2202,24 +2226,21 @@ void qemu_ram_remap(ram_addr_t addr)
             } else if (xen_enabled()) {
                 abort();
             } else {
-                flags = MAP_FIXED;
-                flags |= block->flags & RAM_SHARED ?
-                         MAP_SHARED : MAP_PRIVATE;
-                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
-                prot = PROT_READ;
-                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
-                if (block->fd >= 0) {
-                    area = mmap(vaddr, page_size, prot, flags, block->fd,
-                                offset + block->fd_offset);
-                } else {
-                    flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
-                }
-                if (area != vaddr) {
-                    error_report("Could not remap addr: "
-                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 page_size, addr);
-                    exit(1);
+                if (ram_block_discard_range(block, offset + block->fd_offset,
+                                            page_size) != 0) {
+                    /*
+                     * Fold back to using mmap() only for anonymous mapping,
+                     * as if a backing file is associated we may not be able
+                     * to recover the memory in all cases.
+                     * So don't take the risk of using only mmap and fail now.
+                     */
+                    if (block->fd >= 0) {
+                        error_report("Memory poison recovery failure addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     page_size, addr);
+                        exit(1);
+                    }
+                    qemu_ram_remap_mmap(block, vaddr, page_size, offset);
                 }
                 memory_try_enable_merging(vaddr, page_size);
                 qemu_ram_setup_dump(vaddr, page_size);
-- 
2.43.5

Re: [PATCH v4 2/7] system/physmem: poisoned memory discard on reboot

Posted by David Hildenbrand 1 year, 1 month ago

On 14.12.24 14:45, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> Repair poisoned memory location(s), calling ram_block_discard_range():
> punching a hole in the backend file when necessary and regenerating
> a usable memory.
> If the kernel doesn't support the madvise calls used by this function
> and we are dealing with anonymous memory, fall back to remapping the
> location(s).
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   system/physmem.c | 63 ++++++++++++++++++++++++++++++++----------------
>   1 file changed, 42 insertions(+), 21 deletions(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index 2c90cc2d78..b228a692f8 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2180,13 +2180,37 @@ void qemu_ram_free(RAMBlock *block)
>   }
>   
>   #ifndef _WIN32
> +/* Try to simply remap the given location */
> +static void qemu_ram_remap_mmap(RAMBlock *block, void* vaddr, size_t size,
> +                                ram_addr_t offset)

Can you make the parameters match the ones of ram_block_discard_range() 
so the invocation gets easier to read? You can calculate vaddr easily 
internally.

Something like

static void qemu_ram_remap_mmap(RAMBlock *rb, uint64_t start,
				size_t length)

> +{
> +    int flags, prot;
> +    void *area;
> +
> +    flags = MAP_FIXED;
> +    flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
> +    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> +    prot = PROT_READ;
> +    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> +    if (block->fd >= 0) {

Heh, that case can no longer happen!

assert(block->fs < 0);

?

> +        area = mmap(vaddr, size, prot, flags, block->fd,
> +                    offset + block->fd_offset);
> +    } else {
> +        flags |= MAP_ANONYMOUS;
> +        area = mmap(vaddr, size, prot, flags, -1, 0);
> +    }
> +    if (area != vaddr) {
> +        error_report("Could not remap addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                     size, block->offset + offset);
> +        exit(1);
> +    }
> +}
> +
>   void qemu_ram_remap(ram_addr_t addr)
>   {
>       RAMBlock *block;
>       ram_addr_t offset;
> -    int flags;
> -    void *area, *vaddr;
> -    int prot;
> +    void *vaddr;
>       size_t page_size;
>   
>       RAMBLOCK_FOREACH(block) {
> @@ -2202,24 +2226,21 @@ void qemu_ram_remap(ram_addr_t addr)
>               } else if (xen_enabled()) {
>                   abort();
>               } else {
> -                flags = MAP_FIXED;
> -                flags |= block->flags & RAM_SHARED ?
> -                         MAP_SHARED : MAP_PRIVATE;
> -                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> -                prot = PROT_READ;
> -                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> -                if (block->fd >= 0) {
> -                    area = mmap(vaddr, page_size, prot, flags, block->fd,
> -                                offset + block->fd_offset);
> -                } else {
> -                    flags |= MAP_ANONYMOUS;
> -                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
> -                }
> -                if (area != vaddr) {
> -                    error_report("Could not remap addr: "
> -                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> -                                 page_size, addr);
> -                    exit(1);
> +                if (ram_block_discard_range(block, offset + block->fd_offset,
> +                                            page_size) != 0) {
> +                    /*
> +                     * Fold back to using mmap() only for anonymous mapping,

s/Fold/Fall/

> +                     * as if a backing file is associated we may not be able
> +                     * to recover the memory in all cases.
> +                     * So don't take the risk of using only mmap and fail now.
> +                     */
> +                    if (block->fd >= 0) {
> +                        error_report("Memory poison recovery failure addr: "
> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                                     page_size, addr);
> +                        exit(1);
> +                    }
> +                    qemu_ram_remap_mmap(block, vaddr, page_size, offset);
>                   }
>                   memory_try_enable_merging(vaddr, page_size);
>                   qemu_ram_setup_dump(vaddr, page_size);

These two can be moved into qemu_ram_remap_mmap(). They are not required 
if we didn't actually mess with mmap().

-- 
Cheers,

David / dhildenb

Re: [PATCH v4 2/7] system/physmem: poisoned memory discard on reboot

Posted by William Roche 1 year ago

On 1/8/25 22:44, David Hildenbrand wrote:
> On 14.12.24 14:45, “William Roche wrote:
>> +/* Try to simply remap the given location */
>> +static void qemu_ram_remap_mmap(RAMBlock *block, void* vaddr, size_t 
>> size,
>> +                                ram_addr_t offset)
> 
> Can you make the parameters match the ones of ram_block_discard_range() 
> so the invocation gets easier to read? You can calculate vaddr easily 
> internally.
> 
> Something like
> 
> static void qemu_ram_remap_mmap(RAMBlock *rb, uint64_t start,
>                  size_t length)

I used the same arguments as ram_block_discard_range() as you asked.

> 
>> +{
>> +    int flags, prot;
>> +    void *area;
>> +
>> +    flags = MAP_FIXED;
>> +    flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
>> +    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
>> +    prot = PROT_READ;
>> +    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>> +    if (block->fd >= 0) {
> 
> Heh, that case can no longer happen!

I also removed the used case of remapping a file in the 
qemu_ram_remap_mmap() function.

> 
> assert(block->fs < 0);

And added the assert() you suggested.

>> +                if (ram_block_discard_range(block, offset + block->fd_offset,
>> +                                            page_size) != 0) {

Studying some more the arguments used by ram_block_discard_range() and 
the need to fallocate/Punch the underlying file, I think that I should 
simply provide the 'offset' here and that block->fd_offset is missing in 
the ram_block_discard_range() function where we have to punch a hole in 
the file. Don't you agree ?

If we can get the current set of fixes integrated, I'll submit another 
fix proposal to take the fd_offset into account in a second time. (Not 
enlarging the current set)

But here is what I'm thinking about. That we can discuss later if you want:

@@ -3730,11 +3724,12 @@ int ram_block_discard_range(RAMBlock *rb, 
uint64_t start, size_t length)
              }

              ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | 
FALLOC_FL_KEEP_SIZE,
-                            start, length);
+                            start + rb->fd_offset, length);
              if (ret) {
                  ret = -errno;
                  error_report("%s: Failed to fallocate %s:%" PRIx64 " 
+%zx (%d)",
-                             __func__, rb->idstr, start, length, ret);
+                             __func__, rb->idstr, start + rb->fd_offset,
+                            length, ret);
                  goto err;
              }

Or I can integrate that as an addition patch if you prefer.

>> +                    /*
>> +                     * Fold back to using mmap() only for anonymous mapping,
> 
> s/Fold/Fall/

typo fixed

>>                   }
>>                   memory_try_enable_merging(vaddr, page_size);
>>                   qemu_ram_setup_dump(vaddr, page_size);
> 
> These two can be moved into qemu_ram_remap_mmap(). They are not required 
> if we didn't actually mess with mmap().

These functions will be replaced by the ram_block_notify_remap() of 
patch 7 which is called no matter the ram_block_discard_range() 
succeeded or not.
So we should leave these 2 function calls here for now as they mimic an 
aspect of what the notifier code will do.

Re: [PATCH v4 2/7] system/physmem: poisoned memory discard on reboot

Posted by David Hildenbrand 1 year ago

> If we can get the current set of fixes integrated, I'll submit another
> fix proposal to take the fd_offset into account in a second time. (Not
> enlarging the current set)
> 
> But here is what I'm thinking about. That we can discuss later if you want:
> 
> @@ -3730,11 +3724,12 @@ int ram_block_discard_range(RAMBlock *rb,
> uint64_t start, size_t length)
>                }
> 
>                ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE |
> FALLOC_FL_KEEP_SIZE,
> -                            start, length);
> +                            start + rb->fd_offset, length);
>                if (ret) {
>                    ret = -errno;
>                    error_report("%s: Failed to fallocate %s:%" PRIx64 "
> +%zx (%d)",
> -                             __func__, rb->idstr, start, length, ret);
> +                             __func__, rb->idstr, start + rb->fd_offset,
> +                            length, ret);
>                    goto err;
>                }
> 
> 
> Or I can integrate that as an addition patch if you prefer.

Very good point! We missed to take fd_offset into account here.

Can you send that out as a separate fix?

Fixed: 4b870dc4d0c0 ("hostmem-file: add offset option")

-- 
Cheers,

David / dhildenb

Re: [PATCH v4 2/7] system/physmem: poisoned memory discard on reboot

Posted by William Roche 1 year ago

On 1/14/25 15:00, David Hildenbrand wrote:
>> If we can get the current set of fixes integrated, I'll submit another
>> fix proposal to take the fd_offset into account in a second time. (Not
>> enlarging the current set)
>>
>> But here is what I'm thinking about. That we can discuss later if you 
>> want:
>>
>> @@ -3730,11 +3724,12 @@ int ram_block_discard_range(RAMBlock *rb,
>> uint64_t start, size_t length)
>>                }
>>
>>                ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE |
>> FALLOC_FL_KEEP_SIZE,
>> -                            start, length);
>> +                            start + rb->fd_offset, length);
>>                if (ret) {
>>                    ret = -errno;
>>                    error_report("%s: Failed to fallocate %s:%" PRIx64 "
>> +%zx (%d)",
>> -                             __func__, rb->idstr, start, length, ret);
>> +                             __func__, rb->idstr, start + rb->fd_offset,
>> +                            length, ret);
>>                    goto err;
>>                }
>>
>>
>> Or I can integrate that as an addition patch if you prefer.
> 
> Very good point! We missed to take fd_offset into account here.
> 
> Can you send that out as a separate fix?
> 
> Fixed: 4b870dc4d0c0 ("hostmem-file: add offset option"

Thanks to Peter Xu and to you for your reviews of my proposal for this 
separate fix that should be on track to be integrated soon.

[PATCH v4 3/7] accel/kvm: Report the loss of a large memory page

Posted by “William Roche 1 year, 1 month ago

From: William Roche <william.roche@oracle.com>

In case of a large page impacted by a memory error, enhance
the existing Qemu error message which indicates that the error
is injected in the VM, adding "on lost large page SIZE@ADDR".

Include also a similar message to the ARM platform.

In the case of a large page impacted, we now report:
...Memory Error at QEMU addr X and GUEST addr Y on lost large page SIZE@ADDR of type...

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c   |  4 ----
 target/arm/kvm.c      | 13 +++++++++++++
 target/i386/kvm/kvm.c | 18 ++++++++++++++----
 3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 24c0c4ce3f..8a47aa7258 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1286,10 +1286,6 @@ static void kvm_unpoison_all(void *param)
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
-    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
-
-    if (page_size > TARGET_PAGE_SIZE)
-        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 7b6812c0de..db234f79cc 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2366,6 +2366,8 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
 {
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t page_size;
+    char lp_msg[54];
 
     assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
 
@@ -2373,6 +2375,14 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
+            page_size = qemu_ram_pagesize_from_addr(ram_addr);
+            if (page_size > TARGET_PAGE_SIZE) {
+                ram_addr = ROUND_DOWN(ram_addr, page_size);
+                snprintf(lp_msg, sizeof(lp_msg), " on lost large page "
+                    RAM_ADDR_FMT "@" RAM_ADDR_FMT "", page_size, ram_addr);
+            } else {
+                lp_msg[0] = '\0';
+            }
             kvm_hwpoison_page_add(ram_addr);
             /*
              * If this is a BUS_MCEERR_AR, we know we have been called
@@ -2389,6 +2399,9 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
                 kvm_cpu_synchronize_state(c);
                 if (!acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
                     kvm_inject_arm_sea(c);
+                    error_report("Guest Memory Error at QEMU addr %p and "
+                        "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
+                        addr, paddr, lp_msg, "BUS_MCEERR_AR");
                 } else {
                     error_report("failed to record the error");
                     abort();
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 8e17942c3b..336646ed61 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -741,6 +741,8 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
     CPUX86State *env = &cpu->env;
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t page_size;
+    char lp_msg[54];
 
     /* If we get an action required MCE, it has been injected by KVM
      * while the VM was running.  An action optional MCE instead should
@@ -753,6 +755,14 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
+            page_size = qemu_ram_pagesize_from_addr(ram_addr);
+            if (page_size > TARGET_PAGE_SIZE) {
+                ram_addr = ROUND_DOWN(ram_addr, page_size);
+                snprintf(lp_msg, sizeof(lp_msg), " on lost large page "
+                        RAM_ADDR_FMT "@" RAM_ADDR_FMT "", page_size, ram_addr);
+            } else {
+                lp_msg[0] = '\0';
+            }
             kvm_hwpoison_page_add(ram_addr);
             kvm_mce_inject(cpu, paddr, code);
 
@@ -763,12 +773,12 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
              */
             if (code == BUS_MCEERR_AR) {
                 error_report("Guest MCE Memory Error at QEMU addr %p and "
-                    "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
-                    addr, paddr, "BUS_MCEERR_AR");
+                    "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
+                    addr, paddr, lp_msg, "BUS_MCEERR_AR");
             } else {
                  warn_report("Guest MCE Memory Error at QEMU addr %p and "
-                     "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
-                     addr, paddr, "BUS_MCEERR_AO");
+                     "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
+                     addr, paddr, lp_msg, "BUS_MCEERR_AO");
             }
 
             return;
-- 
2.43.5

[PATCH v4 4/7] numa: Introduce and use ram_block_notify_remap()

Posted by “William Roche 1 year, 1 month ago

From: David Hildenbrand <david@redhat.com>

Notify registered listeners about the remap at the end of
qemu_ram_remap() so e.g., a memory backend can re-apply its
settings correctly.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 hw/core/numa.c         | 11 +++++++++++
 include/exec/ramlist.h |  3 +++
 system/physmem.c       |  1 +
 3 files changed, 15 insertions(+)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index 1b5f44baea..4ca67db483 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -895,3 +895,14 @@ void ram_block_notify_resize(void *host, size_t old_size, size_t new_size)
         }
     }
 }
+
+void ram_block_notify_remap(void *host, size_t offset, size_t size)
+{
+    RAMBlockNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+        if (notifier->ram_block_remapped) {
+            notifier->ram_block_remapped(notifier, host, offset, size);
+        }
+    }
+}
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
index d9cfe530be..c1dc785a57 100644
--- a/include/exec/ramlist.h
+++ b/include/exec/ramlist.h
@@ -72,6 +72,8 @@ struct RAMBlockNotifier {
                               size_t max_size);
     void (*ram_block_resized)(RAMBlockNotifier *n, void *host, size_t old_size,
                               size_t new_size);
+    void (*ram_block_remapped)(RAMBlockNotifier *n, void *host, size_t offset,
+                               size_t size);
     QLIST_ENTRY(RAMBlockNotifier) next;
 };
 
@@ -80,6 +82,7 @@ void ram_block_notifier_remove(RAMBlockNotifier *n);
 void ram_block_notify_add(void *host, size_t size, size_t max_size);
 void ram_block_notify_remove(void *host, size_t size, size_t max_size);
 void ram_block_notify_resize(void *host, size_t old_size, size_t new_size);
+void ram_block_notify_remap(void *host, size_t offset, size_t size);
 
 GString *ram_block_format(void);
 
diff --git a/system/physmem.c b/system/physmem.c
index b228a692f8..9fc74a5699 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2244,6 +2244,7 @@ void qemu_ram_remap(ram_addr_t addr)
                 }
                 memory_try_enable_merging(vaddr, page_size);
                 qemu_ram_setup_dump(vaddr, page_size);
+                ram_block_notify_remap(block->host, offset, page_size);
             }
 
             break;
-- 
2.43.5

[PATCH v4 5/7] hostmem: Factor out applying settings

Posted by “William Roche 1 year, 1 month ago

From: David Hildenbrand <david@redhat.com>

We want to reuse the functionality when remapping or resizing RAM.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c | 155 ++++++++++++++++++++++++---------------------
 1 file changed, 82 insertions(+), 73 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 181446626a..bf85d716e5 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -36,6 +36,87 @@ QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
 QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
 #endif
 
+static void host_memory_backend_apply_settings(HostMemoryBackend *backend,
+                                               void *ptr, uint64_t size,
+                                               Error **errp)
+{
+    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
+
+    if (backend->merge) {
+        qemu_madvise(ptr, size, QEMU_MADV_MERGEABLE);
+    }
+    if (!backend->dump) {
+        qemu_madvise(ptr, size, QEMU_MADV_DONTDUMP);
+    }
+#ifdef CONFIG_NUMA
+    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
+    /* lastbit == MAX_NODES means maxnode = 0 */
+    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
+    /*
+     * Ensure policy won't be ignored in case memory is preallocated
+     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
+     * this doesn't catch hugepage case.
+     */
+    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
+    int mode = backend->policy;
+
+    /*
+     * Check for invalid host-nodes and policies and give more verbose
+     * error messages than mbind().
+     */
+    if (maxnode && backend->policy == MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be empty for policy default,"
+                   " or you should explicitly specify a policy other"
+                   " than default");
+        return;
+    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be set for policy %s",
+                   HostMemPolicy_str(backend->policy));
+        return;
+    }
+
+    /*
+     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
+     * as argument to mbind() due to an old Linux bug (feature?) which
+     * cuts off the last specified node. This means backend->host_nodes
+     * must have MAX_NODES+1 bits available.
+     */
+    assert(sizeof(backend->host_nodes) >=
+           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
+    assert(maxnode <= MAX_NODES);
+
+#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
+    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
+        /*
+         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
+         * silently picks the first node.
+         */
+        mode = MPOL_PREFERRED_MANY;
+    }
+#endif
+
+    if (maxnode &&
+        mbind(ptr, size, mode, backend->host_nodes, maxnode + 1, flags)) {
+        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
+            error_setg_errno(errp, errno,
+                             "cannot bind memory to host NUMA nodes");
+            return;
+        }
+    }
+#endif
+    /*
+     * Preallocate memory after the NUMA policy has been instantiated.
+     * This is necessary to guarantee memory is allocated with
+     * specified NUMA policy in place.
+     */
+    if (backend->prealloc &&
+        !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
+                           ptr, size, backend->prealloc_threads,
+                           backend->prealloc_context, async, errp)) {
+        return;
+    }
+}
+
 char *
 host_memory_backend_get_name(HostMemoryBackend *backend)
 {
@@ -337,7 +418,6 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
     void *ptr;
     uint64_t sz;
     size_t pagesize;
-    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
 
     if (!bc->alloc) {
         return;
@@ -357,78 +437,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
         return;
     }
 
-    if (backend->merge) {
-        qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
-    }
-    if (!backend->dump) {
-        qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
-    }
-#ifdef CONFIG_NUMA
-    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
-    /* lastbit == MAX_NODES means maxnode = 0 */
-    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
-    /*
-     * Ensure policy won't be ignored in case memory is preallocated
-     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
-     * this doesn't catch hugepage case.
-     */
-    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
-    int mode = backend->policy;
-
-    /* check for invalid host-nodes and policies and give more verbose
-     * error messages than mbind(). */
-    if (maxnode && backend->policy == MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be empty for policy default,"
-                   " or you should explicitly specify a policy other"
-                   " than default");
-        return;
-    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be set for policy %s",
-                   HostMemPolicy_str(backend->policy));
-        return;
-    }
-
-    /*
-     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
-     * as argument to mbind() due to an old Linux bug (feature?) which
-     * cuts off the last specified node. This means backend->host_nodes
-     * must have MAX_NODES+1 bits available.
-     */
-    assert(sizeof(backend->host_nodes) >=
-           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
-    assert(maxnode <= MAX_NODES);
-
-#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
-    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
-        /*
-         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
-         * silently picks the first node.
-         */
-        mode = MPOL_PREFERRED_MANY;
-    }
-#endif
-
-    if (maxnode &&
-        mbind(ptr, sz, mode, backend->host_nodes, maxnode + 1, flags)) {
-        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
-            error_setg_errno(errp, errno,
-                             "cannot bind memory to host NUMA nodes");
-            return;
-        }
-    }
-#endif
-    /*
-     * Preallocate memory after the NUMA policy has been instantiated.
-     * This is necessary to guarantee memory is allocated with
-     * specified NUMA policy in place.
-     */
-    if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
-                                                ptr, sz,
-                                                backend->prealloc_threads,
-                                                backend->prealloc_context,
-                                                async, errp)) {
-        return;
-    }
+    host_memory_backend_apply_settings(backend, ptr, sz, errp);
 }
 
 static bool
-- 
2.43.5

Re: [PATCH v4 5/7] hostmem: Factor out applying settings

Posted by David Hildenbrand 1 year, 1 month ago

On 14.12.24 14:45, “William Roche wrote:
> From: David Hildenbrand <david@redhat.com>
> 
> We want to reuse the functionality when remapping or resizing RAM.

We should drop the "or resizing of RAM." part, as that does no longer apply.

> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---

-- 
Cheers,

David / dhildenb

Re: [PATCH v4 5/7] hostmem: Factor out applying settings

Posted by William Roche 1 year ago

On 1/8/25 22:58, David Hildenbrand wrote:
> On 14.12.24 14:45, “William Roche wrote:
>> From: David Hildenbrand <david@redhat.com>
>>
>> We want to reuse the functionality when remapping or resizing RAM.
> 
> We should drop the "or resizing of RAM." part, as that does no longer 
> apply.

Commit message corrected.

[PATCH v4 6/7] hostmem: Handle remapping of RAM

Posted by “William Roche 1 year, 1 month ago

From: David Hildenbrand <david@redhat.com>

Let's register a RAM block notifier and react on remap notifications.
Simply re-apply the settings. Exit if something goes wrong.

Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be
that hostmem is still missing to update that flag ...

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c       | 34 ++++++++++++++++++++++++++++++++++
 include/sysemu/hostmem.h |  1 +
 2 files changed, 35 insertions(+)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index bf85d716e5..863f6da11d 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -361,11 +361,37 @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
     backend->prealloc_threads = value;
 }
 
+static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host,
+                                             size_t offset, size_t size)
+{
+    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
+                                              ram_notifier);
+    Error *err = NULL;
+
+    if (!host_memory_backend_mr_inited(backend) ||
+        memory_region_get_ram_ptr(&backend->mr) != host) {
+        return;
+    }
+
+    host_memory_backend_apply_settings(backend, host + offset, size, &err);
+    if (err) {
+        /*
+         * If memory settings can't be successfully applied on remap,
+         * don't take the risk to continue without them.
+         */
+        error_report_err(err);
+        exit(1);
+    }
+}
+
 static void host_memory_backend_init(Object *obj)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
     MachineState *machine = MACHINE(qdev_get_machine());
 
+    backend->ram_notifier.ram_block_remapped = host_memory_backend_ram_remapped;
+    ram_block_notifier_add(&backend->ram_notifier);
+
     /* TODO: convert access to globals to compat properties */
     backend->merge = machine_mem_merge(machine);
     backend->dump = machine_dump_guest_core(machine);
@@ -379,6 +405,13 @@ static void host_memory_backend_post_init(Object *obj)
     object_apply_compat_props(obj);
 }
 
+static void host_memory_backend_finalize(Object *obj)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+    ram_block_notifier_remove(&backend->ram_notifier);
+}
+
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend)
 {
     /*
@@ -595,6 +628,7 @@ static const TypeInfo host_memory_backend_info = {
     .instance_size = sizeof(HostMemoryBackend),
     .instance_init = host_memory_backend_init,
     .instance_post_init = host_memory_backend_post_init,
+    .instance_finalize = host_memory_backend_finalize,
     .interfaces = (InterfaceInfo[]) {
         { TYPE_USER_CREATABLE },
         { }
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index 67f45abe39..98309a9457 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -83,6 +83,7 @@ struct HostMemoryBackend {
     HostMemPolicy policy;
 
     MemoryRegion mr;
+    RAMBlockNotifier ram_notifier;
 };
 
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend);
-- 
2.43.5

Re: [PATCH v4 6/7] c

Posted by David Hildenbrand 1 year, 1 month ago

On 14.12.24 14:45, “William Roche wrote:
> From: David Hildenbrand <david@redhat.com>
> 
> Let's register a RAM block notifier and react on remap notifications.
> Simply re-apply the settings. Exit if something goes wrong.
> 
> Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be
> that hostmem is still missing to update that flag ...

I think we can drop this comment, I was probably confused when writing 
that :)

> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   backends/hostmem.c       | 34 ++++++++++++++++++++++++++++++++++
>   include/sysemu/hostmem.h |  1 +
>   2 files changed, 35 insertions(+)
> 
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index bf85d716e5..863f6da11d 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -361,11 +361,37 @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
>       backend->prealloc_threads = value;
>   }
>   
> +static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host,
> +                                             size_t offset, size_t size)
> +{
> +    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
> +                                              ram_notifier);
> +    Error *err = NULL;
> +
> +    if (!host_memory_backend_mr_inited(backend) ||
> +        memory_region_get_ram_ptr(&backend->mr) != host) {
> +        return;
> +    }

I think this should work, yes.

-- 
Cheers,

David / dhildenb

Re: [PATCH v4 6/7] hostmem: Handle remapping of RAM

Posted by William Roche 1 year ago

On 1/8/25 22:51, David Hildenbrand wrote:
> On 14.12.24 14:45, “William Roche wrote:
>> From: David Hildenbrand <david@redhat.com>
>>
>> Let's register a RAM block notifier and react on remap notifications.
>> Simply re-apply the settings. Exit if something goes wrong.
>>
>> Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be
>> that hostmem is still missing to update that flag ...
> 
> I think we can drop this comment, I was probably confused when writing 
> that :)

Note deleted from the commit message

> 
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   backends/hostmem.c       | 34 ++++++++++++++++++++++++++++++++++
>>   include/sysemu/hostmem.h |  1 +
>>   2 files changed, 35 insertions(+)
>>
>> diff --git a/backends/hostmem.c b/backends/hostmem.c
>> index bf85d716e5..863f6da11d 100644
>> --- a/backends/hostmem.c
>> +++ b/backends/hostmem.c
>> @@ -361,11 +361,37 @@ static void 
>> host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
>>       backend->prealloc_threads = value;
>>   }
>> +static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, 
>> void *host,
>> +                                             size_t offset, size_t size)
>> +{
>> +    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
>> +                                              ram_notifier);
>> +    Error *err = NULL;
>> +
>> +    if (!host_memory_backend_mr_inited(backend) ||
>> +        memory_region_get_ram_ptr(&backend->mr) != host) {
>> +        return;
>> +    }
> 
> I think this should work, yes.

Good :)
I'm just leaving this unchanged.

[PATCH v4 7/7] system/physmem: Memory settings applied on remap notification

Posted by “William Roche 1 year, 1 month ago

From: William Roche <william.roche@oracle.com>

Merging and dump settings are handled by the remap notification
in addition to memory policy and preallocation.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 9fc74a5699..c0bfa20efc 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2242,8 +2242,6 @@ void qemu_ram_remap(ram_addr_t addr)
                     }
                     qemu_ram_remap_mmap(block, vaddr, page_size, offset);
                 }
-                memory_try_enable_merging(vaddr, page_size);
-                qemu_ram_setup_dump(vaddr, page_size);
                 ram_block_notify_remap(block->host, offset, page_size);
             }
 
-- 
2.43.5

Re: [PATCH v4 7/7] system/physmem: Memory settings applied on remap notification

Posted by David Hildenbrand 1 year, 1 month ago

On 14.12.24 14:45, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> Merging and dump settings are handled by the remap notification
> in addition to memory policy and preallocation.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   system/physmem.c | 2 --
>   1 file changed, 2 deletions(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index 9fc74a5699..c0bfa20efc 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2242,8 +2242,6 @@ void qemu_ram_remap(ram_addr_t addr)
>                       }
>                       qemu_ram_remap_mmap(block, vaddr, page_size, offset);
>                   }
> -                memory_try_enable_merging(vaddr, page_size);
> -                qemu_ram_setup_dump(vaddr, page_size);
>                   ram_block_notify_remap(block->host, offset, page_size);
>               }
>   

Ah yes, indeed.

-- 
Cheers,

David / dhildenb

Re: [PATCH v4 7/7] system/physmem: Memory settings applied on remap notification

Posted by William Roche 1 year ago

On 1/8/25 22:53, David Hildenbrand wrote:
> On 14.12.24 14:45, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> Merging and dump settings are handled by the remap notification
>> in addition to memory policy and preallocation.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   system/physmem.c | 2 --
>>   1 file changed, 2 deletions(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 9fc74a5699..c0bfa20efc 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -2242,8 +2242,6 @@ void qemu_ram_remap(ram_addr_t addr)
>>                       }
>>                       qemu_ram_remap_mmap(block, vaddr, page_size, 
>> offset);
>>                   }
>> -                memory_try_enable_merging(vaddr, page_size);
>> -                qemu_ram_setup_dump(vaddr, page_size);
>>                   ram_block_notify_remap(block->host, offset, page_size);
>>               }
> 
> Ah yes, indeed.

I also merged this patch 7/7 [system/physmem: Memory settings applied on 
remap notification] into your patch 6/7 [hostmem: Handle remapping of 
RAM], removing also the unneeded vaddr.

So now we are down to 6 patches  (unless you want me to integrate the 
fix for ram_block_discard_range() I talked about for patch 2/7)

I'm sending my version v5 now.

Re: [PATCH v4 7/7] system/physmem: Memory settings applied on remap notification

Posted by David Hildenbrand 1 year ago

On 10.01.25 21:57, William Roche wrote:
> On 1/8/25 22:53, David Hildenbrand wrote:
>> On 14.12.24 14:45, “William Roche wrote:
>>> From: William Roche <william.roche@oracle.com>
>>>
>>> Merging and dump settings are handled by the remap notification
>>> in addition to memory policy and preallocation.
>>>
>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>> ---
>>>    system/physmem.c | 2 --
>>>    1 file changed, 2 deletions(-)
>>>
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index 9fc74a5699..c0bfa20efc 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -2242,8 +2242,6 @@ void qemu_ram_remap(ram_addr_t addr)
>>>                        }
>>>                        qemu_ram_remap_mmap(block, vaddr, page_size,
>>> offset);
>>>                    }
>>> -                memory_try_enable_merging(vaddr, page_size);
>>> -                qemu_ram_setup_dump(vaddr, page_size);
>>>                    ram_block_notify_remap(block->host, offset, page_size);
>>>                }
>>
>> Ah yes, indeed.
> 
> I also merged this patch 7/7 [system/physmem: Memory settings applied on
> remap notification] into your patch 6/7 [hostmem: Handle remapping of
> RAM], removing also the unneeded vaddr.
> 
> So now we are down to 6 patches  (unless you want me to integrate the
> fix for ram_block_discard_range() I talked about for patch 2/7)
> 
> I'm sending my version v5 now.

Sorry for the delayed reply to v4.

-- 
Cheers,

David / dhildenb

[PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by “William Roche 1 year, 2 months ago

From: William Roche <william.roche@oracle.com>

Hi David,

Here is an new version of our code, but I still need to double check
the mmap behavior in case of a memory error impact on:
- a clean page of an empty file or populated file
- already mapped using MAP_SHARED or MAP_PRIVATE
to see if mmap() can recover the area or not.

But I wanted to provide this version to know if this is the kind of
implementation you were expecting.

And here is a sligthly updated description of the patch set:

 ---
This set of patches fixes several problems with hardware memory errors
impacting hugetlbfs memory backed VMs. When using hugetlbfs large
pages, any large page location being impacted by an HW memory error
results in poisoning the entire page, suddenly making a large chunk of
the VM memory unusable.

The main problem that currently exists in Qemu is the lack of backend
file repair before resetting the VM memory, resulting in the impacted
memory to be silently unusable even after a VM reboot.

In order to fix this issue, we take into account the page size of the
impacted memory block when dealing with the associated poisoned page
location.

Using the page size information we also try to regenerate the memory
calling ram_block_discard_range() on VM reset when running
qemu_ram_remap(). So that a poisoned memory backed by a hugetlbfs
file is regenerated with a hole punched in this file. A new page is
loaded when the location is first touched.

In case of a discard failure we fall back to unmap/remap the memory
location and reset the memory settings.

We also have to honor the 'prealloc' attribute even after a successful
discard, so we reapply the memory settings in this case too.

This memory setting is performed by a new remap notification mechanism
calling host_memory_backend_ram_remapped() function when a region of
a memory block is remapped.

We also enrich the messages used to report a memory error relayed to
the VM, providing an identification of memory page and its size in
case of a large page impacted.
 ----


v2 -> v3:
. dropped the size parameter from qemu_ram_remap() and determine the page
  size when adding it to the poison list, aligning the offset down to the
  pagesize. Multiple sub-pages poisoned on a large page lead to a single
  poison entry.

. introduction of a helper function for the mmap code

. adding "on lost large page <size>@<ram_addr>" to the error injection
  msg (notation used in qemu_ram_remap() too ).
  So only in the case of a large page, it looks like:
qemu-system-x86_64: Guest MCE Memory Error at QEMU addr 0x7fc1f5dd6000 and GUEST addr 0x19fd6000 on lost large page 200000@19e00000 of type BUS_MCEERR_AR injected

. as we need the page_size value for the above message, I retrieve the
  value in kvm_arch_on_sigbus_vcpu() to pass the appropriate pointer
  to kvm_hwpoison_page_add() that doesn't need to align it anymore.

. added a similar message for the ARM platform (removing the MCE
  keyword)

. I also introduced a "fail hard" in the remap notification:
  host_memory_backend_ram_remapped()


This code is scripts/checkpatch.pl clean
'make check' runs fine on both x86 and Arm.


David Hildenbrand (3):
  numa: Introduce and use ram_block_notify_remap()
  hostmem: Factor out applying settings
  hostmem: Handle remapping of RAM

William Roche (4):
  hwpoison_page_list and qemu_ram_remap are based of pages
  system/physmem: poisoned memory discard on reboot
  accel/kvm: Report the loss of a large memory page
  system/physmem: Memory settings applied on remap notification

 accel/kvm/kvm-all.c       |   2 +-
 backends/hostmem.c        | 189 +++++++++++++++++++++++---------------
 hw/core/numa.c            |  11 +++
 include/exec/cpu-common.h |   3 +-
 include/exec/ramlist.h    |   3 +
 include/sysemu/hostmem.h  |   1 +
 system/physmem.c          |  90 +++++++++++++-----
 target/arm/kvm.c          |  13 +++
 target/i386/kvm/kvm.c     |  18 +++-
 9 files changed, 227 insertions(+), 103 deletions(-)

-- 
2.43.5

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by William Roche 1 year, 2 months ago

Hello David,

I've finally tested many page mapping possibilities and tried to 
identify the error injection reaction on these pages to see if mmap() 
can be used to recover the impacted area.
I'm using the latest upstream kernel I have for that: 
6.12.0-rc7.master.20241117.ol9.x86_64
But I also got similar results with a kernel not supporting 
MADV_DONTNEED, for example: 5.15.0-301.163.5.2.el9uek.x86_64


Let's start with mapping a file without modifying the mapped area:
In this case we should have a clean page cache mapped in the process.
If an error is injected on this page, the kernel doesn't even inform the 
process about the error as the page is replaced (no matter if the 
mapping was shared of not).

The kernel indicates this situation with the following messages:

[10759.371701] Injecting memory failure at pfn 0x10d88e
[10759.374922] Memory failure: 0x10d88e: corrupted page was clean: 
dropped without side effects
[10759.377525] Memory failure: 0x10d88e: recovery action for clean LRU 
page: Recovered


Now when the page content is modified, in the case of standard page 
size, we need to consider a MAP_PRIVATE or MAP_SHARED
- in the case of a MAP_PRIVATE page, this page is corrupted and the 
modified data are lost, the kernel will use the SIGBUS mechanism to 
inform this process if needed.
   But remapping the area sweeps away the poisoned page, and allows the 
process to use the area.

- In the case of a MAP_SHARED page, if the content hasn't been sync'ed 
with the file backend, we also loose the modified data, and the kernel 
can also raise SIGBUS.
   Remapping the area recreates a page cache from the "on disk" file 
content, clearing the error.

In both cases, the kernel indicates messages like:
[41589.578750] Injecting memory failure for pfn 0x122105 at process 
virtual address 0x7f13bad55000
[41589.582237] Memory failure: 0x122105: Sending SIGBUS to testdh:7343 
due to hardware memory corruption
[41589.584907] Memory failure: 0x122105: recovery action for dirty LRU 
page: Recovered


Now in the case of hugetlbfs pages:
This case behaves the same way as the standard page size when using 
MAP_PRIVATE: mmap of the underlying file is able to sweep away the 
poisoned page.
But the MAP_SHARED case is different: mmap() doesn't clear anything. 
fallocate() must be used.


In both cases, the kernel indicates messages like:
[89141.724295] Injecting memory failure for pfn 0x117800 at process 
virtual address 0x7fd148800000
[89141.727103] Memory failure: 0x117800: Sending SIGBUS to testdh:9480 
due to hardware memory corruption
[89141.729829] Memory failure: 0x117800: recovery action for huge page: 
Recovered

Conclusion:
We can't count on the mmap() method only for the hugetlbfs case with 
MAP_SHARED.

So According to these tests results, we should change the part of the 
qemu_ram_remap() function (in the 2nd patch) to something like:

+                if (ram_block_discard_range(block, offset + 
block->fd_offset,
+                                            length) != 0) {
+                    /*
+                     * Fold back to using mmap(), but it cannot repair a
+                     * shared hugetlbfs region. In this case we fail.
+                     */
+                    if (block->fd >= 0 && qemu_ram_is_shared(block) &&
+                        (length > TARGET_PAGE_SIZE)) {
+                        error_report("Memory hugetlbfs poison recovery 
failure addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }
+                    qemu_ram_remap_mmap(block, vaddr, page_size, offset);
+                    memory_try_enable_merging(vaddr, size);
+                    qemu_ram_setup_dump(vaddr, size);
                  }

This should also change the subsequent patch accordingly.

(As a side note about the 3rd patch, I'll also adjust the lp_msg[57] 
message size to 54 bytes instead (there is no '0x' prefix on the 
hexadecimal values and the message ends with a zero)

So if you agree with this v3 proposal (including the above 
modifications), I can submit a v4 version for integration.

Please let me know what you think about that, and if you see any 
additional change we should consider before the integration.

Thanks in advance,
William.

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by David Hildenbrand 1 year, 2 months ago

On 02.12.24 16:41, William Roche wrote:
> Hello David,

Hi,

sorry for reviewing yet, I was rather sick the last 1.5 weeks.

> 
> I've finally tested many page mapping possibilities and tried to
> identify the error injection reaction on these pages to see if mmap()
> can be used to recover the impacted area.
 > I'm using the latest upstream kernel I have for that:> 
6.12.0-rc7.master.20241117.ol9.x86_64
> But I also got similar results with a kernel not supporting
> MADV_DONTNEED, for example: 5.15.0-301.163.5.2.el9uek.x86_64
> 
> 
> Let's start with mapping a file without modifying the mapped area:
> In this case we should have a clean page cache mapped in the process.
> If an error is injected on this page, the kernel doesn't even inform the
> process about the error as the page is replaced (no matter if the
> mapping was shared of not).
> 
> The kernel indicates this situation with the following messages:
> 
> [10759.371701] Injecting memory failure at pfn 0x10d88e
> [10759.374922] Memory failure: 0x10d88e: corrupted page was clean:
> dropped without side effects
> [10759.377525] Memory failure: 0x10d88e: recovery action for clean LRU
> page: Recovered

Right. The reason here is that we can simply allocate a new page and 
load data from disk. No corruption.

> 
> 
> Now when the page content is modified, in the case of standard page
> size, we need to consider a MAP_PRIVATE or MAP_SHARED
> - in the case of a MAP_PRIVATE page, this page is corrupted and the
> modified data are lost, the kernel will use the SIGBUS mechanism to
> inform this process if needed.
>     But remapping the area sweeps away the poisoned page, and allows the
> process to use the area.
> 
> - In the case of a MAP_SHARED page, if the content hasn't been sync'ed
> with the file backend, we also loose the modified data, and the kernel
> can also raise SIGBUS.
>     Remapping the area recreates a page cache from the "on disk" file
> content, clearing the error.

In a mmap(MAP_SHARED, fd) region that will also require fallocate IIUC.

> 
> In both cases, the kernel indicates messages like:
> [41589.578750] Injecting memory failure for pfn 0x122105 at process
> virtual address 0x7f13bad55000
> [41589.582237] Memory failure: 0x122105: Sending SIGBUS to testdh:7343
> due to hardware memory corruption
> [41589.584907] Memory failure: 0x122105: recovery action for dirty LRU
> page: Recovered
 > >
> Now in the case of hugetlbfs pages:
> This case behaves the same way as the standard page size when using
> MAP_PRIVATE: mmap of the underlying file is able to sweep away the
> poisoned page.
> But the MAP_SHARED case is different: mmap() doesn't clear anything.
> fallocate() must be used.

Yes, I recall that is what I initially said. The behavior with 
MAP_SHARED should be consistent between hugetlb and !hugetlb.


> 
> 
> In both cases, the kernel indicates messages like:
> [89141.724295] Injecting memory failure for pfn 0x117800 at process
> virtual address 0x7fd148800000
> [89141.727103] Memory failure: 0x117800: Sending SIGBUS to testdh:9480
> due to hardware memory corruption
> [89141.729829] Memory failure: 0x117800: recovery action for huge page:
> Recovered
> 
> Conclusion:
> We can't count on the mmap() method only for the hugetlbfs case with
> MAP_SHARED.
> 
> So According to these tests results, we should change the part of the

> qemu_ram_remap() function (in the 2nd patch) to something like:
> 
> +                if (ram_block_discard_range(block, offset +
> block->fd_offset,
> +                                            length) != 0) {
> +                    /*
> +                     * Fold back to using mmap(), but it cannot repair a
> +                     * shared hugetlbfs region. In this case we fail.
> +                     */


But why do we special-case hugetlb here? How would mmap(MAP_FIXED) help 
to discard dirty pagecache data in a mmap(MAD_SHARED, fd) mapping?

-- 
Cheers,

David / dhildenb

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by William Roche 1 year, 2 months ago

On 12/2/24 17:00, David Hildenbrand wrote:
> On 02.12.24 16:41, William Roche wrote:
>> Hello David,
> 
> Hi,
> 
> sorry for reviewing yet, I was rather sick the last 1.5 weeks.

I hope you get well soon!

>> I've finally tested many page mapping possibilities and tried to
>> identify the error injection reaction on these pages to see if mmap()
>> can be used to recover the impacted area.
>> I'm using the latest upstream kernel I have for that:
>> 6.12.0-rc7.master.20241117.ol9.x86_64
>> But I also got similar results with a kernel not supporting
>> MADV_DONTNEED, for example: 5.15.0-301.163.5.2.el9uek.x86_64
>>
>>
>> Let's start with mapping a file without modifying the mapped area:
>> In this case we should have a clean page cache mapped in the process.
>> If an error is injected on this page, the kernel doesn't even inform the
>> process about the error as the page is replaced (no matter if the
>> mapping was shared of not).
>>
>> The kernel indicates this situation with the following messages:
>>
>> [10759.371701] Injecting memory failure at pfn 0x10d88e
>> [10759.374922] Memory failure: 0x10d88e: corrupted page was clean:
>> dropped without side effects
>> [10759.377525] Memory failure: 0x10d88e: recovery action for clean LRU
>> page: Recovered
> 
> Right. The reason here is that we can simply allocate a new page and 
> load data from disk. No corruption.
> 
>>
>>
>> Now when the page content is modified, in the case of standard page
>> size, we need to consider a MAP_PRIVATE or MAP_SHARED
>> - in the case of a MAP_PRIVATE page, this page is corrupted and the
>> modified data are lost, the kernel will use the SIGBUS mechanism to
>> inform this process if needed.
>>     But remapping the area sweeps away the poisoned page, and allows the
>> process to use the area.
>>
>> - In the case of a MAP_SHARED page, if the content hasn't been sync'ed
>> with the file backend, we also loose the modified data, and the kernel
>> can also raise SIGBUS.
>>     Remapping the area recreates a page cache from the "on disk" file
>> content, clearing the error.
> 
> In a mmap(MAP_SHARED, fd) region that will also require fallocate IIUC.

I would have expected the same thing, but what I noticed is that in the 
case of !hugetlb, even poisoned shared memory seem to be recovered with:
mmap(location, size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0)

But we can decide that the normal behavior in this case would be to 
require an fallocate() call, and if this call fails, we fail the recovery.
My tests showed that a standard sized page can be replaced by a new one 
calling the above mmap(). And shared hugetlb case doesn't work this way.

>>
>> In both cases, the kernel indicates messages like:
>> [41589.578750] Injecting memory failure for pfn 0x122105 at process
>> virtual address 0x7f13bad55000
>> [41589.582237] Memory failure: 0x122105: Sending SIGBUS to testdh:7343
>> due to hardware memory corruption
>> [41589.584907] Memory failure: 0x122105: recovery action for dirty LRU
>> page: Recovered
>  > >
>> Now in the case of hugetlbfs pages:
>> This case behaves the same way as the standard page size when using
>> MAP_PRIVATE: mmap of the underlying file is able to sweep away the
>> poisoned page.
>> But the MAP_SHARED case is different: mmap() doesn't clear anything.
>> fallocate() must be used.
> 
> Yes, I recall that is what I initially said. The behavior with 
> MAP_SHARED should be consistent between hugetlb and !hugetlb.

The tests showed that they are different.

>>
>>
>> In both cases, the kernel indicates messages like:
>> [89141.724295] Injecting memory failure for pfn 0x117800 at process
>> virtual address 0x7fd148800000
>> [89141.727103] Memory failure: 0x117800: Sending SIGBUS to testdh:9480
>> due to hardware memory corruption
>> [89141.729829] Memory failure: 0x117800: recovery action for huge page:
>> Recovered
>>
>> Conclusion:
>> We can't count on the mmap() method only for the hugetlbfs case with
>> MAP_SHARED.
>>

At the end of this email, I included the source code of a simplistic 
test case that shows that the page is replaced in the case of standard 
page size.

The idea of this test is simple:

1/ Create a local FILE with:
# dd if=/dev/zero of=./FILE bs=4k count=2
2+0 records in
2+0 records out
8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000337674 s, 24.3 MB/s

2/ As root run:
# ./poisonedShared4k
Mapping 8192 bytes from file FILE
Reading and writing the first 2 pages content:
Read: Read: Wrote: Initial mem page 0
Wrote: Initial mem page 1
Data pages at 0x7f71a19d6000  physically 0x124fb0000
Data pages at 0x7f71a19d7000  physically 0x128ce4000
Poisoning 4k at 0x7f71a19d6000
Signal 7 received
	code 4		Signal code
	addr 0x7f71a19d6000	Memory location
	si_addr_lsb 12
siglongjmp used
Remapping the poisoned page
Reading and writing the first 2 pages content:
Read: Read: Initial mem page 1
Wrote: Rewrite mem page 0
Wrote: Rewrite mem page 1
Data pages at 0x7f71a19d6000  physically 0x10c367000
Data pages at 0x7f71a19d7000  physically 0x128ce4000


  ---

As we can see, this process:
- maps the FILE,
- tries to read and write the beginning of the first 2 pages
- gives their physical addresses
- poison the first page with a madvise(MADV_HWPOISON) call
- shows the SIGBUS signal received and recovers from it
- simply remaps the same page from the file
- tries again to read and write the beginning of the first 2 pages
- gives their physical addresses

  ---

The test (run on 6.12.0-rc7.master.20241117.ol9.x86_64) showed the 
memory is usable after the remap.
Do you see a different behavior, with an even more recent kernel ?

 >> So According to these tests results, we should change the part of the
 >> qemu_ram_remap() function (in the 2nd patch) to something like:
 >>
 >> +                if (ram_block_discard_range(block, offset + 
block->fd_offset,
 >> +                                            length) != 0) {
 >> +                    /*
 >> +                     * Fold back to using mmap(), but it cannot 
repair a
 >> +                     * shared hugetlbfs region. In this case we fail.
 >> +                     */
 >
 >
 > But why do we special-case hugetlb here? How would mmap(MAP_FIXED) help
 > to discard dirty pagecache data in a mmap(MAD_SHARED, fd) mapping?

You can see the behavior with the test case.

But for Qemu, we could decide to ignore that, and choose to fail in the 
generic case:

+                    /*
+                     * Fold back to using mmap(), but it should not 
repair a
+                     * shared file memory region. In this case we fail.
+                     */
+                    if (block->fd >= 0 && qemu_ram_is_shared(block)) {
+                        error_report("Shared memory poison recovery 
failure addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }

Do you think this would be more secure ?

HTH,
William.


  ---------------------------------

#include <sys/types.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <stdint.h>
#include <signal.h>
#include <string.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <setjmp.h>

#define PAGEMAP_ENTRY 8
#define GET_BIT(X,Y) (X & ((uint64_t)1<<Y)) >> Y
#define GET_PFN(X) X & 0x7FFFFFFFFFFFFF
const int __endian_bit = 1;
#define is_bigendian() ( (*(char*)&__endian_bit) == 0 )

#define ALLOC_PAGES 2
#define myFile "FILE"
static sigjmp_buf jmpbuf;

/*
  * Generate an error on the given page.
  */
static void memory_error_advise(void* virtual_page) {
    int ret;

    printf("Poisoning 4k at %p\n", virtual_page);
    if (sigsetjmp(jmpbuf, 1) == 0) {
       ret = madvise(virtual_page, 4096, MADV_HWPOISON);
       if (ret)
          printf("Poisoning failed - madvise: %s", strerror(errno));
    }
}

static void print_physical_address(uint64_t virt_addr) {
    char path_buf [0x100];
    FILE * f;
    uint64_t read_val, file_offset, pfn = 0;
    long pgsz;
    unsigned char c_buf[PAGEMAP_ENTRY];
    pid_t my_pid = getpid();
    int status, i;

    sprintf(path_buf, "/proc/%u/pagemap", my_pid);

    f = fopen(path_buf, "rb");
    if(!f){
       printf("Error! Cannot open %s\n", path_buf);
       exit(EXIT_FAILURE);
    }

    pgsz = getpagesize();
    file_offset = virt_addr / pgsz * PAGEMAP_ENTRY;
    status = fseek(f, file_offset, SEEK_SET);
    if(status){
       perror("Failed to do fseek!");
       fclose(f);
       exit(EXIT_FAILURE);
    }

    for(i=0; i < PAGEMAP_ENTRY; i++){
       int c = getc(f);
       if(c==EOF){
          fclose(f);
          exit(EXIT_FAILURE);
       }
       if(is_bigendian())
            c_buf[i] = c;
       else
            c_buf[PAGEMAP_ENTRY - i - 1] = c;
    }
    fclose(f);

    read_val = 0;
    for(i=0; i < PAGEMAP_ENTRY; i++){
       read_val = (read_val << 8) + c_buf[i];
    }

    if(GET_BIT(read_val, 63)) { // Bit 63 page present
       pfn = GET_PFN(read_val);
    } else {
       printf("Page not present !\n");
    }
    if(GET_BIT(read_val, 62)) // Bit 62 page swapped
       printf("Page swapped\n");

    if (pfn == 0) {
       printf("Virt address translation 0x%llx failed\n");
       exit(EXIT_FAILURE);
    }

    printf("Data pages at 0x%llx  physically 0x%llx\n",
          (unsigned long long)virt_addr, (unsigned long long)pfn * pgsz);
}

/*
  * SIGBUS handler to display the given information.
  */
static void sigbus_action(int signum, siginfo_t *siginfo, void *ctx) {
    printf("Signal %d received\n", signum);
    printf("\tcode %d\t\tSignal code\n", siginfo->si_code);
    printf("\taddr 0x%llx\tMemory location\n", siginfo->si_addr);
    printf("\tsi_addr_lsb %d\n", siginfo->si_addr_lsb);

   if (siginfo->si_code == 4) { /* BUS_MCEERR_AR */
	fprintf(stderr, "siglongjmp used\n");
	siglongjmp(jmpbuf, 1);
   }
}

static void read_write(void* addr, int nb_pages, char* prefix) {
    int i;
    fprintf(stderr, "Reading and writing the first %d pages content:\n", 
nb_pages);
    if (sigsetjmp(jmpbuf, 1) == 0) {
       // read the strings at the beginning of each page.
       for (i=0; i < nb_pages; i++) {
          printf("Read: %s", ((char *)addr+ i*4096));
       }
       // also write something
       for (i=0; i < 2; i++) {
          sprintf(((char *)addr + i*4096), "%s %d\n", prefix, i);
	 printf("Wrote: %s %d\n", prefix, i);
       }
    }
}

int main(int argc, char ** argv) {
    int opt, fd, i;
    struct sigaction my_sigaction;
    uint64_t virt_addr, phys_addr;
    void *local_pnt, *v;
    struct stat statbuf;
    off_t s;

    // Need to have the CAP_SYS_ADMIN capability to get PFNs values in 
pagemap.
    if (getuid() != 0) {
       fprintf(stderr, "Usage: %s needs to run as root\n", argv[0]);
       exit(EXIT_FAILURE);
    }

    // attach our SIGBUS handler.
    my_sigaction.sa_sigaction = sigbus_action;
    my_sigaction.sa_flags = SA_SIGINFO | SA_NODEFER | SA_SIGINFO;
    if (sigaction(SIGBUS, &my_sigaction, NULL) == -1) {
       perror("Signal handler attach failed");
       exit(EXIT_FAILURE);
    }

    fd = open(myFile, O_RDWR);
    if (fd == -1) {
       perror("open");
       exit(EXIT_FAILURE);
    }
    if (fstat(fd, &statbuf) == -1) {
       perror("fstat");
       exit(EXIT_FAILURE);
    }
    s = statbuf.st_size;
    if (s < 2*4096) {
      fprintf(stderr, "File must be at least 2 pages large\n");
      exit(EXIT_FAILURE);
    }

    printf("Mapping %d bytes from file %s\n", s, myFile);
    local_pnt = mmap(NULL, s, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    if (local_pnt == MAP_FAILED) {
      perror("mmap");
      exit(EXIT_FAILURE);
    }
    read_write(local_pnt, 2, "Initial mem page");

    virt_addr = (uint64_t)local_pnt;
    print_physical_address(virt_addr);
    print_physical_address(virt_addr+getpagesize());

    // Explicit error
    memory_error_advise((void*)virt_addr);

    // Remap the poisoned page
    fprintf(stderr, "Remapping the poisoned page\n");
    v = mmap(local_pnt, 4092, PROT_READ|PROT_WRITE, 
MAP_SHARED|MAP_FIXED, fd, 0);
    if ((v == MAP_FAILED) || (v != local_pnt)) {
       perror("mmap");
    }

    read_write(local_pnt, 2, "Rewrite mem page");
    print_physical_address(virt_addr);
    print_physical_address(virt_addr+getpagesize());
    return 0;
}

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by David Hildenbrand 1 year, 2 months ago

On 03.12.24 01:15, William Roche wrote:
> On 12/2/24 17:00, David Hildenbrand wrote:
>> On 02.12.24 16:41, William Roche wrote:
>>> Hello David,
>>
>> Hi,
>>
>> sorry for reviewing yet, I was rather sick the last 1.5 weeks.
> 
> I hope you get well soon!

Getting there, thanks! :)

> 
>>> I've finally tested many page mapping possibilities and tried to
>>> identify the error injection reaction on these pages to see if mmap()
>>> can be used to recover the impacted area.
>>> I'm using the latest upstream kernel I have for that:
>>> 6.12.0-rc7.master.20241117.ol9.x86_64
>>> But I also got similar results with a kernel not supporting
>>> MADV_DONTNEED, for example: 5.15.0-301.163.5.2.el9uek.x86_64
>>>
>>>
>>> Let's start with mapping a file without modifying the mapped area:
>>> In this case we should have a clean page cache mapped in the process.
>>> If an error is injected on this page, the kernel doesn't even inform the
>>> process about the error as the page is replaced (no matter if the
>>> mapping was shared of not).
>>>
>>> The kernel indicates this situation with the following messages:
>>>
>>> [10759.371701] Injecting memory failure at pfn 0x10d88e
>>> [10759.374922] Memory failure: 0x10d88e: corrupted page was clean:
>>> dropped without side effects
>>> [10759.377525] Memory failure: 0x10d88e: recovery action for clean LRU
>>> page: Recovered
>>
>> Right. The reason here is that we can simply allocate a new page and
>> load data from disk. No corruption.
>>
>>>
>>>
>>> Now when the page content is modified, in the case of standard page
>>> size, we need to consider a MAP_PRIVATE or MAP_SHARED
>>> - in the case of a MAP_PRIVATE page, this page is corrupted and the
>>> modified data are lost, the kernel will use the SIGBUS mechanism to
>>> inform this process if needed.
>>>      But remapping the area sweeps away the poisoned page, and allows the
>>> process to use the area.
>>>
>>> - In the case of a MAP_SHARED page, if the content hasn't been sync'ed
>>> with the file backend, we also loose the modified data, and the kernel
>>> can also raise SIGBUS.
>>>      Remapping the area recreates a page cache from the "on disk" file
>>> content, clearing the error.
>>
>> In a mmap(MAP_SHARED, fd) region that will also require fallocate IIUC.
> 
> I would have expected the same thing, but what I noticed is that in the
> case of !hugetlb, even poisoned shared memory seem to be recovered with:
> mmap(location, size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0)


Let me take a look at your tool below if I can find an explanation of 
what is happening, because it's weird :)

[...]

> 
> At the end of this email, I included the source code of a simplistic
> test case that shows that the page is replaced in the case of standard
> page size.
> 
> The idea of this test is simple:
> 
> 1/ Create a local FILE with:
> # dd if=/dev/zero of=./FILE bs=4k count=2
> 2+0 records in
> 2+0 records out
> 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000337674 s, 24.3 MB/s
> 
> 2/ As root run:
> # ./poisonedShared4k
> Mapping 8192 bytes from file FILE
> Reading and writing the first 2 pages content:
> Read: Read: Wrote: Initial mem page 0
> Wrote: Initial mem page 1
> Data pages at 0x7f71a19d6000  physically 0x124fb0000
> Data pages at 0x7f71a19d7000  physically 0x128ce4000
> Poisoning 4k at 0x7f71a19d6000
> Signal 7 received
> 	code 4		Signal code
> 	addr 0x7f71a19d6000	Memory location
> 	si_addr_lsb 12
> siglongjmp used
> Remapping the poisoned page
> Reading and writing the first 2 pages content:
> Read: Read: Initial mem page 1
> Wrote: Rewrite mem page 0
> Wrote: Rewrite mem page 1
> Data pages at 0x7f71a19d6000  physically 0x10c367000
> Data pages at 0x7f71a19d7000  physically 0x128ce4000
> 
> 
>    ---
> 
> As we can see, this process:
> - maps the FILE,
> - tries to read and write the beginning of the first 2 pages
> - gives their physical addresses
> - poison the first page with a madvise(MADV_HWPOISON) call
> - shows the SIGBUS signal received and recovers from it
> - simply remaps the same page from the file
> - tries again to read and write the beginning of the first 2 pages
> - gives their physical addresses
> 


Turns out the code will try to truncate the pagecache page using 
mapping->a_ops->error_remove_folio().

That, however, is only implemented on *some* filesystems.

Most prominently, it is not implemented on shmem as well.


So if you run your test with shmem (e.g., /tmp/FILE), it doesn't work.

Using fallocate+MADV_DONTNEED seems to work on shmem.

-- 
Cheers,

David / dhildenb

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by William Roche 1 year, 2 months ago

On 12/3/24 15:08, David Hildenbrand wrote:
> [...]
> 
> Let me take a look at your tool below if I can find an explanation of 
> what is happening, because it's weird :)
> 
> [...]
> 
>>
>> At the end of this email, I included the source code of a simplistic
>> test case that shows that the page is replaced in the case of standard
>> page size.
>>
>> The idea of this test is simple:
>>
>> 1/ Create a local FILE with:
>> # dd if=/dev/zero of=./FILE bs=4k count=2
>> 2+0 records in
>> 2+0 records out
>> 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000337674 s, 24.3 MB/s
>>
>> 2/ As root run:
>> # ./poisonedShared4k
>> Mapping 8192 bytes from file FILE
>> Reading and writing the first 2 pages content:
>> Read: Read: Wrote: Initial mem page 0
>> Wrote: Initial mem page 1
>> Data pages at 0x7f71a19d6000  physically 0x124fb0000
>> Data pages at 0x7f71a19d7000  physically 0x128ce4000
>> Poisoning 4k at 0x7f71a19d6000
>> Signal 7 received
>>     code 4        Signal code
>>     addr 0x7f71a19d6000    Memory location
>>     si_addr_lsb 12
>> siglongjmp used
>> Remapping the poisoned page
>> Reading and writing the first 2 pages content:
>> Read: Read: Initial mem page 1
>> Wrote: Rewrite mem page 0
>> Wrote: Rewrite mem page 1
>> Data pages at 0x7f71a19d6000  physically 0x10c367000
>> Data pages at 0x7f71a19d7000  physically 0x128ce4000
>>
>>
>>    ---
>>
>> As we can see, this process:
>> - maps the FILE,
>> - tries to read and write the beginning of the first 2 pages
>> - gives their physical addresses
>> - poison the first page with a madvise(MADV_HWPOISON) call
>> - shows the SIGBUS signal received and recovers from it
>> - simply remaps the same page from the file
>> - tries again to read and write the beginning of the first 2 pages
>> - gives their physical addresses
>>
> 
> 
> Turns out the code will try to truncate the pagecache page using 
> mapping->a_ops->error_remove_folio().
> 
> That, however, is only implemented on *some* filesystems.
> 
> Most prominently, it is not implemented on shmem as well.
> 
> 
> So if you run your test with shmem (e.g., /tmp/FILE), it doesn't work.

Correct, on tmpfs the test case fails to continue to use the memory area 
and gets a SIGBUS.  And it works with xfs.



> 
> Using fallocate+MADV_DONTNEED seems to work on shmem.
> 

Our new Qemu code is testing first the fallocate+MADV_DONTNEED procedure 
for standard sized pages (in ram_block_discard_range()) and only folds 
back to the mmap() use if it fails. So maybe my proposal to implement:

+                    /*
+                     * Fold back to using mmap(), but it should not 
repair a
+                     * shared file memory region. In this case we fail.
+                     */
+                    if (block->fd >= 0 && qemu_ram_is_shared(block)) {
+                        error_report("Shared memory poison recovery 
failure addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }

Could be the right choice.

William.

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by David Hildenbrand 1 year, 2 months ago

On 03.12.24 15:39, William Roche wrote:
> On 12/3/24 15:08, David Hildenbrand wrote:
>> [...]
>>
>> Let me take a look at your tool below if I can find an explanation of
>> what is happening, because it's weird :)
>>
>> [...]
>>
>>>
>>> At the end of this email, I included the source code of a simplistic
>>> test case that shows that the page is replaced in the case of standard
>>> page size.
>>>
>>> The idea of this test is simple:
>>>
>>> 1/ Create a local FILE with:
>>> # dd if=/dev/zero of=./FILE bs=4k count=2
>>> 2+0 records in
>>> 2+0 records out
>>> 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000337674 s, 24.3 MB/s
>>>
>>> 2/ As root run:
>>> # ./poisonedShared4k
>>> Mapping 8192 bytes from file FILE
>>> Reading and writing the first 2 pages content:
>>> Read: Read: Wrote: Initial mem page 0
>>> Wrote: Initial mem page 1
>>> Data pages at 0x7f71a19d6000  physically 0x124fb0000
>>> Data pages at 0x7f71a19d7000  physically 0x128ce4000
>>> Poisoning 4k at 0x7f71a19d6000
>>> Signal 7 received
>>>      code 4        Signal code
>>>      addr 0x7f71a19d6000    Memory location
>>>      si_addr_lsb 12
>>> siglongjmp used
>>> Remapping the poisoned page
>>> Reading and writing the first 2 pages content:
>>> Read: Read: Initial mem page 1
>>> Wrote: Rewrite mem page 0
>>> Wrote: Rewrite mem page 1
>>> Data pages at 0x7f71a19d6000  physically 0x10c367000
>>> Data pages at 0x7f71a19d7000  physically 0x128ce4000
>>>
>>>
>>>     ---
>>>
>>> As we can see, this process:
>>> - maps the FILE,
>>> - tries to read and write the beginning of the first 2 pages
>>> - gives their physical addresses
>>> - poison the first page with a madvise(MADV_HWPOISON) call
>>> - shows the SIGBUS signal received and recovers from it
>>> - simply remaps the same page from the file
>>> - tries again to read and write the beginning of the first 2 pages
>>> - gives their physical addresses
>>>
>>
>>
>> Turns out the code will try to truncate the pagecache page using
>> mapping->a_ops->error_remove_folio().
>>
>> That, however, is only implemented on *some* filesystems.
>>
>> Most prominently, it is not implemented on shmem as well.
>>
>>
>> So if you run your test with shmem (e.g., /tmp/FILE), it doesn't work.
> 
> Correct, on tmpfs the test case fails to continue to use the memory area
> and gets a SIGBUS.  And it works with xfs.
> 
> 
> 
>>
>> Using fallocate+MADV_DONTNEED seems to work on shmem.
>>
> 
> Our new Qemu code is testing first the fallocate+MADV_DONTNEED procedure
> for standard sized pages (in ram_block_discard_range()) and only folds
> back to the mmap() use if it fails. So maybe my proposal to implement:
> 
> +                    /*
> +                     * Fold back to using mmap(), but it should not
> repair a
> +                     * shared file memory region. In this case we fail.
> +                     */
> +                    if (block->fd >= 0 && qemu_ram_is_shared(block)) {
> +                        error_report("Shared memory poison recovery
> failure addr: "
> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                                     length, addr);
> +                        exit(1);
> +                    }
> 
> Could be the right choice.

Right. But then, what about a mmap(MAP_PRIVATE, shmem), where the 
pagecache page is poisoned and needs an explicit fallocate? :)

It's all tricky. I wonder if we should just say "if it's backed by a 
file, and we cannot discard, then mmap() can't fix it reliably".

if (block->fd >= 0) {
	...
}

After all, we don't even expect the fallocate/MADV_DONTNEED to ever fail 
:) So I was also wondering if we could get rid of the mmap(MAP_FIXED) 
completely ... but who knows what older Linux kernels do.

-- 
Cheers,

David / dhildenb

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by William Roche 1 year, 2 months ago

On 12/3/24 16:00, David Hildenbrand wrote:
> On 03.12.24 15:39, William Roche wrote:
>> [...]
>> Our new Qemu code is testing first the fallocate+MADV_DONTNEED procedure
>> for standard sized pages (in ram_block_discard_range()) and only folds
>> back to the mmap() use if it fails. So maybe my proposal to implement:
>>
>> +                    /*
>> +                     * Fold back to using mmap(), but it should not
>> repair a
>> +                     * shared file memory region. In this case we fail.
>> +                     */
>> +                    if (block->fd >= 0 && qemu_ram_is_shared(block)) {
>> +                        error_report("Shared memory poison recovery
>> failure addr: "
>> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>> +                                     length, addr);
>> +                        exit(1);
>> +                    }
>>
>> Could be the right choice.
> 
> Right. But then, what about a mmap(MAP_PRIVATE, shmem), where the 
> pagecache page is poisoned and needs an explicit fallocate? :)
> 
> It's all tricky. I wonder if we should just say "if it's backed by a 
> file, and we cannot discard, then mmap() can't fix it reliably".
> 
> if (block->fd >= 0) {
>      ...
> }
> 
> After all, we don't even expect the fallocate/MADV_DONTNEED to ever 
> fail :) So I was also wondering if we could get rid of the 
> mmap(MAP_FIXED) completely ... but who knows what older Linux kernels do.

I agree that we expect the ram_block_discard_range() function to be 
working on recent kernels, and to do what's necessary to reuse the 
poisoned memory area.

The case where older kernels could be a problem is where fallocate() 
would not work on a given file, or madvice(MADV_DONTNEED or MADV_REMOVE) 
would not work on standard sized pages. As ram_block_discard_range() 
currently avoids using madvise on huge pages.

In this case, the generic/minimal way to make the memory usable again 
(in all cases) would be to:
- fallocate/PUNCH_HOLE the given file (if any)
- and remap the area
even if it's not _mandatory_ in all cases.

Would you like me to add an fallocate(PUNCH_HOLE) call in the helper 
function qemu_ram_remap_mmap() when a file descriptor is provided 
(before remapping the area) ?

This way, we don't need to know if ram_block_discard_range() failed on 
the fallocate() or the madvise(); in the worst case scenario, we would 
PUNCH twice. If fallocate fails or mmap fails, we exit.
I haven't seen a problem punching a file twice - do you see any problem ?

Do you find this possibility acceptable ? Or should I just go for the 
immediate failure when ram_block_discard_range() fails on a case with a 
file descriptor as you suggest ?

Please let me know if you find any problem with this approach, as it 
could help to have this poison recovery scenario to work on more kernels.

Thanks in advance for your feedback.
William.

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by David Hildenbrand 1 year, 2 months ago

On 06.12.24 19:26, William Roche wrote:
> On 12/3/24 16:00, David Hildenbrand wrote:
>> On 03.12.24 15:39, William Roche wrote:
>>> [...]
>>> Our new Qemu code is testing first the fallocate+MADV_DONTNEED procedure
>>> for standard sized pages (in ram_block_discard_range()) and only folds
>>> back to the mmap() use if it fails. So maybe my proposal to implement:
>>>
>>> +                    /*
>>> +                     * Fold back to using mmap(), but it should not
>>> repair a
>>> +                     * shared file memory region. In this case we fail.
>>> +                     */
>>> +                    if (block->fd >= 0 && qemu_ram_is_shared(block)) {
>>> +                        error_report("Shared memory poison recovery
>>> failure addr: "
>>> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>>> +                                     length, addr);
>>> +                        exit(1);
>>> +                    }
>>>
>>> Could be the right choice.
>>
>> Right. But then, what about a mmap(MAP_PRIVATE, shmem), where the
>> pagecache page is poisoned and needs an explicit fallocate? :)
>>
>> It's all tricky. I wonder if we should just say "if it's backed by a
>> file, and we cannot discard, then mmap() can't fix it reliably".
>>
>> if (block->fd >= 0) {
>>       ...
>> }
>>
>> After all, we don't even expect the fallocate/MADV_DONTNEED to ever
>> fail :) So I was also wondering if we could get rid of the
>> mmap(MAP_FIXED) completely ... but who knows what older Linux kernels do.
> 

Hi,

> I agree that we expect the ram_block_discard_range() function to be
> working on recent kernels, and to do what's necessary to reuse the
> poisoned memory area.
> 
> The case where older kernels could be a problem is where fallocate()
> would not work on a given file

In that case we cannot possibly handle it correctly in all with 
mmap(MAP_FIXED), especially not with shmem/hugetlb :/

So failing is correct.

, or madvice(MADV_DONTNEED or MADV_REMOVE)
> would not work on standard sized pages. 

I assume both can be expected to work in any reasonable Linux setup 
today (-hugetlb) :)

> As ram_block_discard_range()
> currently avoids using madvise on huge pages.

Right, because support might not be around for hugetlb.

> 
> In this case, the generic/minimal way to make the memory usable again
> (in all cases) would be to:
> - fallocate/PUNCH_HOLE the given file (if any)
> - and remap the area
> even if it's not _mandatory_ in all cases.
> 
> Would you like me to add an fallocate(PUNCH_HOLE) call in the helper
> function qemu_ram_remap_mmap() when a file descriptor is provided
> (before remapping the area) ?
 > > This way, we don't need to know if ram_block_discard_range() failed on
> the fallocate() or the madvise(); in the worst case scenario, we would
> PUNCH twice. If fallocate fails or mmap fails, we exit.
> I haven't seen a problem punching a file twice - do you see any problem ?

Hm. I'd like to avoid another fallocate(). It would really only make 
sense if we expect fallocate() to work but madvise() to fail; and I 
don't think that's our expectation.

virtio-balloon has been using ram_block_discard_range() forever, and so 
far nobody really ever reported any of the errors from the function to 
the best of my knowledge ...


> 
> Do you find this possibility acceptable ? Or should I just go for the
> immediate failure when ram_block_discard_range() fails on a case with a
> file descriptor as you suggest ?
> 
> Please let me know if you find any problem with this approach, as it
> could help to have this poison recovery scenario to work on more kernels.

I'd say, let's keep it simple. Try ram_block_discard_range() if that 
fails (a) and we have a file, bail out, (b) we don't have a file, do the 
remap().

If we ever run into issues with that approach, we can investigate why it 
fails and what to do about it (e.g., fallocate).

My best guess it that the whole remap part can be avoided completely.

-- 
Cheers,

David / dhildenb

[PATCH v3 1/7] hwpoison_page_list and qemu_ram_remap are based of pages

Posted by “William Roche 1 year, 2 months ago

From: William Roche <william.roche@oracle.com>

The list of hwpoison pages used to remap the memory on reset
is based on the backend real page size. When dealing with
hugepages, we create a single entry for the entire page.

Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c       |  6 +++++-
 include/exec/cpu-common.h |  3 ++-
 system/physmem.c          | 32 ++++++++++++++++++++++++++------
 3 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 801cff16a5..24c0c4ce3f 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param)
 
     QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
         QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr);
         g_free(page);
     }
 }
@@ -1286,6 +1286,10 @@ static void kvm_unpoison_all(void *param)
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
+    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
+
+    if (page_size > TARGET_PAGE_SIZE)
+        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 638dc806a5..59fbb324fa 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
 
 /* memory API */
 
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
+void qemu_ram_remap(ram_addr_t addr);
 /* This should not be used by devices.  */
 ram_addr_t qemu_ram_addr_from_host(void *ptr);
 ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
@@ -108,6 +108,7 @@ bool qemu_ram_is_named_file(RAMBlock *rb);
 int qemu_ram_get_fd(RAMBlock *rb);
 
 size_t qemu_ram_pagesize(RAMBlock *block);
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr);
 size_t qemu_ram_pagesize_largest(void);
 
 /**
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..410eabd29d 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1665,6 +1665,19 @@ size_t qemu_ram_pagesize(RAMBlock *rb)
     return rb->page_size;
 }
 
+/* Return backend real page size used for the given ram_addr. */
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr)
+{
+    RAMBlock *rb;
+
+    RCU_READ_LOCK_GUARD();
+    rb =  qemu_get_ram_block(addr);
+    if (!rb) {
+        return TARGET_PAGE_SIZE;
+    }
+    return qemu_ram_pagesize(rb);
+}
+
 /* Returns the largest size of page in use */
 size_t qemu_ram_pagesize_largest(void)
 {
@@ -2167,17 +2180,22 @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     ram_addr_t offset;
     int flags;
     void *area, *vaddr;
     int prot;
+    size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
         offset = addr - block->offset;
         if (offset < block->max_length) {
+            /* Respect the pagesize of our RAMBlock */
+            page_size = qemu_ram_pagesize(block);
+            offset = QEMU_ALIGN_DOWN(offset, page_size);
+
             vaddr = ramblock_ptr(block, offset);
             if (block->flags & RAM_PREALLOC) {
                 ;
@@ -2191,21 +2209,23 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 prot = PROT_READ;
                 prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
                 if (block->fd >= 0) {
-                    area = mmap(vaddr, length, prot, flags, block->fd,
+                    area = mmap(vaddr, page_size, prot, flags, block->fd,
                                 offset + block->fd_offset);
                 } else {
                     flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, length, prot, flags, -1, 0);
+                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
                 }
                 if (area != vaddr) {
                     error_report("Could not remap addr: "
                                  RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 length, addr);
+                                 page_size, addr);
                     exit(1);
                 }
-                memory_try_enable_merging(vaddr, length);
-                qemu_ram_setup_dump(vaddr, length);
+                memory_try_enable_merging(vaddr, page_size);
+                qemu_ram_setup_dump(vaddr, page_size);
             }
+
+            break;
         }
     }
 }
-- 
2.43.5

[PATCH v3 2/7] system/physmem: poisoned memory discard on reboot

Posted by “William Roche 1 year, 2 months ago

From: William Roche <william.roche@oracle.com>

Repair memory locations, calling ram_block_discard_range(),
punching a hole in the backend file when necessary and regenerate
a usable memory.
Fall back to unmap/remap the memory location(s) if the kernel doesn't
support the madvise calls used by ram_block_discard_range().

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 69 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 46 insertions(+), 23 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 410eabd29d..26711df2d2 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2180,13 +2180,37 @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
+/* Try to recover the given location using mmap */
+static void qemu_ram_remap_mmap(RAMBlock *block, void* vaddr, size_t size,
+                                ram_addr_t offset)
+{
+    int flags, prot;
+    void *area;
+
+    flags = MAP_FIXED;
+    flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
+    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
+    prot = PROT_READ;
+    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
+    if (block->fd >= 0) {
+        area = mmap(vaddr, size, prot, flags, block->fd,
+                    offset + block->fd_offset);
+    } else {
+        flags |= MAP_ANONYMOUS;
+        area = mmap(vaddr, size, prot, flags, -1, 0);
+    }
+    if (area != vaddr) {
+        error_report("Could not remap addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                     size, addr);
+        exit(1);
+    }
+}
+
 void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     ram_addr_t offset;
-    int flags;
-    void *area, *vaddr;
-    int prot;
+    void *vaddr;
     size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
@@ -2202,27 +2226,26 @@ void qemu_ram_remap(ram_addr_t addr)
             } else if (xen_enabled()) {
                 abort();
             } else {
-                flags = MAP_FIXED;
-                flags |= block->flags & RAM_SHARED ?
-                         MAP_SHARED : MAP_PRIVATE;
-                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
-                prot = PROT_READ;
-                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
-                if (block->fd >= 0) {
-                    area = mmap(vaddr, page_size, prot, flags, block->fd,
-                                offset + block->fd_offset);
-                } else {
-                    flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
-                }
-                if (area != vaddr) {
-                    error_report("Could not remap addr: "
-                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 page_size, addr);
-                    exit(1);
+                if (ram_block_discard_range(block, offset + block->fd_offset,
+                                            length) != 0) {
+                    /*
+                     * Fold back to using mmap(), but it cannot zap pagecache
+                     * pages, only anonymous pages. As soon as we might have
+                     * pagecache pages involved (either private or shared
+                     * mapping), we must be careful.
+                     * We don't take the risk of using mmap and fail now.
+                     */
+                    if (block->fd >= 0 && (qemu_ram_is_shared(block) ||
+                        (length > TARGET_PAGE_SIZE))) {
+                        error_report("Memory poison recovery failure addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }
+                    qemu_ram_remap_mmap(block, vaddr, page_size, offset);
+                    memory_try_enable_merging(vaddr, size);
+                    qemu_ram_setup_dump(vaddr, size);
                 }
-                memory_try_enable_merging(vaddr, page_size);
-                qemu_ram_setup_dump(vaddr, page_size);
             }
 
             break;
-- 
2.43.5

[PATCH v3 3/7] accel/kvm: Report the loss of a large memory page

Posted by “William Roche 1 year, 2 months ago

From: William Roche <william.roche@oracle.com>

In case of a large page impacted by a memory error, complete
the existing Qemu error message to indicate that the error is
injected in the VM. Also include a simlar message to the ARM
platform.
Only in the case of a large page impacted, we now report:
...Memory Error at QEMU addr X and GUEST addr Y on lost large page SIZE@ADDR of type...

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c   |  4 ----
 system/physmem.c      | 12 ++++++------
 target/arm/kvm.c      | 13 +++++++++++++
 target/i386/kvm/kvm.c | 18 ++++++++++++++----
 4 files changed, 33 insertions(+), 14 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 24c0c4ce3f..8a47aa7258 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1286,10 +1286,6 @@ static void kvm_unpoison_all(void *param)
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
-    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
-
-    if (page_size > TARGET_PAGE_SIZE)
-        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
diff --git a/system/physmem.c b/system/physmem.c
index 26711df2d2..b8daf42d20 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2201,7 +2201,7 @@ static void qemu_ram_remap_mmap(RAMBlock *block, void* vaddr, size_t size,
     }
     if (area != vaddr) {
         error_report("Could not remap addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                     size, addr);
+                     size, block->offset + offset);
         exit(1);
     }
 }
@@ -2227,7 +2227,7 @@ void qemu_ram_remap(ram_addr_t addr)
                 abort();
             } else {
                 if (ram_block_discard_range(block, offset + block->fd_offset,
-                                            length) != 0) {
+                                            page_size) != 0) {
                     /*
                      * Fold back to using mmap(), but it cannot zap pagecache
                      * pages, only anonymous pages. As soon as we might have
@@ -2236,15 +2236,15 @@ void qemu_ram_remap(ram_addr_t addr)
                      * We don't take the risk of using mmap and fail now.
                      */
                     if (block->fd >= 0 && (qemu_ram_is_shared(block) ||
-                        (length > TARGET_PAGE_SIZE))) {
+                        (page_size > TARGET_PAGE_SIZE))) {
                         error_report("Memory poison recovery failure addr: "
                                      RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                     length, addr);
+                                     page_size, addr);
                         exit(1);
                     }
                     qemu_ram_remap_mmap(block, vaddr, page_size, offset);
-                    memory_try_enable_merging(vaddr, size);
-                    qemu_ram_setup_dump(vaddr, size);
+                    memory_try_enable_merging(vaddr, page_size);
+                    qemu_ram_setup_dump(vaddr, page_size);
                 }
             }
 
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 7b6812c0de..d92b195851 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2366,6 +2366,8 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
 {
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t page_size;
+    char lp_msg[57];
 
     assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
 
@@ -2373,6 +2375,14 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
+            page_size = qemu_ram_pagesize_from_addr(ram_addr);
+            if (page_size > TARGET_PAGE_SIZE) {
+                ram_addr = ROUND_DOWN(ram_addr, page_size);
+                sprintf(lp_msg, " on lost large page "
+                    RAM_ADDR_FMT "@" RAM_ADDR_FMT "", page_size, ram_addr);
+            } else {
+                lp_msg[0] = '\0';
+            }
             kvm_hwpoison_page_add(ram_addr);
             /*
              * If this is a BUS_MCEERR_AR, we know we have been called
@@ -2389,6 +2399,9 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
                 kvm_cpu_synchronize_state(c);
                 if (!acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
                     kvm_inject_arm_sea(c);
+                    error_report("Guest Memory Error at QEMU addr %p and "
+                        "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
+                        addr, paddr, lp_msg, "BUS_MCEERR_AR");
                 } else {
                     error_report("failed to record the error");
                     abort();
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 8e17942c3b..182985b159 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -741,6 +741,8 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
     CPUX86State *env = &cpu->env;
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t page_size;
+    char lp_msg[57];
 
     /* If we get an action required MCE, it has been injected by KVM
      * while the VM was running.  An action optional MCE instead should
@@ -753,6 +755,14 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
+            page_size = qemu_ram_pagesize_from_addr(ram_addr);
+            if (page_size > TARGET_PAGE_SIZE) {
+                ram_addr = ROUND_DOWN(ram_addr, page_size);
+                sprintf(lp_msg, " on lost large page "
+                        RAM_ADDR_FMT "@" RAM_ADDR_FMT "", page_size, ram_addr);
+            } else {
+                lp_msg[0] = '\0';
+            }
             kvm_hwpoison_page_add(ram_addr);
             kvm_mce_inject(cpu, paddr, code);
 
@@ -763,12 +773,12 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
              */
             if (code == BUS_MCEERR_AR) {
                 error_report("Guest MCE Memory Error at QEMU addr %p and "
-                    "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
-                    addr, paddr, "BUS_MCEERR_AR");
+                    "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
+                    addr, paddr, lp_msg, "BUS_MCEERR_AR");
             } else {
                  warn_report("Guest MCE Memory Error at QEMU addr %p and "
-                     "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
-                     addr, paddr, "BUS_MCEERR_AO");
+                     "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
+                     addr, paddr, lp_msg, "BUS_MCEERR_AO");
             }
 
             return;
-- 
2.43.5

[PATCH v3 4/7] numa: Introduce and use ram_block_notify_remap()

Posted by “William Roche 1 year, 2 months ago

From: David Hildenbrand <david@redhat.com>

Notify registered listeners about the remap at the end of
qemu_ram_remap() so e.g., a memory backend can re-apply its
settings correctly.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 hw/core/numa.c         | 11 +++++++++++
 include/exec/ramlist.h |  3 +++
 system/physmem.c       |  1 +
 3 files changed, 15 insertions(+)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index 1b5f44baea..4ca67db483 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -895,3 +895,14 @@ void ram_block_notify_resize(void *host, size_t old_size, size_t new_size)
         }
     }
 }
+
+void ram_block_notify_remap(void *host, size_t offset, size_t size)
+{
+    RAMBlockNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+        if (notifier->ram_block_remapped) {
+            notifier->ram_block_remapped(notifier, host, offset, size);
+        }
+    }
+}
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
index d9cfe530be..c1dc785a57 100644
--- a/include/exec/ramlist.h
+++ b/include/exec/ramlist.h
@@ -72,6 +72,8 @@ struct RAMBlockNotifier {
                               size_t max_size);
     void (*ram_block_resized)(RAMBlockNotifier *n, void *host, size_t old_size,
                               size_t new_size);
+    void (*ram_block_remapped)(RAMBlockNotifier *n, void *host, size_t offset,
+                               size_t size);
     QLIST_ENTRY(RAMBlockNotifier) next;
 };
 
@@ -80,6 +82,7 @@ void ram_block_notifier_remove(RAMBlockNotifier *n);
 void ram_block_notify_add(void *host, size_t size, size_t max_size);
 void ram_block_notify_remove(void *host, size_t size, size_t max_size);
 void ram_block_notify_resize(void *host, size_t old_size, size_t new_size);
+void ram_block_notify_remap(void *host, size_t offset, size_t size);
 
 GString *ram_block_format(void);
 
diff --git a/system/physmem.c b/system/physmem.c
index b8daf42d20..6b948c0a88 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2246,6 +2246,7 @@ void qemu_ram_remap(ram_addr_t addr)
                     memory_try_enable_merging(vaddr, page_size);
                     qemu_ram_setup_dump(vaddr, page_size);
                 }
+                ram_block_notify_remap(block->host, offset, page_size);
             }
 
             break;
-- 
2.43.5

[PATCH v3 5/7] hostmem: Factor out applying settings

Posted by “William Roche 1 year, 2 months ago

From: David Hildenbrand <david@redhat.com>

We want to reuse the functionality when remapping or resizing RAM.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c | 155 ++++++++++++++++++++++++---------------------
 1 file changed, 82 insertions(+), 73 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 181446626a..bf85d716e5 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -36,6 +36,87 @@ QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
 QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
 #endif
 
+static void host_memory_backend_apply_settings(HostMemoryBackend *backend,
+                                               void *ptr, uint64_t size,
+                                               Error **errp)
+{
+    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
+
+    if (backend->merge) {
+        qemu_madvise(ptr, size, QEMU_MADV_MERGEABLE);
+    }
+    if (!backend->dump) {
+        qemu_madvise(ptr, size, QEMU_MADV_DONTDUMP);
+    }
+#ifdef CONFIG_NUMA
+    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
+    /* lastbit == MAX_NODES means maxnode = 0 */
+    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
+    /*
+     * Ensure policy won't be ignored in case memory is preallocated
+     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
+     * this doesn't catch hugepage case.
+     */
+    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
+    int mode = backend->policy;
+
+    /*
+     * Check for invalid host-nodes and policies and give more verbose
+     * error messages than mbind().
+     */
+    if (maxnode && backend->policy == MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be empty for policy default,"
+                   " or you should explicitly specify a policy other"
+                   " than default");
+        return;
+    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be set for policy %s",
+                   HostMemPolicy_str(backend->policy));
+        return;
+    }
+
+    /*
+     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
+     * as argument to mbind() due to an old Linux bug (feature?) which
+     * cuts off the last specified node. This means backend->host_nodes
+     * must have MAX_NODES+1 bits available.
+     */
+    assert(sizeof(backend->host_nodes) >=
+           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
+    assert(maxnode <= MAX_NODES);
+
+#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
+    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
+        /*
+         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
+         * silently picks the first node.
+         */
+        mode = MPOL_PREFERRED_MANY;
+    }
+#endif
+
+    if (maxnode &&
+        mbind(ptr, size, mode, backend->host_nodes, maxnode + 1, flags)) {
+        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
+            error_setg_errno(errp, errno,
+                             "cannot bind memory to host NUMA nodes");
+            return;
+        }
+    }
+#endif
+    /*
+     * Preallocate memory after the NUMA policy has been instantiated.
+     * This is necessary to guarantee memory is allocated with
+     * specified NUMA policy in place.
+     */
+    if (backend->prealloc &&
+        !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
+                           ptr, size, backend->prealloc_threads,
+                           backend->prealloc_context, async, errp)) {
+        return;
+    }
+}
+
 char *
 host_memory_backend_get_name(HostMemoryBackend *backend)
 {
@@ -337,7 +418,6 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
     void *ptr;
     uint64_t sz;
     size_t pagesize;
-    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
 
     if (!bc->alloc) {
         return;
@@ -357,78 +437,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
         return;
     }
 
-    if (backend->merge) {
-        qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
-    }
-    if (!backend->dump) {
-        qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
-    }
-#ifdef CONFIG_NUMA
-    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
-    /* lastbit == MAX_NODES means maxnode = 0 */
-    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
-    /*
-     * Ensure policy won't be ignored in case memory is preallocated
-     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
-     * this doesn't catch hugepage case.
-     */
-    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
-    int mode = backend->policy;
-
-    /* check for invalid host-nodes and policies and give more verbose
-     * error messages than mbind(). */
-    if (maxnode && backend->policy == MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be empty for policy default,"
-                   " or you should explicitly specify a policy other"
-                   " than default");
-        return;
-    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be set for policy %s",
-                   HostMemPolicy_str(backend->policy));
-        return;
-    }
-
-    /*
-     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
-     * as argument to mbind() due to an old Linux bug (feature?) which
-     * cuts off the last specified node. This means backend->host_nodes
-     * must have MAX_NODES+1 bits available.
-     */
-    assert(sizeof(backend->host_nodes) >=
-           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
-    assert(maxnode <= MAX_NODES);
-
-#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
-    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
-        /*
-         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
-         * silently picks the first node.
-         */
-        mode = MPOL_PREFERRED_MANY;
-    }
-#endif
-
-    if (maxnode &&
-        mbind(ptr, sz, mode, backend->host_nodes, maxnode + 1, flags)) {
-        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
-            error_setg_errno(errp, errno,
-                             "cannot bind memory to host NUMA nodes");
-            return;
-        }
-    }
-#endif
-    /*
-     * Preallocate memory after the NUMA policy has been instantiated.
-     * This is necessary to guarantee memory is allocated with
-     * specified NUMA policy in place.
-     */
-    if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
-                                                ptr, sz,
-                                                backend->prealloc_threads,
-                                                backend->prealloc_context,
-                                                async, errp)) {
-        return;
-    }
+    host_memory_backend_apply_settings(backend, ptr, sz, errp);
 }
 
 static bool
-- 
2.43.5

[PATCH v3 6/7] hostmem: Handle remapping of RAM

Posted by “William Roche 1 year, 2 months ago

From: David Hildenbrand <david@redhat.com>

Let's register a RAM block notifier and react on remap notifications.
Simply re-apply the settings. Exit if something goes wrong.

Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be
that hostmem is still missing to update that flag ...

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c       | 34 ++++++++++++++++++++++++++++++++++
 include/sysemu/hostmem.h |  1 +
 2 files changed, 35 insertions(+)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index bf85d716e5..863f6da11d 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -361,11 +361,37 @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
     backend->prealloc_threads = value;
 }
 
+static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host,
+                                             size_t offset, size_t size)
+{
+    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
+                                              ram_notifier);
+    Error *err = NULL;
+
+    if (!host_memory_backend_mr_inited(backend) ||
+        memory_region_get_ram_ptr(&backend->mr) != host) {
+        return;
+    }
+
+    host_memory_backend_apply_settings(backend, host + offset, size, &err);
+    if (err) {
+        /*
+         * If memory settings can't be successfully applied on remap,
+         * don't take the risk to continue without them.
+         */
+        error_report_err(err);
+        exit(1);
+    }
+}
+
 static void host_memory_backend_init(Object *obj)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
     MachineState *machine = MACHINE(qdev_get_machine());
 
+    backend->ram_notifier.ram_block_remapped = host_memory_backend_ram_remapped;
+    ram_block_notifier_add(&backend->ram_notifier);
+
     /* TODO: convert access to globals to compat properties */
     backend->merge = machine_mem_merge(machine);
     backend->dump = machine_dump_guest_core(machine);
@@ -379,6 +405,13 @@ static void host_memory_backend_post_init(Object *obj)
     object_apply_compat_props(obj);
 }
 
+static void host_memory_backend_finalize(Object *obj)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+    ram_block_notifier_remove(&backend->ram_notifier);
+}
+
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend)
 {
     /*
@@ -595,6 +628,7 @@ static const TypeInfo host_memory_backend_info = {
     .instance_size = sizeof(HostMemoryBackend),
     .instance_init = host_memory_backend_init,
     .instance_post_init = host_memory_backend_post_init,
+    .instance_finalize = host_memory_backend_finalize,
     .interfaces = (InterfaceInfo[]) {
         { TYPE_USER_CREATABLE },
         { }
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index 67f45abe39..98309a9457 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -83,6 +83,7 @@ struct HostMemoryBackend {
     HostMemPolicy policy;
 
     MemoryRegion mr;
+    RAMBlockNotifier ram_notifier;
 };
 
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend);
-- 
2.43.5

[PATCH v3 7/7] system/physmem: Memory settings applied on remap notification

Posted by “William Roche 1 year, 2 months ago

From: William Roche <william.roche@oracle.com>

Merging and dump settings are handled by the remap notification
in addition to memory policy and preallocation.
If preallocation is set on a memory block, qemu_prealloc_mem()
call is needed also after a ram_block_discard_range() use for
this block.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 6b948c0a88..f37c280db2 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2243,8 +2243,6 @@ void qemu_ram_remap(ram_addr_t addr)
                         exit(1);
                     }
                     qemu_ram_remap_mmap(block, vaddr, page_size, offset);
-                    memory_try_enable_merging(vaddr, page_size);
-                    qemu_ram_setup_dump(vaddr, page_size);
                 }
                 ram_block_notify_remap(block->host, offset, page_size);
             }
-- 
2.43.5

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by William Roche 1 year, 4 months ago

On 9/12/24 00:07, David Hildenbrand wrote:

> Hi again,
>
>>>> This is a Qemu RFC to introduce the possibility to deal with hardware
>>>> memory errors impacting hugetlbfs memory backed VMs. When using
>>>> hugetlbfs large pages, any large page location being impacted by an
>>>> HW memory error results in poisoning the entire page, suddenly making
>>>> a large chunk of the VM memory unusable.
>>>>
>>>> The implemented proposal is simply a memory mapping change when an HW
>>>> error
>>>> is reported to Qemu, to transform a hugetlbfs large page into a set of
>>>> standard sized pages. The failed large page is unmapped and a set of
>>>> standard sized pages are mapped in place.
>>>> This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is
>>>> received
>>>> by qemu and the reported location corresponds to a large page.
>
> One clarifying question: you simply replace the hugetlb page by 
> multiple small pages using mmap(MAP_FIXED).

That's right.

> So you
>
> (a) are not able to recover any memory of the original page (as of now)
Once poisoned by the kernel, the original large page is entirely not 
accessible
anymore, but the Kernel can provide what remains from the poisoned hugetlbfs
page through the backend file.  (When this file was mapped MAP_SHARED)

> (b) no longer have a hugetlb page and, therefore, possibly a performance
>     degradation, relevant in low-latency applications that really care
>     about the usage of hugetlb pages.
This is correct.

> (c) run into the described inconsistency issues
The inconsistency I agreed upon is the case of 2 qemu processes sharing 
a piece of
the memory (through the ivshmem mechanism) which can be fixed by disabling
recovery for ivshmem associated hugetlbfs segment.

> Why is what you propose beneficial over just fallocate(PUNCH_HOLE) the 
> full page and get a fresh, non-poisoned page instead?
>
> Sure, you have to reserve some pages if that ever happens, but what is 
> the big selling point over PUNCH_HOLE + realloc? (sorry if I missed it 
> and it was spelled out)
This project provides an essential component that can't be done keeping 
a large
page to replace a failed large page: an uncorrected memory error on a memory
page is a lost memory piece and needs to be identified for any user to 
indicate
the loss. The kernel granularity for that is the entire page. It marks it
'poisoned' making it inaccessible (no matter what the page size, or the lost
memory piece size). So recovering an area of a large page impacted by a 
memory
error has to keep track of the lost area, and there is no other way but to
lower the granularity and split the page into smaller pieces that can be
marked 'poisoned' for the lost area.

That's the reason why we can't replace a failed large page with another 
large page.

We need smaller pages.


>>>>
>>>> This gives the possibility to:
>>>> - Take advantage of newer hypervisor kernel providing a way to 
>>>> retrieve
>>>> still valid data on the impacted hugetlbfs poisoned large page.
>
> Reading that again, that shouldn't have to be hypervisor-specific. 
> Really, if someone were to extract data from a poisoned hugetlb folio, 
> it shouldn't be hypervisor-specific. The kernel should be able to know 
> which regions are accessible and could allow ways for reading these, 
> one way or the other.
>
> It could just be a fairly hugetlb-special feature that would replace 
> the poisoned page by a fresh hugetlb page where as much page content 
> as possible has been recoverd from the old one.
I totally agree with the fact that it should be the Kernel role to split the
page and keep track of the valid and lost pieces. This was an aspect of the
high-granularity-mapping (HGM) project you are referring to. But HGM is not
there yet (and may never be), and currently the only automatic memory split
done by the kernel occurs when we are using Transparent Huge Pages (THP).
Unfortunately THP doesn't show (for the moment) all the performance and
memory optimisation possibilities that hugetlbfs use provides. And it's
a large topic I'd prefer not to get into.


>>> How are you dealing with other consumers of the shared memory,
>>> such as vhost-user processes,
>>
>>
>> In the current proposal, I don't deal with this aspect.
>> In fact, any other process sharing the changed memory will
>> continue to map the poisoned large page. So any access to
>> this page will generate a SIGBUS to this other process.
>>
>> In this situation vhost-user processes should continue to receive
>> SIGBUS signals (and probably continue to die because of that).
>
> That's ... suboptimal. :)
True.

>
> Assume you have a 1 GiB page. The guest OS can happily allocate 
> buffers in there so they can end up in vhost-user and crash that 
> process. Without any warning.
I confess that I don't know how/when and where vhost-user processes get 
their
shared memory locations.
But I agree that a recovered large page is currently not usable to associate
new shared buffers between qemu and external processes.

Note that previously allocated buffers that could have been located on this
page are marked 'poisoned' (after a memory error)on the vhost-user process
the same way they were before this project . The only difference is that,
after a recovered memory error, qemu may continue to see the recovered
address space and use it. But the receiving side (on vhost-user) will fail
when accessing the location.

Can a vhost-process fail without any warning reported ?
I hope not.

>> So I do see a real problem if 2 qemu processes are sharing the
>> same hugetlbfs segment -- in this case, error recovery should not
>> occur on this piece of the memory. Maybe dealing with this situation
>> with "ivshmem" options is doable (marking the shared segment
>> "not eligible" to hugetlbfs recovery, just like not "share=on"
>> hugetlbfs entries are not eligible)
>> -- I need to think about this specific case.
>>
>> Please let me know if there is a better way to deal with this
>> shared memory aspect and have a better system reaction.
>
> Not creating the inconsistency in the first place :)
Yes :)
Of course I don't want to introduce any inconsistency situation leading to
a memory corruption.
But if we consider that 'ivshmem' memory is not eligible for a recovery,
it means that we still leave the entire large page location poisoned and
there would not be any inconsistency for this memory component. Other
hugetlbfs memory componentswould still have the possibility to be
partially recovered, and give a higher chance to the VM not to crash
immediately.

>>> vm migration whereby RAM is migrated using file content,
>>
>>
>> Migration doesn't currently work with memory poisoning.
>> You can give a look at the already integrated following commit:
>>
>> 06152b89db64 migration: prevent migration when VM has poisoned memory
>>
>> This proposal doesn't change anything on this side.
>
> That commit is fairly fresh and likely missed the option to *not* 
> migrate RAM by reading it, but instead by migrating it through a 
> shared file. For example, VM life-upgrade (CPR) wants to use that (or 
> is already using that), to avoid RAM migration completely.
When a memory error occurs on a dirty page used for a mapped file,
the data is lost and the file synchronisation should fail with EIO.
You can't rely on the file content to reflect the latest memory content.
So even a migration using such a file should be avoided according to me.


>>> vfio that might have these pages pinned?
>>
>> AFAIK even pinned memory can be impacted by memory error and poisoned
>> by the kernel. Now as I said in the cover letter, I'd like to know if
>> we should take extra care for IO memory, vfio configured memory 
>> buffers...
>
> Assume your GPU has a hugetlb folio pinned via vfio. As soon as you 
> make the guest RAM point at anything else as VFIO is aware of, we end 
> up in the same problem we had when we learned about having to disable 
> balloon inflation (MADVISE_DONTNEED) as soon as VFIO pinned pages.
>
> We'd have to inform VFIO that the mapping is now different. Otherwise 
> it's really better to crash the VM than having your GPU read/write 
> different data than your CPU reads/writes,
Absolutely true, and fortunately this is not what would happen when the
large poisoned page is still used by the VFIO. After a successful recovery,
the CPU may still be able to read/write on a location where we had a vfio
buffer, but the other side (the device for example) would fail reading or
writing to any location of the poisoned large page.

>>> In general, you cannot simply replace pages by private copies
>>> when somebody else might be relying on these pages to go to
>>> actual guest RAM.
>>
>> This is correct, but the current proposal is dealing with a specific
>> shared memory type: poisoned large pages. So any other process mapping
>> this type of page can't access it without generating a SIGBUS.
>
> Right, and that's the issue. Because, for example, how should the VM 
> be aware that this memory is now special and must not be used for some 
> purposes without leading to problems elsewhere?
That's an excellent question, that I don't have the full answer to. We are
dealing here with a hardware fault situation; the hugetlbfs backend file
still has poisoned large page, so any attempt to map it in a process, or any
process mapping it before the error will not be able to use the segment. It
doesn't mean that they get their own private copy of a page. The only one
getting a private copy (to get what was still valid on the faulted large 
page)
is qemu. So if we imagine that ivshmem segments (between 2 qemu processes)
don't get this recovery, I'm expecting the data exchange on this shared 
memory
to fail, just like they do without the recovery mechanism. So I don't expect
any established communication to continue to work or any new segment using
the recovered area to successfully being created.

But of course I could be missing something here and be too optimistic...

So let take a step back.

I guess these "sharing" questions would not relate to memory segments that
are not defined as 'share=on', am I right ?

Do ivshmem, vhost-user processes or even vfio only use 'share=on' memory 
segments ?

If yes, we could also imagine to only enable recovery for hugetlbfs segments
that do not have 'share=on' attribute, but we would have to map them 
MAP_SHARED
in qemu address space anyway. This can maybe create other kinds of 
problems (?),
but if these inconsistency questions would not appear with this approach it
would be easy to adapt, and still enhance hugetlbfs use. For a first version
of this feature.

>>> It sounds very hacky and incomplete at first.
>>
>> As you can see, RAS features need to be completed.
>> And if this proposal is incomplete, what other changes should be
>> done to complete it ?
>>
>> I do hope we can discuss this RFC to adapt what is incorrect, or
>> find a better way to address this situation.
>
> One long-term goal people are working on is to allow remapping the 
> hugetlb folios in smaller granularity, such that only a single 
> affected PTE can be marked as poisoned. (used to be called 
> high-granularity-mapping)
I look forward to seeing this implemented, but it seems that it will 
take time
to appear, and if hugetlbfs RAS can be enhanced for qemu it would be 
very useful.

The day a kernel solution works, we can disable CONFIG_HUGETLBFS_RAS and 
rely on
the kernel to provide the appropriate information. The first commits will
continue to be necessary (dealing with si_addr_lsb value of the SIGBUS 
signinfo,
tracking the page size information in the hwpoison_page_list and the memory
remap on reset with the missing PUNCH_HOLE).

> However, at the same time, the focus hseems to shift towards using 
> guest_memfd instead of hugetlb, once it supports 1 GiB pages and 
> shared memory. It will likely be easier to support mapping 1 GiB pages 
> using PTEs that way, and there are ongoing discussions how that can be 
> achieved more easily.
>
> There are also discussions [1] about not poisoning the mappings at all 
> and handling it differently. But I haven't yet digested how exactly 
> that could look like in reality.
>
>
> [1] https://lkml.kernel.org/r/20240828234958.GE3773488@nvidia.com

Thank you very much for this pointer. I hope a kernel solution (this one or
another) can be implemented and widely adopted before the next 5 to 10 
years ;)

In the meantime, we can try to enhance qemu using hugetlbfs for VM memory
which is more and more deployed.

Best regards,
William.

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by William Roche 1 year, 4 months ago

Hello David,

I hope my last week email answered your interrogations about:
     - retrieving the valid data from the lost hugepage
     - the need of smaller pages to replace a failed large page
     - the interaction of memory error and VM migration
     - the non-symmetrical access to a poisoned memory area after a recovery
       Qemu would be able to continue to access the still valid data
       location of the formerly poisoned hugepage, but any other entity
       mapping the large page would not be allowed to use the location.

I understand that this last item _is_ some kind of "inconsistency".
So if I want to make sure that a "shared" memory region (used for vhost-user
processes, vfio or ivshmem) is not recovered, how can I identify what 
region(s)
of a guest memory could be used for such a shared location ?
Is there a way for qemu to identify the memory locations that have been 
shared ?

Could you please let me know if there is an entry point I should consider ?

Thanks in advance for your feedback.
William.

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by David Hildenbrand 1 year, 3 months ago

On 19.09.24 18:52, William Roche wrote:
> Hello David,

Hi William,

sorry for not replying earlier, it somehow fell through the cracks as my 
inbox got flooded :(

> 
> I hope my last week email answered your interrogations about:
>       - retrieving the valid data from the lost hugepage
>       - the need of smaller pages to replace a failed large page
>       - the interaction of memory error and VM migration
>       - the non-symmetrical access to a poisoned memory area after a recovery
>         Qemu would be able to continue to access the still valid data
>         location of the formerly poisoned hugepage, but any other entity
>         mapping the large page would not be allowed to use the location.
> 
> I understand that this last item _is_ some kind of "inconsistency".

That's my biggest concern. Physical memory and its properties are 
described by the QEMU RAMBlock, which includes page size, 
shared/private, and sometimes properties (e.g., uffd).

Adding inconsistent there is really suboptimal :(

> So if I want to make sure that a "shared" memory region (used for vhost-user
> processes, vfio or ivshmem) is not recovered, how can I identify what
> region(s)
> of a guest memory could be used for such a shared location ?
> Is there a way for qemu to identify the memory locations that have been
> shared ?

I'll reply to your other cleanups/improvements, but we can detect if we 
must not discard arbitrary memory (because likely something is relying 
on long-term pinnings) using ram_block_discard_is_disabled().

-- 
Cheers,

David / dhildenb

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by Peter Xu 1 year, 4 months ago

On Thu, Sep 19, 2024 at 06:52:37PM +0200, William Roche wrote:
> Hello David,
> 
> I hope my last week email answered your interrogations about:
>     - retrieving the valid data from the lost hugepage
>     - the need of smaller pages to replace a failed large page
>     - the interaction of memory error and VM migration
>     - the non-symmetrical access to a poisoned memory area after a recovery
>       Qemu would be able to continue to access the still valid data
>       location of the formerly poisoned hugepage, but any other entity
>       mapping the large page would not be allowed to use the location.
> 
> I understand that this last item _is_ some kind of "inconsistency".
> So if I want to make sure that a "shared" memory region (used for vhost-user
> processes, vfio or ivshmem) is not recovered, how can I identify what
> region(s)
> of a guest memory could be used for such a shared location ?
> Is there a way for qemu to identify the memory locations that have been
> shared ?

When there's no vIOMMU I think all guest pages need to be shared.  When
with vIOMMU it depends on what was mapped by the guest drivers, while in
most sane setups they can still always be shared because the guest OS (if
Linux) should normally have iommu=pt speeding up kernel drivers.

> 
> Could you please let me know if there is an entry point I should consider ?

IMHO it'll still be more reasonable that this issue be tackled from the
kernel not userspace, simply because it's a shared problem of all
userspaces rather than QEMU process alone.

When with that the kernel should guarantee consistencies on different
processes accessing these pages properly, so logically all these
complexities should be better done in the kernel once for all.

There's indeed difficulties on providing it in hugetlbfs with mm community,
and this is also not the only effort trying to fix 1G page poisoning with
userspace workarounds, see:

https://lore.kernel.org/r/20240924043924.3562257-1-jiaqiyan@google.com

My gut feeling is either hugetlbfs needs to be fixed (with less hope) or
QEMU in general needs to move over to other file systems on consuming huge
pages.  Poisoning is not the only driven force, but at least we want to
also work out postcopy which has similar goal as David said, on being able
to map hugetlbfs pages differently.

May consider having a look at gmemfd 1G proposal, posted here:

https://lore.kernel.org/r/cover.1726009989.git.ackerleytng@google.com

We probably need that in one way or another for CoCo, and the chance is it
can easily support non-CoCo with the same interface ultimately.  Then 1G
hugetlbfs can be abandoned in QEMU.  It'll also need to tackle the same
challenge here either on page poisoning, or postcopy, with/without QEMU's
specific solution, because QEMU is also not the only userspace hypervisor.

Said that, the initial few small patches seem to be standalone small fixes
which may still be good.  So if you think that's the case you can at least
consider sending them separately without RFC tag.

Thanks,

-- 
Peter Xu

[PATCH v1 0/4] hugetlbfs memory HW error fixes

Posted by “William Roche 1 year, 3 months ago

From: William Roche <william.roche@oracle.com>

This set of patches fixes several problems with hardware memory errors
impacting hugetlbfs memory backed VMs. When using hugetlbfs large
pages, any large page location being impacted by an HW memory error
results in poisoning the entire page, suddenly making a large chunk of
the VM memory unusable.

The main problem that currently exists in Qemu is the lack of backend
file repair before resetting the VM memory, resulting in the impacted
memory to be silently unusable even after a VM reboot.

In order to fix this issue, we track the SIGBUS page size information
when informed of a HW error (with si_addr_lsb) and record the size with
the appropriate poisoned page location. On recording a large page
position, we take note of the beginning of the page and its size. The
size information is taken from the backend file page_size value.

Also provide the impact information of a large page of memory loss,
only reported once when the page is poisoned -- for a better
debug-ability of these situations.

This code is scripts/checkpatch.pl clean
'make check' runs fine on both x86 and ARM.
Units tests have been successfully run on x86,
but the ARM VM doesn't deal with several errors on different memory
locations triggered too quickly from each other (which is the case
with hugetlbfs page being poisoned) and either aborts after a
"failed to record the error" message or becomes unresponsive.


William Roche (4):
  accel/kvm: SIGBUS handler should also deal with si_addr_lsb
  accel/kvm: Keep track of the HWPoisonPage page_size
  system/physmem: Largepage punch hole before reset of memory pages
  accel/kvm: Report the loss of a large memory page

 accel/kvm/kvm-all.c       | 27 +++++++++++++++++++++------
 accel/stubs/kvm-stub.c    |  4 ++--
 include/exec/cpu-common.h |  1 +
 include/qemu/osdep.h      |  5 +++--
 include/sysemu/kvm.h      |  7 ++++---
 include/sysemu/kvm_int.h  |  7 +++++--
 system/cpus.c             |  6 ++++--
 system/physmem.c          | 28 ++++++++++++++++++++++++++++
 target/arm/kvm.c          |  8 ++++++--
 target/i386/kvm/kvm.c     |  8 ++++++--
 util/oslib-posix.c        |  3 +++
 11 files changed, 83 insertions(+), 21 deletions(-)

-- 
2.43.5

[PATCH v1 1/4] accel/kvm: SIGBUS handler should also deal with si_addr_lsb

Posted by “William Roche 1 year, 3 months ago

From: William Roche <william.roche@oracle.com>

The SIGBUS signal siginfo reporting a HW memory error
provides a si_addr_lsb field with an indication of the
impacted memory page size.
This information should be used to track the hwpoisoned
page sizes.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c    | 6 ++++--
 accel/stubs/kvm-stub.c | 4 ++--
 include/qemu/osdep.h   | 5 +++--
 include/sysemu/kvm.h   | 4 ++--
 system/cpus.c          | 6 ++++--
 util/oslib-posix.c     | 3 +++
 6 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 801cff16a5..2adc4d9c24 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2940,6 +2940,7 @@ void kvm_cpu_synchronize_pre_loadvm(CPUState *cpu)
 #ifdef KVM_HAVE_MCE_INJECTION
 static __thread void *pending_sigbus_addr;
 static __thread int pending_sigbus_code;
+static __thread short pending_sigbus_addr_lsb;
 static __thread bool have_sigbus_pending;
 #endif
 
@@ -3651,7 +3652,7 @@ void kvm_init_cpu_signals(CPUState *cpu)
 }
 
 /* Called asynchronously in VCPU thread.  */
-int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
+int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr, short addr_lsb)
 {
 #ifdef KVM_HAVE_MCE_INJECTION
     if (have_sigbus_pending) {
@@ -3660,6 +3661,7 @@ int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
     have_sigbus_pending = true;
     pending_sigbus_addr = addr;
     pending_sigbus_code = code;
+    pending_sigbus_addr_lsb = addr_lsb;
     qatomic_set(&cpu->exit_request, 1);
     return 0;
 #else
@@ -3668,7 +3670,7 @@ int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
 }
 
 /* Called synchronously (via signalfd) in main thread.  */
-int kvm_on_sigbus(int code, void *addr)
+int kvm_on_sigbus(int code, void *addr, short addr_lsb)
 {
 #ifdef KVM_HAVE_MCE_INJECTION
     /* Action required MCE kills the process if SIGBUS is blocked.  Because
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 8e0eb22e61..80780433d8 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -38,12 +38,12 @@ bool kvm_has_sync_mmu(void)
     return false;
 }
 
-int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
+int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr, short addr_lsb)
 {
     return 1;
 }
 
-int kvm_on_sigbus(int code, void *addr)
+int kvm_on_sigbus(int code, void *addr, short addr_lsb)
 {
     return 1;
 }
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index fe7c3c5f67..838271c4b8 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -585,8 +585,9 @@ struct qemu_signalfd_siginfo {
     uint64_t ssi_stime;   /* System CPU time consumed (SIGCHLD) */
     uint64_t ssi_addr;    /* Address that generated signal
                              (for hardware-generated signals) */
-    uint8_t  pad[48];     /* Pad size to 128 bytes (allow for
-                             additional fields in the future) */
+    uint16_t ssi_addr_lsb;/* Least significant bit of address (SIGBUS) */
+    uint8_t  pad[46];     /* Pad size to 128 bytes (allow for */
+                          /* additional fields in the future) */
 };
 
 int qemu_signalfd(const sigset_t *mask);
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index c3a60b2890..1bde598404 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -207,8 +207,8 @@ int kvm_has_gsi_routing(void);
 bool kvm_arm_supports_user_irq(void);
 
 
-int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr);
-int kvm_on_sigbus(int code, void *addr);
+int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr, short addr_lsb);
+int kvm_on_sigbus(int code, void *addr, short addr_lsb);
 
 #ifdef COMPILING_PER_TARGET
 #include "cpu.h"
diff --git a/system/cpus.c b/system/cpus.c
index 1c818ff682..12e630f760 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -376,12 +376,14 @@ static void sigbus_handler(int n, siginfo_t *siginfo, void *ctx)
 
     if (current_cpu) {
         /* Called asynchronously in VCPU thread.  */
-        if (kvm_on_sigbus_vcpu(current_cpu, siginfo->si_code, siginfo->si_addr)) {
+        if (kvm_on_sigbus_vcpu(current_cpu, siginfo->si_code,
+                               siginfo->si_addr, siginfo->si_addr_lsb)) {
             sigbus_reraise();
         }
     } else {
         /* Called synchronously (via signalfd) in main thread.  */
-        if (kvm_on_sigbus(siginfo->si_code, siginfo->si_addr)) {
+        if (kvm_on_sigbus(siginfo->si_code,
+                          siginfo->si_addr, siginfo->si_addr_lsb)) {
             sigbus_reraise();
         }
     }
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 11b35e48fb..64517d1e40 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -767,6 +767,9 @@ void sigaction_invoke(struct sigaction *action,
     } else if (info->ssi_signo == SIGILL || info->ssi_signo == SIGFPE ||
                info->ssi_signo == SIGSEGV || info->ssi_signo == SIGBUS) {
         si.si_addr = (void *)(uintptr_t)info->ssi_addr;
+        if (info->ssi_signo == SIGBUS) {
+            si.si_addr_lsb = (short int)info->ssi_addr_lsb;
+        }
     } else if (info->ssi_signo == SIGCHLD) {
         si.si_pid = info->ssi_pid;
         si.si_status = info->ssi_status;
-- 
2.43.5

[PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by “William Roche 1 year, 3 months ago

From: William Roche <william.roche@oracle.com>

Add the page size information to the hwpoison_page_list elements.
As the kernel doesn't always report the actual poisoned page size,
we adjust this size from the backend real page size.
We take into account the recorded page size to adjust the size
and location of the memory hole.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c       | 14 ++++++++++----
 include/exec/cpu-common.h |  1 +
 include/sysemu/kvm.h      |  3 ++-
 include/sysemu/kvm_int.h  |  3 ++-
 system/physmem.c          | 20 ++++++++++++++++++++
 target/arm/kvm.c          |  8 ++++++--
 target/i386/kvm/kvm.c     |  8 ++++++--
 7 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 2adc4d9c24..40117eefa7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
  */
 typedef struct HWPoisonPage {
     ram_addr_t ram_addr;
+    size_t     page_size;
     QLIST_ENTRY(HWPoisonPage) list;
 } HWPoisonPage;
 
@@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
 
     QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
         QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr, page->page_size);
         g_free(page);
     }
 }
 
-void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz)
 {
     HWPoisonPage *page;
 
+    if (sz > TARGET_PAGE_SIZE)
+        ram_addr = ROUND_DOWN(ram_addr, sz);
+
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
             return;
@@ -1294,6 +1298,7 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr)
     }
     page = g_new(HWPoisonPage, 1);
     page->ram_addr = ram_addr;
+    page->page_size = sz;
     QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
 }
 
@@ -3140,7 +3145,8 @@ int kvm_cpu_exec(CPUState *cpu)
         if (unlikely(have_sigbus_pending)) {
             bql_lock();
             kvm_arch_on_sigbus_vcpu(cpu, pending_sigbus_code,
-                                    pending_sigbus_addr);
+                                    pending_sigbus_addr,
+                                    pending_sigbus_addr_lsb);
             have_sigbus_pending = false;
             bql_unlock();
         }
@@ -3678,7 +3684,7 @@ int kvm_on_sigbus(int code, void *addr, short addr_lsb)
      * we can only get action optional here.
      */
     assert(code != BUS_MCEERR_AR);
-    kvm_arch_on_sigbus_vcpu(first_cpu, code, addr);
+    kvm_arch_on_sigbus_vcpu(first_cpu, code, addr, addr_lsb);
     return 0;
 #else
     return 1;
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 638dc806a5..b971b13306 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -108,6 +108,7 @@ bool qemu_ram_is_named_file(RAMBlock *rb);
 int qemu_ram_get_fd(RAMBlock *rb);
 
 size_t qemu_ram_pagesize(RAMBlock *block);
+size_t qemu_ram_pagesize_from_host(void *addr);
 size_t qemu_ram_pagesize_largest(void);
 
 /**
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 1bde598404..4106a7ec07 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -383,7 +383,8 @@ bool kvm_vcpu_id_is_valid(int vcpu_id);
 unsigned long kvm_arch_vcpu_id(CPUState *cpu);
 
 #ifdef KVM_HAVE_MCE_INJECTION
-void kvm_arch_on_sigbus_vcpu(CPUState *cpu, int code, void *addr);
+void kvm_arch_on_sigbus_vcpu(CPUState *cpu, int code, void *addr,
+                             short addr_lsb);
 #endif
 
 void kvm_arch_init_irq_routing(KVMState *s);
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index a1e72763da..d2160be0ae 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -178,10 +178,11 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size);
  *
  * Parameters:
  *  @ram_addr: the address in the RAM for the poisoned page
+ *  @sz: size of the poisoned page as reported by the kernel
  *
  * Add a poisoned page to the list
  *
  * Return: None.
  */
-void kvm_hwpoison_page_add(ram_addr_t ram_addr);
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz);
 #endif
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..3757428336 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1665,6 +1665,26 @@ size_t qemu_ram_pagesize(RAMBlock *rb)
     return rb->page_size;
 }
 
+/* Returns backend real page size used for the given address */
+size_t qemu_ram_pagesize_from_host(void *addr)
+{
+    RAMBlock *rb;
+    ram_addr_t offset;
+
+    /*
+     * XXX kernel provided size is not reliable...
+     * As kvm_send_hwpoison_signal() uses a hard-coded PAGE_SHIFT
+     * signal value on hwpoison signal.
+     * So we must identify the actual size to consider from the
+     * mapping block pagesize.
+     */
+    rb =  qemu_ram_block_from_host(addr, false, &offset);
+    if (!rb) {
+        return TARGET_PAGE_SIZE;
+    }
+    return qemu_ram_pagesize(rb);
+}
+
 /* Returns the largest size of page in use */
 size_t qemu_ram_pagesize_largest(void)
 {
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index f1f1b5b375..11579e170b 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2348,10 +2348,11 @@ int kvm_arch_get_registers(CPUState *cs, Error **errp)
     return ret;
 }
 
-void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
+void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb)
 {
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t sz = (addr_lsb > 0) ? (1 << addr_lsb) : TARGET_PAGE_SIZE;
 
     assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
 
@@ -2359,7 +2360,10 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
+            if (sz == TARGET_PAGE_SIZE) {
+                sz = qemu_ram_pagesize_from_host(addr);
+            }
+            kvm_hwpoison_page_add(ram_addr, sz);
             /*
              * If this is a BUS_MCEERR_AR, we know we have been called
              * synchronously from the vCPU thread, so we can easily
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index fd9f198892..71e674bca0 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -735,12 +735,13 @@ static void hardware_memory_error(void *host_addr)
     exit(1);
 }
 
-void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
+void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb)
 {
     X86CPU *cpu = X86_CPU(c);
     CPUX86State *env = &cpu->env;
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t sz = (addr_lsb > 0) ? (1 << addr_lsb) : TARGET_PAGE_SIZE;
 
     /* If we get an action required MCE, it has been injected by KVM
      * while the VM was running.  An action optional MCE instead should
@@ -753,7 +754,10 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
+            if (sz == TARGET_PAGE_SIZE) {
+                sz = qemu_ram_pagesize_from_host(addr);
+            }
+            kvm_hwpoison_page_add(ram_addr, sz);
             kvm_mce_inject(cpu, paddr, code);
 
             /*
-- 
2.43.5

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by David Hildenbrand 1 year, 3 months ago

On 22.10.24 23:35, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> Add the page size information to the hwpoison_page_list elements.
> As the kernel doesn't always report the actual poisoned page size,
> we adjust this size from the backend real page size.
> We take into account the recorded page size to adjust the size
> and location of the memory hole.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   accel/kvm/kvm-all.c       | 14 ++++++++++----
>   include/exec/cpu-common.h |  1 +
>   include/sysemu/kvm.h      |  3 ++-
>   include/sysemu/kvm_int.h  |  3 ++-
>   system/physmem.c          | 20 ++++++++++++++++++++
>   target/arm/kvm.c          |  8 ++++++--
>   target/i386/kvm/kvm.c     |  8 ++++++--
>   7 files changed, 47 insertions(+), 10 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 2adc4d9c24..40117eefa7 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
>    */
>   typedef struct HWPoisonPage {
>       ram_addr_t ram_addr;
> +    size_t     page_size;
>       QLIST_ENTRY(HWPoisonPage) list;
>   } HWPoisonPage;
>   
> @@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
>   
>       QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>           QLIST_REMOVE(page, list);
> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
> +        qemu_ram_remap(page->ram_addr, page->page_size);

Can't we just use the page size from the RAMBlock in qemu_ram_remap? 
There we lookup the RAMBlock, and all pages in a RAMBlock have the same 
size.

I'll note that qemu_ram_remap() is rather stupid and optimized only for 
private memory (not shmem etc).

mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page from 
the pagecache; you'd have to punch a hole instead.

It might be better to use ram_block_discard_range() in the long run. 
Memory preallocation + page pinning is tricky, but we could simply bail 
out in these cases (preallocation failing, ram discard being disabled).

qemu_ram_remap() might be problematic with page pinning (vfio) as is in 
any way :(

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by William Roche 1 year, 3 months ago

On 10/23/24 09:28, David Hildenbrand wrote:
> On 22.10.24 23:35, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> Add the page size information to the hwpoison_page_list elements.
>> As the kernel doesn't always report the actual poisoned page size,
>> we adjust this size from the backend real page size.
>> We take into account the recorded page size to adjust the size
>> and location of the memory hole.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   accel/kvm/kvm-all.c       | 14 ++++++++++----
>>   include/exec/cpu-common.h |  1 +
>>   include/sysemu/kvm.h      |  3 ++-
>>   include/sysemu/kvm_int.h  |  3 ++-
>>   system/physmem.c          | 20 ++++++++++++++++++++
>>   target/arm/kvm.c          |  8 ++++++--
>>   target/i386/kvm/kvm.c     |  8 ++++++--
>>   7 files changed, 47 insertions(+), 10 deletions(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 2adc4d9c24..40117eefa7 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned 
>> int extension)
>>    */
>>   typedef struct HWPoisonPage {
>>       ram_addr_t ram_addr;
>> +    size_t     page_size;
>>       QLIST_ENTRY(HWPoisonPage) list;
>>   } HWPoisonPage;
>> @@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
>>       QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>           QLIST_REMOVE(page, list);
>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>> +        qemu_ram_remap(page->ram_addr, page->page_size);
> 
> Can't we just use the page size from the RAMBlock in qemu_ram_remap? 
> There we lookup the RAMBlock, and all pages in a RAMBlock have the same 
> size.

Yes, we could use the page size from the RAMBlock in qemu_ram_remap() 
that is called when the VM is resetting. I think that knowing the 
information about the size of poisoned chunk of memory when the poison 
is created is useful to give a trace of what is going on, before seeing 
maybe other pages being reported as poisoned. That's the 4th patch goal 
to give an information as soon as we get it.
It also helps to filter the new errors reported and only create an entry 
in the hwpoison_page_list for new large pages.

Now we could delay the page size retrieval until we are resetting and 
present the information (post mortem). I do think that having the 
information earlier is better in this case.

> 
> I'll note that qemu_ram_remap() is rather stupid and optimized only for 
> private memory (not shmem etc).
> 
> mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page from 
> the pagecache; you'd have to punch a hole instead.
> 
> It might be better to use ram_block_discard_range() in the long run. 
> Memory preallocation + page pinning is tricky, but we could simply bail 
> out in these cases (preallocation failing, ram discard being disabled).

I see that ram_block_discard_range() adds more control before discarding 
the RAM region and can also call madvise() in addition to the fallocate 
punch hole for standard sized memory pages. Now as the range is supposed 
to be recreated, I'm not convinced that these madvise calls are necessary.

But we can also notice that this function will report the following 
warning in all cases of not shared file backends:
"ram_block_discard_range: Discarding RAM in private file mappings is 
possibly dangerous, because it will modify the underlying file and will 
affect other users of the file"
Which means that hugetlbfs configurations do see this new cryptic 
warning message on reboot if it is impacted by a memory poisoning.
So I would prefer to leave the fallocate call in the qemu_ram_remap() 
function. Or would you prefer to enhance ram_block_discard_range() code 
to avoid the message in a reset situation (when called from 
qemu_ram_remap) ?

> 
> qemu_ram_remap() might be problematic with page pinning (vfio) as is in 
> any way :(
> 

I agree. If qemu_ram_remap() fails, Qemu is ended either abort() or 
exit(1). Do you say that memory pinning could be detected by 
ram_block_discard_range() or maybe mmap call for the impacted region and 
make one of them fail ? This would be an additional reason to call 
ram_block_discard_range() from qemu_ram_remap().   Is it what you are 
suggesting ?

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by William Roche 1 year, 3 months ago

On 10/23/24 09:28, David Hildenbrand wrote:

> On 22.10.24 23:35, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> Add the page size information to the hwpoison_page_list elements.
>> As the kernel doesn't always report the actual poisoned page size,
>> we adjust this size from the backend real page size.
>> We take into account the recorded page size to adjust the size
>> and location of the memory hole.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   accel/kvm/kvm-all.c       | 14 ++++++++++----
>>   include/exec/cpu-common.h |  1 +
>>   include/sysemu/kvm.h      |  3 ++-
>>   include/sysemu/kvm_int.h  |  3 ++-
>>   system/physmem.c          | 20 ++++++++++++++++++++
>>   target/arm/kvm.c          |  8 ++++++--
>>   target/i386/kvm/kvm.c     |  8 ++++++--
>>   7 files changed, 47 insertions(+), 10 deletions(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 2adc4d9c24..40117eefa7 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, 
>> unsigned int extension)
>>    */
>>   typedef struct HWPoisonPage {
>>       ram_addr_t ram_addr;
>> +    size_t     page_size;
>>       QLIST_ENTRY(HWPoisonPage) list;
>>   } HWPoisonPage;
>>   @@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
>>         QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>           QLIST_REMOVE(page, list);
>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>> +        qemu_ram_remap(page->ram_addr, page->page_size);
>
> Can't we just use the page size from the RAMBlock in qemu_ram_remap? 
> There we lookup the RAMBlock, and all pages in a RAMBlock have the 
> same size.

Yes, we could use the page size from the RAMBlock in qemu_ram_remap() 
that is called when the VM is resetting. I think that knowing the 
information about the size of poisoned chunk of memory when the poison 
is created is useful to give a trace of what is going on, before seeing 
maybe other pages being reported as poisoned. That's the 4th patch goal 
to give an information as soon as we get it.
It also helps to filter the new errors reported and only create an entry 
in the hwpoison_page_list for new large pages.
Now we could delay the page size retrieval until we are resetting and 
present the information (post mortem). I do think that having the 
information earlier is better in this case.

>
> I'll note that qemu_ram_remap() is rather stupid and optimized only 
> for private memory (not shmem etc).
>
> mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page 
> from the pagecache; you'd have to punch a hole instead.
>
> It might be better to use ram_block_discard_range() in the long run. 
> Memory preallocation + page pinning is tricky, but we could simply 
> bail out in these cases (preallocation failing, ram discard being 
> disabled).

I see that ram_block_discard_range() adds more control before discarding 
the RAM region and can also call madvise() in addition to the fallocate 
punch hole for standard sized memory pages. Now as the range is supposed 
to be recreated, I'm not convinced that these madvise calls are necessary.

But we can also notice that this function will report the following 
warning in all cases of not shared file backends:
"ram_block_discard_range: Discarding RAM in private file mappings is 
possibly dangerous, because it will modify the underlying file and will 
affect other users of the file"
Which means that hugetlbfs configurations do see this new cryptic 
warning message on reboot if it is impacted by a memory poisoning.
So I would prefer to leave the fallocate call in the qemu_ram_remap() 
function. Or would you prefer to enhance ram_block_discard_range()code 
to avoid the message in a reset situation (when called from qemu_ram_remap)?

>
> qemu_ram_remap() might be problematic with page pinning (vfio) as is 
> in any way :(

I agree. If qemu_ram_remap() fails, Qemu is ended either abort() or 
exit(1). Do you say that memory pinning could be detected by 
ram_block_discard_range() or maybe mmap call for the impacted region and 
make one of them fail ? This would be an additional reason to call 
ram_block_discard_range() from qemu_ram_remap().   Is it what you are 
suggesting ?

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by David Hildenbrand 1 year, 3 months ago

On 26.10.24 01:27, William Roche wrote:
> On 10/23/24 09:28, David Hildenbrand wrote:
> 
>> On 22.10.24 23:35, “William Roche wrote:
>>> From: William Roche <william.roche@oracle.com>
>>>
>>> Add the page size information to the hwpoison_page_list elements.
>>> As the kernel doesn't always report the actual poisoned page size,
>>> we adjust this size from the backend real page size.
>>> We take into account the recorded page size to adjust the size
>>> and location of the memory hole.
>>>
>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>> ---
>>>   accel/kvm/kvm-all.c       | 14 ++++++++++----
>>>   include/exec/cpu-common.h |  1 +
>>>   include/sysemu/kvm.h      |  3 ++-
>>>   include/sysemu/kvm_int.h  |  3 ++-
>>>   system/physmem.c          | 20 ++++++++++++++++++++
>>>   target/arm/kvm.c          |  8 ++++++--
>>>   target/i386/kvm/kvm.c     |  8 ++++++--
>>>   7 files changed, 47 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>> index 2adc4d9c24..40117eefa7 100644
>>> --- a/accel/kvm/kvm-all.c
>>> +++ b/accel/kvm/kvm-all.c
>>> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, 
>>> unsigned int extension)
>>>    */
>>>   typedef struct HWPoisonPage {
>>>       ram_addr_t ram_addr;
>>> +    size_t     page_size;
>>>       QLIST_ENTRY(HWPoisonPage) list;
>>>   } HWPoisonPage;
>>>   @@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
>>>         QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>>           QLIST_REMOVE(page, list);
>>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>>> +        qemu_ram_remap(page->ram_addr, page->page_size);
>>
>> Can't we just use the page size from the RAMBlock in qemu_ram_remap? 
>> There we lookup the RAMBlock, and all pages in a RAMBlock have the 
>> same size.
> 
> 
> Yes, we could use the page size from the RAMBlock in qemu_ram_remap() 
> that is called when the VM is resetting. I think that knowing the 
> information about the size of poisoned chunk of memory when the poison 
> is created is useful to give a trace of what is going on, before seeing 
> maybe other pages being reported as poisoned. That's the 4th patch goal 
> to give an information as soon as we get it.
> It also helps to filter the new errors reported and only create an entry 
> in the hwpoison_page_list for new large pages.
> Now we could delay the page size retrieval until we are resetting and 
> present the information (post mortem). I do think that having the 
> information earlier is better in this case.

If it is not required for this patch, then please move the other stuff 
to patch #4.

Here, we really only have to discard a large page, which we can derive 
from the QEMU RAMBlock page size.

> 
> 
>>
>> I'll note that qemu_ram_remap() is rather stupid and optimized only 
>> for private memory (not shmem etc).
>>
>> mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page 
>> from the pagecache; you'd have to punch a hole instead.
>>
>> It might be better to use ram_block_discard_range() in the long run. 
>> Memory preallocation + page pinning is tricky, but we could simply 
>> bail out in these cases (preallocation failing, ram discard being 
>> disabled).
> 
> 
> I see that ram_block_discard_range() adds more control before discarding 
> the RAM region and can also call madvise() in addition to the fallocate 
> punch hole for standard sized memory pages. Now as the range is supposed 
> to be recreated, I'm not convinced that these madvise calls are necessary.

They are the proper replacement for the mmap(MAP_FIXED) + fallocate.

That function handles all cases of properly discarding guest RAM.

> 
> But we can also notice that this function will report the following 
> warning in all cases of not shared file backends:
> "ram_block_discard_range: Discarding RAM in private file mappings is 
> possibly dangerous, because it will modify the underlying file and will 
> affect other users of the file"

Yes, because it's a clear warning sign that something weird is 
happening. You might be throwing away data that some other process might 
be relying on.

How are you making QEMU consume hugetlbs?

We could suppress these warnings, but let's first see how you are able 
to trigger it.

> Which means that hugetlbfs configurations do see this new cryptic 
> warning message on reboot if it is impacted by a memory poisoning.
> So I would prefer to leave the fallocate call in the qemu_ram_remap() 
> function. Or would you prefer to enhance ram_block_discard_range()code 
> to avoid the message in a reset situation (when called from qemu_ram_remap)?

Please try reusing the mechanism to discard guest RAM instead of 
open-coding this. We still have to use mmap(MAP_FIXED) as a backup, but 
otherwise this function should mostly do+check what you need.

(-warnings we might want to report differently / suppress)

If you want, I can start a quick prototype of what it could look like 
when using ram_block_discard_range() + ram_block_discard_is_disabled() + 
fallback to existing mmap(MAP_FIXED).

> 
> 
>>
>> qemu_ram_remap() might be problematic with page pinning (vfio) as is 
>> in any way :(
> 
> I agree. If qemu_ram_remap() fails, Qemu is ended either abort() or 
> exit(1). Do you say that memory pinning could be detected by 
> ram_block_discard_range() or maybe mmap call for the impacted region and 
> make one of them fail ? This would be an additional reason to call 
> ram_block_discard_range() from qemu_ram_remap().   Is it what you are 
> suggesting ?

ram_block_discard_is_disabled() might be the right test. If discarding 
is disabled, then rebooting might create an inconsistency with 
e.g.,vfio, resulting in the issues we know from memory ballooning where 
the state vfio sees will be different from the state the guest kernel 
sees. It's tricky ... and we much rather quit the VM early instead of 
corrupting data later :/

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by William Roche 1 year, 3 months ago

On 10/28/24 17:42, David Hildenbrand wrote:
> On 26.10.24 01:27, William Roche wrote:
>> On 10/23/24 09:28, David Hildenbrand wrote:
>>
>>> On 22.10.24 23:35, “William Roche wrote:
>>>> From: William Roche <william.roche@oracle.com>
>>>>
>>>> Add the page size information to the hwpoison_page_list elements.
>>>> As the kernel doesn't always report the actual poisoned page size,
>>>> we adjust this size from the backend real page size.
>>>> We take into account the recorded page size to adjust the size
>>>> and location of the memory hole.
>>>>
>>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>>> ---
>>>>   accel/kvm/kvm-all.c       | 14 ++++++++++----
>>>>   include/exec/cpu-common.h |  1 +
>>>>   include/sysemu/kvm.h      |  3 ++-
>>>>   include/sysemu/kvm_int.h  |  3 ++-
>>>>   system/physmem.c          | 20 ++++++++++++++++++++
>>>>   target/arm/kvm.c          |  8 ++++++--
>>>>   target/i386/kvm/kvm.c     |  8 ++++++--
>>>>   7 files changed, 47 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>>> index 2adc4d9c24..40117eefa7 100644
>>>> --- a/accel/kvm/kvm-all.c
>>>> +++ b/accel/kvm/kvm-all.c
>>>> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, 
>>>> unsigned int extension)
>>>>    */
>>>>   typedef struct HWPoisonPage {
>>>>       ram_addr_t ram_addr;
>>>> +    size_t     page_size;
>>>>       QLIST_ENTRY(HWPoisonPage) list;
>>>>   } HWPoisonPage;
>>>>   @@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
>>>>         QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, 
>>>> next_page) {
>>>>           QLIST_REMOVE(page, list);
>>>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>>>> +        qemu_ram_remap(page->ram_addr, page->page_size);
>>>
>>> Can't we just use the page size from the RAMBlock in qemu_ram_remap? 
>>> There we lookup the RAMBlock, and all pages in a RAMBlock have the 
>>> same size.
>>
>>
>> Yes, we could use the page size from the RAMBlock in qemu_ram_remap() 
>> that is called when the VM is resetting. I think that knowing the 
>> information about the size of poisoned chunk of memory when the poison 
>> is created is useful to give a trace of what is going on, before 
>> seeing maybe other pages being reported as poisoned. That's the 4th 
>> patch goal to give an information as soon as we get it.
>> It also helps to filter the new errors reported and only create an 
>> entry in the hwpoison_page_list for new large pages.
>> Now we could delay the page size retrieval until we are resetting and 
>> present the information (post mortem). I do think that having the 
>> information earlier is better in this case.
> 
> If it is not required for this patch, then please move the other stuff 
> to patch #4.
> 
> Here, we really only have to discard a large page, which we can derive 
> from the QEMU RAMBlock page size.


Ok, I can remove the first patch that is created to track the kernel 
provided page size and pass it to the kvm_hwpoison_page_add() function, 
but we could deal with the page size at the kvm_hwpoison_page_add() 
function level as we don't rely on the kernel provided info, but just 
the RAMBlock page size.

I'll send a new version with this modification.


>>
>>
>>>
>>> I'll note that qemu_ram_remap() is rather stupid and optimized only 
>>> for private memory (not shmem etc).
>>>
>>> mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page 
>>> from the pagecache; you'd have to punch a hole instead.
>>>
>>> It might be better to use ram_block_discard_range() in the long run. 
>>> Memory preallocation + page pinning is tricky, but we could simply 
>>> bail out in these cases (preallocation failing, ram discard being 
>>> disabled).
>>
>>
>> I see that ram_block_discard_range() adds more control before 
>> discarding the RAM region and can also call madvise() in addition to 
>> the fallocate punch hole for standard sized memory pages. Now as the 
>> range is supposed to be recreated, I'm not convinced that these 
>> madvise calls are necessary.
> 
> They are the proper replacement for the mmap(MAP_FIXED) + fallocate.
> 
> That function handles all cases of properly discarding guest RAM.

In the case of hugetlbfs pages, ram_block_discard_range() does the 
punch-hole fallocate call (and prints out the warning messages).
The madvise call is only done when (rb->page_size == 
qemu_real_host_page_size()) which isn't true for hugetlbfs.
So need_madvise is false and neither QEMU_MADV_REMOVE nor 
QEMU_MADV_DONTNEED madvise calls is performed.


> 
>>
>> But we can also notice that this function will report the following 
>> warning in all cases of not shared file backends:
>> "ram_block_discard_range: Discarding RAM in private file mappings is 
>> possibly dangerous, because it will modify the underlying file and 
>> will affect other users of the file"
> 
> Yes, because it's a clear warning sign that something weird is 
> happening. You might be throwing away data that some other process might 
> be relying on.
> 
> How are you making QEMU consume hugetlbs?

A classical way to consume (not shared) hugetlbfs pages is done with the 
creation of a file that is opened, mmapped by the Qemu instance but we 
also delete the file system entry so that if the Qemu instance dies, the 
resources are released. This file is usually not shared.


> 
> We could suppress these warnings, but let's first see how you are able 
> to trigger it.

The warning is always displayed when such a hugetlbfs VM impacted by a 
memory error is rebooted.
I understand the reason why we have this message, but in the case of 
hugetlbfs classical use this (new) message on reboot is probably too 
worrying...  But loosing memory is already very worrying ;)


> 
>> Which means that hugetlbfs configurations do see this new cryptic 
>> warning message on reboot if it is impacted by a memory poisoning.
>> So I would prefer to leave the fallocate call in the qemu_ram_remap() 
>> function. Or would you prefer to enhance ram_block_discard_range()code 
>> to avoid the message in a reset situation (when called from 
>> qemu_ram_remap)?
> 
> Please try reusing the mechanism to discard guest RAM instead of open- 
> coding this. We still have to use mmap(MAP_FIXED) as a backup, but 
> otherwise this function should mostly do+check what you need.
> 
> (-warnings we might want to report differently / suppress)
> 
> If you want, I can start a quick prototype of what it could look like 
> when using ram_block_discard_range() + ram_block_discard_is_disabled() + 
> fallback to existing mmap(MAP_FIXED).

I just want to notice that the reason why need_madvise was used was 
because "DONTNEED fails for hugepages but fallocate works on hugepages 
and shmem". In fact, MADV_REMOVE support on hugetlbfs only appeared in 
kernel v4.3 and MADV_DONTNEED support only appeared 5.18

Our Qemu code avoids calling these madvise for hugepages, as we need to 
have:
(rb->page_size == qemu_real_host_page_size())

That's a reason why we have to remap the "hole-punched" section of the 
file when using hugepages.

>>
>>
>>>
>>> qemu_ram_remap() might be problematic with page pinning (vfio) as is 
>>> in any way :(
>>
>> I agree. If qemu_ram_remap() fails, Qemu is ended either abort() or 
>> exit(1). Do you say that memory pinning could be detected by 
>> ram_block_discard_range() or maybe mmap call for the impacted region 
>> and make one of them fail ? This would be an additional reason to call 
>> ram_block_discard_range() from qemu_ram_remap().   Is it what you are 
>> suggesting ?
> 
> ram_block_discard_is_disabled() might be the right test. If discarding 
> is disabled, then rebooting might create an inconsistency with 
> e.g.,vfio, resulting in the issues we know from memory ballooning where 
> the state vfio sees will be different from the state the guest kernel 
> sees. It's tricky ... and we much rather quit the VM early instead of 
> corrupting data later :/


Alright. we can verify if ram_block_discard_is_disabled() is true and we 
exit Qemu in this case with a message instead of trying to recreate the 
memory area (in the other case).

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by David Hildenbrand 1 year, 3 months ago

> 
> Ok, I can remove the first patch that is created to track the kernel
> provided page size and pass it to the kvm_hwpoison_page_add() function,
> but we could deal with the page size at the kvm_hwpoison_page_add()
> function level as we don't rely on the kernel provided info, but just
> the RAMBlock page size.

Great!

>>> I see that ram_block_discard_range() adds more control before
>>> discarding the RAM region and can also call madvise() in addition to
>>> the fallocate punch hole for standard sized memory pages. Now as the
>>> range is supposed to be recreated, I'm not convinced that these
>>> madvise calls are necessary.
>>
>> They are the proper replacement for the mmap(MAP_FIXED) + fallocate.
>>
>> That function handles all cases of properly discarding guest RAM.
> 
> In the case of hugetlbfs pages, ram_block_discard_range() does the
> punch-hole fallocate call (and prints out the warning messages).
> The madvise call is only done when (rb->page_size ==
> qemu_real_host_page_size()) which isn't true for hugetlbfs.
> So need_madvise is false and neither QEMU_MADV_REMOVE nor
> QEMU_MADV_DONTNEED madvise calls is performed.

See my other mail regarding fallocte()+hugetlb oddities.

The warning is for MAP_PRIVATE mappings where we cannot be sure that we are
not doing harm to somebody else that is mapping the file :(

See

commit 1d44ff586f8a8e113379430750b5a0a2a3f64cf9
Author: David Hildenbrand <david@redhat.com>
Date:   Thu Jul 6 09:56:06 2023 +0200

     softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
     
     ram_block_discard_range() cannot possibly do the right thing in
     MAP_PRIVATE file mappings in the general case.
     
     To achieve the documented semantics, we also have to punch a hole into
     the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
     of such a file.
     
     For example, using VM templating -- see commit b17fbbe55cba ("migration:
     allow private destination ram with x-ignore-shared") -- in combination with
     any mechanism that relies on discarding of RAM is problematic. This
     includes:
     * Postcopy live migration
     * virtio-balloon inflation/deflation or free-page-reporting
     * virtio-mem
     
     So at least warn that there is something possibly dangerous is going on
     when using ram_block_discard_range() in these cases.

So the warning is the best we can do to say "this is possibly very
problematic, and it might be undesirable".

For hugetlb, users should switch to using memory-backend-memfd or
memory-backend-file,share=on.

> 
> 
>>
>>>
>>> But we can also notice that this function will report the following
>>> warning in all cases of not shared file backends:
>>> "ram_block_discard_range: Discarding RAM in private file mappings is
>>> possibly dangerous, because it will modify the underlying file and
>>> will affect other users of the file"
>>
>> Yes, because it's a clear warning sign that something weird is
>> happening. You might be throwing away data that some other process might
>> be relying on.
>>
>> How are you making QEMU consume hugetlbs?
> 
> A classical way to consume (not shared) hugetlbfs pages is done with the
> creation of a file that is opened, mmapped by the Qemu instance but we
> also delete the file system entry so that if the Qemu instance dies, the
> resources are released. This file is usually not shared.

Right, see above. We should be using memory-backend-file,share=on with that,
just like we would with shmem/tmpfs :(

The ugly bit is that the legacy "-mem-path" option translates to
"memory-backend-file,share=off", and we cannot easily change that.

That option really should not be used anymore.

> 
> 
>>
>> We could suppress these warnings, but let's first see how you are able
>> to trigger it.
> 
> The warning is always displayed when such a hugetlbfs VM impacted by a
> memory error is rebooted.
> I understand the reason why we have this message, but in the case of
> hugetlbfs classical use this (new) message on reboot is probably too
> worrying...  But loosing memory is already very worrying ;)

See above; we cannot easily identify "we map this file MAP_PRIVATE
but we are guaranteed to be the single user", so punching a hole in that
file might just corrupt data for another user (e.g., VM templating) without
any warning.

Again, we could suppress the warning, but not using MAP_PRIVATE with
a hugetlb file would be even better.

(hugetlb contains other hacks that make sure that MAP_PRIVATE on a file
won't result in a double memory consumption -- with shmem/tmpfs it would
result in a double memory consumption!)

Are the users you are aware of using "-mem-path" or "-object memory-backend-file"?

We might be able to change the default for the latter with a new QEMU version,
maybe ...


-- 
Cheers,

David / dhildenb

[PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

Posted by “William Roche 1 year, 3 months ago

From: William Roche <william.roche@oracle.com>

When the VM reboots, a memory reset is performed calling
qemu_ram_remap() on all hwpoisoned pages.
While we take into account the recorded page sizes to repair the
memory locations, a large page also needs to punch a hole in the
backend file to regenerate a usable memory, cleaning the HW
poisoned section. This is mandatory for hugetlbfs case for example.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/system/physmem.c b/system/physmem.c
index 3757428336..3f6024a92d 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 prot = PROT_READ;
                 prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
                 if (block->fd >= 0) {
+                    if (length > TARGET_PAGE_SIZE && fallocate(block->fd,
+                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+                        offset + block->fd_offset, length) != 0) {
+                        error_report("Could not recreate the file hole for "
+                                     "addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }
                     area = mmap(vaddr, length, prot, flags, block->fd,
                                 offset + block->fd_offset);
                 } else {
-- 
2.43.5

Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

Posted by David Hildenbrand 1 year, 3 months ago

On 22.10.24 23:35, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> When the VM reboots, a memory reset is performed calling
> qemu_ram_remap() on all hwpoisoned pages.
> While we take into account the recorded page sizes to repair the
> memory locations, a large page also needs to punch a hole in the
> backend file to regenerate a usable memory, cleaning the HW
> poisoned section. This is mandatory for hugetlbfs case for example.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   system/physmem.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index 3757428336..3f6024a92d 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
>                   prot = PROT_READ;
>                   prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>                   if (block->fd >= 0) {
> +                    if (length > TARGET_PAGE_SIZE && fallocate(block->fd,
> +                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
> +                        offset + block->fd_offset, length) != 0) {
> +                        error_report("Could not recreate the file hole for "
> +                                     "addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                                     length, addr);
> +                        exit(1);
> +                    }
>                       area = mmap(vaddr, length, prot, flags, block->fd,
>                                   offset + block->fd_offset);
>                   } else {

Ah! Just what I commented to patch #3; we should be using 
ram_discard_range(). It might be better to avoid the mmap() completely 
if ram_discard_range() worked.

And as raised, there is the problem with memory preallocation (where we 
should fail if it doesn't work) and ram discards being disabled because 
something relies on long-term page pinning ...

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

Posted by William Roche 1 year, 3 months ago

On 10/23/24 09:30, David Hildenbrand wrote:

> On 22.10.24 23:35, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> When the VM reboots, a memory reset is performed calling
>> qemu_ram_remap() on all hwpoisoned pages.
>> While we take into account the recorded page sizes to repair the
>> memory locations, a large page also needs to punch a hole in the
>> backend file to regenerate a usable memory, cleaning the HW
>> poisoned section. This is mandatory for hugetlbfs case for example.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   system/physmem.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 3757428336..3f6024a92d 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr, 
>> ram_addr_t length)
>>                   prot = PROT_READ;
>>                   prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>>                   if (block->fd >= 0) {
>> +                    if (length > TARGET_PAGE_SIZE && 
>> fallocate(block->fd,
>> +                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
>> +                        offset + block->fd_offset, length) != 0) {
>> +                        error_report("Could not recreate the file 
>> hole for "
>> +                                     "addr: " RAM_ADDR_FMT "@" 
>> RAM_ADDR_FMT "",
>> +                                     length, addr);
>> +                        exit(1);
>> +                    }
>>                       area = mmap(vaddr, length, prot, flags, block->fd,
>>                                   offset + block->fd_offset);
>>                   } else {
>
> Ah! Just what I commented to patch #3; we should be using 
> ram_discard_range(). It might be better to avoid the mmap() completely 
> if ram_discard_range() worked.

I think you are referring to ram_block_discard_range() here, as 
ram_discard_range() seems to relate to VM migrations, maybe not a VM reset.

Remapping the page is needed to get rid of the poison. So if we want to 
avoid the mmap(), we have to shrink the memory address space -- which 
can be a real problem if we imagine a VM with 1G large pages for 
example. qemu_ram_remap() is used to regenerate the lost memory and the 
mmap() call looks mandatory on the reset phase.

>
> And as raised, there is the problem with memory preallocation (where 
> we should fail if it doesn't work) and ram discards being disabled 
> because something relies on long-term page pinning ...

Yes. Do you suggest that we add a call to qemu_prealloc_mem() for the 
remapped area in case of a backend->prealloc being true ?

Or as we are running on posix machines for this piece of code (ifndef 
_WIN32) maybe we could simply add a MAP_POPULATE flag to the mmap call 
done in qemu_ram_remap() in the case where the backend requires a 
'prealloc' ?  Can you confirm if this flag could be used on all systems 
running this code ?

Unfortunately, I don't know how to get the MEMORY_BACKEND corresponding 
to a given memory block. I'm not sure that MEMORY_BACKEND(block->mr) is 
a valid way to retrieve the Backend object and its 'prealloc' property 
here. Could you please give me a direction here ?

I can send a new version using ram_block_discard_range() as you 
suggested to replace the direct call to fallocate(), if you think it 
would be better.
Please let me know what other enhancement(s) you'd like to see in this 
code change.

Thanks in advance,
William.

Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

Posted by David Hildenbrand 1 year, 3 months ago

On 26.10.24 01:27, William Roche wrote:
> On 10/23/24 09:30, David Hildenbrand wrote:
> 
>> On 22.10.24 23:35, “William Roche wrote:
>>> From: William Roche <william.roche@oracle.com>
>>>
>>> When the VM reboots, a memory reset is performed calling
>>> qemu_ram_remap() on all hwpoisoned pages.
>>> While we take into account the recorded page sizes to repair the
>>> memory locations, a large page also needs to punch a hole in the
>>> backend file to regenerate a usable memory, cleaning the HW
>>> poisoned section. This is mandatory for hugetlbfs case for example.
>>>
>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>> ---
>>>    system/physmem.c | 8 ++++++++
>>>    1 file changed, 8 insertions(+)
>>>
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index 3757428336..3f6024a92d 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr,
>>> ram_addr_t length)
>>>                    prot = PROT_READ;
>>>                    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>>>                    if (block->fd >= 0) {
>>> +                    if (length > TARGET_PAGE_SIZE &&
>>> fallocate(block->fd,
>>> +                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
>>> +                        offset + block->fd_offset, length) != 0) {
>>> +                        error_report("Could not recreate the file
>>> hole for "
>>> +                                     "addr: " RAM_ADDR_FMT "@"
>>> RAM_ADDR_FMT "",
>>> +                                     length, addr);
>>> +                        exit(1);
>>> +                    }
>>>                        area = mmap(vaddr, length, prot, flags, block->fd,
>>>                                    offset + block->fd_offset);
>>>                    } else {
>>
>> Ah! Just what I commented to patch #3; we should be using
>> ram_discard_range(). It might be better to avoid the mmap() completely
>> if ram_discard_range() worked.
> 

Hi!

> 
> I think you are referring to ram_block_discard_range() here, as
> ram_discard_range() seems to relate to VM migrations, maybe not a VM reset.

Please take a look at the users of ram_block_discard_range(), including 
virtio-balloon to completely zap guest memory, so we will get fresh 
memory on next access. It takes care of process-private and file-backed 
(shared) memory.

> 
> Remapping the page is needed to get rid of the poison. So if we want to
> avoid the mmap(), we have to shrink the memory address space -- which
> can be a real problem if we imagine a VM with 1G large pages for
> example. qemu_ram_remap() is used to regenerate the lost memory and the
> mmap() call looks mandatory on the reset phase.

Why can't we use ram_block_discard_range() to zap the poisoned page 
(unmap from page tables + conditionallydrop from the page cache)? Is 
there anything important I am missing?

> 
> 
>>
>> And as raised, there is the problem with memory preallocation (where
>> we should fail if it doesn't work) and ram discards being disabled
>> because something relies on long-term page pinning ...
> 
> 
> Yes. Do you suggest that we add a call to qemu_prealloc_mem() for the
> remapped area in case of a backend->prealloc being true ?

Yes. Otherwise, with hugetlb, you might run out of hugetlb pages at 
runtime and SIGBUS QEMU :(

> 
> Or as we are running on posix machines for this piece of code (ifndef
> _WIN32) maybe we could simply add a MAP_POPULATE flag to the mmap call
> done in qemu_ram_remap() in the case where the backend requires a
> 'prealloc' ?  Can you confirm if this flag could be used on all systems
> running this code ?

Please use qemu_prealloc_mem(). MAP_POPULATE has no guarantees, it's 
really weird :/ mmap() might succeed even though MAP_POPULATE didn't 
work ... and it's problematic with NUMA policies because we essentially 
lose (overwrite) them.

And the whole mmap(MAP_FIXED) is an ugly hack. For example, we wouldn't 
reset the memory policy we apply in 
host_memory_backend_memory_complete() ... that code really needs a 
rewrite to do it properly.


Ideally, we'd do something high-level like


if (ram_block_discard_is_disabled()) {
	/*
	 * We cannot safely discard RAM,  ... for example we might have
	 * to remap all guest RAM into vfio after discarding the 	
	 * problematic pages ... TODO.
	 */
	exit(0);
}

/* Throw away the problematic (poisoned) page. *./
if (ram_block_discard_range()) {
	/* Conditionally fallback to MAP_FIXED workaround */
	...
}

/* If prealloction was requested, we really must re-preallcoate. */
if (prealloc && qemu_prealloc_mem()) {
	/* Preallocation failed .... */
	exit(0);
}

As you note the last part is tricky. See bwloe.

> 
> Unfortunately, I don't know how to get the MEMORY_BACKEND corresponding
> to a given memory block. I'm not sure that MEMORY_BACKEND(block->mr) is
> a valid way to retrieve the Backend object and its 'prealloc' property
> here. Could you please give me a direction here ?

We could add a RAM_PREALLOC flag to hint that this memory has "prealloc" 
semantics.

I once had an alternative approach: Similar to ram_block_notify_resize() 
we would implement ram_block_notify_remap().

That's where the backend could register and re-apply mmap properties 
like NUMA policies (in case we have to fallback to MAP_FIXED) and handle 
the preallocation.

So one would implement a ram_block_notify_remap() and maybe indicate if 
we had to do MAP_FIXED or if we only discarded the page.

I once had a prototype for that, let me dig ...

> 
> I can send a new version using ram_block_discard_range() as you
> suggested to replace the direct call to fallocate(), if you think it
> would be better.
> Please let me know what other enhancement(s) you'd like to see in this
> code change.

Something along the lines above. Please let me know if you see problems 
with that approach that I am missing.

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

Posted by William Roche 1 year, 3 months ago

On 10/28/24 18:01, David Hildenbrand wrote:
> On 26.10.24 01:27, William Roche wrote:
>> On 10/23/24 09:30, David Hildenbrand wrote:
>>
>>> On 22.10.24 23:35, “William Roche wrote:
>>>> From: William Roche <william.roche@oracle.com>
>>>>
>>>> When the VM reboots, a memory reset is performed calling
>>>> qemu_ram_remap() on all hwpoisoned pages.
>>>> While we take into account the recorded page sizes to repair the
>>>> memory locations, a large page also needs to punch a hole in the
>>>> backend file to regenerate a usable memory, cleaning the HW
>>>> poisoned section. This is mandatory for hugetlbfs case for example.
>>>>
>>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>>> ---
>>>>    system/physmem.c | 8 ++++++++
>>>>    1 file changed, 8 insertions(+)
>>>>
>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>> index 3757428336..3f6024a92d 100644
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr,
>>>> ram_addr_t length)
>>>>                    prot = PROT_READ;
>>>>                    prot |= block->flags & RAM_READONLY ? 0 : 
>>>> PROT_WRITE;
>>>>                    if (block->fd >= 0) {
>>>> +                    if (length > TARGET_PAGE_SIZE &&
>>>> fallocate(block->fd,
>>>> +                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
>>>> +                        offset + block->fd_offset, length) != 0) {
>>>> +                        error_report("Could not recreate the file
>>>> hole for "
>>>> +                                     "addr: " RAM_ADDR_FMT "@"
>>>> RAM_ADDR_FMT "",
>>>> +                                     length, addr);
>>>> +                        exit(1);
>>>> +                    }
>>>>                        area = mmap(vaddr, length, prot, flags, 
>>>> block->fd,
>>>>                                    offset + block->fd_offset);
>>>>                    } else {
>>>
>>> Ah! Just what I commented to patch #3; we should be using
>>> ram_discard_range(). It might be better to avoid the mmap() completely
>>> if ram_discard_range() worked.
>>
> 
> Hi!
> 
>>
>> I think you are referring to ram_block_discard_range() here, as
>> ram_discard_range() seems to relate to VM migrations, maybe not a VM 
>> reset.
> 
> Please take a look at the users of ram_block_discard_range(), including 
> virtio-balloon to completely zap guest memory, so we will get fresh 
> memory on next access. It takes care of process-private and file-backed 
> (shared) memory.

The calls to madvise should take care of releasing the memory for the 
mapped area, and it is called for standard page sized memory.

>>
>> Remapping the page is needed to get rid of the poison. So if we want to
>> avoid the mmap(), we have to shrink the memory address space -- which
>> can be a real problem if we imagine a VM with 1G large pages for
>> example. qemu_ram_remap() is used to regenerate the lost memory and the
>> mmap() call looks mandatory on the reset phase.
> 
> Why can't we use ram_block_discard_range() to zap the poisoned page 
> (unmap from page tables + conditionally drop from the page cache)? Is 
> there anything important I am missing?

Or maybe _I'm_ missing something important, but what I understand is that:
    need_madvise = (rb->page_size == qemu_real_host_page_size());

ensures that the madvise call on ram_block_discard_range() is not done 
in the case off hugepages.
In this case, we need to call mmap the remap the hugetlbfs large page.

As I said in the previous email, recent kernels start to implement these 
calls for hugetlbfs, but I'm not sure that changing the mechanism of 
this ram_block_discard_range() function now is appropriate.
Do you agree with that ?


>>
>>
>>>
>>> And as raised, there is the problem with memory preallocation (where
>>> we should fail if it doesn't work) and ram discards being disabled
>>> because something relies on long-term page pinning ...
>>
>>
>> Yes. Do you suggest that we add a call to qemu_prealloc_mem() for the
>> remapped area in case of a backend->prealloc being true ?
> 
> Yes. Otherwise, with hugetlb, you might run out of hugetlb pages at 
> runtime and SIGBUS QEMU :(
> 
>>
>> Or as we are running on posix machines for this piece of code (ifndef
>> _WIN32) maybe we could simply add a MAP_POPULATE flag to the mmap call
>> done in qemu_ram_remap() in the case where the backend requires a
>> 'prealloc' ?  Can you confirm if this flag could be used on all systems
>> running this code ?
> 
> Please use qemu_prealloc_mem(). MAP_POPULATE has no guarantees, it's 
> really weird :/ mmap() might succeed even though MAP_POPULATE didn't 
> work ... and it's problematic with NUMA policies because we essentially 
> lose (overwrite) them.
> 
> And the whole mmap(MAP_FIXED) is an ugly hack. For example, we wouldn't 
> reset the memory policy we apply in 
> host_memory_backend_memory_complete() ... that code really needs a 
> rewrite to do it properly.

Maybe I can try to call madvise on hugepages too, only in this VM reset 
situation, and deal with the failure scenario of older kernels not 
supporting it... Leaving the behavior unchanged for every other 
locations calling this function.

But I'll need to verify these madvise effect on hugetlbfs on the latest 
upstream kernel and some older kernels too.



> 
> Ideally, we'd do something high-level like
> 
> 
> if (ram_block_discard_is_disabled()) {
>      /*
>       * We cannot safely discard RAM,  ... for example we might have
>       * to remap all guest RAM into vfio after discarding the
>       * problematic pages ... TODO.
>       */
>      exit(0);
> }
> 
> /* Throw away the problematic (poisoned) page. *./
> if (ram_block_discard_range()) {
>      /* Conditionally fallback to MAP_FIXED workaround */
>      ...
> }
> 
> /* If prealloction was requested, we really must re-preallcoate. */
> if (prealloc && qemu_prealloc_mem()) {
>      /* Preallocation failed .... */
>      exit(0);
> }
> 
> As you note the last part is tricky. See bwloe.
> 
>>
>> Unfortunately, I don't know how to get the MEMORY_BACKEND corresponding
>> to a given memory block. I'm not sure that MEMORY_BACKEND(block->mr) is
>> a valid way to retrieve the Backend object and its 'prealloc' property
>> here. Could you please give me a direction here ?
> 
> We could add a RAM_PREALLOC flag to hint that this memory has "prealloc" 
> semantics.
> 
> I once had an alternative approach: Similar to ram_block_notify_resize() 
> we would implement ram_block_notify_remap().
> 
> That's where the backend could register and re-apply mmap properties 
> like NUMA policies (in case we have to fallback to MAP_FIXED) and handle 
> the preallocation.
> 
> So one would implement a ram_block_notify_remap() and maybe indicate if 
> we had to do MAP_FIXED or if we only discarded the page.
> 
> I once had a prototype for that, let me dig ...

That would be great !  Thanks.

> 
>>
>> I can send a new version using ram_block_discard_range() as you
>> suggested to replace the direct call to fallocate(), if you think it
>> would be better.
>> Please let me know what other enhancement(s) you'd like to see in this
>> code change.
> 
> Something along the lines above. Please let me know if you see problems 
> with that approach that I am missing.


Let me check the madvise use on hugetlbfs and if it works as expected,
I'll try to implement a V2 version of the fix proposal integrating a 
modified ram_block_discard_range() function.

I'll also remove the page size information from the signal handlers
and only keep it in the kvm_hwpoison_page_add() function.

I'll investigate how to keep track of the 'prealloc' attribute to 
optionally use when remapping the hugepages (on older kernels).
And if you find the prototype code you talked about that would 
definitely help :)

Thanks a lot,
William.

Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

Posted by David Hildenbrand 1 year, 3 months ago

>>>
>>> Remapping the page is needed to get rid of the poison. So if we want to
>>> avoid the mmap(), we have to shrink the memory address space -- which
>>> can be a real problem if we imagine a VM with 1G large pages for
>>> example. qemu_ram_remap() is used to regenerate the lost memory and the
>>> mmap() call looks mandatory on the reset phase.
>>
>> Why can't we use ram_block_discard_range() to zap the poisoned page
>> (unmap from page tables + conditionally drop from the page cache)? Is
>> there anything important I am missing?
> 
> Or maybe _I'm_ missing something important, but what I understand is that:
>      need_madvise = (rb->page_size == qemu_real_host_page_size());
> 
> ensures that the madvise call on ram_block_discard_range() is not done
> in the case off hugepages.
> In this case, we need to call mmap the remap the hugetlbfs large page.

Right, madvise(DONTNEED) works ever since "90e7e7f5ef3f ("mm: enable 
MADV_DONTNEED for hugetlb mappings")".

But as you note, in QEMU we never called madvise(DONTNEED) for hugetlb 
as of today. But note that we always have an "fd" with hugetlb, because 
we never use mmap(MAP_ANON|MAP_PRIVATE|MAP_HUGETLB) in QEMU.

The weird thing is that if you have a mmap(fd, MAP_PRIVATE) hugetlb 
mapping, fallocate(fd, FALLOC_FL_PUNCH_HOLE) will *also* zap any private 
pages. So in contrast to "ordinary" memory, the madvise(DONTNEED) is not 
required.

(yes, it's very weird)

So the fallocate(fd, FALLOC_FL_PUNCH_HOLE) will zap the hugetlb page and 
you will get a fresh one on next fault.

For all the glorious details, see:

https://lore.kernel.org/linux-mm/2ddd0a26-33fd-9cde-3501-f0584bbffefc@redhat.com/


> 
> As I said in the previous email, recent kernels start to implement these
> calls for hugetlbfs, but I'm not sure that changing the mechanism of
> this ram_block_discard_range() function now is appropriate.
> Do you agree with that ?

The key point is that it works for hugetlb without madvise(DONTNEED), 
which is weird :)

Which is also why the introducing kernel change added "Do note that 
there is no compelling use case for adding this support.
This was discussed in the RFC [1].  However, adding support makes sense
as it is fairly trivial and brings hugetlb functionality more in line
with 'normal' memory."

[...]

>>
>> So one would implement a ram_block_notify_remap() and maybe indicate if
>> we had to do MAP_FIXED or if we only discarded the page.
>>
>> I once had a prototype for that, let me dig ...
> 
> That would be great !  Thanks.

Found them:

https://gitlab.com/virtio-mem/qemu/-/commit/f528c861897d1086ae84ea1bcd6a0be43e8fea7d

https://gitlab.com/virtio-mem/qemu/-/commit/c5b0328654def8f168497715409d6364096eb63f

https://gitlab.com/virtio-mem/qemu/-/commit/15e9737907835105c132091ad10f9d0c9c68ea64

But note that I didn't realize back then that the mmap(MAP_FIXED) is the 
wrong way to do it, and that we actually have to DONTNEED/PUNCH_HOLE to 
do it properly. But to get the preallocation performed by the backend, 
it should still be valuable.

Note that I wonder if we can get rid of the mmap(MAP_FIXED) handling 
completely: likely we only support Linux with MCE recovery, and 
ram_block_discard_range() should do what we need under Linux.

That would make it a lot simpler.

> 
>>
>>>
>>> I can send a new version using ram_block_discard_range() as you
>>> suggested to replace the direct call to fallocate(), if you think it
>>> would be better.
>>> Please let me know what other enhancement(s) you'd like to see in this
>>> code change.
>>
>> Something along the lines above. Please let me know if you see problems
>> with that approach that I am missing.
> 
> 
> Let me check the madvise use on hugetlbfs and if it works as expected,
> I'll try to implement a V2 version of the fix proposal integrating a
> modified ram_block_discard_range() function.

As discussed, it might all be working. If not, we would have to fix 
ram_block_discard_range().

> 
> I'll also remove the page size information from the signal handlers
> and only keep it in the kvm_hwpoison_page_add() function.

That's good. Especially because there was talk in the last bi-weekly MM 
sync [1] about possibly indicating only the actually failed cachelines 
in the future, not necessarily the full page.

So relying on that interface to return the actual pagesize would no be 
future proof.

That session was in general very interesting and very relevant for your 
work; did you by any chance attend it? If not, we should find you the 
recordings, because the idea is to be able to configure to 
not-unmap-during-mce, and instead only inform the guest OS about the MCE 
(forward it). Which avoids any HGM (high-granularity mapping) issues 
completely.

Only during reboot of the VM we will have to do exactly what is being 
done in this series: zap the whole *page* so our fresh OS will see "all 
non-faulty" memory.

[1] 
https://lkml.kernel.org/r/9242f7cc-6b9d-b807-9079-db0ca81f3c6d@google.com

> 
> I'll investigate how to keep track of the 'prealloc' attribute to
> optionally use when remapping the hugepages (on older kernels).
> And if you find the prototype code you talked about that would
> definitely help :)

Right, the above should help getting that sorted out (but code id 4 
years old, so it won't "just apply").

-- 
Cheers,

David / dhildenb

[PATCH v2 0/7] hugetlbfs memory HW error fixes

Posted by “William Roche 1 year, 3 months ago

From: William Roche <william.roche@oracle.com>

Hi David,

Here is an updated description of the patch set:
 ---
This set of patches fixes several problems with hardware memory errors
impacting hugetlbfs memory backed VMs. When using hugetlbfs large
pages, any large page location being impacted by an HW memory error
results in poisoning the entire page, suddenly making a large chunk of
the VM memory unusable.

The main problem that currently exists in Qemu is the lack of backend
file repair before resetting the VM memory, resulting in the impacted
memory to be silently unusable even after a VM reboot.

In order to fix this issue, we track the page size of the impacted
memory block with the associated poisoned page location.

Using the size information we also call ram_block_discard_range() to
regenerate the memory on VM reset when running qemu_ram_remap(). So
that a poisoned memory backed by a hugetlbfs file is regenerated with
a hole punched in this file. A new page is loaded when the location
is first touched.

In case of a discard failure we fall back to unmap/remap the memory
location and reset the memory settings.

We also have to honor the 'prealloc' attribute even after a successful
discard, so we reapply the memory settings in this case too.

This memory setting is performed by a new remap notification mechanism
calling host_memory_backend_ram_remapped() function when a region of
a memory block is remapped.

Issue also a message providing the impact information of a large page
memory loss. Only reported once when the page is poisoned.
 ---


v1 -> v2:
. I removed the kernel SIGBUS siginfo provided lsb size information
  tracking. Only relying on the RAMBlock page_size instead.

. I adapted the 3 patches you indicated me to implement the
  notification mechanism on remap.  Thank you for this code!
  I left them as Authored by you.
  But I haven't tested if the policy setting works as expected on VM
  reset, only that the replacement of physical memory works.

. I also removed the old memory setting that was kept in qemu_ram_remap()
  but this small last fix could probably be merged with your last commit.


I also got yesterday the recording of the mm-linux session about the
kernel modification on largepage poisoning, and discussed this topic
with a colleague of mine who attended the meeting.

About the use of -mem-path question you asked me, we communicated the
information about the deprecated aspect of this option and advise all
users to use the following options instead.
-object memory-backend-file,id=pc.ram,mem-path=/dev/hugepages,prealloc,size=XXX -machine memory-backend=pc.ram 

We could now add the request to use a share=on attribute too, to avoid
the additional message about dangerous discard situations.


This code is scripts/checkpatch.pl clean
'make check' runs fine on both x86 and Arm.


David Hildenbrand (3):
  numa: Introduce and use ram_block_notify_remap()
  hostmem: Factor out applying settings
  hostmem: Handle remapping of RAM

William Roche (4):
  accel/kvm: Keep track of the HWPoisonPage page_size
  system/physmem: poisoned memory discard on reboot
  accel/kvm: Report the loss of a large memory page
  system/physmem: Memory settings applied on remap notification

 accel/kvm/kvm-all.c       |  17 +++-
 backends/hostmem.c        | 184 +++++++++++++++++++++++---------------
 hw/core/numa.c            |  11 +++
 include/exec/cpu-common.h |   1 +
 include/exec/ramlist.h    |   3 +
 include/sysemu/hostmem.h  |   1 +
 include/sysemu/kvm_int.h  |   4 +-
 system/physmem.c          |  62 ++++++++-----
 target/arm/kvm.c          |   2 +-
 target/i386/kvm/kvm.c     |   2 +-
 10 files changed, 189 insertions(+), 98 deletions(-)

-- 
2.43.5

[PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by “William Roche 1 year, 3 months ago

From: William Roche <william.roche@oracle.com>

When a memory page is added to the hwpoison_page_list, include
the page size information.  This size is the backend real page
size. To better deal with hugepages, we create a single entry
for the entire page.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c       |  8 +++++++-
 include/exec/cpu-common.h |  1 +
 system/physmem.c          | 13 +++++++++++++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 801cff16a5..6dd06f5edf 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
  */
 typedef struct HWPoisonPage {
     ram_addr_t ram_addr;
+    size_t     page_size;
     QLIST_ENTRY(HWPoisonPage) list;
 } HWPoisonPage;
 
@@ -1278,7 +1279,7 @@ static void kvm_unpoison_all(void *param)
 
     QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
         QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr, page->page_size);
         g_free(page);
     }
 }
@@ -1286,6 +1287,10 @@ static void kvm_unpoison_all(void *param)
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
+    size_t sz = qemu_ram_pagesize_from_addr(ram_addr);
+
+    if (sz > TARGET_PAGE_SIZE)
+        ram_addr = ROUND_DOWN(ram_addr, sz);
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
@@ -1294,6 +1299,7 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr)
     }
     page = g_new(HWPoisonPage, 1);
     page->ram_addr = ram_addr;
+    page->page_size = sz;
     QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
 }
 
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 638dc806a5..8f8f7ad567 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -108,6 +108,7 @@ bool qemu_ram_is_named_file(RAMBlock *rb);
 int qemu_ram_get_fd(RAMBlock *rb);
 
 size_t qemu_ram_pagesize(RAMBlock *block);
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr);
 size_t qemu_ram_pagesize_largest(void);
 
 /**
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..750604d47d 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1665,6 +1665,19 @@ size_t qemu_ram_pagesize(RAMBlock *rb)
     return rb->page_size;
 }
 
+/* Return backend real page size used for the given ram_addr. */
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr)
+{
+    RAMBlock *rb;
+
+    RCU_READ_LOCK_GUARD();
+    rb =  qemu_get_ram_block(addr);
+    if (!rb) {
+        return TARGET_PAGE_SIZE;
+    }
+    return qemu_ram_pagesize(rb);
+}
+
 /* Returns the largest size of page in use */
 size_t qemu_ram_pagesize_largest(void)
 {
-- 
2.43.5

Re: [PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by David Hildenbrand 1 year, 2 months ago

On 07.11.24 11:21, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> When a memory page is added to the hwpoison_page_list, include
> the page size information.  This size is the backend real page
> size. To better deal with hugepages, we create a single entry
> for the entire page.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   accel/kvm/kvm-all.c       |  8 +++++++-
>   include/exec/cpu-common.h |  1 +
>   system/physmem.c          | 13 +++++++++++++
>   3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 801cff16a5..6dd06f5edf 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
>    */
>   typedef struct HWPoisonPage {
>       ram_addr_t ram_addr;
> +    size_t     page_size;
>       QLIST_ENTRY(HWPoisonPage) list;
>   } HWPoisonPage;
>   
> @@ -1278,7 +1279,7 @@ static void kvm_unpoison_all(void *param)
>   
>       QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>           QLIST_REMOVE(page, list);
> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
> +        qemu_ram_remap(page->ram_addr, page->page_size);
>           g_free(page);

I'm curious, can't we simply drop the size parameter from qemu_ram_remap()
completely and determine the page size internally from the RAMBlock that
we are looking up already?

This way, we avoid yet another lookup in qemu_ram_pagesize_from_addr(),
and can just handle it completely in qemu_ram_remap().

In particular, to be future proof, we should also align the offset down to
the pagesize.

I'm thinking about something like this:

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 801cff16a5..8a47aa7258 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param)
  
      QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
          QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr);
          g_free(page);
      }
  }
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 638dc806a5..50a829d31f 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
  
  /* memory API */
  
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
+void qemu_ram_remap(ram_addr_t addr);
  /* This should not be used by devices.  */
  ram_addr_t qemu_ram_addr_from_host(void *ptr);
  ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..5f19bec089 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2167,10 +2167,10 @@ void qemu_ram_free(RAMBlock *block)
  }
  
  #ifndef _WIN32
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+void qemu_ram_remap(ram_addr_t addr)
  {
      RAMBlock *block;
-    ram_addr_t offset;
+    ram_addr_t offset, length;
      int flags;
      void *area, *vaddr;
      int prot;
@@ -2178,6 +2178,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
      RAMBLOCK_FOREACH(block) {
          offset = addr - block->offset;
          if (offset < block->max_length) {
+            /* Respect the pagesize of our RAMBlock. */
+            offset = QEMU_ALIGN_DOWN(offset, qemu_ram_pagesize(block));
+            length = qemu_ram_pagesize(block);
+
              vaddr = ramblock_ptr(block, offset);
              if (block->flags & RAM_PREALLOC) {
                  ;
@@ -2206,6 +2210,8 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                  memory_try_enable_merging(vaddr, length);
                  qemu_ram_setup_dump(vaddr, length);
              }
+
+            break;
          }
      }
  }


-- 
Cheers,

David / dhildenb

Re: [PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by William Roche 1 year, 2 months ago

On 11/12/24 11:30, David Hildenbrand wrote:
> On 07.11.24 11:21, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> When a memory page is added to the hwpoison_page_list, include
>> the page size information.  This size is the backend real page
>> size. To better deal with hugepages, we create a single entry
>> for the entire page.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   accel/kvm/kvm-all.c       |  8 +++++++-
>>   include/exec/cpu-common.h |  1 +
>>   system/physmem.c          | 13 +++++++++++++
>>   3 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 801cff16a5..6dd06f5edf 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned 
>> int extension)
>>    */
>>   typedef struct HWPoisonPage {
>>       ram_addr_t ram_addr;
>> +    size_t     page_size;
>>       QLIST_ENTRY(HWPoisonPage) list;
>>   } HWPoisonPage;
>> @@ -1278,7 +1279,7 @@ static void kvm_unpoison_all(void *param)
>>       QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>           QLIST_REMOVE(page, list);
>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>> +        qemu_ram_remap(page->ram_addr, page->page_size);
>>           g_free(page);
> 
> I'm curious, can't we simply drop the size parameter from qemu_ram_remap()
> completely and determine the page size internally from the RAMBlock that
> we are looking up already?
> 
> This way, we avoid yet another lookup in qemu_ram_pagesize_from_addr(),
> and can just handle it completely in qemu_ram_remap().
> 
> In particular, to be future proof, we should also align the offset down to
> the pagesize.
> 
> I'm thinking about something like this:
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 801cff16a5..8a47aa7258 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param)
> 
>       QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>           QLIST_REMOVE(page, list);
> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
> +        qemu_ram_remap(page->ram_addr);
>           g_free(page);
>       }
>   }
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 638dc806a5..50a829d31f 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
> 
>   /* memory API */
> 
> -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
> +void qemu_ram_remap(ram_addr_t addr);
>   /* This should not be used by devices.  */
>   ram_addr_t qemu_ram_addr_from_host(void *ptr);
>   ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
> diff --git a/system/physmem.c b/system/physmem.c
> index dc1db3a384..5f19bec089 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2167,10 +2167,10 @@ void qemu_ram_free(RAMBlock *block)
>   }
> 
>   #ifndef _WIN32
> -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
> +void qemu_ram_remap(ram_addr_t addr)
>   {
>       RAMBlock *block;
> -    ram_addr_t offset;
> +    ram_addr_t offset, length;
>       int flags;
>       void *area, *vaddr;
>       int prot;
> @@ -2178,6 +2178,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t 
> length)
>       RAMBLOCK_FOREACH(block) {
>           offset = addr - block->offset;
>           if (offset < block->max_length) {
> +            /* Respect the pagesize of our RAMBlock. */
> +            offset = QEMU_ALIGN_DOWN(offset, qemu_ram_pagesize(block));
> +            length = qemu_ram_pagesize(block);
> +
>               vaddr = ramblock_ptr(block, offset);
>               if (block->flags & RAM_PREALLOC) {
>                   ;
> @@ -2206,6 +2210,8 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t 
> length)
>                   memory_try_enable_merging(vaddr, length);
>                   qemu_ram_setup_dump(vaddr, length);
>               }
> +
> +            break;
>           }
>       }
>   }
> 
> 


Yes this is a working possibility, and as you say it would provide the 
advantage to avoid a size lookup (needed because the kernel siginfo can 
be incorrect) and avoid tracking the poisoned pages size, with the 
addresses.

But if we want to keep the information about the loss of a large page 
(which I think is useful) we would have to introduce the page size 
lookup when adding the page to the poison list. So according to me, 
keeping track of the page size and reusing it on remap isn't so bad. But 
if you prefer that we don't track the page size and do a lookup on page 
insert into the poison list and another in qemu_ram_remap(), of course 
we can do that.

There is also something to consider about the future: we'll also have to 
deal with migration of VM that have been impacted by a memory error. And 
knowing about the poisoned pages size could be useful too. But this is 
another topic...

I would vote to keep this size tracking.

Re: [PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by David Hildenbrand 1 year, 2 months ago

On 12.11.24 19:17, William Roche wrote:
> On 11/12/24 11:30, David Hildenbrand wrote:
>> On 07.11.24 11:21, “William Roche wrote:
>>> From: William Roche <william.roche@oracle.com>
>>>
>>> When a memory page is added to the hwpoison_page_list, include
>>> the page size information.  This size is the backend real page
>>> size. To better deal with hugepages, we create a single entry
>>> for the entire page.
>>>
>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>> ---
>>>    accel/kvm/kvm-all.c       |  8 +++++++-
>>>    include/exec/cpu-common.h |  1 +
>>>    system/physmem.c          | 13 +++++++++++++
>>>    3 files changed, 21 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>> index 801cff16a5..6dd06f5edf 100644
>>> --- a/accel/kvm/kvm-all.c
>>> +++ b/accel/kvm/kvm-all.c
>>> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned
>>> int extension)
>>>     */
>>>    typedef struct HWPoisonPage {
>>>        ram_addr_t ram_addr;
>>> +    size_t     page_size;
>>>        QLIST_ENTRY(HWPoisonPage) list;
>>>    } HWPoisonPage;
>>> @@ -1278,7 +1279,7 @@ static void kvm_unpoison_all(void *param)
>>>        QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>>            QLIST_REMOVE(page, list);
>>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>>> +        qemu_ram_remap(page->ram_addr, page->page_size);
>>>            g_free(page);
>>
>> I'm curious, can't we simply drop the size parameter from qemu_ram_remap()
>> completely and determine the page size internally from the RAMBlock that
>> we are looking up already?
>>
>> This way, we avoid yet another lookup in qemu_ram_pagesize_from_addr(),
>> and can just handle it completely in qemu_ram_remap().
>>
>> In particular, to be future proof, we should also align the offset down to
>> the pagesize.
>>
>> I'm thinking about something like this:
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 801cff16a5..8a47aa7258 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param)
>>
>>        QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>            QLIST_REMOVE(page, list);
>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>> +        qemu_ram_remap(page->ram_addr);
>>            g_free(page);
>>        }
>>    }
>> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
>> index 638dc806a5..50a829d31f 100644
>> --- a/include/exec/cpu-common.h
>> +++ b/include/exec/cpu-common.h
>> @@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
>>
>>    /* memory API */
>>
>> -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
>> +void qemu_ram_remap(ram_addr_t addr);
>>    /* This should not be used by devices.  */
>>    ram_addr_t qemu_ram_addr_from_host(void *ptr);
>>    ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
>> diff --git a/system/physmem.c b/system/physmem.c
>> index dc1db3a384..5f19bec089 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -2167,10 +2167,10 @@ void qemu_ram_free(RAMBlock *block)
>>    }
>>
>>    #ifndef _WIN32
>> -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
>> +void qemu_ram_remap(ram_addr_t addr)
>>    {
>>        RAMBlock *block;
>> -    ram_addr_t offset;
>> +    ram_addr_t offset, length;
>>        int flags;
>>        void *area, *vaddr;
>>        int prot;
>> @@ -2178,6 +2178,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t
>> length)
>>        RAMBLOCK_FOREACH(block) {
>>            offset = addr - block->offset;
>>            if (offset < block->max_length) {
>> +            /* Respect the pagesize of our RAMBlock. */
>> +            offset = QEMU_ALIGN_DOWN(offset, qemu_ram_pagesize(block));
>> +            length = qemu_ram_pagesize(block);
>> +
>>                vaddr = ramblock_ptr(block, offset);
>>                if (block->flags & RAM_PREALLOC) {
>>                    ;
>> @@ -2206,6 +2210,8 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t
>> length)
>>                    memory_try_enable_merging(vaddr, length);
>>                    qemu_ram_setup_dump(vaddr, length);
>>                }
>> +
>> +            break;
>>            }
>>        }
>>    }
>>
>>
> 
> 
> Yes this is a working possibility, and as you say it would provide the
> advantage to avoid a size lookup (needed because the kernel siginfo can
> be incorrect) and avoid tracking the poisoned pages size, with the
> addresses.
 > > But if we want to keep the information about the loss of a large page
> (which I think is useful) we would have to introduce the page size
> lookup when adding the page to the poison list. So according to me,

Right, that would be independent of the remap logic.

What I dislike about qemu_ram_remap() is that it looks like we could be 
remapping a range that's possibly larger than a single page.

But it really only works on a single address, expanding that to the 
page. Passing in a length that crosses RAMBlocks would not work as 
expected ...

So I'd prefer if we let qemu_ram_remap() do exactly that ... remap a 
single page ...

> keeping track of the page size and reusing it on remap isn't so bad. But
> if you prefer that we don't track the page size and do a lookup on page
> insert into the poison list and another in qemu_ram_remap(), of course
> we can do that.

... and lookup the page size manually here if we really have to, for 
example to warn/trace errors.

 > > There is also something to consider about the future: we'll also 
have to
> deal with migration of VM that have been impacted by a memory error. And
> knowing about the poisoned pages size could be useful too. But this is
> another topic...

Yes, although the destination should be able to derive the same thing 
from the address I guess. We expect src and dst QEMU to use the same 
memory backing.

-- 
Cheers,

David / dhildenb

[PATCH v2 2/7] system/physmem: poisoned memory discard on reboot

Posted by “William Roche 1 year, 3 months ago

From: William Roche <william.roche@oracle.com>

We take into account the recorded page sizes to repair the
memory locations, calling ram_block_discard_range() to punch a hole
in the backend file when necessary and regenerate a usable memory.
Fall back to unmap/remap the memory location(s) if the kernel doesn't
support the madvise calls used by ram_block_discard_range().

Hugetlbfs poison case is also taken into account as a hole punch
with fallocate will reload a new page when first touched.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 50 +++++++++++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 750604d47d..dfea120cc5 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
             } else if (xen_enabled()) {
                 abort();
             } else {
-                flags = MAP_FIXED;
-                flags |= block->flags & RAM_SHARED ?
-                         MAP_SHARED : MAP_PRIVATE;
-                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
-                prot = PROT_READ;
-                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
-                if (block->fd >= 0) {
-                    area = mmap(vaddr, length, prot, flags, block->fd,
-                                offset + block->fd_offset);
-                } else {
-                    flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, length, prot, flags, -1, 0);
-                }
-                if (area != vaddr) {
-                    error_report("Could not remap addr: "
-                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 length, addr);
-                    exit(1);
+                if (ram_block_discard_range(block, offset + block->fd_offset,
+                                            length) != 0) {
+                    if (length > TARGET_PAGE_SIZE) {
+                        /* punch hole is mandatory on hugetlbfs */
+                        error_report("large page recovery failure addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }
+                    flags = MAP_FIXED;
+                    flags |= block->flags & RAM_SHARED ?
+                             MAP_SHARED : MAP_PRIVATE;
+                    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
+                    prot = PROT_READ;
+                    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
+                    if (block->fd >= 0) {
+                        area = mmap(vaddr, length, prot, flags, block->fd,
+                                    offset + block->fd_offset);
+                    } else {
+                        flags |= MAP_ANONYMOUS;
+                        area = mmap(vaddr, length, prot, flags, -1, 0);
+                    }
+                    if (area != vaddr) {
+                        error_report("Could not remap addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }
+                    memory_try_enable_merging(vaddr, length);
+                    qemu_ram_setup_dump(vaddr, length);
                 }
-                memory_try_enable_merging(vaddr, length);
-                qemu_ram_setup_dump(vaddr, length);
             }
         }
     }
-- 
2.43.5

Re: [PATCH v2 2/7] system/physmem: poisoned memory discard on reboot

Posted by David Hildenbrand 1 year, 2 months ago

On 07.11.24 11:21, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> We take into account the recorded page sizes to repair the
> memory locations, calling ram_block_discard_range() to punch a hole
> in the backend file when necessary and regenerate a usable memory.
> Fall back to unmap/remap the memory location(s) if the kernel doesn't
> support the madvise calls used by ram_block_discard_range().
> 
> Hugetlbfs poison case is also taken into account as a hole punch
> with fallocate will reload a new page when first touched.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   system/physmem.c | 50 +++++++++++++++++++++++++++++-------------------
>   1 file changed, 30 insertions(+), 20 deletions(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index 750604d47d..dfea120cc5 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
>               } else if (xen_enabled()) {
>                   abort();
>               } else {
> -                flags = MAP_FIXED;
> -                flags |= block->flags & RAM_SHARED ?
> -                         MAP_SHARED : MAP_PRIVATE;
> -                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> -                prot = PROT_READ;
> -                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> -                if (block->fd >= 0) {
> -                    area = mmap(vaddr, length, prot, flags, block->fd,
> -                                offset + block->fd_offset);
> -                } else {
> -                    flags |= MAP_ANONYMOUS;
> -                    area = mmap(vaddr, length, prot, flags, -1, 0);
> -                }
> -                if (area != vaddr) {
> -                    error_report("Could not remap addr: "
> -                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> -                                 length, addr);
> -                    exit(1);
> +                if (ram_block_discard_range(block, offset + block->fd_offset,
> +                                            length) != 0) {
> +                    if (length > TARGET_PAGE_SIZE) {
> +                        /* punch hole is mandatory on hugetlbfs */
> +                        error_report("large page recovery failure addr: "
> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                                     length, addr);
> +                        exit(1);
> +                    }

For shared memory we really need it.

Private file-backed is weird ... because we don't know if the shared or 
the private page is problematic ... :(

Maybe we should just do:

if (block->fd >= 0) {
	/* mmap(MAP_FIXED) cannot reliably zap our problematic page. */
	error_report(...);
	exit(-1);
}

Or alternatively

if (block->fd >= 0 && qemu_ram_is_shared(block)) {
	/* mmap() cannot possibly zap our problematic page. */
	error_report(...);
	exit(-1);
} else if (block->fd >= 0) {
	/*
	 * MAP_PRIVATE file-backed ... mmap() can only zap the private
	 * page, not the shared one ... we don't know which one is
	 * problematic.
	 */
	warn_report(...);
}


> +                    flags = MAP_FIXED;
> +                    flags |= block->flags & RAM_SHARED ?
> +                             MAP_SHARED : MAP_PRIVATE;
> +                    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> +                    prot = PROT_READ;
> +                    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> +                    if (block->fd >= 0) {
> +                        area = mmap(vaddr, length, prot, flags, block->fd,
> +                                    offset + block->fd_offset);
> +                    } else {
> +                        flags |= MAP_ANONYMOUS;
> +                        area = mmap(vaddr, length, prot, flags, -1, 0);
> +                    }
> +                    if (area != vaddr) {
> +                        error_report("Could not remap addr: "
> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                                     length, addr);
> +                        exit(1);
> +                    }
> +                    memory_try_enable_merging(vaddr, length);
> +                    qemu_ram_setup_dump(vaddr, length);

Can we factor the mmap hack out into a separate helper function to clean 
this up a bit?


-- 
Cheers,

David / dhildenb

Re: [PATCH v2 2/7] system/physmem: poisoned memory discard on reboot

Posted by William Roche 1 year, 2 months ago

On 11/12/24 12:07, David Hildenbrand wrote:
> On 07.11.24 11:21, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> We take into account the recorded page sizes to repair the
>> memory locations, calling ram_block_discard_range() to punch a hole
>> in the backend file when necessary and regenerate a usable memory.
>> Fall back to unmap/remap the memory location(s) if the kernel doesn't
>> support the madvise calls used by ram_block_discard_range().
>>
>> Hugetlbfs poison case is also taken into account as a hole punch
>> with fallocate will reload a new page when first touched.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   system/physmem.c | 50 +++++++++++++++++++++++++++++-------------------
>>   1 file changed, 30 insertions(+), 20 deletions(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 750604d47d..dfea120cc5 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, 
>> ram_addr_t length)
>>               } else if (xen_enabled()) {
>>                   abort();
>>               } else {
>> -                flags = MAP_FIXED;
>> -                flags |= block->flags & RAM_SHARED ?
>> -                         MAP_SHARED : MAP_PRIVATE;
>> -                flags |= block->flags & RAM_NORESERVE ? 
>> MAP_NORESERVE : 0;
>> -                prot = PROT_READ;
>> -                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>> -                if (block->fd >= 0) {
>> -                    area = mmap(vaddr, length, prot, flags, block->fd,
>> -                                offset + block->fd_offset);
>> -                } else {
>> -                    flags |= MAP_ANONYMOUS;
>> -                    area = mmap(vaddr, length, prot, flags, -1, 0);
>> -                }
>> -                if (area != vaddr) {
>> -                    error_report("Could not remap addr: "
>> -                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>> -                                 length, addr);
>> -                    exit(1);
>> +                if (ram_block_discard_range(block, offset + block- 
>> >fd_offset,
>> +                                            length) != 0) {
>> +                    if (length > TARGET_PAGE_SIZE) {
>> +                        /* punch hole is mandatory on hugetlbfs */
>> +                        error_report("large page recovery failure 
>> addr: "
>> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>> +                                     length, addr);
>> +                        exit(1);
>> +                    }
> 
> For shared memory we really need it.
> 
> Private file-backed is weird ... because we don't know if the shared or 
> the private page is problematic ... :(


I agree with you, and we have to decide when should we bail out if 
ram_block_discard_range() doesn't work.
According to me, if discard doesn't work and we are dealing with 
file-backed largepages (shared or not) we have to exit, because the 
fallocate is mandatory. It is the case with hugetlbfs.

In the non-file-backed case, or the file-backed non-largepage private 
case, according to me we can trust the mmap() method to put everything 
back in place for the VM reset to work as expected.
Are there aspects I don't see, and for which mmap + the remap handler is 
not sufficient and we should also bail out here ?



> 
> Maybe we should just do:
> 
> if (block->fd >= 0) {
>      /* mmap(MAP_FIXED) cannot reliably zap our problematic page. */
>      error_report(...);
>      exit(-1);
> }
> 
> Or alternatively
> 
> if (block->fd >= 0 && qemu_ram_is_shared(block)) {
>      /* mmap() cannot possibly zap our problematic page. */
>      error_report(...);
>      exit(-1);
> } else if (block->fd >= 0) {
>      /*
>       * MAP_PRIVATE file-backed ... mmap() can only zap the private
>       * page, not the shared one ... we don't know which one is
>       * problematic.
>       */
>      warn_report(...);
> }

I also agree that any file-backed/shared case should bail out if discard 
(fallocate) fails, no mater large or standard pages are used.

In the case of file-backed private standard pages, I think that a poison 
on the private page can be fixed with a new mmap.
According to me, there are 2 cases to consider: at the moment the poison 
is seen, the page was dirty (so it means that it was a pure private 
page), or the page was not dirty, and in this case the poison could 
replace this non-dirty page with a new copy of the file content.
In both cases, I'd say that the remap should clean up the poison.

So the conditions when discard fails, could be something like:

    if (block->fd >= 0 && (qemu_ram_is_shared(block) ||
        (length > TARGET_PAGE_SIZE))) {
        /* punch hole is mandatory, mmap() cannot possibly zap our page*/
         error_report("%spage recovery failure addr: "
                      RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
                      (length > TARGET_PAGE_SIZE) ? "large " : "",
                      length, addr);
         exit(1);
     }


>> +                    flags = MAP_FIXED;
>> +                    flags |= block->flags & RAM_SHARED ?
>> +                             MAP_SHARED : MAP_PRIVATE;
>> +                    flags |= block->flags & RAM_NORESERVE ? 
>> MAP_NORESERVE : 0;
>> +                    prot = PROT_READ;
>> +                    prot |= block->flags & RAM_READONLY ? 0 : 
>> PROT_WRITE;
>> +                    if (block->fd >= 0) {
>> +                        area = mmap(vaddr, length, prot, flags, 
>> block->fd,
>> +                                    offset + block->fd_offset);
>> +                    } else {
>> +                        flags |= MAP_ANONYMOUS;
>> +                        area = mmap(vaddr, length, prot, flags, -1, 0);
>> +                    }
>> +                    if (area != vaddr) {
>> +                        error_report("Could not remap addr: "
>> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>> +                                     length, addr);
>> +                        exit(1);
>> +                    }
>> +                    memory_try_enable_merging(vaddr, length);
>> +                    qemu_ram_setup_dump(vaddr, length);
> 
> Can we factor the mmap hack out into a separate helper function to clean 
> this up a bit?

Sure, I'll do that.

Re: [PATCH v2 2/7] system/physmem: poisoned memory discard on reboot

Posted by David Hildenbrand 1 year, 2 months ago

>> For shared memory we really need it.
>>
>> Private file-backed is weird ... because we don't know if the shared or
>> the private page is problematic ... :(
> 
> 
> I agree with you, and we have to decide when should we bail out if
> ram_block_discard_range() doesn't work.
> According to me, if discard doesn't work and we are dealing with
> file-backed largepages (shared or not) we have to exit, because the
> fallocate is mandatory. It is the case with hugetlbfs.
 > > In the non-file-backed case, or the file-backed non-largepage private
> case, according to me we can trust the mmap() method to put everything
> back in place for the VM reset to work as expected.
> Are there aspects I don't see, and for which mmap + the remap handler is
> not sufficient and we should also bail out here ?

mmap() will only zap anonymous pages, no pagecache pages. See below.

>>
>> Maybe we should just do:
>>
>> if (block->fd >= 0) {
>>       /* mmap(MAP_FIXED) cannot reliably zap our problematic page. */
>>       error_report(...);
>>       exit(-1);
>> }
>>
>> Or alternatively
>>
>> if (block->fd >= 0 && qemu_ram_is_shared(block)) {
>>       /* mmap() cannot possibly zap our problematic page. */
>>       error_report(...);
>>       exit(-1);
>> } else if (block->fd >= 0) {
>>       /*
>>        * MAP_PRIVATE file-backed ... mmap() can only zap the private
>>        * page, not the shared one ... we don't know which one is
>>        * problematic.
>>        */
>>       warn_report(...);
>> }
> 
> I also agree that any file-backed/shared case should bail out if discard
> (fallocate) fails, no mater large or standard pages are used.
> 
> In the case of file-backed private standard pages, I think that a poison
> on the private page can be fixed with a new mmap.
> According to me, there are 2 cases to consider: at the moment the poison
> is seen, the page was dirty (so it means that it was a pure private
> page), or the page was not dirty, and in this case the poison could
> replace this non-dirty page with a new copy of the file content.
> In both cases, I'd say that the remap should clean up the poison.

Let's assume we have mmap(MAP_RIVATE, fd). The following scenarios are 
possible:

(a) We only have a pagecache page (never written) that is poisoned
	-> mmap(MAP_FIXED) cannot resolve that

(b) We only have an anonymous page (e.g., pagecache truncated, or if the
     hugetlb file was empty) that is poisoned
	-> mmap(MAP_FIXED) can resolve that

(c) We have an anonymous and a pagecache page (written -> COW).
(c1) Anonymous page is poisoned -> mmap(MAP_FIXED) can resolve that
(c2) Pagecache page is poisoned -> mmap(MAP_FIXED) cannot resolve that


So mmap(MAP_FIXED) cannot sort out all cases. In practice, (a) and (c2) 
are uncommon, but possible.

(b) is common with hugetlb. (a) and (c) are uncommon with hugetlb, just 
because of the nature of hugetlb pages being a scarce resource.

And IIRC, (b) with hugetlb should should be sorted out with 
mmap(MAP_FIXED). Please double-check.

> 
> So the conditions when discard fails, could be something like:
> 
>      if (block->fd >= 0 && (qemu_ram_is_shared(block) ||
>          (length > TARGET_PAGE_SIZE))) {
>          /* punch hole is mandatory, mmap() cannot possibly zap our page*/
>           error_report("%spage recovery failure addr: "
>                        RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>                        (length > TARGET_PAGE_SIZE) ? "large " : "",
>                        length, addr);

I'm not sure if we should be special-casing hugetlb.

If we want to be 100% sure, we will do

if (block->fd >= 0) {
	error_report();
	exit(1);
}

But we could decide to be "nice" to hugetlb and assume (b) for them 
above: that is, we would do

/*
  * mmap() cannot zap pagecache pages, only anonymous pages. As soon as
  * we might have pagecache pages involved (either private or shared
  * mapping), we must be careful. However, MAP_PRIVATE on empty hugetlb
  * files is common, and extremely uncommon on non-empty hugetlb files,
  * so we'll special-case them here.
  */
if (block->fd >= 0 && (qemu_ram_is_shared(block) ||
     length == TARGET_PAGE_SIZE))) {
	...
}

[in practice, we could use /proc/self/pagemap to see if we map an 
anonymous page ... but I'd rather not go down that path just yet]

But, in the end the expectation is that madvise()+fallocate() will 
usually not fail.

-- 
Cheers,

David / dhildenb

[PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

Posted by “William Roche 1 year, 3 months ago

From: William Roche <william.roche@oracle.com>

When an entire large page is impacted by an error (hugetlbfs case),
report better the size and location of this large memory hole, so
give a warning message when this page is first hit:
Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST addr Z

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c      | 9 ++++++++-
 include/sysemu/kvm_int.h | 4 +++-
 target/arm/kvm.c         | 2 +-
 target/i386/kvm/kvm.c    | 2 +-
 4 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 6dd06f5edf..a572437115 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1284,7 +1284,7 @@ static void kvm_unpoison_all(void *param)
     }
 }
 
-void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, void *ha, hwaddr gpa)
 {
     HWPoisonPage *page;
     size_t sz = qemu_ram_pagesize_from_addr(ram_addr);
@@ -1301,6 +1301,13 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr)
     page->ram_addr = ram_addr;
     page->page_size = sz;
     QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
+
+    if (sz > TARGET_PAGE_SIZE) {
+        gpa = ROUND_DOWN(gpa, sz);
+        ha = (void *)ROUND_DOWN((uint64_t)ha, sz);
+        warn_report("Memory error: Loosing a large page (size: %zu) "
+            "at QEMU addr %p and GUEST addr 0x%" HWADDR_PRIx, sz, ha, gpa);
+    }
 }
 
 bool kvm_hwpoisoned_mem(void)
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index a1e72763da..ee34f1d225 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -178,10 +178,12 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size);
  *
  * Parameters:
  *  @ram_addr: the address in the RAM for the poisoned page
+ *  @hva: host virtual address aka QEMU addr
+ *  @gpa: guest physical address aka GUEST addr
  *
  * Add a poisoned page to the list
  *
  * Return: None.
  */
-void kvm_hwpoison_page_add(ram_addr_t ram_addr);
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, void *hva, hwaddr gpa);
 #endif
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index f1f1b5b375..aae66dba41 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2359,7 +2359,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
+            kvm_hwpoison_page_add(ram_addr, addr, paddr);
             /*
              * If this is a BUS_MCEERR_AR, we know we have been called
              * synchronously from the vCPU thread, so we can easily
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index fd9f198892..fd7cd7008e 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -753,7 +753,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
+            kvm_hwpoison_page_add(ram_addr, addr, paddr);
             kvm_mce_inject(cpu, paddr, code);
 
             /*
-- 
2.43.5

Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

Posted by David Hildenbrand 1 year, 2 months ago

On 07.11.24 11:21, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> When an entire large page is impacted by an error (hugetlbfs case),
> report better the size and location of this large memory hole, so
> give a warning message when this page is first hit:
> Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST addr Z
> 

Hm, I wonder if we really want to special-case hugetlb here.

Why not make the warning independent of the underlying page size?

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

Posted by William Roche 1 year, 2 months ago

On 11/12/24 12:13, David Hildenbrand wrote:
> On 07.11.24 11:21, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> When an entire large page is impacted by an error (hugetlbfs case),
>> report better the size and location of this large memory hole, so
>> give a warning message when this page is first hit:
>> Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST 
>> addr Z
>>
> 
> Hm, I wonder if we really want to special-case hugetlb here.
> 
> Why not make the warning independent of the underlying page size?

We already have a warning provided by Qemu (in kvm_arch_on_sigbus_vcpu()):

Guest MCE Memory Error at QEMU addr Y and GUEST addr Z of type 
BUS_MCEERR_AR/_AO injected

The one I suggest is an additional message provided before the above 
message.

Here is an example:
qemu-system-x86_64: warning: Memory error: Loosing a large page (size: 
2097152) at QEMU addr 0x7fdd7d400000 and GUEST addr 0x11600000
qemu-system-x86_64: warning: Guest MCE Memory Error at QEMU addr 
0x7fdd7d400000 and GUEST addr 0x11600000 of type BUS_MCEERR_AO injected

According to me, this large page case additional message will help to 
better understand the probable sudden proliferation of memory errors 
that can be reported by Qemu on the impacted range.
Not only will the machine administrator identify better that a single 
memory error had this large impact, it can also help us to better 
measure the impact of fixing the large page memory error support in the 
field (in the future).

These are some reasons why I do think this large page specific message 
can be useful.

Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

Posted by David Hildenbrand 1 year, 2 months ago

On 12.11.24 19:17, William Roche wrote:
> On 11/12/24 12:13, David Hildenbrand wrote:
>> On 07.11.24 11:21, “William Roche wrote:
>>> From: William Roche <william.roche@oracle.com>
>>>
>>> When an entire large page is impacted by an error (hugetlbfs case),
>>> report better the size and location of this large memory hole, so
>>> give a warning message when this page is first hit:
>>> Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST
>>> addr Z
>>>
>>
>> Hm, I wonder if we really want to special-case hugetlb here.
>>
>> Why not make the warning independent of the underlying page size?
> 
> We already have a warning provided by Qemu (in kvm_arch_on_sigbus_vcpu()):
> 
> Guest MCE Memory Error at QEMU addr Y and GUEST addr Z of type
> BUS_MCEERR_AR/_AO injected
> 
> The one I suggest is an additional message provided before the above
> message.
> 
> Here is an example:
> qemu-system-x86_64: warning: Memory error: Loosing a large page (size:
> 2097152) at QEMU addr 0x7fdd7d400000 and GUEST addr 0x11600000
> qemu-system-x86_64: warning: Guest MCE Memory Error at QEMU addr
> 0x7fdd7d400000 and GUEST addr 0x11600000 of type BUS_MCEERR_AO injected
> 

Hm, I think we should definitely be including the size in the existing 
one. That code was written without huge pages in mind.

We should similarly warn in the arm implementation (where I don't see a 
similar message yet).

> 
> According to me, this large page case additional message will help to
> better understand the probable sudden proliferation of memory errors
> that can be reported by Qemu on the impacted range.
> Not only will the machine administrator identify better that a single
> memory error had this large impact, it can also help us to better
> measure the impact of fixing the large page memory error support in the
> field (in the future).

What about extending the existing one to something like

warning: Guest MCE Memory Error at QEMU addr $ADDR and GUEST $PADDR of 
type BUS_MCEERR_AO and size $SIZE (large page) injected


With the "large page" hint you can highlight that this is special.


On a related note ...I think we have a problem. Assume we got a SIGBUS 
on a huge page (e.g., somewhere in a 1 GiB page).

We will call kvm_mce_inject(cpu, paddr, code) / 
acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)

But where is the size information? :// Won't the VM simply assume that 
there was a MCE on a single 4k page starting at paddr?

I'm not sure if we can inject ranges, or if we would have to issue one 
MCE per page ... hm, what's your take on this?


-- 
Cheers,

David / dhildenb

Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

Posted by William Roche 1 year, 2 months ago

Thanks for the feedback on the patches, I'll send a new version in the 
coming week.

But I just wanted to answer now the questions you asked on this specific 
one as they are related to the importance of fixing the large page 
failures handling.

On 11/12/24 23:22, David Hildenbrand wrote:
> On 12.11.24 19:17, William Roche wrote:
>> On 11/12/24 12:13, David Hildenbrand wrote:
>>> On 07.11.24 11:21, “William Roche wrote:
>>>> From: William Roche <william.roche@oracle.com>
>>>>
>>>> When an entire large page is impacted by an error (hugetlbfs case),
>>>> report better the size and location of this large memory hole, so
>>>> give a warning message when this page is first hit:
>>>> Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST
>>>> addr Z
>>>>
>>>
>>> Hm, I wonder if we really want to special-case hugetlb here.
>>>
>>> Why not make the warning independent of the underlying page size?
>>
>> We already have a warning provided by Qemu (in 
>> kvm_arch_on_sigbus_vcpu()):
>>
>> Guest MCE Memory Error at QEMU addr Y and GUEST addr Z of type
>> BUS_MCEERR_AR/_AO injected
>>
>> The one I suggest is an additional message provided before the above
>> message.
>>
>> Here is an example:
>> qemu-system-x86_64: warning: Memory error: Loosing a large page (size:
>> 2097152) at QEMU addr 0x7fdd7d400000 and GUEST addr 0x11600000
>> qemu-system-x86_64: warning: Guest MCE Memory Error at QEMU addr
>> 0x7fdd7d400000 and GUEST addr 0x11600000 of type BUS_MCEERR_AO injected
>>
> 
> Hm, I think we should definitely be including the size in the existing 
> one. That code was written without huge pages in mind.

Yes we can do that, and get the page size at this level to pass as a 
'page_sise' argument to kvm_hwpoison_page_add().

It would make the message longer as we will have the extra information 
about the large page on all messages when an error impacts a large page.
We could change the messages only when we are dealing with a large page, 
so that the standard (4k) case isn't modified.

> 
> We should similarly warn in the arm implementation (where I don't see a 
> similar message yet).

Ok, I'll also add a message for the ARM platform.

>>
>> According to me, this large page case additional message will help to
>> better understand the probable sudden proliferation of memory errors
>> that can be reported by Qemu on the impacted range.
>> Not only will the machine administrator identify better that a single
>> memory error had this large impact, it can also help us to better
>> measure the impact of fixing the large page memory error support in the
>> field (in the future).
> 
> What about extending the existing one to something like
> 
> warning: Guest MCE Memory Error at QEMU addr $ADDR and GUEST $PADDR of 
> type BUS_MCEERR_AO and size $SIZE (large page) injected
> 
> 
> With the "large page" hint you can highlight that this is special.

Right, we can do it that way. It also gives the impression that we 
somehow inject errors on a large range of the memory. Which is not the 
case. I'll send a proposal with a different formulation, so that you can 
choose.

> On a related note ...I think we have a problem. Assume we got a SIGBUS 
> on a huge page (e.g., somewhere in a 1 GiB page).
> 
> We will call kvm_mce_inject(cpu, paddr, code) / 
> acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)
> 
> But where is the size information? :// Won't the VM simply assume that 
> there was a MCE on a single 4k page starting at paddr?

This is absolutely right !
It's exactly what happens: The VM kernel received the information and 
considers that only the impacted page has to be poisoned.

That's also the reason why Qemu repeats the error injections every time 
the poisoned large page is accessed (for all other touched 4k pages 
located on this "memory hole").

> 
> I'm not sure if we can inject ranges, or if we would have to issue one 
> MCE per page ... hm, what's your take on this?

I don't know of any size information about a memory error reported by 
the hardware. The kernel doesn't seem to expect any such information.
It explains why there is no impact/blast size information provided when 
an error is relayed to the VM.

We could take the "memory hole" size into account in Qemu, but repeating 
error injections is not going to help a lot either: We'd need to give 
the VM some time to deal with an error injection before producing a new 
error for the next page etc... in the case (x86 only) where an 
asynchronous error is relayed with BUS_MCEERR_AO, we would also have to 
repeat the error for all the 4k pages located on the lost large page too.

We can see that the Linux kernel has some mechanisms to deal with a 
seldom 4k page loss, but a larger blast is very likely to crash the VM 
(which is fine). And as a significant part of the memory is no longer 
accessible, dealing with the error itself can be impaired and we 
increase the risk of loosing data, even though most of the memory on the 
large page could still be used.

Now if we can recover the 'still valid' memory of the impacted large 
page, we can significantly reduce this blast and give a much better 
chance to the VM to survive the incident or crash more gracefully.

I've looked at the project you indicated me, which is not ready to be 
adopted:
https://lore.kernel.org/linux-mm/20240924043924.3562257-2-jiaqiyan@google.com/T/

But we see that, this large page enhancement is needed, sometimes just 
to give a chance to the VM to survive a little longer before being 
terminated or moved.
Injecting multiple MCEs or ACPI error records doesn't help, according to me.

William.

Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

Posted by David Hildenbrand 1 year, 2 months ago

>> Hm, I think we should definitely be including the size in the existing
>> one. That code was written without huge pages in mind.
> 
> Yes we can do that, and get the page size at this level to pass as a
> 'page_sise' argument to kvm_hwpoison_page_add().
> 
> It would make the message longer as we will have the extra information
> about the large page on all messages when an error impacts a large page.
> We could change the messages only when we are dealing with a large page,
> so that the standard (4k) case isn't modified.

Right. And likely we should call it "huge page" instead, which is the 
Linux term for anything larger than a single page.

[...]

>>
>> With the "large page" hint you can highlight that this is special.
> 
> Right, we can do it that way. It also gives the impression that we
> somehow inject errors on a large range of the memory. Which is not the
> case. I'll send a proposal with a different formulation, so that you can
> choose.
> 

Make sense.

> 
> 
>> On a related note ...I think we have a problem. Assume we got a SIGBUS
>> on a huge page (e.g., somewhere in a 1 GiB page).
>>
>> We will call kvm_mce_inject(cpu, paddr, code) /
>> acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)
>>
>> But where is the size information? :// Won't the VM simply assume that
>> there was a MCE on a single 4k page starting at paddr?
> 
> This is absolutely right !
> It's exactly what happens: The VM kernel received the information and
> considers that only the impacted page has to be poisoned.
 > > That's also the reason why Qemu repeats the error injections every time
> the poisoned large page is accessed (for all other touched 4k pages
> located on this "memory hole").

:/

So we always get from Linux the full 1Gig range and always report the 
first 4k page essentially, on any such access, right?


BTW, should we handle duplicates in our poison list?

> 
>>
>> I'm not sure if we can inject ranges, or if we would have to issue one
>> MCE per page ... hm, what's your take on this?
> 
> I don't know of any size information about a memory error reported by
> the hardware. The kernel doesn't seem to expect any such information.
> It explains why there is no impact/blast size information provided when
> an error is relayed to the VM.
> 
> We could take the "memory hole" size into account in Qemu, but repeating
> error injections is not going to help a lot either: We'd need to give
> the VM some time to deal with an error injection before producing a new
> error for the next page etc... in the case (x86 only) where an

I had the same thoughts.

> asynchronous error is relayed with BUS_MCEERR_AO, we would also have to
> repeat the error for all the 4k pages located on the lost large page too.
> 
> We can see that the Linux kernel has some mechanisms to deal with a
> seldom 4k page loss, but a larger blast is very likely to crash the VM
> (which is fine).

Right, and that will inevitably happen when we get a MVE on a 1GiG 
hugetlb page, correct? The whole thing will be inaccessible.

> And as a significant part of the memory is no longer
> accessible, dealing with the error itself can be impaired and we
> increase the risk of loosing data, even though most of the memory on the
> large page could still be used.
> 
> Now if we can recover the 'still valid' memory of the impacted large
> page, we can significantly reduce this blast and give a much better
> chance to the VM to survive the incident or crash more gracefully.

Right. That cannot be sorted out in user space alone, unfortunately.

> 
> I've looked at the project you indicated me, which is not ready to be
> adopted:
> https://lore.kernel.org/linux-mm/20240924043924.3562257-2-jiaqiyan@google.com/T/
> 

Yes, that goes into a better direction, though.

> But we see that, this large page enhancement is needed, sometimes just
> to give a chance to the VM to survive a little longer before being
> terminated or moved.
> Injecting multiple MCEs or ACPI error records doesn't help, according to me.

I suspect that in most cases, when we get an MCE on a 1Gig page in the 
hypervisor, our running Linux guest will soon crash, because it really 
lost 1 Gig of contiguous memory. :(

-- 
Cheers,

David / dhildenb

[PATCH v2 4/7] numa: Introduce and use ram_block_notify_remap()

Posted by “William Roche 1 year, 3 months ago

From: David Hildenbrand <david@redhat.com>

Notify registered listeners about the remap at the end of
qemu_ram_remap() so e.g., a memory backend can re-apply its
settings correctly.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 hw/core/numa.c         | 11 +++++++++++
 include/exec/ramlist.h |  3 +++
 system/physmem.c       |  1 +
 3 files changed, 15 insertions(+)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index 1b5f44baea..4ca67db483 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -895,3 +895,14 @@ void ram_block_notify_resize(void *host, size_t old_size, size_t new_size)
         }
     }
 }
+
+void ram_block_notify_remap(void *host, size_t offset, size_t size)
+{
+    RAMBlockNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+        if (notifier->ram_block_remapped) {
+            notifier->ram_block_remapped(notifier, host, offset, size);
+        }
+    }
+}
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
index d9cfe530be..c1dc785a57 100644
--- a/include/exec/ramlist.h
+++ b/include/exec/ramlist.h
@@ -72,6 +72,8 @@ struct RAMBlockNotifier {
                               size_t max_size);
     void (*ram_block_resized)(RAMBlockNotifier *n, void *host, size_t old_size,
                               size_t new_size);
+    void (*ram_block_remapped)(RAMBlockNotifier *n, void *host, size_t offset,
+                               size_t size);
     QLIST_ENTRY(RAMBlockNotifier) next;
 };
 
@@ -80,6 +82,7 @@ void ram_block_notifier_remove(RAMBlockNotifier *n);
 void ram_block_notify_add(void *host, size_t size, size_t max_size);
 void ram_block_notify_remove(void *host, size_t size, size_t max_size);
 void ram_block_notify_resize(void *host, size_t old_size, size_t new_size);
+void ram_block_notify_remap(void *host, size_t offset, size_t size);
 
 GString *ram_block_format(void);
 
diff --git a/system/physmem.c b/system/physmem.c
index dfea120cc5..e72ca31451 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2228,6 +2228,7 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                     memory_try_enable_merging(vaddr, length);
                     qemu_ram_setup_dump(vaddr, length);
                 }
+                ram_block_notify_remap(block->host, offset, length);
             }
         }
     }
-- 
2.43.5

[PATCH v2 5/7] hostmem: Factor out applying settings

Posted by “William Roche 1 year, 3 months ago

From: David Hildenbrand <david@redhat.com>

We want to reuse the functionality when remapping or resizing RAM.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c | 155 ++++++++++++++++++++++++---------------------
 1 file changed, 82 insertions(+), 73 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 181446626a..bf85d716e5 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -36,6 +36,87 @@ QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
 QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
 #endif
 
+static void host_memory_backend_apply_settings(HostMemoryBackend *backend,
+                                               void *ptr, uint64_t size,
+                                               Error **errp)
+{
+    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
+
+    if (backend->merge) {
+        qemu_madvise(ptr, size, QEMU_MADV_MERGEABLE);
+    }
+    if (!backend->dump) {
+        qemu_madvise(ptr, size, QEMU_MADV_DONTDUMP);
+    }
+#ifdef CONFIG_NUMA
+    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
+    /* lastbit == MAX_NODES means maxnode = 0 */
+    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
+    /*
+     * Ensure policy won't be ignored in case memory is preallocated
+     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
+     * this doesn't catch hugepage case.
+     */
+    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
+    int mode = backend->policy;
+
+    /*
+     * Check for invalid host-nodes and policies and give more verbose
+     * error messages than mbind().
+     */
+    if (maxnode && backend->policy == MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be empty for policy default,"
+                   " or you should explicitly specify a policy other"
+                   " than default");
+        return;
+    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be set for policy %s",
+                   HostMemPolicy_str(backend->policy));
+        return;
+    }
+
+    /*
+     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
+     * as argument to mbind() due to an old Linux bug (feature?) which
+     * cuts off the last specified node. This means backend->host_nodes
+     * must have MAX_NODES+1 bits available.
+     */
+    assert(sizeof(backend->host_nodes) >=
+           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
+    assert(maxnode <= MAX_NODES);
+
+#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
+    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
+        /*
+         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
+         * silently picks the first node.
+         */
+        mode = MPOL_PREFERRED_MANY;
+    }
+#endif
+
+    if (maxnode &&
+        mbind(ptr, size, mode, backend->host_nodes, maxnode + 1, flags)) {
+        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
+            error_setg_errno(errp, errno,
+                             "cannot bind memory to host NUMA nodes");
+            return;
+        }
+    }
+#endif
+    /*
+     * Preallocate memory after the NUMA policy has been instantiated.
+     * This is necessary to guarantee memory is allocated with
+     * specified NUMA policy in place.
+     */
+    if (backend->prealloc &&
+        !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
+                           ptr, size, backend->prealloc_threads,
+                           backend->prealloc_context, async, errp)) {
+        return;
+    }
+}
+
 char *
 host_memory_backend_get_name(HostMemoryBackend *backend)
 {
@@ -337,7 +418,6 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
     void *ptr;
     uint64_t sz;
     size_t pagesize;
-    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
 
     if (!bc->alloc) {
         return;
@@ -357,78 +437,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
         return;
     }
 
-    if (backend->merge) {
-        qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
-    }
-    if (!backend->dump) {
-        qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
-    }
-#ifdef CONFIG_NUMA
-    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
-    /* lastbit == MAX_NODES means maxnode = 0 */
-    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
-    /*
-     * Ensure policy won't be ignored in case memory is preallocated
-     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
-     * this doesn't catch hugepage case.
-     */
-    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
-    int mode = backend->policy;
-
-    /* check for invalid host-nodes and policies and give more verbose
-     * error messages than mbind(). */
-    if (maxnode && backend->policy == MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be empty for policy default,"
-                   " or you should explicitly specify a policy other"
-                   " than default");
-        return;
-    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be set for policy %s",
-                   HostMemPolicy_str(backend->policy));
-        return;
-    }
-
-    /*
-     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
-     * as argument to mbind() due to an old Linux bug (feature?) which
-     * cuts off the last specified node. This means backend->host_nodes
-     * must have MAX_NODES+1 bits available.
-     */
-    assert(sizeof(backend->host_nodes) >=
-           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
-    assert(maxnode <= MAX_NODES);
-
-#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
-    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
-        /*
-         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
-         * silently picks the first node.
-         */
-        mode = MPOL_PREFERRED_MANY;
-    }
-#endif
-
-    if (maxnode &&
-        mbind(ptr, sz, mode, backend->host_nodes, maxnode + 1, flags)) {
-        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
-            error_setg_errno(errp, errno,
-                             "cannot bind memory to host NUMA nodes");
-            return;
-        }
-    }
-#endif
-    /*
-     * Preallocate memory after the NUMA policy has been instantiated.
-     * This is necessary to guarantee memory is allocated with
-     * specified NUMA policy in place.
-     */
-    if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
-                                                ptr, sz,
-                                                backend->prealloc_threads,
-                                                backend->prealloc_context,
-                                                async, errp)) {
-        return;
-    }
+    host_memory_backend_apply_settings(backend, ptr, sz, errp);
 }
 
 static bool
-- 
2.43.5

[PATCH v2 6/7] hostmem: Handle remapping of RAM

Posted by “William Roche 1 year, 3 months ago

From: David Hildenbrand <david@redhat.com>

Let's register a RAM block notifier and react on remap notifications.
Simply re-apply the settings. Warn only when something goes wrong.

Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be
that hostmem is still missing to update that flag ...

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c       | 29 +++++++++++++++++++++++++++++
 include/sysemu/hostmem.h |  1 +
 2 files changed, 30 insertions(+)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index bf85d716e5..fbd8708664 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -361,11 +361,32 @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
     backend->prealloc_threads = value;
 }
 
+static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host,
+                                             size_t offset, size_t size)
+{
+    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
+                                              ram_notifier);
+    Error *err = NULL;
+
+    if (!host_memory_backend_mr_inited(backend) ||
+        memory_region_get_ram_ptr(&backend->mr) != host) {
+        return;
+    }
+
+    host_memory_backend_apply_settings(backend, host + offset, size, &err);
+    if (err) {
+        warn_report_err(err);
+    }
+}
+
 static void host_memory_backend_init(Object *obj)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
     MachineState *machine = MACHINE(qdev_get_machine());
 
+    backend->ram_notifier.ram_block_remapped = host_memory_backend_ram_remapped;
+    ram_block_notifier_add(&backend->ram_notifier);
+
     /* TODO: convert access to globals to compat properties */
     backend->merge = machine_mem_merge(machine);
     backend->dump = machine_dump_guest_core(machine);
@@ -379,6 +400,13 @@ static void host_memory_backend_post_init(Object *obj)
     object_apply_compat_props(obj);
 }
 
+static void host_memory_backend_finalize(Object *obj)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+    ram_block_notifier_remove(&backend->ram_notifier);
+}
+
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend)
 {
     /*
@@ -595,6 +623,7 @@ static const TypeInfo host_memory_backend_info = {
     .instance_size = sizeof(HostMemoryBackend),
     .instance_init = host_memory_backend_init,
     .instance_post_init = host_memory_backend_post_init,
+    .instance_finalize = host_memory_backend_finalize,
     .interfaces = (InterfaceInfo[]) {
         { TYPE_USER_CREATABLE },
         { }
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index de47ae59e4..062a68c8fc 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -81,6 +81,7 @@ struct HostMemoryBackend {
     HostMemPolicy policy;
 
     MemoryRegion mr;
+    RAMBlockNotifier ram_notifier;
 };
 
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend);
-- 
2.43.5

Re: [PATCH v2 6/7] hostmem: Handle remapping of RAM

Posted by David Hildenbrand 1 year, 2 months ago

On 07.11.24 11:21, “William Roche wrote:
> From: David Hildenbrand <david@redhat.com>
> 
> Let's register a RAM block notifier and react on remap notifications.
> Simply re-apply the settings. Warn only when something goes wrong.
> 
> Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be
> that hostmem is still missing to update that flag ...
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   backends/hostmem.c       | 29 +++++++++++++++++++++++++++++
>   include/sysemu/hostmem.h |  1 +
>   2 files changed, 30 insertions(+)
> 
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index bf85d716e5..fbd8708664 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -361,11 +361,32 @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
>       backend->prealloc_threads = value;
>   }
>   
> +static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host,
> +                                             size_t offset, size_t size)
> +{
> +    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
> +                                              ram_notifier);
> +    Error *err = NULL;
> +
> +    if (!host_memory_backend_mr_inited(backend) ||
> +        memory_region_get_ram_ptr(&backend->mr) != host) {
> +        return;
> +    }
> +
> +    host_memory_backend_apply_settings(backend, host + offset, size, &err);
> +    if (err) {
> +        warn_report_err(err);

I wonder if we want to fail hard instead, or have a way to tell the 
notifier that something wen wrong.

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 6/7] hostmem: Handle remapping of RAM

Posted by William Roche 1 year, 2 months ago

On 11/12/24 14:45, David Hildenbrand wrote:
> On 07.11.24 11:21, “William Roche wrote:
>> From: David Hildenbrand <david@redhat.com>
>>
>> Let's register a RAM block notifier and react on remap notifications.
>> Simply re-apply the settings. Warn only when something goes wrong.
>>
>> Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be
>> that hostmem is still missing to update that flag ...
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   backends/hostmem.c       | 29 +++++++++++++++++++++++++++++
>>   include/sysemu/hostmem.h |  1 +
>>   2 files changed, 30 insertions(+)
>>
>> diff --git a/backends/hostmem.c b/backends/hostmem.c
>> index bf85d716e5..fbd8708664 100644
>> --- a/backends/hostmem.c
>> +++ b/backends/hostmem.c
>> @@ -361,11 +361,32 @@ static void 
>> host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
>>       backend->prealloc_threads = value;
>>   }
>> +static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, 
>> void *host,
>> +                                             size_t offset, size_t size)
>> +{
>> +    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
>> +                                              ram_notifier);
>> +    Error *err = NULL;
>> +
>> +    if (!host_memory_backend_mr_inited(backend) ||
>> +        memory_region_get_ram_ptr(&backend->mr) != host) {
>> +        return;
>> +    }
>> +
>> +    host_memory_backend_apply_settings(backend, host + offset, size, 
>> &err);
>> +    if (err) {
>> +        warn_report_err(err);
> 
> I wonder if we want to fail hard instead, or have a way to tell the 
> notifier that something wen wrong.
> 

It depends on what the caller would do with this information. Is there a 
way to workaround the problem ? (I don't think so)
Can the VM continue to run without doing anything about it ? (Maybe?)

Currently all numa notifiers don't return errors.

This function is only called from ram_block_notify_remap() in 
qemu_ram_remap(), I would vote for a "fail hard" in case where the 
settings are mandatory to continue.

HTH.

Re: [PATCH v2 6/7] hostmem: Handle remapping of RAM

Posted by David Hildenbrand 1 year, 2 months ago

On 12.11.24 19:17, William Roche wrote:
> On 11/12/24 14:45, David Hildenbrand wrote:
>> On 07.11.24 11:21, “William Roche wrote:
>>> From: David Hildenbrand <david@redhat.com>
>>>
>>> Let's register a RAM block notifier and react on remap notifications.
>>> Simply re-apply the settings. Warn only when something goes wrong.
>>>
>>> Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be
>>> that hostmem is still missing to update that flag ...
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>> ---
>>>    backends/hostmem.c       | 29 +++++++++++++++++++++++++++++
>>>    include/sysemu/hostmem.h |  1 +
>>>    2 files changed, 30 insertions(+)
>>>
>>> diff --git a/backends/hostmem.c b/backends/hostmem.c
>>> index bf85d716e5..fbd8708664 100644
>>> --- a/backends/hostmem.c
>>> +++ b/backends/hostmem.c
>>> @@ -361,11 +361,32 @@ static void
>>> host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
>>>        backend->prealloc_threads = value;
>>>    }
>>> +static void host_memory_backend_ram_remapped(RAMBlockNotifier *n,
>>> void *host,
>>> +                                             size_t offset, size_t size)
>>> +{
>>> +    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
>>> +                                              ram_notifier);
>>> +    Error *err = NULL;
>>> +
>>> +    if (!host_memory_backend_mr_inited(backend) ||
>>> +        memory_region_get_ram_ptr(&backend->mr) != host) {
>>> +        return;
>>> +    }
>>> +
>>> +    host_memory_backend_apply_settings(backend, host + offset, size,
>>> &err);
>>> +    if (err) {
>>> +        warn_report_err(err);
>>
>> I wonder if we want to fail hard instead, or have a way to tell the
>> notifier that something wen wrong.
>>
> 
> It depends on what the caller would do with this information. Is there a
> way to workaround the problem ? (I don't think so)

Primarily only preallocation will fail, and that ...

> Can the VM continue to run without doing anything about it ? (Maybe?)
> 

... will make crash the QEMU at some point later (SIGBUS), which is very 
bad.

> Currently all numa notifiers don't return errors.
> 
> This function is only called from ram_block_notify_remap() in
> qemu_ram_remap(), I would vote for a "fail hard" in case where the
> settings are mandatory to continue.

"fail hard" is likely the best approach for now.

-- 
Cheers,

David / dhildenb

[PATCH v2 7/7] system/physmem: Memory settings applied on remap notification

Posted by “William Roche 1 year, 3 months ago

From: William Roche <william.roche@oracle.com>

Merging and dump settings are handled by the remap notification
in addition to memory policy and preallocation.
If preallocation is set on a memory block, qemu_prealloc_mem()
call is needed also after a ram_block_discard_range() use for
this block.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index e72ca31451..72129d5b1b 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2225,8 +2225,6 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                                      length, addr);
                         exit(1);
                     }
-                    memory_try_enable_merging(vaddr, length);
-                    qemu_ram_setup_dump(vaddr, length);
                 }
                 ram_block_notify_remap(block->host, offset, length);
             }
-- 
2.43.5

[PATCH v1 4/4] accel/kvm: Report the loss of a large memory page

Posted by “William Roche 1 year, 3 months ago

From: William Roche <william.roche@oracle.com>

On HW memory error, we need to report better what the impact of this
error is. So when an entire large page is impacted by an error (like the
hugetlbfs case), we give a warning message when this page is first hit:
Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST addr Z

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c      | 9 ++++++++-
 include/sysemu/kvm_int.h | 6 ++++--
 target/arm/kvm.c         | 2 +-
 target/i386/kvm/kvm.c    | 2 +-
 4 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 40117eefa7..bddaf1e981 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1284,7 +1284,7 @@ static void kvm_unpoison_all(void *param)
     }
 }
 
-void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz)
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz, void *ha, hwaddr gpa)
 {
     HWPoisonPage *page;
 
@@ -1300,6 +1300,13 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz)
     page->ram_addr = ram_addr;
     page->page_size = sz;
     QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
+
+    if (sz > TARGET_PAGE_SIZE) {
+        gpa = ROUND_DOWN(gpa, sz);
+        ha = (void *)ROUND_DOWN((uint64_t)ha, sz);
+        warn_report("Memory error: Loosing a large page (size: %zu) "
+            "at QEMU addr %p and GUEST addr 0x%" HWADDR_PRIx, sz, ha, gpa);
+    }
 }
 
 bool kvm_hwpoisoned_mem(void)
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index d2160be0ae..af569380ca 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -177,12 +177,14 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size);
  * kvm_hwpoison_page_add:
  *
  * Parameters:
- *  @ram_addr: the address in the RAM for the poisoned page
+ *  @addr: the address in the RAM for the poisoned page
  *  @sz: size of the poisoned page as reported by the kernel
+ *  @hva: host virtual address aka QEMU addr
+ *  @gpa: guest physical address aka GUEST addr
  *
  * Add a poisoned page to the list
  *
  * Return: None.
  */
-void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz);
+void kvm_hwpoison_page_add(ram_addr_t addr, size_t sz, void *hva, hwaddr gpa);
 #endif
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 11579e170b..f8eb553f7c 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2363,7 +2363,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb)
             if (sz == TARGET_PAGE_SIZE) {
                 sz = qemu_ram_pagesize_from_host(addr);
             }
-            kvm_hwpoison_page_add(ram_addr, sz);
+            kvm_hwpoison_page_add(ram_addr, sz, addr, paddr);
             /*
              * If this is a BUS_MCEERR_AR, we know we have been called
              * synchronously from the vCPU thread, so we can easily
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 71e674bca0..34cfa8b764 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -757,7 +757,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb)
             if (sz == TARGET_PAGE_SIZE) {
                 sz = qemu_ram_pagesize_from_host(addr);
             }
-            kvm_hwpoison_page_add(ram_addr, sz);
+            kvm_hwpoison_page_add(ram_addr, sz, addr, paddr);
             kvm_mce_inject(cpu, paddr, code);
 
             /*
-- 
2.43.5

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by William Roche 1 year, 4 months ago

On 10/9/24 17:45, Peter Xu wrote:
> On Thu, Sep 19, 2024 at 06:52:37PM +0200, William Roche wrote:
>> Hello David,
>>
>> I hope my last week email answered your interrogations about:
>>      - retrieving the valid data from the lost hugepage
>>      - the need of smaller pages to replace a failed large page
>>      - the interaction of memory error and VM migration
>>      - the non-symmetrical access to a poisoned memory area after a recovery
>>        Qemu would be able to continue to access the still valid data
>>        location of the formerly poisoned hugepage, but any other entity
>>        mapping the large page would not be allowed to use the location.
>>
>> I understand that this last item _is_ some kind of "inconsistency".
>> So if I want to make sure that a "shared" memory region (used for vhost-user
>> processes, vfio or ivshmem) is not recovered, how can I identify what
>> region(s)
>> of a guest memory could be used for such a shared location ?
>> Is there a way for qemu to identify the memory locations that have been
>> shared ?
> 
> When there's no vIOMMU I think all guest pages need to be shared.  When
> with vIOMMU it depends on what was mapped by the guest drivers, while in
> most sane setups they can still always be shared because the guest OS (if
> Linux) should normally have iommu=pt speeding up kernel drivers.
> 
>>
>> Could you please let me know if there is an entry point I should consider ?
> 
> IMHO it'll still be more reasonable that this issue be tackled from the
> kernel not userspace, simply because it's a shared problem of all
> userspaces rather than QEMU process alone.
> 
> When with that the kernel should guarantee consistencies on different
> processes accessing these pages properly, so logically all these
> complexities should be better done in the kernel once for all.
> 
> There's indeed difficulties on providing it in hugetlbfs with mm community,
> and this is also not the only effort trying to fix 1G page poisoning with
> userspace workarounds, see:
> 
> https://lore.kernel.org/r/20240924043924.3562257-1-jiaqiyan@google.com
> 
> My gut feeling is either hugetlbfs needs to be fixed (with less hope) or
> QEMU in general needs to move over to other file systems on consuming huge
> pages.  Poisoning is not the only driven force, but at least we want to
> also work out postcopy which has similar goal as David said, on being able
> to map hugetlbfs pages differently.
> 
> May consider having a look at gmemfd 1G proposal, posted here:
> 
> https://lore.kernel.org/r/cover.1726009989.git.ackerleytng@google.com
> 
> We probably need that in one way or another for CoCo, and the chance is it
> can easily support non-CoCo with the same interface ultimately.  Then 1G
> hugetlbfs can be abandoned in QEMU.  It'll also need to tackle the same
> challenge here either on page poisoning, or postcopy, with/without QEMU's
> specific solution, because QEMU is also not the only userspace hypervisor.
> 
> Said that, the initial few small patches seem to be standalone small fixes
> which may still be good.  So if you think that's the case you can at least
> consider sending them separately without RFC tag.
> 
> Thanks,

Thank you very much Peter for your answer, pointers and explanations.

I understand and agree that having the Kernel to deal with huge pages
errors is a much better approach.
Not an easy one...

I'll submit a trimmed down version of my first patches to fix some
problems that currently exist in Qemu.

Thanks again,
William.