[v1] hugetlbfs largepage RAS project

[RFC 0/6] hugetlbfs largepage RAS project

Posted by “William Roche 2 months, 2 weeks ago

From: William Roche <william.roche@oracle.com>

Hello,

This is a Qemu RFC to introduce the possibility to deal with hardware
memory errors impacting hugetlbfs memory backed VMs. When using
hugetlbfs large pages, any large page location being impacted by an
HW memory error results in poisoning the entire page, suddenly making
a large chunk of the VM memory unusable.

The implemented proposal is simply a memory mapping change when an HW error
is reported to Qemu, to transform a hugetlbfs large page into a set of
standard sized pages. The failed large page is unmapped and a set of
standard sized pages are mapped in place.
This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is received
by qemu and the reported location corresponds to a large page.

This gives the possibility to:
- Take advantage of newer hypervisor kernel providing a way to retrieve
still valid data on the impacted hugetlbfs poisoned large page.
If the backend file is MAP_SHARED, we can copy the valid data into the
set of standard sized pages. But if an error is returned when accessing
a location we consider it poisoned and mark the corresponding standard sized
memory page as poisoned with a MADV_HWPOISON madvise call. Hence, the VM
can also continue to use the possible valid pieces of information retrieved.
- Adjust the poison address information. When accessing a poison location,
an older Kernel version may only provide the address of the beginning of
the poisoned large page in the associated SIGBUS siginfo data. Pointing to
a more accurate touched poison location allows the VM kernel to trigger
the right memory error reaction.

A warning is given for hugetlbfs backed memory-regions that are mapped
without the 'share=on' option.
(This warning is also given when using the deprecated "-mem-path" option)

The hugetlbfs memory mapping option should look like that
(with XXX replaced with the actual size):
  -object memory-backend-file,id=pc.ram,mem-path=/dev/hugepages,prealloc=on,share=on,size=XXX -machine memory-backend=pc.ram

I'm introducing new system/hugetlbfs_ras.[ch] files to separate the specific
code for this feature. It's only compiled on Linux versions.

Note that we have to be able to mark as "poison" a replacing valid standard
sized page. We currently do that calling madvise(..., MADV_HWPOISON).
But this requires qemu process to have CAP_SYS_ADMIN priviledge.
Using userfaultfd instead of madvise() to mark the pages as poison could
remove this constraint, and complicating the code adding thread(s) dealing
with the user page faults service.


It's also worth mentioning the IO memory, vfio configured memory buffers
case. The Qemu memory remapping (if it succeeds) will not reconfigure any
device IO buffers locations (no dma unmap/remap is performed) and if an
hardware IO is supposed to access (read or write) a poisoned hugetlbfs
page, I would expect it to fail the same way as before (as its location
hasn't been updated to take into account the new mapping).
But can someone confirm this possible behavior ? Or indicate me what should
be done to deal with this type of memory buffers ?

Details:
--------
The following problems had to be considered:

. kvm dealing with memory faults:
 - Address space mapping changes can't be handled in a signal handler (mmap
   is not async signal safe for example)
     We have a separate listener thread (only created when we use hugetlbfs)
     to deal with the mapping changes.
 - If a memory is not mapped when accessed, kvm fails with
   (exit_reason: KVM_EXIT_UNKNOWN)
     To avoid that, I needed to prevent the access to a changing memory
     region: pausing the VM is used to do so.
 - A fault on a poisoned hugetlbfs large page will report a hardcoded page
   size of 4k (See kernel kvm_send_hwpoison_signal() function)
     When a SIGBUS is received with a page size indication of 4k we have to
     verify if the impacted page is not a hugetlbfs page.
 - Asynchronous SIGBUS/BUS_MCEERR_AO signals provide the right page size,
   but the current Qemu version needs to take the information into account.

. system/physmem needed fixes:
 - When recreating the memory mapping on VM reset, we have to consider the
   memory size impacted.
 - In the case of a mapped file, punching a hole is necessary to clean the
   poison.

. Implementation details:
 - SIGBUS signal received for a large page will trigger the page modification,
   but in order to pause the VM, the signal handers have to terminate.
     So we return from the SIGBUS signal handler(s) when a VM has to be stopped.
     A memory access that generated a SIGBUS/BUS_MCEERR_AR signals before the
     VM pause, will be repeated when the VM resumes. If the memory is still
     not accessible (poisoned) the signal will be generated again by the
     hypervisor kernel.
     In the case of an asyncrounous SIGBUS/BUS_MCEERR_AO signal, the signal is
     not repeated by the kernel and will be recorded by qemu in order to be
     replayed when the VM resumes.
 - Poisoning a memory page with MADV_HWPOISON can generate a SIGBUS when
   called. The listener thread taking care of the memory modification needs
   to deal with this case. To do so, it sets a thread specific variable
   that is recognized by the sigbus handler.


Some questions:
---------------
. Should we take extra care for IO memory, vfio configured memory buffers ?

. My feature code is enclosed within "ifdef CONFIG_HUGETLBFS_RAS" and is only
  compiled on linux versions
  Should we have a configure option to prevent the introduction of this
  feature in the code (turning off CONFIG_HUGETLBFS_RAS) ?

. Should I include the content of my system/hugetlbfs_ras.[ch] files into
  another existing file ?

. Should we force 'sharing' when using "-mem-path" option, instead of the
  -object memory-backend-file,share=on,... ?


This prototype is scripts/checkpatch.pl clean (except for the MAINTAINERS
update for the 2 added files).
'make check' runs fine on both x86 and ARM
Units tests have been done on Intel, AMD and ARM platforms.



William Roche (6):
  accel/kvm: SIGBUS handler should also deal with si_addr_lsb
  accel/kvm: Keep track of the HWPoisonPage sizes
  system/physmem: Remap memory pages on reset based on the page size
  system: Introducing hugetlbfs largepage RAS feature
  system/hugetlb_ras: Handle madvise SIGBUS signal on listener
  system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume

 accel/kvm/kvm-all.c      |  24 +-
 accel/stubs/kvm-stub.c   |   4 +-
 include/qemu/osdep.h     |   5 +-
 include/sysemu/kvm.h     |   7 +-
 include/sysemu/kvm_int.h |   3 +-
 meson.build              |   2 +
 system/cpus.c            |  15 +-
 system/hugetlbfs_ras.c   | 645 +++++++++++++++++++++++++++++++++++++++
 system/hugetlbfs_ras.h   |   4 +
 system/meson.build       |   1 +
 system/physmem.c         |  30 ++
 target/arm/kvm.c         |  15 +-
 target/i386/kvm/kvm.c    |  15 +-
 util/oslib-posix.c       |   3 +
 14 files changed, 753 insertions(+), 20 deletions(-)
 create mode 100644 system/hugetlbfs_ras.c
 create mode 100644 system/hugetlbfs_ras.h

-- 
2.43.5

[RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by “William Roche 2 months, 2 weeks ago

From: William Roche <william.roche@oracle.com>


Apologies for the noise; resending as I missed CC'ing the maintainers of the
changed files


Hello,

This is a Qemu RFC to introduce the possibility to deal with hardware
memory errors impacting hugetlbfs memory backed VMs. When using
hugetlbfs large pages, any large page location being impacted by an
HW memory error results in poisoning the entire page, suddenly making
a large chunk of the VM memory unusable.

The implemented proposal is simply a memory mapping change when an HW error
is reported to Qemu, to transform a hugetlbfs large page into a set of
standard sized pages. The failed large page is unmapped and a set of
standard sized pages are mapped in place.
This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is received
by qemu and the reported location corresponds to a large page.

This gives the possibility to:
- Take advantage of newer hypervisor kernel providing a way to retrieve
still valid data on the impacted hugetlbfs poisoned large page.
If the backend file is MAP_SHARED, we can copy the valid data into the
set of standard sized pages. But if an error is returned when accessing
a location we consider it poisoned and mark the corresponding standard sized
memory page as poisoned with a MADV_HWPOISON madvise call. Hence, the VM
can also continue to use the possible valid pieces of information retrieved.
- Adjust the poison address information. When accessing a poison location,
an older Kernel version may only provide the address of the beginning of
the poisoned large page in the associated SIGBUS siginfo data. Pointing to
a more accurate touched poison location allows the VM kernel to trigger
the right memory error reaction.

A warning is given for hugetlbfs backed memory-regions that are mapped
without the 'share=on' option.
(This warning is also given when using the deprecated "-mem-path" option)

The hugetlbfs memory mapping option should look like that
(with XXX replaced with the actual size):
  -object memory-backend-file,id=pc.ram,mem-path=/dev/hugepages,prealloc=on,share=on,size=XXX -machine memory-backend=pc.ram

I'm introducing new system/hugetlbfs_ras.[ch] files to separate the specific
code for this feature. It's only compiled on Linux versions.

Note that we have to be able to mark as "poison" a replacing valid standard
sized page. We currently do that calling madvise(..., MADV_HWPOISON).
But this requires qemu process to have CAP_SYS_ADMIN priviledge.
Using userfaultfd instead of madvise() to mark the pages as poison could
remove this constraint, and complicating the code adding thread(s) dealing
with the user page faults service.


It's also worth mentioning the IO memory, vfio configured memory buffers
case. The Qemu memory remapping (if it succeeds) will not reconfigure any
device IO buffers locations (no dma unmap/remap is performed) and if an
hardware IO is supposed to access (read or write) a poisoned hugetlbfs
page, I would expect it to fail the same way as before (as its location
hasn't been updated to take into account the new mapping).
But can someone confirm this possible behavior ? Or indicate me what should
be done to deal with this type of memory buffers ?

Details:
--------
The following problems had to be considered:

. kvm dealing with memory faults:
 - Address space mapping changes can't be handled in a signal handler (mmap
   is not async signal safe for example)
     We have a separate listener thread (only created when we use hugetlbfs)
     to deal with the mapping changes.
 - If a memory is not mapped when accessed, kvm fails with
   (exit_reason: KVM_EXIT_UNKNOWN)
     To avoid that, I needed to prevent the access to a changing memory
     region: pausing the VM is used to do so.
 - A fault on a poisoned hugetlbfs large page will report a hardcoded page
   size of 4k (See kernel kvm_send_hwpoison_signal() function)
     When a SIGBUS is received with a page size indication of 4k we have to
     verify if the impacted page is not a hugetlbfs page.
 - Asynchronous SIGBUS/BUS_MCEERR_AO signals provide the right page size,
   but the current Qemu version needs to take the information into account.

. system/physmem needed fixes:
 - When recreating the memory mapping on VM reset, we have to consider the
   memory size impacted.
 - In the case of a mapped file, punching a hole is necessary to clean the
   poison.

. Implementation details:
 - SIGBUS signal received for a large page will trigger the page modification,
   but in order to pause the VM, the signal handers have to terminate.
     So we return from the SIGBUS signal handler(s) when a VM has to be stopped.
     A memory access that generated a SIGBUS/BUS_MCEERR_AR signals before the
     VM pause, will be repeated when the VM resumes. If the memory is still
     not accessible (poisoned) the signal will be generated again by the
     hypervisor kernel.
     In the case of an asyncrounous SIGBUS/BUS_MCEERR_AO signal, the signal is
     not repeated by the kernel and will be recorded by qemu in order to be
     replayed when the VM resumes.
 - Poisoning a memory page with MADV_HWPOISON can generate a SIGBUS when
   called. The listener thread taking care of the memory modification needs
   to deal with this case. To do so, it sets a thread specific variable
   that is recognized by the sigbus handler.


Some questions:
---------------
. Should we take extra care for IO memory, vfio configured memory buffers ?

. My feature code is enclosed within "ifdef CONFIG_HUGETLBFS_RAS" and is only
  compiled on linux versions
  Should we have a configure option to prevent the introduction of this
  feature in the code (turning off CONFIG_HUGETLBFS_RAS) ?

. Should I include the content of my system/hugetlbfs_ras.[ch] files into
  another existing file ?

. Should we force 'sharing' when using "-mem-path" option, instead of the
  -object memory-backend-file,share=on,... ?


This prototype is scripts/checkpatch.pl clean (except for the MAINTAINERS
update for the 2 added files).
'make check' runs fine on both x86 and ARM
Units tests have been done on Intel, AMD and ARM platforms.



William Roche (6):
  accel/kvm: SIGBUS handler should also deal with si_addr_lsb
  accel/kvm: Keep track of the HWPoisonPage sizes
  system/physmem: Remap memory pages on reset based on the page size
  system: Introducing hugetlbfs largepage RAS feature
  system/hugetlb_ras: Handle madvise SIGBUS signal on listener
  system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume

 accel/kvm/kvm-all.c      |  24 +-
 accel/stubs/kvm-stub.c   |   4 +-
 include/qemu/osdep.h     |   5 +-
 include/sysemu/kvm.h     |   7 +-
 include/sysemu/kvm_int.h |   3 +-
 meson.build              |   2 +
 system/cpus.c            |  15 +-
 system/hugetlbfs_ras.c   | 645 +++++++++++++++++++++++++++++++++++++++
 system/hugetlbfs_ras.h   |   4 +
 system/meson.build       |   1 +
 system/physmem.c         |  30 ++
 target/arm/kvm.c         |  15 +-
 target/i386/kvm/kvm.c    |  15 +-
 util/oslib-posix.c       |   3 +
 14 files changed, 753 insertions(+), 20 deletions(-)
 create mode 100644 system/hugetlbfs_ras.c
 create mode 100644 system/hugetlbfs_ras.h

-- 
2.43.5

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by David Hildenbrand 2 months, 2 weeks ago

On 10.09.24 12:02, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 

Hi,

> 
> Apologies for the noise; resending as I missed CC'ing the maintainers of the
> changed files
> 
> 
> Hello,
> 
> This is a Qemu RFC to introduce the possibility to deal with hardware
> memory errors impacting hugetlbfs memory backed VMs. When using
> hugetlbfs large pages, any large page location being impacted by an
> HW memory error results in poisoning the entire page, suddenly making
> a large chunk of the VM memory unusable.
> 
> The implemented proposal is simply a memory mapping change when an HW error
> is reported to Qemu, to transform a hugetlbfs large page into a set of
> standard sized pages. The failed large page is unmapped and a set of
> standard sized pages are mapped in place.
> This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is received
> by qemu and the reported location corresponds to a large page.
> 
> This gives the possibility to:
> - Take advantage of newer hypervisor kernel providing a way to retrieve
> still valid data on the impacted hugetlbfs poisoned large page.
> If the backend file is MAP_SHARED, we can copy the valid data into the

How are you dealing with other consumers of the shared memory, such as 
vhost-user processes, vm migration whereby RAM is migrated using file 
content, vfio that might have these pages pinned?

In general, you cannot simply replace pages by private copies when 
somebody else might be relying on these pages to go to actual guest RAM.

It sounds very hacky and incomplete at first.


-- 
Cheers,

David / dhildenb

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by William Roche 2 months, 2 weeks ago

On 9/10/24 13:36, David Hildenbrand wrote:

> On 10.09.24 12:02, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>
> Hi,
>
>>
>> Apologies for the noise; resending as I missed CC'ing the maintainers 
>> of the
>> changed files
>>
>>
>> Hello,
>>
>> This is a Qemu RFC to introduce the possibility to deal with hardware
>> memory errors impacting hugetlbfs memory backed VMs. When using
>> hugetlbfs large pages, any large page location being impacted by an
>> HW memory error results in poisoning the entire page, suddenly making
>> a large chunk of the VM memory unusable.
>>
>> The implemented proposal is simply a memory mapping change when an HW 
>> error
>> is reported to Qemu, to transform a hugetlbfs large page into a set of
>> standard sized pages. The failed large page is unmapped and a set of
>> standard sized pages are mapped in place.
>> This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is 
>> received
>> by qemu and the reported location corresponds to a large page.
>>
>> This gives the possibility to:
>> - Take advantage of newer hypervisor kernel providing a way to retrieve
>> still valid data on the impacted hugetlbfs poisoned large page.
>> If the backend file is MAP_SHARED, we can copy the valid data into the

Thank you David for this first reaction on this proposal.

> How are you dealing with other consumers of the shared memory,
> such as vhost-user processes,

In the current proposal, I don't deal with this aspect.
In fact, any other process sharing the changed memory will
continue to map the poisoned large page. So any access to
this page will generate a SIGBUS to this other process.

In this situation vhost-user processes should continue to receive
SIGBUS signals (and probably continue to die because of that).

So I do see a real problem if 2 qemu processes are sharing the
same hugetlbfs segment -- in this case, error recovery should not
occur on this piece of the memory. Maybe dealing with this situation
with "ivshmem" options is doable (marking the shared segment
"not eligible" to hugetlbfs recovery, just like not "share=on"
hugetlbfs entries are not eligible)
-- I need to think about this specific case.

Please let me know if there is a better way to deal with this
shared memory aspect and have a better system reaction.

> vm migration whereby RAM is migrated using file content,

Migration doesn't currently work with memory poisoning.
You can give a look at the already integrated following commit:

06152b89db64 migration: prevent migration when VM has poisoned memory

This proposal doesn't change anything on this side.

> vfio that might have these pages pinned?

AFAIK even pinned memory can be impacted by memory error and poisoned
by the kernel. Now as I said in the cover letter, I'd like to know if
we should take extra care for IO memory, vfio configured memory buffers...

> In general, you cannot simply replace pages by private copies
> when somebody else might be relying on these pages to go to
> actual guest RAM.

This is correct, but the current proposal is dealing with a specific
shared memory type: poisoned large pages. So any other process mapping
this type of page can't access it without generating a SIGBUS.

> It sounds very hacky and incomplete at first.

As you can see, RAS features need to be completed.
And if this proposal is incomplete, what other changes should be
done to complete it ?

I do hope we can discuss this RFC to adapt what is incorrect, or
find a better way to address this situation.

Thanks in advance for your feedback,
William.

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by David Hildenbrand 2 months, 1 week ago

Hi again,

>>> This is a Qemu RFC to introduce the possibility to deal with hardware
>>> memory errors impacting hugetlbfs memory backed VMs. When using
>>> hugetlbfs large pages, any large page location being impacted by an
>>> HW memory error results in poisoning the entire page, suddenly making
>>> a large chunk of the VM memory unusable.
>>>
>>> The implemented proposal is simply a memory mapping change when an HW
>>> error
>>> is reported to Qemu, to transform a hugetlbfs large page into a set of
>>> standard sized pages. The failed large page is unmapped and a set of
>>> standard sized pages are mapped in place.
>>> This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is
>>> received
>>> by qemu and the reported location corresponds to a large page.

One clarifying question: you simply replace the hugetlb page by multiple 
small pages using mmap(MAP_FIXED). So you

(a) are not able to recover any memory of the original page (as of now)
(b) no longer have a hugetlb page and, therefore, possibly a performance
     degradation, relevant in low-latency applications that really care
     about the usage of hugetlb pages.
(c) run into the described inconsistency issues

Why is what you propose beneficial over just fallocate(PUNCH_HOLE) the 
full page and get a fresh, non-poisoned page instead?

Sure, you have to reserve some pages if that ever happens, but what is 
the big selling point over PUNCH_HOLE + realloc? (sorry if I missed it 
and it was spelled out)

>>>
>>> This gives the possibility to:
>>> - Take advantage of newer hypervisor kernel providing a way to retrieve
>>> still valid data on the impacted hugetlbfs poisoned large page.

Reading that again, that shouldn't have to be hypervisor-specific. 
Really, if someone were to extract data from a poisoned hugetlb folio, 
it shouldn't be hypervisor-specific. The kernel should be able to know 
which regions are accessible and could allow ways for reading these, one 
way or the other.

It could just be a fairly hugetlb-special feature that would replace the 
poisoned page by a fresh hugetlb page where as much page content as 
possible has been recoverd from the old one.

>>> If the backend file is MAP_SHARED, we can copy the valid data into the
> 
> 
> Thank you David for this first reaction on this proposal.
> 
> 
>> How are you dealing with other consumers of the shared memory,
>> such as vhost-user processes,
> 
> 
> In the current proposal, I don't deal with this aspect.
> In fact, any other process sharing the changed memory will
> continue to map the poisoned large page. So any access to
> this page will generate a SIGBUS to this other process.
> 
> In this situation vhost-user processes should continue to receive
> SIGBUS signals (and probably continue to die because of that).

That's ... suboptimal. :)

Assume you have a 1 GiB page. The guest OS can happily allocate buffers 
in there so they can end up in vhost-user and crash that process. 
Without any warning.

> 
> So I do see a real problem if 2 qemu processes are sharing the
> same hugetlbfs segment -- in this case, error recovery should not
> occur on this piece of the memory. Maybe dealing with this situation
> with "ivshmem" options is doable (marking the shared segment
> "not eligible" to hugetlbfs recovery, just like not "share=on"
> hugetlbfs entries are not eligible)
> -- I need to think about this specific case.
> 
> Please let me know if there is a better way to deal with this
> shared memory aspect and have a better system reaction.

Not creating the inconsistency in the first place :)

>> vm migration whereby RAM is migrated using file content,
> 
> 
> Migration doesn't currently work with memory poisoning.
> You can give a look at the already integrated following commit:
> 
> 06152b89db64 migration: prevent migration when VM has poisoned memory
> 
> This proposal doesn't change anything on this side.

That commit is fairly fresh and likely missed the option to *not* 
migrate RAM by reading it, but instead by migrating it through a shared 
file. For example, VM life-upgrade (CPR) wants to use that (or is 
already using that), to avoid RAM migration completely.

> 
>> vfio that might have these pages pinned?
> 
> AFAIK even pinned memory can be impacted by memory error and poisoned
> by the kernel. Now as I said in the cover letter, I'd like to know if
> we should take extra care for IO memory, vfio configured memory buffers...

Assume your GPU has a hugetlb folio pinned via vfio. As soon as you make 
the guest RAM point at anything else as VFIO is aware of, we end up in 
the same problem we had when we learned about having to disable balloon 
inflation (MADVISE_DONTNEED) as soon as VFIO pinned pages.

We'd have to inform VFIO that the mapping is now different. Otherwise 
it's really better to crash the VM than having your GPU read/write 
different data than your CPU reads/writes,

> 
> 
>> In general, you cannot simply replace pages by private copies
>> when somebody else might be relying on these pages to go to
>> actual guest RAM.
> 
> This is correct, but the current proposal is dealing with a specific
> shared memory type: poisoned large pages. So any other process mapping
> this type of page can't access it without generating a SIGBUS.

Right, and that's the issue. Because, for example, how should the VM be 
aware that this memory is now special and must not be used for some 
purposes without leading to problems elsewhere?

> 
> 
>> It sounds very hacky and incomplete at first.
> 
> As you can see, RAS features need to be completed.
> And if this proposal is incomplete, what other changes should be
> done to complete it ?
> 
> I do hope we can discuss this RFC to adapt what is incorrect, or
> find a better way to address this situation.

One long-term goal people are working on is to allow remapping the 
hugetlb folios in smaller granularity, such that only a single affected 
PTE can be marked as poisoned. (used to be called high-granularity-mapping)

However, at the same time, the focus hseems to shift towards using 
guest_memfd instead of hugetlb, once it supports 1 GiB pages and shared 
memory. It will likely be easier to support mapping 1 GiB pages using 
PTEs that way, and there are ongoing discussions how that can be 
achieved more easily.

There are also discussions [1] about not poisoning the mappings at all 
and handling it differently. But I haven't yet digested how exactly that 
could look like in reality.


[1] https://lkml.kernel.org/r/20240828234958.GE3773488@nvidia.com

-- 
Cheers,

David / dhildenb

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by William Roche 2 months, 1 week ago

On 9/12/24 00:07, David Hildenbrand wrote:

> Hi again,
>
>>>> This is a Qemu RFC to introduce the possibility to deal with hardware
>>>> memory errors impacting hugetlbfs memory backed VMs. When using
>>>> hugetlbfs large pages, any large page location being impacted by an
>>>> HW memory error results in poisoning the entire page, suddenly making
>>>> a large chunk of the VM memory unusable.
>>>>
>>>> The implemented proposal is simply a memory mapping change when an HW
>>>> error
>>>> is reported to Qemu, to transform a hugetlbfs large page into a set of
>>>> standard sized pages. The failed large page is unmapped and a set of
>>>> standard sized pages are mapped in place.
>>>> This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is
>>>> received
>>>> by qemu and the reported location corresponds to a large page.
>
> One clarifying question: you simply replace the hugetlb page by 
> multiple small pages using mmap(MAP_FIXED).

That's right.

> So you
>
> (a) are not able to recover any memory of the original page (as of now)
Once poisoned by the kernel, the original large page is entirely not 
accessible
anymore, but the Kernel can provide what remains from the poisoned hugetlbfs
page through the backend file.  (When this file was mapped MAP_SHARED)

> (b) no longer have a hugetlb page and, therefore, possibly a performance
>     degradation, relevant in low-latency applications that really care
>     about the usage of hugetlb pages.
This is correct.

> (c) run into the described inconsistency issues
The inconsistency I agreed upon is the case of 2 qemu processes sharing 
a piece of
the memory (through the ivshmem mechanism) which can be fixed by disabling
recovery for ivshmem associated hugetlbfs segment.

> Why is what you propose beneficial over just fallocate(PUNCH_HOLE) the 
> full page and get a fresh, non-poisoned page instead?
>
> Sure, you have to reserve some pages if that ever happens, but what is 
> the big selling point over PUNCH_HOLE + realloc? (sorry if I missed it 
> and it was spelled out)
This project provides an essential component that can't be done keeping 
a large
page to replace a failed large page: an uncorrected memory error on a memory
page is a lost memory piece and needs to be identified for any user to 
indicate
the loss. The kernel granularity for that is the entire page. It marks it
'poisoned' making it inaccessible (no matter what the page size, or the lost
memory piece size). So recovering an area of a large page impacted by a 
memory
error has to keep track of the lost area, and there is no other way but to
lower the granularity and split the page into smaller pieces that can be
marked 'poisoned' for the lost area.

That's the reason why we can't replace a failed large page with another 
large page.

We need smaller pages.


>>>>
>>>> This gives the possibility to:
>>>> - Take advantage of newer hypervisor kernel providing a way to 
>>>> retrieve
>>>> still valid data on the impacted hugetlbfs poisoned large page.
>
> Reading that again, that shouldn't have to be hypervisor-specific. 
> Really, if someone were to extract data from a poisoned hugetlb folio, 
> it shouldn't be hypervisor-specific. The kernel should be able to know 
> which regions are accessible and could allow ways for reading these, 
> one way or the other.
>
> It could just be a fairly hugetlb-special feature that would replace 
> the poisoned page by a fresh hugetlb page where as much page content 
> as possible has been recoverd from the old one.
I totally agree with the fact that it should be the Kernel role to split the
page and keep track of the valid and lost pieces. This was an aspect of the
high-granularity-mapping (HGM) project you are referring to. But HGM is not
there yet (and may never be), and currently the only automatic memory split
done by the kernel occurs when we are using Transparent Huge Pages (THP).
Unfortunately THP doesn't show (for the moment) all the performance and
memory optimisation possibilities that hugetlbfs use provides. And it's
a large topic I'd prefer not to get into.


>>> How are you dealing with other consumers of the shared memory,
>>> such as vhost-user processes,
>>
>>
>> In the current proposal, I don't deal with this aspect.
>> In fact, any other process sharing the changed memory will
>> continue to map the poisoned large page. So any access to
>> this page will generate a SIGBUS to this other process.
>>
>> In this situation vhost-user processes should continue to receive
>> SIGBUS signals (and probably continue to die because of that).
>
> That's ... suboptimal. :)
True.

>
> Assume you have a 1 GiB page. The guest OS can happily allocate 
> buffers in there so they can end up in vhost-user and crash that 
> process. Without any warning.
I confess that I don't know how/when and where vhost-user processes get 
their
shared memory locations.
But I agree that a recovered large page is currently not usable to associate
new shared buffers between qemu and external processes.

Note that previously allocated buffers that could have been located on this
page are marked 'poisoned' (after a memory error)on the vhost-user process
the same way they were before this project . The only difference is that,
after a recovered memory error, qemu may continue to see the recovered
address space and use it. But the receiving side (on vhost-user) will fail
when accessing the location.

Can a vhost-process fail without any warning reported ?
I hope not.

>> So I do see a real problem if 2 qemu processes are sharing the
>> same hugetlbfs segment -- in this case, error recovery should not
>> occur on this piece of the memory. Maybe dealing with this situation
>> with "ivshmem" options is doable (marking the shared segment
>> "not eligible" to hugetlbfs recovery, just like not "share=on"
>> hugetlbfs entries are not eligible)
>> -- I need to think about this specific case.
>>
>> Please let me know if there is a better way to deal with this
>> shared memory aspect and have a better system reaction.
>
> Not creating the inconsistency in the first place :)
Yes :)
Of course I don't want to introduce any inconsistency situation leading to
a memory corruption.
But if we consider that 'ivshmem' memory is not eligible for a recovery,
it means that we still leave the entire large page location poisoned and
there would not be any inconsistency for this memory component. Other
hugetlbfs memory componentswould still have the possibility to be
partially recovered, and give a higher chance to the VM not to crash
immediately.

>>> vm migration whereby RAM is migrated using file content,
>>
>>
>> Migration doesn't currently work with memory poisoning.
>> You can give a look at the already integrated following commit:
>>
>> 06152b89db64 migration: prevent migration when VM has poisoned memory
>>
>> This proposal doesn't change anything on this side.
>
> That commit is fairly fresh and likely missed the option to *not* 
> migrate RAM by reading it, but instead by migrating it through a 
> shared file. For example, VM life-upgrade (CPR) wants to use that (or 
> is already using that), to avoid RAM migration completely.
When a memory error occurs on a dirty page used for a mapped file,
the data is lost and the file synchronisation should fail with EIO.
You can't rely on the file content to reflect the latest memory content.
So even a migration using such a file should be avoided according to me.


>>> vfio that might have these pages pinned?
>>
>> AFAIK even pinned memory can be impacted by memory error and poisoned
>> by the kernel. Now as I said in the cover letter, I'd like to know if
>> we should take extra care for IO memory, vfio configured memory 
>> buffers...
>
> Assume your GPU has a hugetlb folio pinned via vfio. As soon as you 
> make the guest RAM point at anything else as VFIO is aware of, we end 
> up in the same problem we had when we learned about having to disable 
> balloon inflation (MADVISE_DONTNEED) as soon as VFIO pinned pages.
>
> We'd have to inform VFIO that the mapping is now different. Otherwise 
> it's really better to crash the VM than having your GPU read/write 
> different data than your CPU reads/writes,
Absolutely true, and fortunately this is not what would happen when the
large poisoned page is still used by the VFIO. After a successful recovery,
the CPU may still be able to read/write on a location where we had a vfio
buffer, but the other side (the device for example) would fail reading or
writing to any location of the poisoned large page.

>>> In general, you cannot simply replace pages by private copies
>>> when somebody else might be relying on these pages to go to
>>> actual guest RAM.
>>
>> This is correct, but the current proposal is dealing with a specific
>> shared memory type: poisoned large pages. So any other process mapping
>> this type of page can't access it without generating a SIGBUS.
>
> Right, and that's the issue. Because, for example, how should the VM 
> be aware that this memory is now special and must not be used for some 
> purposes without leading to problems elsewhere?
That's an excellent question, that I don't have the full answer to. We are
dealing here with a hardware fault situation; the hugetlbfs backend file
still has poisoned large page, so any attempt to map it in a process, or any
process mapping it before the error will not be able to use the segment. It
doesn't mean that they get their own private copy of a page. The only one
getting a private copy (to get what was still valid on the faulted large 
page)
is qemu. So if we imagine that ivshmem segments (between 2 qemu processes)
don't get this recovery, I'm expecting the data exchange on this shared 
memory
to fail, just like they do without the recovery mechanism. So I don't expect
any established communication to continue to work or any new segment using
the recovered area to successfully being created.

But of course I could be missing something here and be too optimistic...

So let take a step back.

I guess these "sharing" questions would not relate to memory segments that
are not defined as 'share=on', am I right ?

Do ivshmem, vhost-user processes or even vfio only use 'share=on' memory 
segments ?

If yes, we could also imagine to only enable recovery for hugetlbfs segments
that do not have 'share=on' attribute, but we would have to map them 
MAP_SHARED
in qemu address space anyway. This can maybe create other kinds of 
problems (?),
but if these inconsistency questions would not appear with this approach it
would be easy to adapt, and still enhance hugetlbfs use. For a first version
of this feature.

>>> It sounds very hacky and incomplete at first.
>>
>> As you can see, RAS features need to be completed.
>> And if this proposal is incomplete, what other changes should be
>> done to complete it ?
>>
>> I do hope we can discuss this RFC to adapt what is incorrect, or
>> find a better way to address this situation.
>
> One long-term goal people are working on is to allow remapping the 
> hugetlb folios in smaller granularity, such that only a single 
> affected PTE can be marked as poisoned. (used to be called 
> high-granularity-mapping)
I look forward to seeing this implemented, but it seems that it will 
take time
to appear, and if hugetlbfs RAS can be enhanced for qemu it would be 
very useful.

The day a kernel solution works, we can disable CONFIG_HUGETLBFS_RAS and 
rely on
the kernel to provide the appropriate information. The first commits will
continue to be necessary (dealing with si_addr_lsb value of the SIGBUS 
signinfo,
tracking the page size information in the hwpoison_page_list and the memory
remap on reset with the missing PUNCH_HOLE).

> However, at the same time, the focus hseems to shift towards using 
> guest_memfd instead of hugetlb, once it supports 1 GiB pages and 
> shared memory. It will likely be easier to support mapping 1 GiB pages 
> using PTEs that way, and there are ongoing discussions how that can be 
> achieved more easily.
>
> There are also discussions [1] about not poisoning the mappings at all 
> and handling it differently. But I haven't yet digested how exactly 
> that could look like in reality.
>
>
> [1] https://lkml.kernel.org/r/20240828234958.GE3773488@nvidia.com

Thank you very much for this pointer. I hope a kernel solution (this one or
another) can be implemented and widely adopted before the next 5 to 10 
years ;)

In the meantime, we can try to enhance qemu using hugetlbfs for VM memory
which is more and more deployed.

Best regards,
William.

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by William Roche 2 months ago

Hello David,

I hope my last week email answered your interrogations about:
     - retrieving the valid data from the lost hugepage
     - the need of smaller pages to replace a failed large page
     - the interaction of memory error and VM migration
     - the non-symmetrical access to a poisoned memory area after a recovery
       Qemu would be able to continue to access the still valid data
       location of the formerly poisoned hugepage, but any other entity
       mapping the large page would not be allowed to use the location.

I understand that this last item _is_ some kind of "inconsistency".
So if I want to make sure that a "shared" memory region (used for vhost-user
processes, vfio or ivshmem) is not recovered, how can I identify what 
region(s)
of a guest memory could be used for such a shared location ?
Is there a way for qemu to identify the memory locations that have been 
shared ?

Could you please let me know if there is an entry point I should consider ?

Thanks in advance for your feedback.
William.

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by David Hildenbrand 3 weeks, 5 days ago

On 19.09.24 18:52, William Roche wrote:
> Hello David,

Hi William,

sorry for not replying earlier, it somehow fell through the cracks as my 
inbox got flooded :(

> 
> I hope my last week email answered your interrogations about:
>       - retrieving the valid data from the lost hugepage
>       - the need of smaller pages to replace a failed large page
>       - the interaction of memory error and VM migration
>       - the non-symmetrical access to a poisoned memory area after a recovery
>         Qemu would be able to continue to access the still valid data
>         location of the formerly poisoned hugepage, but any other entity
>         mapping the large page would not be allowed to use the location.
> 
> I understand that this last item _is_ some kind of "inconsistency".

That's my biggest concern. Physical memory and its properties are 
described by the QEMU RAMBlock, which includes page size, 
shared/private, and sometimes properties (e.g., uffd).

Adding inconsistent there is really suboptimal :(

> So if I want to make sure that a "shared" memory region (used for vhost-user
> processes, vfio or ivshmem) is not recovered, how can I identify what
> region(s)
> of a guest memory could be used for such a shared location ?
> Is there a way for qemu to identify the memory locations that have been
> shared ?

I'll reply to your other cleanups/improvements, but we can detect if we 
must not discard arbitrary memory (because likely something is relying 
on long-term pinnings) using ram_block_discard_is_disabled().

-- 
Cheers,

David / dhildenb

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by Peter Xu 1 month, 2 weeks ago

On Thu, Sep 19, 2024 at 06:52:37PM +0200, William Roche wrote:
> Hello David,
> 
> I hope my last week email answered your interrogations about:
>     - retrieving the valid data from the lost hugepage
>     - the need of smaller pages to replace a failed large page
>     - the interaction of memory error and VM migration
>     - the non-symmetrical access to a poisoned memory area after a recovery
>       Qemu would be able to continue to access the still valid data
>       location of the formerly poisoned hugepage, but any other entity
>       mapping the large page would not be allowed to use the location.
> 
> I understand that this last item _is_ some kind of "inconsistency".
> So if I want to make sure that a "shared" memory region (used for vhost-user
> processes, vfio or ivshmem) is not recovered, how can I identify what
> region(s)
> of a guest memory could be used for such a shared location ?
> Is there a way for qemu to identify the memory locations that have been
> shared ?

When there's no vIOMMU I think all guest pages need to be shared.  When
with vIOMMU it depends on what was mapped by the guest drivers, while in
most sane setups they can still always be shared because the guest OS (if
Linux) should normally have iommu=pt speeding up kernel drivers.

> 
> Could you please let me know if there is an entry point I should consider ?

IMHO it'll still be more reasonable that this issue be tackled from the
kernel not userspace, simply because it's a shared problem of all
userspaces rather than QEMU process alone.

When with that the kernel should guarantee consistencies on different
processes accessing these pages properly, so logically all these
complexities should be better done in the kernel once for all.

There's indeed difficulties on providing it in hugetlbfs with mm community,
and this is also not the only effort trying to fix 1G page poisoning with
userspace workarounds, see:

https://lore.kernel.org/r/20240924043924.3562257-1-jiaqiyan@google.com

My gut feeling is either hugetlbfs needs to be fixed (with less hope) or
QEMU in general needs to move over to other file systems on consuming huge
pages.  Poisoning is not the only driven force, but at least we want to
also work out postcopy which has similar goal as David said, on being able
to map hugetlbfs pages differently.

May consider having a look at gmemfd 1G proposal, posted here:

https://lore.kernel.org/r/cover.1726009989.git.ackerleytng@google.com

We probably need that in one way or another for CoCo, and the chance is it
can easily support non-CoCo with the same interface ultimately.  Then 1G
hugetlbfs can be abandoned in QEMU.  It'll also need to tackle the same
challenge here either on page poisoning, or postcopy, with/without QEMU's
specific solution, because QEMU is also not the only userspace hypervisor.

Said that, the initial few small patches seem to be standalone small fixes
which may still be good.  So if you think that's the case you can at least
consider sending them separately without RFC tag.

Thanks,

-- 
Peter Xu

[PATCH v1 0/4] hugetlbfs memory HW error fixes

Posted by “William Roche 1 month ago

From: William Roche <william.roche@oracle.com>

This set of patches fixes several problems with hardware memory errors
impacting hugetlbfs memory backed VMs. When using hugetlbfs large
pages, any large page location being impacted by an HW memory error
results in poisoning the entire page, suddenly making a large chunk of
the VM memory unusable.

The main problem that currently exists in Qemu is the lack of backend
file repair before resetting the VM memory, resulting in the impacted
memory to be silently unusable even after a VM reboot.

In order to fix this issue, we track the SIGBUS page size information
when informed of a HW error (with si_addr_lsb) and record the size with
the appropriate poisoned page location. On recording a large page
position, we take note of the beginning of the page and its size. The
size information is taken from the backend file page_size value.

Also provide the impact information of a large page of memory loss,
only reported once when the page is poisoned -- for a better
debug-ability of these situations.

This code is scripts/checkpatch.pl clean
'make check' runs fine on both x86 and ARM.
Units tests have been successfully run on x86,
but the ARM VM doesn't deal with several errors on different memory
locations triggered too quickly from each other (which is the case
with hugetlbfs page being poisoned) and either aborts after a
"failed to record the error" message or becomes unresponsive.


William Roche (4):
  accel/kvm: SIGBUS handler should also deal with si_addr_lsb
  accel/kvm: Keep track of the HWPoisonPage page_size
  system/physmem: Largepage punch hole before reset of memory pages
  accel/kvm: Report the loss of a large memory page

 accel/kvm/kvm-all.c       | 27 +++++++++++++++++++++------
 accel/stubs/kvm-stub.c    |  4 ++--
 include/exec/cpu-common.h |  1 +
 include/qemu/osdep.h      |  5 +++--
 include/sysemu/kvm.h      |  7 ++++---
 include/sysemu/kvm_int.h  |  7 +++++--
 system/cpus.c             |  6 ++++--
 system/physmem.c          | 28 ++++++++++++++++++++++++++++
 target/arm/kvm.c          |  8 ++++++--
 target/i386/kvm/kvm.c     |  8 ++++++--
 util/oslib-posix.c        |  3 +++
 11 files changed, 83 insertions(+), 21 deletions(-)

-- 
2.43.5

[PATCH v1 1/4] accel/kvm: SIGBUS handler should also deal with si_addr_lsb

Posted by “William Roche 1 month ago

From: William Roche <william.roche@oracle.com>

The SIGBUS signal siginfo reporting a HW memory error
provides a si_addr_lsb field with an indication of the
impacted memory page size.
This information should be used to track the hwpoisoned
page sizes.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c    | 6 ++++--
 accel/stubs/kvm-stub.c | 4 ++--
 include/qemu/osdep.h   | 5 +++--
 include/sysemu/kvm.h   | 4 ++--
 system/cpus.c          | 6 ++++--
 util/oslib-posix.c     | 3 +++
 6 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 801cff16a5..2adc4d9c24 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2940,6 +2940,7 @@ void kvm_cpu_synchronize_pre_loadvm(CPUState *cpu)
 #ifdef KVM_HAVE_MCE_INJECTION
 static __thread void *pending_sigbus_addr;
 static __thread int pending_sigbus_code;
+static __thread short pending_sigbus_addr_lsb;
 static __thread bool have_sigbus_pending;
 #endif
 
@@ -3651,7 +3652,7 @@ void kvm_init_cpu_signals(CPUState *cpu)
 }
 
 /* Called asynchronously in VCPU thread.  */
-int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
+int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr, short addr_lsb)
 {
 #ifdef KVM_HAVE_MCE_INJECTION
     if (have_sigbus_pending) {
@@ -3660,6 +3661,7 @@ int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
     have_sigbus_pending = true;
     pending_sigbus_addr = addr;
     pending_sigbus_code = code;
+    pending_sigbus_addr_lsb = addr_lsb;
     qatomic_set(&cpu->exit_request, 1);
     return 0;
 #else
@@ -3668,7 +3670,7 @@ int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
 }
 
 /* Called synchronously (via signalfd) in main thread.  */
-int kvm_on_sigbus(int code, void *addr)
+int kvm_on_sigbus(int code, void *addr, short addr_lsb)
 {
 #ifdef KVM_HAVE_MCE_INJECTION
     /* Action required MCE kills the process if SIGBUS is blocked.  Because
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 8e0eb22e61..80780433d8 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -38,12 +38,12 @@ bool kvm_has_sync_mmu(void)
     return false;
 }
 
-int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
+int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr, short addr_lsb)
 {
     return 1;
 }
 
-int kvm_on_sigbus(int code, void *addr)
+int kvm_on_sigbus(int code, void *addr, short addr_lsb)
 {
     return 1;
 }
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index fe7c3c5f67..838271c4b8 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -585,8 +585,9 @@ struct qemu_signalfd_siginfo {
     uint64_t ssi_stime;   /* System CPU time consumed (SIGCHLD) */
     uint64_t ssi_addr;    /* Address that generated signal
                              (for hardware-generated signals) */
-    uint8_t  pad[48];     /* Pad size to 128 bytes (allow for
-                             additional fields in the future) */
+    uint16_t ssi_addr_lsb;/* Least significant bit of address (SIGBUS) */
+    uint8_t  pad[46];     /* Pad size to 128 bytes (allow for */
+                          /* additional fields in the future) */
 };
 
 int qemu_signalfd(const sigset_t *mask);
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index c3a60b2890..1bde598404 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -207,8 +207,8 @@ int kvm_has_gsi_routing(void);
 bool kvm_arm_supports_user_irq(void);
 
 
-int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr);
-int kvm_on_sigbus(int code, void *addr);
+int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr, short addr_lsb);
+int kvm_on_sigbus(int code, void *addr, short addr_lsb);
 
 #ifdef COMPILING_PER_TARGET
 #include "cpu.h"
diff --git a/system/cpus.c b/system/cpus.c
index 1c818ff682..12e630f760 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -376,12 +376,14 @@ static void sigbus_handler(int n, siginfo_t *siginfo, void *ctx)
 
     if (current_cpu) {
         /* Called asynchronously in VCPU thread.  */
-        if (kvm_on_sigbus_vcpu(current_cpu, siginfo->si_code, siginfo->si_addr)) {
+        if (kvm_on_sigbus_vcpu(current_cpu, siginfo->si_code,
+                               siginfo->si_addr, siginfo->si_addr_lsb)) {
             sigbus_reraise();
         }
     } else {
         /* Called synchronously (via signalfd) in main thread.  */
-        if (kvm_on_sigbus(siginfo->si_code, siginfo->si_addr)) {
+        if (kvm_on_sigbus(siginfo->si_code,
+                          siginfo->si_addr, siginfo->si_addr_lsb)) {
             sigbus_reraise();
         }
     }
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 11b35e48fb..64517d1e40 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -767,6 +767,9 @@ void sigaction_invoke(struct sigaction *action,
     } else if (info->ssi_signo == SIGILL || info->ssi_signo == SIGFPE ||
                info->ssi_signo == SIGSEGV || info->ssi_signo == SIGBUS) {
         si.si_addr = (void *)(uintptr_t)info->ssi_addr;
+        if (info->ssi_signo == SIGBUS) {
+            si.si_addr_lsb = (short int)info->ssi_addr_lsb;
+        }
     } else if (info->ssi_signo == SIGCHLD) {
         si.si_pid = info->ssi_pid;
         si.si_status = info->ssi_status;
-- 
2.43.5

[PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by “William Roche 1 month ago

From: William Roche <william.roche@oracle.com>

Add the page size information to the hwpoison_page_list elements.
As the kernel doesn't always report the actual poisoned page size,
we adjust this size from the backend real page size.
We take into account the recorded page size to adjust the size
and location of the memory hole.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c       | 14 ++++++++++----
 include/exec/cpu-common.h |  1 +
 include/sysemu/kvm.h      |  3 ++-
 include/sysemu/kvm_int.h  |  3 ++-
 system/physmem.c          | 20 ++++++++++++++++++++
 target/arm/kvm.c          |  8 ++++++--
 target/i386/kvm/kvm.c     |  8 ++++++--
 7 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 2adc4d9c24..40117eefa7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
  */
 typedef struct HWPoisonPage {
     ram_addr_t ram_addr;
+    size_t     page_size;
     QLIST_ENTRY(HWPoisonPage) list;
 } HWPoisonPage;
 
@@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
 
     QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
         QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr, page->page_size);
         g_free(page);
     }
 }
 
-void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz)
 {
     HWPoisonPage *page;
 
+    if (sz > TARGET_PAGE_SIZE)
+        ram_addr = ROUND_DOWN(ram_addr, sz);
+
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
             return;
@@ -1294,6 +1298,7 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr)
     }
     page = g_new(HWPoisonPage, 1);
     page->ram_addr = ram_addr;
+    page->page_size = sz;
     QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
 }
 
@@ -3140,7 +3145,8 @@ int kvm_cpu_exec(CPUState *cpu)
         if (unlikely(have_sigbus_pending)) {
             bql_lock();
             kvm_arch_on_sigbus_vcpu(cpu, pending_sigbus_code,
-                                    pending_sigbus_addr);
+                                    pending_sigbus_addr,
+                                    pending_sigbus_addr_lsb);
             have_sigbus_pending = false;
             bql_unlock();
         }
@@ -3678,7 +3684,7 @@ int kvm_on_sigbus(int code, void *addr, short addr_lsb)
      * we can only get action optional here.
      */
     assert(code != BUS_MCEERR_AR);
-    kvm_arch_on_sigbus_vcpu(first_cpu, code, addr);
+    kvm_arch_on_sigbus_vcpu(first_cpu, code, addr, addr_lsb);
     return 0;
 #else
     return 1;
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 638dc806a5..b971b13306 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -108,6 +108,7 @@ bool qemu_ram_is_named_file(RAMBlock *rb);
 int qemu_ram_get_fd(RAMBlock *rb);
 
 size_t qemu_ram_pagesize(RAMBlock *block);
+size_t qemu_ram_pagesize_from_host(void *addr);
 size_t qemu_ram_pagesize_largest(void);
 
 /**
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 1bde598404..4106a7ec07 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -383,7 +383,8 @@ bool kvm_vcpu_id_is_valid(int vcpu_id);
 unsigned long kvm_arch_vcpu_id(CPUState *cpu);
 
 #ifdef KVM_HAVE_MCE_INJECTION
-void kvm_arch_on_sigbus_vcpu(CPUState *cpu, int code, void *addr);
+void kvm_arch_on_sigbus_vcpu(CPUState *cpu, int code, void *addr,
+                             short addr_lsb);
 #endif
 
 void kvm_arch_init_irq_routing(KVMState *s);
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index a1e72763da..d2160be0ae 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -178,10 +178,11 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size);
  *
  * Parameters:
  *  @ram_addr: the address in the RAM for the poisoned page
+ *  @sz: size of the poisoned page as reported by the kernel
  *
  * Add a poisoned page to the list
  *
  * Return: None.
  */
-void kvm_hwpoison_page_add(ram_addr_t ram_addr);
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz);
 #endif
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..3757428336 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1665,6 +1665,26 @@ size_t qemu_ram_pagesize(RAMBlock *rb)
     return rb->page_size;
 }
 
+/* Returns backend real page size used for the given address */
+size_t qemu_ram_pagesize_from_host(void *addr)
+{
+    RAMBlock *rb;
+    ram_addr_t offset;
+
+    /*
+     * XXX kernel provided size is not reliable...
+     * As kvm_send_hwpoison_signal() uses a hard-coded PAGE_SHIFT
+     * signal value on hwpoison signal.
+     * So we must identify the actual size to consider from the
+     * mapping block pagesize.
+     */
+    rb =  qemu_ram_block_from_host(addr, false, &offset);
+    if (!rb) {
+        return TARGET_PAGE_SIZE;
+    }
+    return qemu_ram_pagesize(rb);
+}
+
 /* Returns the largest size of page in use */
 size_t qemu_ram_pagesize_largest(void)
 {
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index f1f1b5b375..11579e170b 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2348,10 +2348,11 @@ int kvm_arch_get_registers(CPUState *cs, Error **errp)
     return ret;
 }
 
-void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
+void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb)
 {
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t sz = (addr_lsb > 0) ? (1 << addr_lsb) : TARGET_PAGE_SIZE;
 
     assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
 
@@ -2359,7 +2360,10 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
+            if (sz == TARGET_PAGE_SIZE) {
+                sz = qemu_ram_pagesize_from_host(addr);
+            }
+            kvm_hwpoison_page_add(ram_addr, sz);
             /*
              * If this is a BUS_MCEERR_AR, we know we have been called
              * synchronously from the vCPU thread, so we can easily
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index fd9f198892..71e674bca0 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -735,12 +735,13 @@ static void hardware_memory_error(void *host_addr)
     exit(1);
 }
 
-void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
+void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb)
 {
     X86CPU *cpu = X86_CPU(c);
     CPUX86State *env = &cpu->env;
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t sz = (addr_lsb > 0) ? (1 << addr_lsb) : TARGET_PAGE_SIZE;
 
     /* If we get an action required MCE, it has been injected by KVM
      * while the VM was running.  An action optional MCE instead should
@@ -753,7 +754,10 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
+            if (sz == TARGET_PAGE_SIZE) {
+                sz = qemu_ram_pagesize_from_host(addr);
+            }
+            kvm_hwpoison_page_add(ram_addr, sz);
             kvm_mce_inject(cpu, paddr, code);
 
             /*
-- 
2.43.5

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by David Hildenbrand 1 month ago

On 22.10.24 23:35, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> Add the page size information to the hwpoison_page_list elements.
> As the kernel doesn't always report the actual poisoned page size,
> we adjust this size from the backend real page size.
> We take into account the recorded page size to adjust the size
> and location of the memory hole.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   accel/kvm/kvm-all.c       | 14 ++++++++++----
>   include/exec/cpu-common.h |  1 +
>   include/sysemu/kvm.h      |  3 ++-
>   include/sysemu/kvm_int.h  |  3 ++-
>   system/physmem.c          | 20 ++++++++++++++++++++
>   target/arm/kvm.c          |  8 ++++++--
>   target/i386/kvm/kvm.c     |  8 ++++++--
>   7 files changed, 47 insertions(+), 10 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 2adc4d9c24..40117eefa7 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
>    */
>   typedef struct HWPoisonPage {
>       ram_addr_t ram_addr;
> +    size_t     page_size;
>       QLIST_ENTRY(HWPoisonPage) list;
>   } HWPoisonPage;
>   
> @@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
>   
>       QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>           QLIST_REMOVE(page, list);
> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
> +        qemu_ram_remap(page->ram_addr, page->page_size);

Can't we just use the page size from the RAMBlock in qemu_ram_remap? 
There we lookup the RAMBlock, and all pages in a RAMBlock have the same 
size.

I'll note that qemu_ram_remap() is rather stupid and optimized only for 
private memory (not shmem etc).

mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page from 
the pagecache; you'd have to punch a hole instead.

It might be better to use ram_block_discard_range() in the long run. 
Memory preallocation + page pinning is tricky, but we could simply bail 
out in these cases (preallocation failing, ram discard being disabled).

qemu_ram_remap() might be problematic with page pinning (vfio) as is in 
any way :(

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by William Roche 4 weeks, 1 day ago

On 10/23/24 09:28, David Hildenbrand wrote:
> On 22.10.24 23:35, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> Add the page size information to the hwpoison_page_list elements.
>> As the kernel doesn't always report the actual poisoned page size,
>> we adjust this size from the backend real page size.
>> We take into account the recorded page size to adjust the size
>> and location of the memory hole.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   accel/kvm/kvm-all.c       | 14 ++++++++++----
>>   include/exec/cpu-common.h |  1 +
>>   include/sysemu/kvm.h      |  3 ++-
>>   include/sysemu/kvm_int.h  |  3 ++-
>>   system/physmem.c          | 20 ++++++++++++++++++++
>>   target/arm/kvm.c          |  8 ++++++--
>>   target/i386/kvm/kvm.c     |  8 ++++++--
>>   7 files changed, 47 insertions(+), 10 deletions(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 2adc4d9c24..40117eefa7 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned 
>> int extension)
>>    */
>>   typedef struct HWPoisonPage {
>>       ram_addr_t ram_addr;
>> +    size_t     page_size;
>>       QLIST_ENTRY(HWPoisonPage) list;
>>   } HWPoisonPage;
>> @@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
>>       QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>           QLIST_REMOVE(page, list);
>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>> +        qemu_ram_remap(page->ram_addr, page->page_size);
> 
> Can't we just use the page size from the RAMBlock in qemu_ram_remap? 
> There we lookup the RAMBlock, and all pages in a RAMBlock have the same 
> size.

Yes, we could use the page size from the RAMBlock in qemu_ram_remap() 
that is called when the VM is resetting. I think that knowing the 
information about the size of poisoned chunk of memory when the poison 
is created is useful to give a trace of what is going on, before seeing 
maybe other pages being reported as poisoned. That's the 4th patch goal 
to give an information as soon as we get it.
It also helps to filter the new errors reported and only create an entry 
in the hwpoison_page_list for new large pages.

Now we could delay the page size retrieval until we are resetting and 
present the information (post mortem). I do think that having the 
information earlier is better in this case.

> 
> I'll note that qemu_ram_remap() is rather stupid and optimized only for 
> private memory (not shmem etc).
> 
> mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page from 
> the pagecache; you'd have to punch a hole instead.
> 
> It might be better to use ram_block_discard_range() in the long run. 
> Memory preallocation + page pinning is tricky, but we could simply bail 
> out in these cases (preallocation failing, ram discard being disabled).

I see that ram_block_discard_range() adds more control before discarding 
the RAM region and can also call madvise() in addition to the fallocate 
punch hole for standard sized memory pages. Now as the range is supposed 
to be recreated, I'm not convinced that these madvise calls are necessary.

But we can also notice that this function will report the following 
warning in all cases of not shared file backends:
"ram_block_discard_range: Discarding RAM in private file mappings is 
possibly dangerous, because it will modify the underlying file and will 
affect other users of the file"
Which means that hugetlbfs configurations do see this new cryptic 
warning message on reboot if it is impacted by a memory poisoning.
So I would prefer to leave the fallocate call in the qemu_ram_remap() 
function. Or would you prefer to enhance ram_block_discard_range() code 
to avoid the message in a reset situation (when called from 
qemu_ram_remap) ?

> 
> qemu_ram_remap() might be problematic with page pinning (vfio) as is in 
> any way :(
> 

I agree. If qemu_ram_remap() fails, Qemu is ended either abort() or 
exit(1). Do you say that memory pinning could be detected by 
ram_block_discard_range() or maybe mmap call for the impacted region and 
make one of them fail ? This would be an additional reason to call 
ram_block_discard_range() from qemu_ram_remap().   Is it what you are 
suggesting ?

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by William Roche 4 weeks, 1 day ago

On 10/23/24 09:28, David Hildenbrand wrote:

> On 22.10.24 23:35, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> Add the page size information to the hwpoison_page_list elements.
>> As the kernel doesn't always report the actual poisoned page size,
>> we adjust this size from the backend real page size.
>> We take into account the recorded page size to adjust the size
>> and location of the memory hole.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   accel/kvm/kvm-all.c       | 14 ++++++++++----
>>   include/exec/cpu-common.h |  1 +
>>   include/sysemu/kvm.h      |  3 ++-
>>   include/sysemu/kvm_int.h  |  3 ++-
>>   system/physmem.c          | 20 ++++++++++++++++++++
>>   target/arm/kvm.c          |  8 ++++++--
>>   target/i386/kvm/kvm.c     |  8 ++++++--
>>   7 files changed, 47 insertions(+), 10 deletions(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 2adc4d9c24..40117eefa7 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, 
>> unsigned int extension)
>>    */
>>   typedef struct HWPoisonPage {
>>       ram_addr_t ram_addr;
>> +    size_t     page_size;
>>       QLIST_ENTRY(HWPoisonPage) list;
>>   } HWPoisonPage;
>>   @@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
>>         QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>           QLIST_REMOVE(page, list);
>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>> +        qemu_ram_remap(page->ram_addr, page->page_size);
>
> Can't we just use the page size from the RAMBlock in qemu_ram_remap? 
> There we lookup the RAMBlock, and all pages in a RAMBlock have the 
> same size.

Yes, we could use the page size from the RAMBlock in qemu_ram_remap() 
that is called when the VM is resetting. I think that knowing the 
information about the size of poisoned chunk of memory when the poison 
is created is useful to give a trace of what is going on, before seeing 
maybe other pages being reported as poisoned. That's the 4th patch goal 
to give an information as soon as we get it.
It also helps to filter the new errors reported and only create an entry 
in the hwpoison_page_list for new large pages.
Now we could delay the page size retrieval until we are resetting and 
present the information (post mortem). I do think that having the 
information earlier is better in this case.

>
> I'll note that qemu_ram_remap() is rather stupid and optimized only 
> for private memory (not shmem etc).
>
> mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page 
> from the pagecache; you'd have to punch a hole instead.
>
> It might be better to use ram_block_discard_range() in the long run. 
> Memory preallocation + page pinning is tricky, but we could simply 
> bail out in these cases (preallocation failing, ram discard being 
> disabled).

I see that ram_block_discard_range() adds more control before discarding 
the RAM region and can also call madvise() in addition to the fallocate 
punch hole for standard sized memory pages. Now as the range is supposed 
to be recreated, I'm not convinced that these madvise calls are necessary.

But we can also notice that this function will report the following 
warning in all cases of not shared file backends:
"ram_block_discard_range: Discarding RAM in private file mappings is 
possibly dangerous, because it will modify the underlying file and will 
affect other users of the file"
Which means that hugetlbfs configurations do see this new cryptic 
warning message on reboot if it is impacted by a memory poisoning.
So I would prefer to leave the fallocate call in the qemu_ram_remap() 
function. Or would you prefer to enhance ram_block_discard_range()code 
to avoid the message in a reset situation (when called from qemu_ram_remap)?

>
> qemu_ram_remap() might be problematic with page pinning (vfio) as is 
> in any way :(

I agree. If qemu_ram_remap() fails, Qemu is ended either abort() or 
exit(1). Do you say that memory pinning could be detected by 
ram_block_discard_range() or maybe mmap call for the impacted region and 
make one of them fail ? This would be an additional reason to call 
ram_block_discard_range() from qemu_ram_remap().   Is it what you are 
suggesting ?

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by David Hildenbrand 3 weeks, 5 days ago

On 26.10.24 01:27, William Roche wrote:
> On 10/23/24 09:28, David Hildenbrand wrote:
> 
>> On 22.10.24 23:35, “William Roche wrote:
>>> From: William Roche <william.roche@oracle.com>
>>>
>>> Add the page size information to the hwpoison_page_list elements.
>>> As the kernel doesn't always report the actual poisoned page size,
>>> we adjust this size from the backend real page size.
>>> We take into account the recorded page size to adjust the size
>>> and location of the memory hole.
>>>
>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>> ---
>>>   accel/kvm/kvm-all.c       | 14 ++++++++++----
>>>   include/exec/cpu-common.h |  1 +
>>>   include/sysemu/kvm.h      |  3 ++-
>>>   include/sysemu/kvm_int.h  |  3 ++-
>>>   system/physmem.c          | 20 ++++++++++++++++++++
>>>   target/arm/kvm.c          |  8 ++++++--
>>>   target/i386/kvm/kvm.c     |  8 ++++++--
>>>   7 files changed, 47 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>> index 2adc4d9c24..40117eefa7 100644
>>> --- a/accel/kvm/kvm-all.c
>>> +++ b/accel/kvm/kvm-all.c
>>> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, 
>>> unsigned int extension)
>>>    */
>>>   typedef struct HWPoisonPage {
>>>       ram_addr_t ram_addr;
>>> +    size_t     page_size;
>>>       QLIST_ENTRY(HWPoisonPage) list;
>>>   } HWPoisonPage;
>>>   @@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
>>>         QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>>           QLIST_REMOVE(page, list);
>>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>>> +        qemu_ram_remap(page->ram_addr, page->page_size);
>>
>> Can't we just use the page size from the RAMBlock in qemu_ram_remap? 
>> There we lookup the RAMBlock, and all pages in a RAMBlock have the 
>> same size.
> 
> 
> Yes, we could use the page size from the RAMBlock in qemu_ram_remap() 
> that is called when the VM is resetting. I think that knowing the 
> information about the size of poisoned chunk of memory when the poison 
> is created is useful to give a trace of what is going on, before seeing 
> maybe other pages being reported as poisoned. That's the 4th patch goal 
> to give an information as soon as we get it.
> It also helps to filter the new errors reported and only create an entry 
> in the hwpoison_page_list for new large pages.
> Now we could delay the page size retrieval until we are resetting and 
> present the information (post mortem). I do think that having the 
> information earlier is better in this case.

If it is not required for this patch, then please move the other stuff 
to patch #4.

Here, we really only have to discard a large page, which we can derive 
from the QEMU RAMBlock page size.

> 
> 
>>
>> I'll note that qemu_ram_remap() is rather stupid and optimized only 
>> for private memory (not shmem etc).
>>
>> mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page 
>> from the pagecache; you'd have to punch a hole instead.
>>
>> It might be better to use ram_block_discard_range() in the long run. 
>> Memory preallocation + page pinning is tricky, but we could simply 
>> bail out in these cases (preallocation failing, ram discard being 
>> disabled).
> 
> 
> I see that ram_block_discard_range() adds more control before discarding 
> the RAM region and can also call madvise() in addition to the fallocate 
> punch hole for standard sized memory pages. Now as the range is supposed 
> to be recreated, I'm not convinced that these madvise calls are necessary.

They are the proper replacement for the mmap(MAP_FIXED) + fallocate.

That function handles all cases of properly discarding guest RAM.

> 
> But we can also notice that this function will report the following 
> warning in all cases of not shared file backends:
> "ram_block_discard_range: Discarding RAM in private file mappings is 
> possibly dangerous, because it will modify the underlying file and will 
> affect other users of the file"

Yes, because it's a clear warning sign that something weird is 
happening. You might be throwing away data that some other process might 
be relying on.

How are you making QEMU consume hugetlbs?

We could suppress these warnings, but let's first see how you are able 
to trigger it.

> Which means that hugetlbfs configurations do see this new cryptic 
> warning message on reboot if it is impacted by a memory poisoning.
> So I would prefer to leave the fallocate call in the qemu_ram_remap() 
> function. Or would you prefer to enhance ram_block_discard_range()code 
> to avoid the message in a reset situation (when called from qemu_ram_remap)?

Please try reusing the mechanism to discard guest RAM instead of 
open-coding this. We still have to use mmap(MAP_FIXED) as a backup, but 
otherwise this function should mostly do+check what you need.

(-warnings we might want to report differently / suppress)

If you want, I can start a quick prototype of what it could look like 
when using ram_block_discard_range() + ram_block_discard_is_disabled() + 
fallback to existing mmap(MAP_FIXED).

> 
> 
>>
>> qemu_ram_remap() might be problematic with page pinning (vfio) as is 
>> in any way :(
> 
> I agree. If qemu_ram_remap() fails, Qemu is ended either abort() or 
> exit(1). Do you say that memory pinning could be detected by 
> ram_block_discard_range() or maybe mmap call for the impacted region and 
> make one of them fail ? This would be an additional reason to call 
> ram_block_discard_range() from qemu_ram_remap().   Is it what you are 
> suggesting ?

ram_block_discard_is_disabled() might be the right test. If discarding 
is disabled, then rebooting might create an inconsistency with 
e.g.,vfio, resulting in the issues we know from memory ballooning where 
the state vfio sees will be different from the state the guest kernel 
sees. It's tricky ... and we much rather quit the VM early instead of 
corrupting data later :/

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by William Roche 3 weeks, 4 days ago

On 10/28/24 17:42, David Hildenbrand wrote:
> On 26.10.24 01:27, William Roche wrote:
>> On 10/23/24 09:28, David Hildenbrand wrote:
>>
>>> On 22.10.24 23:35, “William Roche wrote:
>>>> From: William Roche <william.roche@oracle.com>
>>>>
>>>> Add the page size information to the hwpoison_page_list elements.
>>>> As the kernel doesn't always report the actual poisoned page size,
>>>> we adjust this size from the backend real page size.
>>>> We take into account the recorded page size to adjust the size
>>>> and location of the memory hole.
>>>>
>>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>>> ---
>>>>   accel/kvm/kvm-all.c       | 14 ++++++++++----
>>>>   include/exec/cpu-common.h |  1 +
>>>>   include/sysemu/kvm.h      |  3 ++-
>>>>   include/sysemu/kvm_int.h  |  3 ++-
>>>>   system/physmem.c          | 20 ++++++++++++++++++++
>>>>   target/arm/kvm.c          |  8 ++++++--
>>>>   target/i386/kvm/kvm.c     |  8 ++++++--
>>>>   7 files changed, 47 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>>> index 2adc4d9c24..40117eefa7 100644
>>>> --- a/accel/kvm/kvm-all.c
>>>> +++ b/accel/kvm/kvm-all.c
>>>> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, 
>>>> unsigned int extension)
>>>>    */
>>>>   typedef struct HWPoisonPage {
>>>>       ram_addr_t ram_addr;
>>>> +    size_t     page_size;
>>>>       QLIST_ENTRY(HWPoisonPage) list;
>>>>   } HWPoisonPage;
>>>>   @@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
>>>>         QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, 
>>>> next_page) {
>>>>           QLIST_REMOVE(page, list);
>>>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>>>> +        qemu_ram_remap(page->ram_addr, page->page_size);
>>>
>>> Can't we just use the page size from the RAMBlock in qemu_ram_remap? 
>>> There we lookup the RAMBlock, and all pages in a RAMBlock have the 
>>> same size.
>>
>>
>> Yes, we could use the page size from the RAMBlock in qemu_ram_remap() 
>> that is called when the VM is resetting. I think that knowing the 
>> information about the size of poisoned chunk of memory when the poison 
>> is created is useful to give a trace of what is going on, before 
>> seeing maybe other pages being reported as poisoned. That's the 4th 
>> patch goal to give an information as soon as we get it.
>> It also helps to filter the new errors reported and only create an 
>> entry in the hwpoison_page_list for new large pages.
>> Now we could delay the page size retrieval until we are resetting and 
>> present the information (post mortem). I do think that having the 
>> information earlier is better in this case.
> 
> If it is not required for this patch, then please move the other stuff 
> to patch #4.
> 
> Here, we really only have to discard a large page, which we can derive 
> from the QEMU RAMBlock page size.


Ok, I can remove the first patch that is created to track the kernel 
provided page size and pass it to the kvm_hwpoison_page_add() function, 
but we could deal with the page size at the kvm_hwpoison_page_add() 
function level as we don't rely on the kernel provided info, but just 
the RAMBlock page size.

I'll send a new version with this modification.


>>
>>
>>>
>>> I'll note that qemu_ram_remap() is rather stupid and optimized only 
>>> for private memory (not shmem etc).
>>>
>>> mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page 
>>> from the pagecache; you'd have to punch a hole instead.
>>>
>>> It might be better to use ram_block_discard_range() in the long run. 
>>> Memory preallocation + page pinning is tricky, but we could simply 
>>> bail out in these cases (preallocation failing, ram discard being 
>>> disabled).
>>
>>
>> I see that ram_block_discard_range() adds more control before 
>> discarding the RAM region and can also call madvise() in addition to 
>> the fallocate punch hole for standard sized memory pages. Now as the 
>> range is supposed to be recreated, I'm not convinced that these 
>> madvise calls are necessary.
> 
> They are the proper replacement for the mmap(MAP_FIXED) + fallocate.
> 
> That function handles all cases of properly discarding guest RAM.

In the case of hugetlbfs pages, ram_block_discard_range() does the 
punch-hole fallocate call (and prints out the warning messages).
The madvise call is only done when (rb->page_size == 
qemu_real_host_page_size()) which isn't true for hugetlbfs.
So need_madvise is false and neither QEMU_MADV_REMOVE nor 
QEMU_MADV_DONTNEED madvise calls is performed.


> 
>>
>> But we can also notice that this function will report the following 
>> warning in all cases of not shared file backends:
>> "ram_block_discard_range: Discarding RAM in private file mappings is 
>> possibly dangerous, because it will modify the underlying file and 
>> will affect other users of the file"
> 
> Yes, because it's a clear warning sign that something weird is 
> happening. You might be throwing away data that some other process might 
> be relying on.
> 
> How are you making QEMU consume hugetlbs?

A classical way to consume (not shared) hugetlbfs pages is done with the 
creation of a file that is opened, mmapped by the Qemu instance but we 
also delete the file system entry so that if the Qemu instance dies, the 
resources are released. This file is usually not shared.


> 
> We could suppress these warnings, but let's first see how you are able 
> to trigger it.

The warning is always displayed when such a hugetlbfs VM impacted by a 
memory error is rebooted.
I understand the reason why we have this message, but in the case of 
hugetlbfs classical use this (new) message on reboot is probably too 
worrying...  But loosing memory is already very worrying ;)


> 
>> Which means that hugetlbfs configurations do see this new cryptic 
>> warning message on reboot if it is impacted by a memory poisoning.
>> So I would prefer to leave the fallocate call in the qemu_ram_remap() 
>> function. Or would you prefer to enhance ram_block_discard_range()code 
>> to avoid the message in a reset situation (when called from 
>> qemu_ram_remap)?
> 
> Please try reusing the mechanism to discard guest RAM instead of open- 
> coding this. We still have to use mmap(MAP_FIXED) as a backup, but 
> otherwise this function should mostly do+check what you need.
> 
> (-warnings we might want to report differently / suppress)
> 
> If you want, I can start a quick prototype of what it could look like 
> when using ram_block_discard_range() + ram_block_discard_is_disabled() + 
> fallback to existing mmap(MAP_FIXED).

I just want to notice that the reason why need_madvise was used was 
because "DONTNEED fails for hugepages but fallocate works on hugepages 
and shmem". In fact, MADV_REMOVE support on hugetlbfs only appeared in 
kernel v4.3 and MADV_DONTNEED support only appeared 5.18

Our Qemu code avoids calling these madvise for hugepages, as we need to 
have:
(rb->page_size == qemu_real_host_page_size())

That's a reason why we have to remap the "hole-punched" section of the 
file when using hugepages.

>>
>>
>>>
>>> qemu_ram_remap() might be problematic with page pinning (vfio) as is 
>>> in any way :(
>>
>> I agree. If qemu_ram_remap() fails, Qemu is ended either abort() or 
>> exit(1). Do you say that memory pinning could be detected by 
>> ram_block_discard_range() or maybe mmap call for the impacted region 
>> and make one of them fail ? This would be an additional reason to call 
>> ram_block_discard_range() from qemu_ram_remap().   Is it what you are 
>> suggesting ?
> 
> ram_block_discard_is_disabled() might be the right test. If discarding 
> is disabled, then rebooting might create an inconsistency with 
> e.g.,vfio, resulting in the issues we know from memory ballooning where 
> the state vfio sees will be different from the state the guest kernel 
> sees. It's tricky ... and we much rather quit the VM early instead of 
> corrupting data later :/


Alright. we can verify if ram_block_discard_is_disabled() is true and we 
exit Qemu in this case with a message instead of trying to recreate the 
memory area (in the other case).

Re: [PATCH v1 2/4] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by David Hildenbrand 2 weeks, 5 days ago

> 
> Ok, I can remove the first patch that is created to track the kernel
> provided page size and pass it to the kvm_hwpoison_page_add() function,
> but we could deal with the page size at the kvm_hwpoison_page_add()
> function level as we don't rely on the kernel provided info, but just
> the RAMBlock page size.

Great!

>>> I see that ram_block_discard_range() adds more control before
>>> discarding the RAM region and can also call madvise() in addition to
>>> the fallocate punch hole for standard sized memory pages. Now as the
>>> range is supposed to be recreated, I'm not convinced that these
>>> madvise calls are necessary.
>>
>> They are the proper replacement for the mmap(MAP_FIXED) + fallocate.
>>
>> That function handles all cases of properly discarding guest RAM.
> 
> In the case of hugetlbfs pages, ram_block_discard_range() does the
> punch-hole fallocate call (and prints out the warning messages).
> The madvise call is only done when (rb->page_size ==
> qemu_real_host_page_size()) which isn't true for hugetlbfs.
> So need_madvise is false and neither QEMU_MADV_REMOVE nor
> QEMU_MADV_DONTNEED madvise calls is performed.

See my other mail regarding fallocte()+hugetlb oddities.

The warning is for MAP_PRIVATE mappings where we cannot be sure that we are
not doing harm to somebody else that is mapping the file :(

See

commit 1d44ff586f8a8e113379430750b5a0a2a3f64cf9
Author: David Hildenbrand <david@redhat.com>
Date:   Thu Jul 6 09:56:06 2023 +0200

     softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
     
     ram_block_discard_range() cannot possibly do the right thing in
     MAP_PRIVATE file mappings in the general case.
     
     To achieve the documented semantics, we also have to punch a hole into
     the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
     of such a file.
     
     For example, using VM templating -- see commit b17fbbe55cba ("migration:
     allow private destination ram with x-ignore-shared") -- in combination with
     any mechanism that relies on discarding of RAM is problematic. This
     includes:
     * Postcopy live migration
     * virtio-balloon inflation/deflation or free-page-reporting
     * virtio-mem
     
     So at least warn that there is something possibly dangerous is going on
     when using ram_block_discard_range() in these cases.

So the warning is the best we can do to say "this is possibly very
problematic, and it might be undesirable".

For hugetlb, users should switch to using memory-backend-memfd or
memory-backend-file,share=on.

> 
> 
>>
>>>
>>> But we can also notice that this function will report the following
>>> warning in all cases of not shared file backends:
>>> "ram_block_discard_range: Discarding RAM in private file mappings is
>>> possibly dangerous, because it will modify the underlying file and
>>> will affect other users of the file"
>>
>> Yes, because it's a clear warning sign that something weird is
>> happening. You might be throwing away data that some other process might
>> be relying on.
>>
>> How are you making QEMU consume hugetlbs?
> 
> A classical way to consume (not shared) hugetlbfs pages is done with the
> creation of a file that is opened, mmapped by the Qemu instance but we
> also delete the file system entry so that if the Qemu instance dies, the
> resources are released. This file is usually not shared.

Right, see above. We should be using memory-backend-file,share=on with that,
just like we would with shmem/tmpfs :(

The ugly bit is that the legacy "-mem-path" option translates to
"memory-backend-file,share=off", and we cannot easily change that.

That option really should not be used anymore.

> 
> 
>>
>> We could suppress these warnings, but let's first see how you are able
>> to trigger it.
> 
> The warning is always displayed when such a hugetlbfs VM impacted by a
> memory error is rebooted.
> I understand the reason why we have this message, but in the case of
> hugetlbfs classical use this (new) message on reboot is probably too
> worrying...  But loosing memory is already very worrying ;)

See above; we cannot easily identify "we map this file MAP_PRIVATE
but we are guaranteed to be the single user", so punching a hole in that
file might just corrupt data for another user (e.g., VM templating) without
any warning.

Again, we could suppress the warning, but not using MAP_PRIVATE with
a hugetlb file would be even better.

(hugetlb contains other hacks that make sure that MAP_PRIVATE on a file
won't result in a double memory consumption -- with shmem/tmpfs it would
result in a double memory consumption!)

Are the users you are aware of using "-mem-path" or "-object memory-backend-file"?

We might be able to change the default for the latter with a new QEMU version,
maybe ...


-- 
Cheers,

David / dhildenb

[PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

Posted by “William Roche 1 month ago

From: William Roche <william.roche@oracle.com>

When the VM reboots, a memory reset is performed calling
qemu_ram_remap() on all hwpoisoned pages.
While we take into account the recorded page sizes to repair the
memory locations, a large page also needs to punch a hole in the
backend file to regenerate a usable memory, cleaning the HW
poisoned section. This is mandatory for hugetlbfs case for example.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/system/physmem.c b/system/physmem.c
index 3757428336..3f6024a92d 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 prot = PROT_READ;
                 prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
                 if (block->fd >= 0) {
+                    if (length > TARGET_PAGE_SIZE && fallocate(block->fd,
+                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+                        offset + block->fd_offset, length) != 0) {
+                        error_report("Could not recreate the file hole for "
+                                     "addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }
                     area = mmap(vaddr, length, prot, flags, block->fd,
                                 offset + block->fd_offset);
                 } else {
-- 
2.43.5

Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

Posted by David Hildenbrand 1 month ago

On 22.10.24 23:35, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> When the VM reboots, a memory reset is performed calling
> qemu_ram_remap() on all hwpoisoned pages.
> While we take into account the recorded page sizes to repair the
> memory locations, a large page also needs to punch a hole in the
> backend file to regenerate a usable memory, cleaning the HW
> poisoned section. This is mandatory for hugetlbfs case for example.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   system/physmem.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index 3757428336..3f6024a92d 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
>                   prot = PROT_READ;
>                   prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>                   if (block->fd >= 0) {
> +                    if (length > TARGET_PAGE_SIZE && fallocate(block->fd,
> +                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
> +                        offset + block->fd_offset, length) != 0) {
> +                        error_report("Could not recreate the file hole for "
> +                                     "addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                                     length, addr);
> +                        exit(1);
> +                    }
>                       area = mmap(vaddr, length, prot, flags, block->fd,
>                                   offset + block->fd_offset);
>                   } else {

Ah! Just what I commented to patch #3; we should be using 
ram_discard_range(). It might be better to avoid the mmap() completely 
if ram_discard_range() worked.

And as raised, there is the problem with memory preallocation (where we 
should fail if it doesn't work) and ram discards being disabled because 
something relies on long-term page pinning ...

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

Posted by William Roche 4 weeks, 1 day ago

On 10/23/24 09:30, David Hildenbrand wrote:

> On 22.10.24 23:35, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> When the VM reboots, a memory reset is performed calling
>> qemu_ram_remap() on all hwpoisoned pages.
>> While we take into account the recorded page sizes to repair the
>> memory locations, a large page also needs to punch a hole in the
>> backend file to regenerate a usable memory, cleaning the HW
>> poisoned section. This is mandatory for hugetlbfs case for example.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   system/physmem.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 3757428336..3f6024a92d 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr, 
>> ram_addr_t length)
>>                   prot = PROT_READ;
>>                   prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>>                   if (block->fd >= 0) {
>> +                    if (length > TARGET_PAGE_SIZE && 
>> fallocate(block->fd,
>> +                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
>> +                        offset + block->fd_offset, length) != 0) {
>> +                        error_report("Could not recreate the file 
>> hole for "
>> +                                     "addr: " RAM_ADDR_FMT "@" 
>> RAM_ADDR_FMT "",
>> +                                     length, addr);
>> +                        exit(1);
>> +                    }
>>                       area = mmap(vaddr, length, prot, flags, block->fd,
>>                                   offset + block->fd_offset);
>>                   } else {
>
> Ah! Just what I commented to patch #3; we should be using 
> ram_discard_range(). It might be better to avoid the mmap() completely 
> if ram_discard_range() worked.

I think you are referring to ram_block_discard_range() here, as 
ram_discard_range() seems to relate to VM migrations, maybe not a VM reset.

Remapping the page is needed to get rid of the poison. So if we want to 
avoid the mmap(), we have to shrink the memory address space -- which 
can be a real problem if we imagine a VM with 1G large pages for 
example. qemu_ram_remap() is used to regenerate the lost memory and the 
mmap() call looks mandatory on the reset phase.

>
> And as raised, there is the problem with memory preallocation (where 
> we should fail if it doesn't work) and ram discards being disabled 
> because something relies on long-term page pinning ...

Yes. Do you suggest that we add a call to qemu_prealloc_mem() for the 
remapped area in case of a backend->prealloc being true ?

Or as we are running on posix machines for this piece of code (ifndef 
_WIN32) maybe we could simply add a MAP_POPULATE flag to the mmap call 
done in qemu_ram_remap() in the case where the backend requires a 
'prealloc' ?  Can you confirm if this flag could be used on all systems 
running this code ?

Unfortunately, I don't know how to get the MEMORY_BACKEND corresponding 
to a given memory block. I'm not sure that MEMORY_BACKEND(block->mr) is 
a valid way to retrieve the Backend object and its 'prealloc' property 
here. Could you please give me a direction here ?

I can send a new version using ram_block_discard_range() as you 
suggested to replace the direct call to fallocate(), if you think it 
would be better.
Please let me know what other enhancement(s) you'd like to see in this 
code change.

Thanks in advance,
William.

Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

Posted by David Hildenbrand 3 weeks, 5 days ago

On 26.10.24 01:27, William Roche wrote:
> On 10/23/24 09:30, David Hildenbrand wrote:
> 
>> On 22.10.24 23:35, “William Roche wrote:
>>> From: William Roche <william.roche@oracle.com>
>>>
>>> When the VM reboots, a memory reset is performed calling
>>> qemu_ram_remap() on all hwpoisoned pages.
>>> While we take into account the recorded page sizes to repair the
>>> memory locations, a large page also needs to punch a hole in the
>>> backend file to regenerate a usable memory, cleaning the HW
>>> poisoned section. This is mandatory for hugetlbfs case for example.
>>>
>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>> ---
>>>    system/physmem.c | 8 ++++++++
>>>    1 file changed, 8 insertions(+)
>>>
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index 3757428336..3f6024a92d 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr,
>>> ram_addr_t length)
>>>                    prot = PROT_READ;
>>>                    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>>>                    if (block->fd >= 0) {
>>> +                    if (length > TARGET_PAGE_SIZE &&
>>> fallocate(block->fd,
>>> +                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
>>> +                        offset + block->fd_offset, length) != 0) {
>>> +                        error_report("Could not recreate the file
>>> hole for "
>>> +                                     "addr: " RAM_ADDR_FMT "@"
>>> RAM_ADDR_FMT "",
>>> +                                     length, addr);
>>> +                        exit(1);
>>> +                    }
>>>                        area = mmap(vaddr, length, prot, flags, block->fd,
>>>                                    offset + block->fd_offset);
>>>                    } else {
>>
>> Ah! Just what I commented to patch #3; we should be using
>> ram_discard_range(). It might be better to avoid the mmap() completely
>> if ram_discard_range() worked.
> 

Hi!

> 
> I think you are referring to ram_block_discard_range() here, as
> ram_discard_range() seems to relate to VM migrations, maybe not a VM reset.

Please take a look at the users of ram_block_discard_range(), including 
virtio-balloon to completely zap guest memory, so we will get fresh 
memory on next access. It takes care of process-private and file-backed 
(shared) memory.

> 
> Remapping the page is needed to get rid of the poison. So if we want to
> avoid the mmap(), we have to shrink the memory address space -- which
> can be a real problem if we imagine a VM with 1G large pages for
> example. qemu_ram_remap() is used to regenerate the lost memory and the
> mmap() call looks mandatory on the reset phase.

Why can't we use ram_block_discard_range() to zap the poisoned page 
(unmap from page tables + conditionallydrop from the page cache)? Is 
there anything important I am missing?

> 
> 
>>
>> And as raised, there is the problem with memory preallocation (where
>> we should fail if it doesn't work) and ram discards being disabled
>> because something relies on long-term page pinning ...
> 
> 
> Yes. Do you suggest that we add a call to qemu_prealloc_mem() for the
> remapped area in case of a backend->prealloc being true ?

Yes. Otherwise, with hugetlb, you might run out of hugetlb pages at 
runtime and SIGBUS QEMU :(

> 
> Or as we are running on posix machines for this piece of code (ifndef
> _WIN32) maybe we could simply add a MAP_POPULATE flag to the mmap call
> done in qemu_ram_remap() in the case where the backend requires a
> 'prealloc' ?  Can you confirm if this flag could be used on all systems
> running this code ?

Please use qemu_prealloc_mem(). MAP_POPULATE has no guarantees, it's 
really weird :/ mmap() might succeed even though MAP_POPULATE didn't 
work ... and it's problematic with NUMA policies because we essentially 
lose (overwrite) them.

And the whole mmap(MAP_FIXED) is an ugly hack. For example, we wouldn't 
reset the memory policy we apply in 
host_memory_backend_memory_complete() ... that code really needs a 
rewrite to do it properly.


Ideally, we'd do something high-level like


if (ram_block_discard_is_disabled()) {
	/*
	 * We cannot safely discard RAM,  ... for example we might have
	 * to remap all guest RAM into vfio after discarding the 	
	 * problematic pages ... TODO.
	 */
	exit(0);
}

/* Throw away the problematic (poisoned) page. *./
if (ram_block_discard_range()) {
	/* Conditionally fallback to MAP_FIXED workaround */
	...
}

/* If prealloction was requested, we really must re-preallcoate. */
if (prealloc && qemu_prealloc_mem()) {
	/* Preallocation failed .... */
	exit(0);
}

As you note the last part is tricky. See bwloe.

> 
> Unfortunately, I don't know how to get the MEMORY_BACKEND corresponding
> to a given memory block. I'm not sure that MEMORY_BACKEND(block->mr) is
> a valid way to retrieve the Backend object and its 'prealloc' property
> here. Could you please give me a direction here ?

We could add a RAM_PREALLOC flag to hint that this memory has "prealloc" 
semantics.

I once had an alternative approach: Similar to ram_block_notify_resize() 
we would implement ram_block_notify_remap().

That's where the backend could register and re-apply mmap properties 
like NUMA policies (in case we have to fallback to MAP_FIXED) and handle 
the preallocation.

So one would implement a ram_block_notify_remap() and maybe indicate if 
we had to do MAP_FIXED or if we only discarded the page.

I once had a prototype for that, let me dig ...

> 
> I can send a new version using ram_block_discard_range() as you
> suggested to replace the direct call to fallocate(), if you think it
> would be better.
> Please let me know what other enhancement(s) you'd like to see in this
> code change.

Something along the lines above. Please let me know if you see problems 
with that approach that I am missing.

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

Posted by William Roche 3 weeks, 4 days ago

On 10/28/24 18:01, David Hildenbrand wrote:
> On 26.10.24 01:27, William Roche wrote:
>> On 10/23/24 09:30, David Hildenbrand wrote:
>>
>>> On 22.10.24 23:35, “William Roche wrote:
>>>> From: William Roche <william.roche@oracle.com>
>>>>
>>>> When the VM reboots, a memory reset is performed calling
>>>> qemu_ram_remap() on all hwpoisoned pages.
>>>> While we take into account the recorded page sizes to repair the
>>>> memory locations, a large page also needs to punch a hole in the
>>>> backend file to regenerate a usable memory, cleaning the HW
>>>> poisoned section. This is mandatory for hugetlbfs case for example.
>>>>
>>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>>> ---
>>>>    system/physmem.c | 8 ++++++++
>>>>    1 file changed, 8 insertions(+)
>>>>
>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>> index 3757428336..3f6024a92d 100644
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr,
>>>> ram_addr_t length)
>>>>                    prot = PROT_READ;
>>>>                    prot |= block->flags & RAM_READONLY ? 0 : 
>>>> PROT_WRITE;
>>>>                    if (block->fd >= 0) {
>>>> +                    if (length > TARGET_PAGE_SIZE &&
>>>> fallocate(block->fd,
>>>> +                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
>>>> +                        offset + block->fd_offset, length) != 0) {
>>>> +                        error_report("Could not recreate the file
>>>> hole for "
>>>> +                                     "addr: " RAM_ADDR_FMT "@"
>>>> RAM_ADDR_FMT "",
>>>> +                                     length, addr);
>>>> +                        exit(1);
>>>> +                    }
>>>>                        area = mmap(vaddr, length, prot, flags, 
>>>> block->fd,
>>>>                                    offset + block->fd_offset);
>>>>                    } else {
>>>
>>> Ah! Just what I commented to patch #3; we should be using
>>> ram_discard_range(). It might be better to avoid the mmap() completely
>>> if ram_discard_range() worked.
>>
> 
> Hi!
> 
>>
>> I think you are referring to ram_block_discard_range() here, as
>> ram_discard_range() seems to relate to VM migrations, maybe not a VM 
>> reset.
> 
> Please take a look at the users of ram_block_discard_range(), including 
> virtio-balloon to completely zap guest memory, so we will get fresh 
> memory on next access. It takes care of process-private and file-backed 
> (shared) memory.

The calls to madvise should take care of releasing the memory for the 
mapped area, and it is called for standard page sized memory.

>>
>> Remapping the page is needed to get rid of the poison. So if we want to
>> avoid the mmap(), we have to shrink the memory address space -- which
>> can be a real problem if we imagine a VM with 1G large pages for
>> example. qemu_ram_remap() is used to regenerate the lost memory and the
>> mmap() call looks mandatory on the reset phase.
> 
> Why can't we use ram_block_discard_range() to zap the poisoned page 
> (unmap from page tables + conditionally drop from the page cache)? Is 
> there anything important I am missing?

Or maybe _I'm_ missing something important, but what I understand is that:
    need_madvise = (rb->page_size == qemu_real_host_page_size());

ensures that the madvise call on ram_block_discard_range() is not done 
in the case off hugepages.
In this case, we need to call mmap the remap the hugetlbfs large page.

As I said in the previous email, recent kernels start to implement these 
calls for hugetlbfs, but I'm not sure that changing the mechanism of 
this ram_block_discard_range() function now is appropriate.
Do you agree with that ?


>>
>>
>>>
>>> And as raised, there is the problem with memory preallocation (where
>>> we should fail if it doesn't work) and ram discards being disabled
>>> because something relies on long-term page pinning ...
>>
>>
>> Yes. Do you suggest that we add a call to qemu_prealloc_mem() for the
>> remapped area in case of a backend->prealloc being true ?
> 
> Yes. Otherwise, with hugetlb, you might run out of hugetlb pages at 
> runtime and SIGBUS QEMU :(
> 
>>
>> Or as we are running on posix machines for this piece of code (ifndef
>> _WIN32) maybe we could simply add a MAP_POPULATE flag to the mmap call
>> done in qemu_ram_remap() in the case where the backend requires a
>> 'prealloc' ?  Can you confirm if this flag could be used on all systems
>> running this code ?
> 
> Please use qemu_prealloc_mem(). MAP_POPULATE has no guarantees, it's 
> really weird :/ mmap() might succeed even though MAP_POPULATE didn't 
> work ... and it's problematic with NUMA policies because we essentially 
> lose (overwrite) them.
> 
> And the whole mmap(MAP_FIXED) is an ugly hack. For example, we wouldn't 
> reset the memory policy we apply in 
> host_memory_backend_memory_complete() ... that code really needs a 
> rewrite to do it properly.

Maybe I can try to call madvise on hugepages too, only in this VM reset 
situation, and deal with the failure scenario of older kernels not 
supporting it... Leaving the behavior unchanged for every other 
locations calling this function.

But I'll need to verify these madvise effect on hugetlbfs on the latest 
upstream kernel and some older kernels too.



> 
> Ideally, we'd do something high-level like
> 
> 
> if (ram_block_discard_is_disabled()) {
>      /*
>       * We cannot safely discard RAM,  ... for example we might have
>       * to remap all guest RAM into vfio after discarding the
>       * problematic pages ... TODO.
>       */
>      exit(0);
> }
> 
> /* Throw away the problematic (poisoned) page. *./
> if (ram_block_discard_range()) {
>      /* Conditionally fallback to MAP_FIXED workaround */
>      ...
> }
> 
> /* If prealloction was requested, we really must re-preallcoate. */
> if (prealloc && qemu_prealloc_mem()) {
>      /* Preallocation failed .... */
>      exit(0);
> }
> 
> As you note the last part is tricky. See bwloe.
> 
>>
>> Unfortunately, I don't know how to get the MEMORY_BACKEND corresponding
>> to a given memory block. I'm not sure that MEMORY_BACKEND(block->mr) is
>> a valid way to retrieve the Backend object and its 'prealloc' property
>> here. Could you please give me a direction here ?
> 
> We could add a RAM_PREALLOC flag to hint that this memory has "prealloc" 
> semantics.
> 
> I once had an alternative approach: Similar to ram_block_notify_resize() 
> we would implement ram_block_notify_remap().
> 
> That's where the backend could register and re-apply mmap properties 
> like NUMA policies (in case we have to fallback to MAP_FIXED) and handle 
> the preallocation.
> 
> So one would implement a ram_block_notify_remap() and maybe indicate if 
> we had to do MAP_FIXED or if we only discarded the page.
> 
> I once had a prototype for that, let me dig ...

That would be great !  Thanks.

> 
>>
>> I can send a new version using ram_block_discard_range() as you
>> suggested to replace the direct call to fallocate(), if you think it
>> would be better.
>> Please let me know what other enhancement(s) you'd like to see in this
>> code change.
> 
> Something along the lines above. Please let me know if you see problems 
> with that approach that I am missing.


Let me check the madvise use on hugetlbfs and if it works as expected,
I'll try to implement a V2 version of the fix proposal integrating a 
modified ram_block_discard_range() function.

I'll also remove the page size information from the signal handlers
and only keep it in the kvm_hwpoison_page_add() function.

I'll investigate how to keep track of the 'prealloc' attribute to 
optionally use when remapping the hugepages (on older kernels).
And if you find the prototype code you talked about that would 
definitely help :)

Thanks a lot,
William.

Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

Posted by David Hildenbrand 2 weeks, 5 days ago

>>>
>>> Remapping the page is needed to get rid of the poison. So if we want to
>>> avoid the mmap(), we have to shrink the memory address space -- which
>>> can be a real problem if we imagine a VM with 1G large pages for
>>> example. qemu_ram_remap() is used to regenerate the lost memory and the
>>> mmap() call looks mandatory on the reset phase.
>>
>> Why can't we use ram_block_discard_range() to zap the poisoned page
>> (unmap from page tables + conditionally drop from the page cache)? Is
>> there anything important I am missing?
> 
> Or maybe _I'm_ missing something important, but what I understand is that:
>      need_madvise = (rb->page_size == qemu_real_host_page_size());
> 
> ensures that the madvise call on ram_block_discard_range() is not done
> in the case off hugepages.
> In this case, we need to call mmap the remap the hugetlbfs large page.

Right, madvise(DONTNEED) works ever since "90e7e7f5ef3f ("mm: enable 
MADV_DONTNEED for hugetlb mappings")".

But as you note, in QEMU we never called madvise(DONTNEED) for hugetlb 
as of today. But note that we always have an "fd" with hugetlb, because 
we never use mmap(MAP_ANON|MAP_PRIVATE|MAP_HUGETLB) in QEMU.

The weird thing is that if you have a mmap(fd, MAP_PRIVATE) hugetlb 
mapping, fallocate(fd, FALLOC_FL_PUNCH_HOLE) will *also* zap any private 
pages. So in contrast to "ordinary" memory, the madvise(DONTNEED) is not 
required.

(yes, it's very weird)

So the fallocate(fd, FALLOC_FL_PUNCH_HOLE) will zap the hugetlb page and 
you will get a fresh one on next fault.

For all the glorious details, see:

https://lore.kernel.org/linux-mm/2ddd0a26-33fd-9cde-3501-f0584bbffefc@redhat.com/


> 
> As I said in the previous email, recent kernels start to implement these
> calls for hugetlbfs, but I'm not sure that changing the mechanism of
> this ram_block_discard_range() function now is appropriate.
> Do you agree with that ?

The key point is that it works for hugetlb without madvise(DONTNEED), 
which is weird :)

Which is also why the introducing kernel change added "Do note that 
there is no compelling use case for adding this support.
This was discussed in the RFC [1].  However, adding support makes sense
as it is fairly trivial and brings hugetlb functionality more in line
with 'normal' memory."

[...]

>>
>> So one would implement a ram_block_notify_remap() and maybe indicate if
>> we had to do MAP_FIXED or if we only discarded the page.
>>
>> I once had a prototype for that, let me dig ...
> 
> That would be great !  Thanks.

Found them:

https://gitlab.com/virtio-mem/qemu/-/commit/f528c861897d1086ae84ea1bcd6a0be43e8fea7d

https://gitlab.com/virtio-mem/qemu/-/commit/c5b0328654def8f168497715409d6364096eb63f

https://gitlab.com/virtio-mem/qemu/-/commit/15e9737907835105c132091ad10f9d0c9c68ea64

But note that I didn't realize back then that the mmap(MAP_FIXED) is the 
wrong way to do it, and that we actually have to DONTNEED/PUNCH_HOLE to 
do it properly. But to get the preallocation performed by the backend, 
it should still be valuable.

Note that I wonder if we can get rid of the mmap(MAP_FIXED) handling 
completely: likely we only support Linux with MCE recovery, and 
ram_block_discard_range() should do what we need under Linux.

That would make it a lot simpler.

> 
>>
>>>
>>> I can send a new version using ram_block_discard_range() as you
>>> suggested to replace the direct call to fallocate(), if you think it
>>> would be better.
>>> Please let me know what other enhancement(s) you'd like to see in this
>>> code change.
>>
>> Something along the lines above. Please let me know if you see problems
>> with that approach that I am missing.
> 
> 
> Let me check the madvise use on hugetlbfs and if it works as expected,
> I'll try to implement a V2 version of the fix proposal integrating a
> modified ram_block_discard_range() function.

As discussed, it might all be working. If not, we would have to fix 
ram_block_discard_range().

> 
> I'll also remove the page size information from the signal handlers
> and only keep it in the kvm_hwpoison_page_add() function.

That's good. Especially because there was talk in the last bi-weekly MM 
sync [1] about possibly indicating only the actually failed cachelines 
in the future, not necessarily the full page.

So relying on that interface to return the actual pagesize would no be 
future proof.

That session was in general very interesting and very relevant for your 
work; did you by any chance attend it? If not, we should find you the 
recordings, because the idea is to be able to configure to 
not-unmap-during-mce, and instead only inform the guest OS about the MCE 
(forward it). Which avoids any HGM (high-granularity mapping) issues 
completely.

Only during reboot of the VM we will have to do exactly what is being 
done in this series: zap the whole *page* so our fresh OS will see "all 
non-faulty" memory.

[1] 
https://lkml.kernel.org/r/9242f7cc-6b9d-b807-9079-db0ca81f3c6d@google.com

> 
> I'll investigate how to keep track of the 'prealloc' attribute to
> optionally use when remapping the hugepages (on older kernels).
> And if you find the prototype code you talked about that would
> definitely help :)

Right, the above should help getting that sorted out (but code id 4 
years old, so it won't "just apply").

-- 
Cheers,

David / dhildenb

[PATCH v2 0/7] hugetlbfs memory HW error fixes

Posted by “William Roche 2 weeks, 2 days ago

From: William Roche <william.roche@oracle.com>

Hi David,

Here is an updated description of the patch set:
 ---
This set of patches fixes several problems with hardware memory errors
impacting hugetlbfs memory backed VMs. When using hugetlbfs large
pages, any large page location being impacted by an HW memory error
results in poisoning the entire page, suddenly making a large chunk of
the VM memory unusable.

The main problem that currently exists in Qemu is the lack of backend
file repair before resetting the VM memory, resulting in the impacted
memory to be silently unusable even after a VM reboot.

In order to fix this issue, we track the page size of the impacted
memory block with the associated poisoned page location.

Using the size information we also call ram_block_discard_range() to
regenerate the memory on VM reset when running qemu_ram_remap(). So
that a poisoned memory backed by a hugetlbfs file is regenerated with
a hole punched in this file. A new page is loaded when the location
is first touched.

In case of a discard failure we fall back to unmap/remap the memory
location and reset the memory settings.

We also have to honor the 'prealloc' attribute even after a successful
discard, so we reapply the memory settings in this case too.

This memory setting is performed by a new remap notification mechanism
calling host_memory_backend_ram_remapped() function when a region of
a memory block is remapped.

Issue also a message providing the impact information of a large page
memory loss. Only reported once when the page is poisoned.
 ---


v1 -> v2:
. I removed the kernel SIGBUS siginfo provided lsb size information
  tracking. Only relying on the RAMBlock page_size instead.

. I adapted the 3 patches you indicated me to implement the
  notification mechanism on remap.  Thank you for this code!
  I left them as Authored by you.
  But I haven't tested if the policy setting works as expected on VM
  reset, only that the replacement of physical memory works.

. I also removed the old memory setting that was kept in qemu_ram_remap()
  but this small last fix could probably be merged with your last commit.


I also got yesterday the recording of the mm-linux session about the
kernel modification on largepage poisoning, and discussed this topic
with a colleague of mine who attended the meeting.

About the use of -mem-path question you asked me, we communicated the
information about the deprecated aspect of this option and advise all
users to use the following options instead.
-object memory-backend-file,id=pc.ram,mem-path=/dev/hugepages,prealloc,size=XXX -machine memory-backend=pc.ram 

We could now add the request to use a share=on attribute too, to avoid
the additional message about dangerous discard situations.


This code is scripts/checkpatch.pl clean
'make check' runs fine on both x86 and Arm.


David Hildenbrand (3):
  numa: Introduce and use ram_block_notify_remap()
  hostmem: Factor out applying settings
  hostmem: Handle remapping of RAM

William Roche (4):
  accel/kvm: Keep track of the HWPoisonPage page_size
  system/physmem: poisoned memory discard on reboot
  accel/kvm: Report the loss of a large memory page
  system/physmem: Memory settings applied on remap notification

 accel/kvm/kvm-all.c       |  17 +++-
 backends/hostmem.c        | 184 +++++++++++++++++++++++---------------
 hw/core/numa.c            |  11 +++
 include/exec/cpu-common.h |   1 +
 include/exec/ramlist.h    |   3 +
 include/sysemu/hostmem.h  |   1 +
 include/sysemu/kvm_int.h  |   4 +-
 system/physmem.c          |  62 ++++++++-----
 target/arm/kvm.c          |   2 +-
 target/i386/kvm/kvm.c     |   2 +-
 10 files changed, 189 insertions(+), 98 deletions(-)

-- 
2.43.5

[PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by “William Roche 2 weeks, 2 days ago

From: William Roche <william.roche@oracle.com>

When a memory page is added to the hwpoison_page_list, include
the page size information.  This size is the backend real page
size. To better deal with hugepages, we create a single entry
for the entire page.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c       |  8 +++++++-
 include/exec/cpu-common.h |  1 +
 system/physmem.c          | 13 +++++++++++++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 801cff16a5..6dd06f5edf 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
  */
 typedef struct HWPoisonPage {
     ram_addr_t ram_addr;
+    size_t     page_size;
     QLIST_ENTRY(HWPoisonPage) list;
 } HWPoisonPage;
 
@@ -1278,7 +1279,7 @@ static void kvm_unpoison_all(void *param)
 
     QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
         QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr, page->page_size);
         g_free(page);
     }
 }
@@ -1286,6 +1287,10 @@ static void kvm_unpoison_all(void *param)
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
+    size_t sz = qemu_ram_pagesize_from_addr(ram_addr);
+
+    if (sz > TARGET_PAGE_SIZE)
+        ram_addr = ROUND_DOWN(ram_addr, sz);
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
@@ -1294,6 +1299,7 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr)
     }
     page = g_new(HWPoisonPage, 1);
     page->ram_addr = ram_addr;
+    page->page_size = sz;
     QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
 }
 
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 638dc806a5..8f8f7ad567 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -108,6 +108,7 @@ bool qemu_ram_is_named_file(RAMBlock *rb);
 int qemu_ram_get_fd(RAMBlock *rb);
 
 size_t qemu_ram_pagesize(RAMBlock *block);
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr);
 size_t qemu_ram_pagesize_largest(void);
 
 /**
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..750604d47d 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1665,6 +1665,19 @@ size_t qemu_ram_pagesize(RAMBlock *rb)
     return rb->page_size;
 }
 
+/* Return backend real page size used for the given ram_addr. */
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr)
+{
+    RAMBlock *rb;
+
+    RCU_READ_LOCK_GUARD();
+    rb =  qemu_get_ram_block(addr);
+    if (!rb) {
+        return TARGET_PAGE_SIZE;
+    }
+    return qemu_ram_pagesize(rb);
+}
+
 /* Returns the largest size of page in use */
 size_t qemu_ram_pagesize_largest(void)
 {
-- 
2.43.5

Re: [PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by David Hildenbrand 1 week, 4 days ago

On 07.11.24 11:21, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> When a memory page is added to the hwpoison_page_list, include
> the page size information.  This size is the backend real page
> size. To better deal with hugepages, we create a single entry
> for the entire page.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   accel/kvm/kvm-all.c       |  8 +++++++-
>   include/exec/cpu-common.h |  1 +
>   system/physmem.c          | 13 +++++++++++++
>   3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 801cff16a5..6dd06f5edf 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
>    */
>   typedef struct HWPoisonPage {
>       ram_addr_t ram_addr;
> +    size_t     page_size;
>       QLIST_ENTRY(HWPoisonPage) list;
>   } HWPoisonPage;
>   
> @@ -1278,7 +1279,7 @@ static void kvm_unpoison_all(void *param)
>   
>       QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>           QLIST_REMOVE(page, list);
> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
> +        qemu_ram_remap(page->ram_addr, page->page_size);
>           g_free(page);

I'm curious, can't we simply drop the size parameter from qemu_ram_remap()
completely and determine the page size internally from the RAMBlock that
we are looking up already?

This way, we avoid yet another lookup in qemu_ram_pagesize_from_addr(),
and can just handle it completely in qemu_ram_remap().

In particular, to be future proof, we should also align the offset down to
the pagesize.

I'm thinking about something like this:

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 801cff16a5..8a47aa7258 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param)
  
      QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
          QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr);
          g_free(page);
      }
  }
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 638dc806a5..50a829d31f 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
  
  /* memory API */
  
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
+void qemu_ram_remap(ram_addr_t addr);
  /* This should not be used by devices.  */
  ram_addr_t qemu_ram_addr_from_host(void *ptr);
  ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..5f19bec089 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2167,10 +2167,10 @@ void qemu_ram_free(RAMBlock *block)
  }
  
  #ifndef _WIN32
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+void qemu_ram_remap(ram_addr_t addr)
  {
      RAMBlock *block;
-    ram_addr_t offset;
+    ram_addr_t offset, length;
      int flags;
      void *area, *vaddr;
      int prot;
@@ -2178,6 +2178,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
      RAMBLOCK_FOREACH(block) {
          offset = addr - block->offset;
          if (offset < block->max_length) {
+            /* Respect the pagesize of our RAMBlock. */
+            offset = QEMU_ALIGN_DOWN(offset, qemu_ram_pagesize(block));
+            length = qemu_ram_pagesize(block);
+
              vaddr = ramblock_ptr(block, offset);
              if (block->flags & RAM_PREALLOC) {
                  ;
@@ -2206,6 +2210,8 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                  memory_try_enable_merging(vaddr, length);
                  qemu_ram_setup_dump(vaddr, length);
              }
+
+            break;
          }
      }
  }


-- 
Cheers,

David / dhildenb

Re: [PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by William Roche 1 week, 4 days ago

On 11/12/24 11:30, David Hildenbrand wrote:
> On 07.11.24 11:21, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> When a memory page is added to the hwpoison_page_list, include
>> the page size information.  This size is the backend real page
>> size. To better deal with hugepages, we create a single entry
>> for the entire page.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   accel/kvm/kvm-all.c       |  8 +++++++-
>>   include/exec/cpu-common.h |  1 +
>>   system/physmem.c          | 13 +++++++++++++
>>   3 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 801cff16a5..6dd06f5edf 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned 
>> int extension)
>>    */
>>   typedef struct HWPoisonPage {
>>       ram_addr_t ram_addr;
>> +    size_t     page_size;
>>       QLIST_ENTRY(HWPoisonPage) list;
>>   } HWPoisonPage;
>> @@ -1278,7 +1279,7 @@ static void kvm_unpoison_all(void *param)
>>       QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>           QLIST_REMOVE(page, list);
>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>> +        qemu_ram_remap(page->ram_addr, page->page_size);
>>           g_free(page);
> 
> I'm curious, can't we simply drop the size parameter from qemu_ram_remap()
> completely and determine the page size internally from the RAMBlock that
> we are looking up already?
> 
> This way, we avoid yet another lookup in qemu_ram_pagesize_from_addr(),
> and can just handle it completely in qemu_ram_remap().
> 
> In particular, to be future proof, we should also align the offset down to
> the pagesize.
> 
> I'm thinking about something like this:
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 801cff16a5..8a47aa7258 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param)
> 
>       QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>           QLIST_REMOVE(page, list);
> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
> +        qemu_ram_remap(page->ram_addr);
>           g_free(page);
>       }
>   }
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 638dc806a5..50a829d31f 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
> 
>   /* memory API */
> 
> -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
> +void qemu_ram_remap(ram_addr_t addr);
>   /* This should not be used by devices.  */
>   ram_addr_t qemu_ram_addr_from_host(void *ptr);
>   ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
> diff --git a/system/physmem.c b/system/physmem.c
> index dc1db3a384..5f19bec089 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2167,10 +2167,10 @@ void qemu_ram_free(RAMBlock *block)
>   }
> 
>   #ifndef _WIN32
> -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
> +void qemu_ram_remap(ram_addr_t addr)
>   {
>       RAMBlock *block;
> -    ram_addr_t offset;
> +    ram_addr_t offset, length;
>       int flags;
>       void *area, *vaddr;
>       int prot;
> @@ -2178,6 +2178,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t 
> length)
>       RAMBLOCK_FOREACH(block) {
>           offset = addr - block->offset;
>           if (offset < block->max_length) {
> +            /* Respect the pagesize of our RAMBlock. */
> +            offset = QEMU_ALIGN_DOWN(offset, qemu_ram_pagesize(block));
> +            length = qemu_ram_pagesize(block);
> +
>               vaddr = ramblock_ptr(block, offset);
>               if (block->flags & RAM_PREALLOC) {
>                   ;
> @@ -2206,6 +2210,8 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t 
> length)
>                   memory_try_enable_merging(vaddr, length);
>                   qemu_ram_setup_dump(vaddr, length);
>               }
> +
> +            break;
>           }
>       }
>   }
> 
> 


Yes this is a working possibility, and as you say it would provide the 
advantage to avoid a size lookup (needed because the kernel siginfo can 
be incorrect) and avoid tracking the poisoned pages size, with the 
addresses.

But if we want to keep the information about the loss of a large page 
(which I think is useful) we would have to introduce the page size 
lookup when adding the page to the poison list. So according to me, 
keeping track of the page size and reusing it on remap isn't so bad. But 
if you prefer that we don't track the page size and do a lookup on page 
insert into the poison list and another in qemu_ram_remap(), of course 
we can do that.

There is also something to consider about the future: we'll also have to 
deal with migration of VM that have been impacted by a memory error. And 
knowing about the poisoned pages size could be useful too. But this is 
another topic...

I would vote to keep this size tracking.

Re: [PATCH v2 1/7] accel/kvm: Keep track of the HWPoisonPage page_size

Posted by David Hildenbrand 1 week, 4 days ago

On 12.11.24 19:17, William Roche wrote:
> On 11/12/24 11:30, David Hildenbrand wrote:
>> On 07.11.24 11:21, “William Roche wrote:
>>> From: William Roche <william.roche@oracle.com>
>>>
>>> When a memory page is added to the hwpoison_page_list, include
>>> the page size information.  This size is the backend real page
>>> size. To better deal with hugepages, we create a single entry
>>> for the entire page.
>>>
>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>> ---
>>>    accel/kvm/kvm-all.c       |  8 +++++++-
>>>    include/exec/cpu-common.h |  1 +
>>>    system/physmem.c          | 13 +++++++++++++
>>>    3 files changed, 21 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>> index 801cff16a5..6dd06f5edf 100644
>>> --- a/accel/kvm/kvm-all.c
>>> +++ b/accel/kvm/kvm-all.c
>>> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned
>>> int extension)
>>>     */
>>>    typedef struct HWPoisonPage {
>>>        ram_addr_t ram_addr;
>>> +    size_t     page_size;
>>>        QLIST_ENTRY(HWPoisonPage) list;
>>>    } HWPoisonPage;
>>> @@ -1278,7 +1279,7 @@ static void kvm_unpoison_all(void *param)
>>>        QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>>            QLIST_REMOVE(page, list);
>>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>>> +        qemu_ram_remap(page->ram_addr, page->page_size);
>>>            g_free(page);
>>
>> I'm curious, can't we simply drop the size parameter from qemu_ram_remap()
>> completely and determine the page size internally from the RAMBlock that
>> we are looking up already?
>>
>> This way, we avoid yet another lookup in qemu_ram_pagesize_from_addr(),
>> and can just handle it completely in qemu_ram_remap().
>>
>> In particular, to be future proof, we should also align the offset down to
>> the pagesize.
>>
>> I'm thinking about something like this:
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 801cff16a5..8a47aa7258 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -1278,7 +1278,7 @@ static void kvm_unpoison_all(void *param)
>>
>>        QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
>>            QLIST_REMOVE(page, list);
>> -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
>> +        qemu_ram_remap(page->ram_addr);
>>            g_free(page);
>>        }
>>    }
>> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
>> index 638dc806a5..50a829d31f 100644
>> --- a/include/exec/cpu-common.h
>> +++ b/include/exec/cpu-common.h
>> @@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
>>
>>    /* memory API */
>>
>> -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
>> +void qemu_ram_remap(ram_addr_t addr);
>>    /* This should not be used by devices.  */
>>    ram_addr_t qemu_ram_addr_from_host(void *ptr);
>>    ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
>> diff --git a/system/physmem.c b/system/physmem.c
>> index dc1db3a384..5f19bec089 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -2167,10 +2167,10 @@ void qemu_ram_free(RAMBlock *block)
>>    }
>>
>>    #ifndef _WIN32
>> -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
>> +void qemu_ram_remap(ram_addr_t addr)
>>    {
>>        RAMBlock *block;
>> -    ram_addr_t offset;
>> +    ram_addr_t offset, length;
>>        int flags;
>>        void *area, *vaddr;
>>        int prot;
>> @@ -2178,6 +2178,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t
>> length)
>>        RAMBLOCK_FOREACH(block) {
>>            offset = addr - block->offset;
>>            if (offset < block->max_length) {
>> +            /* Respect the pagesize of our RAMBlock. */
>> +            offset = QEMU_ALIGN_DOWN(offset, qemu_ram_pagesize(block));
>> +            length = qemu_ram_pagesize(block);
>> +
>>                vaddr = ramblock_ptr(block, offset);
>>                if (block->flags & RAM_PREALLOC) {
>>                    ;
>> @@ -2206,6 +2210,8 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t
>> length)
>>                    memory_try_enable_merging(vaddr, length);
>>                    qemu_ram_setup_dump(vaddr, length);
>>                }
>> +
>> +            break;
>>            }
>>        }
>>    }
>>
>>
> 
> 
> Yes this is a working possibility, and as you say it would provide the
> advantage to avoid a size lookup (needed because the kernel siginfo can
> be incorrect) and avoid tracking the poisoned pages size, with the
> addresses.
 > > But if we want to keep the information about the loss of a large page
> (which I think is useful) we would have to introduce the page size
> lookup when adding the page to the poison list. So according to me,

Right, that would be independent of the remap logic.

What I dislike about qemu_ram_remap() is that it looks like we could be 
remapping a range that's possibly larger than a single page.

But it really only works on a single address, expanding that to the 
page. Passing in a length that crosses RAMBlocks would not work as 
expected ...

So I'd prefer if we let qemu_ram_remap() do exactly that ... remap a 
single page ...

> keeping track of the page size and reusing it on remap isn't so bad. But
> if you prefer that we don't track the page size and do a lookup on page
> insert into the poison list and another in qemu_ram_remap(), of course
> we can do that.

... and lookup the page size manually here if we really have to, for 
example to warn/trace errors.

 > > There is also something to consider about the future: we'll also 
have to
> deal with migration of VM that have been impacted by a memory error. And
> knowing about the poisoned pages size could be useful too. But this is
> another topic...

Yes, although the destination should be able to derive the same thing 
from the address I guess. We expect src and dst QEMU to use the same 
memory backing.

-- 
Cheers,

David / dhildenb

[PATCH v2 2/7] system/physmem: poisoned memory discard on reboot

Posted by “William Roche 2 weeks, 2 days ago

From: William Roche <william.roche@oracle.com>

We take into account the recorded page sizes to repair the
memory locations, calling ram_block_discard_range() to punch a hole
in the backend file when necessary and regenerate a usable memory.
Fall back to unmap/remap the memory location(s) if the kernel doesn't
support the madvise calls used by ram_block_discard_range().

Hugetlbfs poison case is also taken into account as a hole punch
with fallocate will reload a new page when first touched.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 50 +++++++++++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 750604d47d..dfea120cc5 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
             } else if (xen_enabled()) {
                 abort();
             } else {
-                flags = MAP_FIXED;
-                flags |= block->flags & RAM_SHARED ?
-                         MAP_SHARED : MAP_PRIVATE;
-                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
-                prot = PROT_READ;
-                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
-                if (block->fd >= 0) {
-                    area = mmap(vaddr, length, prot, flags, block->fd,
-                                offset + block->fd_offset);
-                } else {
-                    flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, length, prot, flags, -1, 0);
-                }
-                if (area != vaddr) {
-                    error_report("Could not remap addr: "
-                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 length, addr);
-                    exit(1);
+                if (ram_block_discard_range(block, offset + block->fd_offset,
+                                            length) != 0) {
+                    if (length > TARGET_PAGE_SIZE) {
+                        /* punch hole is mandatory on hugetlbfs */
+                        error_report("large page recovery failure addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }
+                    flags = MAP_FIXED;
+                    flags |= block->flags & RAM_SHARED ?
+                             MAP_SHARED : MAP_PRIVATE;
+                    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
+                    prot = PROT_READ;
+                    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
+                    if (block->fd >= 0) {
+                        area = mmap(vaddr, length, prot, flags, block->fd,
+                                    offset + block->fd_offset);
+                    } else {
+                        flags |= MAP_ANONYMOUS;
+                        area = mmap(vaddr, length, prot, flags, -1, 0);
+                    }
+                    if (area != vaddr) {
+                        error_report("Could not remap addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }
+                    memory_try_enable_merging(vaddr, length);
+                    qemu_ram_setup_dump(vaddr, length);
                 }
-                memory_try_enable_merging(vaddr, length);
-                qemu_ram_setup_dump(vaddr, length);
             }
         }
     }
-- 
2.43.5

Re: [PATCH v2 2/7] system/physmem: poisoned memory discard on reboot

Posted by David Hildenbrand 1 week, 4 days ago

On 07.11.24 11:21, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> We take into account the recorded page sizes to repair the
> memory locations, calling ram_block_discard_range() to punch a hole
> in the backend file when necessary and regenerate a usable memory.
> Fall back to unmap/remap the memory location(s) if the kernel doesn't
> support the madvise calls used by ram_block_discard_range().
> 
> Hugetlbfs poison case is also taken into account as a hole punch
> with fallocate will reload a new page when first touched.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   system/physmem.c | 50 +++++++++++++++++++++++++++++-------------------
>   1 file changed, 30 insertions(+), 20 deletions(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index 750604d47d..dfea120cc5 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
>               } else if (xen_enabled()) {
>                   abort();
>               } else {
> -                flags = MAP_FIXED;
> -                flags |= block->flags & RAM_SHARED ?
> -                         MAP_SHARED : MAP_PRIVATE;
> -                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> -                prot = PROT_READ;
> -                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> -                if (block->fd >= 0) {
> -                    area = mmap(vaddr, length, prot, flags, block->fd,
> -                                offset + block->fd_offset);
> -                } else {
> -                    flags |= MAP_ANONYMOUS;
> -                    area = mmap(vaddr, length, prot, flags, -1, 0);
> -                }
> -                if (area != vaddr) {
> -                    error_report("Could not remap addr: "
> -                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> -                                 length, addr);
> -                    exit(1);
> +                if (ram_block_discard_range(block, offset + block->fd_offset,
> +                                            length) != 0) {
> +                    if (length > TARGET_PAGE_SIZE) {
> +                        /* punch hole is mandatory on hugetlbfs */
> +                        error_report("large page recovery failure addr: "
> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                                     length, addr);
> +                        exit(1);
> +                    }

For shared memory we really need it.

Private file-backed is weird ... because we don't know if the shared or 
the private page is problematic ... :(

Maybe we should just do:

if (block->fd >= 0) {
	/* mmap(MAP_FIXED) cannot reliably zap our problematic page. */
	error_report(...);
	exit(-1);
}

Or alternatively

if (block->fd >= 0 && qemu_ram_is_shared(block)) {
	/* mmap() cannot possibly zap our problematic page. */
	error_report(...);
	exit(-1);
} else if (block->fd >= 0) {
	/*
	 * MAP_PRIVATE file-backed ... mmap() can only zap the private
	 * page, not the shared one ... we don't know which one is
	 * problematic.
	 */
	warn_report(...);
}


> +                    flags = MAP_FIXED;
> +                    flags |= block->flags & RAM_SHARED ?
> +                             MAP_SHARED : MAP_PRIVATE;
> +                    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> +                    prot = PROT_READ;
> +                    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> +                    if (block->fd >= 0) {
> +                        area = mmap(vaddr, length, prot, flags, block->fd,
> +                                    offset + block->fd_offset);
> +                    } else {
> +                        flags |= MAP_ANONYMOUS;
> +                        area = mmap(vaddr, length, prot, flags, -1, 0);
> +                    }
> +                    if (area != vaddr) {
> +                        error_report("Could not remap addr: "
> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                                     length, addr);
> +                        exit(1);
> +                    }
> +                    memory_try_enable_merging(vaddr, length);
> +                    qemu_ram_setup_dump(vaddr, length);

Can we factor the mmap hack out into a separate helper function to clean 
this up a bit?


-- 
Cheers,

David / dhildenb

Re: [PATCH v2 2/7] system/physmem: poisoned memory discard on reboot

Posted by William Roche 1 week, 4 days ago

On 11/12/24 12:07, David Hildenbrand wrote:
> On 07.11.24 11:21, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> We take into account the recorded page sizes to repair the
>> memory locations, calling ram_block_discard_range() to punch a hole
>> in the backend file when necessary and regenerate a usable memory.
>> Fall back to unmap/remap the memory location(s) if the kernel doesn't
>> support the madvise calls used by ram_block_discard_range().
>>
>> Hugetlbfs poison case is also taken into account as a hole punch
>> with fallocate will reload a new page when first touched.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   system/physmem.c | 50 +++++++++++++++++++++++++++++-------------------
>>   1 file changed, 30 insertions(+), 20 deletions(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 750604d47d..dfea120cc5 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, 
>> ram_addr_t length)
>>               } else if (xen_enabled()) {
>>                   abort();
>>               } else {
>> -                flags = MAP_FIXED;
>> -                flags |= block->flags & RAM_SHARED ?
>> -                         MAP_SHARED : MAP_PRIVATE;
>> -                flags |= block->flags & RAM_NORESERVE ? 
>> MAP_NORESERVE : 0;
>> -                prot = PROT_READ;
>> -                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>> -                if (block->fd >= 0) {
>> -                    area = mmap(vaddr, length, prot, flags, block->fd,
>> -                                offset + block->fd_offset);
>> -                } else {
>> -                    flags |= MAP_ANONYMOUS;
>> -                    area = mmap(vaddr, length, prot, flags, -1, 0);
>> -                }
>> -                if (area != vaddr) {
>> -                    error_report("Could not remap addr: "
>> -                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>> -                                 length, addr);
>> -                    exit(1);
>> +                if (ram_block_discard_range(block, offset + block- 
>> >fd_offset,
>> +                                            length) != 0) {
>> +                    if (length > TARGET_PAGE_SIZE) {
>> +                        /* punch hole is mandatory on hugetlbfs */
>> +                        error_report("large page recovery failure 
>> addr: "
>> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>> +                                     length, addr);
>> +                        exit(1);
>> +                    }
> 
> For shared memory we really need it.
> 
> Private file-backed is weird ... because we don't know if the shared or 
> the private page is problematic ... :(


I agree with you, and we have to decide when should we bail out if 
ram_block_discard_range() doesn't work.
According to me, if discard doesn't work and we are dealing with 
file-backed largepages (shared or not) we have to exit, because the 
fallocate is mandatory. It is the case with hugetlbfs.

In the non-file-backed case, or the file-backed non-largepage private 
case, according to me we can trust the mmap() method to put everything 
back in place for the VM reset to work as expected.
Are there aspects I don't see, and for which mmap + the remap handler is 
not sufficient and we should also bail out here ?



> 
> Maybe we should just do:
> 
> if (block->fd >= 0) {
>      /* mmap(MAP_FIXED) cannot reliably zap our problematic page. */
>      error_report(...);
>      exit(-1);
> }
> 
> Or alternatively
> 
> if (block->fd >= 0 && qemu_ram_is_shared(block)) {
>      /* mmap() cannot possibly zap our problematic page. */
>      error_report(...);
>      exit(-1);
> } else if (block->fd >= 0) {
>      /*
>       * MAP_PRIVATE file-backed ... mmap() can only zap the private
>       * page, not the shared one ... we don't know which one is
>       * problematic.
>       */
>      warn_report(...);
> }

I also agree that any file-backed/shared case should bail out if discard 
(fallocate) fails, no mater large or standard pages are used.

In the case of file-backed private standard pages, I think that a poison 
on the private page can be fixed with a new mmap.
According to me, there are 2 cases to consider: at the moment the poison 
is seen, the page was dirty (so it means that it was a pure private 
page), or the page was not dirty, and in this case the poison could 
replace this non-dirty page with a new copy of the file content.
In both cases, I'd say that the remap should clean up the poison.

So the conditions when discard fails, could be something like:

    if (block->fd >= 0 && (qemu_ram_is_shared(block) ||
        (length > TARGET_PAGE_SIZE))) {
        /* punch hole is mandatory, mmap() cannot possibly zap our page*/
         error_report("%spage recovery failure addr: "
                      RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
                      (length > TARGET_PAGE_SIZE) ? "large " : "",
                      length, addr);
         exit(1);
     }


>> +                    flags = MAP_FIXED;
>> +                    flags |= block->flags & RAM_SHARED ?
>> +                             MAP_SHARED : MAP_PRIVATE;
>> +                    flags |= block->flags & RAM_NORESERVE ? 
>> MAP_NORESERVE : 0;
>> +                    prot = PROT_READ;
>> +                    prot |= block->flags & RAM_READONLY ? 0 : 
>> PROT_WRITE;
>> +                    if (block->fd >= 0) {
>> +                        area = mmap(vaddr, length, prot, flags, 
>> block->fd,
>> +                                    offset + block->fd_offset);
>> +                    } else {
>> +                        flags |= MAP_ANONYMOUS;
>> +                        area = mmap(vaddr, length, prot, flags, -1, 0);
>> +                    }
>> +                    if (area != vaddr) {
>> +                        error_report("Could not remap addr: "
>> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>> +                                     length, addr);
>> +                        exit(1);
>> +                    }
>> +                    memory_try_enable_merging(vaddr, length);
>> +                    qemu_ram_setup_dump(vaddr, length);
> 
> Can we factor the mmap hack out into a separate helper function to clean 
> this up a bit?

Sure, I'll do that.

Re: [PATCH v2 2/7] system/physmem: poisoned memory discard on reboot

Posted by David Hildenbrand 1 week, 4 days ago

>> For shared memory we really need it.
>>
>> Private file-backed is weird ... because we don't know if the shared or
>> the private page is problematic ... :(
> 
> 
> I agree with you, and we have to decide when should we bail out if
> ram_block_discard_range() doesn't work.
> According to me, if discard doesn't work and we are dealing with
> file-backed largepages (shared or not) we have to exit, because the
> fallocate is mandatory. It is the case with hugetlbfs.
 > > In the non-file-backed case, or the file-backed non-largepage private
> case, according to me we can trust the mmap() method to put everything
> back in place for the VM reset to work as expected.
> Are there aspects I don't see, and for which mmap + the remap handler is
> not sufficient and we should also bail out here ?

mmap() will only zap anonymous pages, no pagecache pages. See below.

>>
>> Maybe we should just do:
>>
>> if (block->fd >= 0) {
>>       /* mmap(MAP_FIXED) cannot reliably zap our problematic page. */
>>       error_report(...);
>>       exit(-1);
>> }
>>
>> Or alternatively
>>
>> if (block->fd >= 0 && qemu_ram_is_shared(block)) {
>>       /* mmap() cannot possibly zap our problematic page. */
>>       error_report(...);
>>       exit(-1);
>> } else if (block->fd >= 0) {
>>       /*
>>        * MAP_PRIVATE file-backed ... mmap() can only zap the private
>>        * page, not the shared one ... we don't know which one is
>>        * problematic.
>>        */
>>       warn_report(...);
>> }
> 
> I also agree that any file-backed/shared case should bail out if discard
> (fallocate) fails, no mater large or standard pages are used.
> 
> In the case of file-backed private standard pages, I think that a poison
> on the private page can be fixed with a new mmap.
> According to me, there are 2 cases to consider: at the moment the poison
> is seen, the page was dirty (so it means that it was a pure private
> page), or the page was not dirty, and in this case the poison could
> replace this non-dirty page with a new copy of the file content.
> In both cases, I'd say that the remap should clean up the poison.

Let's assume we have mmap(MAP_RIVATE, fd). The following scenarios are 
possible:

(a) We only have a pagecache page (never written) that is poisoned
	-> mmap(MAP_FIXED) cannot resolve that

(b) We only have an anonymous page (e.g., pagecache truncated, or if the
     hugetlb file was empty) that is poisoned
	-> mmap(MAP_FIXED) can resolve that

(c) We have an anonymous and a pagecache page (written -> COW).
(c1) Anonymous page is poisoned -> mmap(MAP_FIXED) can resolve that
(c2) Pagecache page is poisoned -> mmap(MAP_FIXED) cannot resolve that


So mmap(MAP_FIXED) cannot sort out all cases. In practice, (a) and (c2) 
are uncommon, but possible.

(b) is common with hugetlb. (a) and (c) are uncommon with hugetlb, just 
because of the nature of hugetlb pages being a scarce resource.

And IIRC, (b) with hugetlb should should be sorted out with 
mmap(MAP_FIXED). Please double-check.

> 
> So the conditions when discard fails, could be something like:
> 
>      if (block->fd >= 0 && (qemu_ram_is_shared(block) ||
>          (length > TARGET_PAGE_SIZE))) {
>          /* punch hole is mandatory, mmap() cannot possibly zap our page*/
>           error_report("%spage recovery failure addr: "
>                        RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>                        (length > TARGET_PAGE_SIZE) ? "large " : "",
>                        length, addr);

I'm not sure if we should be special-casing hugetlb.

If we want to be 100% sure, we will do

if (block->fd >= 0) {
	error_report();
	exit(1);
}

But we could decide to be "nice" to hugetlb and assume (b) for them 
above: that is, we would do

/*
  * mmap() cannot zap pagecache pages, only anonymous pages. As soon as
  * we might have pagecache pages involved (either private or shared
  * mapping), we must be careful. However, MAP_PRIVATE on empty hugetlb
  * files is common, and extremely uncommon on non-empty hugetlb files,
  * so we'll special-case them here.
  */
if (block->fd >= 0 && (qemu_ram_is_shared(block) ||
     length == TARGET_PAGE_SIZE))) {
	...
}

[in practice, we could use /proc/self/pagemap to see if we map an 
anonymous page ... but I'd rather not go down that path just yet]

But, in the end the expectation is that madvise()+fallocate() will 
usually not fail.

-- 
Cheers,

David / dhildenb

[PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

Posted by “William Roche 2 weeks, 2 days ago

From: William Roche <william.roche@oracle.com>

When an entire large page is impacted by an error (hugetlbfs case),
report better the size and location of this large memory hole, so
give a warning message when this page is first hit:
Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST addr Z

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c      | 9 ++++++++-
 include/sysemu/kvm_int.h | 4 +++-
 target/arm/kvm.c         | 2 +-
 target/i386/kvm/kvm.c    | 2 +-
 4 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 6dd06f5edf..a572437115 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1284,7 +1284,7 @@ static void kvm_unpoison_all(void *param)
     }
 }
 
-void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, void *ha, hwaddr gpa)
 {
     HWPoisonPage *page;
     size_t sz = qemu_ram_pagesize_from_addr(ram_addr);
@@ -1301,6 +1301,13 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr)
     page->ram_addr = ram_addr;
     page->page_size = sz;
     QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
+
+    if (sz > TARGET_PAGE_SIZE) {
+        gpa = ROUND_DOWN(gpa, sz);
+        ha = (void *)ROUND_DOWN((uint64_t)ha, sz);
+        warn_report("Memory error: Loosing a large page (size: %zu) "
+            "at QEMU addr %p and GUEST addr 0x%" HWADDR_PRIx, sz, ha, gpa);
+    }
 }
 
 bool kvm_hwpoisoned_mem(void)
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index a1e72763da..ee34f1d225 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -178,10 +178,12 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size);
  *
  * Parameters:
  *  @ram_addr: the address in the RAM for the poisoned page
+ *  @hva: host virtual address aka QEMU addr
+ *  @gpa: guest physical address aka GUEST addr
  *
  * Add a poisoned page to the list
  *
  * Return: None.
  */
-void kvm_hwpoison_page_add(ram_addr_t ram_addr);
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, void *hva, hwaddr gpa);
 #endif
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index f1f1b5b375..aae66dba41 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2359,7 +2359,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
+            kvm_hwpoison_page_add(ram_addr, addr, paddr);
             /*
              * If this is a BUS_MCEERR_AR, we know we have been called
              * synchronously from the vCPU thread, so we can easily
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index fd9f198892..fd7cd7008e 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -753,7 +753,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
+            kvm_hwpoison_page_add(ram_addr, addr, paddr);
             kvm_mce_inject(cpu, paddr, code);
 
             /*
-- 
2.43.5

Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

Posted by David Hildenbrand 1 week, 4 days ago

On 07.11.24 11:21, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> When an entire large page is impacted by an error (hugetlbfs case),
> report better the size and location of this large memory hole, so
> give a warning message when this page is first hit:
> Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST addr Z
> 

Hm, I wonder if we really want to special-case hugetlb here.

Why not make the warning independent of the underlying page size?

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

Posted by William Roche 1 week, 4 days ago

On 11/12/24 12:13, David Hildenbrand wrote:
> On 07.11.24 11:21, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> When an entire large page is impacted by an error (hugetlbfs case),
>> report better the size and location of this large memory hole, so
>> give a warning message when this page is first hit:
>> Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST 
>> addr Z
>>
> 
> Hm, I wonder if we really want to special-case hugetlb here.
> 
> Why not make the warning independent of the underlying page size?

We already have a warning provided by Qemu (in kvm_arch_on_sigbus_vcpu()):

Guest MCE Memory Error at QEMU addr Y and GUEST addr Z of type 
BUS_MCEERR_AR/_AO injected

The one I suggest is an additional message provided before the above 
message.

Here is an example:
qemu-system-x86_64: warning: Memory error: Loosing a large page (size: 
2097152) at QEMU addr 0x7fdd7d400000 and GUEST addr 0x11600000
qemu-system-x86_64: warning: Guest MCE Memory Error at QEMU addr 
0x7fdd7d400000 and GUEST addr 0x11600000 of type BUS_MCEERR_AO injected

According to me, this large page case additional message will help to 
better understand the probable sudden proliferation of memory errors 
that can be reported by Qemu on the impacted range.
Not only will the machine administrator identify better that a single 
memory error had this large impact, it can also help us to better 
measure the impact of fixing the large page memory error support in the 
field (in the future).

These are some reasons why I do think this large page specific message 
can be useful.

Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

Posted by David Hildenbrand 1 week, 4 days ago

On 12.11.24 19:17, William Roche wrote:
> On 11/12/24 12:13, David Hildenbrand wrote:
>> On 07.11.24 11:21, “William Roche wrote:
>>> From: William Roche <william.roche@oracle.com>
>>>
>>> When an entire large page is impacted by an error (hugetlbfs case),
>>> report better the size and location of this large memory hole, so
>>> give a warning message when this page is first hit:
>>> Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST
>>> addr Z
>>>
>>
>> Hm, I wonder if we really want to special-case hugetlb here.
>>
>> Why not make the warning independent of the underlying page size?
> 
> We already have a warning provided by Qemu (in kvm_arch_on_sigbus_vcpu()):
> 
> Guest MCE Memory Error at QEMU addr Y and GUEST addr Z of type
> BUS_MCEERR_AR/_AO injected
> 
> The one I suggest is an additional message provided before the above
> message.
> 
> Here is an example:
> qemu-system-x86_64: warning: Memory error: Loosing a large page (size:
> 2097152) at QEMU addr 0x7fdd7d400000 and GUEST addr 0x11600000
> qemu-system-x86_64: warning: Guest MCE Memory Error at QEMU addr
> 0x7fdd7d400000 and GUEST addr 0x11600000 of type BUS_MCEERR_AO injected
> 

Hm, I think we should definitely be including the size in the existing 
one. That code was written without huge pages in mind.

We should similarly warn in the arm implementation (where I don't see a 
similar message yet).

> 
> According to me, this large page case additional message will help to
> better understand the probable sudden proliferation of memory errors
> that can be reported by Qemu on the impacted range.
> Not only will the machine administrator identify better that a single
> memory error had this large impact, it can also help us to better
> measure the impact of fixing the large page memory error support in the
> field (in the future).

What about extending the existing one to something like

warning: Guest MCE Memory Error at QEMU addr $ADDR and GUEST $PADDR of 
type BUS_MCEERR_AO and size $SIZE (large page) injected


With the "large page" hint you can highlight that this is special.


On a related note ...I think we have a problem. Assume we got a SIGBUS 
on a huge page (e.g., somewhere in a 1 GiB page).

We will call kvm_mce_inject(cpu, paddr, code) / 
acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)

But where is the size information? :// Won't the VM simply assume that 
there was a MCE on a single 4k page starting at paddr?

I'm not sure if we can inject ranges, or if we would have to issue one 
MCE per page ... hm, what's your take on this?


-- 
Cheers,

David / dhildenb

Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

Posted by William Roche 1 week, 1 day ago

Thanks for the feedback on the patches, I'll send a new version in the 
coming week.

But I just wanted to answer now the questions you asked on this specific 
one as they are related to the importance of fixing the large page 
failures handling.

On 11/12/24 23:22, David Hildenbrand wrote:
> On 12.11.24 19:17, William Roche wrote:
>> On 11/12/24 12:13, David Hildenbrand wrote:
>>> On 07.11.24 11:21, “William Roche wrote:
>>>> From: William Roche <william.roche@oracle.com>
>>>>
>>>> When an entire large page is impacted by an error (hugetlbfs case),
>>>> report better the size and location of this large memory hole, so
>>>> give a warning message when this page is first hit:
>>>> Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST
>>>> addr Z
>>>>
>>>
>>> Hm, I wonder if we really want to special-case hugetlb here.
>>>
>>> Why not make the warning independent of the underlying page size?
>>
>> We already have a warning provided by Qemu (in 
>> kvm_arch_on_sigbus_vcpu()):
>>
>> Guest MCE Memory Error at QEMU addr Y and GUEST addr Z of type
>> BUS_MCEERR_AR/_AO injected
>>
>> The one I suggest is an additional message provided before the above
>> message.
>>
>> Here is an example:
>> qemu-system-x86_64: warning: Memory error: Loosing a large page (size:
>> 2097152) at QEMU addr 0x7fdd7d400000 and GUEST addr 0x11600000
>> qemu-system-x86_64: warning: Guest MCE Memory Error at QEMU addr
>> 0x7fdd7d400000 and GUEST addr 0x11600000 of type BUS_MCEERR_AO injected
>>
> 
> Hm, I think we should definitely be including the size in the existing 
> one. That code was written without huge pages in mind.

Yes we can do that, and get the page size at this level to pass as a 
'page_sise' argument to kvm_hwpoison_page_add().

It would make the message longer as we will have the extra information 
about the large page on all messages when an error impacts a large page.
We could change the messages only when we are dealing with a large page, 
so that the standard (4k) case isn't modified.

> 
> We should similarly warn in the arm implementation (where I don't see a 
> similar message yet).

Ok, I'll also add a message for the ARM platform.

>>
>> According to me, this large page case additional message will help to
>> better understand the probable sudden proliferation of memory errors
>> that can be reported by Qemu on the impacted range.
>> Not only will the machine administrator identify better that a single
>> memory error had this large impact, it can also help us to better
>> measure the impact of fixing the large page memory error support in the
>> field (in the future).
> 
> What about extending the existing one to something like
> 
> warning: Guest MCE Memory Error at QEMU addr $ADDR and GUEST $PADDR of 
> type BUS_MCEERR_AO and size $SIZE (large page) injected
> 
> 
> With the "large page" hint you can highlight that this is special.

Right, we can do it that way. It also gives the impression that we 
somehow inject errors on a large range of the memory. Which is not the 
case. I'll send a proposal with a different formulation, so that you can 
choose.

> On a related note ...I think we have a problem. Assume we got a SIGBUS 
> on a huge page (e.g., somewhere in a 1 GiB page).
> 
> We will call kvm_mce_inject(cpu, paddr, code) / 
> acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)
> 
> But where is the size information? :// Won't the VM simply assume that 
> there was a MCE on a single 4k page starting at paddr?

This is absolutely right !
It's exactly what happens: The VM kernel received the information and 
considers that only the impacted page has to be poisoned.

That's also the reason why Qemu repeats the error injections every time 
the poisoned large page is accessed (for all other touched 4k pages 
located on this "memory hole").

> 
> I'm not sure if we can inject ranges, or if we would have to issue one 
> MCE per page ... hm, what's your take on this?

I don't know of any size information about a memory error reported by 
the hardware. The kernel doesn't seem to expect any such information.
It explains why there is no impact/blast size information provided when 
an error is relayed to the VM.

We could take the "memory hole" size into account in Qemu, but repeating 
error injections is not going to help a lot either: We'd need to give 
the VM some time to deal with an error injection before producing a new 
error for the next page etc... in the case (x86 only) where an 
asynchronous error is relayed with BUS_MCEERR_AO, we would also have to 
repeat the error for all the 4k pages located on the lost large page too.

We can see that the Linux kernel has some mechanisms to deal with a 
seldom 4k page loss, but a larger blast is very likely to crash the VM 
(which is fine). And as a significant part of the memory is no longer 
accessible, dealing with the error itself can be impaired and we 
increase the risk of loosing data, even though most of the memory on the 
large page could still be used.

Now if we can recover the 'still valid' memory of the impacted large 
page, we can significantly reduce this blast and give a much better 
chance to the VM to survive the incident or crash more gracefully.

I've looked at the project you indicated me, which is not ready to be 
adopted:
https://lore.kernel.org/linux-mm/20240924043924.3562257-2-jiaqiyan@google.com/T/

But we see that, this large page enhancement is needed, sometimes just 
to give a chance to the VM to survive a little longer before being 
terminated or moved.
Injecting multiple MCEs or ACPI error records doesn't help, according to me.

William.

Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

Posted by David Hildenbrand 5 days, 18 hours ago

>> Hm, I think we should definitely be including the size in the existing
>> one. That code was written without huge pages in mind.
> 
> Yes we can do that, and get the page size at this level to pass as a
> 'page_sise' argument to kvm_hwpoison_page_add().
> 
> It would make the message longer as we will have the extra information
> about the large page on all messages when an error impacts a large page.
> We could change the messages only when we are dealing with a large page,
> so that the standard (4k) case isn't modified.

Right. And likely we should call it "huge page" instead, which is the 
Linux term for anything larger than a single page.

[...]

>>
>> With the "large page" hint you can highlight that this is special.
> 
> Right, we can do it that way. It also gives the impression that we
> somehow inject errors on a large range of the memory. Which is not the
> case. I'll send a proposal with a different formulation, so that you can
> choose.
> 

Make sense.

> 
> 
>> On a related note ...I think we have a problem. Assume we got a SIGBUS
>> on a huge page (e.g., somewhere in a 1 GiB page).
>>
>> We will call kvm_mce_inject(cpu, paddr, code) /
>> acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)
>>
>> But where is the size information? :// Won't the VM simply assume that
>> there was a MCE on a single 4k page starting at paddr?
> 
> This is absolutely right !
> It's exactly what happens: The VM kernel received the information and
> considers that only the impacted page has to be poisoned.
 > > That's also the reason why Qemu repeats the error injections every time
> the poisoned large page is accessed (for all other touched 4k pages
> located on this "memory hole").

:/

So we always get from Linux the full 1Gig range and always report the 
first 4k page essentially, on any such access, right?


BTW, should we handle duplicates in our poison list?

> 
>>
>> I'm not sure if we can inject ranges, or if we would have to issue one
>> MCE per page ... hm, what's your take on this?
> 
> I don't know of any size information about a memory error reported by
> the hardware. The kernel doesn't seem to expect any such information.
> It explains why there is no impact/blast size information provided when
> an error is relayed to the VM.
> 
> We could take the "memory hole" size into account in Qemu, but repeating
> error injections is not going to help a lot either: We'd need to give
> the VM some time to deal with an error injection before producing a new
> error for the next page etc... in the case (x86 only) where an

I had the same thoughts.

> asynchronous error is relayed with BUS_MCEERR_AO, we would also have to
> repeat the error for all the 4k pages located on the lost large page too.
> 
> We can see that the Linux kernel has some mechanisms to deal with a
> seldom 4k page loss, but a larger blast is very likely to crash the VM
> (which is fine).

Right, and that will inevitably happen when we get a MVE on a 1GiG 
hugetlb page, correct? The whole thing will be inaccessible.

> And as a significant part of the memory is no longer
> accessible, dealing with the error itself can be impaired and we
> increase the risk of loosing data, even though most of the memory on the
> large page could still be used.
> 
> Now if we can recover the 'still valid' memory of the impacted large
> page, we can significantly reduce this blast and give a much better
> chance to the VM to survive the incident or crash more gracefully.

Right. That cannot be sorted out in user space alone, unfortunately.

> 
> I've looked at the project you indicated me, which is not ready to be
> adopted:
> https://lore.kernel.org/linux-mm/20240924043924.3562257-2-jiaqiyan@google.com/T/
> 

Yes, that goes into a better direction, though.

> But we see that, this large page enhancement is needed, sometimes just
> to give a chance to the VM to survive a little longer before being
> terminated or moved.
> Injecting multiple MCEs or ACPI error records doesn't help, according to me.

I suspect that in most cases, when we get an MCE on a 1Gig page in the 
hypervisor, our running Linux guest will soon crash, because it really 
lost 1 Gig of contiguous memory. :(

-- 
Cheers,

David / dhildenb

[PATCH v2 4/7] numa: Introduce and use ram_block_notify_remap()

Posted by “William Roche 2 weeks, 2 days ago

From: David Hildenbrand <david@redhat.com>

Notify registered listeners about the remap at the end of
qemu_ram_remap() so e.g., a memory backend can re-apply its
settings correctly.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 hw/core/numa.c         | 11 +++++++++++
 include/exec/ramlist.h |  3 +++
 system/physmem.c       |  1 +
 3 files changed, 15 insertions(+)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index 1b5f44baea..4ca67db483 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -895,3 +895,14 @@ void ram_block_notify_resize(void *host, size_t old_size, size_t new_size)
         }
     }
 }
+
+void ram_block_notify_remap(void *host, size_t offset, size_t size)
+{
+    RAMBlockNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+        if (notifier->ram_block_remapped) {
+            notifier->ram_block_remapped(notifier, host, offset, size);
+        }
+    }
+}
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
index d9cfe530be..c1dc785a57 100644
--- a/include/exec/ramlist.h
+++ b/include/exec/ramlist.h
@@ -72,6 +72,8 @@ struct RAMBlockNotifier {
                               size_t max_size);
     void (*ram_block_resized)(RAMBlockNotifier *n, void *host, size_t old_size,
                               size_t new_size);
+    void (*ram_block_remapped)(RAMBlockNotifier *n, void *host, size_t offset,
+                               size_t size);
     QLIST_ENTRY(RAMBlockNotifier) next;
 };
 
@@ -80,6 +82,7 @@ void ram_block_notifier_remove(RAMBlockNotifier *n);
 void ram_block_notify_add(void *host, size_t size, size_t max_size);
 void ram_block_notify_remove(void *host, size_t size, size_t max_size);
 void ram_block_notify_resize(void *host, size_t old_size, size_t new_size);
+void ram_block_notify_remap(void *host, size_t offset, size_t size);
 
 GString *ram_block_format(void);
 
diff --git a/system/physmem.c b/system/physmem.c
index dfea120cc5..e72ca31451 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2228,6 +2228,7 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                     memory_try_enable_merging(vaddr, length);
                     qemu_ram_setup_dump(vaddr, length);
                 }
+                ram_block_notify_remap(block->host, offset, length);
             }
         }
     }
-- 
2.43.5

[PATCH v2 5/7] hostmem: Factor out applying settings

Posted by “William Roche 2 weeks, 2 days ago

From: David Hildenbrand <david@redhat.com>

We want to reuse the functionality when remapping or resizing RAM.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c | 155 ++++++++++++++++++++++++---------------------
 1 file changed, 82 insertions(+), 73 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 181446626a..bf85d716e5 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -36,6 +36,87 @@ QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
 QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
 #endif
 
+static void host_memory_backend_apply_settings(HostMemoryBackend *backend,
+                                               void *ptr, uint64_t size,
+                                               Error **errp)
+{
+    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
+
+    if (backend->merge) {
+        qemu_madvise(ptr, size, QEMU_MADV_MERGEABLE);
+    }
+    if (!backend->dump) {
+        qemu_madvise(ptr, size, QEMU_MADV_DONTDUMP);
+    }
+#ifdef CONFIG_NUMA
+    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
+    /* lastbit == MAX_NODES means maxnode = 0 */
+    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
+    /*
+     * Ensure policy won't be ignored in case memory is preallocated
+     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
+     * this doesn't catch hugepage case.
+     */
+    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
+    int mode = backend->policy;
+
+    /*
+     * Check for invalid host-nodes and policies and give more verbose
+     * error messages than mbind().
+     */
+    if (maxnode && backend->policy == MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be empty for policy default,"
+                   " or you should explicitly specify a policy other"
+                   " than default");
+        return;
+    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be set for policy %s",
+                   HostMemPolicy_str(backend->policy));
+        return;
+    }
+
+    /*
+     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
+     * as argument to mbind() due to an old Linux bug (feature?) which
+     * cuts off the last specified node. This means backend->host_nodes
+     * must have MAX_NODES+1 bits available.
+     */
+    assert(sizeof(backend->host_nodes) >=
+           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
+    assert(maxnode <= MAX_NODES);
+
+#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
+    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
+        /*
+         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
+         * silently picks the first node.
+         */
+        mode = MPOL_PREFERRED_MANY;
+    }
+#endif
+
+    if (maxnode &&
+        mbind(ptr, size, mode, backend->host_nodes, maxnode + 1, flags)) {
+        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
+            error_setg_errno(errp, errno,
+                             "cannot bind memory to host NUMA nodes");
+            return;
+        }
+    }
+#endif
+    /*
+     * Preallocate memory after the NUMA policy has been instantiated.
+     * This is necessary to guarantee memory is allocated with
+     * specified NUMA policy in place.
+     */
+    if (backend->prealloc &&
+        !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
+                           ptr, size, backend->prealloc_threads,
+                           backend->prealloc_context, async, errp)) {
+        return;
+    }
+}
+
 char *
 host_memory_backend_get_name(HostMemoryBackend *backend)
 {
@@ -337,7 +418,6 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
     void *ptr;
     uint64_t sz;
     size_t pagesize;
-    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
 
     if (!bc->alloc) {
         return;
@@ -357,78 +437,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
         return;
     }
 
-    if (backend->merge) {
-        qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
-    }
-    if (!backend->dump) {
-        qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
-    }
-#ifdef CONFIG_NUMA
-    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
-    /* lastbit == MAX_NODES means maxnode = 0 */
-    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
-    /*
-     * Ensure policy won't be ignored in case memory is preallocated
-     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
-     * this doesn't catch hugepage case.
-     */
-    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
-    int mode = backend->policy;
-
-    /* check for invalid host-nodes and policies and give more verbose
-     * error messages than mbind(). */
-    if (maxnode && backend->policy == MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be empty for policy default,"
-                   " or you should explicitly specify a policy other"
-                   " than default");
-        return;
-    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be set for policy %s",
-                   HostMemPolicy_str(backend->policy));
-        return;
-    }
-
-    /*
-     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
-     * as argument to mbind() due to an old Linux bug (feature?) which
-     * cuts off the last specified node. This means backend->host_nodes
-     * must have MAX_NODES+1 bits available.
-     */
-    assert(sizeof(backend->host_nodes) >=
-           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
-    assert(maxnode <= MAX_NODES);
-
-#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
-    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
-        /*
-         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
-         * silently picks the first node.
-         */
-        mode = MPOL_PREFERRED_MANY;
-    }
-#endif
-
-    if (maxnode &&
-        mbind(ptr, sz, mode, backend->host_nodes, maxnode + 1, flags)) {
-        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
-            error_setg_errno(errp, errno,
-                             "cannot bind memory to host NUMA nodes");
-            return;
-        }
-    }
-#endif
-    /*
-     * Preallocate memory after the NUMA policy has been instantiated.
-     * This is necessary to guarantee memory is allocated with
-     * specified NUMA policy in place.
-     */
-    if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
-                                                ptr, sz,
-                                                backend->prealloc_threads,
-                                                backend->prealloc_context,
-                                                async, errp)) {
-        return;
-    }
+    host_memory_backend_apply_settings(backend, ptr, sz, errp);
 }
 
 static bool
-- 
2.43.5

[PATCH v2 6/7] hostmem: Handle remapping of RAM

Posted by “William Roche 2 weeks, 2 days ago

From: David Hildenbrand <david@redhat.com>

Let's register a RAM block notifier and react on remap notifications.
Simply re-apply the settings. Warn only when something goes wrong.

Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be
that hostmem is still missing to update that flag ...

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c       | 29 +++++++++++++++++++++++++++++
 include/sysemu/hostmem.h |  1 +
 2 files changed, 30 insertions(+)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index bf85d716e5..fbd8708664 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -361,11 +361,32 @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
     backend->prealloc_threads = value;
 }
 
+static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host,
+                                             size_t offset, size_t size)
+{
+    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
+                                              ram_notifier);
+    Error *err = NULL;
+
+    if (!host_memory_backend_mr_inited(backend) ||
+        memory_region_get_ram_ptr(&backend->mr) != host) {
+        return;
+    }
+
+    host_memory_backend_apply_settings(backend, host + offset, size, &err);
+    if (err) {
+        warn_report_err(err);
+    }
+}
+
 static void host_memory_backend_init(Object *obj)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
     MachineState *machine = MACHINE(qdev_get_machine());
 
+    backend->ram_notifier.ram_block_remapped = host_memory_backend_ram_remapped;
+    ram_block_notifier_add(&backend->ram_notifier);
+
     /* TODO: convert access to globals to compat properties */
     backend->merge = machine_mem_merge(machine);
     backend->dump = machine_dump_guest_core(machine);
@@ -379,6 +400,13 @@ static void host_memory_backend_post_init(Object *obj)
     object_apply_compat_props(obj);
 }
 
+static void host_memory_backend_finalize(Object *obj)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+    ram_block_notifier_remove(&backend->ram_notifier);
+}
+
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend)
 {
     /*
@@ -595,6 +623,7 @@ static const TypeInfo host_memory_backend_info = {
     .instance_size = sizeof(HostMemoryBackend),
     .instance_init = host_memory_backend_init,
     .instance_post_init = host_memory_backend_post_init,
+    .instance_finalize = host_memory_backend_finalize,
     .interfaces = (InterfaceInfo[]) {
         { TYPE_USER_CREATABLE },
         { }
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index de47ae59e4..062a68c8fc 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -81,6 +81,7 @@ struct HostMemoryBackend {
     HostMemPolicy policy;
 
     MemoryRegion mr;
+    RAMBlockNotifier ram_notifier;
 };
 
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend);
-- 
2.43.5

Re: [PATCH v2 6/7] hostmem: Handle remapping of RAM

Posted by David Hildenbrand 1 week, 4 days ago

On 07.11.24 11:21, “William Roche wrote:
> From: David Hildenbrand <david@redhat.com>
> 
> Let's register a RAM block notifier and react on remap notifications.
> Simply re-apply the settings. Warn only when something goes wrong.
> 
> Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be
> that hostmem is still missing to update that flag ...
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   backends/hostmem.c       | 29 +++++++++++++++++++++++++++++
>   include/sysemu/hostmem.h |  1 +
>   2 files changed, 30 insertions(+)
> 
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index bf85d716e5..fbd8708664 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -361,11 +361,32 @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
>       backend->prealloc_threads = value;
>   }
>   
> +static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host,
> +                                             size_t offset, size_t size)
> +{
> +    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
> +                                              ram_notifier);
> +    Error *err = NULL;
> +
> +    if (!host_memory_backend_mr_inited(backend) ||
> +        memory_region_get_ram_ptr(&backend->mr) != host) {
> +        return;
> +    }
> +
> +    host_memory_backend_apply_settings(backend, host + offset, size, &err);
> +    if (err) {
> +        warn_report_err(err);

I wonder if we want to fail hard instead, or have a way to tell the 
notifier that something wen wrong.

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 6/7] hostmem: Handle remapping of RAM

Posted by William Roche 1 week, 4 days ago

On 11/12/24 14:45, David Hildenbrand wrote:
> On 07.11.24 11:21, “William Roche wrote:
>> From: David Hildenbrand <david@redhat.com>
>>
>> Let's register a RAM block notifier and react on remap notifications.
>> Simply re-apply the settings. Warn only when something goes wrong.
>>
>> Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be
>> that hostmem is still missing to update that flag ...
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   backends/hostmem.c       | 29 +++++++++++++++++++++++++++++
>>   include/sysemu/hostmem.h |  1 +
>>   2 files changed, 30 insertions(+)
>>
>> diff --git a/backends/hostmem.c b/backends/hostmem.c
>> index bf85d716e5..fbd8708664 100644
>> --- a/backends/hostmem.c
>> +++ b/backends/hostmem.c
>> @@ -361,11 +361,32 @@ static void 
>> host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
>>       backend->prealloc_threads = value;
>>   }
>> +static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, 
>> void *host,
>> +                                             size_t offset, size_t size)
>> +{
>> +    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
>> +                                              ram_notifier);
>> +    Error *err = NULL;
>> +
>> +    if (!host_memory_backend_mr_inited(backend) ||
>> +        memory_region_get_ram_ptr(&backend->mr) != host) {
>> +        return;
>> +    }
>> +
>> +    host_memory_backend_apply_settings(backend, host + offset, size, 
>> &err);
>> +    if (err) {
>> +        warn_report_err(err);
> 
> I wonder if we want to fail hard instead, or have a way to tell the 
> notifier that something wen wrong.
> 

It depends on what the caller would do with this information. Is there a 
way to workaround the problem ? (I don't think so)
Can the VM continue to run without doing anything about it ? (Maybe?)

Currently all numa notifiers don't return errors.

This function is only called from ram_block_notify_remap() in 
qemu_ram_remap(), I would vote for a "fail hard" in case where the 
settings are mandatory to continue.

HTH.

Re: [PATCH v2 6/7] hostmem: Handle remapping of RAM

Posted by David Hildenbrand 1 week, 4 days ago

On 12.11.24 19:17, William Roche wrote:
> On 11/12/24 14:45, David Hildenbrand wrote:
>> On 07.11.24 11:21, “William Roche wrote:
>>> From: David Hildenbrand <david@redhat.com>
>>>
>>> Let's register a RAM block notifier and react on remap notifications.
>>> Simply re-apply the settings. Warn only when something goes wrong.
>>>
>>> Note: qemu_ram_remap() will not remap when RAM_PREALLOC is set. Could be
>>> that hostmem is still missing to update that flag ...
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: William Roche <william.roche@oracle.com>
>>> ---
>>>    backends/hostmem.c       | 29 +++++++++++++++++++++++++++++
>>>    include/sysemu/hostmem.h |  1 +
>>>    2 files changed, 30 insertions(+)
>>>
>>> diff --git a/backends/hostmem.c b/backends/hostmem.c
>>> index bf85d716e5..fbd8708664 100644
>>> --- a/backends/hostmem.c
>>> +++ b/backends/hostmem.c
>>> @@ -361,11 +361,32 @@ static void
>>> host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
>>>        backend->prealloc_threads = value;
>>>    }
>>> +static void host_memory_backend_ram_remapped(RAMBlockNotifier *n,
>>> void *host,
>>> +                                             size_t offset, size_t size)
>>> +{
>>> +    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
>>> +                                              ram_notifier);
>>> +    Error *err = NULL;
>>> +
>>> +    if (!host_memory_backend_mr_inited(backend) ||
>>> +        memory_region_get_ram_ptr(&backend->mr) != host) {
>>> +        return;
>>> +    }
>>> +
>>> +    host_memory_backend_apply_settings(backend, host + offset, size,
>>> &err);
>>> +    if (err) {
>>> +        warn_report_err(err);
>>
>> I wonder if we want to fail hard instead, or have a way to tell the
>> notifier that something wen wrong.
>>
> 
> It depends on what the caller would do with this information. Is there a
> way to workaround the problem ? (I don't think so)

Primarily only preallocation will fail, and that ...

> Can the VM continue to run without doing anything about it ? (Maybe?)
> 

... will make crash the QEMU at some point later (SIGBUS), which is very 
bad.

> Currently all numa notifiers don't return errors.
> 
> This function is only called from ram_block_notify_remap() in
> qemu_ram_remap(), I would vote for a "fail hard" in case where the
> settings are mandatory to continue.

"fail hard" is likely the best approach for now.

-- 
Cheers,

David / dhildenb

[PATCH v2 7/7] system/physmem: Memory settings applied on remap notification

Posted by “William Roche 2 weeks, 2 days ago

From: William Roche <william.roche@oracle.com>

Merging and dump settings are handled by the remap notification
in addition to memory policy and preallocation.
If preallocation is set on a memory block, qemu_prealloc_mem()
call is needed also after a ram_block_discard_range() use for
this block.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index e72ca31451..72129d5b1b 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2225,8 +2225,6 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                                      length, addr);
                         exit(1);
                     }
-                    memory_try_enable_merging(vaddr, length);
-                    qemu_ram_setup_dump(vaddr, length);
                 }
                 ram_block_notify_remap(block->host, offset, length);
             }
-- 
2.43.5

[PATCH v1 4/4] accel/kvm: Report the loss of a large memory page

Posted by “William Roche 1 month ago

From: William Roche <william.roche@oracle.com>

On HW memory error, we need to report better what the impact of this
error is. So when an entire large page is impacted by an error (like the
hugetlbfs case), we give a warning message when this page is first hit:
Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST addr Z

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c      | 9 ++++++++-
 include/sysemu/kvm_int.h | 6 ++++--
 target/arm/kvm.c         | 2 +-
 target/i386/kvm/kvm.c    | 2 +-
 4 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 40117eefa7..bddaf1e981 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1284,7 +1284,7 @@ static void kvm_unpoison_all(void *param)
     }
 }
 
-void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz)
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz, void *ha, hwaddr gpa)
 {
     HWPoisonPage *page;
 
@@ -1300,6 +1300,13 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz)
     page->ram_addr = ram_addr;
     page->page_size = sz;
     QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
+
+    if (sz > TARGET_PAGE_SIZE) {
+        gpa = ROUND_DOWN(gpa, sz);
+        ha = (void *)ROUND_DOWN((uint64_t)ha, sz);
+        warn_report("Memory error: Loosing a large page (size: %zu) "
+            "at QEMU addr %p and GUEST addr 0x%" HWADDR_PRIx, sz, ha, gpa);
+    }
 }
 
 bool kvm_hwpoisoned_mem(void)
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index d2160be0ae..af569380ca 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -177,12 +177,14 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size);
  * kvm_hwpoison_page_add:
  *
  * Parameters:
- *  @ram_addr: the address in the RAM for the poisoned page
+ *  @addr: the address in the RAM for the poisoned page
  *  @sz: size of the poisoned page as reported by the kernel
+ *  @hva: host virtual address aka QEMU addr
+ *  @gpa: guest physical address aka GUEST addr
  *
  * Add a poisoned page to the list
  *
  * Return: None.
  */
-void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz);
+void kvm_hwpoison_page_add(ram_addr_t addr, size_t sz, void *hva, hwaddr gpa);
 #endif
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 11579e170b..f8eb553f7c 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2363,7 +2363,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb)
             if (sz == TARGET_PAGE_SIZE) {
                 sz = qemu_ram_pagesize_from_host(addr);
             }
-            kvm_hwpoison_page_add(ram_addr, sz);
+            kvm_hwpoison_page_add(ram_addr, sz, addr, paddr);
             /*
              * If this is a BUS_MCEERR_AR, we know we have been called
              * synchronously from the vCPU thread, so we can easily
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 71e674bca0..34cfa8b764 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -757,7 +757,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb)
             if (sz == TARGET_PAGE_SIZE) {
                 sz = qemu_ram_pagesize_from_host(addr);
             }
-            kvm_hwpoison_page_add(ram_addr, sz);
+            kvm_hwpoison_page_add(ram_addr, sz, addr, paddr);
             kvm_mce_inject(cpu, paddr, code);
 
             /*
-- 
2.43.5

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

Posted by William Roche 1 month, 2 weeks ago

On 10/9/24 17:45, Peter Xu wrote:
> On Thu, Sep 19, 2024 at 06:52:37PM +0200, William Roche wrote:
>> Hello David,
>>
>> I hope my last week email answered your interrogations about:
>>      - retrieving the valid data from the lost hugepage
>>      - the need of smaller pages to replace a failed large page
>>      - the interaction of memory error and VM migration
>>      - the non-symmetrical access to a poisoned memory area after a recovery
>>        Qemu would be able to continue to access the still valid data
>>        location of the formerly poisoned hugepage, but any other entity
>>        mapping the large page would not be allowed to use the location.
>>
>> I understand that this last item _is_ some kind of "inconsistency".
>> So if I want to make sure that a "shared" memory region (used for vhost-user
>> processes, vfio or ivshmem) is not recovered, how can I identify what
>> region(s)
>> of a guest memory could be used for such a shared location ?
>> Is there a way for qemu to identify the memory locations that have been
>> shared ?
> 
> When there's no vIOMMU I think all guest pages need to be shared.  When
> with vIOMMU it depends on what was mapped by the guest drivers, while in
> most sane setups they can still always be shared because the guest OS (if
> Linux) should normally have iommu=pt speeding up kernel drivers.
> 
>>
>> Could you please let me know if there is an entry point I should consider ?
> 
> IMHO it'll still be more reasonable that this issue be tackled from the
> kernel not userspace, simply because it's a shared problem of all
> userspaces rather than QEMU process alone.
> 
> When with that the kernel should guarantee consistencies on different
> processes accessing these pages properly, so logically all these
> complexities should be better done in the kernel once for all.
> 
> There's indeed difficulties on providing it in hugetlbfs with mm community,
> and this is also not the only effort trying to fix 1G page poisoning with
> userspace workarounds, see:
> 
> https://lore.kernel.org/r/20240924043924.3562257-1-jiaqiyan@google.com
> 
> My gut feeling is either hugetlbfs needs to be fixed (with less hope) or
> QEMU in general needs to move over to other file systems on consuming huge
> pages.  Poisoning is not the only driven force, but at least we want to
> also work out postcopy which has similar goal as David said, on being able
> to map hugetlbfs pages differently.
> 
> May consider having a look at gmemfd 1G proposal, posted here:
> 
> https://lore.kernel.org/r/cover.1726009989.git.ackerleytng@google.com
> 
> We probably need that in one way or another for CoCo, and the chance is it
> can easily support non-CoCo with the same interface ultimately.  Then 1G
> hugetlbfs can be abandoned in QEMU.  It'll also need to tackle the same
> challenge here either on page poisoning, or postcopy, with/without QEMU's
> specific solution, because QEMU is also not the only userspace hypervisor.
> 
> Said that, the initial few small patches seem to be standalone small fixes
> which may still be good.  So if you think that's the case you can at least
> consider sending them separately without RFC tag.
> 
> Thanks,

Thank you very much Peter for your answer, pointers and explanations.

I understand and agree that having the Kernel to deal with huge pages
errors is a much better approach.
Not an easy one...

I'll submit a trimmed down version of my first patches to fix some
problems that currently exist in Qemu.

Thanks again,
William.

[RFC RESEND 1/6] accel/kvm: SIGBUS handler should also deal with si_addr_lsb

Posted by “William Roche 2 months, 2 weeks ago

From: William Roche <william.roche@oracle.com>

The SIGBUS signal siginfo reporting a HW memory error
provides a si_addr_lsb fields with an indication of the
impacted memory page size.
This information should be used to track the hwpoisoned
page sizes.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c    | 6 ++++--
 accel/stubs/kvm-stub.c | 4 ++--
 include/qemu/osdep.h   | 5 +++--
 include/sysemu/kvm.h   | 4 ++--
 system/cpus.c          | 6 ++++--
 util/oslib-posix.c     | 3 +++
 6 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 75d11a07b2..409c5d3ce6 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2836,6 +2836,7 @@ void kvm_cpu_synchronize_pre_loadvm(CPUState *cpu)
 #ifdef KVM_HAVE_MCE_INJECTION
 static __thread void *pending_sigbus_addr;
 static __thread int pending_sigbus_code;
+static __thread short pending_sigbus_addr_lsb;
 static __thread bool have_sigbus_pending;
 #endif
 
@@ -3542,7 +3543,7 @@ void kvm_init_cpu_signals(CPUState *cpu)
 }
 
 /* Called asynchronously in VCPU thread.  */
-int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
+int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr, short addr_lsb)
 {
 #ifdef KVM_HAVE_MCE_INJECTION
     if (have_sigbus_pending) {
@@ -3551,6 +3552,7 @@ int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
     have_sigbus_pending = true;
     pending_sigbus_addr = addr;
     pending_sigbus_code = code;
+    pending_sigbus_addr_lsb = addr_lsb;
     qatomic_set(&cpu->exit_request, 1);
     return 0;
 #else
@@ -3559,7 +3561,7 @@ int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
 }
 
 /* Called synchronously (via signalfd) in main thread.  */
-int kvm_on_sigbus(int code, void *addr)
+int kvm_on_sigbus(int code, void *addr, short addr_lsb)
 {
 #ifdef KVM_HAVE_MCE_INJECTION
     /* Action required MCE kills the process if SIGBUS is blocked.  Because
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 8e0eb22e61..80780433d8 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -38,12 +38,12 @@ bool kvm_has_sync_mmu(void)
     return false;
 }
 
-int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
+int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr, short addr_lsb)
 {
     return 1;
 }
 
-int kvm_on_sigbus(int code, void *addr)
+int kvm_on_sigbus(int code, void *addr, short addr_lsb)
 {
     return 1;
 }
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index fe7c3c5f67..838271c4b8 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -585,8 +585,9 @@ struct qemu_signalfd_siginfo {
     uint64_t ssi_stime;   /* System CPU time consumed (SIGCHLD) */
     uint64_t ssi_addr;    /* Address that generated signal
                              (for hardware-generated signals) */
-    uint8_t  pad[48];     /* Pad size to 128 bytes (allow for
-                             additional fields in the future) */
+    uint16_t ssi_addr_lsb;/* Least significant bit of address (SIGBUS) */
+    uint8_t  pad[46];     /* Pad size to 128 bytes (allow for */
+                          /* additional fields in the future) */
 };
 
 int qemu_signalfd(const sigset_t *mask);
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 9cf14ca3d5..21262eb970 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -207,8 +207,8 @@ int kvm_has_gsi_routing(void);
 bool kvm_arm_supports_user_irq(void);
 
 
-int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr);
-int kvm_on_sigbus(int code, void *addr);
+int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr, short addr_lsb);
+int kvm_on_sigbus(int code, void *addr, short addr_lsb);
 
 #ifdef COMPILING_PER_TARGET
 #include "cpu.h"
diff --git a/system/cpus.c b/system/cpus.c
index 1c818ff682..12e630f760 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -376,12 +376,14 @@ static void sigbus_handler(int n, siginfo_t *siginfo, void *ctx)
 
     if (current_cpu) {
         /* Called asynchronously in VCPU thread.  */
-        if (kvm_on_sigbus_vcpu(current_cpu, siginfo->si_code, siginfo->si_addr)) {
+        if (kvm_on_sigbus_vcpu(current_cpu, siginfo->si_code,
+                               siginfo->si_addr, siginfo->si_addr_lsb)) {
             sigbus_reraise();
         }
     } else {
         /* Called synchronously (via signalfd) in main thread.  */
-        if (kvm_on_sigbus(siginfo->si_code, siginfo->si_addr)) {
+        if (kvm_on_sigbus(siginfo->si_code,
+                          siginfo->si_addr, siginfo->si_addr_lsb)) {
             sigbus_reraise();
         }
     }
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 11b35e48fb..64517d1e40 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -767,6 +767,9 @@ void sigaction_invoke(struct sigaction *action,
     } else if (info->ssi_signo == SIGILL || info->ssi_signo == SIGFPE ||
                info->ssi_signo == SIGSEGV || info->ssi_signo == SIGBUS) {
         si.si_addr = (void *)(uintptr_t)info->ssi_addr;
+        if (info->ssi_signo == SIGBUS) {
+            si.si_addr_lsb = (short int)info->ssi_addr_lsb;
+        }
     } else if (info->ssi_signo == SIGCHLD) {
         si.si_pid = info->ssi_pid;
         si.si_status = info->ssi_status;
-- 
2.43.5

[RFC RESEND 2/6] accel/kvm: Keep track of the HWPoisonPage sizes

Posted by “William Roche 2 months, 2 weeks ago

From: William Roche <william.roche@oracle.com>

Add the page size information to the hwpoison_page_list elements.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c      | 11 +++++++----
 include/sysemu/kvm.h     |  3 ++-
 include/sysemu/kvm_int.h |  3 ++-
 target/arm/kvm.c         |  5 +++--
 target/i386/kvm/kvm.c    |  5 +++--
 5 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 409c5d3ce6..bcccf80bd7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1200,6 +1200,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
  */
 typedef struct HWPoisonPage {
     ram_addr_t ram_addr;
+    size_t     page_size;
     QLIST_ENTRY(HWPoisonPage) list;
 } HWPoisonPage;
 
@@ -1212,12 +1213,12 @@ static void kvm_unpoison_all(void *param)
 
     QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
         QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr, page->page_size);
         g_free(page);
     }
 }
 
-void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz)
 {
     HWPoisonPage *page;
 
@@ -1228,6 +1229,7 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr)
     }
     page = g_new(HWPoisonPage, 1);
     page->ram_addr = ram_addr;
+    page->page_size = sz;
     QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
 }
 
@@ -3031,7 +3033,8 @@ int kvm_cpu_exec(CPUState *cpu)
         if (unlikely(have_sigbus_pending)) {
             bql_lock();
             kvm_arch_on_sigbus_vcpu(cpu, pending_sigbus_code,
-                                    pending_sigbus_addr);
+                                    pending_sigbus_addr,
+                                    pending_sigbus_addr_lsb);
             have_sigbus_pending = false;
             bql_unlock();
         }
@@ -3569,7 +3572,7 @@ int kvm_on_sigbus(int code, void *addr, short addr_lsb)
      * we can only get action optional here.
      */
     assert(code != BUS_MCEERR_AR);
-    kvm_arch_on_sigbus_vcpu(first_cpu, code, addr);
+    kvm_arch_on_sigbus_vcpu(first_cpu, code, addr, addr_lsb);
     return 0;
 #else
     return 1;
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 21262eb970..c8c0d52bed 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -383,7 +383,8 @@ bool kvm_vcpu_id_is_valid(int vcpu_id);
 unsigned long kvm_arch_vcpu_id(CPUState *cpu);
 
 #ifdef KVM_HAVE_MCE_INJECTION
-void kvm_arch_on_sigbus_vcpu(CPUState *cpu, int code, void *addr);
+void kvm_arch_on_sigbus_vcpu(CPUState *cpu, int code, void *addr,
+                             short addr_lsb);
 #endif
 
 void kvm_arch_init_irq_routing(KVMState *s);
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 1d8fb1473b..753e4bc6ef 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -168,10 +168,11 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size);
  *
  * Parameters:
  *  @ram_addr: the address in the RAM for the poisoned page
+ *  @sz: size of the poison page as reported by the kernel
  *
  * Add a poisoned page to the list
  *
  * Return: None.
  */
-void kvm_hwpoison_page_add(ram_addr_t ram_addr);
+void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz);
 #endif
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 849e2e21b3..f62e53e423 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2348,10 +2348,11 @@ int kvm_arch_get_registers(CPUState *cs)
     return ret;
 }
 
-void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
+void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb)
 {
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t sz = (addr_lsb > 0) ? (1 << addr_lsb) : TARGET_PAGE_SIZE;
 
     assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
 
@@ -2359,7 +2360,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
+            kvm_hwpoison_page_add(ram_addr, sz);
             /*
              * If this is a BUS_MCEERR_AR, we know we have been called
              * synchronously from the vCPU thread, so we can easily
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 2fa88ef1e3..99b87140cc 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -714,12 +714,13 @@ static void hardware_memory_error(void *host_addr)
     exit(1);
 }
 
-void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
+void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb)
 {
     X86CPU *cpu = X86_CPU(c);
     CPUX86State *env = &cpu->env;
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t sz = (addr_lsb > 0) ? (1 << addr_lsb) : TARGET_PAGE_SIZE;
 
     /* If we get an action required MCE, it has been injected by KVM
      * while the VM was running.  An action optional MCE instead should
@@ -732,7 +733,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-            kvm_hwpoison_page_add(ram_addr);
+            kvm_hwpoison_page_add(ram_addr, sz);
             kvm_mce_inject(cpu, paddr, code);
 
             /*
-- 
2.43.5

[RFC RESEND 3/6] system/physmem: Remap memory pages on reset based on the page size

Posted by “William Roche 2 months, 2 weeks ago

From: William Roche <william.roche@oracle.com>

When the VM reboots, a memory reset is performed calling
qemu_ram_remap() on all hwpoisoned pages.
We take into account the recorded page size to adjust the
size and location of the memory hole.
In case of a largepage used, we also need to punch a hole
in the backend file to regenerate a usable memory, cleaning
the HW poisoned section.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/system/physmem.c b/system/physmem.c
index 94600a33ec..5c176146c0 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2195,6 +2195,11 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
     RAMBLOCK_FOREACH(block) {
         offset = addr - block->offset;
         if (offset < block->max_length) {
+            /* addr could be a subpage of a former large page */
+            if (length < block->page_size) {
+                offset = ROUND_DOWN(offset, block->page_size);
+                length = block->page_size;
+            }
             vaddr = ramblock_ptr(block, offset);
             if (block->flags & RAM_PREALLOC) {
                 ;
@@ -2208,6 +2213,14 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 prot = PROT_READ;
                 prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
                 if (block->fd >= 0) {
+                    if (length > TARGET_PAGE_SIZE && fallocate(block->fd,
+                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+                        offset + block->fd_offset, length) != 0) {
+                        error_report("Could not recreate the file hole for "
+                                     "addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }
                     area = mmap(vaddr, length, prot, flags, block->fd,
                                 offset + block->fd_offset);
                 } else {
-- 
2.43.5

[RFC RESEND 4/6] system: Introducing hugetlbfs largepage RAS feature

Posted by “William Roche 2 months, 2 weeks ago

From: William Roche <william.roche@oracle.com>

We need to deal with hugetlbfs memory large pages facing HW errors,
to increase the probability to survive a memory poisoning.
When an error is detected, the platform kernel marks the entire
hugetlbfs large page as "poisoned" and reports the event to all
potential users using SIGBUS.

This change introduces 2 aspects:
. trying to recover not HWPOISON data from the error impacted large page
. splitting this large page into standard sized pages

When Qemu receives this SIGBUS, it will try to recover as much data
as possible from the hugepage backend file (which has to be a MAP_SHARED
mapping) with the help of the following kernel feature:
 linux commit 38c1ddbde6c6 ("hugetlbfs: improve read HWPOISON hugepage")

The impacted hugetlbfs large page is replaced by a set of standard pages
populated with the content of the recovered area or a poison for the
unrecoverable locations, reading the backend file.
Any error reading this file results in the corresponding standard
sized page to be poisoned. And if this file mapping is not set with
"share=on", the entire replacing set of pages is poisoned.

We pause the VM to perfom the memory replacement. To do so we have
to get out of the SIGBUS handler(s). But the signal will be
reraised on VM resume.

The platform kernel may report the beginning of the error impacted
large page in the SIGBUS siginfo data, so we may have to adjust this
poison address information to point to the finer grain actual
poison location based on the replacing standard sized pages.
Aiming at a more precise poison information reported to the VM
gives the possibility to better react to this situation, improving
the overall RAS of hugetlbfs VMs.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c    |   7 +
 meson.build            |   2 +
 system/hugetlbfs_ras.c | 546 +++++++++++++++++++++++++++++++++++++++++
 system/hugetlbfs_ras.h |   3 +
 system/meson.build     |   1 +
 system/physmem.c       |  17 ++
 target/arm/kvm.c       |  10 +
 target/i386/kvm/kvm.c  |  10 +
 8 files changed, 596 insertions(+)
 create mode 100644 system/hugetlbfs_ras.c
 create mode 100644 system/hugetlbfs_ras.h

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index bcccf80bd7..6c6841f935 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -57,6 +57,10 @@
 #include <sys/eventfd.h>
 #endif
 
+#ifdef CONFIG_HUGETLBFS_RAS
+#include "system/hugetlbfs_ras.h"
+#endif
+
 /* KVM uses PAGE_SIZE in its definition of KVM_COALESCED_MMIO_MAX. We
  * need to use the real host PAGE_SIZE, as that's what KVM will use.
  */
@@ -1211,6 +1215,9 @@ static void kvm_unpoison_all(void *param)
 {
     HWPoisonPage *page, *next_page;
 
+#ifdef CONFIG_HUGETLBFS_RAS
+    hugetlbfs_ras_empty();
+#endif
     QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
         QLIST_REMOVE(page, list);
         qemu_ram_remap(page->ram_addr, page->page_size);
diff --git a/meson.build b/meson.build
index fbda17c987..03214586c4 100644
--- a/meson.build
+++ b/meson.build
@@ -3029,6 +3029,8 @@ if host_os == 'windows'
   endif
 endif
 
+config_host_data.set('CONFIG_HUGETLBFS_RAS', host_os == 'linux')
+
 ########################
 # Target configuration #
 ########################
diff --git a/system/hugetlbfs_ras.c b/system/hugetlbfs_ras.c
new file mode 100644
index 0000000000..2f7e550f56
--- /dev/null
+++ b/system/hugetlbfs_ras.c
@@ -0,0 +1,546 @@
+/*
+ * Deal with memory hugetlbfs largepage errors in userland.
+ *
+ * Copyright (c) 2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * (at your option) any later version.
+ */
+
+#include "qemu/osdep.h"
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <poll.h>
+#include <errno.h>
+#include <string.h>
+
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "exec/ramblock.h"
+#include "qemu/thread.h"
+#include "qemu/queue.h"
+#include "qemu/error-report.h"
+#include "block/thread-pool.h"
+#include "sysemu/runstate.h"
+#include "sysemu/kvm.h"
+#include "qemu/main-loop.h"
+#include "block/aio-wait.h"
+
+#include "hugetlbfs_ras.h"
+
+/* #define DEBUG_HUGETLBFS_RAS */
+
+#ifdef DEBUG_HUGETLBFS_RAS
+#define DPRINTF(fmt, ...) \
+    do { fprintf(stderr, "lpg_ras[%s]: " fmt, __func__, ## __VA_ARGS__); \
+    } while (0)
+#else
+#define DPRINTF(fmt, ...) do {} while (0)
+#endif
+
+/*
+ * We track the Large Poisoned Pages to be able to:
+ * - Identify if a faulting page is already under replacement
+ * - Trigger a replacement for new pages
+ * - Inform the suspended signal handlers that they can continue
+ */
+typedef enum LPP_state {
+    LPP_SUBMITTED = 1,
+    LPP_PREPARING,
+    LPP_DONE,
+    LPP_FAILED,
+} LPP_state;
+
+typedef struct LargeHWPoisonPage {
+    void     *page_addr; /* hva of the poisoned large page */
+    size_t    page_size;
+    LPP_state page_state;
+    void     *first_poison; /* location of the first poison found */
+    struct timespec creation_time;
+    QLIST_ENTRY(LargeHWPoisonPage) list;
+} LargeHWPoisonPage;
+
+static QLIST_HEAD(, LargeHWPoisonPage) large_hwpoison_page_list =
+    QLIST_HEAD_INITIALIZER(large_hwpoison_page_list);
+static int large_hwpoison_vm_stop; /* indicate that VM is stopping */
+static QemuCond large_hwpoison_cv;
+static QemuCond large_hwpoison_new;
+static QemuCond large_hwpoison_vm_running;
+static QemuMutex large_hwpoison_mtx;
+static QemuThread thread;
+static void *hugetlbfs_ras_listener(void *arg);
+static int vm_running;
+static bool hugetlbfs_ras_initialized;
+static int _PAGE_SIZE = 4096;
+static int _PAGE_SHIFT = 12;
+
+/* size should be a power of 2 */
+static int
+shift(int sz)
+{
+    int e, s = 0;
+
+    for (e = 0; (s < sz) && (e < 31); e++) {
+        s = (1 << e);
+        if (s == sz) {
+            return e;
+        }
+    }
+    return -1;
+}
+
+static int
+hugetlbfs_ras_init(void)
+{
+    _PAGE_SIZE = qemu_real_host_page_size();
+    _PAGE_SHIFT = shift(_PAGE_SIZE);
+    if (_PAGE_SHIFT < 0) {
+        warn_report("No support for hugetlbfs largepage errors: "
+                    "Bad page size (%d)", _PAGE_SIZE);
+        return -EIO;
+    }
+    qemu_cond_init(&large_hwpoison_cv);
+    qemu_cond_init(&large_hwpoison_new);
+    qemu_cond_init(&large_hwpoison_vm_running);
+    qemu_mutex_init(&large_hwpoison_mtx);
+
+    qemu_thread_create(&thread, "hugetlbfs_error", hugetlbfs_ras_listener,
+                       NULL, QEMU_THREAD_DETACHED);
+
+    hugetlbfs_ras_initialized = true;
+    return 0;
+}
+
+bool
+hugetlbfs_ras_use(void)
+{
+    static bool answered;
+
+    if (answered) {
+        return hugetlbfs_ras_initialized;
+    }
+
+    /* XXX we could verify if ras feature should be used or not (on ARM) */
+
+    /* CAP_SYS_ADMIN capability needed for madvise(MADV_HWPOISON) */
+    if (getuid() != 0) {
+        warn_report("Priviledges needed to deal with hugetlbfs memory poison");
+    } else {
+        hugetlbfs_ras_init();
+    }
+
+    answered = true;
+    return hugetlbfs_ras_initialized;
+}
+
+/* return the backend real page size used for the given address */
+static size_t
+hugetlbfs_ras_backend_sz(void *addr)
+{
+    ram_addr_t offset;
+    RAMBlock *rb;
+
+    rb = qemu_ram_block_from_host(addr, false, &offset);
+    if (!rb) {
+        warn_report("No associated RAMBlock to addr: %p", addr);
+        return _PAGE_SIZE;
+    }
+    return rb->page_size;
+}
+
+/*
+ * Report if this std page address of the given faulted large page should be
+ * retried or if the current signal handler should continue to deal with it.
+ * Once the mapping is replaced, we retry the errors appeared before the
+ * 'page struct' creation, to deal with previous errors that haven't been
+ * taken into account yet.
+ * But the 4k pages of the mapping can also experience HW errors in their
+ * lifetime.
+ */
+static int
+hugetlbfs_ras_retry(void *addr, LargeHWPoisonPage *page,
+                    struct timespec *entry_time)
+{
+    if (addr == page->first_poison) {
+        return 0;
+    }
+    if (entry_time->tv_sec < page->creation_time.tv_sec) {
+        return 1;
+    }
+    if ((entry_time->tv_sec == page->creation_time.tv_sec) &&
+        (entry_time->tv_nsec <= page->creation_time.tv_nsec)) {
+        return 1;
+    }
+    return 0;
+}
+
+/*
+ * If the given address is a large page, we try to replace it
+ * with a set of standard sized pages where we copy what remains
+ * valid from the failed large page.
+ * We reset the two values pointed by paddr and psz to point
+ * to the first poisoned page in the new set, and the size
+ * of this poisoned page.
+ * Return True when it's done. The handler continues with the
+ * possibly corrected values.
+ * Returning False means that there is no signal handler further
+ * action to be taken, the handler should exit.
+ */
+bool
+hugetlbfs_ras_correct(void **paddr, size_t *psz, int code)
+{
+    void *p, *reported_addr;
+    size_t reported_sz, real_sz;
+    LargeHWPoisonPage *page;
+    int found = 0;
+    struct timespec et;
+
+    assert(psz != NULL && paddr != NULL);
+
+    DPRINTF("SIGBUS (%s) at %p (size: %lu)\n",
+        (code == BUS_MCEERR_AR ? "AR" : "AO"), *paddr, *psz);
+
+    if (!hugetlbfs_ras_initialized) {
+        return true;
+    }
+
+    /*
+     * XXX kernel provided size is not reliable...
+     * As kvm_send_hwpoison_signal() uses a hard-coded PAGE_SHIFT
+     * signal value on hwpoison signal.
+     * So in the case of a std page size, we must identify the actual
+     * size to consider (from the mapping block size, or if we
+     * already reduced the page to 4k chunks)
+     */
+    reported_sz = *psz;
+
+    p = *paddr;
+    reported_addr = p;
+
+    if (clock_gettime(CLOCK_MONOTONIC, &et) != 0) {
+        et.tv_sec = 0;
+        et.tv_nsec = 1;
+    }
+    qemu_mutex_lock(&large_hwpoison_mtx);
+
+    if (large_hwpoison_vm_stop) {
+        qemu_mutex_unlock(&large_hwpoison_mtx);
+        return false;
+    }
+
+    QLIST_FOREACH(page, &large_hwpoison_page_list, list) {
+        if (page->page_addr <= p &&
+            page->page_addr + page->page_size > p) {
+            found = 1;
+            break;
+        }
+    }
+
+    if (!found) {
+        if (reported_sz > _PAGE_SIZE) {
+            /* we trust the kernel in this case */
+            real_sz = reported_sz;
+        } else {
+            real_sz = hugetlbfs_ras_backend_sz(p);
+            if (real_sz <= _PAGE_SIZE) {
+                /* not part of a large page */
+                qemu_mutex_unlock(&large_hwpoison_mtx);
+                return true;
+            }
+        }
+        page = g_new0(LargeHWPoisonPage, 1);
+        p = (void *)ROUND_DOWN((unsigned long long)p, real_sz);
+        page->page_addr = p;
+        page->page_size = real_sz;
+        page->page_state = LPP_SUBMITTED;
+        QLIST_INSERT_HEAD(&large_hwpoison_page_list, page, list);
+        qemu_cond_signal(&large_hwpoison_new);
+    } else {
+        if ((code == BUS_MCEERR_AR) && (reported_sz <= _PAGE_SIZE) &&
+            hugetlbfs_ras_retry(p, page, &et)) {
+            *paddr = NULL;
+        }
+    }
+
+    while (page->page_state < LPP_DONE && !large_hwpoison_vm_stop) {
+        qemu_cond_wait(&large_hwpoison_cv, &large_hwpoison_mtx);
+    }
+
+    if (large_hwpoison_vm_stop) {
+        DPRINTF("Handler exit requested as on page %p\n", page->page_addr);
+        *paddr = NULL;
+    }
+    qemu_mutex_unlock(&large_hwpoison_mtx);
+
+    if (page->page_state == LPP_FAILED) {
+        warn_report("Failed recovery for page: %p (error at %p)",
+                    page->page_addr, reported_addr);
+        return (*paddr == NULL ? false : true);
+    }
+
+    *psz = (size_t)_PAGE_SIZE;
+
+    DPRINTF("SIGBUS (%s) corrected from %p to %p (size %ld to %ld)\n",
+            (code == BUS_MCEERR_AR ? "AR" : "AO"),
+            reported_addr, *paddr, reported_sz, *psz);
+
+    return (*paddr == NULL ? false : true);
+}
+
+/*
+ * Sequentially read the valid data from the failed large page (shared) backend
+ * file and copy that into our set of standard sized pages.
+ * Any error reading this file (not only EIO) means that we don't retrieve
+ * valid data for the read location, so it results in the corresponding
+ * standard page to be marked as poisoned.
+ * And if this file mapping is not set with "share=on", we can't rely on the
+ * content on the backend file, so the entire replacing set of pages
+ * is poisoned in this case.
+ */
+static int take_valid_data_lpg(LargeHWPoisonPage *page, const char **err)
+{
+    int fd, i, ps = _PAGE_SIZE, slot_num, poison_count = 0;
+    ram_addr_t offset;
+    RAMBlock *rb;
+    uint64_t fd_offset;
+    ssize_t count, retrieved;
+
+    /* find the backend to get the associated fd and offset */
+    rb = qemu_ram_block_from_host(page->page_addr, false, &offset);
+    if (!rb) {
+        if (err) {
+            *err = "No associated RAMBlock";
+        }
+        return -1;
+    }
+    fd = qemu_ram_get_fd(rb);
+    fd_offset = rb->fd_offset;
+    offset += fd_offset;
+    assert(page->page_size == qemu_ram_pagesize(rb));
+    slot_num = page->page_size / ps;
+
+    if (!qemu_ram_is_shared(rb)) { /* we can't use the backend file */
+        if (madvise(page->page_addr, page->page_size, MADV_HWPOISON) == 0) {
+            page->first_poison = page->page_addr;
+            warn_report("Large memory error, unrecoverable section "
+                "(unshared hugetlbfs): start:%p length: %ld",
+                page->page_addr, page->page_size);
+            return 0;
+        } else {
+            if (err) {
+                *err = "large poison injection failed";
+            }
+            return -1;
+        }
+    }
+
+    for (i = 0; i < slot_num; i++) {
+        retrieved = 0;
+        while (retrieved < ps) {
+            count = pread(fd, page->page_addr + i * ps + retrieved,
+                ps - retrieved, offset + i * ps + retrieved);
+            if (count == 0) {
+                DPRINTF("read reach end of the file\n");
+                break;
+            } else if (count < 0) {
+                DPRINTF("read backend failed: %s\n", strerror(errno));
+                break;
+            }
+            retrieved += count;
+        }
+        if (retrieved < ps) { /* consider this page as poisoned */
+            if (madvise(page->page_addr + i * ps, ps, MADV_HWPOISON)) {
+                if (err) {
+                    *err = "poison injection failed";
+                }
+                return -1;
+            }
+            if (page->first_poison == NULL) {
+                page->first_poison = page->page_addr + i * ps;
+            }
+            poison_count++;
+            DPRINTF("Found a poison at index %d = addr %p\n",
+                i, page->page_addr + i * ps);
+        }
+    }
+
+    /*
+     * A large page without at least a 4k poison will not have an
+     * entry into hwpoison_page_list, so won't be correctly replaced
+     * with a new large page on VM reset with qemu_ram_remap().
+     * Any new error on this area will fail to be handled.
+     */
+    if (poison_count == 0) {
+        if (err) {
+            *err = "No Poison found";
+        }
+        return -1;
+    }
+
+    DPRINTF("Num poison for page %p : %d / %d\n",
+            page->page_addr, poison_count, i);
+    return 0;
+}
+
+/*
+ * Empty the large_hwpoison_page_list -- to be called on address space
+ * poison cleanup outside of concurrent memory access.
+ */
+void hugetlbfs_ras_empty(void)
+{
+    LargeHWPoisonPage *page, *next_page;
+
+    if (!hugetlbfs_ras_initialized) {
+        return;
+    }
+    qemu_mutex_lock(&large_hwpoison_mtx);
+    QLIST_FOREACH_SAFE(page, &large_hwpoison_page_list, list, next_page) {
+        QLIST_REMOVE(page, list);
+        g_free(page);
+    }
+    qemu_mutex_unlock(&large_hwpoison_mtx);
+}
+
+/*
+ * Deal with the given page, initializing its data.
+ */
+static void
+hugetlbfs_ras_transform_page(LargeHWPoisonPage *page, const char **err_info)
+{
+    const char *err_msg;
+    int fd;
+    ram_addr_t offset;
+    RAMBlock *rb;
+
+    /* find the backend to get the associated fd and offset */
+    rb = qemu_ram_block_from_host(page->page_addr, false, &offset);
+    if (!rb) {
+        DPRINTF("No associated RAMBlock to %p\n", page->page_addr);
+        err_msg = "qemu_ram_block_from_host error";
+        goto err;
+    }
+    fd = qemu_ram_get_fd(rb);
+
+    if (sync_file_range(fd, offset, page->page_size,
+                        SYNC_FILE_RANGE_WAIT_AFTER) != 0) {
+        err_msg = "sync_file_range error on the backend";
+        perror("sync_file_range");
+        goto err;
+    }
+    if (fsync(fd) != 0) {
+        err_msg = "fsync error on the backend";
+        perror("fsync");
+        goto err;
+    }
+    if (msync(page->page_addr, page->page_size, MS_SYNC) != 0) {
+        err_msg = "msync error on the backend";
+        perror("msync");
+        goto err;
+    }
+    page->page_state = LPP_PREPARING;
+
+    if (munmap(page->page_addr, page->page_size) != 0) {
+        err_msg = "Failed to unmap";
+        perror("munmap");
+        goto err;
+    }
+
+    /* replace the large page with standard pages */
+    if (mmap(page->page_addr, page->page_size, PROT_READ | PROT_WRITE,
+            MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0)
+            == MAP_FAILED) {
+        err_msg = "Failed to map std page";
+        perror("mmap");
+        goto err;
+    }
+
+    /* take a copy of still valid data and mark the failed pages as poisoned */
+    if (take_valid_data_lpg(page, &err_msg) != 0) {
+        goto err;
+    }
+
+    if (clock_gettime(CLOCK_MONOTONIC, &page->creation_time) != 0) {
+        err_msg = "Failed to set creation time";
+        perror("clock_gettime");
+        goto err;
+    }
+
+    page->page_state = LPP_DONE;
+    return;
+
+err:
+    if (err_info) {
+        *err_info = err_msg;
+    }
+    page->page_state = LPP_FAILED;
+}
+
+/* attempt to vm_stop the entire VM in the IOthread */
+static void coroutine_hugetlbfs_ras_vmstop_bh(void *opaque)
+{
+    vm_stop(RUN_STATE_PAUSED);
+    DPRINTF("VM STOPPED\n");
+    qemu_mutex_lock(&large_hwpoison_mtx);
+    vm_running = 0;
+    qemu_cond_signal(&large_hwpoison_vm_running);
+    qemu_mutex_unlock(&large_hwpoison_mtx);
+}
+
+static void coroutine_hugetlbfs_ras_vmstart_bh(void *opaque)
+{
+    vm_start();
+}
+
+static void *
+hugetlbfs_ras_listener(void *arg)
+{
+    LargeHWPoisonPage *page;
+    int new;
+    const char *err;
+
+    /* monitor any newly submitted element in the list */
+    qemu_mutex_lock(&large_hwpoison_mtx);
+    while (1) {
+        new = 0;
+        QLIST_FOREACH(page, &large_hwpoison_page_list, list) {
+            if (page->page_state == LPP_SUBMITTED) {
+                new++;
+                vm_running = 1;
+                DPRINTF("Stopping the VM\n");
+                aio_bh_schedule_oneshot(qemu_get_aio_context(),
+                                coroutine_hugetlbfs_ras_vmstop_bh, NULL);
+                /* inform all SIGBUS threads that they have to return */
+                large_hwpoison_vm_stop++;
+                qemu_cond_broadcast(&large_hwpoison_cv);
+
+                /* wait until VM is stopped */
+                while (vm_running) {
+                    DPRINTF("waiting for vm to stop\n");
+                    qemu_cond_wait(&large_hwpoison_vm_running,
+                                   &large_hwpoison_mtx);
+                }
+
+                hugetlbfs_ras_transform_page(page, &err);
+                if (page->page_state == LPP_FAILED) {
+                    error_report("fatal: unrecoverable hugepage memory error"
+                                 " at %p (%s)", page->page_addr, err);
+                    exit(1);
+                }
+
+                large_hwpoison_vm_stop--;
+
+                DPRINTF("Restarting the VM\n");
+                aio_bh_schedule_oneshot(qemu_get_aio_context(),
+                                coroutine_hugetlbfs_ras_vmstart_bh, NULL);
+            }
+        }
+        if (new) {
+            qemu_cond_broadcast(&large_hwpoison_cv);
+        }
+
+        qemu_cond_wait(&large_hwpoison_new, &large_hwpoison_mtx);
+    }
+    qemu_mutex_unlock(&large_hwpoison_mtx);
+    return NULL;
+}
diff --git a/system/hugetlbfs_ras.h b/system/hugetlbfs_ras.h
new file mode 100644
index 0000000000..324228bda3
--- /dev/null
+++ b/system/hugetlbfs_ras.h
@@ -0,0 +1,3 @@
+bool hugetlbfs_ras_use(void);
+bool hugetlbfs_ras_correct(void **paddr, size_t *psz, int code);
+void hugetlbfs_ras_empty(void);
diff --git a/system/meson.build b/system/meson.build
index a296270cb0..eda92f55a9 100644
--- a/system/meson.build
+++ b/system/meson.build
@@ -37,4 +37,5 @@ system_ss.add(when: 'CONFIG_DEVICE_TREE',
               if_false: files('device_tree-stub.c'))
 if host_os == 'linux'
   system_ss.add(files('async-teardown.c'))
+  system_ss.add(files('hugetlbfs_ras.c'))
 endif
diff --git a/system/physmem.c b/system/physmem.c
index 5c176146c0..78de507bd0 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -82,6 +82,10 @@
 #include <daxctl/libdaxctl.h>
 #endif
 
+#ifdef CONFIG_HUGETLBFS_RAS
+#include "system/hugetlbfs_ras.h"
+#endif
+
 //#define DEBUG_SUBPAGE
 
 /* ram_list is read under rcu_read_lock()/rcu_read_unlock().  Writes
@@ -2061,6 +2065,19 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
         return NULL;
     }
 
+#ifdef CONFIG_HUGETLBFS_RAS
+    {
+        QemuFsType ftyp = qemu_fd_getfs(fd);
+
+        if (ftyp == QEMU_FS_TYPE_HUGETLBFS) {
+            if (hugetlbfs_ras_use() && !(ram_flags & RAM_SHARED)) {
+                warn_report("'share=on' option must be set to support "
+                            "hugetlbfs memory error handling");
+            }
+        }
+    }
+#endif
+
     block = qemu_ram_alloc_from_fd(size, mr, ram_flags, fd, offset, errp);
     if (!block) {
         if (created) {
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index f62e53e423..6215d1acb5 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -40,6 +40,10 @@
 #include "hw/acpi/ghes.h"
 #include "target/arm/gtimer.h"
 
+#ifdef CONFIG_HUGETLBFS_RAS
+#include "system/hugetlbfs_ras.h"
+#endif
+
 const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
     KVM_CAP_LAST_INFO
 };
@@ -2356,6 +2360,12 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb)
 
     assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
 
+#ifdef CONFIG_HUGETLBFS_RAS
+    if (!hugetlbfs_ras_correct(&addr, &sz, code)) {
+        return;
+    }
+#endif
+
     if (acpi_ghes_present() && addr) {
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 99b87140cc..c99095cb1f 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -69,6 +69,10 @@
 #include "exec/memattrs.h"
 #include "trace.h"
 
+#ifdef CONFIG_HUGETLBFS_RAS
+#include "system/hugetlbfs_ras.h"
+#endif
+
 #include CONFIG_DEVICES
 
 //#define DEBUG_KVM
@@ -729,6 +733,12 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb)
      */
     assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
 
+#ifdef CONFIG_HUGETLBFS_RAS
+    if (!hugetlbfs_ras_correct(&addr, &sz, code)) {
+        return;
+    }
+#endif
+
     if ((env->mcg_cap & MCG_SER_P) && addr) {
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
-- 
2.43.5

[RFC RESEND 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener

Posted by “William Roche 2 months, 2 weeks ago

From: William Roche <william.roche@oracle.com>

madvise MADV_HWPOISON can generate a SIGBUS when called, so the listener
thread (the caller) needs to deal with this signal.
The signal handler recognizes a thread specific variable allowing it to
directly exit when generated from this thread.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/cpus.c          |  9 +++++++++
 system/hugetlbfs_ras.c | 43 ++++++++++++++++++++++++++++++++++++++++--
 system/hugetlbfs_ras.h |  1 +
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/system/cpus.c b/system/cpus.c
index 12e630f760..642055f729 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -47,6 +47,10 @@
 #include "hw/hw.h"
 #include "trace.h"
 
+#ifdef CONFIG_HUGETLBFS_RAS
+#include "system/hugetlbfs_ras.h"
+#endif
+
 #ifdef CONFIG_LINUX
 
 #include <sys/prctl.h>
@@ -374,6 +378,11 @@ static void sigbus_handler(int n, siginfo_t *siginfo, void *ctx)
         sigbus_reraise();
     }
 
+#ifdef CONFIG_HUGETLBFS_RAS
+    /* skip error on the listener thread - does not return in this case */
+    hugetlbfs_ras_signal_from_listener();
+#endif
+
     if (current_cpu) {
         /* Called asynchronously in VCPU thread.  */
         if (kvm_on_sigbus_vcpu(current_cpu, siginfo->si_code,
diff --git a/system/hugetlbfs_ras.c b/system/hugetlbfs_ras.c
index 2f7e550f56..90e399bbad 100644
--- a/system/hugetlbfs_ras.c
+++ b/system/hugetlbfs_ras.c
@@ -70,6 +70,8 @@ static QemuCond large_hwpoison_vm_running;
 static QemuMutex large_hwpoison_mtx;
 static QemuThread thread;
 static void *hugetlbfs_ras_listener(void *arg);
+static pthread_key_t id_key;
+static sigjmp_buf listener_jmp_buf;
 static int vm_running;
 static bool hugetlbfs_ras_initialized;
 static int _PAGE_SIZE = 4096;
@@ -105,6 +107,10 @@ hugetlbfs_ras_init(void)
     qemu_cond_init(&large_hwpoison_vm_running);
     qemu_mutex_init(&large_hwpoison_mtx);
 
+    if (pthread_key_create(&id_key, NULL) != 0) {
+        warn_report("No support for hugetlbfs largepage errors - no id_key");
+        return -EIO;
+    }
     qemu_thread_create(&thread, "hugetlbfs_error", hugetlbfs_ras_listener,
                        NULL, QEMU_THREAD_DETACHED);
 
@@ -288,6 +294,19 @@ hugetlbfs_ras_correct(void **paddr, size_t *psz, int code)
     return (*paddr == NULL ? false : true);
 }
 
+/* this madvise can generate a SIGBUS, use the jump buffer to deal with it */
+static bool poison_location(void *addr, int size)
+{
+    if (sigsetjmp(listener_jmp_buf, 1) == 0) {
+        if (madvise(addr, size, MADV_HWPOISON)) {
+            DPRINTF("poison injection failed: %s (addr:%p sz:%d)\n",
+                    strerror(errno), addr, size);
+            return false;
+        }
+    }
+    return true;
+}
+
 /*
  * Sequentially read the valid data from the failed large page (shared) backend
  * file and copy that into our set of standard sized pages.
@@ -321,7 +340,7 @@ static int take_valid_data_lpg(LargeHWPoisonPage *page, const char **err)
     slot_num = page->page_size / ps;
 
     if (!qemu_ram_is_shared(rb)) { /* we can't use the backend file */
-        if (madvise(page->page_addr, page->page_size, MADV_HWPOISON) == 0) {
+        if (poison_location(page->page_addr, page->page_size)) {
             page->first_poison = page->page_addr;
             warn_report("Large memory error, unrecoverable section "
                 "(unshared hugetlbfs): start:%p length: %ld",
@@ -350,7 +369,7 @@ static int take_valid_data_lpg(LargeHWPoisonPage *page, const char **err)
             retrieved += count;
         }
         if (retrieved < ps) { /* consider this page as poisoned */
-            if (madvise(page->page_addr + i * ps, ps, MADV_HWPOISON)) {
+            if (!poison_location(page->page_addr + i * ps, ps)) {
                 if (err) {
                     *err = "poison injection failed";
                 }
@@ -402,6 +421,19 @@ void hugetlbfs_ras_empty(void)
     qemu_mutex_unlock(&large_hwpoison_mtx);
 }
 
+/*
+ * Check if the signal is taken from the listener thread,
+ * in this thread we don't return as we jump after the madvise call.
+ */
+void
+hugetlbfs_ras_signal_from_listener(void)
+{
+    /* check if we take the SIGBUS in the listener */
+    if (pthread_getspecific(id_key) != NULL) {
+        siglongjmp(listener_jmp_buf, 1);
+    }
+}
+
 /*
  * Deal with the given page, initializing its data.
  */
@@ -498,6 +530,13 @@ hugetlbfs_ras_listener(void *arg)
     LargeHWPoisonPage *page;
     int new;
     const char *err;
+    sigset_t set;
+
+    pthread_setspecific(id_key, (void *)1);
+    /* unblock SIGBUS */
+    sigemptyset(&set);
+    sigaddset(&set, SIGBUS);
+    pthread_sigmask(SIG_UNBLOCK, &set, NULL);
 
     /* monitor any newly submitted element in the list */
     qemu_mutex_lock(&large_hwpoison_mtx);
diff --git a/system/hugetlbfs_ras.h b/system/hugetlbfs_ras.h
index 324228bda3..9c2a6e49a1 100644
--- a/system/hugetlbfs_ras.h
+++ b/system/hugetlbfs_ras.h
@@ -1,3 +1,4 @@
 bool hugetlbfs_ras_use(void);
 bool hugetlbfs_ras_correct(void **paddr, size_t *psz, int code);
 void hugetlbfs_ras_empty(void);
+void hugetlbfs_ras_signal_from_listener(void);
-- 
2.43.5

[RFC RESEND 6/6] system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume

Posted by “William Roche 2 months, 2 weeks ago

From: William Roche <william.roche@oracle.com>

In case the SIGBUS handler is triggered by a BUS_MCEERR_AO signal
and this handler needs to exit to let the VM pause during the memory
mapping change, this SIGBUS won't be regenerated when the VM resumes.
In this case we take note of this signal before exiting the handler
to replay it when the VM resumes.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/hugetlbfs_ras.c | 60 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/system/hugetlbfs_ras.c b/system/hugetlbfs_ras.c
index 90e399bbad..50f810f836 100644
--- a/system/hugetlbfs_ras.c
+++ b/system/hugetlbfs_ras.c
@@ -155,6 +155,56 @@ hugetlbfs_ras_backend_sz(void *addr)
     return rb->page_size;
 }
 
+
+/*
+ *  List of BUS_MCEERR_AO signals received before replaying.
+ *  Addition is serialized under large_hwpoison_mtx, but replay is
+ *  asynchronous.
+ */
+typedef struct LargeHWPoisonAO {
+    void  *addr;
+    QLIST_ENTRY(LargeHWPoisonAO) list;
+} LargeHWPoisonAO;
+
+static QLIST_HEAD(, LargeHWPoisonAO) large_hwpoison_ao =
+    QLIST_HEAD_INITIALIZER(large_hwpoison_ao);
+
+static void
+large_hwpoison_ao_record(void *addr)
+{
+    LargeHWPoisonAO *cel;
+
+    cel = g_new(LargeHWPoisonAO, 1);
+    cel->addr = addr;
+    QLIST_INSERT_HEAD(&large_hwpoison_ao, cel, list);
+}
+
+/* replay the possible BUS_MCEERR_AO recorded signal(s) */
+static void
+hugetlbfs_ras_ao_replay_bh(void)
+{
+    LargeHWPoisonAO *cel, *next;
+    QLIST_HEAD(, LargeHWPoisonAO) local_list =
+    QLIST_HEAD_INITIALIZER(local_list);
+
+    /*
+     * Copy to a local list to avoid holding large_hwpoison_mtx
+     * when calling kvm_on_sigbus().
+     */
+    qemu_mutex_lock(&large_hwpoison_mtx);
+    QLIST_FOREACH_SAFE(cel, &large_hwpoison_ao, list, next) {
+        QLIST_REMOVE(cel, list);
+        QLIST_INSERT_HEAD(&local_list, cel, list);
+    }
+    qemu_mutex_unlock(&large_hwpoison_mtx);
+
+    QLIST_FOREACH_SAFE(cel, &local_list, list, next) {
+        DPRINTF("AO on %p\n", cel->addr);
+        kvm_on_sigbus(BUS_MCEERR_AO, cel->addr, _PAGE_SHIFT);
+        g_free(cel);
+    }
+}
+
 /*
  * Report if this std page address of the given faulted large page should be
  * retried or if the current signal handler should continue to deal with it.
@@ -276,6 +326,15 @@ hugetlbfs_ras_correct(void **paddr, size_t *psz, int code)
     if (large_hwpoison_vm_stop) {
         DPRINTF("Handler exit requested as on page %p\n", page->page_addr);
         *paddr = NULL;
+        /*
+         * BUS_MCEERR_AO specific case: this signal is not regenerated,
+         * we keep it to replay when the VM is ready to take it.
+         */
+        if (code == BUS_MCEERR_AO) {
+            large_hwpoison_ao_record(page->first_poison ? page->first_poison :
+                reported_addr);
+        }
+
     }
     qemu_mutex_unlock(&large_hwpoison_mtx);
 
@@ -522,6 +581,7 @@ static void coroutine_hugetlbfs_ras_vmstop_bh(void *opaque)
 static void coroutine_hugetlbfs_ras_vmstart_bh(void *opaque)
 {
     vm_start();
+    hugetlbfs_ras_ao_replay_bh();
 }
 
 static void *
-- 
2.43.5