Series comparison

-[PATCH v5 0/6] Poisoned memory recovery on reboot
+[PATCH v7 0/6] Poisoned memory recovery on reboot
 From: William Roche <william.roche@oracle.com>
 Hello David,
-I'm keeping the description of the patch set you already reviewed:
+Here is the version with the small nits corrected.
 And the 'Acked-by' entries you gave me for patch 1 and 2.
  ---
 This set of patches fixes several problems with hardware memory errors
 impacting hugetlbfs memory backed VMs and the generic memory recovery
 on VM reset.
 When using hugetlbfs large pages, any large page location being impacted
 ...
 We also enrich the messages used to report a memory error relayed to
 the VM, providing an identification of memory page and its size in
 case of a large page impacted.
  ----
-v4->v5
+v1 -> v2:
 . I removed the kernel SIGBUS siginfo provided lsb size information
   tracking. Only relying on the RAMBlock page_size instead.
 . I adapted the 3 patches you indicated me to implement the
   notification mechanism on remap.  Thank you for this code!
   I left them as Authored by you.
   But I haven't tested if the policy setting works as expected on VM
   reset, only that the replacement of physical memory works.
 . I also removed the old memory setting that was kept in qemu_ram_remap()
   but this small last fix could probably be merged with your last commit.
 v2 -> v3:
 . dropped the size parameter from qemu_ram_remap() and determine the page
   size when adding it to the poison list, aligning the offset down to the
   pagesize. Multiple sub-pages poisoned on a large page lead to a single
   poison entry.
 . introduction of a helper function for the mmap code
 . adding "on lost large page <size>@<ram_addr>" to the error injection
   msg (notation used in qemu_ram_remap() too ).
   So only in the case of a large page, it looks like:
 Guest MCE Memory Error at QEMU addr 0x7fc1f5dd6000 and GUEST addr 0x19fd6000 on lost large page 200000@19e00000 of type BUS_MCEERR_AR injected
 . as we need the page_size value for the above message, I retrieve the
   value in kvm_arch_on_sigbus_vcpu() to pass the appropriate pointer
   to kvm_hwpoison_page_add() that doesn't need to align it anymore.
 . added a similar message for the ARM platform (removing the MCE
   keyword)
 . I also introduced a "fail hard" in the remap notification:
   host_memory_backend_ram_remapped()
 v3 -> v4:
 . Fixed some commit messages typos
 . Enhanced some code comments
 . Changed the discard fall back conditions to consider only anonymous
   memory
 . Fixed missing some variable name changes in intermediary patches.
 . Modify the error message given when an error is injected to report
   the case of a large page
 . use snprintf() to generate this message
 . Adding this same type of message in the ARM case too
 v4->v5:
 . Updated commit messages (for patches 1, 5 and 6)
 . Fixed comment typo of patch 2
 . Changed the fall back function parameters to match the
   ram_block_discard_range() function.
 . Removed the unused case of remapping a file in this function
 . add the assert(block->fd < 0) in this function too
-. I merged my patch 7 with you patch 6 (we only have 6 patches now)
+. I merged my patch 7 with your patch 6 (we only have 6 patches now)
 v5->v6:
 . don't align down ram_addr on kvm_hwpoison_page_add() but create
   a new entry for each subpage reported as poisoned
 . introduce similar messages about memory error as discard_range()
 . introduce a function to retrieve more information about a RAMBlock
   experiencing an error than just its associated page size
 . file offset as an uint64_t instead of a ram_addr_t
 . changed ownership of patch 6/6
 v6->v7:
 . change the block location information collection function name to
   qemu_ram_block_info_from_addr()
 . display the fd_offset value only when dealing with a file backend
   in kvm_hwpoison_page_add() and qemu_ram_remap()
 . better placed offset alignment computation
 . two empty separation lines missing
 This code is scripts/checkpatch.pl clean
 'make check' runs clean on both x86 and ARM.
-David Hildenbrand (3):
+David Hildenbrand (2):
   numa: Introduce and use ram_block_notify_remap()
   hostmem: Factor out applying settings
-  hostmem: Handle remapping of RAM
-William Roche (3):
+William Roche (4):
   system/physmem: handle hugetlb correctly in qemu_ram_remap()
   system/physmem: poisoned memory discard on reboot
   accel/kvm: Report the loss of a large memory page
+  hostmem: Handle remapping of RAM
- accel/kvm/kvm-all.c       |   2 +-
+ accel/kvm/kvm-all.c       |  20 +++-
  backends/hostmem.c        | 189 +++++++++++++++++++++++---------------
  hw/core/numa.c            |  11 +++
- include/exec/cpu-common.h |   3 +-
+ include/exec/cpu-common.h |  12 ++-
  include/exec/ramlist.h    |   3 +
  include/system/hostmem.h  |   1 +
- system/physmem.c          |  82 ++++++++++++-----
+ system/physmem.c          | 107 +++++++++++++++------
- target/arm/kvm.c          |  13 +++
+ target/arm/kvm.c          |   3 +
- target/i386/kvm/kvm.c     |  18 +++-
+files changed, 244 insertions(+), 102 deletions(-)
 files changed, 218 insertions(+), 104 deletions(-)
 --
 .43.5

-[PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()
+[PATCH v7 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()
 From: William Roche <william.roche@oracle.com>
 The list of hwpoison pages used to remap the memory on reset
-is based on the backend real page size. When dealing with
+is based on the backend real page size.
 hugepages, we create a single entry for the entire page.
 To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete
 hugetlb page; hugetlb pages cannot be partially mapped.
+Signed-off-by: William Roche <william.roche@oracle.com>
 Co-developed-by: David Hildenbrand <david@redhat.com>
-Signed-off-by: William Roche <william.roche@oracle.com>
+Acked-by: David Hildenbrand <david@redhat.com>
 ---
- accel/kvm/kvm-all.c       |  6 +++++-
+ accel/kvm/kvm-all.c       |  2 +-
- include/exec/cpu-common.h |  3 ++-
+ include/exec/cpu-common.h |  2 +-
- system/physmem.c          | 32 ++++++++++++++++++++++++++------
+ system/physmem.c          | 38 +++++++++++++++++++++++++++++---------
-files changed, 33 insertions(+), 8 deletions(-)
+files changed, 31 insertions(+), 11 deletions(-)
 diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
 index XXXXXXX..XXXXXXX 100644
 --- a/accel/kvm/kvm-all.c
 +++ b/accel/kvm/kvm-all.c
 ...
 -        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
 +        qemu_ram_remap(page->ram_addr);
          g_free(page);
      }
  }
-@@ -XXX,XX +XXX,XX @@ static void kvm_unpoison_all(void *param)
- void kvm_hwpoison_page_add(ram_addr_t ram_addr)
- {
-     HWPoisonPage *page;
-+    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
-+
-+    if (page_size > TARGET_PAGE_SIZE)
-+        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
-     QLIST_FOREACH(page, &hwpoison_page_list, list) {
-         if (page->ram_addr == ram_addr) {
 diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/exec/cpu-common.h
 +++ b/include/exec/cpu-common.h
 @@ -XXX,XX +XXX,XX @@ typedef uintptr_t ram_addr_t;
 ...
 -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
 +void qemu_ram_remap(ram_addr_t addr);
  /* This should not be used by devices.  */
  ram_addr_t qemu_ram_addr_from_host(void *ptr);
  ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
-@@ -XXX,XX +XXX,XX @@ bool qemu_ram_is_named_file(RAMBlock *rb);
- int qemu_ram_get_fd(RAMBlock *rb);
- size_t qemu_ram_pagesize(RAMBlock *block);
-+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr);
- size_t qemu_ram_pagesize_largest(void);
- /**
 diff --git a/system/physmem.c b/system/physmem.c
 index XXXXXXX..XXXXXXX 100644
 --- a/system/physmem.c
 +++ b/system/physmem.c
-@@ -XXX,XX +XXX,XX @@ size_t qemu_ram_pagesize(RAMBlock *rb)
-     return rb->page_size;
- }
-+/* Return backend real page size used for the given ram_addr */
-+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr)
-+{
-+    RAMBlock *rb;
-+
-+    RCU_READ_LOCK_GUARD();
-+    rb =  qemu_get_ram_block(addr);
-+    if (!rb) {
-+        return TARGET_PAGE_SIZE;
-+    }
-+    return qemu_ram_pagesize(rb);
-+}
-+
- /* Returns the largest size of page in use */
- size_t qemu_ram_pagesize_largest(void)
- {
 @@ -XXX,XX +XXX,XX @@ void qemu_ram_free(RAMBlock *block)
  }
  #ifndef _WIN32
 -void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
++/*
++ * qemu_ram_remap - remap a single RAM page
++ *
++ * @addr: address in ram_addr_t address space.
++ *
++ * This function will try remapping a single page of guest RAM identified by
++ * @addr, essentially discarding memory to recover from previously poisoned
++ * memory (MCE). The page size depends on the RAMBlock (i.e., hugetlb). @addr
++ * does not have to point at the start of the page.
++ *
++ * This function is only to be used during system resets; it will kill the
++ * VM if remapping failed.
++ */
 +void qemu_ram_remap(ram_addr_t addr)
  {
      RAMBlock *block;
-     ram_addr_t offset;
+-    ram_addr_t offset;
 +    uint64_t offset;
      int flags;
      void *area, *vaddr;
      int prot;
 +    size_t page_size;
 ...
                      flags |= MAP_ANONYMOUS;
 -                    area = mmap(vaddr, length, prot, flags, -1, 0);
 +                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
                  }
                  if (area != vaddr) {
-                     error_report("Could not remap addr: "
+-                    error_report("Could not remap addr: "
-                                  RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+-                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
 -                                 length, addr);
-+                                 page_size, addr);
++                    error_report("Could not remap RAM %s:%" PRIx64 "+%" PRIx64
 +                                 " +%zx", block->idstr, offset,
 +                                 block->fd_offset, page_size);
                      exit(1);
                  }
 -                memory_try_enable_merging(vaddr, length);
 -                qemu_ram_setup_dump(vaddr, length);
 +                memory_try_enable_merging(vaddr, page_size);
 ...

-[PATCH v5 2/6] system/physmem: poisoned memory discard on reboot
+[PATCH v7 2/6] system/physmem: poisoned memory discard on reboot
 ...
 If the kernel doesn't support the madvise calls used by this function
 and we are dealing with anonymous memory, fall back to remapping the
 location(s).
 Signed-off-by: William Roche <william.roche@oracle.com>
+Acked-by: David Hildenbrand <david@redhat.com>
 ---
- system/physmem.c | 57 ++++++++++++++++++++++++++++++------------------
+ system/physmem.c | 58 ++++++++++++++++++++++++++++++------------------
-file changed, 36 insertions(+), 21 deletions(-)
+file changed, 36 insertions(+), 22 deletions(-)
 diff --git a/system/physmem.c b/system/physmem.c
 index XXXXXXX..XXXXXXX 100644
 --- a/system/physmem.c
 +++ b/system/physmem.c
 @@ -XXX,XX +XXX,XX @@ void qemu_ram_free(RAMBlock *block)
  }
  #ifndef _WIN32
 +/* Simply remap the given VM memory location from start to start+length */
-+static void qemu_ram_remap_mmap(RAMBlock *block, uint64_t start, size_t length)
++static int qemu_ram_remap_mmap(RAMBlock *block, uint64_t start, size_t length)
 +{
 +    int flags, prot;
 +    void *area;
 +    void *host_startaddr = block->host + start;
 +
 ...
 +    flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
 +    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
 +    prot = PROT_READ;
 +    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
 +    area = mmap(host_startaddr, length, prot, flags, -1, 0);
-+    if (area != host_startaddr) {
++    return area != host_startaddr ? -errno : 0;
 +        error_report("Could not remap addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
 +                     length, start);
 +        exit(1);
 +    }
 +}
 +
- void qemu_ram_remap(ram_addr_t addr)
+ /*
   * qemu_ram_remap - remap a single RAM page
   *
@@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
  {
      RAMBlock *block;
-     ram_addr_t offset;
+     uint64_t offset;
 -    int flags;
 -    void *area, *vaddr;
 -    int prot;
 +    void *vaddr;
      size_t page_size;
      RAMBLOCK_FOREACH(block) {
 @@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
+                 ;
              } else if (xen_enabled()) {
                  abort();
-             } else {
+-            } else {
 -                flags = MAP_FIXED;
 -                flags |= block->flags & RAM_SHARED ?
 -                         MAP_SHARED : MAP_PRIVATE;
 -                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
 -                prot = PROT_READ;
 ...
 -                } else {
 -                    flags |= MAP_ANONYMOUS;
 -                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
 -                }
 -                if (area != vaddr) {
--                    error_report("Could not remap addr: "
+-                    error_report("Could not remap RAM %s:%" PRIx64 "+%" PRIx64
--                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+-                                 " +%zx", block->idstr, offset,
--                                 page_size, addr);
+-                                 block->fd_offset, page_size);
 -                    exit(1);
 +                if (ram_block_discard_range(block, offset, page_size) != 0) {
 +                    /*
 +                     * Fall back to using mmap() only for anonymous mapping,
 +                     * as if a backing file is associated we may not be able
 +                     * to recover the memory in all cases.
 +                     * So don't take the risk of using only mmap and fail now.
 +                     */
 +                    if (block->fd >= 0) {
-+                        error_report("Memory poison recovery failure addr: "
++                        error_report("Could not remap RAM %s:%" PRIx64 "+%"
-+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
++                                     PRIx64 " +%zx", block->idstr, offset,
-+                                     page_size, addr);
++                                     block->fd_offset, page_size);
 +                        exit(1);
 +                    }
-+                    qemu_ram_remap_mmap(block, offset, page_size);
++                    if (qemu_ram_remap_mmap(block, offset, page_size) != 0) {
 +                        error_report("Could not remap RAM %s:%" PRIx64 " +%zx",
 +                                     block->idstr, offset, page_size);
 +                        exit(1);
 +                    }
                  }
                  memory_try_enable_merging(vaddr, page_size);
                  qemu_ram_setup_dump(vaddr, page_size);
 --
 .43.5

-[PATCH v5 3/6] accel/kvm: Report the loss of a large memory page
+[PATCH v7 3/6] accel/kvm: Report the loss of a large memory page
 From: William Roche <william.roche@oracle.com>
-In case of a large page impacted by a memory error, enhance
+In case of a large page impacted by a memory error, provide an
-the existing Qemu error message which indicates that the error
+information about the impacted large page before the memory
-is injected in the VM, adding "on lost large page SIZE@ADDR".
+error injection message.
-Include also a similar message to the ARM platform.
+This message would also appear on ras enabled ARM platforms, with
 the introduction of an x86 similar error injection message.
 In the case of a large page impacted, we now report:
-...Memory Error at QEMU addr X and GUEST addr Y on lost large page SIZE@ADDR of type...
+Memory Error on large page from <backend>:<address>+<fd_offset> +<page_size>
 The +<fd_offset> information is only provided with a file backend.
 Signed-off-by: William Roche <william.roche@oracle.com>
 ---
- accel/kvm/kvm-all.c   |  4 ----
+ accel/kvm/kvm-all.c       | 18 ++++++++++++++++++
- target/arm/kvm.c      | 13 +++++++++++++
+ include/exec/cpu-common.h | 10 ++++++++++
- target/i386/kvm/kvm.c | 18 ++++++++++++++----
+ system/physmem.c          | 22 ++++++++++++++++++++++
-files changed, 27 insertions(+), 8 deletions(-)
+ target/arm/kvm.c          |  3 +++
 files changed, 53 insertions(+)
 diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
 index XXXXXXX..XXXXXXX 100644
 --- a/accel/kvm/kvm-all.c
 +++ b/accel/kvm/kvm-all.c
 @@ -XXX,XX +XXX,XX @@ static void kvm_unpoison_all(void *param)
  void kvm_hwpoison_page_add(ram_addr_t ram_addr)
  {
      HWPoisonPage *page;
--    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
++    struct RAMBlockInfo rb_info;
--
++
--    if (page_size > TARGET_PAGE_SIZE)
++    if (qemu_ram_block_info_from_addr(ram_addr, &rb_info)) {
--        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
++        size_t ps = rb_info.page_size;
 +
 +        if (ps > TARGET_PAGE_SIZE) {
 +            uint64_t offset = QEMU_ALIGN_DOWN(ram_addr - rb_info.offset, ps);
 +
 +            if (rb_info.fd >= 0) {
 +                error_report("Memory Error on large page from %s:%" PRIx64
 +                             "+%" PRIx64 " +%zx", rb_info.idstr, offset,
 +                             rb_info.fd_offset, ps);
 +            } else {
 +                error_report("Memory Error on large page from %s:%" PRIx64
 +                            " +%zx", rb_info.idstr, offset, ps);
 +            }
 +        }
 +    }
      QLIST_FOREACH(page, &hwpoison_page_list, list) {
          if (page->ram_addr == ram_addr) {
+diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
+index XXXXXXX..XXXXXXX 100644
+--- a/include/exec/cpu-common.h
++++ b/include/exec/cpu-common.h
+@@ -XXX,XX +XXX,XX @@ int qemu_ram_get_fd(RAMBlock *rb);
+ size_t qemu_ram_pagesize(RAMBlock *block);
+ size_t qemu_ram_pagesize_largest(void);
++struct RAMBlockInfo {
++    char idstr[256];
++    ram_addr_t offset;
++    int fd;
++    uint64_t fd_offset;
++    size_t page_size;
++};
++bool qemu_ram_block_info_from_addr(ram_addr_t ram_addr,
++                                   struct RAMBlockInfo *block);
++
+ /**
+  * cpu_address_space_init:
+  * @cpu: CPU to add this address space to
+diff --git a/system/physmem.c b/system/physmem.c
+index XXXXXXX..XXXXXXX 100644
+--- a/system/physmem.c
++++ b/system/physmem.c
+@@ -XXX,XX +XXX,XX @@ size_t qemu_ram_pagesize_largest(void)
+     return largest;
+ }
++/* Copy RAMBlock information associated to the given ram_addr location */
++bool qemu_ram_block_info_from_addr(ram_addr_t ram_addr,
++                                   struct RAMBlockInfo *b_info)
++{
++    RAMBlock *rb;
++
++    assert(b_info);
++
++    RCU_READ_LOCK_GUARD();
++    rb =  qemu_get_ram_block(ram_addr);
++    if (!rb) {
++        return false;
++    }
++
++    pstrcat(b_info->idstr, sizeof(b_info->idstr), rb->idstr);
++    b_info->offset = rb->offset;
++    b_info->fd = rb->fd;
++    b_info->fd_offset = rb->fd_offset;
++    b_info->page_size = rb->page_size;
++    return true;
++}
++
+ static int memory_try_enable_merging(void *addr, size_t len)
+ {
+     if (!machine_mem_merge(current_machine)) {
 diff --git a/target/arm/kvm.c b/target/arm/kvm.c
 index XXXXXXX..XXXXXXX 100644
 --- a/target/arm/kvm.c
 +++ b/target/arm/kvm.c
 @@ -XXX,XX +XXX,XX @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
- {
-     ram_addr_t ram_addr;
-     hwaddr paddr;
-+    size_t page_size;
-+    char lp_msg[54];
-     assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
-@@ -XXX,XX +XXX,XX @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
-         ram_addr = qemu_ram_addr_from_host(addr);
-         if (ram_addr != RAM_ADDR_INVALID &&
-             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-+            page_size = qemu_ram_pagesize_from_addr(ram_addr);
-+            if (page_size > TARGET_PAGE_SIZE) {
-+                ram_addr = ROUND_DOWN(ram_addr, page_size);
-+                snprintf(lp_msg, sizeof(lp_msg), " on lost large page "
-+                    RAM_ADDR_FMT "@" RAM_ADDR_FMT "", page_size, ram_addr);
-+            } else {
-+                lp_msg[0] = '\0';
-+            }
-             kvm_hwpoison_page_add(ram_addr);
-             /*
-              * If this is a BUS_MCEERR_AR, we know we have been called
-@@ -XXX,XX +XXX,XX @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
                  kvm_cpu_synchronize_state(c);
-                 if (!acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
+                 if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
                      kvm_inject_arm_sea(c);
 +                    error_report("Guest Memory Error at QEMU addr %p and "
-+                        "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
++                        "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
-+                        addr, paddr, lp_msg, "BUS_MCEERR_AR");
++                        addr, paddr, "BUS_MCEERR_AR");
                  } else {
                      error_report("failed to record the error");
                      abort();
-diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
-index XXXXXXX..XXXXXXX 100644
---- a/target/i386/kvm/kvm.c
-+++ b/target/i386/kvm/kvm.c
-@@ -XXX,XX +XXX,XX @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
-     CPUX86State *env = &cpu->env;
-     ram_addr_t ram_addr;
-     hwaddr paddr;
-+    size_t page_size;
-+    char lp_msg[54];
-     /* If we get an action required MCE, it has been injected by KVM
-      * while the VM was running.  An action optional MCE instead should
-@@ -XXX,XX +XXX,XX @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
-         ram_addr = qemu_ram_addr_from_host(addr);
-         if (ram_addr != RAM_ADDR_INVALID &&
-             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
-+            page_size = qemu_ram_pagesize_from_addr(ram_addr);
-+            if (page_size > TARGET_PAGE_SIZE) {
-+                ram_addr = ROUND_DOWN(ram_addr, page_size);
-+                snprintf(lp_msg, sizeof(lp_msg), " on lost large page "
-+                        RAM_ADDR_FMT "@" RAM_ADDR_FMT "", page_size, ram_addr);
-+            } else {
-+                lp_msg[0] = '\0';
-+            }
-             kvm_hwpoison_page_add(ram_addr);
-             kvm_mce_inject(cpu, paddr, code);
-@@ -XXX,XX +XXX,XX @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
-              */
-             if (code == BUS_MCEERR_AR) {
-                 error_report("Guest MCE Memory Error at QEMU addr %p and "
--                    "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
--                    addr, paddr, "BUS_MCEERR_AR");
-+                    "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
-+                    addr, paddr, lp_msg, "BUS_MCEERR_AR");
-             } else {
-                  warn_report("Guest MCE Memory Error at QEMU addr %p and "
--                     "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
--                     addr, paddr, "BUS_MCEERR_AO");
-+                     "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
-+                     addr, paddr, lp_msg, "BUS_MCEERR_AO");
-             }
-             return;
 --
 .43.5

-[PATCH v5 4/6] numa: Introduce and use ram_block_notify_remap()
+[PATCH v7 4/6] numa: Introduce and use ram_block_notify_remap()
 From: David Hildenbrand <david@redhat.com>
 Notify registered listeners about the remap at the end of
 qemu_ram_remap() so e.g., a memory backend can re-apply its
 settings correctly.
 Signed-off-by: David Hildenbrand <david@redhat.com>
 Signed-off-by: William Roche <william.roche@oracle.com>
 ---
  hw/core/numa.c         | 11 +++++++++++
  include/exec/ramlist.h |  3 +++
  system/physmem.c       |  1 +
 files changed, 15 insertions(+)
 diff --git a/hw/core/numa.c b/hw/core/numa.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/core/numa.c
 +++ b/hw/core/numa.c
 @@ -XXX,XX +XXX,XX @@ void ram_block_notify_resize(void *host, size_t old_size, size_t new_size)
          }
      }
  }
 +
 +void ram_block_notify_remap(void *host, size_t offset, size_t size)
 +{
 +    RAMBlockNotifier *notifier;
 +
 +    QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
 +        if (notifier->ram_block_remapped) {
 +            notifier->ram_block_remapped(notifier, host, offset, size);
 +        }
 +    }
 +}
 diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/exec/ramlist.h
 +++ b/include/exec/ramlist.h
 @@ -XXX,XX +XXX,XX @@ struct RAMBlockNotifier {
                                size_t max_size);
      void (*ram_block_resized)(RAMBlockNotifier *n, void *host, size_t old_size,
                                size_t new_size);
 +    void (*ram_block_remapped)(RAMBlockNotifier *n, void *host, size_t offset,
 +                               size_t size);
      QLIST_ENTRY(RAMBlockNotifier) next;
  };
 @@ -XXX,XX +XXX,XX @@ void ram_block_notifier_remove(RAMBlockNotifier *n);
  void ram_block_notify_add(void *host, size_t size, size_t max_size);
  void ram_block_notify_remove(void *host, size_t size, size_t max_size);
  void ram_block_notify_resize(void *host, size_t old_size, size_t new_size);
 +void ram_block_notify_remap(void *host, size_t offset, size_t size);
  GString *ram_block_format(void);
 diff --git a/system/physmem.c b/system/physmem.c
 index XXXXXXX..XXXXXXX 100644
 --- a/system/physmem.c
 +++ b/system/physmem.c
 @@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
                  }
                  memory_try_enable_merging(vaddr, page_size);
                  qemu_ram_setup_dump(vaddr, page_size);
 +                ram_block_notify_remap(block->host, offset, page_size);
              }
              break;
 --
 .43.5

-[PATCH v5 5/6] hostmem: Factor out applying settings
+[PATCH v7 5/6] hostmem: Factor out applying settings
 From: David Hildenbrand <david@redhat.com>
 We want to reuse the functionality when remapping RAM.
 Signed-off-by: David Hildenbrand <david@redhat.com>
 Signed-off-by: William Roche <william.roche@oracle.com>
 ---
  backends/hostmem.c | 155 ++++++++++++++++++++++++---------------------
 file changed, 82 insertions(+), 73 deletions(-)
 diff --git a/backends/hostmem.c b/backends/hostmem.c
 index XXXXXXX..XXXXXXX 100644
 --- a/backends/hostmem.c
 +++ b/backends/hostmem.c
 @@ -XXX,XX +XXX,XX @@ QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
  QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
  #endif
 +static void host_memory_backend_apply_settings(HostMemoryBackend *backend,
 +                                               void *ptr, uint64_t size,
 +                                               Error **errp)
 +{
 +    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
 +
 +    if (backend->merge) {
 +        qemu_madvise(ptr, size, QEMU_MADV_MERGEABLE);
 +    }
 +    if (!backend->dump) {
 +        qemu_madvise(ptr, size, QEMU_MADV_DONTDUMP);
 +    }
 +#ifdef CONFIG_NUMA
 +    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
 +    /* lastbit == MAX_NODES means maxnode = 0 */
 +    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
 +    /*
 +     * Ensure policy won't be ignored in case memory is preallocated
 +     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
 +     * this doesn't catch hugepage case.
 +     */
 +    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
 +    int mode = backend->policy;
 +
 +    /*
 +     * Check for invalid host-nodes and policies and give more verbose
 +     * error messages than mbind().
 +     */
 +    if (maxnode && backend->policy == MPOL_DEFAULT) {
 +        error_setg(errp, "host-nodes must be empty for policy default,"
 +                   " or you should explicitly specify a policy other"
 +                   " than default");
 +        return;
 +    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
 +        error_setg(errp, "host-nodes must be set for policy %s",
 +                   HostMemPolicy_str(backend->policy));
 +        return;
 +    }
 +
 +    /*
 +     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
 +     * as argument to mbind() due to an old Linux bug (feature?) which
 +     * cuts off the last specified node. This means backend->host_nodes
 +     * must have MAX_NODES+1 bits available.
 +     */
 +    assert(sizeof(backend->host_nodes) >=
 +           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
 +    assert(maxnode <= MAX_NODES);
 +
 +#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
 +    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
 +        /*
 +         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
 +         * silently picks the first node.
 +         */
 +        mode = MPOL_PREFERRED_MANY;
 +    }
 +#endif
 +
 +    if (maxnode &&
 +        mbind(ptr, size, mode, backend->host_nodes, maxnode + 1, flags)) {
 +        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
 +            error_setg_errno(errp, errno,
 +                             "cannot bind memory to host NUMA nodes");
 +            return;
 +        }
 +    }
 +#endif
 +    /*
 +     * Preallocate memory after the NUMA policy has been instantiated.
 +     * This is necessary to guarantee memory is allocated with
 +     * specified NUMA policy in place.
 +     */
 +    if (backend->prealloc &&
 +        !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
 +                           ptr, size, backend->prealloc_threads,
 +                           backend->prealloc_context, async, errp)) {
 +        return;
 +    }
 +}
 +
  char *
  host_memory_backend_get_name(HostMemoryBackend *backend)
  {
 @@ -XXX,XX +XXX,XX @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
      void *ptr;
      uint64_t sz;
      size_t pagesize;
 -    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
      if (!bc->alloc) {
          return;
 @@ -XXX,XX +XXX,XX @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
          return;
      }
 -    if (backend->merge) {
 -        qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
 -    }
 -    if (!backend->dump) {
 -        qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
 -    }
 -#ifdef CONFIG_NUMA
 -    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
 -    /* lastbit == MAX_NODES means maxnode = 0 */
 -    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
 -    /*
 -     * Ensure policy won't be ignored in case memory is preallocated
 -     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
 -     * this doesn't catch hugepage case.
 -     */
 -    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
 -    int mode = backend->policy;
 -
 -    /* check for invalid host-nodes and policies and give more verbose
 -     * error messages than mbind(). */
 -    if (maxnode && backend->policy == MPOL_DEFAULT) {
 -        error_setg(errp, "host-nodes must be empty for policy default,"
 -                   " or you should explicitly specify a policy other"
 -                   " than default");
 -        return;
 -    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
 -        error_setg(errp, "host-nodes must be set for policy %s",
 -                   HostMemPolicy_str(backend->policy));
 -        return;
 -    }
 -
 -    /*
 -     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
 -     * as argument to mbind() due to an old Linux bug (feature?) which
 -     * cuts off the last specified node. This means backend->host_nodes
 -     * must have MAX_NODES+1 bits available.
 -     */
 -    assert(sizeof(backend->host_nodes) >=
 -           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
 -    assert(maxnode <= MAX_NODES);
 -
 -#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
 -    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
 -        /*
 -         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
 -         * silently picks the first node.
 -         */
 -        mode = MPOL_PREFERRED_MANY;
 -    }
 -#endif
 -
 -    if (maxnode &&
 -        mbind(ptr, sz, mode, backend->host_nodes, maxnode + 1, flags)) {
 -        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
 -            error_setg_errno(errp, errno,
 -                             "cannot bind memory to host NUMA nodes");
 -            return;
 -        }
 -    }
 -#endif
 -    /*
 -     * Preallocate memory after the NUMA policy has been instantiated.
 -     * This is necessary to guarantee memory is allocated with
 -     * specified NUMA policy in place.
 -     */
 -    if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
 -                                                ptr, sz,
 -                                                backend->prealloc_threads,
 -                                                backend->prealloc_context,
 -                                                async, errp)) {
 -        return;
 -    }
 +    host_memory_backend_apply_settings(backend, ptr, sz, errp);
  }
  static bool
 --
 .43.5

-[PATCH v5 6/6] hostmem: Handle remapping of RAM
+[PATCH v7 6/6] hostmem: Handle remapping of RAM
-From: David Hildenbrand <david@redhat.com>
+From: William Roche <william.roche@oracle.com>
 Let's register a RAM block notifier and react on remap notifications.
 Simply re-apply the settings. Exit if something goes wrong.
 Merging and dump settings are handled by the remap notification
 in addition to memory policy and preallocation.
-Signed-off-by: David Hildenbrand <david@redhat.com>
+Co-developed-by: David Hildenbrand <david@redhat.com>
 Signed-off-by: William Roche <william.roche@oracle.com>
 ---
  backends/hostmem.c       | 34 ++++++++++++++++++++++++++++++++++
  include/system/hostmem.h |  1 +
  system/physmem.c         |  4 ----
 ...
 --- a/system/physmem.c
 +++ b/system/physmem.c
 @@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
  {
      RAMBlock *block;
-     ram_addr_t offset;
+     uint64_t offset;
 -    void *vaddr;
      size_t page_size;
      RAMBLOCK_FOREACH(block) {
 @@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
 ...
 -            vaddr = ramblock_ptr(block, offset);
              if (block->flags & RAM_PREALLOC) {
                  ;
              } else if (xen_enabled()) {
 @@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
+                         exit(1);
                      }
-                     qemu_ram_remap_mmap(block, offset, page_size);
                  }
 -                memory_try_enable_merging(vaddr, page_size);
 -                qemu_ram_setup_dump(vaddr, page_size);
                  ram_block_notify_remap(block->host, offset, page_size);
              }
 --
 .43.5

From: William Roche <william.roche@oracle.com>

Hello David,

I'm keeping the description of the patch set you already reviewed:
 ---
This set of patches fixes several problems with hardware memory errors
impacting hugetlbfs memory backed VMs and the generic memory recovery
on VM reset.
When using hugetlbfs large pages, any large page location being impacted
by an HW memory error results in poisoning the entire page, suddenly
making a large chunk of the VM memory unusable.

The main problem that currently exists in Qemu is the lack of backend
file repair before resetting the VM memory, resulting in the impacted
memory to be silently unusable even after a VM reboot.

In order to fix this issue, we take into account the page size of the
impacted memory block when dealing with the associated poisoned page
location.

Using the page size information we also try to regenerate the memory
calling ram_block_discard_range() on VM reset when running
qemu_ram_remap(). So that a poisoned memory backed by a hugetlbfs
file is regenerated with a hole punched in this file. A new page is
loaded when the location is first touched.

In case of a discard failure we fall back to remapping the memory
location. We also have to reset the memory settings and honor the
'prealloc' attribute.

This memory setting is performed by a new remap notification mechanism
calling host_memory_backend_ram_remapped() function when a region of
a memory block is remapped.

We also enrich the messages used to report a memory error relayed to
the VM, providing an identification of memory page and its size in
case of a large page impacted.
 ----

v4->v5
. Updated commit messages (for patches 1, 5 and 6)
. Fixed comment typo of patch 2
. Changed the fall back function parameters to match the
  ram_block_discard_range() function.
. Removed the unused case of remapping a file in this function
. add the assert(block->fd < 0) in this function too
. I merged my patch 7 with you patch 6 (we only have 6 patches now)

This code is scripts/checkpatch.pl clean
'make check' runs clean on both x86 and ARM.

David Hildenbrand (3):
  numa: Introduce and use ram_block_notify_remap()
  hostmem: Factor out applying settings
  hostmem: Handle remapping of RAM

William Roche (3):
  system/physmem: handle hugetlb correctly in qemu_ram_remap()
  system/physmem: poisoned memory discard on reboot
  accel/kvm: Report the loss of a large memory page

-- 
2.43.5

From: William Roche <william.roche@oracle.com>

The list of hwpoison pages used to remap the memory on reset
is based on the backend real page size. When dealing with
hugepages, we create a single entry for the entire page.

To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete
hugetlb page; hugetlb pages cannot be partially mapped.

Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c       |  6 +++++-
 include/exec/cpu-common.h |  3 ++-
 system/physmem.c          | 32 ++++++++++++++++++++++++++------
 3 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -XXX,XX +XXX,XX @@ static void kvm_unpoison_all(void *param)
 
     QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
         QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr);
         g_free(page);
     }
 }
@@ -XXX,XX +XXX,XX @@ static void kvm_unpoison_all(void *param)
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
+    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
+
+    if (page_size > TARGET_PAGE_SIZE)
+        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -XXX,XX +XXX,XX @@ typedef uintptr_t ram_addr_t;
 
 /* memory API */
 
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
+void qemu_ram_remap(ram_addr_t addr);
 /* This should not be used by devices.  */
 ram_addr_t qemu_ram_addr_from_host(void *ptr);
 ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
@@ -XXX,XX +XXX,XX @@ bool qemu_ram_is_named_file(RAMBlock *rb);
 int qemu_ram_get_fd(RAMBlock *rb);
 
 size_t qemu_ram_pagesize(RAMBlock *block);
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr);
 size_t qemu_ram_pagesize_largest(void);
 
 /**
diff --git a/system/physmem.c b/system/physmem.c
index XXXXXXX..XXXXXXX 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -XXX,XX +XXX,XX @@ size_t qemu_ram_pagesize(RAMBlock *rb)
     return rb->page_size;
 }
 
+/* Return backend real page size used for the given ram_addr */
+size_t qemu_ram_pagesize_from_addr(ram_addr_t addr)
+{
+    RAMBlock *rb;
+
+    RCU_READ_LOCK_GUARD();
+    rb =  qemu_get_ram_block(addr);
+    if (!rb) {
+        return TARGET_PAGE_SIZE;
+    }
+    return qemu_ram_pagesize(rb);
+}
+
 /* Returns the largest size of page in use */
 size_t qemu_ram_pagesize_largest(void)
 {
@@ -XXX,XX +XXX,XX @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     ram_addr_t offset;
     int flags;
     void *area, *vaddr;
     int prot;
+    size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
         offset = addr - block->offset;
         if (offset < block->max_length) {
+            /* Respect the pagesize of our RAMBlock */
+            page_size = qemu_ram_pagesize(block);
+            offset = QEMU_ALIGN_DOWN(offset, page_size);
+
             vaddr = ramblock_ptr(block, offset);
             if (block->flags & RAM_PREALLOC) {
                 ;
@@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 prot = PROT_READ;
                 prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
                 if (block->fd >= 0) {
-                    area = mmap(vaddr, length, prot, flags, block->fd,
+                    area = mmap(vaddr, page_size, prot, flags, block->fd,
                                 offset + block->fd_offset);
                 } else {
                     flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, length, prot, flags, -1, 0);
+                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
                 }
                 if (area != vaddr) {
                     error_report("Could not remap addr: "
                                  RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 length, addr);
+                                 page_size, addr);
                     exit(1);
                 }
-                memory_try_enable_merging(vaddr, length);
-                qemu_ram_setup_dump(vaddr, length);
+                memory_try_enable_merging(vaddr, page_size);
+                qemu_ram_setup_dump(vaddr, page_size);
             }
+
+            break;
         }
     }
 }
-- 
2.43.5

From: William Roche <william.roche@oracle.com>

Repair poisoned memory location(s), calling ram_block_discard_range():
punching a hole in the backend file when necessary and regenerating
a usable memory.
If the kernel doesn't support the madvise calls used by this function
and we are dealing with anonymous memory, fall back to remapping the
location(s).

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 57 ++++++++++++++++++++++++++++++------------------
 1 file changed, 36 insertions(+), 21 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index XXXXXXX..XXXXXXX 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -XXX,XX +XXX,XX @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
+/* Simply remap the given VM memory location from start to start+length */
+static void qemu_ram_remap_mmap(RAMBlock *block, uint64_t start, size_t length)
+{
+    int flags, prot;
+    void *area;
+    void *host_startaddr = block->host + start;
+
+    assert(block->fd < 0);
+    flags = MAP_FIXED | MAP_ANONYMOUS;
+    flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
+    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
+    prot = PROT_READ;
+    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
+    area = mmap(host_startaddr, length, prot, flags, -1, 0);
+    if (area != host_startaddr) {
+        error_report("Could not remap addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                     length, start);
+        exit(1);
+    }
+}
+
 void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     ram_addr_t offset;
-    int flags;
-    void *area, *vaddr;
-    int prot;
+    void *vaddr;
     size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
@@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
             } else if (xen_enabled()) {
                 abort();
             } else {
-                flags = MAP_FIXED;
-                flags |= block->flags & RAM_SHARED ?
-                         MAP_SHARED : MAP_PRIVATE;
-                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
-                prot = PROT_READ;
-                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
-                if (block->fd >= 0) {
-                    area = mmap(vaddr, page_size, prot, flags, block->fd,
-                                offset + block->fd_offset);
-                } else {
-                    flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
-                }
-                if (area != vaddr) {
-                    error_report("Could not remap addr: "
-                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 page_size, addr);
-                    exit(1);
+                if (ram_block_discard_range(block, offset, page_size) != 0) {
+                    /*
+                     * Fall back to using mmap() only for anonymous mapping,
+                     * as if a backing file is associated we may not be able
+                     * to recover the memory in all cases.
+                     * So don't take the risk of using only mmap and fail now.
+                     */
+                    if (block->fd >= 0) {
+                        error_report("Memory poison recovery failure addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     page_size, addr);
+                        exit(1);
+                    }
+                    qemu_ram_remap_mmap(block, offset, page_size);
                 }
                 memory_try_enable_merging(vaddr, page_size);
                 qemu_ram_setup_dump(vaddr, page_size);
-- 
2.43.5

From: William Roche <william.roche@oracle.com>

In case of a large page impacted by a memory error, enhance
the existing Qemu error message which indicates that the error
is injected in the VM, adding "on lost large page SIZE@ADDR".

Include also a similar message to the ARM platform.

In the case of a large page impacted, we now report:
...Memory Error at QEMU addr X and GUEST addr Y on lost large page SIZE@ADDR of type...

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c   |  4 ----
 target/arm/kvm.c      | 13 +++++++++++++
 target/i386/kvm/kvm.c | 18 ++++++++++++++----
 3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -XXX,XX +XXX,XX @@ static void kvm_unpoison_all(void *param)
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
-    size_t page_size = qemu_ram_pagesize_from_addr(ram_addr);
-
-    if (page_size > TARGET_PAGE_SIZE)
-        ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index XXXXXXX..XXXXXXX 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -XXX,XX +XXX,XX @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
 {
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t page_size;
+    char lp_msg[54];
 
     assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
 
@@ -XXX,XX +XXX,XX @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
+            page_size = qemu_ram_pagesize_from_addr(ram_addr);
+            if (page_size > TARGET_PAGE_SIZE) {
+                ram_addr = ROUND_DOWN(ram_addr, page_size);
+                snprintf(lp_msg, sizeof(lp_msg), " on lost large page "
+                    RAM_ADDR_FMT "@" RAM_ADDR_FMT "", page_size, ram_addr);
+            } else {
+                lp_msg[0] = '\0';
+            }
             kvm_hwpoison_page_add(ram_addr);
             /*
              * If this is a BUS_MCEERR_AR, we know we have been called
@@ -XXX,XX +XXX,XX @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
                 kvm_cpu_synchronize_state(c);
                 if (!acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
                     kvm_inject_arm_sea(c);
+                    error_report("Guest Memory Error at QEMU addr %p and "
+                        "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
+                        addr, paddr, lp_msg, "BUS_MCEERR_AR");
                 } else {
                     error_report("failed to record the error");
                     abort();
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index XXXXXXX..XXXXXXX 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -XXX,XX +XXX,XX @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
     CPUX86State *env = &cpu->env;
     ram_addr_t ram_addr;
     hwaddr paddr;
+    size_t page_size;
+    char lp_msg[54];
 
     /* If we get an action required MCE, it has been injected by KVM
      * while the VM was running.  An action optional MCE instead should
@@ -XXX,XX +XXX,XX @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
+            page_size = qemu_ram_pagesize_from_addr(ram_addr);
+            if (page_size > TARGET_PAGE_SIZE) {
+                ram_addr = ROUND_DOWN(ram_addr, page_size);
+                snprintf(lp_msg, sizeof(lp_msg), " on lost large page "
+                        RAM_ADDR_FMT "@" RAM_ADDR_FMT "", page_size, ram_addr);
+            } else {
+                lp_msg[0] = '\0';
+            }
             kvm_hwpoison_page_add(ram_addr);
             kvm_mce_inject(cpu, paddr, code);
 
@@ -XXX,XX +XXX,XX @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
              */
             if (code == BUS_MCEERR_AR) {
                 error_report("Guest MCE Memory Error at QEMU addr %p and "
-                    "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
-                    addr, paddr, "BUS_MCEERR_AR");
+                    "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
+                    addr, paddr, lp_msg, "BUS_MCEERR_AR");
             } else {
                  warn_report("Guest MCE Memory Error at QEMU addr %p and "
-                     "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
-                     addr, paddr, "BUS_MCEERR_AO");
+                     "GUEST addr 0x%" HWADDR_PRIx "%s of type %s injected",
+                     addr, paddr, lp_msg, "BUS_MCEERR_AO");
             }
 
             return;
-- 
2.43.5

From: David Hildenbrand <david@redhat.com>

Notify registered listeners about the remap at the end of
qemu_ram_remap() so e.g., a memory backend can re-apply its
settings correctly.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 hw/core/numa.c         | 11 +++++++++++
 include/exec/ramlist.h |  3 +++
 system/physmem.c       |  1 +
 3 files changed, 15 insertions(+)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -XXX,XX +XXX,XX @@ void ram_block_notify_resize(void *host, size_t old_size, size_t new_size)
         }
     }
 }
+
+void ram_block_notify_remap(void *host, size_t offset, size_t size)
+{
+    RAMBlockNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+        if (notifier->ram_block_remapped) {
+            notifier->ram_block_remapped(notifier, host, offset, size);
+        }
+    }
+}
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
index XXXXXXX..XXXXXXX 100644
--- a/include/exec/ramlist.h
+++ b/include/exec/ramlist.h
@@ -XXX,XX +XXX,XX @@ struct RAMBlockNotifier {
                               size_t max_size);
     void (*ram_block_resized)(RAMBlockNotifier *n, void *host, size_t old_size,
                               size_t new_size);
+    void (*ram_block_remapped)(RAMBlockNotifier *n, void *host, size_t offset,
+                               size_t size);
     QLIST_ENTRY(RAMBlockNotifier) next;
 };
 
@@ -XXX,XX +XXX,XX @@ void ram_block_notifier_remove(RAMBlockNotifier *n);
 void ram_block_notify_add(void *host, size_t size, size_t max_size);
 void ram_block_notify_remove(void *host, size_t size, size_t max_size);
 void ram_block_notify_resize(void *host, size_t old_size, size_t new_size);
+void ram_block_notify_remap(void *host, size_t offset, size_t size);
 
 GString *ram_block_format(void);
 
diff --git a/system/physmem.c b/system/physmem.c
index XXXXXXX..XXXXXXX 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
                 }
                 memory_try_enable_merging(vaddr, page_size);
                 qemu_ram_setup_dump(vaddr, page_size);
+                ram_block_notify_remap(block->host, offset, page_size);
             }
 
             break;
-- 
2.43.5

From: David Hildenbrand <david@redhat.com>

We want to reuse the functionality when remapping RAM.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c | 155 ++++++++++++++++++++++++---------------------
 1 file changed, 82 insertions(+), 73 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index XXXXXXX..XXXXXXX 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -XXX,XX +XXX,XX @@ QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
 QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
 #endif
 
+static void host_memory_backend_apply_settings(HostMemoryBackend *backend,
+                                               void *ptr, uint64_t size,
+                                               Error **errp)
+{
+    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
+
+    if (backend->merge) {
+        qemu_madvise(ptr, size, QEMU_MADV_MERGEABLE);
+    }
+    if (!backend->dump) {
+        qemu_madvise(ptr, size, QEMU_MADV_DONTDUMP);
+    }
+#ifdef CONFIG_NUMA
+    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
+    /* lastbit == MAX_NODES means maxnode = 0 */
+    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
+    /*
+     * Ensure policy won't be ignored in case memory is preallocated
+     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
+     * this doesn't catch hugepage case.
+     */
+    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
+    int mode = backend->policy;
+
+    /*
+     * Check for invalid host-nodes and policies and give more verbose
+     * error messages than mbind().
+     */
+    if (maxnode && backend->policy == MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be empty for policy default,"
+                   " or you should explicitly specify a policy other"
+                   " than default");
+        return;
+    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
+        error_setg(errp, "host-nodes must be set for policy %s",
+                   HostMemPolicy_str(backend->policy));
+        return;
+    }
+
+    /*
+     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
+     * as argument to mbind() due to an old Linux bug (feature?) which
+     * cuts off the last specified node. This means backend->host_nodes
+     * must have MAX_NODES+1 bits available.
+     */
+    assert(sizeof(backend->host_nodes) >=
+           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
+    assert(maxnode <= MAX_NODES);
+
+#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
+    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
+        /*
+         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
+         * silently picks the first node.
+         */
+        mode = MPOL_PREFERRED_MANY;
+    }
+#endif
+
+    if (maxnode &&
+        mbind(ptr, size, mode, backend->host_nodes, maxnode + 1, flags)) {
+        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
+            error_setg_errno(errp, errno,
+                             "cannot bind memory to host NUMA nodes");
+            return;
+        }
+    }
+#endif
+    /*
+     * Preallocate memory after the NUMA policy has been instantiated.
+     * This is necessary to guarantee memory is allocated with
+     * specified NUMA policy in place.
+     */
+    if (backend->prealloc &&
+        !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
+                           ptr, size, backend->prealloc_threads,
+                           backend->prealloc_context, async, errp)) {
+        return;
+    }
+}
+
 char *
 host_memory_backend_get_name(HostMemoryBackend *backend)
 {
@@ -XXX,XX +XXX,XX @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
     void *ptr;
     uint64_t sz;
     size_t pagesize;
-    bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
 
     if (!bc->alloc) {
         return;
@@ -XXX,XX +XXX,XX @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
         return;
     }
 
-    if (backend->merge) {
-        qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
-    }
-    if (!backend->dump) {
-        qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
-    }
-#ifdef CONFIG_NUMA
-    unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
-    /* lastbit == MAX_NODES means maxnode = 0 */
-    unsigned long maxnode = (lastbit + 1) % (MAX_NODES + 1);
-    /*
-     * Ensure policy won't be ignored in case memory is preallocated
-     * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
-     * this doesn't catch hugepage case.
-     */
-    unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
-    int mode = backend->policy;
-
-    /* check for invalid host-nodes and policies and give more verbose
-     * error messages than mbind(). */
-    if (maxnode && backend->policy == MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be empty for policy default,"
-                   " or you should explicitly specify a policy other"
-                   " than default");
-        return;
-    } else if (maxnode == 0 && backend->policy != MPOL_DEFAULT) {
-        error_setg(errp, "host-nodes must be set for policy %s",
-                   HostMemPolicy_str(backend->policy));
-        return;
-    }
-
-    /*
-     * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
-     * as argument to mbind() due to an old Linux bug (feature?) which
-     * cuts off the last specified node. This means backend->host_nodes
-     * must have MAX_NODES+1 bits available.
-     */
-    assert(sizeof(backend->host_nodes) >=
-           BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
-    assert(maxnode <= MAX_NODES);
-
-#ifdef HAVE_NUMA_HAS_PREFERRED_MANY
-    if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
-        /*
-         * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
-         * silently picks the first node.
-         */
-        mode = MPOL_PREFERRED_MANY;
-    }
-#endif
-
-    if (maxnode &&
-        mbind(ptr, sz, mode, backend->host_nodes, maxnode + 1, flags)) {
-        if (backend->policy != MPOL_DEFAULT || errno != ENOSYS) {
-            error_setg_errno(errp, errno,
-                             "cannot bind memory to host NUMA nodes");
-            return;
-        }
-    }
-#endif
-    /*
-     * Preallocate memory after the NUMA policy has been instantiated.
-     * This is necessary to guarantee memory is allocated with
-     * specified NUMA policy in place.
-     */
-    if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(&backend->mr),
-                                                ptr, sz,
-                                                backend->prealloc_threads,
-                                                backend->prealloc_context,
-                                                async, errp)) {
-        return;
-    }
+    host_memory_backend_apply_settings(backend, ptr, sz, errp);
 }
 
 static bool
-- 
2.43.5

From: David Hildenbrand <david@redhat.com>

Let's register a RAM block notifier and react on remap notifications.
Simply re-apply the settings. Exit if something goes wrong.

Merging and dump settings are handled by the remap notification
in addition to memory policy and preallocation.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c       | 34 ++++++++++++++++++++++++++++++++++
 include/system/hostmem.h |  1 +
 system/physmem.c         |  4 ----
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index XXXXXXX..XXXXXXX 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -XXX,XX +XXX,XX @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
     backend->prealloc_threads = value;
 }
 
+static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host,
+                                             size_t offset, size_t size)
+{
+    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
+                                              ram_notifier);
+    Error *err = NULL;
+
+    if (!host_memory_backend_mr_inited(backend) ||
+        memory_region_get_ram_ptr(&backend->mr) != host) {
+        return;
+    }
+
+    host_memory_backend_apply_settings(backend, host + offset, size, &err);
+    if (err) {
+        /*
+         * If memory settings can't be successfully applied on remap,
+         * don't take the risk to continue without them.
+         */
+        error_report_err(err);
+        exit(1);
+    }
+}
+
 static void host_memory_backend_init(Object *obj)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
     MachineState *machine = MACHINE(qdev_get_machine());
 
+    backend->ram_notifier.ram_block_remapped = host_memory_backend_ram_remapped;
+    ram_block_notifier_add(&backend->ram_notifier);
+
     /* TODO: convert access to globals to compat properties */
     backend->merge = machine_mem_merge(machine);
     backend->dump = machine_dump_guest_core(machine);
@@ -XXX,XX +XXX,XX @@ static void host_memory_backend_post_init(Object *obj)
     object_apply_compat_props(obj);
 }
 
+static void host_memory_backend_finalize(Object *obj)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+    ram_block_notifier_remove(&backend->ram_notifier);
+}
+
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend)
 {
     /*
@@ -XXX,XX +XXX,XX @@ static const TypeInfo host_memory_backend_info = {
     .instance_size = sizeof(HostMemoryBackend),
     .instance_init = host_memory_backend_init,
     .instance_post_init = host_memory_backend_post_init,
+    .instance_finalize = host_memory_backend_finalize,
     .interfaces = (InterfaceInfo[]) {
         { TYPE_USER_CREATABLE },
         { }
diff --git a/include/system/hostmem.h b/include/system/hostmem.h
index XXXXXXX..XXXXXXX 100644
--- a/include/system/hostmem.h
+++ b/include/system/hostmem.h
@@ -XXX,XX +XXX,XX @@ struct HostMemoryBackend {
     HostMemPolicy policy;
 
     MemoryRegion mr;
+    RAMBlockNotifier ram_notifier;
 };
 
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend);
diff --git a/system/physmem.c b/system/physmem.c
index XXXXXXX..XXXXXXX 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     ram_addr_t offset;
-    void *vaddr;
     size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
@@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
             page_size = qemu_ram_pagesize(block);
             offset = QEMU_ALIGN_DOWN(offset, page_size);
 
-            vaddr = ramblock_ptr(block, offset);
             if (block->flags & RAM_PREALLOC) {
                 ;
             } else if (xen_enabled()) {
@@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
                     }
                     qemu_ram_remap_mmap(block, offset, page_size);
                 }
-                memory_try_enable_merging(vaddr, page_size);
-                qemu_ram_setup_dump(vaddr, page_size);
                 ram_block_notify_remap(block->host, offset, page_size);
             }
 
-- 
2.43.5

From: William Roche <william.roche@oracle.com>

Hello David,

Here is the version with the small nits corrected.
And the 'Acked-by' entries you gave me for patch 1 and 2.

---
This set of patches fixes several problems with hardware memory errors
impacting hugetlbfs memory backed VMs and the generic memory recovery
on VM reset.
When using hugetlbfs large pages, any large page location being impacted
by an HW memory error results in poisoning the entire page, suddenly
making a large chunk of the VM memory unusable.

The main problem that currently exists in Qemu is the lack of backend
file repair before resetting the VM memory, resulting in the impacted
memory to be silently unusable even after a VM reboot.

In order to fix this issue, we take into account the page size of the
impacted memory block when dealing with the associated poisoned page
location.

In case of a discard failure we fall back to remapping the memory
location. We also have to reset the memory settings and honor the
'prealloc' attribute.

This memory setting is performed by a new remap notification mechanism
calling host_memory_backend_ram_remapped() function when a region of
a memory block is remapped.

We also enrich the messages used to report a memory error relayed to
the VM, providing an identification of memory page and its size in
case of a large page impacted.
 ----

v1 -> v2:
. I removed the kernel SIGBUS siginfo provided lsb size information
  tracking. Only relying on the RAMBlock page_size instead.
. I adapted the 3 patches you indicated me to implement the
  notification mechanism on remap.  Thank you for this code!
  I left them as Authored by you.
  But I haven't tested if the policy setting works as expected on VM
  reset, only that the replacement of physical memory works.
. I also removed the old memory setting that was kept in qemu_ram_remap()
  but this small last fix could probably be merged with your last commit.

v2 -> v3:
. dropped the size parameter from qemu_ram_remap() and determine the page
  size when adding it to the poison list, aligning the offset down to the
  pagesize. Multiple sub-pages poisoned on a large page lead to a single
  poison entry.
. introduction of a helper function for the mmap code
. adding "on lost large page <size>@<ram_addr>" to the error injection
  msg (notation used in qemu_ram_remap() too ).
  So only in the case of a large page, it looks like:
Guest MCE Memory Error at QEMU addr 0x7fc1f5dd6000 and GUEST addr 0x19fd6000 on lost large page 200000@19e00000 of type BUS_MCEERR_AR injected
. as we need the page_size value for the above message, I retrieve the
  value in kvm_arch_on_sigbus_vcpu() to pass the appropriate pointer
  to kvm_hwpoison_page_add() that doesn't need to align it anymore.
. added a similar message for the ARM platform (removing the MCE
  keyword)
. I also introduced a "fail hard" in the remap notification:
  host_memory_backend_ram_remapped()

v3 -> v4:
. Fixed some commit messages typos
. Enhanced some code comments
. Changed the discard fall back conditions to consider only anonymous
  memory
. Fixed missing some variable name changes in intermediary patches.
. Modify the error message given when an error is injected to report
  the case of a large page
. use snprintf() to generate this message
. Adding this same type of message in the ARM case too

v4->v5:
. Updated commit messages (for patches 1, 5 and 6)
. Fixed comment typo of patch 2
. Changed the fall back function parameters to match the
  ram_block_discard_range() function.
. Removed the unused case of remapping a file in this function
. add the assert(block->fd < 0) in this function too
. I merged my patch 7 with your patch 6 (we only have 6 patches now)

v5->v6:
. don't align down ram_addr on kvm_hwpoison_page_add() but create
  a new entry for each subpage reported as poisoned
. introduce similar messages about memory error as discard_range()
. introduce a function to retrieve more information about a RAMBlock
  experiencing an error than just its associated page size
. file offset as an uint64_t instead of a ram_addr_t
. changed ownership of patch 6/6

v6->v7:
. change the block location information collection function name to
  qemu_ram_block_info_from_addr()
. display the fd_offset value only when dealing with a file backend
  in kvm_hwpoison_page_add() and qemu_ram_remap()
. better placed offset alignment computation
. two empty separation lines missing

This code is scripts/checkpatch.pl clean
'make check' runs clean on both x86 and ARM.

David Hildenbrand (2):
  numa: Introduce and use ram_block_notify_remap()
  hostmem: Factor out applying settings

William Roche (4):
  system/physmem: handle hugetlb correctly in qemu_ram_remap()
  system/physmem: poisoned memory discard on reboot
  accel/kvm: Report the loss of a large memory page
  hostmem: Handle remapping of RAM

-- 
2.43.5

From: William Roche <william.roche@oracle.com>

The list of hwpoison pages used to remap the memory on reset
is based on the backend real page size.
To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete
hugetlb page; hugetlb pages cannot be partially mapped.

Signed-off-by: William Roche <william.roche@oracle.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 accel/kvm/kvm-all.c       |  2 +-
 include/exec/cpu-common.h |  2 +-
 system/physmem.c          | 38 +++++++++++++++++++++++++++++---------
 3 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -XXX,XX +XXX,XX @@ static void kvm_unpoison_all(void *param)
 
     QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
         QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr);
         g_free(page);
     }
 }
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -XXX,XX +XXX,XX @@ typedef uintptr_t ram_addr_t;
 
 /* memory API */
 
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
+void qemu_ram_remap(ram_addr_t addr);
 /* This should not be used by devices.  */
 ram_addr_t qemu_ram_addr_from_host(void *ptr);
 ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
diff --git a/system/physmem.c b/system/physmem.c
index XXXXXXX..XXXXXXX 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -XXX,XX +XXX,XX @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+/*
+ * qemu_ram_remap - remap a single RAM page
+ *
+ * @addr: address in ram_addr_t address space.
+ *
+ * This function will try remapping a single page of guest RAM identified by
+ * @addr, essentially discarding memory to recover from previously poisoned
+ * memory (MCE). The page size depends on the RAMBlock (i.e., hugetlb). @addr
+ * does not have to point at the start of the page.
+ *
+ * This function is only to be used during system resets; it will kill the
+ * VM if remapping failed.
+ */
+void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
-    ram_addr_t offset;
+    uint64_t offset;
     int flags;
     void *area, *vaddr;
     int prot;
+    size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
         offset = addr - block->offset;
         if (offset < block->max_length) {
+            /* Respect the pagesize of our RAMBlock */
+            page_size = qemu_ram_pagesize(block);
+            offset = QEMU_ALIGN_DOWN(offset, page_size);
+
             vaddr = ramblock_ptr(block, offset);
             if (block->flags & RAM_PREALLOC) {
                 ;
@@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 prot = PROT_READ;
                 prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
                 if (block->fd >= 0) {
-                    area = mmap(vaddr, length, prot, flags, block->fd,
+                    area = mmap(vaddr, page_size, prot, flags, block->fd,
                                 offset + block->fd_offset);
                 } else {
                     flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, length, prot, flags, -1, 0);
+                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
                 }
                 if (area != vaddr) {
-                    error_report("Could not remap addr: "
-                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 length, addr);
+                    error_report("Could not remap RAM %s:%" PRIx64 "+%" PRIx64
+                                 " +%zx", block->idstr, offset,
+                                 block->fd_offset, page_size);
                     exit(1);
                 }
-                memory_try_enable_merging(vaddr, length);
-                qemu_ram_setup_dump(vaddr, length);
+                memory_try_enable_merging(vaddr, page_size);
+                qemu_ram_setup_dump(vaddr, page_size);
             }
+
+            break;
         }
     }
 }
-- 
2.43.5

From: William Roche <william.roche@oracle.com>

Signed-off-by: William Roche <william.roche@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 system/physmem.c | 58 ++++++++++++++++++++++++++++++------------------
 1 file changed, 36 insertions(+), 22 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index XXXXXXX..XXXXXXX 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -XXX,XX +XXX,XX @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
+/* Simply remap the given VM memory location from start to start+length */
+static int qemu_ram_remap_mmap(RAMBlock *block, uint64_t start, size_t length)
+{
+    int flags, prot;
+    void *area;
+    void *host_startaddr = block->host + start;
+
+    assert(block->fd < 0);
+    flags = MAP_FIXED | MAP_ANONYMOUS;
+    flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
+    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
+    prot = PROT_READ;
+    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
+    area = mmap(host_startaddr, length, prot, flags, -1, 0);
+    return area != host_startaddr ? -errno : 0;
+}
+
 /*
  * qemu_ram_remap - remap a single RAM page
  *
@@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     uint64_t offset;
-    int flags;
-    void *area, *vaddr;
-    int prot;
+    void *vaddr;
     size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
@@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
                 ;
             } else if (xen_enabled()) {
                 abort();
-            } else {
-                flags = MAP_FIXED;
-                flags |= block->flags & RAM_SHARED ?
-                         MAP_SHARED : MAP_PRIVATE;
-                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
-                prot = PROT_READ;
-                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
-                if (block->fd >= 0) {
-                    area = mmap(vaddr, page_size, prot, flags, block->fd,
-                                offset + block->fd_offset);
-                } else {
-                    flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
-                }
-                if (area != vaddr) {
-                    error_report("Could not remap RAM %s:%" PRIx64 "+%" PRIx64
-                                 " +%zx", block->idstr, offset,
-                                 block->fd_offset, page_size);
-                    exit(1);
+                if (ram_block_discard_range(block, offset, page_size) != 0) {
+                    /*
+                     * Fall back to using mmap() only for anonymous mapping,
+                     * as if a backing file is associated we may not be able
+                     * to recover the memory in all cases.
+                     * So don't take the risk of using only mmap and fail now.
+                     */
+                    if (block->fd >= 0) {
+                        error_report("Could not remap RAM %s:%" PRIx64 "+%"
+                                     PRIx64 " +%zx", block->idstr, offset,
+                                     block->fd_offset, page_size);
+                        exit(1);
+                    }
+                    if (qemu_ram_remap_mmap(block, offset, page_size) != 0) {
+                        error_report("Could not remap RAM %s:%" PRIx64 " +%zx",
+                                     block->idstr, offset, page_size);
+                        exit(1);
+                    }
                 }
                 memory_try_enable_merging(vaddr, page_size);
                 qemu_ram_setup_dump(vaddr, page_size);
-- 
2.43.5

From: William Roche <william.roche@oracle.com>

In case of a large page impacted by a memory error, provide an
information about the impacted large page before the memory
error injection message.

This message would also appear on ras enabled ARM platforms, with
the introduction of an x86 similar error injection message.

In the case of a large page impacted, we now report:
Memory Error on large page from <backend>:<address>+<fd_offset> +<page_size>

The +<fd_offset> information is only provided with a file backend.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c       | 18 ++++++++++++++++++
 include/exec/cpu-common.h | 10 ++++++++++
 system/physmem.c          | 22 ++++++++++++++++++++++
 target/arm/kvm.c          |  3 +++
 4 files changed, 53 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -XXX,XX +XXX,XX @@ static void kvm_unpoison_all(void *param)
 void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 {
     HWPoisonPage *page;
+    struct RAMBlockInfo rb_info;
+
+    if (qemu_ram_block_info_from_addr(ram_addr, &rb_info)) {
+        size_t ps = rb_info.page_size;
+
+        if (ps > TARGET_PAGE_SIZE) {
+            uint64_t offset = QEMU_ALIGN_DOWN(ram_addr - rb_info.offset, ps);
+
+            if (rb_info.fd >= 0) {
+                error_report("Memory Error on large page from %s:%" PRIx64
+                             "+%" PRIx64 " +%zx", rb_info.idstr, offset,
+                             rb_info.fd_offset, ps);
+            } else {
+                error_report("Memory Error on large page from %s:%" PRIx64
+                            " +%zx", rb_info.idstr, offset, ps);
+            }
+        }
+    }
 
     QLIST_FOREACH(page, &hwpoison_page_list, list) {
         if (page->ram_addr == ram_addr) {
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -XXX,XX +XXX,XX @@ int qemu_ram_get_fd(RAMBlock *rb);
 size_t qemu_ram_pagesize(RAMBlock *block);
 size_t qemu_ram_pagesize_largest(void);
 
+struct RAMBlockInfo {
+    char idstr[256];
+    ram_addr_t offset;
+    int fd;
+    uint64_t fd_offset;
+    size_t page_size;
+};
+bool qemu_ram_block_info_from_addr(ram_addr_t ram_addr,
+                                   struct RAMBlockInfo *block);
+
 /**
  * cpu_address_space_init:
  * @cpu: CPU to add this address space to
diff --git a/system/physmem.c b/system/physmem.c
index XXXXXXX..XXXXXXX 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -XXX,XX +XXX,XX @@ size_t qemu_ram_pagesize_largest(void)
     return largest;
 }
 
+/* Copy RAMBlock information associated to the given ram_addr location */
+bool qemu_ram_block_info_from_addr(ram_addr_t ram_addr,
+                                   struct RAMBlockInfo *b_info)
+{
+    RAMBlock *rb;
+
+    assert(b_info);
+
+    RCU_READ_LOCK_GUARD();
+    rb =  qemu_get_ram_block(ram_addr);
+    if (!rb) {
+        return false;
+    }
+
+    pstrcat(b_info->idstr, sizeof(b_info->idstr), rb->idstr);
+    b_info->offset = rb->offset;
+    b_info->fd = rb->fd;
+    b_info->fd_offset = rb->fd_offset;
+    b_info->page_size = rb->page_size;
+    return true;
+}
+
 static int memory_try_enable_merging(void *addr, size_t len)
 {
     if (!machine_mem_merge(current_machine)) {
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index XXXXXXX..XXXXXXX 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -XXX,XX +XXX,XX @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
                 kvm_cpu_synchronize_state(c);
                 if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
                     kvm_inject_arm_sea(c);
+                    error_report("Guest Memory Error at QEMU addr %p and "
+                        "GUEST addr 0x%" HWADDR_PRIx " of type %s injected",
+                        addr, paddr, "BUS_MCEERR_AR");
                 } else {
                     error_report("failed to record the error");
                     abort();
-- 
2.43.5

From: David Hildenbrand <david@redhat.com>

Notify registered listeners about the remap at the end of
qemu_ram_remap() so e.g., a memory backend can re-apply its
settings correctly.

From: David Hildenbrand <david@redhat.com>

We want to reuse the functionality when remapping RAM.

From: William Roche <william.roche@oracle.com>

Let's register a RAM block notifier and react on remap notifications.
Simply re-apply the settings. Exit if something goes wrong.

Merging and dump settings are handled by the remap notification
in addition to memory policy and preallocation.

Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: William Roche <william.roche@oracle.com>
---
 backends/hostmem.c       | 34 ++++++++++++++++++++++++++++++++++
 include/system/hostmem.h |  1 +
 system/physmem.c         |  4 ----
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index XXXXXXX..XXXXXXX 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -XXX,XX +XXX,XX @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
     backend->prealloc_threads = value;
 }
 
+static void host_memory_backend_ram_remapped(RAMBlockNotifier *n, void *host,
+                                             size_t offset, size_t size)
+{
+    HostMemoryBackend *backend = container_of(n, HostMemoryBackend,
+                                              ram_notifier);
+    Error *err = NULL;
+
+    if (!host_memory_backend_mr_inited(backend) ||
+        memory_region_get_ram_ptr(&backend->mr) != host) {
+        return;
+    }
+
+    host_memory_backend_apply_settings(backend, host + offset, size, &err);
+    if (err) {
+        /*
+         * If memory settings can't be successfully applied on remap,
+         * don't take the risk to continue without them.
+         */
+        error_report_err(err);
+        exit(1);
+    }
+}
+
 static void host_memory_backend_init(Object *obj)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
     MachineState *machine = MACHINE(qdev_get_machine());
 
+    backend->ram_notifier.ram_block_remapped = host_memory_backend_ram_remapped;
+    ram_block_notifier_add(&backend->ram_notifier);
+
     /* TODO: convert access to globals to compat properties */
     backend->merge = machine_mem_merge(machine);
     backend->dump = machine_dump_guest_core(machine);
@@ -XXX,XX +XXX,XX @@ static void host_memory_backend_post_init(Object *obj)
     object_apply_compat_props(obj);
 }
 
+static void host_memory_backend_finalize(Object *obj)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+    ram_block_notifier_remove(&backend->ram_notifier);
+}
+
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend)
 {
     /*
@@ -XXX,XX +XXX,XX @@ static const TypeInfo host_memory_backend_info = {
     .instance_size = sizeof(HostMemoryBackend),
     .instance_init = host_memory_backend_init,
     .instance_post_init = host_memory_backend_post_init,
+    .instance_finalize = host_memory_backend_finalize,
     .interfaces = (InterfaceInfo[]) {
         { TYPE_USER_CREATABLE },
         { }
diff --git a/include/system/hostmem.h b/include/system/hostmem.h
index XXXXXXX..XXXXXXX 100644
--- a/include/system/hostmem.h
+++ b/include/system/hostmem.h
@@ -XXX,XX +XXX,XX @@ struct HostMemoryBackend {
     HostMemPolicy policy;
 
     MemoryRegion mr;
+    RAMBlockNotifier ram_notifier;
 };
 
 bool host_memory_backend_mr_inited(HostMemoryBackend *backend);
diff --git a/system/physmem.c b/system/physmem.c
index XXXXXXX..XXXXXXX 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     uint64_t offset;
-    void *vaddr;
     size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
@@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
             page_size = qemu_ram_pagesize(block);
             offset = QEMU_ALIGN_DOWN(offset, page_size);
 
-            vaddr = ramblock_ptr(block, offset);
             if (block->flags & RAM_PREALLOC) {
                 ;
             } else if (xen_enabled()) {
@@ -XXX,XX +XXX,XX @@ void qemu_ram_remap(ram_addr_t addr)
                         exit(1);
                     }
                 }
-                memory_try_enable_merging(vaddr, page_size);
-                qemu_ram_setup_dump(vaddr, page_size);
                 ram_block_notify_remap(block->host, offset, page_size);
             }
 
-- 
2.43.5