[v3] hugetlbfs memory HW error fixes

[PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by “William Roche 1 year, 2 months ago

From: William Roche <william.roche@oracle.com>

Hi David,

Here is an new version of our code, but I still need to double check
the mmap behavior in case of a memory error impact on:
- a clean page of an empty file or populated file
- already mapped using MAP_SHARED or MAP_PRIVATE
to see if mmap() can recover the area or not.

But I wanted to provide this version to know if this is the kind of
implementation you were expecting.

And here is a sligthly updated description of the patch set:

 ---
This set of patches fixes several problems with hardware memory errors
impacting hugetlbfs memory backed VMs. When using hugetlbfs large
pages, any large page location being impacted by an HW memory error
results in poisoning the entire page, suddenly making a large chunk of
the VM memory unusable.

The main problem that currently exists in Qemu is the lack of backend
file repair before resetting the VM memory, resulting in the impacted
memory to be silently unusable even after a VM reboot.

In order to fix this issue, we take into account the page size of the
impacted memory block when dealing with the associated poisoned page
location.

Using the page size information we also try to regenerate the memory
calling ram_block_discard_range() on VM reset when running
qemu_ram_remap(). So that a poisoned memory backed by a hugetlbfs
file is regenerated with a hole punched in this file. A new page is
loaded when the location is first touched.

In case of a discard failure we fall back to unmap/remap the memory
location and reset the memory settings.

We also have to honor the 'prealloc' attribute even after a successful
discard, so we reapply the memory settings in this case too.

This memory setting is performed by a new remap notification mechanism
calling host_memory_backend_ram_remapped() function when a region of
a memory block is remapped.

We also enrich the messages used to report a memory error relayed to
the VM, providing an identification of memory page and its size in
case of a large page impacted.
 ----


v2 -> v3:
. dropped the size parameter from qemu_ram_remap() and determine the page
  size when adding it to the poison list, aligning the offset down to the
  pagesize. Multiple sub-pages poisoned on a large page lead to a single
  poison entry.

. introduction of a helper function for the mmap code

. adding "on lost large page <size>@<ram_addr>" to the error injection
  msg (notation used in qemu_ram_remap() too ).
  So only in the case of a large page, it looks like:
qemu-system-x86_64: Guest MCE Memory Error at QEMU addr 0x7fc1f5dd6000 and GUEST addr 0x19fd6000 on lost large page 200000@19e00000 of type BUS_MCEERR_AR injected

. as we need the page_size value for the above message, I retrieve the
  value in kvm_arch_on_sigbus_vcpu() to pass the appropriate pointer
  to kvm_hwpoison_page_add() that doesn't need to align it anymore.

. added a similar message for the ARM platform (removing the MCE
  keyword)

. I also introduced a "fail hard" in the remap notification:
  host_memory_backend_ram_remapped()


This code is scripts/checkpatch.pl clean
'make check' runs fine on both x86 and Arm.


David Hildenbrand (3):
  numa: Introduce and use ram_block_notify_remap()
  hostmem: Factor out applying settings
  hostmem: Handle remapping of RAM

William Roche (4):
  hwpoison_page_list and qemu_ram_remap are based of pages
  system/physmem: poisoned memory discard on reboot
  accel/kvm: Report the loss of a large memory page
  system/physmem: Memory settings applied on remap notification

 accel/kvm/kvm-all.c       |   2 +-
 backends/hostmem.c        | 189 +++++++++++++++++++++++---------------
 hw/core/numa.c            |  11 +++
 include/exec/cpu-common.h |   3 +-
 include/exec/ramlist.h    |   3 +
 include/sysemu/hostmem.h  |   1 +
 system/physmem.c          |  90 +++++++++++++-----
 target/arm/kvm.c          |  13 +++
 target/i386/kvm/kvm.c     |  18 +++-
 9 files changed, 227 insertions(+), 103 deletions(-)

-- 
2.43.5

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by William Roche 1 year, 2 months ago

Hello David,

I've finally tested many page mapping possibilities and tried to 
identify the error injection reaction on these pages to see if mmap() 
can be used to recover the impacted area.
I'm using the latest upstream kernel I have for that: 
6.12.0-rc7.master.20241117.ol9.x86_64
But I also got similar results with a kernel not supporting 
MADV_DONTNEED, for example: 5.15.0-301.163.5.2.el9uek.x86_64


Let's start with mapping a file without modifying the mapped area:
In this case we should have a clean page cache mapped in the process.
If an error is injected on this page, the kernel doesn't even inform the 
process about the error as the page is replaced (no matter if the 
mapping was shared of not).

The kernel indicates this situation with the following messages:

[10759.371701] Injecting memory failure at pfn 0x10d88e
[10759.374922] Memory failure: 0x10d88e: corrupted page was clean: 
dropped without side effects
[10759.377525] Memory failure: 0x10d88e: recovery action for clean LRU 
page: Recovered


Now when the page content is modified, in the case of standard page 
size, we need to consider a MAP_PRIVATE or MAP_SHARED
- in the case of a MAP_PRIVATE page, this page is corrupted and the 
modified data are lost, the kernel will use the SIGBUS mechanism to 
inform this process if needed.
   But remapping the area sweeps away the poisoned page, and allows the 
process to use the area.

- In the case of a MAP_SHARED page, if the content hasn't been sync'ed 
with the file backend, we also loose the modified data, and the kernel 
can also raise SIGBUS.
   Remapping the area recreates a page cache from the "on disk" file 
content, clearing the error.

In both cases, the kernel indicates messages like:
[41589.578750] Injecting memory failure for pfn 0x122105 at process 
virtual address 0x7f13bad55000
[41589.582237] Memory failure: 0x122105: Sending SIGBUS to testdh:7343 
due to hardware memory corruption
[41589.584907] Memory failure: 0x122105: recovery action for dirty LRU 
page: Recovered


Now in the case of hugetlbfs pages:
This case behaves the same way as the standard page size when using 
MAP_PRIVATE: mmap of the underlying file is able to sweep away the 
poisoned page.
But the MAP_SHARED case is different: mmap() doesn't clear anything. 
fallocate() must be used.


In both cases, the kernel indicates messages like:
[89141.724295] Injecting memory failure for pfn 0x117800 at process 
virtual address 0x7fd148800000
[89141.727103] Memory failure: 0x117800: Sending SIGBUS to testdh:9480 
due to hardware memory corruption
[89141.729829] Memory failure: 0x117800: recovery action for huge page: 
Recovered

Conclusion:
We can't count on the mmap() method only for the hugetlbfs case with 
MAP_SHARED.

So According to these tests results, we should change the part of the 
qemu_ram_remap() function (in the 2nd patch) to something like:

+                if (ram_block_discard_range(block, offset + 
block->fd_offset,
+                                            length) != 0) {
+                    /*
+                     * Fold back to using mmap(), but it cannot repair a
+                     * shared hugetlbfs region. In this case we fail.
+                     */
+                    if (block->fd >= 0 && qemu_ram_is_shared(block) &&
+                        (length > TARGET_PAGE_SIZE)) {
+                        error_report("Memory hugetlbfs poison recovery 
failure addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }
+                    qemu_ram_remap_mmap(block, vaddr, page_size, offset);
+                    memory_try_enable_merging(vaddr, size);
+                    qemu_ram_setup_dump(vaddr, size);
                  }

This should also change the subsequent patch accordingly.

(As a side note about the 3rd patch, I'll also adjust the lp_msg[57] 
message size to 54 bytes instead (there is no '0x' prefix on the 
hexadecimal values and the message ends with a zero)

So if you agree with this v3 proposal (including the above 
modifications), I can submit a v4 version for integration.

Please let me know what you think about that, and if you see any 
additional change we should consider before the integration.

Thanks in advance,
William.

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by David Hildenbrand 1 year, 2 months ago

On 02.12.24 16:41, William Roche wrote:
> Hello David,

Hi,

sorry for reviewing yet, I was rather sick the last 1.5 weeks.

> 
> I've finally tested many page mapping possibilities and tried to
> identify the error injection reaction on these pages to see if mmap()
> can be used to recover the impacted area.
 > I'm using the latest upstream kernel I have for that:> 
6.12.0-rc7.master.20241117.ol9.x86_64
> But I also got similar results with a kernel not supporting
> MADV_DONTNEED, for example: 5.15.0-301.163.5.2.el9uek.x86_64
> 
> 
> Let's start with mapping a file without modifying the mapped area:
> In this case we should have a clean page cache mapped in the process.
> If an error is injected on this page, the kernel doesn't even inform the
> process about the error as the page is replaced (no matter if the
> mapping was shared of not).
> 
> The kernel indicates this situation with the following messages:
> 
> [10759.371701] Injecting memory failure at pfn 0x10d88e
> [10759.374922] Memory failure: 0x10d88e: corrupted page was clean:
> dropped without side effects
> [10759.377525] Memory failure: 0x10d88e: recovery action for clean LRU
> page: Recovered

Right. The reason here is that we can simply allocate a new page and 
load data from disk. No corruption.

> 
> 
> Now when the page content is modified, in the case of standard page
> size, we need to consider a MAP_PRIVATE or MAP_SHARED
> - in the case of a MAP_PRIVATE page, this page is corrupted and the
> modified data are lost, the kernel will use the SIGBUS mechanism to
> inform this process if needed.
>     But remapping the area sweeps away the poisoned page, and allows the
> process to use the area.
> 
> - In the case of a MAP_SHARED page, if the content hasn't been sync'ed
> with the file backend, we also loose the modified data, and the kernel
> can also raise SIGBUS.
>     Remapping the area recreates a page cache from the "on disk" file
> content, clearing the error.

In a mmap(MAP_SHARED, fd) region that will also require fallocate IIUC.

> 
> In both cases, the kernel indicates messages like:
> [41589.578750] Injecting memory failure for pfn 0x122105 at process
> virtual address 0x7f13bad55000
> [41589.582237] Memory failure: 0x122105: Sending SIGBUS to testdh:7343
> due to hardware memory corruption
> [41589.584907] Memory failure: 0x122105: recovery action for dirty LRU
> page: Recovered
 > >
> Now in the case of hugetlbfs pages:
> This case behaves the same way as the standard page size when using
> MAP_PRIVATE: mmap of the underlying file is able to sweep away the
> poisoned page.
> But the MAP_SHARED case is different: mmap() doesn't clear anything.
> fallocate() must be used.

Yes, I recall that is what I initially said. The behavior with 
MAP_SHARED should be consistent between hugetlb and !hugetlb.


> 
> 
> In both cases, the kernel indicates messages like:
> [89141.724295] Injecting memory failure for pfn 0x117800 at process
> virtual address 0x7fd148800000
> [89141.727103] Memory failure: 0x117800: Sending SIGBUS to testdh:9480
> due to hardware memory corruption
> [89141.729829] Memory failure: 0x117800: recovery action for huge page:
> Recovered
> 
> Conclusion:
> We can't count on the mmap() method only for the hugetlbfs case with
> MAP_SHARED.
> 
> So According to these tests results, we should change the part of the

> qemu_ram_remap() function (in the 2nd patch) to something like:
> 
> +                if (ram_block_discard_range(block, offset +
> block->fd_offset,
> +                                            length) != 0) {
> +                    /*
> +                     * Fold back to using mmap(), but it cannot repair a
> +                     * shared hugetlbfs region. In this case we fail.
> +                     */


But why do we special-case hugetlb here? How would mmap(MAP_FIXED) help 
to discard dirty pagecache data in a mmap(MAD_SHARED, fd) mapping?

-- 
Cheers,

David / dhildenb

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by William Roche 1 year, 2 months ago

On 12/2/24 17:00, David Hildenbrand wrote:
> On 02.12.24 16:41, William Roche wrote:
>> Hello David,
> 
> Hi,
> 
> sorry for reviewing yet, I was rather sick the last 1.5 weeks.

I hope you get well soon!

>> I've finally tested many page mapping possibilities and tried to
>> identify the error injection reaction on these pages to see if mmap()
>> can be used to recover the impacted area.
>> I'm using the latest upstream kernel I have for that:
>> 6.12.0-rc7.master.20241117.ol9.x86_64
>> But I also got similar results with a kernel not supporting
>> MADV_DONTNEED, for example: 5.15.0-301.163.5.2.el9uek.x86_64
>>
>>
>> Let's start with mapping a file without modifying the mapped area:
>> In this case we should have a clean page cache mapped in the process.
>> If an error is injected on this page, the kernel doesn't even inform the
>> process about the error as the page is replaced (no matter if the
>> mapping was shared of not).
>>
>> The kernel indicates this situation with the following messages:
>>
>> [10759.371701] Injecting memory failure at pfn 0x10d88e
>> [10759.374922] Memory failure: 0x10d88e: corrupted page was clean:
>> dropped without side effects
>> [10759.377525] Memory failure: 0x10d88e: recovery action for clean LRU
>> page: Recovered
> 
> Right. The reason here is that we can simply allocate a new page and 
> load data from disk. No corruption.
> 
>>
>>
>> Now when the page content is modified, in the case of standard page
>> size, we need to consider a MAP_PRIVATE or MAP_SHARED
>> - in the case of a MAP_PRIVATE page, this page is corrupted and the
>> modified data are lost, the kernel will use the SIGBUS mechanism to
>> inform this process if needed.
>>     But remapping the area sweeps away the poisoned page, and allows the
>> process to use the area.
>>
>> - In the case of a MAP_SHARED page, if the content hasn't been sync'ed
>> with the file backend, we also loose the modified data, and the kernel
>> can also raise SIGBUS.
>>     Remapping the area recreates a page cache from the "on disk" file
>> content, clearing the error.
> 
> In a mmap(MAP_SHARED, fd) region that will also require fallocate IIUC.

I would have expected the same thing, but what I noticed is that in the 
case of !hugetlb, even poisoned shared memory seem to be recovered with:
mmap(location, size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0)

But we can decide that the normal behavior in this case would be to 
require an fallocate() call, and if this call fails, we fail the recovery.
My tests showed that a standard sized page can be replaced by a new one 
calling the above mmap(). And shared hugetlb case doesn't work this way.

>>
>> In both cases, the kernel indicates messages like:
>> [41589.578750] Injecting memory failure for pfn 0x122105 at process
>> virtual address 0x7f13bad55000
>> [41589.582237] Memory failure: 0x122105: Sending SIGBUS to testdh:7343
>> due to hardware memory corruption
>> [41589.584907] Memory failure: 0x122105: recovery action for dirty LRU
>> page: Recovered
>  > >
>> Now in the case of hugetlbfs pages:
>> This case behaves the same way as the standard page size when using
>> MAP_PRIVATE: mmap of the underlying file is able to sweep away the
>> poisoned page.
>> But the MAP_SHARED case is different: mmap() doesn't clear anything.
>> fallocate() must be used.
> 
> Yes, I recall that is what I initially said. The behavior with 
> MAP_SHARED should be consistent between hugetlb and !hugetlb.

The tests showed that they are different.

>>
>>
>> In both cases, the kernel indicates messages like:
>> [89141.724295] Injecting memory failure for pfn 0x117800 at process
>> virtual address 0x7fd148800000
>> [89141.727103] Memory failure: 0x117800: Sending SIGBUS to testdh:9480
>> due to hardware memory corruption
>> [89141.729829] Memory failure: 0x117800: recovery action for huge page:
>> Recovered
>>
>> Conclusion:
>> We can't count on the mmap() method only for the hugetlbfs case with
>> MAP_SHARED.
>>

At the end of this email, I included the source code of a simplistic 
test case that shows that the page is replaced in the case of standard 
page size.

The idea of this test is simple:

1/ Create a local FILE with:
# dd if=/dev/zero of=./FILE bs=4k count=2
2+0 records in
2+0 records out
8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000337674 s, 24.3 MB/s

2/ As root run:
# ./poisonedShared4k
Mapping 8192 bytes from file FILE
Reading and writing the first 2 pages content:
Read: Read: Wrote: Initial mem page 0
Wrote: Initial mem page 1
Data pages at 0x7f71a19d6000  physically 0x124fb0000
Data pages at 0x7f71a19d7000  physically 0x128ce4000
Poisoning 4k at 0x7f71a19d6000
Signal 7 received
	code 4		Signal code
	addr 0x7f71a19d6000	Memory location
	si_addr_lsb 12
siglongjmp used
Remapping the poisoned page
Reading and writing the first 2 pages content:
Read: Read: Initial mem page 1
Wrote: Rewrite mem page 0
Wrote: Rewrite mem page 1
Data pages at 0x7f71a19d6000  physically 0x10c367000
Data pages at 0x7f71a19d7000  physically 0x128ce4000


  ---

As we can see, this process:
- maps the FILE,
- tries to read and write the beginning of the first 2 pages
- gives their physical addresses
- poison the first page with a madvise(MADV_HWPOISON) call
- shows the SIGBUS signal received and recovers from it
- simply remaps the same page from the file
- tries again to read and write the beginning of the first 2 pages
- gives their physical addresses

  ---

The test (run on 6.12.0-rc7.master.20241117.ol9.x86_64) showed the 
memory is usable after the remap.
Do you see a different behavior, with an even more recent kernel ?

 >> So According to these tests results, we should change the part of the
 >> qemu_ram_remap() function (in the 2nd patch) to something like:
 >>
 >> +                if (ram_block_discard_range(block, offset + 
block->fd_offset,
 >> +                                            length) != 0) {
 >> +                    /*
 >> +                     * Fold back to using mmap(), but it cannot 
repair a
 >> +                     * shared hugetlbfs region. In this case we fail.
 >> +                     */
 >
 >
 > But why do we special-case hugetlb here? How would mmap(MAP_FIXED) help
 > to discard dirty pagecache data in a mmap(MAD_SHARED, fd) mapping?

You can see the behavior with the test case.

But for Qemu, we could decide to ignore that, and choose to fail in the 
generic case:

+                    /*
+                     * Fold back to using mmap(), but it should not 
repair a
+                     * shared file memory region. In this case we fail.
+                     */
+                    if (block->fd >= 0 && qemu_ram_is_shared(block)) {
+                        error_report("Shared memory poison recovery 
failure addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }

Do you think this would be more secure ?

HTH,
William.


  ---------------------------------

#include <sys/types.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <stdint.h>
#include <signal.h>
#include <string.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <setjmp.h>

#define PAGEMAP_ENTRY 8
#define GET_BIT(X,Y) (X & ((uint64_t)1<<Y)) >> Y
#define GET_PFN(X) X & 0x7FFFFFFFFFFFFF
const int __endian_bit = 1;
#define is_bigendian() ( (*(char*)&__endian_bit) == 0 )

#define ALLOC_PAGES 2
#define myFile "FILE"
static sigjmp_buf jmpbuf;

/*
  * Generate an error on the given page.
  */
static void memory_error_advise(void* virtual_page) {
    int ret;

    printf("Poisoning 4k at %p\n", virtual_page);
    if (sigsetjmp(jmpbuf, 1) == 0) {
       ret = madvise(virtual_page, 4096, MADV_HWPOISON);
       if (ret)
          printf("Poisoning failed - madvise: %s", strerror(errno));
    }
}

static void print_physical_address(uint64_t virt_addr) {
    char path_buf [0x100];
    FILE * f;
    uint64_t read_val, file_offset, pfn = 0;
    long pgsz;
    unsigned char c_buf[PAGEMAP_ENTRY];
    pid_t my_pid = getpid();
    int status, i;

    sprintf(path_buf, "/proc/%u/pagemap", my_pid);

    f = fopen(path_buf, "rb");
    if(!f){
       printf("Error! Cannot open %s\n", path_buf);
       exit(EXIT_FAILURE);
    }

    pgsz = getpagesize();
    file_offset = virt_addr / pgsz * PAGEMAP_ENTRY;
    status = fseek(f, file_offset, SEEK_SET);
    if(status){
       perror("Failed to do fseek!");
       fclose(f);
       exit(EXIT_FAILURE);
    }

    for(i=0; i < PAGEMAP_ENTRY; i++){
       int c = getc(f);
       if(c==EOF){
          fclose(f);
          exit(EXIT_FAILURE);
       }
       if(is_bigendian())
            c_buf[i] = c;
       else
            c_buf[PAGEMAP_ENTRY - i - 1] = c;
    }
    fclose(f);

    read_val = 0;
    for(i=0; i < PAGEMAP_ENTRY; i++){
       read_val = (read_val << 8) + c_buf[i];
    }

    if(GET_BIT(read_val, 63)) { // Bit 63 page present
       pfn = GET_PFN(read_val);
    } else {
       printf("Page not present !\n");
    }
    if(GET_BIT(read_val, 62)) // Bit 62 page swapped
       printf("Page swapped\n");

    if (pfn == 0) {
       printf("Virt address translation 0x%llx failed\n");
       exit(EXIT_FAILURE);
    }

    printf("Data pages at 0x%llx  physically 0x%llx\n",
          (unsigned long long)virt_addr, (unsigned long long)pfn * pgsz);
}

/*
  * SIGBUS handler to display the given information.
  */
static void sigbus_action(int signum, siginfo_t *siginfo, void *ctx) {
    printf("Signal %d received\n", signum);
    printf("\tcode %d\t\tSignal code\n", siginfo->si_code);
    printf("\taddr 0x%llx\tMemory location\n", siginfo->si_addr);
    printf("\tsi_addr_lsb %d\n", siginfo->si_addr_lsb);

   if (siginfo->si_code == 4) { /* BUS_MCEERR_AR */
	fprintf(stderr, "siglongjmp used\n");
	siglongjmp(jmpbuf, 1);
   }
}

static void read_write(void* addr, int nb_pages, char* prefix) {
    int i;
    fprintf(stderr, "Reading and writing the first %d pages content:\n", 
nb_pages);
    if (sigsetjmp(jmpbuf, 1) == 0) {
       // read the strings at the beginning of each page.
       for (i=0; i < nb_pages; i++) {
          printf("Read: %s", ((char *)addr+ i*4096));
       }
       // also write something
       for (i=0; i < 2; i++) {
          sprintf(((char *)addr + i*4096), "%s %d\n", prefix, i);
	 printf("Wrote: %s %d\n", prefix, i);
       }
    }
}

int main(int argc, char ** argv) {
    int opt, fd, i;
    struct sigaction my_sigaction;
    uint64_t virt_addr, phys_addr;
    void *local_pnt, *v;
    struct stat statbuf;
    off_t s;

    // Need to have the CAP_SYS_ADMIN capability to get PFNs values in 
pagemap.
    if (getuid() != 0) {
       fprintf(stderr, "Usage: %s needs to run as root\n", argv[0]);
       exit(EXIT_FAILURE);
    }

    // attach our SIGBUS handler.
    my_sigaction.sa_sigaction = sigbus_action;
    my_sigaction.sa_flags = SA_SIGINFO | SA_NODEFER | SA_SIGINFO;
    if (sigaction(SIGBUS, &my_sigaction, NULL) == -1) {
       perror("Signal handler attach failed");
       exit(EXIT_FAILURE);
    }

    fd = open(myFile, O_RDWR);
    if (fd == -1) {
       perror("open");
       exit(EXIT_FAILURE);
    }
    if (fstat(fd, &statbuf) == -1) {
       perror("fstat");
       exit(EXIT_FAILURE);
    }
    s = statbuf.st_size;
    if (s < 2*4096) {
      fprintf(stderr, "File must be at least 2 pages large\n");
      exit(EXIT_FAILURE);
    }

    printf("Mapping %d bytes from file %s\n", s, myFile);
    local_pnt = mmap(NULL, s, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    if (local_pnt == MAP_FAILED) {
      perror("mmap");
      exit(EXIT_FAILURE);
    }
    read_write(local_pnt, 2, "Initial mem page");

    virt_addr = (uint64_t)local_pnt;
    print_physical_address(virt_addr);
    print_physical_address(virt_addr+getpagesize());

    // Explicit error
    memory_error_advise((void*)virt_addr);

    // Remap the poisoned page
    fprintf(stderr, "Remapping the poisoned page\n");
    v = mmap(local_pnt, 4092, PROT_READ|PROT_WRITE, 
MAP_SHARED|MAP_FIXED, fd, 0);
    if ((v == MAP_FAILED) || (v != local_pnt)) {
       perror("mmap");
    }

    read_write(local_pnt, 2, "Rewrite mem page");
    print_physical_address(virt_addr);
    print_physical_address(virt_addr+getpagesize());
    return 0;
}

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by David Hildenbrand 1 year, 2 months ago

On 03.12.24 01:15, William Roche wrote:
> On 12/2/24 17:00, David Hildenbrand wrote:
>> On 02.12.24 16:41, William Roche wrote:
>>> Hello David,
>>
>> Hi,
>>
>> sorry for reviewing yet, I was rather sick the last 1.5 weeks.
> 
> I hope you get well soon!

Getting there, thanks! :)

> 
>>> I've finally tested many page mapping possibilities and tried to
>>> identify the error injection reaction on these pages to see if mmap()
>>> can be used to recover the impacted area.
>>> I'm using the latest upstream kernel I have for that:
>>> 6.12.0-rc7.master.20241117.ol9.x86_64
>>> But I also got similar results with a kernel not supporting
>>> MADV_DONTNEED, for example: 5.15.0-301.163.5.2.el9uek.x86_64
>>>
>>>
>>> Let's start with mapping a file without modifying the mapped area:
>>> In this case we should have a clean page cache mapped in the process.
>>> If an error is injected on this page, the kernel doesn't even inform the
>>> process about the error as the page is replaced (no matter if the
>>> mapping was shared of not).
>>>
>>> The kernel indicates this situation with the following messages:
>>>
>>> [10759.371701] Injecting memory failure at pfn 0x10d88e
>>> [10759.374922] Memory failure: 0x10d88e: corrupted page was clean:
>>> dropped without side effects
>>> [10759.377525] Memory failure: 0x10d88e: recovery action for clean LRU
>>> page: Recovered
>>
>> Right. The reason here is that we can simply allocate a new page and
>> load data from disk. No corruption.
>>
>>>
>>>
>>> Now when the page content is modified, in the case of standard page
>>> size, we need to consider a MAP_PRIVATE or MAP_SHARED
>>> - in the case of a MAP_PRIVATE page, this page is corrupted and the
>>> modified data are lost, the kernel will use the SIGBUS mechanism to
>>> inform this process if needed.
>>>      But remapping the area sweeps away the poisoned page, and allows the
>>> process to use the area.
>>>
>>> - In the case of a MAP_SHARED page, if the content hasn't been sync'ed
>>> with the file backend, we also loose the modified data, and the kernel
>>> can also raise SIGBUS.
>>>      Remapping the area recreates a page cache from the "on disk" file
>>> content, clearing the error.
>>
>> In a mmap(MAP_SHARED, fd) region that will also require fallocate IIUC.
> 
> I would have expected the same thing, but what I noticed is that in the
> case of !hugetlb, even poisoned shared memory seem to be recovered with:
> mmap(location, size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0)


Let me take a look at your tool below if I can find an explanation of 
what is happening, because it's weird :)

[...]

> 
> At the end of this email, I included the source code of a simplistic
> test case that shows that the page is replaced in the case of standard
> page size.
> 
> The idea of this test is simple:
> 
> 1/ Create a local FILE with:
> # dd if=/dev/zero of=./FILE bs=4k count=2
> 2+0 records in
> 2+0 records out
> 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000337674 s, 24.3 MB/s
> 
> 2/ As root run:
> # ./poisonedShared4k
> Mapping 8192 bytes from file FILE
> Reading and writing the first 2 pages content:
> Read: Read: Wrote: Initial mem page 0
> Wrote: Initial mem page 1
> Data pages at 0x7f71a19d6000  physically 0x124fb0000
> Data pages at 0x7f71a19d7000  physically 0x128ce4000
> Poisoning 4k at 0x7f71a19d6000
> Signal 7 received
> 	code 4		Signal code
> 	addr 0x7f71a19d6000	Memory location
> 	si_addr_lsb 12
> siglongjmp used
> Remapping the poisoned page
> Reading and writing the first 2 pages content:
> Read: Read: Initial mem page 1
> Wrote: Rewrite mem page 0
> Wrote: Rewrite mem page 1
> Data pages at 0x7f71a19d6000  physically 0x10c367000
> Data pages at 0x7f71a19d7000  physically 0x128ce4000
> 
> 
>    ---
> 
> As we can see, this process:
> - maps the FILE,
> - tries to read and write the beginning of the first 2 pages
> - gives their physical addresses
> - poison the first page with a madvise(MADV_HWPOISON) call
> - shows the SIGBUS signal received and recovers from it
> - simply remaps the same page from the file
> - tries again to read and write the beginning of the first 2 pages
> - gives their physical addresses
> 


Turns out the code will try to truncate the pagecache page using 
mapping->a_ops->error_remove_folio().

That, however, is only implemented on *some* filesystems.

Most prominently, it is not implemented on shmem as well.


So if you run your test with shmem (e.g., /tmp/FILE), it doesn't work.

Using fallocate+MADV_DONTNEED seems to work on shmem.

-- 
Cheers,

David / dhildenb

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by William Roche 1 year, 2 months ago

On 12/3/24 15:08, David Hildenbrand wrote:
> [...]
> 
> Let me take a look at your tool below if I can find an explanation of 
> what is happening, because it's weird :)
> 
> [...]
> 
>>
>> At the end of this email, I included the source code of a simplistic
>> test case that shows that the page is replaced in the case of standard
>> page size.
>>
>> The idea of this test is simple:
>>
>> 1/ Create a local FILE with:
>> # dd if=/dev/zero of=./FILE bs=4k count=2
>> 2+0 records in
>> 2+0 records out
>> 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000337674 s, 24.3 MB/s
>>
>> 2/ As root run:
>> # ./poisonedShared4k
>> Mapping 8192 bytes from file FILE
>> Reading and writing the first 2 pages content:
>> Read: Read: Wrote: Initial mem page 0
>> Wrote: Initial mem page 1
>> Data pages at 0x7f71a19d6000  physically 0x124fb0000
>> Data pages at 0x7f71a19d7000  physically 0x128ce4000
>> Poisoning 4k at 0x7f71a19d6000
>> Signal 7 received
>>     code 4        Signal code
>>     addr 0x7f71a19d6000    Memory location
>>     si_addr_lsb 12
>> siglongjmp used
>> Remapping the poisoned page
>> Reading and writing the first 2 pages content:
>> Read: Read: Initial mem page 1
>> Wrote: Rewrite mem page 0
>> Wrote: Rewrite mem page 1
>> Data pages at 0x7f71a19d6000  physically 0x10c367000
>> Data pages at 0x7f71a19d7000  physically 0x128ce4000
>>
>>
>>    ---
>>
>> As we can see, this process:
>> - maps the FILE,
>> - tries to read and write the beginning of the first 2 pages
>> - gives their physical addresses
>> - poison the first page with a madvise(MADV_HWPOISON) call
>> - shows the SIGBUS signal received and recovers from it
>> - simply remaps the same page from the file
>> - tries again to read and write the beginning of the first 2 pages
>> - gives their physical addresses
>>
> 
> 
> Turns out the code will try to truncate the pagecache page using 
> mapping->a_ops->error_remove_folio().
> 
> That, however, is only implemented on *some* filesystems.
> 
> Most prominently, it is not implemented on shmem as well.
> 
> 
> So if you run your test with shmem (e.g., /tmp/FILE), it doesn't work.

Correct, on tmpfs the test case fails to continue to use the memory area 
and gets a SIGBUS.  And it works with xfs.



> 
> Using fallocate+MADV_DONTNEED seems to work on shmem.
> 

Our new Qemu code is testing first the fallocate+MADV_DONTNEED procedure 
for standard sized pages (in ram_block_discard_range()) and only folds 
back to the mmap() use if it fails. So maybe my proposal to implement:

+                    /*
+                     * Fold back to using mmap(), but it should not 
repair a
+                     * shared file memory region. In this case we fail.
+                     */
+                    if (block->fd >= 0 && qemu_ram_is_shared(block)) {
+                        error_report("Shared memory poison recovery 
failure addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }

Could be the right choice.

William.

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by David Hildenbrand 1 year, 2 months ago

On 03.12.24 15:39, William Roche wrote:
> On 12/3/24 15:08, David Hildenbrand wrote:
>> [...]
>>
>> Let me take a look at your tool below if I can find an explanation of
>> what is happening, because it's weird :)
>>
>> [...]
>>
>>>
>>> At the end of this email, I included the source code of a simplistic
>>> test case that shows that the page is replaced in the case of standard
>>> page size.
>>>
>>> The idea of this test is simple:
>>>
>>> 1/ Create a local FILE with:
>>> # dd if=/dev/zero of=./FILE bs=4k count=2
>>> 2+0 records in
>>> 2+0 records out
>>> 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000337674 s, 24.3 MB/s
>>>
>>> 2/ As root run:
>>> # ./poisonedShared4k
>>> Mapping 8192 bytes from file FILE
>>> Reading and writing the first 2 pages content:
>>> Read: Read: Wrote: Initial mem page 0
>>> Wrote: Initial mem page 1
>>> Data pages at 0x7f71a19d6000  physically 0x124fb0000
>>> Data pages at 0x7f71a19d7000  physically 0x128ce4000
>>> Poisoning 4k at 0x7f71a19d6000
>>> Signal 7 received
>>>      code 4        Signal code
>>>      addr 0x7f71a19d6000    Memory location
>>>      si_addr_lsb 12
>>> siglongjmp used
>>> Remapping the poisoned page
>>> Reading and writing the first 2 pages content:
>>> Read: Read: Initial mem page 1
>>> Wrote: Rewrite mem page 0
>>> Wrote: Rewrite mem page 1
>>> Data pages at 0x7f71a19d6000  physically 0x10c367000
>>> Data pages at 0x7f71a19d7000  physically 0x128ce4000
>>>
>>>
>>>     ---
>>>
>>> As we can see, this process:
>>> - maps the FILE,
>>> - tries to read and write the beginning of the first 2 pages
>>> - gives their physical addresses
>>> - poison the first page with a madvise(MADV_HWPOISON) call
>>> - shows the SIGBUS signal received and recovers from it
>>> - simply remaps the same page from the file
>>> - tries again to read and write the beginning of the first 2 pages
>>> - gives their physical addresses
>>>
>>
>>
>> Turns out the code will try to truncate the pagecache page using
>> mapping->a_ops->error_remove_folio().
>>
>> That, however, is only implemented on *some* filesystems.
>>
>> Most prominently, it is not implemented on shmem as well.
>>
>>
>> So if you run your test with shmem (e.g., /tmp/FILE), it doesn't work.
> 
> Correct, on tmpfs the test case fails to continue to use the memory area
> and gets a SIGBUS.  And it works with xfs.
> 
> 
> 
>>
>> Using fallocate+MADV_DONTNEED seems to work on shmem.
>>
> 
> Our new Qemu code is testing first the fallocate+MADV_DONTNEED procedure
> for standard sized pages (in ram_block_discard_range()) and only folds
> back to the mmap() use if it fails. So maybe my proposal to implement:
> 
> +                    /*
> +                     * Fold back to using mmap(), but it should not
> repair a
> +                     * shared file memory region. In this case we fail.
> +                     */
> +                    if (block->fd >= 0 && qemu_ram_is_shared(block)) {
> +                        error_report("Shared memory poison recovery
> failure addr: "
> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                                     length, addr);
> +                        exit(1);
> +                    }
> 
> Could be the right choice.

Right. But then, what about a mmap(MAP_PRIVATE, shmem), where the 
pagecache page is poisoned and needs an explicit fallocate? :)

It's all tricky. I wonder if we should just say "if it's backed by a 
file, and we cannot discard, then mmap() can't fix it reliably".

if (block->fd >= 0) {
	...
}

After all, we don't even expect the fallocate/MADV_DONTNEED to ever fail 
:) So I was also wondering if we could get rid of the mmap(MAP_FIXED) 
completely ... but who knows what older Linux kernels do.

-- 
Cheers,

David / dhildenb

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by William Roche 1 year, 2 months ago

On 12/3/24 16:00, David Hildenbrand wrote:
> On 03.12.24 15:39, William Roche wrote:
>> [...]
>> Our new Qemu code is testing first the fallocate+MADV_DONTNEED procedure
>> for standard sized pages (in ram_block_discard_range()) and only folds
>> back to the mmap() use if it fails. So maybe my proposal to implement:
>>
>> +                    /*
>> +                     * Fold back to using mmap(), but it should not
>> repair a
>> +                     * shared file memory region. In this case we fail.
>> +                     */
>> +                    if (block->fd >= 0 && qemu_ram_is_shared(block)) {
>> +                        error_report("Shared memory poison recovery
>> failure addr: "
>> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>> +                                     length, addr);
>> +                        exit(1);
>> +                    }
>>
>> Could be the right choice.
> 
> Right. But then, what about a mmap(MAP_PRIVATE, shmem), where the 
> pagecache page is poisoned and needs an explicit fallocate? :)
> 
> It's all tricky. I wonder if we should just say "if it's backed by a 
> file, and we cannot discard, then mmap() can't fix it reliably".
> 
> if (block->fd >= 0) {
>      ...
> }
> 
> After all, we don't even expect the fallocate/MADV_DONTNEED to ever 
> fail :) So I was also wondering if we could get rid of the 
> mmap(MAP_FIXED) completely ... but who knows what older Linux kernels do.

I agree that we expect the ram_block_discard_range() function to be 
working on recent kernels, and to do what's necessary to reuse the 
poisoned memory area.

The case where older kernels could be a problem is where fallocate() 
would not work on a given file, or madvice(MADV_DONTNEED or MADV_REMOVE) 
would not work on standard sized pages. As ram_block_discard_range() 
currently avoids using madvise on huge pages.

In this case, the generic/minimal way to make the memory usable again 
(in all cases) would be to:
- fallocate/PUNCH_HOLE the given file (if any)
- and remap the area
even if it's not _mandatory_ in all cases.

Would you like me to add an fallocate(PUNCH_HOLE) call in the helper 
function qemu_ram_remap_mmap() when a file descriptor is provided 
(before remapping the area) ?

This way, we don't need to know if ram_block_discard_range() failed on 
the fallocate() or the madvise(); in the worst case scenario, we would 
PUNCH twice. If fallocate fails or mmap fails, we exit.
I haven't seen a problem punching a file twice - do you see any problem ?

Do you find this possibility acceptable ? Or should I just go for the 
immediate failure when ram_block_discard_range() fails on a case with a 
file descriptor as you suggest ?

Please let me know if you find any problem with this approach, as it 
could help to have this poison recovery scenario to work on more kernels.

Thanks in advance for your feedback.
William.

Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

Posted by David Hildenbrand 1 year, 2 months ago

On 06.12.24 19:26, William Roche wrote:
> On 12/3/24 16:00, David Hildenbrand wrote:
>> On 03.12.24 15:39, William Roche wrote:
>>> [...]
>>> Our new Qemu code is testing first the fallocate+MADV_DONTNEED procedure
>>> for standard sized pages (in ram_block_discard_range()) and only folds
>>> back to the mmap() use if it fails. So maybe my proposal to implement:
>>>
>>> +                    /*
>>> +                     * Fold back to using mmap(), but it should not
>>> repair a
>>> +                     * shared file memory region. In this case we fail.
>>> +                     */
>>> +                    if (block->fd >= 0 && qemu_ram_is_shared(block)) {
>>> +                        error_report("Shared memory poison recovery
>>> failure addr: "
>>> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>>> +                                     length, addr);
>>> +                        exit(1);
>>> +                    }
>>>
>>> Could be the right choice.
>>
>> Right. But then, what about a mmap(MAP_PRIVATE, shmem), where the
>> pagecache page is poisoned and needs an explicit fallocate? :)
>>
>> It's all tricky. I wonder if we should just say "if it's backed by a
>> file, and we cannot discard, then mmap() can't fix it reliably".
>>
>> if (block->fd >= 0) {
>>       ...
>> }
>>
>> After all, we don't even expect the fallocate/MADV_DONTNEED to ever
>> fail :) So I was also wondering if we could get rid of the
>> mmap(MAP_FIXED) completely ... but who knows what older Linux kernels do.
> 

Hi,

> I agree that we expect the ram_block_discard_range() function to be
> working on recent kernels, and to do what's necessary to reuse the
> poisoned memory area.
> 
> The case where older kernels could be a problem is where fallocate()
> would not work on a given file

In that case we cannot possibly handle it correctly in all with 
mmap(MAP_FIXED), especially not with shmem/hugetlb :/

So failing is correct.

, or madvice(MADV_DONTNEED or MADV_REMOVE)
> would not work on standard sized pages. 

I assume both can be expected to work in any reasonable Linux setup 
today (-hugetlb) :)

> As ram_block_discard_range()
> currently avoids using madvise on huge pages.

Right, because support might not be around for hugetlb.

> 
> In this case, the generic/minimal way to make the memory usable again
> (in all cases) would be to:
> - fallocate/PUNCH_HOLE the given file (if any)
> - and remap the area
> even if it's not _mandatory_ in all cases.
> 
> Would you like me to add an fallocate(PUNCH_HOLE) call in the helper
> function qemu_ram_remap_mmap() when a file descriptor is provided
> (before remapping the area) ?
 > > This way, we don't need to know if ram_block_discard_range() failed on
> the fallocate() or the madvise(); in the worst case scenario, we would
> PUNCH twice. If fallocate fails or mmap fails, we exit.
> I haven't seen a problem punching a file twice - do you see any problem ?

Hm. I'd like to avoid another fallocate(). It would really only make 
sense if we expect fallocate() to work but madvise() to fail; and I 
don't think that's our expectation.

virtio-balloon has been using ram_block_discard_range() forever, and so 
far nobody really ever reported any of the errors from the function to 
the best of my knowledge ...


> 
> Do you find this possibility acceptable ? Or should I just go for the
> immediate failure when ram_block_discard_range() fails on a case with a
> file descriptor as you suggest ?
> 
> Please let me know if you find any problem with this approach, as it
> could help to have this poison recovery scenario to work on more kernels.

I'd say, let's keep it simple. Try ram_block_discard_range() if that 
fails (a) and we have a file, bail out, (b) we don't have a file, do the 
remap().

If we ever run into issues with that approach, we can investigate why it 
fails and what to do about it (e.g., fallocate).

My best guess it that the whole remap part can be avoided completely.

-- 
Cheers,

David / dhildenb