MAINTAINERS | 1 + drivers/acpi/apei/ghes.c | 6 -- drivers/vfio/pci/nvgrace-gpu/main.c | 45 +++++++++- include/linux/memory-failure.h | 17 ++++ include/linux/mm.h | 1 + include/ras/ras_event.h | 1 + mm/Kconfig | 1 + mm/memory-failure.c | 128 +++++++++++++++++++++++++++- 8 files changed, 192 insertions(+), 8 deletions(-) create mode 100644 include/linux/memory-failure.h
From: Ankit Agrawal <ankita@nvidia.com> The kernel MM currently handles ECC errors / poison only on memory page backed by struct page. The handling is currently missing for the PFNMAP memory that does not have struct pages. The series adds such support. Implement a new ECC handling for memory without struct pages. Kernel MM expose registration APIs to allow modules that are managing the device to register its device memory region. MM then tracks such regions using interval tree. The mechanism is largely similar to that of ECC on pfn with struct pages. If there is an ECC error on a pfn, all the mapping to it are identified and a SIGBUS is sent to the user space processes owning those mappings. Note that there is one primary difference versus the handling of the poison on struct pages, which is to skip unmapping to the poisoned PFN. This is done to handle the huge PFNMAP support added recently [1] that enables VM_PFNMAP vmas to map at PMD level. Otherwise, a poison to a PFN would need breaking the PMD mapping into PTEs to unmap only the poisoned PFN. This can have a major performance impact. nvgrace-gpu-vfio-pci module maps the device memory to user VA (Qemu) using remap_pfn_range without being added to the kernel [2]. These device memory PFNs are not backed by struct page. So make nvgrace-gpu-vfio-pci module make use of the mechanism to get poison handling support on the device memory. Patch rebased to v6.17-rc7. Signed-off-by: Ankit Agrawal <ankita@nvidia.com> --- Link: https://lore.kernel.org/all/20231123003513.24292-1-ankita@nvidia.com/ [v2] v2 -> v3 - Rebased to v6.17-rc7. - Skipped the unmapping of PFNMAP during reception of poison. Suggested by Jason Gunthorpe, Jiaqi Yan, Vikram Sethi (Thanks!) - Updated the check to prevent multiple registration to the same PFN range using interval_tree_iter_first. Thanks Shameer Kolothum for the suggestion. - Removed the callback function in the nvgrace-gpu requiring tracking of poisoned PFN as it isn't required anymore. - Introduced seperate collect_procs_pfn function to collect the list of processes mapping to the poisoned PFN. v1 -> v2 - Change poisoned page tracking from bitmap to hashtable. - Addressed miscellaneous comments in v1. Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1] Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [2] Ankit Agrawal (3): mm: handle poisoning of pfn without struct pages mm: Change ghes code to allow poison of non-struct pfn vfio/nvgrace-gpu: register device memory for poison handling MAINTAINERS | 1 + drivers/acpi/apei/ghes.c | 6 -- drivers/vfio/pci/nvgrace-gpu/main.c | 45 +++++++++- include/linux/memory-failure.h | 17 ++++ include/linux/mm.h | 1 + include/ras/ras_event.h | 1 + mm/Kconfig | 1 + mm/memory-failure.c | 128 +++++++++++++++++++++++++++- 8 files changed, 192 insertions(+), 8 deletions(-) create mode 100644 include/linux/memory-failure.h -- 2.34.1
* ankita@nvidia.com <ankita@nvidia.com> [251021 06:23]: > From: Ankit Agrawal <ankita@nvidia.com> > > The kernel MM currently handles ECC errors / poison only on memory page > backed by struct page. The handling is currently missing for the PFNMAP > memory that does not have struct pages. The series adds such support. > > Implement a new ECC handling for memory without struct pages. Kernel MM > expose registration APIs to allow modules that are managing the device > to register its device memory region. MM then tracks such regions using > interval tree. > > The mechanism is largely similar to that of ECC on pfn with struct pages. > If there is an ECC error on a pfn, all the mapping to it are identified > and a SIGBUS is sent to the user space processes owning those mappings. > Note that there is one primary difference versus the handling of the > poison on struct pages, which is to skip unmapping to the poisoned PFN. > This is done to handle the huge PFNMAP support added recently [1] that > enables VM_PFNMAP vmas to map at PMD level. Otherwise, a poison to a PFN > would need breaking the PMD mapping into PTEs to unmap only the poisoned > PFN. This can have a major performance impact. Is the performance impact really a concern in the event of failed memory? Does this happen enough to warrant this special case? Surely it's not failing hardware that may cause performance impacts, so is this triggered in some other way that I'm missing or a conversation pointer? > > nvgrace-gpu-vfio-pci module maps the device memory to user VA (Qemu) using > remap_pfn_range without being added to the kernel [2]. These device memory > PFNs are not backed by struct page. So make nvgrace-gpu-vfio-pci module > make use of the mechanism to get poison handling support on the device > memory. > > Patch rebased to v6.17-rc7. > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > --- > > Link: https://lore.kernel.org/all/20231123003513.24292-1-ankita@nvidia.com/ [v2] > > v2 -> v3 > - Rebased to v6.17-rc7. > - Skipped the unmapping of PFNMAP during reception of poison. Suggested by > Jason Gunthorpe, Jiaqi Yan, Vikram Sethi (Thanks!) > - Updated the check to prevent multiple registration to the same PFN > range using interval_tree_iter_first. Thanks Shameer Kolothum for the > suggestion. > - Removed the callback function in the nvgrace-gpu requiring tracking of > poisoned PFN as it isn't required anymore. > - Introduced seperate collect_procs_pfn function to collect the list of > processes mapping to the poisoned PFN. > > v1 -> v2 > - Change poisoned page tracking from bitmap to hashtable. > - Addressed miscellaneous comments in v1. > > Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1] > Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [2] > > Ankit Agrawal (3): > mm: handle poisoning of pfn without struct pages > mm: Change ghes code to allow poison of non-struct pfn > vfio/nvgrace-gpu: register device memory for poison handling > > MAINTAINERS | 1 + > drivers/acpi/apei/ghes.c | 6 -- > drivers/vfio/pci/nvgrace-gpu/main.c | 45 +++++++++- > include/linux/memory-failure.h | 17 ++++ > include/linux/mm.h | 1 + > include/ras/ras_event.h | 1 + > mm/Kconfig | 1 + > mm/memory-failure.c | 128 +++++++++++++++++++++++++++- > 8 files changed, 192 insertions(+), 8 deletions(-) > create mode 100644 include/linux/memory-failure.h > > -- > 2.34.1 >
On Tue, Oct 21, 2025 at 12:30:48PM -0400, Liam R. Howlett wrote: > > enables VM_PFNMAP vmas to map at PMD level. Otherwise, a poison to a PFN > > would need breaking the PMD mapping into PTEs to unmap only the poisoned > > PFN. This can have a major performance impact. > > Is the performance impact really a concern in the event of failed > memory? Yes, something like the KVM S2 is very sensitive to page size for TLB performace. > Does this happen enough to warrant this special case? If you have a 100k sized cluster it happens constantly :\ > Surely it's not failing hardware that may cause performance impacts, so > is this triggered in some other way that I'm missing or a conversation > pointer? It is the splitting of a pgd/pmd level into PTEs that gets mirrored into the S2 and then greatly increases the cost of table walks inside a guest. The HW caches are sized for 1G S2 PTEs, not 4k. Jason
* Jason Gunthorpe <jgg@nvidia.com> [251021 12:44]: > On Tue, Oct 21, 2025 at 12:30:48PM -0400, Liam R. Howlett wrote: > > > enables VM_PFNMAP vmas to map at PMD level. Otherwise, a poison to a PFN > > > would need breaking the PMD mapping into PTEs to unmap only the poisoned > > > PFN. This can have a major performance impact. > > > > Is the performance impact really a concern in the event of failed > > memory? > > Yes, something like the KVM S2 is very sensitive to page size for TLB > performace. > > > Does this happen enough to warrant this special case? > > If you have a 100k sized cluster it happens constantly :\ > > > Surely it's not failing hardware that may cause performance impacts, so > > is this triggered in some other way that I'm missing or a conversation > > pointer? > > It is the splitting of a pgd/pmd level into PTEs that gets mirrored > into the S2 and then greatly increases the cost of table walks inside > a guest. The HW caches are sized for 1G S2 PTEs, not 4k. Ah, I see. Seems like a worthy addition to the commit message? I mean, this is really a choice of throwing away memory for the benefit of tlb performance. Seems like a valid choice in your usecase but less so for the average laptop. Won't leaving the poisoned memory mapped cause migration issues? Even if the machine is migrated, my understanding is the poison follows through checkpoint restore. Thanks, Liam
On Tue, Oct 21, 2025 at 02:54:10PM -0400, Liam R. Howlett wrote: > > > Surely it's not failing hardware that may cause performance impacts, so > > > is this triggered in some other way that I'm missing or a conversation > > > pointer? > > > > It is the splitting of a pgd/pmd level into PTEs that gets mirrored > > into the S2 and then greatly increases the cost of table walks inside > > a guest. The HW caches are sized for 1G S2 PTEs, not 4k. > > Ah, I see. Seems like a worthy addition to the commit message? I mean, > this is really a choice of throwing away memory for the benefit of tlb > performance. Seems like a valid choice in your usecase but less so for > the average laptop. No memory is being thrown away, the choice is if the kernel will protect itself from loading via userspace issuing repeated reads to bad memory. Ankit please include some of these details in the commit message > Won't leaving the poisoned memory mapped cause migration issues? Even > if the machine is migrated, my understanding is the poison follows > through checkpoint restore. The VMM has to keep track of this and not try to read the bad memory during migration. Jason
© 2016 - 2026 Red Hat, Inc.