[PATCH v5 0/3] mm: Implement ECC handling for pfn with no struct page

ankita@nvidia.com posted 3 patches 3 months ago
MAINTAINERS                         |   1 +
drivers/acpi/apei/ghes.c            |   6 --
drivers/vfio/pci/nvgrace-gpu/main.c |  45 ++++++++-
include/linux/memory-failure.h      |  17 ++++
include/linux/mm.h                  |   1 +
include/ras/ras_event.h             |   1 +
mm/Kconfig                          |   1 +
mm/memory-failure.c                 | 145 +++++++++++++++++++++++++++-
8 files changed, 209 insertions(+), 8 deletions(-)
create mode 100644 include/linux/memory-failure.h
[PATCH v5 0/3] mm: Implement ECC handling for pfn with no struct page
Posted by ankita@nvidia.com 3 months ago
From: Ankit Agrawal <ankita@nvidia.com>

Poison (or ECC) errors can be very common on a large size cluster.
The kernel MM currently handles ECC errors / poison only on memory page
backed by struct page. The handling is currently missing for the PFNMAP
memory that does not have struct pages. The series adds such support.

Implement a new ECC handling for memory without struct pages. Kernel MM
expose registration APIs to allow modules that are managing the device
to register its device memory region. MM then tracks such regions using
interval tree.

The mechanism is largely similar to that of ECC on pfn with struct pages.
If there is an ECC error on a pfn, all the mapping to it are identified
and a SIGBUS is sent to the user space processes owning those mappings.
Note that there is one primary difference versus the handling of the
poison on struct pages, which is to skip unmapping to the faulty PFN.
This is done to handle the huge PFNMAP support added recently [1] that
enables VM_PFNMAP vmas to map at PMD or PUD level. A poison to a PFN
mapped in such as way would need breaking the PMD/PUD mapping into PTEs
that will get mirrored into the S2. This can greatly increase the cost
of table walks and have a major performance impact.

nvgrace-gpu-vfio-pci module maps the device memory to user VA (Qemu) using
remap_pfn_range without being added to the kernel [2]. These device memory
PFNs are not backed by struct page. So make nvgrace-gpu-vfio-pci module
make use of the mechanism to get poison handling support on the device
memory.

Patch rebased to v6.17-rc7.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---

Link: https://lore.kernel.org/all/20251026141919.2261-1-ankita@nvidia.com/ [v4]

v4 -> v5
- Removed pfn_space NULL checks. Instead a wrong parameter would cause
a panic. (Thanks Andrew Morton for suggestion)
- Log message to mention kmalloc allocation error and the failure to
kill a process. (Thanks Andrew Morton)
- Comments with 80 chars.

v3 -> v4
- Added guards in memory_failure_pfn, register, unregister function to
simplify code. (Thanks Ira Weiny for suggestion).
- Collected reviewed-by from Shuai Xue (Thanks!) on the mm GHES patch. Also
moved it to the front of the series.
- Added check for interval_tree_iter_first before removing the device
memory region. (Thanks Jiaqi Yan for suggestion)
- If pfn doesn't belong to any address space mapping, returning
MF_IGNORED (Thanks Miaohe Lin for suggestion).
- Updated patch commit to add more details on the perf impact on
HUGE PFNMAP (Thanks Jason Gunthorpe, Tony Luck for suggestion).

v2 -> v3
- Rebased to v6.17-rc7.
- Skipped the unmapping of PFNMAP during reception of poison. Suggested by
Jason Gunthorpe, Jiaqi Yan, Vikram Sethi (Thanks!)
- Updated the check to prevent multiple registration to the same PFN
range using interval_tree_iter_first. Thanks Shameer Kolothum for the
suggestion.
- Removed the callback function in the nvgrace-gpu requiring tracking of
poisoned PFN as it isn't required anymore.
- Introduced seperate collect_procs_pfn function to collect the list of
processes mapping to the poisoned PFN.

v1 -> v2
- Change poisoned page tracking from bitmap to hashtable.
- Addressed miscellaneous comments in v1.

Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [2]

Ankit Agrawal (3):
  mm: Change ghes code to allow poison of non-struct pfn
  mm: handle poisoning of pfn without struct pages
  vfio/nvgrace-gpu: register device memory for poison handling

 MAINTAINERS                         |   1 +
 drivers/acpi/apei/ghes.c            |   6 --
 drivers/vfio/pci/nvgrace-gpu/main.c |  45 ++++++++-
 include/linux/memory-failure.h      |  17 ++++
 include/linux/mm.h                  |   1 +
 include/ras/ras_event.h             |   1 +
 mm/Kconfig                          |   1 +
 mm/memory-failure.c                 | 145 +++++++++++++++++++++++++++-
 8 files changed, 209 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/memory-failure.h

-- 
2.34.1
Re: [PATCH v5 0/3] mm: Implement ECC handling for pfn with no struct page
Posted by Jiaqi Yan 3 weeks ago
On Sun, Nov 2, 2025 at 10:45 AM <ankita@nvidia.com> wrote:
>
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Poison (or ECC) errors can be very common on a large size cluster.
> The kernel MM currently handles ECC errors / poison only on memory page
> backed by struct page. The handling is currently missing for the PFNMAP
> memory that does not have struct pages. The series adds such support.
>
> Implement a new ECC handling for memory without struct pages. Kernel MM
> expose registration APIs to allow modules that are managing the device
> to register its device memory region. MM then tracks such regions using
> interval tree.
>
> The mechanism is largely similar to that of ECC on pfn with struct pages.
> If there is an ECC error on a pfn, all the mapping to it are identified
> and a SIGBUS is sent to the user space processes owning those mappings.
> Note that there is one primary difference versus the handling of the
> poison on struct pages, which is to skip unmapping to the faulty PFN.
> This is done to handle the huge PFNMAP support added recently [1] that
> enables VM_PFNMAP vmas to map at PMD or PUD level. A poison to a PFN
> mapped in such as way would need breaking the PMD/PUD mapping into PTEs
> that will get mirrored into the S2. This can greatly increase the cost
> of table walks and have a major performance impact.
>
> nvgrace-gpu-vfio-pci module maps the device memory to user VA (Qemu) using
> remap_pfn_range without being added to the kernel [2]. These device memory
> PFNs are not backed by struct page. So make nvgrace-gpu-vfio-pci module
> make use of the mechanism to get poison handling support on the device
> memory.
>
> Patch rebased to v6.17-rc7.
>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>
> Link: https://lore.kernel.org/all/20251026141919.2261-1-ankita@nvidia.com/ [v4]
>
> v4 -> v5
> - Removed pfn_space NULL checks. Instead a wrong parameter would cause
> a panic. (Thanks Andrew Morton for suggestion)
> - Log message to mention kmalloc allocation error and the failure to
> kill a process. (Thanks Andrew Morton)
> - Comments with 80 chars.
>
> v3 -> v4
> - Added guards in memory_failure_pfn, register, unregister function to
> simplify code. (Thanks Ira Weiny for suggestion).
> - Collected reviewed-by from Shuai Xue (Thanks!) on the mm GHES patch. Also
> moved it to the front of the series.
> - Added check for interval_tree_iter_first before removing the device
> memory region. (Thanks Jiaqi Yan for suggestion)
> - If pfn doesn't belong to any address space mapping, returning
> MF_IGNORED (Thanks Miaohe Lin for suggestion).
> - Updated patch commit to add more details on the perf impact on
> HUGE PFNMAP (Thanks Jason Gunthorpe, Tony Luck for suggestion).
>
> v2 -> v3
> - Rebased to v6.17-rc7.
> - Skipped the unmapping of PFNMAP during reception of poison. Suggested by
> Jason Gunthorpe, Jiaqi Yan, Vikram Sethi (Thanks!)
> - Updated the check to prevent multiple registration to the same PFN
> range using interval_tree_iter_first. Thanks Shameer Kolothum for the
> suggestion.
> - Removed the callback function in the nvgrace-gpu requiring tracking of
> poisoned PFN as it isn't required anymore.

Hi Ankit,

I get that for nvgrace-gpu driver, you removed pfn_address_space_ops
because there is no need to unmap poisoned HBM page.

What about the nvgrace-egm driver? Now that you removed the
pfn_address_space_ops callback from pfn_address_space in [1], how can
nvgrace-egm driver know the poisoned EGM pages at runtime?

I expect the functionality to return retired pages should also include
runtime poisoned pages, which are not in the list queried from
egm-retired-pages-data-base during initialization. Or maybe my
expection is wrong/obsolete?

[1] https://lore.kernel.org/linux-mm/20230920140210.12663-2-ankita@nvidia.com
[2] https://lore.kernel.org/kvm/20250904040828.319452-12-ankita@nvidia.com

> - Introduced seperate collect_procs_pfn function to collect the list of
> processes mapping to the poisoned PFN.
>
> v1 -> v2
> - Change poisoned page tracking from bitmap to hashtable.
> - Addressed miscellaneous comments in v1.
>
> Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
> Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [2]
>
> Ankit Agrawal (3):
>   mm: Change ghes code to allow poison of non-struct pfn
>   mm: handle poisoning of pfn without struct pages
>   vfio/nvgrace-gpu: register device memory for poison handling
>
>  MAINTAINERS                         |   1 +
>  drivers/acpi/apei/ghes.c            |   6 --
>  drivers/vfio/pci/nvgrace-gpu/main.c |  45 ++++++++-
>  include/linux/memory-failure.h      |  17 ++++
>  include/linux/mm.h                  |   1 +
>  include/ras/ras_event.h             |   1 +
>  mm/Kconfig                          |   1 +
>  mm/memory-failure.c                 | 145 +++++++++++++++++++++++++++-
>  8 files changed, 209 insertions(+), 8 deletions(-)
>  create mode 100644 include/linux/memory-failure.h
>
> --
> 2.34.1
>
>
Re: [PATCH v5 0/3] mm: Implement ECC handling for pfn with no struct page
Posted by Andrew Morton 3 months ago
On Sun, 2 Nov 2025 18:44:31 +0000 <ankita@nvidia.com> wrote:

> Poison (or ECC) errors can be very common on a large size cluster.
> The kernel MM currently handles ECC errors / poison only on memory page
> backed by struct page. The handling is currently missing for the PFNMAP
> memory that does not have struct pages. The series adds such support.
> 
> Implement a new ECC handling for memory without struct pages. Kernel MM
> expose registration APIs to allow modules that are managing the device
> to register its device memory region. MM then tracks such regions using
> interval tree.

Thanks.  My knowledge of this material is weaker than usual :( But the
series looks good to my eye so I'll toss it into mm.git's mm-new branch
for some testing exposure.  If that goes OK then I'll later move it
into the mm-unstable branch where it will get linux-next esposure.  At
that point I'll monitor reviewer and tester feedback (please).