[PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn

ankita@nvidia.com posted 3 patches 3 months, 2 weeks ago
There is a newer version of this series
[PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn
Posted by ankita@nvidia.com 3 months, 2 weeks ago
From: Ankit Agrawal <ankita@nvidia.com>

The GHES code allows calling of memory_failure() on the PFNs that pass the
pfn_valid() check. This contract is broken for the remapped PFNs which
fails the check and ghes_do_memory_failure() returns without triggering
memory_failure().

Update code to allow memory_failure() call on PFNs failing pfn_valid().

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/acpi/apei/ghes.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a0d54993edb3..bc4d0f2b3e9d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -505,12 +505,6 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
 		return false;
 
 	pfn = PHYS_PFN(physical_addr);
-	if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
-		pr_warn_ratelimited(FW_WARN GHES_PFX
-		"Invalid address in generic error data: %#llx\n",
-		physical_addr);
-		return false;
-	}
 
 	if (flags == MF_ACTION_REQUIRED && current->mm) {
 		twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
-- 
2.34.1
Re: [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn
Posted by Ira Weiny 3 months, 2 weeks ago
ankita@ wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The GHES code allows calling of memory_failure() on the PFNs that pass the
> pfn_valid() check. This contract is broken for the remapped PFNs which
> fails the check and ghes_do_memory_failure() returns without triggering
> memory_failure().
> 
> Update code to allow memory_failure() call on PFNs failing pfn_valid().
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/acpi/apei/ghes.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index a0d54993edb3..bc4d0f2b3e9d 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -505,12 +505,6 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>  		return false;
>  
>  	pfn = PHYS_PFN(physical_addr);
> -	if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {

Tony,

I'm not an SGX expert but does this break SGX by removing
arch_is_platform_page()?

See:

40e0e7843e23 ("x86/sgx: Add infrastructure to identify SGX EPC pages")
Cc: Tony Luck <tony.luck@intel.com>

Ira
RE: [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn
Posted by Luck, Tony 3 months, 2 weeks ago
> >     pfn = PHYS_PFN(physical_addr);
> > -   if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
>
> Tony,
>
> I'm not an SGX expert but does this break SGX by removing
> arch_is_platform_page()?
>
> See:
>
> 40e0e7843e23 ("x86/sgx: Add infrastructure to identify SGX EPC pages")
> Cc: Tony Luck <tony.luck@intel.com>
>
Ira,

I think this deletion makes the GHES code always call memory_failure()
instead of bailing out here on "bad" page frame numbers.

That centralizes the checks for different types of memory into
memory_failure().

-Tony
Re: [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn
Posted by Shuai Xue 3 months, 2 weeks ago

在 2025/10/22 01:19, Luck, Tony 写道:
>>>      pfn = PHYS_PFN(physical_addr);
>>> -   if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
>>
>> Tony,
>>
>> I'm not an SGX expert but does this break SGX by removing
>> arch_is_platform_page()?
>>
>> See:
>>
>> 40e0e7843e23 ("x86/sgx: Add infrastructure to identify SGX EPC pages")
>> Cc: Tony Luck <tony.luck@intel.com>
>>
> Ira,
> 
> I think this deletion makes the GHES code always call memory_failure()
> instead of bailing out here on "bad" page frame numbers.
> 
> That centralizes the checks for different types of memory into
> memory_failure().
> 
> -Tony

Hi, Tony, Ankit and Ira,

Finally, we're seeing other use cases that need to handle errors for
non-struct page PFNs :)

IMHO, non-struct page PFNs are common in production environments.
Besides NVIDIA Grace GPU device memory, we also use reserved DRAM memory
managed by a separate VMEM allocator. This VMEM allocator is designed
for virtual machine memory allocation, significantly reducing kernel
memory management overhead by minimizing page table maintenance.

To enable hardware error isolation for these memory pages, we've already
removed this sanity check internally. This change makes memory_failure()
the central point for handling all memory types, which is a much cleaner
architecture.

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>

Thanks.
Shuai
Re: [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn
Posted by Ira Weiny 3 months, 2 weeks ago
Shuai Xue wrote:
> 
> 
> 在 2025/10/22 01:19, Luck, Tony 写道:
> >>>      pfn = PHYS_PFN(physical_addr);
> >>> -   if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
> >>
> >> Tony,
> >>
> >> I'm not an SGX expert but does this break SGX by removing
> >> arch_is_platform_page()?
> >>
> >> See:
> >>
> >> 40e0e7843e23 ("x86/sgx: Add infrastructure to identify SGX EPC pages")
> >> Cc: Tony Luck <tony.luck@intel.com>
> >>
> > Ira,
> > 
> > I think this deletion makes the GHES code always call memory_failure()
> > instead of bailing out here on "bad" page frame numbers.
> > 
> > That centralizes the checks for different types of memory into
> > memory_failure().
> > 
> > -Tony
> 
> Hi, Tony, Ankit and Ira,
> 
> Finally, we're seeing other use cases that need to handle errors for
> non-struct page PFNs :)
> 
> IMHO, non-struct page PFNs are common in production environments.
> Besides NVIDIA Grace GPU device memory, we also use reserved DRAM memory
> managed by a separate VMEM allocator.

Can you elaborate on this more?

Ira

>
> This VMEM allocator is designed
> for virtual machine memory allocation, significantly reducing kernel
> memory management overhead by minimizing page table maintenance.
> 
> To enable hardware error isolation for these memory pages, we've already
> removed this sanity check internally. This change makes memory_failure()
> the central point for handling all memory types, which is a much cleaner
> architecture.
> 
> Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
> 
> Thanks.
> Shuai


Re: [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn
Posted by Shuai Xue 3 months, 2 weeks ago

在 2025/10/22 23:03, Ira Weiny 写道:
> Shuai Xue wrote:
>>
>>
>> 在 2025/10/22 01:19, Luck, Tony 写道:
>>>>>       pfn = PHYS_PFN(physical_addr);
>>>>> -   if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
>>>>
>>>> Tony,
>>>>
>>>> I'm not an SGX expert but does this break SGX by removing
>>>> arch_is_platform_page()?
>>>>
>>>> See:
>>>>
>>>> 40e0e7843e23 ("x86/sgx: Add infrastructure to identify SGX EPC pages")
>>>> Cc: Tony Luck <tony.luck@intel.com>
>>>>
>>> Ira,
>>>
>>> I think this deletion makes the GHES code always call memory_failure()
>>> instead of bailing out here on "bad" page frame numbers.
>>>
>>> That centralizes the checks for different types of memory into
>>> memory_failure().
>>>
>>> -Tony
>>
>> Hi, Tony, Ankit and Ira,
>>
>> Finally, we're seeing other use cases that need to handle errors for
>> non-struct page PFNs :)
>>
>> IMHO, non-struct page PFNs are common in production environments.
>> Besides NVIDIA Grace GPU device memory, we also use reserved DRAM memory
>> managed by a separate VMEM allocator.
> 
> Can you elaborate on this more?

We reserve a significant portion of DRAM memory at boot time using
kernel command line parameters. This reserved memory is then managed by
our internal VMEM allocator, which handles memory allocation and
deallocation for virtual machines.

To minimize memory overhead, we intentionally avoid creating struct
pages for this reserved memory region. Instead, we've implemented the
following approach:

- Our VMEM allocator directly manages the physical memory without the
   overhead of struct page metadata.
- Error Handling: We register custom RAS operations (ras_ops) with the
   memory failure infrastructure. When poisoned memory is accessed within
   this region, our registered handler: Tags the affected memory area as
   poisoned Isolates the memory to prevent further access Terminates any
   tasks that were using the poisoned memory

This approach allows us to handle memory errors effectively while
maintaining minimal memory overhead for large reserved regions. It's
similar in concept to how device memory (like NVIDIA Grace GPU memory
mentioned earlier) needs error handling without struct page backing.

Thanks.
Shuai