[PATCH] RAS/AMD/FMPM: Add option to ignore CEs

Yazen Ghannam posted 1 patch 1 month, 1 week ago
drivers/ras/amd/fmpm.c | 11 +++++++++++
1 file changed, 11 insertions(+)
[PATCH] RAS/AMD/FMPM: Add option to ignore CEs
Posted by Yazen Ghannam 1 month, 1 week ago
Generally, FMPM will handle all memory errors as it is expected that
"upstream" entities, like hardware thresholding or other Linux notifier
blocks, will filter out errors.

However, some users prefer that correctable errors are not filtered out
but only that FMPM does not take action on them.

Add a module parameter to ignore correctable errors.

When set, FMPM will not retire memory nor will it save FRU records for
correctable errors.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 drivers/ras/amd/fmpm.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index 8877c6ff64c4..08b16a133f20 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -129,6 +129,14 @@ static struct dentry *fmpm_dfs_entries;
 	GUID_INIT(0x5e4706c1, 0x5356, 0x48c6, 0x93, 0x0b, 0x52, 0xf2,	\
 		  0x12, 0x0a, 0x44, 0x58)
 
+/**
+ * DOC: ignore_ce (bool)
+ * Switch to handle or ignore correctable errors.
+ */
+static bool ignore_ce;
+module_param(ignore_ce, bool, 0644);
+MODULE_PARM_DESC(ignore_ce, "Ignore correctable errors");
+
 /**
  * DOC: max_nr_entries (byte)
  * Maximum number of descriptor entries possible for each FRU.
@@ -413,6 +421,9 @@ static int fru_handle_mem_poison(struct notifier_block *nb, unsigned long val, v
 	if (!mce_is_memory_error(m))
 		return NOTIFY_DONE;
 
+	if (ignore_ce && mce_is_correctable(m))
+		return NOTIFY_DONE;
+
 	retire_dram_row(m->addr, m->ipid, m->extcpu);
 
 	/*

base-commit: fd94619c43360eb44d28bd3ef326a4f85c600a07
-- 
2.51.0
Re: [PATCH] RAS/AMD/FMPM: Add option to ignore CEs
Posted by Borislav Petkov 1 month, 1 week ago
On Mon, Oct 06, 2025 at 03:17:31PM +0000, Yazen Ghannam wrote:
> Generally, FMPM will handle all memory errors as it is expected that
> "upstream" entities, like hardware thresholding or other Linux notifier
> blocks, will filter out errors.
> 
> However, some users prefer that correctable errors are not filtered out
> but only that FMPM does not take action on them.

That's a pretty shallow use case if you ask me...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH] RAS/AMD/FMPM: Add option to ignore CEs
Posted by Yazen Ghannam 1 month, 1 week ago
On Mon, Oct 06, 2025 at 11:34:06PM +0200, Borislav Petkov wrote:
> On Mon, Oct 06, 2025 at 03:17:31PM +0000, Yazen Ghannam wrote:
> > Generally, FMPM will handle all memory errors as it is expected that
> > "upstream" entities, like hardware thresholding or other Linux notifier
> > blocks, will filter out errors.
> > 
> > However, some users prefer that correctable errors are not filtered out
> > but only that FMPM does not take action on them.
> 
> That's a pretty shallow use case if you ask me...
> 
> -- 

I think it's a common use case without FMPM.

IOW, log correctable errors but don't offline memory because of them.

Does that sounds better or about the same?

Thanks,
Yazen
RE: [PATCH] RAS/AMD/FMPM: Add option to ignore CEs
Posted by Luck, Tony 1 month, 1 week ago
> I think it's a common use case without FMPM.
>
> IOW, log correctable errors but don't offline memory because of them.
>
> Does that sounds better or about the same?

Linux has  /proc/sys/vm/enable_soft_offline toggle for that case.

-Tony
Re: [PATCH] RAS/AMD/FMPM: Add option to ignore CEs
Posted by Yazen Ghannam 1 month, 1 week ago
On Tue, Oct 07, 2025 at 04:52:55PM +0000, Luck, Tony wrote:
> > I think it's a common use case without FMPM.
> >
> > IOW, log correctable errors but don't offline memory because of them.
> >
> > Does that sounds better or about the same?
> 
> Linux has  /proc/sys/vm/enable_soft_offline toggle for that case.
> 

Thanks, that's a good suggestion.

We would still need a check in fru_handle_mem_poison() to skip saving
records to persistent storage.

And we would need a code update in _retire_row_mi300() to use the
soft_offline path.

Thanks,
Yazen
Re: [PATCH] RAS/AMD/FMPM: Add option to ignore CEs
Posted by Naik, Avadhut 1 month, 1 week ago

On 10/6/2025 10:17, Yazen Ghannam wrote:
> Generally, FMPM will handle all memory errors as it is expected that
> "upstream" entities, like hardware thresholding or other Linux notifier
> blocks, will filter out errors.
> 
> However, some users prefer that correctable errors are not filtered out
> but only that FMPM does not take action on them.
> 
> Add a module parameter to ignore correctable errors.
> 
> When set, FMPM will not retire memory nor will it save FRU records for
> correctable errors.
> 
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> ---
>  drivers/ras/amd/fmpm.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
> index 8877c6ff64c4..08b16a133f20 100644
> --- a/drivers/ras/amd/fmpm.c
> +++ b/drivers/ras/amd/fmpm.c
> @@ -129,6 +129,14 @@ static struct dentry *fmpm_dfs_entries;
>  	GUID_INIT(0x5e4706c1, 0x5356, 0x48c6, 0x93, 0x0b, 0x52, 0xf2,	\
>  		  0x12, 0x0a, 0x44, 0x58)
>  
> +/**
> + * DOC: ignore_ce (bool)
> + * Switch to handle or ignore correctable errors.
> + */
> +static bool ignore_ce;
> +module_param(ignore_ce, bool, 0644);
> +MODULE_PARM_DESC(ignore_ce, "Ignore correctable errors");
> +
>  /**
>   * DOC: max_nr_entries (byte)
>   * Maximum number of descriptor entries possible for each FRU.
> @@ -413,6 +421,9 @@ static int fru_handle_mem_poison(struct notifier_block *nb, unsigned long val, v
>  	if (!mce_is_memory_error(m))
>  		return NOTIFY_DONE;
>  
> +	if (ignore_ce && mce_is_correctable(m))
> +		return NOTIFY_DONE;
> +
>  	retire_dram_row(m->addr, m->ipid, m->extcpu);
>  
>  	/*
> 
> base-commit: fd94619c43360eb44d28bd3ef326a4f85c600a07

LGTM!

Reviewed-by: Avadhut Naik <avadhut.naik@amd.com>

-- 
Thanks,
Avadhut Naik