drivers/ras/amd/fmpm.c | 11 +++++++++++ 1 file changed, 11 insertions(+)
Generally, FMPM will handle all memory errors as it is expected that
"upstream" entities, like hardware thresholding or other Linux notifier
blocks, will filter out errors.
However, some users prefer that correctable errors are not filtered out
but only that FMPM does not take action on them.
Add a module parameter to ignore correctable errors.
When set, FMPM will not retire memory nor will it save FRU records for
correctable errors.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
drivers/ras/amd/fmpm.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index 8877c6ff64c4..08b16a133f20 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -129,6 +129,14 @@ static struct dentry *fmpm_dfs_entries;
GUID_INIT(0x5e4706c1, 0x5356, 0x48c6, 0x93, 0x0b, 0x52, 0xf2, \
0x12, 0x0a, 0x44, 0x58)
+/**
+ * DOC: ignore_ce (bool)
+ * Switch to handle or ignore correctable errors.
+ */
+static bool ignore_ce;
+module_param(ignore_ce, bool, 0644);
+MODULE_PARM_DESC(ignore_ce, "Ignore correctable errors");
+
/**
* DOC: max_nr_entries (byte)
* Maximum number of descriptor entries possible for each FRU.
@@ -413,6 +421,9 @@ static int fru_handle_mem_poison(struct notifier_block *nb, unsigned long val, v
if (!mce_is_memory_error(m))
return NOTIFY_DONE;
+ if (ignore_ce && mce_is_correctable(m))
+ return NOTIFY_DONE;
+
retire_dram_row(m->addr, m->ipid, m->extcpu);
/*
base-commit: fd94619c43360eb44d28bd3ef326a4f85c600a07
--
2.51.0
On Mon, Oct 06, 2025 at 03:17:31PM +0000, Yazen Ghannam wrote:
> Generally, FMPM will handle all memory errors as it is expected that
> "upstream" entities, like hardware thresholding or other Linux notifier
> blocks, will filter out errors.
>
> However, some users prefer that correctable errors are not filtered out
> but only that FMPM does not take action on them.
That's a pretty shallow use case if you ask me...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Oct 06, 2025 at 11:34:06PM +0200, Borislav Petkov wrote: > On Mon, Oct 06, 2025 at 03:17:31PM +0000, Yazen Ghannam wrote: > > Generally, FMPM will handle all memory errors as it is expected that > > "upstream" entities, like hardware thresholding or other Linux notifier > > blocks, will filter out errors. > > > > However, some users prefer that correctable errors are not filtered out > > but only that FMPM does not take action on them. > > That's a pretty shallow use case if you ask me... > > -- I think it's a common use case without FMPM. IOW, log correctable errors but don't offline memory because of them. Does that sounds better or about the same? Thanks, Yazen
> I think it's a common use case without FMPM. > > IOW, log correctable errors but don't offline memory because of them. > > Does that sounds better or about the same? Linux has /proc/sys/vm/enable_soft_offline toggle for that case. -Tony
On Tue, Oct 07, 2025 at 04:52:55PM +0000, Luck, Tony wrote: > > I think it's a common use case without FMPM. > > > > IOW, log correctable errors but don't offline memory because of them. > > > > Does that sounds better or about the same? > > Linux has /proc/sys/vm/enable_soft_offline toggle for that case. > Thanks, that's a good suggestion. We would still need a check in fru_handle_mem_poison() to skip saving records to persistent storage. And we would need a code update in _retire_row_mi300() to use the soft_offline path. Thanks, Yazen
On 10/6/2025 10:17, Yazen Ghannam wrote: > Generally, FMPM will handle all memory errors as it is expected that > "upstream" entities, like hardware thresholding or other Linux notifier > blocks, will filter out errors. > > However, some users prefer that correctable errors are not filtered out > but only that FMPM does not take action on them. > > Add a module parameter to ignore correctable errors. > > When set, FMPM will not retire memory nor will it save FRU records for > correctable errors. > > Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> > --- > drivers/ras/amd/fmpm.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c > index 8877c6ff64c4..08b16a133f20 100644 > --- a/drivers/ras/amd/fmpm.c > +++ b/drivers/ras/amd/fmpm.c > @@ -129,6 +129,14 @@ static struct dentry *fmpm_dfs_entries; > GUID_INIT(0x5e4706c1, 0x5356, 0x48c6, 0x93, 0x0b, 0x52, 0xf2, \ > 0x12, 0x0a, 0x44, 0x58) > > +/** > + * DOC: ignore_ce (bool) > + * Switch to handle or ignore correctable errors. > + */ > +static bool ignore_ce; > +module_param(ignore_ce, bool, 0644); > +MODULE_PARM_DESC(ignore_ce, "Ignore correctable errors"); > + > /** > * DOC: max_nr_entries (byte) > * Maximum number of descriptor entries possible for each FRU. > @@ -413,6 +421,9 @@ static int fru_handle_mem_poison(struct notifier_block *nb, unsigned long val, v > if (!mce_is_memory_error(m)) > return NOTIFY_DONE; > > + if (ignore_ce && mce_is_correctable(m)) > + return NOTIFY_DONE; > + > retire_dram_row(m->addr, m->ipid, m->extcpu); > > /* > > base-commit: fd94619c43360eb44d28bd3ef326a4f85c600a07 LGTM! Reviewed-by: Avadhut Naik <avadhut.naik@amd.com> -- Thanks, Avadhut Naik
© 2016 - 2025 Red Hat, Inc.