[v6] AMD MCA interrupts rework

[PATCH v6 15/15] x86/mce: Save and use APEI corrected threshold limit

Posted by Yazen Ghannam 5 months ago

The MCA threshold limit generally is not something that needs to change
during runtime. It is common for a system administrator to decide on a
policy for their managed systems.

If MCA thresholding is OS-managed, then the threshold limit must be set
at every boot. However, many systems allow the user to set a value in
their BIOS. And this is reported through an APEI HEST entry even if
thresholding is not in FW-First mode.

Use this value, if available, to set the OS-managed threshold limit.
Users can still override it through sysfs if desired for testing or
debug.

APEI is parsed after MCE is initialized. So reset the thresholding
blocks later to pick up the threshold limit.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---

Notes:
    Link:
    https://lore.kernel.org/r/20250825-wip-mca-updates-v5-20-865768a2eef8@amd.com
    
    v5->v6:
    * No change.
    
    v4->v5:
    * No change.
    
    v3->v4:
    * New in v4.

 arch/x86/include/asm/mce.h          |  6 ++++++
 arch/x86/kernel/acpi/apei.c         |  2 ++
 arch/x86/kernel/cpu/mce/amd.c       | 18 ++++++++++++++++--
 arch/x86/kernel/cpu/mce/internal.h  |  2 ++
 arch/x86/kernel/cpu/mce/threshold.c | 13 +++++++++++++
 5 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 7d6588195d56..1cfbfff0be3f 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -308,6 +308,12 @@ DECLARE_PER_CPU(struct mce, injectm);
 /* Disable CMCI/polling for MCA bank claimed by firmware */
 extern void mce_disable_bank(int bank);
 
+#ifdef CONFIG_X86_MCE_THRESHOLD
+void mce_save_apei_thr_limit(u32 thr_limit);
+#else
+static inline void mce_save_apei_thr_limit(u32 thr_limit) { }
+#endif /* CONFIG_X86_MCE_THRESHOLD */
+
 /*
  * Exception handler
  */
diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
index 0916f00a992e..e21419e686eb 100644
--- a/arch/x86/kernel/acpi/apei.c
+++ b/arch/x86/kernel/acpi/apei.c
@@ -19,6 +19,8 @@ int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data)
 	if (!cmc->enabled)
 		return 0;
 
+	mce_save_apei_thr_limit(cmc->notify.error_threshold_value);
+
 	/*
 	 * We expect HEST to provide a list of MC banks that report errors
 	 * in firmware first mode. Otherwise, return non-zero value to
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index b895559e80ad..9b746080351f 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -489,6 +489,18 @@ static void threshold_restart_bank(unsigned int bank, bool intr_en)
 	}
 }
 
+/* Try to use the threshold limit reported through APEI. */
+static u16 get_thr_limit(void)
+{
+	u32 thr_limit = mce_get_apei_thr_limit();
+
+	/* Fallback to old default if APEI limit is not available. */
+	if (!thr_limit)
+		return THRESHOLD_MAX;
+
+	return min(thr_limit, THRESHOLD_MAX);
+}
+
 static void mce_threshold_block_init(struct threshold_block *b, int offset)
 {
 	struct thresh_restart tr = {
@@ -497,7 +509,7 @@ static void mce_threshold_block_init(struct threshold_block *b, int offset)
 		.lvt_off		= offset,
 	};
 
-	b->threshold_limit		= THRESHOLD_MAX;
+	b->threshold_limit		= get_thr_limit();
 	threshold_restart_block(&tr);
 };
 
@@ -1071,7 +1083,7 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
 	b->address		= address;
 	b->interrupt_enable	= 0;
 	b->interrupt_capable	= lvt_interrupt_supported(bank, high);
-	b->threshold_limit	= THRESHOLD_MAX;
+	b->threshold_limit	= get_thr_limit();
 
 	if (b->interrupt_capable) {
 		default_attrs[2] = &interrupt_enable.attr;
@@ -1082,6 +1094,8 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
 
 	list_add(&b->miscj, &tb->miscj);
 
+	mce_threshold_block_init(b, (high & MASK_LVTOFF_HI) >> 20);
+
 	err = kobject_init_and_add(&b->kobj, &threshold_ktype, tb->kobj, get_name(cpu, bank, b));
 	if (err)
 		goto out_free;
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 9920ee5fb34c..a31cf984619c 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -67,6 +67,7 @@ void mce_track_storm(struct mce *mce);
 void mce_inherit_storm(unsigned int bank);
 bool mce_get_storm_mode(void);
 void mce_set_storm_mode(bool storm);
+u32  mce_get_apei_thr_limit(void);
 #else
 static inline void cmci_storm_begin(unsigned int bank) {}
 static inline void cmci_storm_end(unsigned int bank) {}
@@ -74,6 +75,7 @@ static inline void mce_track_storm(struct mce *mce) {}
 static inline void mce_inherit_storm(unsigned int bank) {}
 static inline bool mce_get_storm_mode(void) { return false; }
 static inline void mce_set_storm_mode(bool storm) {}
+static inline u32  mce_get_apei_thr_limit(void) { return 0; }
 #endif
 
 /*
diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
index 45144598ec74..d00d5bf9959d 100644
--- a/arch/x86/kernel/cpu/mce/threshold.c
+++ b/arch/x86/kernel/cpu/mce/threshold.c
@@ -13,6 +13,19 @@
 
 #include "internal.h"
 
+static u32 mce_apei_thr_limit;
+
+void mce_save_apei_thr_limit(u32 thr_limit)
+{
+	mce_apei_thr_limit = thr_limit;
+	pr_info("HEST: Corrected error threshold limit = %u\n", thr_limit);
+}
+
+u32 mce_get_apei_thr_limit(void)
+{
+	return mce_apei_thr_limit;
+}
+
 static void default_threshold_interrupt(void)
 {
 	pr_err("Unexpected threshold interrupt at vector %x\n",

-- 
2.51.0

Re: [PATCH v6 15/15] x86/mce: Save and use APEI corrected threshold limit

Posted by Nikolay Borisov 5 months ago


On 9/8/25 18:40, Yazen Ghannam wrote:
> The MCA threshold limit generally is not something that needs to change
> during runtime. It is common for a system administrator to decide on a
> policy for their managed systems.
> 
> If MCA thresholding is OS-managed, then the threshold limit must be set
> at every boot. However, many systems allow the user to set a value in
> their BIOS. And this is reported through an APEI HEST entry even if
> thresholding is not in FW-First mode.
> 
> Use this value, if available, to set the OS-managed threshold limit.
> Users can still override it through sysfs if desired for testing or
> debug.
> 
> APEI is parsed after MCE is initialized. So reset the thresholding
> blocks later to pick up the threshold limit.
> 
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> ---
> 
> Notes:
>      Link:
>      https://lore.kernel.org/r/20250825-wip-mca-updates-v5-20-865768a2eef8@amd.com
>      
>      v5->v6:
>      * No change.
>      
>      v4->v5:
>      * No change.
>      
>      v3->v4:
>      * New in v4.
> 
>   arch/x86/include/asm/mce.h          |  6 ++++++
>   arch/x86/kernel/acpi/apei.c         |  2 ++
>   arch/x86/kernel/cpu/mce/amd.c       | 18 ++++++++++++++++--
>   arch/x86/kernel/cpu/mce/internal.h  |  2 ++
>   arch/x86/kernel/cpu/mce/threshold.c | 13 +++++++++++++
>   5 files changed, 39 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 7d6588195d56..1cfbfff0be3f 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -308,6 +308,12 @@ DECLARE_PER_CPU(struct mce, injectm);
>   /* Disable CMCI/polling for MCA bank claimed by firmware */
>   extern void mce_disable_bank(int bank);
>   
> +#ifdef CONFIG_X86_MCE_THRESHOLD
> +void mce_save_apei_thr_limit(u32 thr_limit);
> +#else
> +static inline void mce_save_apei_thr_limit(u32 thr_limit) { }
> +#endif /* CONFIG_X86_MCE_THRESHOLD */
> +
>   /*
>    * Exception handler
>    */
> diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
> index 0916f00a992e..e21419e686eb 100644
> --- a/arch/x86/kernel/acpi/apei.c
> +++ b/arch/x86/kernel/acpi/apei.c
> @@ -19,6 +19,8 @@ int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data)
>   	if (!cmc->enabled)
>   		return 0;
>   
> +	mce_save_apei_thr_limit(cmc->notify.error_threshold_value);
> +
>   	/*
>   	 * We expect HEST to provide a list of MC banks that report errors
>   	 * in firmware first mode. Otherwise, return non-zero value to
> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> index b895559e80ad..9b746080351f 100644
> --- a/arch/x86/kernel/cpu/mce/amd.c
> +++ b/arch/x86/kernel/cpu/mce/amd.c
> @@ -489,6 +489,18 @@ static void threshold_restart_bank(unsigned int bank, bool intr_en)
>   	}
>   }
>   
> +/* Try to use the threshold limit reported through APEI. */
> +static u16 get_thr_limit(void)
> +{
> +	u32 thr_limit = mce_get_apei_thr_limit();
> +
> +	/* Fallback to old default if APEI limit is not available. */
> +	if (!thr_limit)
> +		return THRESHOLD_MAX;
> +
> +	return min(thr_limit, THRESHOLD_MAX);
> +}
> +
>   static void mce_threshold_block_init(struct threshold_block *b, int offset)
>   {
>   	struct thresh_restart tr = {
> @@ -497,7 +509,7 @@ static void mce_threshold_block_init(struct threshold_block *b, int offset)
>   		.lvt_off		= offset,
>   	};
>   
> -	b->threshold_limit		= THRESHOLD_MAX;
> +	b->threshold_limit		= get_thr_limit();
>   	threshold_restart_block(&tr);
>   };
>   
> @@ -1071,7 +1083,7 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
>   	b->address		= address;
>   	b->interrupt_enable	= 0;
>   	b->interrupt_capable	= lvt_interrupt_supported(bank, high);
> -	b->threshold_limit	= THRESHOLD_MAX;
> +	b->threshold_limit	= get_thr_limit();
>   
>   	if (b->interrupt_capable) {
>   		default_attrs[2] = &interrupt_enable.attr;
> @@ -1082,6 +1094,8 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
>   
>   	list_add(&b->miscj, &tb->miscj);
>   
> +	mce_threshold_block_init(b, (high & MASK_LVTOFF_HI) >> 20);

Why is this necessary? Shouldn't this patch consist of mainly 
s/THRESHOLD_MAX/get_thr_limit();


In allocate_threshold_block have already properly set threshold_limit. 
So this change really ensures threshold_restart_block is being called 
for the given block being initialized. Ignoring the changed threshold 
limit logic, why is this extra call necessary now and wasn't before?


> +
>   	err = kobject_init_and_add(&b->kobj, &threshold_ktype, tb->kobj, get_name(cpu, bank, b));
>   	if (err)
>   		goto out_free;
> diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
> index 9920ee5fb34c..a31cf984619c 100644
> --- a/arch/x86/kernel/cpu/mce/internal.h
> +++ b/arch/x86/kernel/cpu/mce/internal.h
> @@ -67,6 +67,7 @@ void mce_track_storm(struct mce *mce);
>   void mce_inherit_storm(unsigned int bank);
>   bool mce_get_storm_mode(void);
>   void mce_set_storm_mode(bool storm);
> +u32  mce_get_apei_thr_limit(void);
>   #else
>   static inline void cmci_storm_begin(unsigned int bank) {}
>   static inline void cmci_storm_end(unsigned int bank) {}
> @@ -74,6 +75,7 @@ static inline void mce_track_storm(struct mce *mce) {}
>   static inline void mce_inherit_storm(unsigned int bank) {}
>   static inline bool mce_get_storm_mode(void) { return false; }
>   static inline void mce_set_storm_mode(bool storm) {}
> +static inline u32  mce_get_apei_thr_limit(void) { return 0; }
>   #endif
>   
>   /*
> diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
> index 45144598ec74..d00d5bf9959d 100644
> --- a/arch/x86/kernel/cpu/mce/threshold.c
> +++ b/arch/x86/kernel/cpu/mce/threshold.c
> @@ -13,6 +13,19 @@
>   
>   #include "internal.h"
>   
> +static u32 mce_apei_thr_limit;
> +
> +void mce_save_apei_thr_limit(u32 thr_limit)
> +{
> +	mce_apei_thr_limit = thr_limit;
> +	pr_info("HEST: Corrected error threshold limit = %u\n", thr_limit);
> +}
> +
> +u32 mce_get_apei_thr_limit(void)
> +{
> +	return mce_apei_thr_limit;
> +}
> +
>   static void default_threshold_interrupt(void)
>   {
>   	pr_err("Unexpected threshold interrupt at vector %x\n",
>

Re: [PATCH v6 15/15] x86/mce: Save and use APEI corrected threshold limit

Posted by Yazen Ghannam 4 months, 3 weeks ago

On Thu, Sep 11, 2025 at 08:01:17PM +0300, Nikolay Borisov wrote:
> 
> 
> On 9/8/25 18:40, Yazen Ghannam wrote:
> > The MCA threshold limit generally is not something that needs to change
> > during runtime. It is common for a system administrator to decide on a
> > policy for their managed systems.
> > 
> > If MCA thresholding is OS-managed, then the threshold limit must be set
> > at every boot. However, many systems allow the user to set a value in
> > their BIOS. And this is reported through an APEI HEST entry even if
> > thresholding is not in FW-First mode.
> > 
> > Use this value, if available, to set the OS-managed threshold limit.
> > Users can still override it through sysfs if desired for testing or
> > debug.
> > 
> > APEI is parsed after MCE is initialized. So reset the thresholding
> > blocks later to pick up the threshold limit.
> > 
> > Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> > ---
> > 
> > Notes:
> >      Link:
> >      https://lore.kernel.org/r/20250825-wip-mca-updates-v5-20-865768a2eef8@amd.com
> >      v5->v6:
> >      * No change.
> >      v4->v5:
> >      * No change.
> >      v3->v4:
> >      * New in v4.
> > 
> >   arch/x86/include/asm/mce.h          |  6 ++++++
> >   arch/x86/kernel/acpi/apei.c         |  2 ++
> >   arch/x86/kernel/cpu/mce/amd.c       | 18 ++++++++++++++++--
> >   arch/x86/kernel/cpu/mce/internal.h  |  2 ++
> >   arch/x86/kernel/cpu/mce/threshold.c | 13 +++++++++++++
> >   5 files changed, 39 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> > index 7d6588195d56..1cfbfff0be3f 100644
> > --- a/arch/x86/include/asm/mce.h
> > +++ b/arch/x86/include/asm/mce.h
> > @@ -308,6 +308,12 @@ DECLARE_PER_CPU(struct mce, injectm);
> >   /* Disable CMCI/polling for MCA bank claimed by firmware */
> >   extern void mce_disable_bank(int bank);
> > +#ifdef CONFIG_X86_MCE_THRESHOLD
> > +void mce_save_apei_thr_limit(u32 thr_limit);
> > +#else
> > +static inline void mce_save_apei_thr_limit(u32 thr_limit) { }
> > +#endif /* CONFIG_X86_MCE_THRESHOLD */
> > +
> >   /*
> >    * Exception handler
> >    */
> > diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
> > index 0916f00a992e..e21419e686eb 100644
> > --- a/arch/x86/kernel/acpi/apei.c
> > +++ b/arch/x86/kernel/acpi/apei.c
> > @@ -19,6 +19,8 @@ int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data)
> >   	if (!cmc->enabled)
> >   		return 0;
> > +	mce_save_apei_thr_limit(cmc->notify.error_threshold_value);
> > +
> >   	/*
> >   	 * We expect HEST to provide a list of MC banks that report errors
> >   	 * in firmware first mode. Otherwise, return non-zero value to
> > diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> > index b895559e80ad..9b746080351f 100644
> > --- a/arch/x86/kernel/cpu/mce/amd.c
> > +++ b/arch/x86/kernel/cpu/mce/amd.c
> > @@ -489,6 +489,18 @@ static void threshold_restart_bank(unsigned int bank, bool intr_en)
> >   	}
> >   }
> > +/* Try to use the threshold limit reported through APEI. */
> > +static u16 get_thr_limit(void)
> > +{
> > +	u32 thr_limit = mce_get_apei_thr_limit();
> > +
> > +	/* Fallback to old default if APEI limit is not available. */
> > +	if (!thr_limit)
> > +		return THRESHOLD_MAX;
> > +
> > +	return min(thr_limit, THRESHOLD_MAX);
> > +}
> > +
> >   static void mce_threshold_block_init(struct threshold_block *b, int offset)
> >   {
> >   	struct thresh_restart tr = {
> > @@ -497,7 +509,7 @@ static void mce_threshold_block_init(struct threshold_block *b, int offset)
> >   		.lvt_off		= offset,
> >   	};
> > -	b->threshold_limit		= THRESHOLD_MAX;
> > +	b->threshold_limit		= get_thr_limit();
> >   	threshold_restart_block(&tr);
> >   };
> > @@ -1071,7 +1083,7 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
> >   	b->address		= address;
> >   	b->interrupt_enable	= 0;
> >   	b->interrupt_capable	= lvt_interrupt_supported(bank, high);
> > -	b->threshold_limit	= THRESHOLD_MAX;
> > +	b->threshold_limit	= get_thr_limit();
> >   	if (b->interrupt_capable) {
> >   		default_attrs[2] = &interrupt_enable.attr;
> > @@ -1082,6 +1094,8 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
> >   	list_add(&b->miscj, &tb->miscj);
> > +	mce_threshold_block_init(b, (high & MASK_LVTOFF_HI) >> 20);
> 
> Why is this necessary? Shouldn't this patch consist of mainly
> s/THRESHOLD_MAX/get_thr_limit();
> 
> 
> In allocate_threshold_block have already properly set threshold_limit. So
> this change really ensures threshold_restart_block is being called for the
> given block being initialized. Ignoring the changed threshold limit logic,
> why is this extra call necessary now and wasn't before?
> 

It is necessary to apply the threshold limit to the hardware register.
The MCA thresholding registers are accessed in two passes: first during
per-CPU init, and second when the MCE subsystem devices are created.

The hardware registers are updated in the first pass, and they are left
as-is in the second pass assuming no configuration has changed. That's
why there isn't a "reset" in the second pass.

The APEI tables are parsed between the first and second passes. So now
we need to update the registers during the second pass to apply the
value found from HEST.

Thanks,
Yazen

Re: [PATCH v6 15/15] x86/mce: Save and use APEI corrected threshold limit

Posted by Nikolay Borisov 4 months, 3 weeks ago


On 9/15/25 20:33, Yazen Ghannam wrote:
> On Thu, Sep 11, 2025 at 08:01:17PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 9/8/25 18:40, Yazen Ghannam wrote:
>>> The MCA threshold limit generally is not something that needs to change
>>> during runtime. It is common for a system administrator to decide on a
>>> policy for their managed systems.
>>>
>>> If MCA thresholding is OS-managed, then the threshold limit must be set
>>> at every boot. However, many systems allow the user to set a value in
>>> their BIOS. And this is reported through an APEI HEST entry even if
>>> thresholding is not in FW-First mode.
>>>
>>> Use this value, if available, to set the OS-managed threshold limit.
>>> Users can still override it through sysfs if desired for testing or
>>> debug.
>>>
>>> APEI is parsed after MCE is initialized. So reset the thresholding
>>> blocks later to pick up the threshold limit.
>>>
>>> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
>>> ---
>>>
>>> Notes:
>>>       Link:
>>>       https://lore.kernel.org/r/20250825-wip-mca-updates-v5-20-865768a2eef8@amd.com
>>>       v5->v6:
>>>       * No change.
>>>       v4->v5:
>>>       * No change.
>>>       v3->v4:
>>>       * New in v4.
>>>
>>>    arch/x86/include/asm/mce.h          |  6 ++++++
>>>    arch/x86/kernel/acpi/apei.c         |  2 ++
>>>    arch/x86/kernel/cpu/mce/amd.c       | 18 ++++++++++++++++--
>>>    arch/x86/kernel/cpu/mce/internal.h  |  2 ++
>>>    arch/x86/kernel/cpu/mce/threshold.c | 13 +++++++++++++
>>>    5 files changed, 39 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
>>> index 7d6588195d56..1cfbfff0be3f 100644
>>> --- a/arch/x86/include/asm/mce.h
>>> +++ b/arch/x86/include/asm/mce.h
>>> @@ -308,6 +308,12 @@ DECLARE_PER_CPU(struct mce, injectm);
>>>    /* Disable CMCI/polling for MCA bank claimed by firmware */
>>>    extern void mce_disable_bank(int bank);
>>> +#ifdef CONFIG_X86_MCE_THRESHOLD
>>> +void mce_save_apei_thr_limit(u32 thr_limit);
>>> +#else
>>> +static inline void mce_save_apei_thr_limit(u32 thr_limit) { }
>>> +#endif /* CONFIG_X86_MCE_THRESHOLD */
>>> +
>>>    /*
>>>     * Exception handler
>>>     */
>>> diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
>>> index 0916f00a992e..e21419e686eb 100644
>>> --- a/arch/x86/kernel/acpi/apei.c
>>> +++ b/arch/x86/kernel/acpi/apei.c
>>> @@ -19,6 +19,8 @@ int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data)
>>>    	if (!cmc->enabled)
>>>    		return 0;
>>> +	mce_save_apei_thr_limit(cmc->notify.error_threshold_value);
>>> +
>>>    	/*
>>>    	 * We expect HEST to provide a list of MC banks that report errors
>>>    	 * in firmware first mode. Otherwise, return non-zero value to
>>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>>> index b895559e80ad..9b746080351f 100644
>>> --- a/arch/x86/kernel/cpu/mce/amd.c
>>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>>> @@ -489,6 +489,18 @@ static void threshold_restart_bank(unsigned int bank, bool intr_en)
>>>    	}
>>>    }
>>> +/* Try to use the threshold limit reported through APEI. */
>>> +static u16 get_thr_limit(void)
>>> +{
>>> +	u32 thr_limit = mce_get_apei_thr_limit();
>>> +
>>> +	/* Fallback to old default if APEI limit is not available. */
>>> +	if (!thr_limit)
>>> +		return THRESHOLD_MAX;
>>> +
>>> +	return min(thr_limit, THRESHOLD_MAX);
>>> +}
>>> +
>>>    static void mce_threshold_block_init(struct threshold_block *b, int offset)
>>>    {
>>>    	struct thresh_restart tr = {
>>> @@ -497,7 +509,7 @@ static void mce_threshold_block_init(struct threshold_block *b, int offset)
>>>    		.lvt_off		= offset,
>>>    	};
>>> -	b->threshold_limit		= THRESHOLD_MAX;
>>> +	b->threshold_limit		= get_thr_limit();
>>>    	threshold_restart_block(&tr);
>>>    };
>>> @@ -1071,7 +1083,7 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
>>>    	b->address		= address;
>>>    	b->interrupt_enable	= 0;
>>>    	b->interrupt_capable	= lvt_interrupt_supported(bank, high);
>>> -	b->threshold_limit	= THRESHOLD_MAX;
>>> +	b->threshold_limit	= get_thr_limit();
>>>    	if (b->interrupt_capable) {
>>>    		default_attrs[2] = &interrupt_enable.attr;
>>> @@ -1082,6 +1094,8 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
>>>    	list_add(&b->miscj, &tb->miscj);
>>> +	mce_threshold_block_init(b, (high & MASK_LVTOFF_HI) >> 20);
>>
>> Why is this necessary? Shouldn't this patch consist of mainly
>> s/THRESHOLD_MAX/get_thr_limit();
>>
>>
>> In allocate_threshold_block have already properly set threshold_limit. So
>> this change really ensures threshold_restart_block is being called for the
>> given block being initialized. Ignoring the changed threshold limit logic,
>> why is this extra call necessary now and wasn't before?
>>
> 
> It is necessary to apply the threshold limit to the hardware register.
> The MCA thresholding registers are accessed in two passes: first during
> per-CPU init, and second when the MCE subsystem devices are created.
> 
> The hardware registers are updated in the first pass, and they are left
> as-is in the second pass assuming no configuration has changed. That's
> why there isn't a "reset" in the second pass.
> 
> The APEI tables are parsed between the first and second passes. So now
> we need to update the registers during the second pass to apply the
> value found from HEST.

So APEI is initialized as part of the subsys_initcall which is processed 
via:

start_kernel
rest_init
kernel_init
kernel_init_freeable
do_basic_setup
do_initcalls

And the first mce_threshold_block_init() happens from :

start_kernel
arch_cpu_finalize_init <---- way before rest_init()
identify_boot_cpu
identify_cpu
mcheck_cpu_init
mcheck_cpu_init_vendor
mce_amd_feature_init
prepare_threshold_block
mce_threshold_block_init


Finally the per-cpu hotplug callback is installed via:

mcheck_init_device <- initiated from a device_initcall, happens after 
APEI subsys init.
mce_cpu_online  - called on every cpu from the HP machinery
mce_threshold_create_device
threshold_create_bank
allocate_threshold_blocks
mce_threshold_block_init - newly added call in alloc block, used to 
update the register with the new limit


Given that mce_cpu_online is already called on every online cpu at the 
time of installation of the callback, and every subsequent cpu that will 
come online I can't help but wonder why do we have to do the mce 
initialization from start_kernel, can't we move the mcheck_cpu_init call 
into mce_cpu_online, or have the latter subsume the former?

Sorry if I'm being too nitpicky, I just want to have proper 
understanding of the subsystem and the various (implicit) requirements 
it has.


At the very least I believe this commit message should at least allude 
to the fact that mce threshold devices have a strict requirement to be 
created after the APEI subsys has been created.


> 
> Thanks,
> Yazen

Re: [PATCH v6 15/15] x86/mce: Save and use APEI corrected threshold limit

Posted by Yazen Ghannam 4 months, 2 weeks ago

On Fri, Sep 19, 2025 at 01:42:57PM +0300, Nikolay Borisov wrote:
> 
> 
> On 9/15/25 20:33, Yazen Ghannam wrote:
> > On Thu, Sep 11, 2025 at 08:01:17PM +0300, Nikolay Borisov wrote:
> > > 
> > > 
> > > On 9/8/25 18:40, Yazen Ghannam wrote:
> > > > The MCA threshold limit generally is not something that needs to change
> > > > during runtime. It is common for a system administrator to decide on a
> > > > policy for their managed systems.
> > > > 
> > > > If MCA thresholding is OS-managed, then the threshold limit must be set
> > > > at every boot. However, many systems allow the user to set a value in
> > > > their BIOS. And this is reported through an APEI HEST entry even if
> > > > thresholding is not in FW-First mode.
> > > > 
> > > > Use this value, if available, to set the OS-managed threshold limit.
> > > > Users can still override it through sysfs if desired for testing or
> > > > debug.
> > > > 
> > > > APEI is parsed after MCE is initialized. So reset the thresholding
> > > > blocks later to pick up the threshold limit.
> > > > 
> > > > Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> > > > ---
> > > > 
> > > > Notes:
> > > >       Link:
> > > >       https://lore.kernel.org/r/20250825-wip-mca-updates-v5-20-865768a2eef8@amd.com
> > > >       v5->v6:
> > > >       * No change.
> > > >       v4->v5:
> > > >       * No change.
> > > >       v3->v4:
> > > >       * New in v4.
> > > > 
> > > >    arch/x86/include/asm/mce.h          |  6 ++++++
> > > >    arch/x86/kernel/acpi/apei.c         |  2 ++
> > > >    arch/x86/kernel/cpu/mce/amd.c       | 18 ++++++++++++++++--
> > > >    arch/x86/kernel/cpu/mce/internal.h  |  2 ++
> > > >    arch/x86/kernel/cpu/mce/threshold.c | 13 +++++++++++++
> > > >    5 files changed, 39 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> > > > index 7d6588195d56..1cfbfff0be3f 100644
> > > > --- a/arch/x86/include/asm/mce.h
> > > > +++ b/arch/x86/include/asm/mce.h
> > > > @@ -308,6 +308,12 @@ DECLARE_PER_CPU(struct mce, injectm);
> > > >    /* Disable CMCI/polling for MCA bank claimed by firmware */
> > > >    extern void mce_disable_bank(int bank);
> > > > +#ifdef CONFIG_X86_MCE_THRESHOLD
> > > > +void mce_save_apei_thr_limit(u32 thr_limit);
> > > > +#else
> > > > +static inline void mce_save_apei_thr_limit(u32 thr_limit) { }
> > > > +#endif /* CONFIG_X86_MCE_THRESHOLD */
> > > > +
> > > >    /*
> > > >     * Exception handler
> > > >     */
> > > > diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
> > > > index 0916f00a992e..e21419e686eb 100644
> > > > --- a/arch/x86/kernel/acpi/apei.c
> > > > +++ b/arch/x86/kernel/acpi/apei.c
> > > > @@ -19,6 +19,8 @@ int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data)
> > > >    	if (!cmc->enabled)
> > > >    		return 0;
> > > > +	mce_save_apei_thr_limit(cmc->notify.error_threshold_value);
> > > > +
> > > >    	/*
> > > >    	 * We expect HEST to provide a list of MC banks that report errors
> > > >    	 * in firmware first mode. Otherwise, return non-zero value to
> > > > diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> > > > index b895559e80ad..9b746080351f 100644
> > > > --- a/arch/x86/kernel/cpu/mce/amd.c
> > > > +++ b/arch/x86/kernel/cpu/mce/amd.c
> > > > @@ -489,6 +489,18 @@ static void threshold_restart_bank(unsigned int bank, bool intr_en)
> > > >    	}
> > > >    }
> > > > +/* Try to use the threshold limit reported through APEI. */
> > > > +static u16 get_thr_limit(void)
> > > > +{
> > > > +	u32 thr_limit = mce_get_apei_thr_limit();
> > > > +
> > > > +	/* Fallback to old default if APEI limit is not available. */
> > > > +	if (!thr_limit)
> > > > +		return THRESHOLD_MAX;
> > > > +
> > > > +	return min(thr_limit, THRESHOLD_MAX);
> > > > +}
> > > > +
> > > >    static void mce_threshold_block_init(struct threshold_block *b, int offset)
> > > >    {
> > > >    	struct thresh_restart tr = {
> > > > @@ -497,7 +509,7 @@ static void mce_threshold_block_init(struct threshold_block *b, int offset)
> > > >    		.lvt_off		= offset,
> > > >    	};
> > > > -	b->threshold_limit		= THRESHOLD_MAX;
> > > > +	b->threshold_limit		= get_thr_limit();
> > > >    	threshold_restart_block(&tr);
> > > >    };
> > > > @@ -1071,7 +1083,7 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
> > > >    	b->address		= address;
> > > >    	b->interrupt_enable	= 0;
> > > >    	b->interrupt_capable	= lvt_interrupt_supported(bank, high);
> > > > -	b->threshold_limit	= THRESHOLD_MAX;
> > > > +	b->threshold_limit	= get_thr_limit();
> > > >    	if (b->interrupt_capable) {
> > > >    		default_attrs[2] = &interrupt_enable.attr;
> > > > @@ -1082,6 +1094,8 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
> > > >    	list_add(&b->miscj, &tb->miscj);
> > > > +	mce_threshold_block_init(b, (high & MASK_LVTOFF_HI) >> 20);
> > > 
> > > Why is this necessary? Shouldn't this patch consist of mainly
> > > s/THRESHOLD_MAX/get_thr_limit();
> > > 
> > > 
> > > In allocate_threshold_block have already properly set threshold_limit. So
> > > this change really ensures threshold_restart_block is being called for the
> > > given block being initialized. Ignoring the changed threshold limit logic,
> > > why is this extra call necessary now and wasn't before?
> > > 
> > 
> > It is necessary to apply the threshold limit to the hardware register.
> > The MCA thresholding registers are accessed in two passes: first during
> > per-CPU init, and second when the MCE subsystem devices are created.
> > 
> > The hardware registers are updated in the first pass, and they are left
> > as-is in the second pass assuming no configuration has changed. That's
> > why there isn't a "reset" in the second pass.
> > 
> > The APEI tables are parsed between the first and second passes. So now
> > we need to update the registers during the second pass to apply the
> > value found from HEST.
> 
> So APEI is initialized as part of the subsys_initcall which is processed
> via:
> 
> start_kernel
> rest_init
> kernel_init
> kernel_init_freeable
> do_basic_setup
> do_initcalls
> 
> And the first mce_threshold_block_init() happens from :
> 
> start_kernel
> arch_cpu_finalize_init <---- way before rest_init()
> identify_boot_cpu
> identify_cpu
> mcheck_cpu_init
> mcheck_cpu_init_vendor
> mce_amd_feature_init
> prepare_threshold_block
> mce_threshold_block_init
> 
> 
> Finally the per-cpu hotplug callback is installed via:
> 
> mcheck_init_device <- initiated from a device_initcall, happens after APEI
> subsys init.
> mce_cpu_online  - called on every cpu from the HP machinery
> mce_threshold_create_device
> threshold_create_bank
> allocate_threshold_blocks
> mce_threshold_block_init - newly added call in alloc block, used to update
> the register with the new limit
> 
> 
> Given that mce_cpu_online is already called on every online cpu at the time
> of installation of the callback, and every subsequent cpu that will come
> online I can't help but wonder why do we have to do the mce initialization
> from start_kernel, can't we move the mcheck_cpu_init call into
> mce_cpu_online, or have the latter subsume the former?
> 
> Sorry if I'm being too nitpicky, I just want to have proper understanding of
> the subsystem and the various (implicit) requirements it has.
>

No worries. It is a bit convoluted. I'd like to clean this up
eventually. I have an old attempt here:
https://github.com/AMDESE/linux/commit/640db92eca07804e889fac88856904552a4466bd

In general, I'd like to do the hardware and data structure init in a
single pass. And the sysfs interface can be optionally added separately.

> 
> At the very least I believe this commit message should at least allude to
> the fact that mce threshold devices have a strict requirement to be created
> after the APEI subsys has been created.
> 
> 

The commit message has this:

"APEI is parsed after MCE is initialized. So reset the thresholding
blocks later to pick up the threshold limit."

The "devices" are already created after APEI subsys init. The missing
step is updating the hardware registers.

Thanks,
Yazen