Today, checking for non-fatal MCE errors on ARM is very invasive: it
involves a periodic timer interrupting the physical CPU execution at
regular intervals. Moreover, when the timer fires, the handler sends an
IPI to all physical CPUs.
Both these actions are disruptive in terms of latency and deterministic
execution times for real-time workloads. They might miss a deadline due
to one of these IPIs. Make it possible to disable non-fatal MCE errors
checking with a new Kconfig option (AMD_MCE_NONFATAL).
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>
---
RFC. I couldn't find a better way to do this.
---
xen/arch/x86/Kconfig.cpu | 15 +++++++++++++++
xen/arch/x86/cpu/mcheck/amd_nonfatal.c | 3 ++-
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/xen/arch/x86/Kconfig.cpu b/xen/arch/x86/Kconfig.cpu
index 5fb18db1aa..14e20ad19d 100644
--- a/xen/arch/x86/Kconfig.cpu
+++ b/xen/arch/x86/Kconfig.cpu
@@ -10,6 +10,21 @@ config AMD
May be turned off in builds targetting other vendors. Otherwise,
must be enabled for Xen to work suitably on AMD platforms.
+config AMD_MCE_NONFATAL
+ bool "Check for non-fatal MCEs on AMD CPUs"
+ default y
+ depends on AMD
+ help
+ Check for non-fatal MCE errors.
+
+ When this option is on (default), Xen regularly checks for
+ non-fatal MCEs potentially occurring on all physical CPUs. The
+ checking is done via timers and IPI interrupts, which is
+ acceptable in most configurations, but not for real-time.
+
+ Turn this option off if you plan on deploying real-time workloads
+ on Xen.
+
config INTEL
bool "Support Intel CPUs"
default y
diff --git a/xen/arch/x86/cpu/mcheck/amd_nonfatal.c b/xen/arch/x86/cpu/mcheck/amd_nonfatal.c
index 7d48c9ab5f..812e18f612 100644
--- a/xen/arch/x86/cpu/mcheck/amd_nonfatal.c
+++ b/xen/arch/x86/cpu/mcheck/amd_nonfatal.c
@@ -191,7 +191,8 @@ static void cf_check mce_amd_work_fn(void *data)
void __init amd_nonfatal_mcheck_init(struct cpuinfo_x86 *c)
{
- if (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)))
+ if ( !IS_ENABLED(CONFIG_AMD_MCE_NONFATAL) ||
+ (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON))) )
return;
/* Assume we are on K8 or newer AMD or Hygon CPU here */
--
2.25.1
On Tue Jul 8, 2025 at 2:07 AM CEST, Stefano Stabellini wrote:
> Today, checking for non-fatal MCE errors on ARM is very invasive: it
> involves a periodic timer interrupting the physical CPU execution at
> regular intervals. Moreover, when the timer fires, the handler sends an
> IPI to all physical CPUs.
>
> Both these actions are disruptive in terms of latency and deterministic
> execution times for real-time workloads. They might miss a deadline due
> to one of these IPIs. Make it possible to disable non-fatal MCE errors
> checking with a new Kconfig option (AMD_MCE_NONFATAL).
>
> Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>
> ---
> RFC. I couldn't find a better way to do this.
> ---
> xen/arch/x86/Kconfig.cpu | 15 +++++++++++++++
> xen/arch/x86/cpu/mcheck/amd_nonfatal.c | 3 ++-
> 2 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/xen/arch/x86/Kconfig.cpu b/xen/arch/x86/Kconfig.cpu
> index 5fb18db1aa..14e20ad19d 100644
> --- a/xen/arch/x86/Kconfig.cpu
> +++ b/xen/arch/x86/Kconfig.cpu
> @@ -10,6 +10,21 @@ config AMD
> May be turned off in builds targetting other vendors. Otherwise,
> must be enabled for Xen to work suitably on AMD platforms.
>
> +config AMD_MCE_NONFATAL
> + bool "Check for non-fatal MCEs on AMD CPUs"
> + default y
> + depends on AMD
> + help
> + Check for non-fatal MCE errors.
> +
> + When this option is on (default), Xen regularly checks for
> + non-fatal MCEs potentially occurring on all physical CPUs. The
> + checking is done via timers and IPI interrupts, which is
> + acceptable in most configurations, but not for real-time.
> +
> + Turn this option off if you plan on deploying real-time workloads
> + on Xen.
> +
This being in the CPU vendor submenu seems off. I'd expect only a list of
silicon vendors here. I think it ought to be in the regular Kconfig file.
> config INTEL
> bool "Support Intel CPUs"
> default y
> diff --git a/xen/arch/x86/cpu/mcheck/amd_nonfatal.c b/xen/arch/x86/cpu/mcheck/amd_nonfatal.c
> index 7d48c9ab5f..812e18f612 100644
> --- a/xen/arch/x86/cpu/mcheck/amd_nonfatal.c
> +++ b/xen/arch/x86/cpu/mcheck/amd_nonfatal.c
> @@ -191,7 +191,8 @@ static void cf_check mce_amd_work_fn(void *data)
>
> void __init amd_nonfatal_mcheck_init(struct cpuinfo_x86 *c)
> {
> - if (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)))
> + if ( !IS_ENABLED(CONFIG_AMD_MCE_NONFATAL) ||
> + (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON))) )
> return;
>
> /* Assume we are on K8 or newer AMD or Hygon CPU here */
It can be made more general to remove more code. What do you think of removing
all non-fatals and getting rid of the initcall altogether?
diff --git a/xen/arch/x86/Kconfig.cpu b/xen/arch/x86/Kconfig.cpu
index 5fb18db1aa..a4b892a1aa 100644
--- a/xen/arch/x86/Kconfig.cpu
+++ b/xen/arch/x86/Kconfig.cpu
@@ -10,6 +10,20 @@ config AMD
May be turned off in builds targetting other vendors. Otherwise,
must be enabled for Xen to work suitably on AMD platforms.
+config MCE_NONFATAL
+ bool "Check for non-fatal MCEs"
+ default y
+ help
+ Check for non-fatal MCE errors.
+
+ When this option is on (default), Xen regularly checks for
+ non-fatal MCEs potentially occurring on all physical CPUs. The
+ checking is done via timers and IPI interrupts, which is
+ acceptable in most configurations, but not for real-time.
+
+ Turn this option off if you plan on deploying real-time workloads
+ on Xen.
+
config INTEL
bool "Support Intel CPUs"
default y
diff --git a/xen/arch/x86/cpu/mcheck/Makefile b/xen/arch/x86/cpu/mcheck/Makefile
index e6cb4dd503..c70b441888 100644
--- a/xen/arch/x86/cpu/mcheck/Makefile
+++ b/xen/arch/x86/cpu/mcheck/Makefile
@@ -1,12 +1,12 @@
-obj-$(CONFIG_AMD) += amd_nonfatal.o
+obj-$(filter $(CONFIG_AMD),$(CONFIG_MCE_NONFATAL)) += amd_nonfatal.o
obj-$(CONFIG_AMD) += mce_amd.o
obj-y += mcaction.o
obj-y += barrier.o
-obj-$(CONFIG_INTEL) += intel-nonfatal.o
+obj-$(filter $(CONFIG_INTEL),$(CONFIG_MCE_NONFATAL)) += intel-nonfatal.o
obj-y += mctelem.o
obj-y += mce.o
obj-y += mce-apei.o
obj-$(CONFIG_INTEL) += mce_intel.o
-obj-y += non-fatal.o
+obj-$(CONFIG_MCE_NONFATAL) += non-fatal.o
obj-y += util.o
obj-y += vmce.o
... with the Kconfig option probably in the regular x86 Kconfig rather than
Kconfig.cpu
Thoughts?
Cheers,
Alejandro
On 08.07.2025 12:25, Alejandro Vallejo wrote:
> On Tue Jul 8, 2025 at 2:07 AM CEST, Stefano Stabellini wrote:
>> --- a/xen/arch/x86/Kconfig.cpu
>> +++ b/xen/arch/x86/Kconfig.cpu
>> @@ -10,6 +10,21 @@ config AMD
>> May be turned off in builds targetting other vendors. Otherwise,
>> must be enabled for Xen to work suitably on AMD platforms.
>>
>> +config AMD_MCE_NONFATAL
>> + bool "Check for non-fatal MCEs on AMD CPUs"
>> + default y
>> + depends on AMD
>> + help
>> + Check for non-fatal MCE errors.
>> +
>> + When this option is on (default), Xen regularly checks for
>> + non-fatal MCEs potentially occurring on all physical CPUs. The
>> + checking is done via timers and IPI interrupts, which is
>> + acceptable in most configurations, but not for real-time.
>> +
>> + Turn this option off if you plan on deploying real-time workloads
>> + on Xen.
>> +
>
> This being in the CPU vendor submenu seems off. I'd expect only a list of
> silicon vendors here. I think it ought to be in the regular Kconfig file.
Whether in this file or the regular one is up for discussion, but yes,
definitely not inside the vendor menu.
>> --- a/xen/arch/x86/cpu/mcheck/amd_nonfatal.c
>> +++ b/xen/arch/x86/cpu/mcheck/amd_nonfatal.c
>> @@ -191,7 +191,8 @@ static void cf_check mce_amd_work_fn(void *data)
>>
>> void __init amd_nonfatal_mcheck_init(struct cpuinfo_x86 *c)
>> {
>> - if (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)))
>> + if ( !IS_ENABLED(CONFIG_AMD_MCE_NONFATAL) ||
>> + (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON))) )
>> return;
>>
>> /* Assume we are on K8 or newer AMD or Hygon CPU here */
>
> It can be made more general to remove more code. What do you think of removing
> all non-fatals and getting rid of the initcall altogether?
I think such a more general approach would be quite a bit better.
Jan
> diff --git a/xen/arch/x86/Kconfig.cpu b/xen/arch/x86/Kconfig.cpu
> index 5fb18db1aa..a4b892a1aa 100644
> --- a/xen/arch/x86/Kconfig.cpu
> +++ b/xen/arch/x86/Kconfig.cpu
> @@ -10,6 +10,20 @@ config AMD
> May be turned off in builds targetting other vendors. Otherwise,
> must be enabled for Xen to work suitably on AMD platforms.
>
> +config MCE_NONFATAL
> + bool "Check for non-fatal MCEs"
> + default y
> + help
> + Check for non-fatal MCE errors.
> +
> + When this option is on (default), Xen regularly checks for
> + non-fatal MCEs potentially occurring on all physical CPUs. The
> + checking is done via timers and IPI interrupts, which is
> + acceptable in most configurations, but not for real-time.
> +
> + Turn this option off if you plan on deploying real-time workloads
> + on Xen.
> +
> config INTEL
> bool "Support Intel CPUs"
> default y
> diff --git a/xen/arch/x86/cpu/mcheck/Makefile b/xen/arch/x86/cpu/mcheck/Makefile
> index e6cb4dd503..c70b441888 100644
> --- a/xen/arch/x86/cpu/mcheck/Makefile
> +++ b/xen/arch/x86/cpu/mcheck/Makefile
> @@ -1,12 +1,12 @@
> -obj-$(CONFIG_AMD) += amd_nonfatal.o
> +obj-$(filter $(CONFIG_AMD),$(CONFIG_MCE_NONFATAL)) += amd_nonfatal.o
> obj-$(CONFIG_AMD) += mce_amd.o
> obj-y += mcaction.o
> obj-y += barrier.o
> -obj-$(CONFIG_INTEL) += intel-nonfatal.o
> +obj-$(filter $(CONFIG_INTEL),$(CONFIG_MCE_NONFATAL)) += intel-nonfatal.o
> obj-y += mctelem.o
> obj-y += mce.o
> obj-y += mce-apei.o
> obj-$(CONFIG_INTEL) += mce_intel.o
> -obj-y += non-fatal.o
> +obj-$(CONFIG_MCE_NONFATAL) += non-fatal.o
> obj-y += util.o
> obj-y += vmce.o
>
> ... with the Kconfig option probably in the regular x86 Kconfig rather than
> Kconfig.cpu
>
> Thoughts?
>
> Cheers,
> Alejandro
On Tue, 8 Jul 2025, Jan Beulich wrote:
> On 08.07.2025 12:25, Alejandro Vallejo wrote:
> > On Tue Jul 8, 2025 at 2:07 AM CEST, Stefano Stabellini wrote:
> >> --- a/xen/arch/x86/Kconfig.cpu
> >> +++ b/xen/arch/x86/Kconfig.cpu
> >> @@ -10,6 +10,21 @@ config AMD
> >> May be turned off in builds targetting other vendors. Otherwise,
> >> must be enabled for Xen to work suitably on AMD platforms.
> >>
> >> +config AMD_MCE_NONFATAL
> >> + bool "Check for non-fatal MCEs on AMD CPUs"
> >> + default y
> >> + depends on AMD
> >> + help
> >> + Check for non-fatal MCE errors.
> >> +
> >> + When this option is on (default), Xen regularly checks for
> >> + non-fatal MCEs potentially occurring on all physical CPUs. The
> >> + checking is done via timers and IPI interrupts, which is
> >> + acceptable in most configurations, but not for real-time.
> >> +
> >> + Turn this option off if you plan on deploying real-time workloads
> >> + on Xen.
> >> +
> >
> > This being in the CPU vendor submenu seems off. I'd expect only a list of
> > silicon vendors here. I think it ought to be in the regular Kconfig file.
>
> Whether in this file or the regular one is up for discussion, but yes,
> definitely not inside the vendor menu.
>
> >> --- a/xen/arch/x86/cpu/mcheck/amd_nonfatal.c
> >> +++ b/xen/arch/x86/cpu/mcheck/amd_nonfatal.c
> >> @@ -191,7 +191,8 @@ static void cf_check mce_amd_work_fn(void *data)
> >>
> >> void __init amd_nonfatal_mcheck_init(struct cpuinfo_x86 *c)
> >> {
> >> - if (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)))
> >> + if ( !IS_ENABLED(CONFIG_AMD_MCE_NONFATAL) ||
> >> + (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON))) )
> >> return;
> >>
> >> /* Assume we are on K8 or newer AMD or Hygon CPU here */
> >
> > It can be made more general to remove more code. What do you think of removing
> > all non-fatals and getting rid of the initcall altogether?
>
> I think such a more general approach would be quite a bit better.
I am fine with that, actually better to remove the code then to leave it
around and do nothing.
> > diff --git a/xen/arch/x86/Kconfig.cpu b/xen/arch/x86/Kconfig.cpu
> > index 5fb18db1aa..a4b892a1aa 100644
> > --- a/xen/arch/x86/Kconfig.cpu
> > +++ b/xen/arch/x86/Kconfig.cpu
> > @@ -10,6 +10,20 @@ config AMD
> > May be turned off in builds targetting other vendors. Otherwise,
> > must be enabled for Xen to work suitably on AMD platforms.
> >
> > +config MCE_NONFATAL
> > + bool "Check for non-fatal MCEs"
> > + default y
> > + help
> > + Check for non-fatal MCE errors.
> > +
> > + When this option is on (default), Xen regularly checks for
> > + non-fatal MCEs potentially occurring on all physical CPUs. The
> > + checking is done via timers and IPI interrupts, which is
> > + acceptable in most configurations, but not for real-time.
> > +
> > + Turn this option off if you plan on deploying real-time workloads
> > + on Xen.
> > +
> > config INTEL
> > bool "Support Intel CPUs"
> > default y
> > diff --git a/xen/arch/x86/cpu/mcheck/Makefile b/xen/arch/x86/cpu/mcheck/Makefile
> > index e6cb4dd503..c70b441888 100644
> > --- a/xen/arch/x86/cpu/mcheck/Makefile
> > +++ b/xen/arch/x86/cpu/mcheck/Makefile
> > @@ -1,12 +1,12 @@
> > -obj-$(CONFIG_AMD) += amd_nonfatal.o
> > +obj-$(filter $(CONFIG_AMD),$(CONFIG_MCE_NONFATAL)) += amd_nonfatal.o
> > obj-$(CONFIG_AMD) += mce_amd.o
> > obj-y += mcaction.o
> > obj-y += barrier.o
> > -obj-$(CONFIG_INTEL) += intel-nonfatal.o
> > +obj-$(filter $(CONFIG_INTEL),$(CONFIG_MCE_NONFATAL)) += intel-nonfatal.o
> > obj-y += mctelem.o
> > obj-y += mce.o
> > obj-y += mce-apei.o
> > obj-$(CONFIG_INTEL) += mce_intel.o
> > -obj-y += non-fatal.o
> > +obj-$(CONFIG_MCE_NONFATAL) += non-fatal.o
> > obj-y += util.o
> > obj-y += vmce.o
> >
> > ... with the Kconfig option probably in the regular x86 Kconfig rather than
> > Kconfig.cpu
> >
> > Thoughts?
> >
> > Cheers,
> > Alejandro
>
© 2016 - 2025 Red Hat, Inc.