[PATCH 01/12] x86/mce: Fix hw MCE injection feature detection

isaku.yamahata@intel.com posted 12 patches 2 years, 2 months ago
[PATCH 01/12] x86/mce: Fix hw MCE injection feature detection
Posted by isaku.yamahata@intel.com 2 years, 2 months ago
From: Isaku Yamahata <isaku.yamahata@intel.com>

When initializing x86 MCE injection framework, it checks if hardware mce
injection is available or not.  When it's not available on AMD, set the
boolean variable to false to not use it.  The variable is on by default and
the feature is AMD specific based on the code.

Because the variable is default on, it is true on Intel platform (probably
on other non-AMD x86 platform).  It results in unchecked msr access of
MSR_K7_HWCR=0xc0010015 when injecting MCE on Intel platform.  (Probably on
other x86 platform.)

Make the variable of by default, and set the variable on when the hardware
feature is usable.

The procedure to produce on Intel platform.
  echo hw > /sys/kernel/debug/mce-inject/flags
    This succeeds.
    set other MCE parameters if necessary.
  echo 0 > /sys/kernel/debug/mce-inject/bank
    This triggers mce injection.

The kernel accesses MSR_K7_HWCR=0xc0010015 resulting unchecked MSR access.

Fixes: 891e465a1bd8 ("x86/mce: Check whether writes to MCA_STATUS are getting ignored")
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kernel/cpu/mce/inject.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/inject.c b/arch/x86/kernel/cpu/mce/inject.c
index 4d8d4bcf915d..881898a1d2f4 100644
--- a/arch/x86/kernel/cpu/mce/inject.c
+++ b/arch/x86/kernel/cpu/mce/inject.c
@@ -33,7 +33,7 @@
 
 #include "internal.h"
 
-static bool hw_injection_possible = true;
+static bool hw_injection_possible;
 
 /*
  * Collect all the MCi_XXX settings
@@ -732,6 +732,7 @@ static void check_hw_inj_possible(void)
 	if (!cpu_feature_enabled(X86_FEATURE_SMCA))
 		return;
 
+	hw_injection_possible = true;
 	cpu = get_cpu();
 
 	for (bank = 0; bank < MAX_NR_BANKS; ++bank) {
-- 
2.25.1
Re: [PATCH 01/12] x86/mce: Fix hw MCE injection feature detection
Posted by Dave Hansen 2 years, 2 months ago
Isaku, when you report a bug, it would be great to include the folks who
authored and worked on the original patch that introduced the bug.  I've
gone ahead and done that for you here.

On 10/10/23 01:35, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> When initializing x86 MCE injection framework, it checks if hardware mce
> injection is available or not.  When it's not available on AMD, set the
> boolean variable to false to not use it.  The variable is on by default and
> the feature is AMD specific based on the code.
> 
> Because the variable is default on, it is true on Intel platform (probably
> on other non-AMD x86 platform).  It results in unchecked msr access of
> MSR_K7_HWCR=0xc0010015 when injecting MCE on Intel platform.  (Probably on
> other x86 platform.)
> 
> Make the variable of by default, and set the variable on when the hardware
> feature is usable.

Gah, I'm finding that changelog impenetrable.  Here's what's missing:

  * The entirety of check_hw_inj_possible() is for AMD hardware:
    X86_FEATURE_SMCA, the MSRs, hw_injection_possible, everything.
  * Only AMD systems with SMCA support hardware error injection
    (anything other than "echo sw > /sys/kernel/debug/mce-inject/flags")
  * That AMD-only restriction is enforced by 'hw_injection_possible'
  * 'hw_injection_possible' is true by default and only set to false in
    check_hw_inj_possible() ... the AMD-only code

The end result is that everyone except SMCA-enabled AMD systems (Intel
included) leaves hw_injection_possible=true.  They are free to try and
inject hardware errors.  If they do, they'll get errors when writing to
the MSRs.

To fix this, make disable hw_injection_possible by default.  Only enable
it on SMCA hardware that actually succeeds in ... whatever:

                wrmsrl_safe(mca_msr_reg(bank, MCA_STATUS), status);
                rdmsrl_safe(mca_msr_reg(bank, MCA_STATUS), &status);

is doing.

... and don't do it at the top of the function.  Why bother setting it
to true only to disable it a moment later?

Do something like the following instead.