[PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY

Shuai Xue posted 5 patches 10 months ago
There is a newer version of this series
[PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
Posted by Shuai Xue 10 months ago
Currently, mce_no_way_out() only collects error messages when the error
severity is equal to `MCE_PANIC_SEVERITY`. To improve diagnostics,
modify the behavior to also collect error messages when the severity is
less than `MCE_PANIC_SEVERITY`.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 arch/x86/kernel/cpu/mce/core.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 0dc00c9894c7..f2e730d4acc5 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -925,12 +925,13 @@ static __always_inline void quirk_zen_ifu(int bank, struct mce *m, struct pt_reg
  * Do a quick check if any of the events requires a panic.
  * This decides if we keep the events around or clear them.
  */
-static __always_inline int mce_no_way_out(struct mce_hw_err *err, char **msg, unsigned long *validp,
-					  struct pt_regs *regs)
+static __always_inline bool mce_no_way_out(struct mce_hw_err *err, char **msg,
+					   unsigned long *validp,
+					   struct pt_regs *regs)
 {
 	struct mce *m = &err->m;
 	char *tmp = *msg;
-	int i;
+	int i, cur_sev = MCE_NO_SEVERITY, sev;
 
 	for (i = 0; i < this_cpu_read(mce_num_banks); i++) {
 		m->status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));
@@ -945,13 +946,17 @@ static __always_inline int mce_no_way_out(struct mce_hw_err *err, char **msg, un
 			quirk_zen_ifu(i, m, regs);
 
 		m->bank = i;
-		if (mce_severity(m, regs, &tmp, true) >= MCE_PANIC_SEVERITY) {
+		sev = mce_severity(m, regs, &tmp, true);
+		if (sev >= cur_sev) {
 			mce_read_aux(err, i);
 			*msg = tmp;
-			return 1;
+			cur_sev = sev;
 		}
+
+		if (cur_sev == MCE_PANIC_SEVERITY)
+			return true;
 	}
-	return 0;
+	return false;
 }
 
 /*
-- 
2.39.3
Re: [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
Posted by Borislav Petkov 10 months ago
On Mon, Feb 17, 2025 at 02:33:31PM +0800, Shuai Xue wrote:
> Currently, mce_no_way_out() only collects error messages when the error
> severity is equal to `MCE_PANIC_SEVERITY`. To improve diagnostics,
> modify the behavior to also collect error messages when the severity is
> less than `MCE_PANIC_SEVERITY`.

That function is literally called "no way out" as in, is the error severe
enough so that there is no way out.

Now you went and stomped all over that to "improve diagnostics". What
diagnostsics? Your commit messages need to explain in detail why exactly
a patch exists.

So nope.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
Posted by Shuai Xue 10 months ago

在 2025/2/18 15:58, Borislav Petkov 写道:
> On Mon, Feb 17, 2025 at 02:33:31PM +0800, Shuai Xue wrote:
>> Currently, mce_no_way_out() only collects error messages when the error
>> severity is equal to `MCE_PANIC_SEVERITY`. To improve diagnostics,
>> modify the behavior to also collect error messages when the severity is
>> less than `MCE_PANIC_SEVERITY`.
> 
> That function is literally called "no way out" as in, is the error severe
> enough so that there is no way out.
> 
> Now you went and stomped all over that to "improve diagnostics". What
> diagnostsics? Your commit messages need to explain in detail why exactly
> a patch exists.
> 
> So nope.
> 

Hi, Borislav,

Thank you for reply.

The msg in predefined `severities`, e.g.

	MCESEV(
		AO, "Action optional: last level cache writeback error",
		SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
		),

is helpful for users to know what kind of MCE is happened. For a fatal machine
check, kernel panic use the message and I want to extend to collect the message
and print it out for non-fatal one.

If you don't object, let's go on to discuss how to implement it.
Otherwise, you can ignore the following response.

Yes, mce_no_way_out() means "no way out" literally. It only collects message
for MCE_PANIC_SEVERITY but use in common path. So I used this function to
extend it to non-fatal, assuming it was obvious.
  
Is __mc_scan_banks() a proper function to extend?

Thanks.
Shuai
Re: [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
Posted by Borislav Petkov 10 months ago
On Tue, Feb 18, 2025 at 05:39:33PM +0800, Shuai Xue wrote:
> Is __mc_scan_banks() a proper function to extend?

No, first you need to explain the big picture:

https://lore.kernel.org/r/20250218082727.GCZ7REb7OG6NTAY-V-@fat_crate.local

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette