[PATCH v3 00/10] x86/mce: Clean up some x86/mce code

Qiuxu Zhuo posted 10 patches 1 month ago
There is a newer version of this series
arch/x86/include/asm/mce.h           |   4 +-
arch/x86/kernel/cpu/mce/amd.c        |  18 +--
arch/x86/kernel/cpu/mce/core.c       | 230 +++++++++++++++------------
arch/x86/kernel/cpu/mce/dev-mcelog.c |  11 +-
arch/x86/kernel/cpu/mce/genpool.c    |   9 +-
arch/x86/kernel/cpu/mce/inject.c     |   2 +-
arch/x86/kernel/cpu/mce/intel.c      |  11 +-
arch/x86/kernel/cpu/mce/threshold.c  |   2 +-
8 files changed, 149 insertions(+), 138 deletions(-)
[PATCH v3 00/10] x86/mce: Clean up some x86/mce code
Posted by Qiuxu Zhuo 1 month ago
Clean up some x86/mce code as below. No functional changes intended.

  - Simplify some code.

  - Remove some unnecessary code.

  - Improve readability for some code.

  - Convert some family/model mixed checks to VFM-based checks.

  - Fix some typos.

Pass the following basic tests:

  - Tested on an Intel Sapphire Rapids server.

  - Compile test.

  - System boot test.

  - Correctable/uncorrectable memory errors can be notified via CMCI/MCE interrupts.

  - Correctable/uncorrectable memory errors can be dispatched to the mcelog daemon and the EDAC driver.

Changes in v3:

  - Collect "Reviewed-by:" from Nikolay & Sohil.

  - Drop the "x86/mce: Remove the redundant zeroing assignments" patch.

  - 0003: Rename mce_notify_irq() to mce_notify_user().

  - 0005: Move the 'int ret' variable along with the other int variables.

  - 0006: New patch. Break up __mcheck_cpu_apply_quirks().

  - 0007: New patch. Convert family/model mixed checks to VFM-based checks.

  - 0009: Remove the variable names from the commit message.

  - 0010: Remove the detail typos from the commit message.

Changes in v2:

  - Collect "Reviewed-by:" from Tony.

  - 0009: Add the missing variable names to the commit message.

This series is based on v6.12-rc4.

Thanks Tony, Dave, Sohil and Nikolay for your review and discussion on this series.

Qiuxu Zhuo (9):
  x86/mce/dev-mcelog: Use xchg() to get and clear the flags
  x86/mce/intel: Use MCG_BANKCNT_MASK instead of 0xff
  x86/mce: Make several functions return bool and rename a function
  x86/mce/threshold: Remove the redundant this_cpu_dec_return()
  x86/mce/genpool: Make mce_gen_pool_create() return explicit error codes
  x86/mce: Convert family/model mixed checks to VFM-based checks
  x86/mce: Remove the unnecessary {}
  x86/mce/amd: Remove unnecessary NULL pointer initializations
  x86/mce: Fix typos

Tony Luck (1):
  x86/mce: Break up __mcheck_cpu_apply_quirks()

 arch/x86/include/asm/mce.h           |   4 +-
 arch/x86/kernel/cpu/mce/amd.c        |  18 +--
 arch/x86/kernel/cpu/mce/core.c       | 230 +++++++++++++++------------
 arch/x86/kernel/cpu/mce/dev-mcelog.c |  11 +-
 arch/x86/kernel/cpu/mce/genpool.c    |   9 +-
 arch/x86/kernel/cpu/mce/inject.c     |   2 +-
 arch/x86/kernel/cpu/mce/intel.c      |  11 +-
 arch/x86/kernel/cpu/mce/threshold.c  |   2 +-
 8 files changed, 149 insertions(+), 138 deletions(-)


base-commit: 42f7652d3eb527d03665b09edac47f85fb600924
-- 
2.17.1
[PATCH v4 0/8] x86/mce: Clean up some x86/mce code
Posted by Qiuxu Zhuo 2 weeks ago
Clean up some x86/mce code as below. No functional changes intended.
  - Simplify some code.
  - Remove some unnecessary code.
  - Improve readability for some code.
  - Convert some family/model mixed checks to VFM-based checks.
  - Fix some typos.

Pass the following basic tests:
  - Tested on an Intel Sapphire Rapids server.
  - Compile test.
  - System boot test.
  - Correctable/uncorrectable memory errors can be notified via CMCI/MCE interrupts.
  - Correctable/uncorrectable memory errors can be dispatched to the mcelog daemon and the EDAC driver.

Changes in v4:
  - Drop the first two patches as they've landed in the TIP ras/core branch.
  - Drop "Make mce_gen_pool_create() return explicit error codes" patch.
  - 0001: Don't rename mce_notify_irq().
  - 0003: New patch. Make four functions return bool.
  - 0004: Add necessary blank lines and directly use 'mca_cfg'.

Changes in v3:
  - Collect "Reviewed-by:" from Nikolay & Sohil.
  - Drop the "x86/mce: Remove the redundant zeroing assignments" patch.
  - 0003: Rename mce_notify_irq() to mce_notify_user().
  - 0005: Move the 'int ret' variable along with the other int variables.
  - 0006: New patch. Break up __mcheck_cpu_apply_quirks().
  - 0007: New patch. Convert family/model mixed checks to VFM-based checks.
  - 0009: Remove the variables' names from the commit message.
  - 0010: Remove the detail typos from the commit message.

Changes in v2:
  - Collect "Reviewed-by:" from Tony.
  - Update the commit message of patch 9 to include the names of all
    variables that don't need NULL pointer initializations.

This series is based on the TIP branch ras/core with top commit 612c2addff36.

Thanks Thomas, Boris, Tony, Dave, Sohil and Nikolay for your review and discussion on this series.

Qiuxu Zhuo (7):
  x86/mce: Make several functions return bool
  x86/mce/threshold: Remove the redundant this_cpu_dec_return()
  x86/mce: Make four functions return bool
  x86/mce: Convert family/model mixed checks to VFM-based checks
  x86/mce: Remove the unnecessary {}
  x86/mce/amd: Remove unnecessary NULL pointer initializations
  x86/mce: Fix typos

Tony Luck (1):
  x86/mce: Break up __mcheck_cpu_apply_quirks()

 arch/x86/include/asm/mce.h          |   4 +-
 arch/x86/kernel/cpu/mce/amd.c       |  18 +--
 arch/x86/kernel/cpu/mce/core.c      | 234 +++++++++++++++-------------
 arch/x86/kernel/cpu/mce/genpool.c   |  29 ++--
 arch/x86/kernel/cpu/mce/intel.c     |   9 +-
 arch/x86/kernel/cpu/mce/internal.h  |   4 +-
 arch/x86/kernel/cpu/mce/threshold.c |   2 +-
 7 files changed, 158 insertions(+), 142 deletions(-)


base-commit: 612c2addff367ee461dc99ffca2bc786f105d2ec
-- 
2.17.1
[PATCH v4 1/8] x86/mce: Make several functions return bool
Posted by Qiuxu Zhuo 2 weeks ago
Make several functions that return 0 or 1 return a boolean value for
better readability.

No functional changes are intended.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
Changes in v4:
  - Don't rename mce_notify_irq() (Boris).

Changes in v3:
  - Collect "Reviewed-by:" from Nikolay & Sohil.
  - Rename mce_notify_irq() to mce_notify_user() (Sohil).

Changes in v2:
  - Collect "Reviewed-by:" from Tony.

 arch/x86/include/asm/mce.h      |  4 ++--
 arch/x86/kernel/cpu/mce/amd.c   | 10 +++++-----
 arch/x86/kernel/cpu/mce/core.c  | 22 +++++++++++-----------
 arch/x86/kernel/cpu/mce/intel.c |  9 +++++----
 4 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 4543cf2eb5e8..ea9ca7689f6b 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -276,7 +276,7 @@ static inline void cmci_rediscover(void) {}
 static inline void cmci_recheck(void) {}
 #endif
 
-int mce_available(struct cpuinfo_x86 *c);
+bool mce_available(struct cpuinfo_x86 *c);
 bool mce_is_memory_error(struct mce *m);
 bool mce_is_correctable(struct mce *m);
 bool mce_usable_address(struct mce *m);
@@ -296,7 +296,7 @@ enum mcp_flags {
 
 void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
-int mce_notify_irq(void);
+bool mce_notify_irq(void);
 
 DECLARE_PER_CPU(struct mce, injectm);
 
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 6ca80fff1fea..018874b554cb 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -381,7 +381,7 @@ static bool lvt_interrupt_supported(unsigned int bank, u32 msr_high_bits)
 	return msr_high_bits & BIT(28);
 }
 
-static int lvt_off_valid(struct threshold_block *b, int apic, u32 lo, u32 hi)
+static bool lvt_off_valid(struct threshold_block *b, int apic, u32 lo, u32 hi)
 {
 	int msr = (hi & MASK_LVTOFF_HI) >> 20;
 
@@ -389,7 +389,7 @@ static int lvt_off_valid(struct threshold_block *b, int apic, u32 lo, u32 hi)
 		pr_err(FW_BUG "cpu %d, failed to setup threshold interrupt "
 		       "for bank %d, block %d (MSR%08X=0x%x%08x)\n", b->cpu,
 		       b->bank, b->block, b->address, hi, lo);
-		return 0;
+		return false;
 	}
 
 	if (apic != msr) {
@@ -399,15 +399,15 @@ static int lvt_off_valid(struct threshold_block *b, int apic, u32 lo, u32 hi)
 		 * was set is reserved. Return early here:
 		 */
 		if (mce_flags.smca)
-			return 0;
+			return false;
 
 		pr_err(FW_BUG "cpu %d, invalid threshold interrupt offset %d "
 		       "for bank %d, block %d (MSR%08X=0x%x%08x)\n",
 		       b->cpu, apic, b->bank, b->block, b->address, hi, lo);
-		return 0;
+		return false;
 	}
 
-	return 1;
+	return true;
 };
 
 /* Reprogram MCx_MISC MSR behind this threshold bank. */
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7fb5556a0b53..167965bd2ac0 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -492,10 +492,10 @@ static noinstr void mce_gather_info(struct mce_hw_err *err, struct pt_regs *regs
 	}
 }
 
-int mce_available(struct cpuinfo_x86 *c)
+bool mce_available(struct cpuinfo_x86 *c)
 {
 	if (mca_cfg.disabled)
-		return 0;
+		return false;
 	return cpu_has(c, X86_FEATURE_MCE) && cpu_has(c, X86_FEATURE_MCA);
 }
 
@@ -1778,7 +1778,7 @@ static void mce_timer_delete_all(void)
  * Can be called from interrupt context, but not from machine check/NMI
  * context.
  */
-int mce_notify_irq(void)
+bool mce_notify_irq(void)
 {
 	/* Not more than two messages every minute */
 	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
@@ -1789,9 +1789,9 @@ int mce_notify_irq(void)
 		if (__ratelimit(&ratelimit))
 			pr_info(HW_ERR "Machine check events logged\n");
 
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 EXPORT_SYMBOL_GPL(mce_notify_irq);
 
@@ -2015,25 +2015,25 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
 	return 0;
 }
 
-static int __mcheck_cpu_ancient_init(struct cpuinfo_x86 *c)
+static bool __mcheck_cpu_ancient_init(struct cpuinfo_x86 *c)
 {
 	if (c->x86 != 5)
-		return 0;
+		return false;
 
 	switch (c->x86_vendor) {
 	case X86_VENDOR_INTEL:
 		intel_p5_mcheck_init(c);
 		mce_flags.p5 = 1;
-		return 1;
+		return true;
 	case X86_VENDOR_CENTAUR:
 		winchip_mcheck_init(c);
 		mce_flags.winchip = 1;
-		return 1;
+		return true;
 	default:
-		return 0;
+		return false;
 	}
 
-	return 0;
+	return false;
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index b3cd2c61b11d..f863df0ff42c 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -75,12 +75,12 @@ static u16 cmci_threshold[MAX_NR_BANKS];
  */
 #define CMCI_STORM_THRESHOLD	32749
 
-static int cmci_supported(int *banks)
+static bool cmci_supported(int *banks)
 {
 	u64 cap;
 
 	if (mca_cfg.cmci_disabled || mca_cfg.ignore_ce)
-		return 0;
+		return false;
 
 	/*
 	 * Vendor check is not strictly needed, but the initial
@@ -89,10 +89,11 @@ static int cmci_supported(int *banks)
 	 */
 	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
 	    boot_cpu_data.x86_vendor != X86_VENDOR_ZHAOXIN)
-		return 0;
+		return false;
 
 	if (!boot_cpu_has(X86_FEATURE_APIC) || lapic_get_maxlvt() < 6)
-		return 0;
+		return false;
+
 	rdmsrl(MSR_IA32_MCG_CAP, cap);
 	*banks = min_t(unsigned, MAX_NR_BANKS, cap & MCG_BANKCNT_MASK);
 	return !!(cap & MCG_CMCI_P);
-- 
2.17.1
[PATCH v4 2/8] x86/mce/threshold: Remove the redundant this_cpu_dec_return()
Posted by Qiuxu Zhuo 2 weeks ago
The 'storm' variable points to this_cpu_ptr(&storm_desc). Access the
'stormy_bank_count' field through the 'storm' to avoid calling
this_cpu_*() on the same per-CPU variable twice.

This minor optimization reduces the text size by 16 bytes.

  $ size threshold.o.*
     text	   data	    bss	    dec	    hex	filename
     1395	   1664	      0	   3059	    bf3	threshold.o.old
     1379	   1664	      0	   3043	    be3	threshold.o.new

No functional changes intended.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
Changes in v4:
  - No changes.

Changes in v3:
  - Collect "Reviewed-by:" from Nikolay & Sohil.

Changes in v2:
  - Collect "Reviewed-by:" from Tony.

 arch/x86/kernel/cpu/mce/threshold.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
index 89e31e1e5c9c..f4a007616468 100644
--- a/arch/x86/kernel/cpu/mce/threshold.c
+++ b/arch/x86/kernel/cpu/mce/threshold.c
@@ -90,7 +90,7 @@ void cmci_storm_end(unsigned int bank)
 	storm->banks[bank].in_storm_mode = false;
 
 	/* If no banks left in storm mode, stop polling. */
-	if (!this_cpu_dec_return(storm_desc.stormy_bank_count))
+	if (!--storm->stormy_bank_count)
 		mce_timer_kick(false);
 }
 
-- 
2.17.1
[PATCH v4 3/8] x86/mce: Make four functions return bool
Posted by Qiuxu Zhuo 2 weeks ago
Make those functions whose callers only care about success or failure
return a boolean value for better readability. Also, update the call
sites accordingly.

No functional changes.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
Changes in v4:
  - New patch.

 arch/x86/kernel/cpu/mce/core.c     | 12 ++++++------
 arch/x86/kernel/cpu/mce/genpool.c  | 29 ++++++++++++++---------------
 arch/x86/kernel/cpu/mce/internal.h |  4 ++--
 3 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 167965bd2ac0..ce6fe5e20805 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -151,7 +151,7 @@ EXPORT_PER_CPU_SYMBOL_GPL(injectm);
 
 void mce_log(struct mce_hw_err *err)
 {
-	if (!mce_gen_pool_add(err))
+	if (mce_gen_pool_add(err))
 		irq_work_queue(&mce_irq_work);
 }
 EXPORT_SYMBOL_GPL(mce_log);
@@ -1911,14 +1911,14 @@ static void __mcheck_cpu_check_banks(void)
 }
 
 /* Add per CPU specific workarounds here */
-static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
+static bool __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
 {
 	struct mce_bank *mce_banks = this_cpu_ptr(mce_banks_array);
 	struct mca_config *cfg = &mca_cfg;
 
 	if (c->x86_vendor == X86_VENDOR_UNKNOWN) {
 		pr_info("unknown CPU type - not enabling MCE support\n");
-		return -EOPNOTSUPP;
+		return false;
 	}
 
 	/* This should be disabled by the BIOS, but isn't always */
@@ -2012,7 +2012,7 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
 	if (cfg->bootlog != 0)
 		cfg->panic_timeout = 30;
 
-	return 0;
+	return true;
 }
 
 static bool __mcheck_cpu_ancient_init(struct cpuinfo_x86 *c)
@@ -2279,12 +2279,12 @@ void mcheck_cpu_init(struct cpuinfo_x86 *c)
 
 	__mcheck_cpu_cap_init();
 
-	if (__mcheck_cpu_apply_quirks(c) < 0) {
+	if (!__mcheck_cpu_apply_quirks(c)) {
 		mca_cfg.disabled = 1;
 		return;
 	}
 
-	if (mce_gen_pool_init()) {
+	if (!mce_gen_pool_init()) {
 		mca_cfg.disabled = 1;
 		pr_emerg("Couldn't allocate MCE records pool!\n");
 		return;
diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index d0be6dda0c14..3ca9c007a666 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -94,64 +94,63 @@ bool mce_gen_pool_empty(void)
 	return llist_empty(&mce_event_llist);
 }
 
-int mce_gen_pool_add(struct mce_hw_err *err)
+bool mce_gen_pool_add(struct mce_hw_err *err)
 {
 	struct mce_evt_llist *node;
 
 	if (filter_mce(&err->m))
-		return -EINVAL;
+		return false;
 
 	if (!mce_evt_pool)
-		return -EINVAL;
+		return false;
 
 	node = (void *)gen_pool_alloc(mce_evt_pool, sizeof(*node));
 	if (!node) {
 		pr_warn_ratelimited("MCE records pool full!\n");
-		return -ENOMEM;
+		return false;
 	}
 
 	memcpy(&node->err, err, sizeof(*err));
 	llist_add(&node->llnode, &mce_event_llist);
 
-	return 0;
+	return true;
 }
 
-static int mce_gen_pool_create(void)
+static bool mce_gen_pool_create(void)
 {
 	int mce_numrecords, mce_poolsz, order;
 	struct gen_pool *gpool;
-	int ret = -ENOMEM;
 	void *mce_pool;
 
 	order = order_base_2(sizeof(struct mce_evt_llist));
 	gpool = gen_pool_create(order, -1);
 	if (!gpool)
-		return ret;
+		return false;
 
 	mce_numrecords = max(MCE_MIN_ENTRIES, num_possible_cpus() * MCE_PER_CPU);
 	mce_poolsz = mce_numrecords * (1 << order);
 	mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);
 	if (!mce_pool) {
 		gen_pool_destroy(gpool);
-		return ret;
+		return false;
 	}
-	ret = gen_pool_add(gpool, (unsigned long)mce_pool, mce_poolsz, -1);
-	if (ret) {
+
+	if (gen_pool_add(gpool, (unsigned long)mce_pool, mce_poolsz, -1)) {
 		gen_pool_destroy(gpool);
 		kfree(mce_pool);
-		return ret;
+		return false;
 	}
 
 	mce_evt_pool = gpool;
 
-	return ret;
+	return true;
 }
 
-int mce_gen_pool_init(void)
+bool mce_gen_pool_init(void)
 {
 	/* Just init mce_gen_pool once. */
 	if (mce_evt_pool)
-		return 0;
+		return true;
 
 	return mce_gen_pool_create();
 }
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 84f810598231..95a504ece43e 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -31,8 +31,8 @@ struct mce_evt_llist {
 
 void mce_gen_pool_process(struct work_struct *__unused);
 bool mce_gen_pool_empty(void);
-int mce_gen_pool_add(struct mce_hw_err *err);
-int mce_gen_pool_init(void);
+bool mce_gen_pool_add(struct mce_hw_err *err);
+bool mce_gen_pool_init(void);
 struct llist_node *mce_gen_pool_prepare_records(void);
 
 int mce_severity(struct mce *a, struct pt_regs *regs, char **msg, bool is_excp);
-- 
2.17.1
Re: [PATCH v4 3/8] x86/mce: Make four functions return bool
Posted by Sohil Mehta 1 week, 6 days ago
On 11/10/2024 10:04 PM, Qiuxu Zhuo wrote:
> Make those functions whose callers only care about success or failure
> return a boolean value for better readability. Also, update the call
> sites accordingly.
> 
> No functional changes.
> 
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---

Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>

> Changes in v4:
>   - New patch.
> 
>  arch/x86/kernel/cpu/mce/core.c     | 12 ++++++------
>  arch/x86/kernel/cpu/mce/genpool.c  | 29 ++++++++++++++---------------
>  arch/x86/kernel/cpu/mce/internal.h |  4 ++--
>  3 files changed, 22 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 167965bd2ac0..ce6fe5e20805 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -151,7 +151,7 @@ EXPORT_PER_CPU_SYMBOL_GPL(injectm);
>  
>  void mce_log(struct mce_hw_err *err)
>  {
> -	if (!mce_gen_pool_add(err))
> +	if (mce_gen_pool_add(err))
>  		irq_work_queue(&mce_irq_work);
>  }

The polarities of all the return values have been flipped. They all look
correct but took me a few minutes to figure out. Might be useful to make
a mention in the commit message if you end up doing another version (Not
needed otherwise).

>  EXPORT_SYMBOL_GPL(mce_log);
[PATCH v4 4/8] x86/mce: Break up __mcheck_cpu_apply_quirks()
Posted by Qiuxu Zhuo 2 weeks ago
From: Tony Luck <tony.luck@intel.com>

Split each vendor specific part into its own helper function.

Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
Changes in v4:
  - Add necessary blank lines in apply_quirks_amd() (Yazen).
  - Use 'mca_cfg' instead of 'cfg' in apply_quirks_*(). (Yazen).

Changes in v3:
  - New patch.

 arch/x86/kernel/cpu/mce/core.c | 192 ++++++++++++++++++---------------
 1 file changed, 104 insertions(+), 88 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index ce6fe5e20805..3855ec2ed0e0 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1910,101 +1910,117 @@ static void __mcheck_cpu_check_banks(void)
 	}
 }
 
+static void apply_quirks_amd(struct cpuinfo_x86 *c)
+{
+	struct mce_bank *mce_banks = this_cpu_ptr(mce_banks_array);
+
+	/* This should be disabled by the BIOS, but isn't always */
+	if (c->x86 == 15 && this_cpu_read(mce_num_banks) > 4) {
+		/*
+		 * disable GART TBL walk error reporting, which
+		 * trips off incorrectly with the IOMMU & 3ware
+		 * & Cerberus:
+		 */
+		clear_bit(10, (unsigned long *)&mce_banks[4].ctl);
+	}
+
+	if (c->x86 < 0x11 && mca_cfg.bootlog < 0) {
+		/*
+		 * Lots of broken BIOS around that don't clear them
+		 * by default and leave crap in there. Don't log:
+		 */
+		mca_cfg.bootlog = 0;
+	}
+
+	/*
+	 * Various K7s with broken bank 0 around. Always disable
+	 * by default.
+	 */
+	if (c->x86 == 6 && this_cpu_read(mce_num_banks) > 0)
+		mce_banks[0].ctl = 0;
+
+	/*
+	 * overflow_recov is supported for F15h Models 00h-0fh
+	 * even though we don't have a CPUID bit for it.
+	 */
+	if (c->x86 == 0x15 && c->x86_model <= 0xf)
+		mce_flags.overflow_recov = 1;
+
+	if (c->x86 >= 0x17 && c->x86 <= 0x1A)
+		mce_flags.zen_ifu_quirk = 1;
+}
+
+static void apply_quirks_intel(struct cpuinfo_x86 *c)
+{
+	struct mce_bank *mce_banks = this_cpu_ptr(mce_banks_array);
+
+	/*
+	 * SDM documents that on family 6 bank 0 should not be written
+	 * because it aliases to another special BIOS controlled
+	 * register.
+	 * But it's not aliased anymore on model 0x1a+
+	 * Don't ignore bank 0 completely because there could be a
+	 * valid event later, merely don't write CTL0.
+	 */
+	if (c->x86 == 6 && c->x86_model < 0x1A && this_cpu_read(mce_num_banks) > 0)
+		mce_banks[0].init = false;
+
+	/*
+	 * All newer Intel systems support MCE broadcasting. Enable
+	 * synchronization with a one second timeout.
+	 */
+	if ((c->x86 > 6 || (c->x86 == 6 && c->x86_model >= 0xe)) &&
+	    mca_cfg.monarch_timeout < 0)
+		mca_cfg.monarch_timeout = USEC_PER_SEC;
+
+	/*
+	 * There are also broken BIOSes on some Pentium M and
+	 * earlier systems:
+	 */
+	if (c->x86 == 6 && c->x86_model <= 13 && mca_cfg.bootlog < 0)
+		mca_cfg.bootlog = 0;
+
+	if (c->x86_vfm == INTEL_SANDYBRIDGE_X)
+		mce_flags.snb_ifu_quirk = 1;
+
+	/*
+	 * Skylake, Cascacde Lake and Cooper Lake require a quirk on
+	 * rep movs.
+	 */
+	if (c->x86_vfm == INTEL_SKYLAKE_X)
+		mce_flags.skx_repmov_quirk = 1;
+}
+
+static void apply_quirks_zhaoxin(struct cpuinfo_x86 *c)
+{
+	/*
+	 * All newer Zhaoxin CPUs support MCE broadcasting. Enable
+	 * synchronization with a one second timeout.
+	 */
+	if (c->x86 > 6 || (c->x86_model == 0x19 || c->x86_model == 0x1f)) {
+		if (mca_cfg.monarch_timeout < 0)
+			mca_cfg.monarch_timeout = USEC_PER_SEC;
+	}
+}
+
 /* Add per CPU specific workarounds here */
 static bool __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
 {
-	struct mce_bank *mce_banks = this_cpu_ptr(mce_banks_array);
 	struct mca_config *cfg = &mca_cfg;
 
-	if (c->x86_vendor == X86_VENDOR_UNKNOWN) {
+	switch (c->x86_vendor) {
+	case X86_VENDOR_UNKNOWN:
 		pr_info("unknown CPU type - not enabling MCE support\n");
 		return false;
-	}
-
-	/* This should be disabled by the BIOS, but isn't always */
-	if (c->x86_vendor == X86_VENDOR_AMD) {
-		if (c->x86 == 15 && this_cpu_read(mce_num_banks) > 4) {
-			/*
-			 * disable GART TBL walk error reporting, which
-			 * trips off incorrectly with the IOMMU & 3ware
-			 * & Cerberus:
-			 */
-			clear_bit(10, (unsigned long *)&mce_banks[4].ctl);
-		}
-		if (c->x86 < 0x11 && cfg->bootlog < 0) {
-			/*
-			 * Lots of broken BIOS around that don't clear them
-			 * by default and leave crap in there. Don't log:
-			 */
-			cfg->bootlog = 0;
-		}
-		/*
-		 * Various K7s with broken bank 0 around. Always disable
-		 * by default.
-		 */
-		if (c->x86 == 6 && this_cpu_read(mce_num_banks) > 0)
-			mce_banks[0].ctl = 0;
-
-		/*
-		 * overflow_recov is supported for F15h Models 00h-0fh
-		 * even though we don't have a CPUID bit for it.
-		 */
-		if (c->x86 == 0x15 && c->x86_model <= 0xf)
-			mce_flags.overflow_recov = 1;
-
-		if (c->x86 >= 0x17 && c->x86 <= 0x1A)
-			mce_flags.zen_ifu_quirk = 1;
-
-	}
-
-	if (c->x86_vendor == X86_VENDOR_INTEL) {
-		/*
-		 * SDM documents that on family 6 bank 0 should not be written
-		 * because it aliases to another special BIOS controlled
-		 * register.
-		 * But it's not aliased anymore on model 0x1a+
-		 * Don't ignore bank 0 completely because there could be a
-		 * valid event later, merely don't write CTL0.
-		 */
-
-		if (c->x86 == 6 && c->x86_model < 0x1A && this_cpu_read(mce_num_banks) > 0)
-			mce_banks[0].init = false;
-
-		/*
-		 * All newer Intel systems support MCE broadcasting. Enable
-		 * synchronization with a one second timeout.
-		 */
-		if ((c->x86 > 6 || (c->x86 == 6 && c->x86_model >= 0xe)) &&
-			cfg->monarch_timeout < 0)
-			cfg->monarch_timeout = USEC_PER_SEC;
-
-		/*
-		 * There are also broken BIOSes on some Pentium M and
-		 * earlier systems:
-		 */
-		if (c->x86 == 6 && c->x86_model <= 13 && cfg->bootlog < 0)
-			cfg->bootlog = 0;
-
-		if (c->x86_vfm == INTEL_SANDYBRIDGE_X)
-			mce_flags.snb_ifu_quirk = 1;
-
-		/*
-		 * Skylake, Cascacde Lake and Cooper Lake require a quirk on
-		 * rep movs.
-		 */
-		if (c->x86_vfm == INTEL_SKYLAKE_X)
-			mce_flags.skx_repmov_quirk = 1;
-	}
-
-	if (c->x86_vendor == X86_VENDOR_ZHAOXIN) {
-		/*
-		 * All newer Zhaoxin CPUs support MCE broadcasting. Enable
-		 * synchronization with a one second timeout.
-		 */
-		if (c->x86 > 6 || (c->x86_model == 0x19 || c->x86_model == 0x1f)) {
-			if (cfg->monarch_timeout < 0)
-				cfg->monarch_timeout = USEC_PER_SEC;
-		}
+	case X86_VENDOR_AMD:
+		apply_quirks_amd(c);
+		break;
+	case X86_VENDOR_INTEL:
+		apply_quirks_intel(c);
+		break;
+	case X86_VENDOR_ZHAOXIN:
+		apply_quirks_zhaoxin(c);
+		break;
 	}
 
 	if (cfg->monarch_timeout < 0)
-- 
2.17.1
Re: [PATCH v4 4/8] x86/mce: Break up __mcheck_cpu_apply_quirks()
Posted by Sohil Mehta 1 week, 6 days ago
On 11/10/2024 10:04 PM, Qiuxu Zhuo wrote:
> From: Tony Luck <tony.luck@intel.com>
> 
> Split each vendor specific part into its own helper function.
> 
> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---

Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
[PATCH v4 5/8] x86/mce: Convert family/model mixed checks to VFM-based checks
Posted by Qiuxu Zhuo 2 weeks ago
Convert family/model mixed checks to VFM-based checks to make
the code more compact.

Suggested-by: Sohil Mehta <sohil.mehta@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
Changes in v4:
  - No changes but rebased.

Changes in v3:
  - Newly added.

 arch/x86/kernel/cpu/mce/core.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 3855ec2ed0e0..d288cc7390f6 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1954,6 +1954,10 @@ static void apply_quirks_intel(struct cpuinfo_x86 *c)
 {
 	struct mce_bank *mce_banks = this_cpu_ptr(mce_banks_array);
 
+	/* Older CPUs (prior to family 6) don't need quirks. */
+	if (c->x86_vfm < INTEL_PENTIUM_PRO)
+		return;
+
 	/*
 	 * SDM documents that on family 6 bank 0 should not be written
 	 * because it aliases to another special BIOS controlled
@@ -1962,22 +1966,21 @@ static void apply_quirks_intel(struct cpuinfo_x86 *c)
 	 * Don't ignore bank 0 completely because there could be a
 	 * valid event later, merely don't write CTL0.
 	 */
-	if (c->x86 == 6 && c->x86_model < 0x1A && this_cpu_read(mce_num_banks) > 0)
+	if (c->x86_vfm < INTEL_NEHALEM_EP && this_cpu_read(mce_num_banks) > 0)
 		mce_banks[0].init = false;
 
 	/*
 	 * All newer Intel systems support MCE broadcasting. Enable
 	 * synchronization with a one second timeout.
 	 */
-	if ((c->x86 > 6 || (c->x86 == 6 && c->x86_model >= 0xe)) &&
-	    mca_cfg.monarch_timeout < 0)
+	if (c->x86_vfm >= INTEL_CORE_YONAH && mca_cfg.monarch_timeout < 0)
 		mca_cfg.monarch_timeout = USEC_PER_SEC;
 
 	/*
 	 * There are also broken BIOSes on some Pentium M and
 	 * earlier systems:
 	 */
-	if (c->x86 == 6 && c->x86_model <= 13 && mca_cfg.bootlog < 0)
+	if (c->x86_vfm < INTEL_CORE_YONAH && mca_cfg.bootlog < 0)
 		mca_cfg.bootlog = 0;
 
 	if (c->x86_vfm == INTEL_SANDYBRIDGE_X)
-- 
2.17.1
Re: [PATCH v4 5/8] x86/mce: Convert family/model mixed checks to VFM-based checks
Posted by Yazen Ghannam 1 week, 6 days ago
On Mon, Nov 11, 2024 at 02:04:25PM +0800, Qiuxu Zhuo wrote:
> Convert family/model mixed checks to VFM-based checks to make
> the code more compact.
> 
> Suggested-by: Sohil Mehta <sohil.mehta@intel.com>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---
> Changes in v4:
>   - No changes but rebased.
> 
> Changes in v3:
>   - Newly added.
> 
>  arch/x86/kernel/cpu/mce/core.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 3855ec2ed0e0..d288cc7390f6 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1954,6 +1954,10 @@ static void apply_quirks_intel(struct cpuinfo_x86 *c)
>  {
>  	struct mce_bank *mce_banks = this_cpu_ptr(mce_banks_array);
>  
> +	/* Older CPUs (prior to family 6) don't need quirks. */
> +	if (c->x86_vfm < INTEL_PENTIUM_PRO)
> +		return;
> +

Is it possible for pre-"family 6" to get here?

Family 5 is "ancient" which has its own MCE init path. And I assume
anything older doesn't support MCE/MCA. Is this correct?

>  	/*
>  	 * SDM documents that on family 6 bank 0 should not be written
>  	 * because it aliases to another special BIOS controlled
> @@ -1962,22 +1966,21 @@ static void apply_quirks_intel(struct cpuinfo_x86 *c)
>  	 * Don't ignore bank 0 completely because there could be a
>  	 * valid event later, merely don't write CTL0.
>  	 */
> -	if (c->x86 == 6 && c->x86_model < 0x1A && this_cpu_read(mce_num_banks) > 0)
> +	if (c->x86_vfm < INTEL_NEHALEM_EP && this_cpu_read(mce_num_banks) > 0)

The "> 0" is not needed, since mce_num_banks is unsigned int. 

Otherwise, looks good.

Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>

Thanks,
Yazen
RE: [PATCH v4 5/8] x86/mce: Convert family/model mixed checks to VFM-based checks
Posted by Zhuo, Qiuxu 1 week, 5 days ago
Hi Yazen,

> From: Yazen Ghannam <yazen.ghannam@amd.com>
> [...]
> > @@ -1954,6 +1954,10 @@ static void apply_quirks_intel(struct
> > cpuinfo_x86 *c)  {
> >  	struct mce_bank *mce_banks = this_cpu_ptr(mce_banks_array);
> >
> > +	/* Older CPUs (prior to family 6) don't need quirks. */
> > +	if (c->x86_vfm < INTEL_PENTIUM_PRO)
> > +		return;
> > +
> 
> Is it possible for pre-"family 6" to get here?
> 
> Family 5 is "ancient" which has its own MCE init path. And I assume anything
> older doesn't support MCE/MCA. Is this correct?

Yes, there is an early return in __mcheck_cpu_ancient_init() for Family 5. 
However, this code explicitly indicates that "prior to families 6 don't need quirks"
and addresses concerns like:

   https://lore.kernel.org/all/dcfdba92-7004-413d-8011-12771636d11f@intel.com/

> >  	/*
> >  	 * SDM documents that on family 6 bank 0 should not be written
> >  	 * because it aliases to another special BIOS controlled @@ -1962,22
> > +1966,21 @@ static void apply_quirks_intel(struct cpuinfo_x86 *c)
> >  	 * Don't ignore bank 0 completely because there could be a
> >  	 * valid event later, merely don't write CTL0.
> >  	 */
> > -	if (c->x86 == 6 && c->x86_model < 0x1A &&
> this_cpu_read(mce_num_banks) > 0)
> > +	if (c->x86_vfm < INTEL_NEHALEM_EP &&
> this_cpu_read(mce_num_banks) >
> > +0)
> 
> The "> 0" is not needed, since mce_num_banks is unsigned int.

I don't get your point here. 
But it needs to check for the case where mce_num_banks == 0.

> Otherwise, looks good.
> 
> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>

Thanks!
-Qiuxu
Re: [PATCH v4 5/8] x86/mce: Convert family/model mixed checks to VFM-based checks
Posted by Yazen Ghannam 1 week, 5 days ago
On Wed, Nov 13, 2024 at 12:10:31PM +0000, Zhuo, Qiuxu wrote:
> Hi Yazen,
> 
> > From: Yazen Ghannam <yazen.ghannam@amd.com>
> > [...]
> > > @@ -1954,6 +1954,10 @@ static void apply_quirks_intel(struct
> > > cpuinfo_x86 *c)  {
> > >  	struct mce_bank *mce_banks = this_cpu_ptr(mce_banks_array);
> > >
> > > +	/* Older CPUs (prior to family 6) don't need quirks. */
> > > +	if (c->x86_vfm < INTEL_PENTIUM_PRO)
> > > +		return;
> > > +
> > 
> > Is it possible for pre-"family 6" to get here?
> > 
> > Family 5 is "ancient" which has its own MCE init path. And I assume anything
> > older doesn't support MCE/MCA. Is this correct?
> 
> Yes, there is an early return in __mcheck_cpu_ancient_init() for Family 5. 
> However, this code explicitly indicates that "prior to families 6 don't need quirks"
> and addresses concerns like:
> 
>    https://lore.kernel.org/all/dcfdba92-7004-413d-8011-12771636d11f@intel.com/
>

Right, but my point is that this check would never be executed, since
the older systems would not get here during init. So this seems like
dead code.

> > >  	/*
> > >  	 * SDM documents that on family 6 bank 0 should not be written
> > >  	 * because it aliases to another special BIOS controlled @@ -1962,22
> > > +1966,21 @@ static void apply_quirks_intel(struct cpuinfo_x86 *c)
> > >  	 * Don't ignore bank 0 completely because there could be a
> > >  	 * valid event later, merely don't write CTL0.
> > >  	 */
> > > -	if (c->x86 == 6 && c->x86_model < 0x1A &&
> > this_cpu_read(mce_num_banks) > 0)
> > > +	if (c->x86_vfm < INTEL_NEHALEM_EP &&
> > this_cpu_read(mce_num_banks) >
> > > +0)
> > 
> > The "> 0" is not needed, since mce_num_banks is unsigned int.
> 
> I don't get your point here. 
> But it needs to check for the case where mce_num_banks == 0.
>

The check is "mce_num_banks > 0", and mce_num_banks is an unsigned int.
Therefore, the check is reduced to "mce_num_banks != 0". In this case,
you can just do "if (mce_num_banks)" to the same effect.

Thanks,
Yazen
RE: [PATCH v4 5/8] x86/mce: Convert family/model mixed checks to VFM-based checks
Posted by Zhuo, Qiuxu 1 week, 4 days ago
Hi Yazen,

> From: Yazen Ghannam <yazen.ghannam@amd.com>
> [...]
> > > > +	if (c->x86_vfm < INTEL_NEHALEM_EP &&
> > > this_cpu_read(mce_num_banks) >
> > > > +0)
> > >
> > > The "> 0" is not needed, since mce_num_banks is unsigned int.
> >
> > I don't get your point here.
> > But it needs to check for the case where mce_num_banks == 0.
> >
> 
> The check is "mce_num_banks > 0", and mce_num_banks is an unsigned int.
> Therefore, the check is reduced to "mce_num_banks != 0". In this case, you
> can just do "if (mce_num_banks)" to the same effect.

I got you. OK, if nobody else objects, I'll update it in the next version.
[ Hope others won't blame this as over-optimization. ]

-Qiuxu
Re: [PATCH v4 5/8] x86/mce: Convert family/model mixed checks to VFM-based checks
Posted by Sohil Mehta 1 week, 6 days ago
On 11/10/2024 10:04 PM, Qiuxu Zhuo wrote:
> Convert family/model mixed checks to VFM-based checks to make
> the code more compact.
> 
> Suggested-by: Sohil Mehta <sohil.mehta@intel.com>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---

Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
[PATCH v4 6/8] x86/mce: Remove the unnecessary {}
Posted by Qiuxu Zhuo 2 weeks ago
Remove the unnecessary {} from the case statement.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
Changes in v4:
  - No changes.

Changes in v3:
  - Collect "Reviewed-by:" from Nikolay & Sohil.

Changes in v2:
  - Collect "Reviewed-by:" from Tony.

 arch/x86/kernel/cpu/mce/core.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index d288cc7390f6..0f0c6e9d9183 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2118,10 +2118,9 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
 		mce_intel_feature_init(c);
 		break;
 
-	case X86_VENDOR_AMD: {
+	case X86_VENDOR_AMD:
 		mce_amd_feature_init(c);
 		break;
-		}
 
 	case X86_VENDOR_HYGON:
 		mce_hygon_feature_init(c);
-- 
2.17.1
Re: [PATCH v4 6/8] x86/mce: Remove the unnecessary {}
Posted by Yazen Ghannam 1 week, 6 days ago
On Mon, Nov 11, 2024 at 02:04:26PM +0800, Qiuxu Zhuo wrote:
> Remove the unnecessary {} from the case statement.
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>

Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>

But please see note below.

> ---
> Changes in v4:
>   - No changes.
> 
> Changes in v3:
>   - Collect "Reviewed-by:" from Nikolay & Sohil.
> 
> Changes in v2:
>   - Collect "Reviewed-by:" from Tony.
> 
>  arch/x86/kernel/cpu/mce/core.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index d288cc7390f6..0f0c6e9d9183 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -2118,10 +2118,9 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
>  		mce_intel_feature_init(c);
>  		break;
>  
> -	case X86_VENDOR_AMD: {
> +	case X86_VENDOR_AMD:
>  		mce_amd_feature_init(c);
>  		break;
> -		}
>  
>  	case X86_VENDOR_HYGON:
>  		mce_hygon_feature_init(c);
> --

I think this could be a bit more substantive if you also combine the AMD
and HYGON cases. And remove mce_hygon_feature_init() which just calls
mce_amd_feature_init() anyway.

Thanks,
Yazen
RE: [PATCH v4 6/8] x86/mce: Remove the unnecessary {}
Posted by Zhuo, Qiuxu 1 week, 5 days ago
Hi Yazen,

> From: Yazen Ghannam <yazen.ghannam@amd.com>
> [...]
> 
> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>

Thanks!

> But please see note below.
> [...]
> > -	case X86_VENDOR_AMD: {
> > +	case X86_VENDOR_AMD:
> >  		mce_amd_feature_init(c);
> >  		break;
> > -		}
> >
> >  	case X86_VENDOR_HYGON:
> >  		mce_hygon_feature_init(c);
> > --
> 
> I think this could be a bit more substantive if you also combine the AMD and
> HYGON cases. And remove mce_hygon_feature_init() which just calls
> mce_amd_feature_init() anyway.

How about a separate patch for this? 
If It's OK for you, I'll follow up on it after the current patch series has settled.

-Qiuxu
Re: [PATCH v4 6/8] x86/mce: Remove the unnecessary {}
Posted by Borislav Petkov 1 week, 5 days ago
On Wed, Nov 13, 2024 at 01:32:08PM +0000, Zhuo, Qiuxu wrote:
> How about a separate patch for this? 

Can you drop this micro-change per patch? Just do a single patch here which
fixes up everything mentioned during review in that area and be done with it.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
RE: [PATCH v4 6/8] x86/mce: Remove the unnecessary {}
Posted by Zhuo, Qiuxu 1 week, 4 days ago
Hi Boris,

> From: Borislav Petkov <bp@alien8.de>
> [...]
> 
> Can you drop this micro-change per patch? Just do a single patch here which
> fixes up everything mentioned during review in that area and be done with it.

OK. I'll drop this one and replace it with Yazen's suggestion in next version.

-Qiuxu
Re: [PATCH v4 6/8] x86/mce: Remove the unnecessary {}
Posted by Yazen Ghannam 1 week, 5 days ago
On Wed, Nov 13, 2024 at 01:32:08PM +0000, Zhuo, Qiuxu wrote:
> Hi Yazen,
> 
> > From: Yazen Ghannam <yazen.ghannam@amd.com>
> > [...]
> > 
> > Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
> 
> Thanks!
> 
> > But please see note below.
> > [...]
> > > -	case X86_VENDOR_AMD: {
> > > +	case X86_VENDOR_AMD:
> > >  		mce_amd_feature_init(c);
> > >  		break;
> > > -		}
> > >
> > >  	case X86_VENDOR_HYGON:
> > >  		mce_hygon_feature_init(c);
> > > --
> > 
> > I think this could be a bit more substantive if you also combine the AMD and
> > HYGON cases. And remove mce_hygon_feature_init() which just calls
> > mce_amd_feature_init() anyway.
> 
> How about a separate patch for this? 
> If It's OK for you, I'll follow up on it after the current patch series has settled.
>

Sure thing. No problem.

Thanks,
Yazen
[PATCH v4 7/8] x86/mce/amd: Remove unnecessary NULL pointer initializations
Posted by Qiuxu Zhuo 2 weeks ago
Remove unnecessary NULL pointer initializations from variables that
are already initialized before use.

Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
Changes in v4:
 - No changes.

Changes in v3:
 - Collect "Reviewed-by:" Nikolay & Sohil.
 - Remove the variables' names from the commit message (Sohil).

 arch/x86/kernel/cpu/mce/amd.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 018874b554cb..c79a82912d38 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -921,8 +921,8 @@ static void log_and_reset_block(struct threshold_block *block)
  */
 static void amd_threshold_interrupt(void)
 {
-	struct threshold_block *first_block = NULL, *block = NULL, *tmp = NULL;
 	struct threshold_bank **bp = this_cpu_read(threshold_banks);
+	struct threshold_block *first_block, *block, *tmp;
 	unsigned int bank, cpu = smp_processor_id();
 
 	/*
@@ -1201,8 +1201,7 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
 static int __threshold_add_blocks(struct threshold_bank *b)
 {
 	struct list_head *head = &b->blocks->miscj;
-	struct threshold_block *pos = NULL;
-	struct threshold_block *tmp = NULL;
+	struct threshold_block *pos, *tmp;
 	int err = 0;
 
 	err = kobject_add(&b->blocks->kobj, b->kobj, b->blocks->kobj.name);
@@ -1312,8 +1311,7 @@ static void deallocate_threshold_blocks(struct threshold_bank *bank)
 
 static void __threshold_remove_blocks(struct threshold_bank *b)
 {
-	struct threshold_block *pos = NULL;
-	struct threshold_block *tmp = NULL;
+	struct threshold_block *pos, *tmp;
 
 	kobject_put(b->kobj);
 
-- 
2.17.1
Re: [PATCH v4 7/8] x86/mce/amd: Remove unnecessary NULL pointer initializations
Posted by Yazen Ghannam 1 week, 6 days ago
On Mon, Nov 11, 2024 at 02:04:27PM +0800, Qiuxu Zhuo wrote:
> Remove unnecessary NULL pointer initializations from variables that
> are already initialized before use.
> 
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---

Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>

Thanks,
Yazen
[PATCH v4 8/8] x86/mce: Fix typos
Posted by Qiuxu Zhuo 2 weeks ago
Fix typos in comments.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
Changes in v4:
  - No changes.

Changes in v3:
  - Collect "Reviewed-by:" from Nikolay & Sohil.
  - Remove the detail typos from the commit message (Sohil).

Changes in v2:
  - Collect "Reviewed-by:" from Tony.

 arch/x86/kernel/cpu/mce/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 0f0c6e9d9183..6e194ccffc7c 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1144,7 +1144,7 @@ static noinstr int mce_start(int *no_way_out)
 	} else {
 		/*
 		 * Subject: Now start the scanning loop one by one in
-		 * the original callin order.
+		 * the original calling order.
 		 * This way when there are any shared banks it will be
 		 * only seen by one CPU before cleared, avoiding duplicates.
 		 */
@@ -1917,7 +1917,7 @@ static void apply_quirks_amd(struct cpuinfo_x86 *c)
 	/* This should be disabled by the BIOS, but isn't always */
 	if (c->x86 == 15 && this_cpu_read(mce_num_banks) > 4) {
 		/*
-		 * disable GART TBL walk error reporting, which
+		 * disable GART TLB walk error reporting, which
 		 * trips off incorrectly with the IOMMU & 3ware
 		 * & Cerberus:
 		 */
-- 
2.17.1
Re: [PATCH v4 8/8] x86/mce: Fix typos
Posted by Yazen Ghannam 1 week, 6 days ago
On Mon, Nov 11, 2024 at 02:04:28PM +0800, Qiuxu Zhuo wrote:
> Fix typos in comments.
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---
> Changes in v4:
>   - No changes.
> 
> Changes in v3:
>   - Collect "Reviewed-by:" from Nikolay & Sohil.
>   - Remove the detail typos from the commit message (Sohil).
> 
> Changes in v2:
>   - Collect "Reviewed-by:" from Tony.
> 
>  arch/x86/kernel/cpu/mce/core.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 0f0c6e9d9183..6e194ccffc7c 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1144,7 +1144,7 @@ static noinstr int mce_start(int *no_way_out)
>  	} else {
>  		/*
>  		 * Subject: Now start the scanning loop one by one in
> -		 * the original callin order.
> +		 * the original calling order.

I don't think this is a typo. It seems to refer to the mce_callin
variable/idea.

For example, each CPU "calls in" when ready. This is independent of when
each CPU is "called" to do something.

	CPUs are called in this order 0, 1, 2.
	CPUs "call in" in this order 1, 0, 2.

When a CPU is called can be different from when it responds.

Maybe I'm reading too much into this. :/

>  		 * This way when there are any shared banks it will be
>  		 * only seen by one CPU before cleared, avoiding duplicates.
>  		 */
> @@ -1917,7 +1917,7 @@ static void apply_quirks_amd(struct cpuinfo_x86 *c)
>  	/* This should be disabled by the BIOS, but isn't always */
>  	if (c->x86 == 15 && this_cpu_read(mce_num_banks) > 4) {
>  		/*
> -		 * disable GART TBL walk error reporting, which
> +		 * disable GART TLB walk error reporting, which

This also is not a typo. TBL -> table

From old AMD K8 BKDG document:
	10 GartTblWkEn GART Table Walk Error Reporting Enable R/W 0

Thanks,
Yazen
RE: [PATCH v4 8/8] x86/mce: Fix typos
Posted by Zhuo, Qiuxu 1 week, 5 days ago
Hi Yazen,

> From: Yazen Ghannam <yazen.ghannam@amd.com>
> [...]
> > @@ -1144,7 +1144,7 @@ static noinstr int mce_start(int *no_way_out)
> >  	} else {
> >  		/*
> >  		 * Subject: Now start the scanning loop one by one in
> > -		 * the original callin order.
> > +		 * the original calling order.
> 
> I don't think this is a typo. It seems to refer to the mce_callin variable/idea.
> 
> For example, each CPU "calls in" when ready. This is independent of when
> each CPU is "called" to do something.
> 
> 	CPUs are called in this order 0, 1, 2.
> 	CPUs "call in" in this order 1, 0, 2.
> 
> When a CPU is called can be different from when it responds.
> 
> Maybe I'm reading too much into this. :/

Too finicky to me :(
But I appreciate you sharing your reading and thoughts. 😊

-Qiuxu
Re: [PATCH v4 8/8] x86/mce: Fix typos
Posted by Sohil Mehta 1 week, 5 days ago
On 11/12/2024 7:38 AM, Yazen Ghannam wrote:
> 
>>  		 * This way when there are any shared banks it will be
>>  		 * only seen by one CPU before cleared, avoiding duplicates.
>>  		 */
>> @@ -1917,7 +1917,7 @@ static void apply_quirks_amd(struct cpuinfo_x86 *c)
>>  	/* This should be disabled by the BIOS, but isn't always */
>>  	if (c->x86 == 15 && this_cpu_read(mce_num_banks) > 4) {
>>  		/*
>> -		 * disable GART TBL walk error reporting, which
>> +		 * disable GART TLB walk error reporting, which
> 
> This also is not a typo. TBL -> table
> 
> From old AMD K8 BKDG document:
> 	10 GartTblWkEn GART Table Walk Error Reporting Enable R/W 0
> 

There is another comment in init_amd_gh() that seems to be related to
the same thing and similarly worded. That seems to refer to TLB instead
of TBL(table).

/*
 * Disable GART TLB Walk Errors on Fam10h. We do this here because this
 * is always needed when GART is enabled, even in a kernel which has no
 * MCE support built in. BIOS should disable GartTlbWlk Errors already.
 * If it doesn't, we do it here as suggested by the BKDG.
 *
 * Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=33012
 */
msr_set_bit(MSR_AMD64_MCx_MASK(4), 10);
Re: [PATCH v4 8/8] x86/mce: Fix typos
Posted by Yazen Ghannam 1 week, 5 days ago
On Tue, Nov 12, 2024 at 02:35:32PM -0800, Sohil Mehta wrote:
> On 11/12/2024 7:38 AM, Yazen Ghannam wrote:
> > 
> >>  		 * This way when there are any shared banks it will be
> >>  		 * only seen by one CPU before cleared, avoiding duplicates.
> >>  		 */
> >> @@ -1917,7 +1917,7 @@ static void apply_quirks_amd(struct cpuinfo_x86 *c)
> >>  	/* This should be disabled by the BIOS, but isn't always */
> >>  	if (c->x86 == 15 && this_cpu_read(mce_num_banks) > 4) {
> >>  		/*
> >> -		 * disable GART TBL walk error reporting, which
> >> +		 * disable GART TLB walk error reporting, which
> > 
> > This also is not a typo. TBL -> table
> > 
> > From old AMD K8 BKDG document:
> > 	10 GartTblWkEn GART Table Walk Error Reporting Enable R/W 0
> > 
> 
> There is another comment in init_amd_gh() that seems to be related to
> the same thing and similarly worded. That seems to refer to TLB instead
> of TBL(table).
> 
> /*
>  * Disable GART TLB Walk Errors on Fam10h. We do this here because this
>  * is always needed when GART is enabled, even in a kernel which has no
>  * MCE support built in. BIOS should disable GartTlbWlk Errors already.
>  * If it doesn't, we do it here as suggested by the BKDG.
>  *
>  * Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=33012
>  */
> msr_set_bit(MSR_AMD64_MCx_MASK(4), 10);
> 
>

Now I think *that* is a typo, since it doesn't match the documentation.
:)

Thanks,
Yazen
RE: [PATCH v4 8/8] x86/mce: Fix typos
Posted by Zhuo, Qiuxu 1 week, 4 days ago
> From: Yazen Ghannam <yazen.ghannam@amd.com>
> [...]
> > >> -		 * disable GART TBL walk error reporting, which
> > >> +		 * disable GART TLB walk error reporting, which
> > >
> > > This also is not a typo. TBL -> table
> > >
> > > From old AMD K8 BKDG document:
> > > 	10 GartTblWkEn GART Table Walk Error Reporting Enable R/W 0
> > >
> >
> > There is another comment in init_amd_gh() that seems to be related to
> > the same thing and similarly worded. That seems to refer to TLB
> > instead of TBL(table).
> >
> > /*
> >  * Disable GART TLB Walk Errors on Fam10h. We do this here because
> > this
> >  * is always needed when GART is enabled, even in a kernel which has
> > no
> >  * MCE support built in. BIOS should disable GartTlbWlk Errors already.
> >  * If it doesn't, we do it here as suggested by the BKDG.
> >  *
> >  * Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=33012
> >  */

Thanks Sohil for this reference. 

> > msr_set_bit(MSR_AMD64_MCx_MASK(4), 10);
> >
> >
> 
> Now I think *that* is a typo, since it doesn't match the documentation.
> :)

As Yazen has the documentation that says TBL (table) is the intended meaning. 
I'll drop this patch in next version. 

-Qiuxu
Re: [PATCH v4 8/8] x86/mce: Fix typos
Posted by Sohil Mehta 1 week, 6 days ago
On 11/10/2024 10:04 PM, Qiuxu Zhuo wrote:
> Fix typos in comments.
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---

It might have been fine to merge this with patch 6/8. They both touch
the same file and the changes are mostly superficial. Probably consider
it if you do another revision but not needed otherwise.

>  arch/x86/kernel/cpu/mce/core.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 0f0c6e9d9183..6e194ccffc7c 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1144,7 +1144,7 @@ static noinstr int mce_start(int *no_way_out)
>  	} else {
>  		/*
>  		 * Subject: Now start the scanning loop one by one in
> -		 * the original callin order.
> +		 * the original calling order.
>  		 * This way when there are any shared banks it will be
>  		 * only seen by one CPU before cleared, avoiding duplicates.
>  		 */
> @@ -1917,7 +1917,7 @@ static void apply_quirks_amd(struct cpuinfo_x86 *c)
>  	/* This should be disabled by the BIOS, but isn't always */
>  	if (c->x86 == 15 && this_cpu_read(mce_num_banks) > 4) {
>  		/*
> -		 * disable GART TBL walk error reporting, which
> +		 * disable GART TLB walk error reporting, which
>  		 * trips off incorrectly with the IOMMU & 3ware
>  		 * & Cerberus:
>  		 */