drivers/edac/i10nm_base.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
Hi AceLan,
> From: AceLan Kao <acelan.kao@canonical.com>
> [...]
> > Which CPU did you test it on?
> It's an on going project, there is no CPU name on it.
> $ lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Address sizes: 52 bits physical, 57 bits virtual
> Byte Order: Little Endian
> CPU(s): 172
> On-line CPU(s) list: 0-171
> Vendor ID: GenuineIntel
> Model name: Genuine Intel(R) 0000
> CPU family: 6
> Model: 173
This is the CPU with the code name "Granite Rapids".
> Thread(s) per core: 2
> Core(s) per socket: 86
> Socket(s): 1
> Stepping: 1
> CPU(s) scaling MHz: 18%
> CPU max MHz: 4800.0000
> CPU min MHz: 800.0000
> BogoMIPS: 3800.00
>
> > Would you mind taking a complete dmesg log with the kernel option
> > CONFIG_EDAC_DEBUG=y (your current log showed this option had been
> enabled)?
> Sure, here you are.
Thanks so much for your log.
We've encountered the same issue recently due to the BIOS disabling the
memory controller when no DIMMs are populated, leading to invalid values
of the disabled memory controller register and the call trace you reported.
Attached is a patch that skips DIMM enumeration on a disabled memory
controller to fix the call trace. Could you please test this patch on your machines
and share the dmesg log?
Thanks!
-Qiuxu
From 4de20bd2e7e669c7a16be33c1ebb4106a5479b69 Mon Sep 17 00:00:00 2001
From: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Date: Wed, 30 Jul 2025 22:32:33 +0800
Subject: [PATCH 1/1] EDAC/i10nm: Skip DIMM enumeration on a disabled memory
controller
When loading the i10nm_edac driver on some Intel Granite Rapids servers,
a call trace may appear as follows:
UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:453:16
shift exponent -66 is negative
...
__ubsan_handle_shift_out_of_bounds+0x1e3/0x390
skx_get_dimm_info.cold+0x47/0xd40 [skx_edac_common]
i10nm_get_dimm_config+0x23e/0x390 [i10nm_edac]
skx_register_mci+0x159/0x220 [skx_edac_common]
i10nm_init+0xcb0/0x1ff0 [i10nm_edac]
...
This occurs because some BIOS may disable a memory controller if there
aren't any memory DIMMs populated on this memory controller. The DIMMMTR
register of this disabled memory controller contains the invalid value
~0, resulting in the call trace above.
Fix this call trace by skipping DIMM enumeration on a disabled memory
controller.
Fixes: ba987eaaabf9 ("EDAC/i10nm: Add Intel Granite Rapids server support")
Reported-by: Jose Jesus Ambriz Meza <jose.jesus.ambriz.meza@intel.com>
Reported-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
Closes: https://lore.kernel.org/all/20250730063155.2612379-1-acelan.kao@canonical.com/
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
drivers/edac/i10nm_base.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/drivers/edac/i10nm_base.c b/drivers/edac/i10nm_base.c
index a3fca2567752..103fbf602d1b 100644
--- a/drivers/edac/i10nm_base.c
+++ b/drivers/edac/i10nm_base.c
@@ -1047,6 +1047,15 @@ static bool i10nm_check_ecc(struct skx_imc *imc, int chan)
return !!GET_BITFIELD(mcmtr, 2, 2);
}
+static bool i10nm_channel_disabled(struct skx_imc *imc, int chan)
+{
+ u32 mcmtr = I10NM_GET_MCMTR(imc, chan);
+
+ edac_dbg(1, "ch%d mcmtr reg %x\n", chan, mcmtr);
+
+ return (mcmtr == ~0 || GET_BITFIELD(mcmtr, 18, 18));
+}
+
static int i10nm_get_dimm_config(struct mem_ctl_info *mci,
struct res_config *cfg)
{
@@ -1060,6 +1069,11 @@ static int i10nm_get_dimm_config(struct mem_ctl_info *mci,
if (!imc->mbase)
continue;
+ if (i10nm_channel_disabled(imc, i)) {
+ edac_dbg(1, "mc%d ch%d is disabled.\n", imc->mc, i);
+ continue;
+ }
+
ndimms = 0;
if (res_cfg->type != GNR)
base-commit: 038d61fd642278bab63ee8ef722c50d10ab01e8f
--
2.43.0
Zhuo, Qiuxu <qiuxu.zhuo@intel.com> 於 2025年7月31日 週四 上午12:33寫道: > > Hi AceLan, > > > From: AceLan Kao <acelan.kao@canonical.com> > > [...] > > > Which CPU did you test it on? > > It's an on going project, there is no CPU name on it. > > $ lscpu > > Architecture: x86_64 > > CPU op-mode(s): 32-bit, 64-bit > > Address sizes: 52 bits physical, 57 bits virtual > > Byte Order: Little Endian > > CPU(s): 172 > > On-line CPU(s) list: 0-171 > > Vendor ID: GenuineIntel > > Model name: Genuine Intel(R) 0000 > > CPU family: 6 > > Model: 173 > > This is the CPU with the code name "Granite Rapids". > > > Thread(s) per core: 2 > > Core(s) per socket: 86 > > Socket(s): 1 > > Stepping: 1 > > CPU(s) scaling MHz: 18% > > CPU max MHz: 4800.0000 > > CPU min MHz: 800.0000 > > BogoMIPS: 3800.00 > > > > > Would you mind taking a complete dmesg log with the kernel option > > > CONFIG_EDAC_DEBUG=y (your current log showed this option had been > > enabled)? > > Sure, here you are. > > Thanks so much for your log. > > We've encountered the same issue recently due to the BIOS disabling the > memory controller when no DIMMs are populated, leading to invalid values > of the disabled memory controller register and the call trace you reported. > > Attached is a patch that skips DIMM enumeration on a disabled memory > controller to fix the call trace. Could you please test this patch on your machines > and share the dmesg log? Yes, this works for me. Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> > > Thanks! > -Qiuxu
> From: AceLan Kao <acelan.kao@canonical.com> > [...] > > Attached is a patch that skips DIMM enumeration on a disabled memory > > controller to fix the call trace. Could you please test this patch on > > your machines and share the dmesg log? > Yes, this works for me. Thanks for your testing feedback. > > Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> I'll add your "Tested-by" in the commit message. Thanks! -Qiuxu
© 2016 - 2025 Red Hat, Inc.