drivers/edac/skx_common.c | 3 +++ 1 file changed, 3 insertions(+)
The skx_get_dimm_attr() function can return a negative error code,
which is then assigned to 'ranks', 'rows', or 'cols'.
[ 9.344702] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)
[ 9.344703] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff)
[ 9.344703] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
[ 9.344704] ------------[ cut here ]------------
[ 9.344705] UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:453:2
[ 9.344707] shift exponent -66 is negative
The 3 values, rows, cols, and ranks are all -EINVAL(-22), so this line
(1ull << (rows + cols + ranks)
would become
(1ull << ((-22) + (-22) + (-22))
Which leads to shift exponent -66 error
Add a check to ensure that 'ranks', 'rows', and 'cols' are not
negative before they are used in the size calculation. This prevents
the use of invalid values.
Fixes: 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac")
Signed-off-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
---
drivers/edac/skx_common.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c
index 39c733dbc5b9..36dd14320d70 100644
--- a/drivers/edac/skx_common.c
+++ b/drivers/edac/skx_common.c
@@ -436,6 +436,9 @@ int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm,
rows = numrow(mtr);
cols = imc->hbm_mc ? 6 : numcol(mtr);
+ if (ranks < 0 || rows < 0 || cols < 0)
+ return 0;
+
if (imc->hbm_mc) {
banks = 32;
mtype = MEM_HBM2;
--
2.43.0
Hi AceLan, > From: AceLan Kao <acelan@gmail.com> On Behalf Of Chia-Lin Kao (AceLan) > Sent: Wednesday, July 30, 2025 2:32 PM > To: Luck, Tony <tony.luck@intel.com>; Borislav Petkov <bp@alien8.de>; James > Morse <james.morse@arm.com>; Mauro Carvalho Chehab > <mchehab@kernel.org>; Robert Richter <rric@kernel.org>; Zhuo, Qiuxu > <qiuxu.zhuo@intel.com>; linux-edac@vger.kernel.org; linux- > kernel@vger.kernel.org > Subject: [PATCH] EDAC/skx_common: Fix potential negative values in DIMM > size calculation > > The skx_get_dimm_attr() function can return a negative error code, which is > then assigned to 'ranks', 'rows', or 'cols'. > > [ 9.344702] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff) > [ 9.344703] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff) > [ 9.344703] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff) > [ 9.344704] ------------[ cut here ]------------ > [ 9.344705] UBSAN: shift-out-of-bounds in > drivers/edac/skx_common.c:453:2 > [ 9.344707] shift exponent -66 is negative > > The 3 values, rows, cols, and ranks are all -EINVAL(-22), so this line > (1ull << (rows + cols + ranks) > would become > (1ull << ((-22) + (-22) + (-22)) > Which leads to shift exponent -66 error > > Add a check to ensure that 'ranks', 'rows', and 'cols' are not negative before > they are used in the size calculation. This prevents the use of invalid values. > > Fixes: 88a242c98740 ("EDAC, skx_common: Separate common code out from > skx_edac") > Signed-off-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> > Thanks for reporting this. Which CPU did you test it on? Would you mind taking a complete dmesg log with the kernel option CONFIG_EDAC_DEBUG=y (your current log showed this option had been enabled)? Thanks! -Qiuxu
Zhuo, Qiuxu <qiuxu.zhuo@intel.com> 於 2025年7月30日 週三 下午3:56寫道: > > Hi AceLan, > > > From: AceLan Kao <acelan@gmail.com> On Behalf Of Chia-Lin Kao (AceLan) > > Sent: Wednesday, July 30, 2025 2:32 PM > > To: Luck, Tony <tony.luck@intel.com>; Borislav Petkov <bp@alien8.de>; James > > Morse <james.morse@arm.com>; Mauro Carvalho Chehab > > <mchehab@kernel.org>; Robert Richter <rric@kernel.org>; Zhuo, Qiuxu > > <qiuxu.zhuo@intel.com>; linux-edac@vger.kernel.org; linux- > > kernel@vger.kernel.org > > Subject: [PATCH] EDAC/skx_common: Fix potential negative values in DIMM > > size calculation > > > > The skx_get_dimm_attr() function can return a negative error code, which is > > then assigned to 'ranks', 'rows', or 'cols'. > > > > [ 9.344702] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff) > > [ 9.344703] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff) > > [ 9.344703] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff) > > [ 9.344704] ------------[ cut here ]------------ > > [ 9.344705] UBSAN: shift-out-of-bounds in > > drivers/edac/skx_common.c:453:2 > > [ 9.344707] shift exponent -66 is negative > > > > The 3 values, rows, cols, and ranks are all -EINVAL(-22), so this line > > (1ull << (rows + cols + ranks) > > would become > > (1ull << ((-22) + (-22) + (-22)) > > Which leads to shift exponent -66 error > > > > Add a check to ensure that 'ranks', 'rows', and 'cols' are not negative before > > they are used in the size calculation. This prevents the use of invalid values. > > > > Fixes: 88a242c98740 ("EDAC, skx_common: Separate common code out from > > skx_edac") > > Signed-off-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> > > > > Thanks for reporting this. > > Which CPU did you test it on? It's an on going project, there is no CPU name on it. $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 172 On-line CPU(s) list: 0-171 Vendor ID: GenuineIntel Model name: Genuine Intel(R) 0000 CPU family: 6 Model: 173 Thread(s) per core: 2 Core(s) per socket: 86 Socket(s): 1 Stepping: 1 CPU(s) scaling MHz: 18% CPU max MHz: 4800.0000 CPU min MHz: 800.0000 BogoMIPS: 3800.00 > Would you mind taking a complete dmesg log with the kernel option > CONFIG_EDAC_DEBUG=y (your current log showed this option had been enabled)? Sure, here you are. I masked the product name in the log. > > Thanks! > -Qiuxu
Hi AceLan,
> From: AceLan Kao <acelan.kao@canonical.com>
> [...]
> > Which CPU did you test it on?
> It's an on going project, there is no CPU name on it.
> $ lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Address sizes: 52 bits physical, 57 bits virtual
> Byte Order: Little Endian
> CPU(s): 172
> On-line CPU(s) list: 0-171
> Vendor ID: GenuineIntel
> Model name: Genuine Intel(R) 0000
> CPU family: 6
> Model: 173
This is the CPU with the code name "Granite Rapids".
> Thread(s) per core: 2
> Core(s) per socket: 86
> Socket(s): 1
> Stepping: 1
> CPU(s) scaling MHz: 18%
> CPU max MHz: 4800.0000
> CPU min MHz: 800.0000
> BogoMIPS: 3800.00
>
> > Would you mind taking a complete dmesg log with the kernel option
> > CONFIG_EDAC_DEBUG=y (your current log showed this option had been
> enabled)?
> Sure, here you are.
Thanks so much for your log.
We've encountered the same issue recently due to the BIOS disabling the
memory controller when no DIMMs are populated, leading to invalid values
of the disabled memory controller register and the call trace you reported.
Attached is a patch that skips DIMM enumeration on a disabled memory
controller to fix the call trace. Could you please test this patch on your machines
and share the dmesg log?
Thanks!
-Qiuxu
From 4de20bd2e7e669c7a16be33c1ebb4106a5479b69 Mon Sep 17 00:00:00 2001
From: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Date: Wed, 30 Jul 2025 22:32:33 +0800
Subject: [PATCH 1/1] EDAC/i10nm: Skip DIMM enumeration on a disabled memory
controller
When loading the i10nm_edac driver on some Intel Granite Rapids servers,
a call trace may appear as follows:
UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:453:16
shift exponent -66 is negative
...
__ubsan_handle_shift_out_of_bounds+0x1e3/0x390
skx_get_dimm_info.cold+0x47/0xd40 [skx_edac_common]
i10nm_get_dimm_config+0x23e/0x390 [i10nm_edac]
skx_register_mci+0x159/0x220 [skx_edac_common]
i10nm_init+0xcb0/0x1ff0 [i10nm_edac]
...
This occurs because some BIOS may disable a memory controller if there
aren't any memory DIMMs populated on this memory controller. The DIMMMTR
register of this disabled memory controller contains the invalid value
~0, resulting in the call trace above.
Fix this call trace by skipping DIMM enumeration on a disabled memory
controller.
Fixes: ba987eaaabf9 ("EDAC/i10nm: Add Intel Granite Rapids server support")
Reported-by: Jose Jesus Ambriz Meza <jose.jesus.ambriz.meza@intel.com>
Reported-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
Closes: https://lore.kernel.org/all/20250730063155.2612379-1-acelan.kao@canonical.com/
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
drivers/edac/i10nm_base.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/drivers/edac/i10nm_base.c b/drivers/edac/i10nm_base.c
index a3fca2567752..103fbf602d1b 100644
--- a/drivers/edac/i10nm_base.c
+++ b/drivers/edac/i10nm_base.c
@@ -1047,6 +1047,15 @@ static bool i10nm_check_ecc(struct skx_imc *imc, int chan)
return !!GET_BITFIELD(mcmtr, 2, 2);
}
+static bool i10nm_channel_disabled(struct skx_imc *imc, int chan)
+{
+ u32 mcmtr = I10NM_GET_MCMTR(imc, chan);
+
+ edac_dbg(1, "ch%d mcmtr reg %x\n", chan, mcmtr);
+
+ return (mcmtr == ~0 || GET_BITFIELD(mcmtr, 18, 18));
+}
+
static int i10nm_get_dimm_config(struct mem_ctl_info *mci,
struct res_config *cfg)
{
@@ -1060,6 +1069,11 @@ static int i10nm_get_dimm_config(struct mem_ctl_info *mci,
if (!imc->mbase)
continue;
+ if (i10nm_channel_disabled(imc, i)) {
+ edac_dbg(1, "mc%d ch%d is disabled.\n", imc->mc, i);
+ continue;
+ }
+
ndimms = 0;
if (res_cfg->type != GNR)
base-commit: 038d61fd642278bab63ee8ef722c50d10ab01e8f
--
2.43.0
Zhuo, Qiuxu <qiuxu.zhuo@intel.com> 於 2025年7月31日 週四 上午12:33寫道: > > Hi AceLan, > > > From: AceLan Kao <acelan.kao@canonical.com> > > [...] > > > Which CPU did you test it on? > > It's an on going project, there is no CPU name on it. > > $ lscpu > > Architecture: x86_64 > > CPU op-mode(s): 32-bit, 64-bit > > Address sizes: 52 bits physical, 57 bits virtual > > Byte Order: Little Endian > > CPU(s): 172 > > On-line CPU(s) list: 0-171 > > Vendor ID: GenuineIntel > > Model name: Genuine Intel(R) 0000 > > CPU family: 6 > > Model: 173 > > This is the CPU with the code name "Granite Rapids". > > > Thread(s) per core: 2 > > Core(s) per socket: 86 > > Socket(s): 1 > > Stepping: 1 > > CPU(s) scaling MHz: 18% > > CPU max MHz: 4800.0000 > > CPU min MHz: 800.0000 > > BogoMIPS: 3800.00 > > > > > Would you mind taking a complete dmesg log with the kernel option > > > CONFIG_EDAC_DEBUG=y (your current log showed this option had been > > enabled)? > > Sure, here you are. > > Thanks so much for your log. > > We've encountered the same issue recently due to the BIOS disabling the > memory controller when no DIMMs are populated, leading to invalid values > of the disabled memory controller register and the call trace you reported. > > Attached is a patch that skips DIMM enumeration on a disabled memory > controller to fix the call trace. Could you please test this patch on your machines > and share the dmesg log? Yes, this works for me. Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> > > Thanks! > -Qiuxu
> From: AceLan Kao <acelan.kao@canonical.com> > [...] > > Attached is a patch that skips DIMM enumeration on a disabled memory > > controller to fix the call trace. Could you please test this patch on > > your machines and share the dmesg log? > Yes, this works for me. Thanks for your testing feedback. > > Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> I'll add your "Tested-by" in the commit message. Thanks! -Qiuxu
© 2016 - 2025 Red Hat, Inc.