[PATCH] EDAC/mce_amd: Fix Hygon UMC ECC error decoding with logical_die_id

Aichun Shi posted 1 patch 1 month, 2 weeks ago
drivers/edac/mce_amd.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
[PATCH] EDAC/mce_amd: Fix Hygon UMC ECC error decoding with logical_die_id
Posted by Aichun Shi 1 month, 2 weeks ago
cpuinfo_topology.amd_node_id is populated via CPUID or MSR, as introduced
by commit f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology
parser") and commit 03fa6bea5a3e ("x86/cpu: Make topology_amd_node_id()
use the actual node info"). However, this value may be non-continuous for
Hygon processors while EDAC uses continuous node IDs, which leads to
incorrect UMC ECC error decoding.

In contract, cpuinfo_topology.logical_die_id always provides continuous
die (or node) IDs. Fix this by replacing topology_amd_node_id() with
topology_logical_die_id() when decoding UMC ECC errors for Hygon
processors.

Signed-off-by: Aichun Shi <shiaichun@open-hieco.net>
---
 drivers/edac/mce_amd.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index af3c12284a1e..4a23c1d6488e 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -746,8 +746,13 @@ static void decode_smca_error(struct mce *m)
 	pr_emerg(HW_ERR "%s Ext. Error Code: %d", smca_get_long_name(bank_type), xec);
 
 	if ((bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) &&
-	    xec == 0 && decode_dram_ecc)
-		decode_dram_ecc(topology_amd_node_id(m->extcpu), m);
+	    xec == 0 && decode_dram_ecc) {
+		if (boot_cpu_data.x86_vendor == X86_VENDOR_HYGON &&
+		    boot_cpu_data.x86 == 0x18)
+			decode_dram_ecc(topology_logical_die_id(m->extcpu), m);
+		else
+			decode_dram_ecc(topology_amd_node_id(m->extcpu), m);
+	}
 }
 
 static inline void amd_decode_err_code(u16 ec)
-- 
2.47.3
Re: [PATCH] EDAC/mce_amd: Fix Hygon UMC ECC error decoding with logical_die_id
Posted by Yazen Ghannam 1 month, 2 weeks ago
On Sat, Feb 14, 2026 at 02:42:03PM +0800, Aichun Shi wrote:
> cpuinfo_topology.amd_node_id is populated via CPUID or MSR, as introduced
> by commit f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology
> parser") and commit 03fa6bea5a3e ("x86/cpu: Make topology_amd_node_id()
> use the actual node info"). However, this value may be non-continuous for
> Hygon processors while EDAC uses continuous node IDs, which leads to
> incorrect UMC ECC error decoding.

Can you please share an example?

> 
> In contract, cpuinfo_topology.logical_die_id always provides continuous
> die (or node) IDs. Fix this by replacing topology_amd_node_id() with
> topology_logical_die_id() when decoding UMC ECC errors for Hygon
> processors.
> 
> Signed-off-by: Aichun Shi <shiaichun@open-hieco.net>
> ---
>  drivers/edac/mce_amd.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
> index af3c12284a1e..4a23c1d6488e 100644
> --- a/drivers/edac/mce_amd.c
> +++ b/drivers/edac/mce_amd.c
> @@ -746,8 +746,13 @@ static void decode_smca_error(struct mce *m)
>  	pr_emerg(HW_ERR "%s Ext. Error Code: %d", smca_get_long_name(bank_type), xec);
>  
>  	if ((bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) &&
> -	    xec == 0 && decode_dram_ecc)
> -		decode_dram_ecc(topology_amd_node_id(m->extcpu), m);
> +	    xec == 0 && decode_dram_ecc) {
> +		if (boot_cpu_data.x86_vendor == X86_VENDOR_HYGON &&
> +		    boot_cpu_data.x86 == 0x18)

Is the family check necessary? You did not mention a specific family in
the commit message. So it seems the intent is to apply to all Hygon
systems.

Thanks,
Yazen
Re: [PATCH] EDAC/mce_amd: Fix Hygon UMC ECC error decoding with logical_die_id
Posted by Aichun Shi 1 month ago
On Mon, Feb 16, 2026 03:32:11PM -0500, Yazen Ghannam wrote:
> On Sat, Feb 14, 2026 at 02:42:03PM +0800, Aichun Shi wrote:
> > cpuinfo_topology.amd_node_id is populated via CPUID or MSR, as introduced
> > by commit f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology
> > parser") and commit 03fa6bea5a3e ("x86/cpu: Make topology_amd_node_id()
> > use the actual node info"). However, this value may be non-continuous for
> > Hygon processors while EDAC uses continuous node IDs, which leads to
> > incorrect UMC ECC error decoding.
> 
> Can you please share an example?

Yazen, thanks for your reply!

Certainly. For example, on some Hygon processors with 2 sockets and 4 dies
per socket, amd_node_id is populated as 0,1,2,3 for the 4 dies on socket 0,
and 16,17,18,19 for the 4 dies on socket 1, which is non-contiguous.

> > 
> > In contract, cpuinfo_topology.logical_die_id always provides continuous
> > die (or node) IDs. Fix this by replacing topology_amd_node_id() with
> > topology_logical_die_id() when decoding UMC ECC errors for Hygon
> > processors.

On Hygon processors without CPUID leaf 0x80000026, the logical_die_id
obtained from topology_get_logical_id(apicid, TOPO_DIE_DOMAIN) is
incorrect. This is caused by the absence of die topology information
in the APIC ID space.

I have sent another patch to fix this issue:
https://lore.kernel.org/lkml/20260301141157.241770-1-shiaichun@open-hieco.net/
Could you help to review this patch firstly?

> > 
> > Signed-off-by: Aichun Shi <shiaichun@open-hieco.net>
> > ---
> >  drivers/edac/mce_amd.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
> > index af3c12284a1e..4a23c1d6488e 100644
> > --- a/drivers/edac/mce_amd.c
> > +++ b/drivers/edac/mce_amd.c
> > @@ -746,8 +746,13 @@ static void decode_smca_error(struct mce *m)
> >  	pr_emerg(HW_ERR "%s Ext. Error Code: %d", smca_get_long_name(bank_type), xec);
> >  
> >  	if ((bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) &&
> > -	    xec == 0 && decode_dram_ecc)
> > -		decode_dram_ecc(topology_amd_node_id(m->extcpu), m);
> > +	    xec == 0 && decode_dram_ecc) {
> > +		if (boot_cpu_data.x86_vendor == X86_VENDOR_HYGON &&
> > +		    boot_cpu_data.x86 == 0x18)
> 
> Is the family check necessary? You did not mention a specific family in
> the commit message. So it seems the intent is to apply to all Hygon
> systems.

You are right, the family check (0x18) is over restrictive and can be removed.

> Thanks,
> Yazen

Thanks for your review and valuable comments!

Aichun Shi