[PATCH v2] EDAC/amd64: Fix possible module load failure on some UMC usage combinations

Avadhut Naik posted 1 patch 2 weeks, 2 days ago
drivers/edac/amd64_edac.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH v2] EDAC/amd64: Fix possible module load failure on some UMC usage combinations
Posted by Avadhut Naik 2 weeks, 2 days ago
Starting Zen4, AMD SOCs have 12 Unified Memory Controllers (UMCs) per
socket.

When the amd64_edac module is being loaded, these UMCs are traversed to
determine if they have SdpInit (SdpCtrl[31]) and EccEnabled (UmcCapHi[30])
bits set and create masks in umc_en_mask and ecc_en_mask respectively.

However, the current data type of these variables is u8. As a result, if
only the last 4 UMCs (UMC8 - UMC11) of the system have been utilized,
umc_ecc_enabled() will return false. Consequently, the module may fail to
load on these systems.

Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh")
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Cc: stable@vger.kernel.org
---
Changes in v2:
1. Change data type of variables from u16 to int. (Boris)
2. Modify commit message per feedback. (Boris)
3. Add Fixes: and CC:stable tags. (Boris)
---
 drivers/edac/amd64_edac.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index ddfbdb66b794..b1c034214a8d 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3362,7 +3362,7 @@ static bool dct_ecc_enabled(struct amd64_pvt *pvt)
 
 static bool umc_ecc_enabled(struct amd64_pvt *pvt)
 {
-	u8 umc_en_mask = 0, ecc_en_mask = 0;
+	int umc_en_mask = 0, ecc_en_mask = 0;
 	u16 nid = pvt->mc_node_id;
 	struct amd64_umc *umc;
 	u8 ecc_en = 0, i;

base-commit: f84722cbed6c2b2094ad8bbe48be2c5900752935
-- 
2.43.0
Re: [PATCH v2] EDAC/amd64: Fix possible module load failure on some UMC usage combinations
Posted by Borislav Petkov 2 weeks, 2 days ago
On Tue, Dec 10, 2024 at 09:20:00PM +0000, Avadhut Naik wrote:
> Starting Zen4, AMD SOCs have 12 Unified Memory Controllers (UMCs) per
> socket.
> 
> When the amd64_edac module is being loaded, these UMCs are traversed to
> determine if they have SdpInit (SdpCtrl[31]) and EccEnabled (UmcCapHi[30])
> bits set and create masks in umc_en_mask and ecc_en_mask respectively.
> 
> However, the current data type of these variables is u8. As a result, if
> only the last 4 UMCs (UMC8 - UMC11) of the system have been utilized,
> umc_ecc_enabled() will return false. Consequently, the module may fail to
> load on these systems.
> 
> Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh")
> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
> Cc: stable@vger.kernel.org
> ---
> Changes in v2:
> 1. Change data type of variables from u16 to int. (Boris)
> 2. Modify commit message per feedback. (Boris)
> 3. Add Fixes: and CC:stable tags. (Boris)
> ---
>  drivers/edac/amd64_edac.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index ddfbdb66b794..b1c034214a8d 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3362,7 +3362,7 @@ static bool dct_ecc_enabled(struct amd64_pvt *pvt)
>  
>  static bool umc_ecc_enabled(struct amd64_pvt *pvt)
>  {
> -	u8 umc_en_mask = 0, ecc_en_mask = 0;
> +	int umc_en_mask = 0, ecc_en_mask = 0;
>  	u16 nid = pvt->mc_node_id;
>  	struct amd64_umc *umc;
>  	u8 ecc_en = 0, i;

Hmm, looking at that whole function, it looks kinda clumsy to me. If the point
is to check whether at least one UMC is enabled, why aren't we doing simply
that instead of those silly masks?

Yazen? Did you think about checking anything else here, in addition?

Because if not, this can be written as simple as:

static bool umc_ecc_enabled(struct amd64_pvt *pvt)
{
        u16 nid = pvt->mc_node_id;
        struct amd64_umc *umc;
        bool ecc_en = false; 
        int i;

        /* Check whether at least one UMC is enabled: */
        for_each_umc(i) {
                umc = &pvt->umc[i];
                
                if (umc->sdp_ctrl & UMC_SDP_INIT && 
                    umc->umc_cap_hi & UMC_ECC_ENABLED) {
                        ecc_en = true;
                        break; 
                }       
        }

        edac_dbg(3, "Node %d: DRAM ECC %s.\n", nid, (ecc_en ? "enabled" : "disabled"));
        
        return ecc_en;
}

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH v2] EDAC/amd64: Fix possible module load failure on some UMC usage combinations
Posted by Yazen Ghannam 2 weeks, 1 day ago
On Wed, Dec 11, 2024 at 12:07:29PM +0100, Borislav Petkov wrote:
> On Tue, Dec 10, 2024 at 09:20:00PM +0000, Avadhut Naik wrote:
> > Starting Zen4, AMD SOCs have 12 Unified Memory Controllers (UMCs) per
> > socket.
> > 
> > When the amd64_edac module is being loaded, these UMCs are traversed to
> > determine if they have SdpInit (SdpCtrl[31]) and EccEnabled (UmcCapHi[30])
> > bits set and create masks in umc_en_mask and ecc_en_mask respectively.
> > 
> > However, the current data type of these variables is u8. As a result, if
> > only the last 4 UMCs (UMC8 - UMC11) of the system have been utilized,
> > umc_ecc_enabled() will return false. Consequently, the module may fail to
> > load on these systems.
> > 
> > Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh")
> > Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
> > Cc: stable@vger.kernel.org
> > ---
> > Changes in v2:
> > 1. Change data type of variables from u16 to int. (Boris)
> > 2. Modify commit message per feedback. (Boris)
> > 3. Add Fixes: and CC:stable tags. (Boris)
> > ---
> >  drivers/edac/amd64_edac.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> > index ddfbdb66b794..b1c034214a8d 100644
> > --- a/drivers/edac/amd64_edac.c
> > +++ b/drivers/edac/amd64_edac.c
> > @@ -3362,7 +3362,7 @@ static bool dct_ecc_enabled(struct amd64_pvt *pvt)
> >  
> >  static bool umc_ecc_enabled(struct amd64_pvt *pvt)
> >  {
> > -	u8 umc_en_mask = 0, ecc_en_mask = 0;
> > +	int umc_en_mask = 0, ecc_en_mask = 0;
> >  	u16 nid = pvt->mc_node_id;
> >  	struct amd64_umc *umc;
> >  	u8 ecc_en = 0, i;
> 
> Hmm, looking at that whole function, it looks kinda clumsy to me. If the point
> is to check whether at least one UMC is enabled, why aren't we doing simply
> that instead of those silly masks?
> 
> Yazen? Did you think about checking anything else here, in addition?
>

I think we used the masks because we would only read registers as
needed.

  196b79fcc8ed ("EDAC, amd64: Extend ecc_enabled() to Fam17h")

Now we cache all the registers at init time. So yeah, I agree that this
can be simplified.

> Because if not, this can be written as simple as:
> 
> static bool umc_ecc_enabled(struct amd64_pvt *pvt)
> {
>         u16 nid = pvt->mc_node_id;
>         struct amd64_umc *umc;
>         bool ecc_en = false; 
>         int i;
> 
>         /* Check whether at least one UMC is enabled: */
>         for_each_umc(i) {
>                 umc = &pvt->umc[i];
>                 
>                 if (umc->sdp_ctrl & UMC_SDP_INIT && 
>                     umc->umc_cap_hi & UMC_ECC_ENABLED) {
>                         ecc_en = true;
>                         break; 
>                 }       
>         }
> 
>         edac_dbg(3, "Node %d: DRAM ECC %s.\n", nid, (ecc_en ? "enabled" : "disabled"));
>         
>         return ecc_en;
> }
>

Looks good overall. We can even remove the "nid" variable and just use
"pvt->mc_node_id" directly in the debug message. This is another remnant
from when this function did register accesses.

Thanks,
Yazen
Re: [PATCH v2] EDAC/amd64: Fix possible module load failure on some UMC usage combinations
Posted by Borislav Petkov 2 weeks, 1 day ago
On Wed, Dec 11, 2024 at 10:46:37AM -0500, Yazen Ghannam wrote:
> Looks good overall. We can even remove the "nid" variable and just use
> "pvt->mc_node_id" directly in the debug message. This is another remnant
> from when this function did register accesses.

Ok, done.

Avadhut, can you pls verify this fixes your issue too?

I'll run it on my boxes too, to make sure nothing breaks.

Thx.

---
From: "Borislav Petkov (AMD)" <bp@alien8.de>
Date: Wed, 11 Dec 2024 12:07:42 +0100
Subject: [PATCH] EDAC/amd64: Simplify ECC check on unified memory controllers

The intent of the check is to see whether at least one UMC has ECC
enabled. So do that instead of tracking which ones are enabled in masks
which are too small in size anyway and lead to not loading the driver on
Zen4 machines with UMCs enabled over UMC8.

Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh")
Reported-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20241210212054.3895697-1-avadhut.naik@amd.com
---
 drivers/edac/amd64_edac.c | 32 ++++++++++----------------------
 1 file changed, 10 insertions(+), 22 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index ddfbdb66b794..5d356b7c4589 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3362,36 +3362,24 @@ static bool dct_ecc_enabled(struct amd64_pvt *pvt)
 
 static bool umc_ecc_enabled(struct amd64_pvt *pvt)
 {
-	u8 umc_en_mask = 0, ecc_en_mask = 0;
-	u16 nid = pvt->mc_node_id;
 	struct amd64_umc *umc;
-	u8 ecc_en = 0, i;
+	bool ecc_en = false;
+	int i;
 
+	/* Check whether at least one UMC is enabled: */
 	for_each_umc(i) {
 		umc = &pvt->umc[i];
 
-		/* Only check enabled UMCs. */
-		if (!(umc->sdp_ctrl & UMC_SDP_INIT))
-			continue;
-
-		umc_en_mask |= BIT(i);
-
-		if (umc->umc_cap_hi & UMC_ECC_ENABLED)
-			ecc_en_mask |= BIT(i);
+		if (umc->sdp_ctrl & UMC_SDP_INIT &&
+		    umc->umc_cap_hi & UMC_ECC_ENABLED) {
+			ecc_en = true;
+			break;
+		}
 	}
 
-	/* Check whether at least one UMC is enabled: */
-	if (umc_en_mask)
-		ecc_en = umc_en_mask == ecc_en_mask;
-	else
-		edac_dbg(0, "Node %d: No enabled UMCs.\n", nid);
-
-	edac_dbg(3, "Node %d: DRAM ECC %s.\n", nid, (ecc_en ? "enabled" : "disabled"));
+	edac_dbg(3, "Node %d: DRAM ECC %s.\n", pvt->mc_node_id, (ecc_en ? "enabled" : "disabled"));
 
-	if (!ecc_en)
-		return false;
-	else
-		return true;
+	return ecc_en;
 }
 
 static inline void
-- 
2.43.0

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH v2] EDAC/amd64: Fix possible module load failure on some UMC usage combinations
Posted by Naik, Avadhut 2 weeks, 1 day ago

On 12/11/2024 12:51, Borislav Petkov wrote:
> On Wed, Dec 11, 2024 at 10:46:37AM -0500, Yazen Ghannam wrote:
>> Looks good overall. We can even remove the "nid" variable and just use
>> "pvt->mc_node_id" directly in the debug message. This is another remnant
>> from when this function did register accesses.
> 
> Ok, done.
> 
> Avadhut, can you pls verify this fixes your issue too?
> 
Yes, this fixes the issue of module not loading with some UMC
configurations.

If relevant, then for the below patch:

Tested-by: Avadhut Naik <avadhut.naik@amd.com>
Reviewed-by: Avadhut Naik <avadhut.naik@amd.com>

> I'll run it on my boxes too, to make sure nothing breaks.
> 
> Thx.
> 
> ---
> From: "Borislav Petkov (AMD)" <bp@alien8.de>
> Date: Wed, 11 Dec 2024 12:07:42 +0100
> Subject: [PATCH] EDAC/amd64: Simplify ECC check on unified memory controllers
> 
> The intent of the check is to see whether at least one UMC has ECC
> enabled. So do that instead of tracking which ones are enabled in masks
> which are too small in size anyway and lead to not loading the driver on
> Zen4 machines with UMCs enabled over UMC8.
> 
> Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh")
> Reported-by: Avadhut Naik <avadhut.naik@amd.com>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> Cc: <stable@kernel.org>
> Link: https://lore.kernel.org/r/20241210212054.3895697-1-avadhut.naik@amd.com
> ---
>  drivers/edac/amd64_edac.c | 32 ++++++++++----------------------
>  1 file changed, 10 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index ddfbdb66b794..5d356b7c4589 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3362,36 +3362,24 @@ static bool dct_ecc_enabled(struct amd64_pvt *pvt)
>  
>  static bool umc_ecc_enabled(struct amd64_pvt *pvt)
>  {
> -	u8 umc_en_mask = 0, ecc_en_mask = 0;
> -	u16 nid = pvt->mc_node_id;
>  	struct amd64_umc *umc;
> -	u8 ecc_en = 0, i;
> +	bool ecc_en = false;
> +	int i;
>  
> +	/* Check whether at least one UMC is enabled: */
>  	for_each_umc(i) {
>  		umc = &pvt->umc[i];
>  
> -		/* Only check enabled UMCs. */
> -		if (!(umc->sdp_ctrl & UMC_SDP_INIT))
> -			continue;
> -
> -		umc_en_mask |= BIT(i);
> -
> -		if (umc->umc_cap_hi & UMC_ECC_ENABLED)
> -			ecc_en_mask |= BIT(i);
> +		if (umc->sdp_ctrl & UMC_SDP_INIT &&
> +		    umc->umc_cap_hi & UMC_ECC_ENABLED) {
> +			ecc_en = true;
> +			break;
> +		}
>  	}
>  
> -	/* Check whether at least one UMC is enabled: */
> -	if (umc_en_mask)
> -		ecc_en = umc_en_mask == ecc_en_mask;
> -	else
> -		edac_dbg(0, "Node %d: No enabled UMCs.\n", nid);
> -
> -	edac_dbg(3, "Node %d: DRAM ECC %s.\n", nid, (ecc_en ? "enabled" : "disabled"));
> +	edac_dbg(3, "Node %d: DRAM ECC %s.\n", pvt->mc_node_id, (ecc_en ? "enabled" : "disabled"));
>  
> -	if (!ecc_en)
> -		return false;
> -	else
> -		return true;
> +	return ecc_en;
>  }
>  
>  static inline void

-- 
Thanks,
Avadhut Naik
Re: [PATCH v2] EDAC/amd64: Fix possible module load failure on some UMC usage combinations
Posted by Borislav Petkov 2 weeks, 1 day ago
On Wed, Dec 11, 2024 at 01:18:39PM -0600, Naik, Avadhut wrote:
> Yes, this fixes the issue of module not loading with some UMC
> configurations.

Thanks!

> If relevant, then for the below patch:
> 
> Tested-by: Avadhut Naik <avadhut.naik@amd.com>
> Reviewed-by: Avadhut Naik <avadhut.naik@amd.com>

Added.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette