[PATCH] EDAC/amd64: Fix possible module load failure on some UMC usage combinations

Avadhut Naik posted 1 patch 2 weeks, 2 days ago
There is a newer version of this series
drivers/edac/amd64_edac.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH] EDAC/amd64: Fix possible module load failure on some UMC usage combinations
Posted by Avadhut Naik 2 weeks, 2 days ago
Starting Zen4, AMD SOCs have 12 Unified Memory Controllers (UMCs) per
socket.

When the amd64_edac module is being loaded, these UMCs are traversed to
determine if they have SdpInit (SdpCtrl[31]) and EccEnabled (UmcCapHi[30])
bits set and create masks in umc_en_mask and ecc_en_mask respectively.

However, the current data type of these variables is u8. As a result, if
only the last 4 UMCs (UMC8 - UMC11) of the system have been utilized,
umc_ecc_enabled() will return false. Consequently, the module may fail to
load on these systems.

Change the data type of these variables to u16.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
---
 drivers/edac/amd64_edac.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index ddfbdb66b794..583685e8e60f 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3362,7 +3362,7 @@ static bool dct_ecc_enabled(struct amd64_pvt *pvt)
 
 static bool umc_ecc_enabled(struct amd64_pvt *pvt)
 {
-	u8 umc_en_mask = 0, ecc_en_mask = 0;
+	u16 umc_en_mask = 0, ecc_en_mask = 0;
 	u16 nid = pvt->mc_node_id;
 	struct amd64_umc *umc;
 	u8 ecc_en = 0, i;
-- 
2.43.0
Re: [PATCH] EDAC/amd64: Fix possible module load failure on some UMC usage combinations
Posted by Borislav Petkov 2 weeks, 2 days ago
On Mon, Dec 09, 2024 at 09:55:10PM +0000, Avadhut Naik wrote:
> Starting Zen4, AMD SOCs have 12 Unified Memory Controllers (UMCs) per
> socket.
> 
> When the amd64_edac module is being loaded, these UMCs are traversed to
> determine if they have SdpInit (SdpCtrl[31]) and EccEnabled (UmcCapHi[30])
> bits set and create masks in umc_en_mask and ecc_en_mask respectively.
> 
> However, the current data type of these variables is u8. As a result, if
> only the last 4 UMCs (UMC8 - UMC11) of the system have been utilized,
> umc_ecc_enabled() will return false. Consequently, the module may fail to
> load on these systems.
> 
> Change the data type of these variables to u16.

No need to explain what the patch does. The "why" is enough.
 
> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
> ---
>  drivers/edac/amd64_edac.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

This looks like it needs a CC:stable and a Fixes: tag, right?

While at it, you can simply make those vars int and be done with it.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH] EDAC/amd64: Fix possible module load failure on some UMC usage combinations
Posted by Naik, Avadhut 2 weeks, 2 days ago

On 12/10/2024 03:55, Borislav Petkov wrote:
> On Mon, Dec 09, 2024 at 09:55:10PM +0000, Avadhut Naik wrote:
>> Starting Zen4, AMD SOCs have 12 Unified Memory Controllers (UMCs) per
>> socket.
>>
>> When the amd64_edac module is being loaded, these UMCs are traversed to
>> determine if they have SdpInit (SdpCtrl[31]) and EccEnabled (UmcCapHi[30])
>> bits set and create masks in umc_en_mask and ecc_en_mask respectively.
>>
>> However, the current data type of these variables is u8. As a result, if
>> only the last 4 UMCs (UMC8 - UMC11) of the system have been utilized,
>> umc_ecc_enabled() will return false. Consequently, the module may fail to
>> load on these systems.
>>
>> Change the data type of these variables to u16.
> 
> No need to explain what the patch does. The "why" is enough.
>  
Will remove this line.

>> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
>> ---
>>  drivers/edac/amd64_edac.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> This looks like it needs a CC:stable and a Fixes: tag, right?
> 
It does need a Fixes: tag. Will add that.
But not so sure about the CC:stable.
Quoting from stable-kernel-rules about the patches that are accepted into stable.

It must fix a real bug that bothers people (not a, “This could be a problem...” type thing).

This has not been reported by anyone yet. IMO, it classifies as
"This could be a problem" type issue.
Would you agree?

> While at it, you can simply make those vars int and be done with it.
> 
Okay, will change the data type of these 2 variables to int.
> Thx.
> 

-- 
Thanks,
Avadhut Naik

Re: [PATCH] EDAC/amd64: Fix possible module load failure on some UMC usage combinations
Posted by Borislav Petkov 2 weeks, 2 days ago
On Tue, Dec 10, 2024 at 11:32:06AM -0600, Naik, Avadhut wrote:
> But not so sure about the CC:stable.
> Quoting from stable-kernel-rules about the patches that are accepted into stable.
> 
> It must fix a real bug that bothers people (not a, “This could be a problem...” type thing).
> 
> This has not been reported by anyone yet. IMO, it classifies as
> "This could be a problem" type issue.
> Would you agree?

So if someone populates the last 4 UMCs, he won't have EDAC load on long-term
kernels. So are you sure about "this could be a problem"?

Or is it more like a "this is real problem for some configurations" issue?

IOW, at least "oh, that's not good" issue.

Or?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH] EDAC/amd64: Fix possible module load failure on some UMC usage combinations
Posted by Naik, Avadhut 2 weeks, 2 days ago

On 12/10/2024 12:04, Borislav Petkov wrote:
> On Tue, Dec 10, 2024 at 11:32:06AM -0600, Naik, Avadhut wrote:
>> But not so sure about the CC:stable.
>> Quoting from stable-kernel-rules about the patches that are accepted into stable.
>>
>> It must fix a real bug that bothers people (not a, “This could be a problem...” type thing).
>>
>> This has not been reported by anyone yet. IMO, it classifies as
>> "This could be a problem" type issue.
>> Would you agree?
> 
> So if someone populates the last 4 UMCs, he won't have EDAC load on long-term
> kernels. So are you sure about "this could be a problem"?
> 
> Or is it more like a "this is real problem for some configurations" issue?
> 
> IOW, at least "oh, that's not good" issue.
> 
> Or?

Agreed that this is a real problem for some configurations, particularly since
the module wont load, even on long term kernels. Will add stable to Cc and resend.
> 

-- 
Thanks,
Avadhut Naik