[v1] Incorporate DRAM address in EDAC messages

[PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message

Posted by Avadhut Naik 2 months, 3 weeks ago

Currently, the amd64_edac module only provides UMC normalized and system
physical address when a DRAM ECC error occurs. DRAM Address on which the
error has occurred is not provided since the support required to translate
the normalized address into DRAM address has not yet been implemented.

This support however, has now been implemented through an earlier patch
(RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM)
and DRAM address, which provides additional debugging information relating
to the error received, can now be logged by the module.

Add the required support to log DRAM address on which the error has been
received in dmesg and through the RAS tracepoint.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
---
 drivers/edac/amd64_edac.c | 23 +++++++++++++++++++++++
 drivers/edac/amd64_edac.h |  1 +
 2 files changed, 24 insertions(+)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 07f1e9dc1ca7..36b46cd81bb2 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2724,6 +2724,22 @@ static void __log_ecc_error(struct mem_ctl_info *mci, struct err_info *err,
 	switch (err->err_code) {
 	case DECODE_OK:
 		string = "";
+		if (err->dram_addr) {
+			char s[100];
+
+			memset(s, 0, 100);
+			sprintf(s, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x"
+					   " Row: 0x%x Column: 0x%x"
+					   " RankMul: 0x%x SubChannel: 0x%x",
+					   err->dram_addr->chip_select,
+					   err->dram_addr->bank_group,
+					   err->dram_addr->bank_addr,
+					   err->dram_addr->row_addr,
+					   err->dram_addr->col_addr,
+					   err->dram_addr->rank_mul,
+					   err->dram_addr->sub_ch);
+			string = s;
+		}
 		break;
 	case ERR_NODE:
 		string = "Failed to map error addr to a node";
@@ -2808,11 +2824,13 @@ static void umc_get_err_info(struct mce *m, struct err_info *err)
 static void decode_umc_error(int node_id, struct mce *m)
 {
 	u8 ecc_type = (m->status >> 45) & 0x3;
+	struct dram_addr dram_addr;
 	struct mem_ctl_info *mci;
 	unsigned long sys_addr;
 	struct amd64_pvt *pvt;
 	struct atl_err a_err;
 	struct err_info err;
+	int ret;
 
 	node_id = fixup_node_id(node_id, m);
 
@@ -2822,6 +2840,7 @@ static void decode_umc_error(int node_id, struct mce *m)
 
 	pvt = mci->pvt_info;
 
+	memset(&dram_addr, 0, sizeof(dram_addr));
 	memset(&err, 0, sizeof(err));
 
 	if (m->status & MCI_STATUS_DEFERRED)
@@ -2853,6 +2872,10 @@ static void decode_umc_error(int node_id, struct mce *m)
 		goto log_error;
 	}
 
+	ret = amd_convert_umc_mca_addr_to_dram_addr(&a_err, &dram_addr);
+	if (!ret)
+		err.dram_addr = &dram_addr;
+
 	error_address_to_page_and_offset(sys_addr, &err);
 
 log_error:
diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
index 17228d07de4c..88b0b8425ab3 100644
--- a/drivers/edac/amd64_edac.h
+++ b/drivers/edac/amd64_edac.h
@@ -399,6 +399,7 @@ struct err_info {
 	u16 syndrome;
 	u32 page;
 	u32 offset;
+	struct dram_addr *dram_addr;
 };
 
 static inline u32 get_umc_base(u8 channel)
-- 
2.43.0

Re: [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message

Posted by Yazen Ghannam 2 months, 1 week ago

On Thu, Jul 17, 2025 at 04:48:43PM +0000, Avadhut Naik wrote:
> Currently, the amd64_edac module only provides UMC normalized and system

The amd64_edac module provides data for the EDAC interface. This is only
the system physical address (PFN + offset). The UMC normalized address
comes from MCA error decoding.

> physical address when a DRAM ECC error occurs. DRAM Address on which the
> error has occurred is not provided since the support required to translate
> the normalized address into DRAM address has not yet been implemented.

I don't think this last sentence is necessary.

> 
> This support however, has now been implemented through an earlier patch
> (RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM)
> and DRAM address, which provides additional debugging information relating
> to the error received, can now be logged by the module.
> 
> Add the required support to log DRAM address on which the error has been
> received in dmesg and through the RAS tracepoint.

These last two paragraphs could be something like this:

"Use the new PRM call in the AMD Address Translation Library to gather the
DRAM address of an error. Include this data in the EDAC 'string' so it
is available in the kernel messages and EDAC trace event." 

> 
> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
> ---
>  drivers/edac/amd64_edac.c | 23 +++++++++++++++++++++++
>  drivers/edac/amd64_edac.h |  1 +
>  2 files changed, 24 insertions(+)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 07f1e9dc1ca7..36b46cd81bb2 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -2724,6 +2724,22 @@ static void __log_ecc_error(struct mem_ctl_info *mci, struct err_info *err,
>  	switch (err->err_code) {
>  	case DECODE_OK:
>  		string = "";
> +		if (err->dram_addr) {
> +			char s[100];
> +
> +			memset(s, 0, 100);
> +			sprintf(s, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x"
> +					   " Row: 0x%x Column: 0x%x"
> +					   " RankMul: 0x%x SubChannel: 0x%x",

There's a checkpatch warning here about splitting user-visible strings.

Why not use scnprintf() or the like?

> +					   err->dram_addr->chip_select,
> +					   err->dram_addr->bank_group,
> +					   err->dram_addr->bank_addr,
> +					   err->dram_addr->row_addr,
> +					   err->dram_addr->col_addr,
> +					   err->dram_addr->rank_mul,
> +					   err->dram_addr->sub_ch);
> +			string = s;

Looking at the description for 'edac_mc_handle_error()', it seems that
"other_detail" would be more appropriate for this info.

I do think we should consider updating the EDAC interface if multiple
vendors are gathering this data.

> +		}
>  		break;
>  	case ERR_NODE:
>  		string = "Failed to map error addr to a node";
> @@ -2808,11 +2824,13 @@ static void umc_get_err_info(struct mce *m, struct err_info *err)
>  static void decode_umc_error(int node_id, struct mce *m)
>  {
>  	u8 ecc_type = (m->status >> 45) & 0x3;
> +	struct dram_addr dram_addr;
>  	struct mem_ctl_info *mci;
>  	unsigned long sys_addr;
>  	struct amd64_pvt *pvt;
>  	struct atl_err a_err;
>  	struct err_info err;
> +	int ret;
>  
>  	node_id = fixup_node_id(node_id, m);
>  
> @@ -2822,6 +2840,7 @@ static void decode_umc_error(int node_id, struct mce *m)
>  
>  	pvt = mci->pvt_info;
>  
> +	memset(&dram_addr, 0, sizeof(dram_addr));
>  	memset(&err, 0, sizeof(err));
>  
>  	if (m->status & MCI_STATUS_DEFERRED)
> @@ -2853,6 +2872,10 @@ static void decode_umc_error(int node_id, struct mce *m)
>  		goto log_error;
>  	}
>  
> +	ret = amd_convert_umc_mca_addr_to_dram_addr(&a_err, &dram_addr);
> +	if (!ret)
> +		err.dram_addr = &dram_addr;

I feel like it is not necessary to pass a second struct if it is already
contained in another.

You could just clear (or not set) that field if an error occurs.

> +
>  	error_address_to_page_and_offset(sys_addr, &err);
>  
>  log_error:
> diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
> index 17228d07de4c..88b0b8425ab3 100644
> --- a/drivers/edac/amd64_edac.h
> +++ b/drivers/edac/amd64_edac.h
> @@ -399,6 +399,7 @@ struct err_info {
>  	u16 syndrome;
>  	u32 page;
>  	u32 offset;
> +	struct dram_addr *dram_addr;
>  };
>  
>  static inline u32 get_umc_base(u8 channel)
> -- 

Thanks,
Yazen

[PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message

Posted by Naik, Avadhut 2 months ago

Hi,

On 7/28/2025 10:14, Yazen Ghannam wrote:
> On Thu, Jul 17, 2025 at 04:48:43PM +0000, Avadhut Naik wrote:
>> Currently, the amd64_edac module only provides UMC normalized and system
> 
> The amd64_edac module provides data for the EDAC interface. This is only
> the system physical address (PFN + offset). The UMC normalized address
> comes from MCA error decoding.
> 
Will reword this part.
>> physical address when a DRAM ECC error occurs. DRAM Address on which the
>> error has occurred is not provided since the support required to translate
>> the normalized address into DRAM address has not yet been implemented.
> 
> I don't think this last sentence is necessary.
> 
Noted.
>>
>> This support however, has now been implemented through an earlier patch
>> (RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM)
>> and DRAM address, which provides additional debugging information relating
>> to the error received, can now be logged by the module.
>>
>> Add the required support to log DRAM address on which the error has been
>> received in dmesg and through the RAS tracepoint.
> 
> These last two paragraphs could be something like this:
> 
> "Use the new PRM call in the AMD Address Translation Library to gather the
> DRAM address of an error. Include this data in the EDAC 'string' so it
> is available in the kernel messages and EDAC trace event." 
> 
Okay. Will change them.
>>
>> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
>> ---
>>  drivers/edac/amd64_edac.c | 23 +++++++++++++++++++++++
>>  drivers/edac/amd64_edac.h |  1 +
>>  2 files changed, 24 insertions(+)
>>
>> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
>> index 07f1e9dc1ca7..36b46cd81bb2 100644
>> --- a/drivers/edac/amd64_edac.c
>> +++ b/drivers/edac/amd64_edac.c
>> @@ -2724,6 +2724,22 @@ static void __log_ecc_error(struct mem_ctl_info *mci, struct err_info *err,
>>  	switch (err->err_code) {
>>  	case DECODE_OK:
>>  		string = "";
>> +		if (err->dram_addr) {
>> +			char s[100];
>> +
>> +			memset(s, 0, 100);
>> +			sprintf(s, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x"
>> +					   " Row: 0x%x Column: 0x%x"
>> +					   " RankMul: 0x%x SubChannel: 0x%x",
> 
> There's a checkpatch warning here about splitting user-visible strings.
> 
> Why not use scnprintf() or the like?
> 
Had noticed the checkpatch warning initially.
IIRC, it was for splitting the quoted string across multiple lines.
Can use scnprintf here. But wont the warning still prevail?
One way I can think of for getting rid of the warning is to generate the string
through multiple scnprintf calls. Something like below:

            memset(s, 0, len);
            n = scnprintf(s + n, len - n, "Cs: 0x%x Bank Grp: 0x%x",
                      err->dram_addr->chip_select,
                      err->dram_addr->bank_group);
            n += scnprintf(s + n, len - n, " Bank Addr: 0x%x",
                      err->dram_addr->bank_addr);
            n += scnprintf(s + n, len - n, " Row: 0x%x Column: 0x%x",
                      err->dram_addr->row_addr,
                      err->dram_addr->col_addr);
            n += scnprintf(s + n, len - n, " RankMul: 0x%x SubChannel: 0x%x",
                      err->dram_addr->rank_mul,
                      err->dram_addr->sub_ch);

            pr_err("%s: s: %s\n", __func__, s);
            string = s;

Is this acceptable?

>> +					   err->dram_addr->chip_select,
>> +					   err->dram_addr->bank_group,
>> +					   err->dram_addr->bank_addr,
>> +					   err->dram_addr->row_addr,
>> +					   err->dram_addr->col_addr,
>> +					   err->dram_addr->rank_mul,
>> +					   err->dram_addr->sub_ch);
>> +			string = s;
> 
> Looking at the description for 'edac_mc_handle_error()', it seems that
> "other_detail" would be more appropriate for this info.
> 
> I do think we should consider updating the EDAC interface if multiple
> vendors are gathering this data.
> 
Okay, will use "other_detail" parameter of edac_mc_handle_error() for this.

>> +		}
>>  		break;
>>  	case ERR_NODE:
>>  		string = "Failed to map error addr to a node";
>> @@ -2808,11 +2824,13 @@ static void umc_get_err_info(struct mce *m, struct err_info *err)
>>  static void decode_umc_error(int node_id, struct mce *m)
>>  {
>>  	u8 ecc_type = (m->status >> 45) & 0x3;
>> +	struct dram_addr dram_addr;
>>  	struct mem_ctl_info *mci;
>>  	unsigned long sys_addr;
>>  	struct amd64_pvt *pvt;
>>  	struct atl_err a_err;
>>  	struct err_info err;
>> +	int ret;
>>  
>>  	node_id = fixup_node_id(node_id, m);
>>  
>> @@ -2822,6 +2840,7 @@ static void decode_umc_error(int node_id, struct mce *m)
>>  
>>  	pvt = mci->pvt_info;
>>  
>> +	memset(&dram_addr, 0, sizeof(dram_addr));
>>  	memset(&err, 0, sizeof(err));
>>  
>>  	if (m->status & MCI_STATUS_DEFERRED)
>> @@ -2853,6 +2872,10 @@ static void decode_umc_error(int node_id, struct mce *m)
>>  		goto log_error;
>>  	}
>>  
>> +	ret = amd_convert_umc_mca_addr_to_dram_addr(&a_err, &dram_addr);
>> +	if (!ret)
>> +		err.dram_addr = &dram_addr;
> 
> I feel like it is not necessary to pass a second struct if it is already
> contained in another.
> 
> You could just clear (or not set) that field if an error occurs.
>
Slightly confused here.
Do you mean we should avoid passing dram_addr as second parameter
for amd_convert_umc_mca_addr_to_dram_addr() and instead just pass
struct err_info instance err?

And, in case some error occurs, we should just do
	err.dram_addr = 0x0;
?
 
>> +
>>  	error_address_to_page_and_offset(sys_addr, &err);
>>  
>>  log_error:
>> diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
>> index 17228d07de4c..88b0b8425ab3 100644
>> --- a/drivers/edac/amd64_edac.h
>> +++ b/drivers/edac/amd64_edac.h
>> @@ -399,6 +399,7 @@ struct err_info {
>>  	u16 syndrome;
>>  	u32 page;
>>  	u32 offset;
>> +	struct dram_addr *dram_addr;
>>  };
>>  
>>  static inline u32 get_umc_base(u8 channel)
>> -- 
> 
> Thanks,
> Yazen

-- 
Thanks,
Avadhut Naik

Re: [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message

Posted by Yazen Ghannam 1 month, 3 weeks ago

On Wed, Aug 06, 2025 at 04:08:07PM -0500, Naik, Avadhut wrote:
[...]
> >> @@ -2808,11 +2824,13 @@ static void umc_get_err_info(struct mce *m, struct err_info *err)
> >>  static void decode_umc_error(int node_id, struct mce *m)
> >>  {
> >>  	u8 ecc_type = (m->status >> 45) & 0x3;
> >> +	struct dram_addr dram_addr;
> >>  	struct mem_ctl_info *mci;
> >>  	unsigned long sys_addr;
> >>  	struct amd64_pvt *pvt;
> >>  	struct atl_err a_err;
> >>  	struct err_info err;
> >> +	int ret;
> >>  
> >>  	node_id = fixup_node_id(node_id, m);
> >>  
> >> @@ -2822,6 +2840,7 @@ static void decode_umc_error(int node_id, struct mce *m)
> >>  
> >>  	pvt = mci->pvt_info;
> >>  
> >> +	memset(&dram_addr, 0, sizeof(dram_addr));
> >>  	memset(&err, 0, sizeof(err));
> >>  
> >>  	if (m->status & MCI_STATUS_DEFERRED)
> >> @@ -2853,6 +2872,10 @@ static void decode_umc_error(int node_id, struct mce *m)
> >>  		goto log_error;
> >>  	}
> >>  
> >> +	ret = amd_convert_umc_mca_addr_to_dram_addr(&a_err, &dram_addr);
> >> +	if (!ret)
> >> +		err.dram_addr = &dram_addr;
> > 
> > I feel like it is not necessary to pass a second struct if it is already
> > contained in another.
> > 
> > You could just clear (or not set) that field if an error occurs.
> >
> Slightly confused here.
> Do you mean we should avoid passing dram_addr as second parameter
> for amd_convert_umc_mca_addr_to_dram_addr() and instead just pass
> struct err_info instance err?
> 
> And, in case some error occurs, we should just do
> 	err.dram_addr = 0x0;
> ?
>  

Sorry, I think I misread this before.

I was thinking you can add 'struct dram_addr' to 'struct atl_err'. The
intent of 'struct atl_err' is to collect all needed parameters for the
translation functions.

Additionally, I just realized, we should have an 'invalid' default value
for dram_addr. Technically, bank=0, row=0, col=0, etc., would be a valid
DRAM address.

Thanks,
Yazen

Re: [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message

Posted by Borislav Petkov 2 months ago

On Wed, Aug 06, 2025 at 04:08:07PM -0500, Naik, Avadhut wrote:
> >> +			sprintf(s, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x"
> >> +					   " Row: 0x%x Column: 0x%x"
> >> +					   " RankMul: 0x%x SubChannel: 0x%x",
> > 
> > There's a checkpatch warning here about splitting user-visible strings.
> > 
> > Why not use scnprintf() or the like?
> > 
> Had noticed the checkpatch warning initially.
> IIRC, it was for splitting the quoted string across multiple lines.
> Can use scnprintf here. But wont the warning still prevail?

Just do one long line. The idea is to be able to grep the code when you see
the string issued in dmesg.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message

Posted by Naik, Avadhut 2 months ago


On 8/6/2025 16:15, Borislav Petkov wrote:
> On Wed, Aug 06, 2025 at 04:08:07PM -0500, Naik, Avadhut wrote:
>>>> +			sprintf(s, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x"
>>>> +					   " Row: 0x%x Column: 0x%x"
>>>> +					   " RankMul: 0x%x SubChannel: 0x%x",
>>>
>>> There's a checkpatch warning here about splitting user-visible strings.
>>>
>>> Why not use scnprintf() or the like?
>>>
>> Had noticed the checkpatch warning initially.
>> IIRC, it was for splitting the quoted string across multiple lines.
>> Can use scnprintf here. But wont the warning still prevail?
> 
> Just do one long line. The idea is to be able to grep the code when you see
> the string issued in dmesg.
> 
Okay, sounds good! Will change it to something like below:

scnprintf(s, 100, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x Row: 0x%x Column: 0x%x RankMul: 0x%x SubChannel: 0x%x",
          err->dram_addr->chip_select,
          err->dram_addr->bank_group,
          err->dram_addr->bank_addr,
          err->dram_addr->row_addr,
          err->dram_addr->col_addr,
          err->dram_addr->rank_mul,
          err->dram_addr->sub_ch);

-- 
Thanks,
Avadhut Naik