[PATCH v3 0/2] Incorporate DRAM address in EDAC messages

Avadhut Naik posted 2 patches 3 months, 3 weeks ago
drivers/edac/amd64_edac.c      | 23 +++++++++++++++++++++-
drivers/edac/amd64_edac.h      |  1 +
drivers/ras/amd/atl/core.c     |  3 ++-
drivers/ras/amd/atl/internal.h |  9 +++++++++
drivers/ras/amd/atl/prm.c      | 36 ++++++++++++++++++++++++++++++----
drivers/ras/amd/atl/umc.c      |  9 +++++++++
drivers/ras/ras.c              | 18 +++++++++++++++--
include/linux/ras.h            | 19 +++++++++++++++++-
8 files changed, 109 insertions(+), 9 deletions(-)
[PATCH v3 0/2] Incorporate DRAM address in EDAC messages
Posted by Avadhut Naik 3 months, 3 weeks ago
Currently, the amd64_edac module only provides UMC normalized and system
physical address when a DRAM ECC error occurs. DRAM Address is neither
logged nor exported through tracepoint.

Modern AMD SOCs provide UEFI PRM module that implements various address
translation PRM handlers. These PRM handlers can be leveraged to convert
UMC normalized address into DRAM address at runtime on occurrence of a
DRAM ECC error. This translated DRAM address can then be logged and
exported through tracepoints. This set adds the required support to
accomplish the aforementioned.

The first patch adds support in the Address Translation Library to invoke
the appropriate PRM handler to perform the translation.

The second patch leverages the support added in the first patch to log
DRAM Address and export it through the RAS tracepoint on occurrence of a
DRAM ECC error.

Changes in v2:
 - Modify commit messages per feedback received.
 - Remove unnecessary variables.
 - Rename struct dram_addr to atl_dram_addr.
 - Replace sprintf call in __log_ecc_error() with scnprintf.
 - Pass the DRAM Address to edac_mc_handle_error() through "other_detail"
parameter instead of "msg".

Changes in v3:
 - Rebase on top of edac-for-next.
 - Add Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>

Links:
v1: https://lore.kernel.org/all/20250717165622.1162091-1-avadhut.naik@amd.com/
v2: https://lore.kernel.org/all/20250915212244.886668-1-avadhut.naik@amd.com/

Avadhut Naik (2):
  RAS/AMD/ATL: Translate UMC normalized address to DRAM address using
    PRM
  EDAC/amd64: Incorporate DRAM Address in EDAC message

 drivers/edac/amd64_edac.c      | 23 +++++++++++++++++++++-
 drivers/edac/amd64_edac.h      |  1 +
 drivers/ras/amd/atl/core.c     |  3 ++-
 drivers/ras/amd/atl/internal.h |  9 +++++++++
 drivers/ras/amd/atl/prm.c      | 36 ++++++++++++++++++++++++++++++----
 drivers/ras/amd/atl/umc.c      |  9 +++++++++
 drivers/ras/ras.c              | 18 +++++++++++++++--
 include/linux/ras.h            | 19 +++++++++++++++++-
 8 files changed, 109 insertions(+), 9 deletions(-)


base-commit: 79c0a2b7abc906c7cf3c793256c6b638d7dc477f
-- 
2.43.0
Re: [PATCH v3 0/2] Incorporate DRAM address in EDAC messages
Posted by Borislav Petkov 3 months, 3 weeks ago
On Mon, Oct 13, 2025 at 07:34:47PM +0000, Avadhut Naik wrote:
> Currently, the amd64_edac module only provides UMC normalized and system
> physical address when a DRAM ECC error occurs. DRAM Address is neither
> logged nor exported through tracepoint.
> 
> Modern AMD SOCs provide UEFI PRM module that implements various address
> translation PRM handlers. These PRM handlers can be leveraged to convert
> UMC normalized address into DRAM address at runtime on occurrence of a
> DRAM ECC error. This translated DRAM address can then be logged and
> exported through tracepoints.

And?

I read all three commit messages to figure out *why* those DRAM addresses want
to be logged. But it seems they don't want to be logged. Because there's not
a single reason why they should be, AFAICT. Without a proper justification,
this looks like a bunch of unnecessary code to me...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH v3 0/2] Incorporate DRAM address in EDAC messages
Posted by Yazen Ghannam 3 months, 3 weeks ago
On Tue, Oct 14, 2025 at 12:00:19AM +0200, Borislav Petkov wrote:
> On Mon, Oct 13, 2025 at 07:34:47PM +0000, Avadhut Naik wrote:
> > Currently, the amd64_edac module only provides UMC normalized and system
> > physical address when a DRAM ECC error occurs. DRAM Address is neither
> > logged nor exported through tracepoint.
> > 
> > Modern AMD SOCs provide UEFI PRM module that implements various address
> > translation PRM handlers. These PRM handlers can be leveraged to convert
> > UMC normalized address into DRAM address at runtime on occurrence of a
> > DRAM ECC error. This translated DRAM address can then be logged and
> > exported through tracepoints.
> 
> And?
> 
> I read all three commit messages to figure out *why* those DRAM addresses want
> to be logged. But it seems they don't want to be logged. Because there's not
> a single reason why they should be, AFAICT. Without a proper justification,
> this looks like a bunch of unnecessary code to me...
> 

Good point. I overlooked this myself.

The "DRAM address" helps memory vendors analyze failures. System
builders want to collect this data and pass it along to the memory
vendors. The DRAM address is not contained in architectural data like
MCA info, and getting the address from MCA requires using additional
system-specific hardware info. It's much more reliable to get the DRAM
address from the system with the error rather than try to post-process
it later.

Thanks,
Yazen