[PATCH v6 0/5] Fix issues with ARM Processor CPER records

Daniel Ferguson posted 5 patches 5 months, 3 weeks ago
Documentation/driver-api/firmware/efi/index.rst | 11 +++--
drivers/acpi/apei/ghes.c                        | 27 +++++------
drivers/firmware/efi/cper-arm.c                 | 52 ++++++++++-----------
drivers/firmware/efi/cper.c                     | 62 ++++++++++++++++++++++++-
drivers/ras/ras.c                               | 40 +++++++++++++++-
include/linux/cper.h                            | 12 +++--
include/linux/ras.h                             | 16 +++++--
include/ras/ras_event.h                         | 49 +++++++++++++++++--
8 files changed, 210 insertions(+), 59 deletions(-)
[PATCH v6 0/5] Fix issues with ARM Processor CPER records
Posted by Daniel Ferguson 5 months, 3 weeks ago
This is needed for both kernelspace and userspace properly handle
ARM processor CPER events.

Patch 1 of this series fix the UEFI 2.6+ implementation of the ARM
trace event, as the original implementation was incomplete.
Changeset e9279e83ad1f ("trace, ras: add ARM processor error trace event")
added such event, but it reports only some fields of the CPER record
defined on UEFI 2.6+ appendix N, table N.16.  Those are not enough
actually parse such events on userspace, as not even the event type
is exported.

Patch 2 fixes a compilation breakage when W=1;

Patch 3 adds a new helper function to be used by cper and ghes drivers to
display CPER bitmaps;

Patch 4 fixes CPER logic according with UEFI 2.9A errata. Before it, there
was no description about how processor type field was encoded. The errata
defines it as a bitmask, and provides the information about how it should
be encoded.

Patch 5 adds CPER functions to Kernel-doc.

This series was validated with the help of an ARM EINJ code for QEMU:

	https://gitlab.com/mchehab_kernel/qemu/-/tree/qemu_submission

$ scripts/ghes_inject.py -d arm -p 0xdeadbeef -t cache,bus,micro-arch

[   11.094205] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[   11.095009] {1}[Hardware Error]: event severity: recoverable
[   11.095486] {1}[Hardware Error]:  Error 0, type: recoverable
[   11.096090] {1}[Hardware Error]:   section_type: ARM processor error
[   11.096399] {1}[Hardware Error]:   MIDR: 0x00000000000f0510
[   11.097135] {1}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000080000000
[   11.097811] {1}[Hardware Error]:   running state: 0x0
[   11.098193] {1}[Hardware Error]:   Power State Coordination Interface state: 0
[   11.098699] {1}[Hardware Error]:   Error info structure 0:
[   11.099174] {1}[Hardware Error]:   num errors: 2
[   11.099682] {1}[Hardware Error]:    error_type: 0x1a: cache error|bus error|micro-architectural error
[   11.100150] {1}[Hardware Error]:    physical fault address: 0x00000000deadbeef
[   11.111214] Memory failure: 0xdeadb: recovery action for free buddy page: Recovered

- 

I also tested the ghes and cper reports both with and without this
change, using different versions of rasdaemon, with and without
support for the extended trace event. Those are a summary of the
test results:

- adding more fields to the trace events didn't break userspace API:
  both versions of rasdaemon handled it;

- the rasdaemon patches to handle the new trace report was missing
  a backward-compatibility logic. I fixed already. So, rasdaemon
  can now handle both old and new trace events.

Btw, rasdaemon has gained support for the extended trace since its
version 0.5.8 (released in 2021). I didn't saw any issues there
complain about troubles on it, so either distros used on ARM servers
are using an old version of rasdaemon, or they're carrying on the trace
event changes as well.

---
v6:
 - fix typo in Jonathans "reviewed-by" in patch 3
 - Link to v5: https://lore.kernel.org/linux-acpi/20250813-mauro_v3-v6-16-rev2-v5-0-954db8ccfbe6@os.amperecomputing.com

v5:
 - fix a few code formatting issues
 - remove "Co-developed-by: danielf" because his/my contribution was
   removed in v2.
 - adjust tag block
 - Link to v4: https://lore.kernel.org/linux-acpi/20250805-mauro_v3-v6-16-rev2-v4-0-ea538759841c@os.amperecomputing.com

v4:
 - rebase to kernel v6.16
 - modify commit message of patch 1, and adjust white spaces
   per Boris' suggestions.
 - Link to v3: https://lore.kernel.org/linux-acpi/cover.1725429659.git.mchehab+huawei@kernel.org

v3:
 - history of patch 1 improved with a chain of co-developed-by;
 - add a better description and an example on patch 3;
 - use BIT_ULL() on patch 3;
 - add a missing include on patch 4.

v2:
  - removed an uneeded patch adding #ifdef for CONFIG_ARM/ARM64;
  - cper_bits_to_str() now returns the number of chars filled at the buffer;
  - did a cosmetic (blank lines) improvement at include/linux/ras.h;
  - arm_event trace dynamic arrays renamed to pei_buf/ctx_buf/oem_buf.

    

---
Changes in v6:
- EDITME: describe what is new in this series revision.
- EDITME: use bulletpoints and terse descriptions.
- Link to v5: https://lore.kernel.org/r/20250813-mauro_v3-v6-16-rev2-v5-0-954db8ccfbe6@os.amperecomputing.com

Jason Tian (1):
      RAS: Report all ARM processor CPER information to userspace

Mauro Carvalho Chehab (4):
      efi/cper: Adjust infopfx size to accept an extra space
      efi/cper: Add a new helper function to print bitmasks
      efi/cper: align ARM CPER type with UEFI 2.9A/2.10 specs
      docs: efi: add CPER functions to driver-api

 Documentation/driver-api/firmware/efi/index.rst | 11 +++--
 drivers/acpi/apei/ghes.c                        | 27 +++++------
 drivers/firmware/efi/cper-arm.c                 | 52 ++++++++++-----------
 drivers/firmware/efi/cper.c                     | 62 ++++++++++++++++++++++++-
 drivers/ras/ras.c                               | 40 +++++++++++++++-
 include/linux/cper.h                            | 12 +++--
 include/linux/ras.h                             | 16 +++++--
 include/ras/ras_event.h                         | 49 +++++++++++++++++--
 8 files changed, 210 insertions(+), 59 deletions(-)
Re: [PATCH v6 0/5] Fix issues with ARM Processor CPER records
Posted by Mauro Carvalho Chehab 2 months, 2 weeks ago
Em Thu, 14 Aug 2025 09:52:51 -0700
Daniel Ferguson <danielf@os.amperecomputing.com> escreveu:

> This is needed for both kernelspace and userspace properly handle
> ARM processor CPER events.
> 
> Patch 1 of this series fix the UEFI 2.6+ implementation of the ARM
> trace event, as the original implementation was incomplete.
> Changeset e9279e83ad1f ("trace, ras: add ARM processor error trace event")
> added such event, but it reports only some fields of the CPER record
> defined on UEFI 2.6+ appendix N, table N.16.  Those are not enough
> actually parse such events on userspace, as not even the event type
> is exported.

Hi Rafael/Ard,

What's the status of this series? I'm not seeing it yet on linux-next.

Regards,
Mauro

> 
> Patch 2 fixes a compilation breakage when W=1;
> 
> Patch 3 adds a new helper function to be used by cper and ghes drivers to
> display CPER bitmaps;
> 
> Patch 4 fixes CPER logic according with UEFI 2.9A errata. Before it, there
> was no description about how processor type field was encoded. The errata
> defines it as a bitmask, and provides the information about how it should
> be encoded.
> 
> Patch 5 adds CPER functions to Kernel-doc.
> 
> This series was validated with the help of an ARM EINJ code for QEMU:
> 
> 	https://gitlab.com/mchehab_kernel/qemu/-/tree/qemu_submission
> 
> $ scripts/ghes_inject.py -d arm -p 0xdeadbeef -t cache,bus,micro-arch
> 
> [   11.094205] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> [   11.095009] {1}[Hardware Error]: event severity: recoverable
> [   11.095486] {1}[Hardware Error]:  Error 0, type: recoverable
> [   11.096090] {1}[Hardware Error]:   section_type: ARM processor error
> [   11.096399] {1}[Hardware Error]:   MIDR: 0x00000000000f0510
> [   11.097135] {1}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000080000000
> [   11.097811] {1}[Hardware Error]:   running state: 0x0
> [   11.098193] {1}[Hardware Error]:   Power State Coordination Interface state: 0
> [   11.098699] {1}[Hardware Error]:   Error info structure 0:
> [   11.099174] {1}[Hardware Error]:   num errors: 2
> [   11.099682] {1}[Hardware Error]:    error_type: 0x1a: cache error|bus error|micro-architectural error
> [   11.100150] {1}[Hardware Error]:    physical fault address: 0x00000000deadbeef
> [   11.111214] Memory failure: 0xdeadb: recovery action for free buddy page: Recovered
> 
> - 
> 
> I also tested the ghes and cper reports both with and without this
> change, using different versions of rasdaemon, with and without
> support for the extended trace event. Those are a summary of the
> test results:
> 
> - adding more fields to the trace events didn't break userspace API:
>   both versions of rasdaemon handled it;
> 
> - the rasdaemon patches to handle the new trace report was missing
>   a backward-compatibility logic. I fixed already. So, rasdaemon
>   can now handle both old and new trace events.
> 
> Btw, rasdaemon has gained support for the extended trace since its
> version 0.5.8 (released in 2021). I didn't saw any issues there
> complain about troubles on it, so either distros used on ARM servers
> are using an old version of rasdaemon, or they're carrying on the trace
> event changes as well.
> 
> ---
> v6:
>  - fix typo in Jonathans "reviewed-by" in patch 3
>  - Link to v5: https://lore.kernel.org/linux-acpi/20250813-mauro_v3-v6-16-rev2-v5-0-954db8ccfbe6@os.amperecomputing.com
> 
> v5:
>  - fix a few code formatting issues
>  - remove "Co-developed-by: danielf" because his/my contribution was
>    removed in v2.
>  - adjust tag block
>  - Link to v4: https://lore.kernel.org/linux-acpi/20250805-mauro_v3-v6-16-rev2-v4-0-ea538759841c@os.amperecomputing.com
> 
> v4:
>  - rebase to kernel v6.16
>  - modify commit message of patch 1, and adjust white spaces
>    per Boris' suggestions.
>  - Link to v3: https://lore.kernel.org/linux-acpi/cover.1725429659.git.mchehab+huawei@kernel.org
> 
> v3:
>  - history of patch 1 improved with a chain of co-developed-by;
>  - add a better description and an example on patch 3;
>  - use BIT_ULL() on patch 3;
>  - add a missing include on patch 4.
> 
> v2:
>   - removed an uneeded patch adding #ifdef for CONFIG_ARM/ARM64;
>   - cper_bits_to_str() now returns the number of chars filled at the buffer;
>   - did a cosmetic (blank lines) improvement at include/linux/ras.h;
>   - arm_event trace dynamic arrays renamed to pei_buf/ctx_buf/oem_buf.
> 
>     
> 
> ---
> Changes in v6:
> - EDITME: describe what is new in this series revision.
> - EDITME: use bulletpoints and terse descriptions.
> - Link to v5: https://lore.kernel.org/r/20250813-mauro_v3-v6-16-rev2-v5-0-954db8ccfbe6@os.amperecomputing.com
> 
> Jason Tian (1):
>       RAS: Report all ARM processor CPER information to userspace
> 
> Mauro Carvalho Chehab (4):
>       efi/cper: Adjust infopfx size to accept an extra space
>       efi/cper: Add a new helper function to print bitmasks
>       efi/cper: align ARM CPER type with UEFI 2.9A/2.10 specs
>       docs: efi: add CPER functions to driver-api
> 
>  Documentation/driver-api/firmware/efi/index.rst | 11 +++--
>  drivers/acpi/apei/ghes.c                        | 27 +++++------
>  drivers/firmware/efi/cper-arm.c                 | 52 ++++++++++-----------
>  drivers/firmware/efi/cper.c                     | 62 ++++++++++++++++++++++++-
>  drivers/ras/ras.c                               | 40 +++++++++++++++-
>  include/linux/cper.h                            | 12 +++--
>  include/linux/ras.h                             | 16 +++++--
>  include/ras/ras_event.h                         | 49 +++++++++++++++++--
>  8 files changed, 210 insertions(+), 59 deletions(-)
> 



Thanks,
Mauro
Re: [PATCH v6 0/5] Fix issues with ARM Processor CPER records
Posted by Ard Biesheuvel 2 months, 2 weeks ago
On Fri, 21 Nov 2025 at 09:30, Mauro Carvalho Chehab
<mchehab+huawei@kernel.org> wrote:
>
> Em Thu, 14 Aug 2025 09:52:51 -0700
> Daniel Ferguson <danielf@os.amperecomputing.com> escreveu:
>
> > This is needed for both kernelspace and userspace properly handle
> > ARM processor CPER events.
> >
> > Patch 1 of this series fix the UEFI 2.6+ implementation of the ARM
> > trace event, as the original implementation was incomplete.
> > Changeset e9279e83ad1f ("trace, ras: add ARM processor error trace event")
> > added such event, but it reports only some fields of the CPER record
> > defined on UEFI 2.6+ appendix N, table N.16.  Those are not enough
> > actually parse such events on userspace, as not even the event type
> > is exported.
>
> Hi Rafael/Ard,
>
> What's the status of this series? I'm not seeing it yet on linux-next.
>

I'll queue it up - thanks for the reminder.