[PATCH v7 00/16] Support Armv8 RAS Extensions for Kernel-first error handling

Ruidong Tian posted 16 patches 5 days, 20 hours ago
Documentation/ABI/testing/debugfs-arm64-ras |   87 ++
MAINTAINERS                                 |   11 +
arch/arm64/include/asm/ras.h                |   99 ++
drivers/acpi/arm64/Kconfig                  |   10 +
drivers/acpi/arm64/Makefile                 |    1 +
drivers/acpi/arm64/aest.c                   |  423 ++++++++
drivers/ras/Kconfig                         |    1 +
drivers/ras/Makefile                        |    1 +
drivers/ras/arm64/Kconfig                   |   16 +
drivers/ras/arm64/Makefile                  |    9 +
drivers/ras/arm64/ras-cmn.c                 |  479 +++++++++
drivers/ras/arm64/ras-core.c                | 1026 +++++++++++++++++++
drivers/ras/arm64/ras-inject.c              |  130 +++
drivers/ras/arm64/ras-storm.c               |  198 ++++
drivers/ras/arm64/ras-sysfs.c               |  319 ++++++
drivers/ras/arm64/ras.h                     |  374 +++++++
drivers/ras/debugfs.c                       |    3 +-
drivers/ras/ras.c                           |    3 +
include/linux/acpi_aest.h                   |   36 +
include/linux/cpuhotplug.h                  |    1 +
include/linux/ras.h                         |   10 +
include/ras/ras_event.h                     |   79 ++
22 files changed, 3315 insertions(+), 1 deletion(-)
create mode 100644 Documentation/ABI/testing/debugfs-arm64-ras
create mode 100644 arch/arm64/include/asm/ras.h
create mode 100644 drivers/acpi/arm64/aest.c
create mode 100644 drivers/ras/arm64/Kconfig
create mode 100644 drivers/ras/arm64/Makefile
create mode 100644 drivers/ras/arm64/ras-cmn.c
create mode 100644 drivers/ras/arm64/ras-core.c
create mode 100644 drivers/ras/arm64/ras-inject.c
create mode 100644 drivers/ras/arm64/ras-storm.c
create mode 100644 drivers/ras/arm64/ras-sysfs.c
create mode 100644 drivers/ras/arm64/ras.h
create mode 100644 include/linux/acpi_aest.h
[PATCH v7 00/16] Support Armv8 RAS Extensions for Kernel-first error handling
Posted by Ruidong Tian 5 days, 20 hours ago
This patch introduces an arm64 platform RAS driver to support the Armv8 RAS Extensions[0].
The features supported by this patch include:

1. ACPI frontend[1]
2. Driver probing and interrupt handling
3. CMN-700 support
4. Error thresholding and error statistics
5. Hardware error interrupt storm mitigation
6. Tracepoints for userspace monitoring

The following features are still required but are not included in this patch,
as they are highly platform-specific:

1. Error decoding
2. Address decoding
(It is also known as the EDAC driver.)

The devicetree frontend support will be implemented later by Umang Chheda[3].

Motivation and Background
================================

On current ARM platforms, RAS functionality is mainly implemented through the
APEI framework, also known as the Firmware-First model(FFM). In this model, RAS
errors are first handled by firmware and then reported to the kernel.

However, this model does not fit all use cases well. In particular:

1. Some platforms want to collect more error events with lower overhead,
avoiding the expensive firmware-kernel context switching.
2. Some platforms prefer to handle error collection directly in the kernel[3].

To address these requirements, a kernel-first error collection mechanism has
been implemented in the kernel based on the Armv8 RAS Extensions. This model
is also referred to as Kernel-First Mode(KFM).

Extended Use Cases
=====================

With support for both Kernel-First Mode (KFM) and Firmware-First Mode (FFM),
users can flexibly choose the most appropriate error handling path for their
platforms and workloads.

This enables additional use cases such as:

1. Predictive failure analysis (PFA)[4]
2. Large-scale error prediction in cloud environments[5][6][7][8]

Maintenance
=============================
This series is based on Tyler Baicar's preliminary patches [9]. I attempted
to follow up with Tyler in 2022 but received no reply. As he no longer
appears active on the mailing list, I have picked up this work, updated it
to align with the latest AEST v2.0 specification, and addressed pending
feedback to ensure this critical feature is integrated into the mainline.

Testing
===================
I have tested this series on THead Yitian710 SOC with customized BIOS. Someone
can also use QEMU[10] for preliminary driver testing.

1. Boot Qemu

qemu-system-aarch64 -smp 4 -m 32G \
  -cpu host --enable-kvm -machine virt,gic-version=3 \
  -kernel Image -initrd initrd.cpio.gz \
  -device virtio-net-pci,netdev=t0 -netdev user,id=t0 \
  -bios /usr/share/edk2/aarch64/QEMU_EFI.fd  \
  -append "rdinit=/sbin/init earlycon verbose debug console=ttyAMA0 aest.dyndbg='+pt'" \
  -nographic -d guest_errors -D qemu.log

2. inject error

echo 0xc4800390 > /sys/kernel/debug/ras/arm64/memory.90d0000/record0/inject/hard_inject
[13581.756132] arm64_ras: {119}[Hardware Error]: Hardware error from AEST memory.90d0000
[13581.756158] arm64_ras: {119}[Hardware Error]:  Error from memory at SRAT proximity domain 0x0
[13581.756162] arm64_ras: {119}[Hardware Error]:   ERR0FR: 0x40000080044081
[13581.756164] arm64_ras: {119}[Hardware Error]:   ERR0CTRL: 0x108
[13581.756165] arm64_ras: {119}[Hardware Error]:   ERR0STATUS: 0xc4800390
[13581.756169] arm64_ras: {119}[Hardware Error]:   ERR0ADDR: 0x8400000043344521
[13581.756170] arm64_ras: {119}[Hardware Error]:   ERR0MISC0: 0x7fff00000000
[13581.756171] arm64_ras: {119}[Hardware Error]:   ERR0MISC1: 0x0
[13581.756172] arm64_ras: {119}[Hardware Error]:   ERR0MISC2: 0x0
[13581.756173] arm64_ras: {119}[Hardware Error]:   ERR0MISC3: 0x0

[0]: https://developer.arm.com/documentation/ihi0100/
[1]: https://developer.arm.com/documentation/den0085/0201/
[3]: https://lore.kernel.org/all/20260505-aest-devicetree-support-v1-0-d5d6ffacf0a5@oss.qualcomm.com/
[4]: http://www.mcelog.org/glossary.html#pfa
[5]: Intel: Predicting Uncorrectable Memory Errors from the Correctable Error History
[6]: Alibaba. Predicting DRAM-Caused Risky VMs in Large-Scale Clouds. Published in HPCA2025
[7]: AMD: Physics-informed machinelearning for dram error modeling
[8]: Tencent: Predicting uncorrectablememory errors for proactive replacement: An empirical study
on large-scale field data
[9]: https://lore.kernel.org/all/20211124170708.3874-1-baicar@os.amperecomputing.com/
[10]: https://github.com/winterddd/qemu/tree/error_record

Change from V6:
https://lore.kernel.org/all/20260122094656.73399-1-tianruidong@linux.alibaba.com/
1. Comment by Robin Murphy
  - Use the information in the AEST table to calculate PERIPHBASE instead of 
    looking it up from the DSDT.
  - Support erratum #2732981: some nodes do not support ERRGSR, so fall back to
    polling for those nodes, while other nodes should continue using ERRGSR.
  - Other minor changes.
2. Architecture refactor:
  - Renamed the driver to the arm64_ras driver.
  - To support Umang Chheda’s changes[3] in the devicetree, the interaction logic between
    the driver (backend) and the device (frontend) was updated. The device property
    infrastructure is now used uniformly to pass properties, and the extra aest_device
    abstraction has been removed.
  - Removed the use of genpoll, since it is not needed in interrupt context.
3. New features:
  - Added support for hardware error interrupt storm mitigation.

Change from V5:
https://lore.kernel.org/all/20251230090945.43969-1-tianru...
1. Based on the feedback from Borislav Petkov, I've dropped the idea of a 
   unified address translation interface across ARM and AMD.

Change from V4:
https://lore.kernel.org/all/20251222094351.38792-1-tianru...
1. Fix build warning in 0010 and 0014 report by kernel test robot:
    https://lore.kernel.org/all/202512230122.CfXZcF76-lkp@int...
    https://lore.kernel.org/all/202512230007.Vs6IvFVD-lkp@int...
2. Dropped the extra patch(0014) that was mistakenly included in v4.

Change from V3:
https://lore.kernel.org/all/20250115084228.107573-1-tianr...
1. Add vendor AEST node framework and support CMN700
2. Borislav Petkov
    - Split into multiple smaller patches for easier review.
    - refined the English in the cover letter for better flow.
3. Accept Tomohiro Misono's comment

Change from V2:
https://lore.kernel.org/all/20240321025317.114621-1-tianr...
1. Tomohiro Misono
    - dump register before panic
2. Baolin Wang & Shuai Xue: accept all comment.
3. Support AEST V2.

Change from V1:
https://lore.kernel.org/all/20240304111517.33001-1-tianru...
1. Marc Zyngier
  - Use readq/writeq_relaxed instead of readq/writeq for MMIO address.
  - Add sync for system register operation.
  - Use irq_is_percpu_devid() helper to identify a per-CPU interrupt.
  - Other fix.
2. Set RAS CE threshold in AEST driver.
3. Enable RAS interrupt explicitly in driver.
4. UER and UEO trigger memory_failure other than panic.

Ruidong Tian (16):
  ACPI/AEST: Register arm64_ras platform devices from AEST v2
  arm64: ras: Add probe/remove for arm64_ras driver
  arm64: ras: Unify the read/write interface for system and MMIO
    registers
  arm64: ras: Support RAS Common Fault Injection Model Extension
  arm64: ras: Plumb AEST interrupts as platform IRQ resources
  arm64: ras: Enable error reporting
  arm64: ras: Add error record processing and interrupt handling
  arm64: ras: Handle memory failure for uncorrectable errors
  arm64: ras: Probe RAS architecture version
  arm64: ras: Support CE threshold of error record
  arm64: ras: Add RAS decode notifier chain
  arm64: ras: Expose config abi through debugfs
  arm64: ras: Introduce ras inject interface
  arm64: ras: support vendor node CMN700
  arm64: ras: Introduce ras error storm mitigation
  trace, ras: add ARM RAS extension trace event

 Documentation/ABI/testing/debugfs-arm64-ras |   87 ++
 MAINTAINERS                                 |   11 +
 arch/arm64/include/asm/ras.h                |   99 ++
 drivers/acpi/arm64/Kconfig                  |   10 +
 drivers/acpi/arm64/Makefile                 |    1 +
 drivers/acpi/arm64/aest.c                   |  423 ++++++++
 drivers/ras/Kconfig                         |    1 +
 drivers/ras/Makefile                        |    1 +
 drivers/ras/arm64/Kconfig                   |   16 +
 drivers/ras/arm64/Makefile                  |    9 +
 drivers/ras/arm64/ras-cmn.c                 |  479 +++++++++
 drivers/ras/arm64/ras-core.c                | 1026 +++++++++++++++++++
 drivers/ras/arm64/ras-inject.c              |  130 +++
 drivers/ras/arm64/ras-storm.c               |  198 ++++
 drivers/ras/arm64/ras-sysfs.c               |  319 ++++++
 drivers/ras/arm64/ras.h                     |  374 +++++++
 drivers/ras/debugfs.c                       |    3 +-
 drivers/ras/ras.c                           |    3 +
 include/linux/acpi_aest.h                   |   36 +
 include/linux/cpuhotplug.h                  |    1 +
 include/linux/ras.h                         |   10 +
 include/ras/ras_event.h                     |   79 ++
 22 files changed, 3315 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/ABI/testing/debugfs-arm64-ras
 create mode 100644 arch/arm64/include/asm/ras.h
 create mode 100644 drivers/acpi/arm64/aest.c
 create mode 100644 drivers/ras/arm64/Kconfig
 create mode 100644 drivers/ras/arm64/Makefile
 create mode 100644 drivers/ras/arm64/ras-cmn.c
 create mode 100644 drivers/ras/arm64/ras-core.c
 create mode 100644 drivers/ras/arm64/ras-inject.c
 create mode 100644 drivers/ras/arm64/ras-storm.c
 create mode 100644 drivers/ras/arm64/ras-sysfs.c
 create mode 100644 drivers/ras/arm64/ras.h
 create mode 100644 include/linux/acpi_aest.h

-- 
2.51.2.612.gdc70283dfc