.../devicetree/bindings/arm/cpus.yaml | 6 + drivers/edac/Kconfig | 9 + drivers/edac/Makefile | 1 + drivers/edac/cortex_arm64_l1_l2.c | 229 ++++++++++++++++++ 4 files changed, 245 insertions(+) create mode 100644 drivers/edac/cortex_arm64_l1_l2.c
Hello, This is an attempt to revive [v5] series. I have attempted to address comments and suggestions from Marc Zyngier since [v5]. Additionally, I have extended support for A72 processors. Testing the driver on a problematic A72 SoC has led to the detection of Correctable Errors (CEs). Below are logs captured from the problematic SoC during various boot instances. [ 876.896022] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2' [ 3700.978086] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2' [ 976.956158] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2' [ 1427.933606] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2' [ 192.959911] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2' Our primary focus is on A72. We have a significant number of A72-based systems in our fleet, and timely replacements via monitoring CEs will be instrumental in managing them effectively. I am eager to hear your suggestions and feedback on this series. Thanks, Vijay [v5] https://lore.kernel.org/all/20210401110615.15326-1-s.hauer@pengutronix.de/#t [v6] https://lore.kernel.org/all/1744241785-20256-1-git-send-email-vijayb@linux.microsoft.com/ [v7] https://lore.kernel.org/all/1744409319-24912-1-git-send-email-vijayb@linux.microsoft.com/#t Changes since v7: - v5 was based on the internal product kernel, identified following upon review - correct format specifier to print CPUID/WAY - removal of unused dynamic attributes for edac_device_alloc_ctl_info() - driver remove callback return type is void Changes since v6: - restore the change made in [v5] to clear CPU/L2 syndrome registers back to read_errors() - upon detecting a valid error, clear syndrome registers immediately to avoid clobbering between the read and write (Marc) - NULL return check for of_get_cpu_node() (Tyler) - of_node_put() to avoid refcount issue (Tyler) - quotes are dropped in yaml file (Krzysztof) Changes since v5: - rebase on v6.15-rc1 - the syndrome registers for CPU/L2 memory errors are cleared only upon detecting an error and an isb() after for synchronization (Marc) - "edac-enabled" hunk moved to initial patch to avoid breaking virtual environments (Marc) - to ensure compatibility across all three families, we are not reporting "L1 Dirty RAM," documented only in the A53 TRM - above prompted changing default CPU L1 error meesage from "unknown" to "Unspecified" - capturing CPUID/WAY information in L2 memory error log (Marc) - module license from "GPL v2" to "GPL" (checkpatch.pl warning) - extend support for A72 Changes since v4: - Rebase on v5.12-rc5 Changes since v3: - Add edac-enabled property to make EDAC 3support optional Changes since v2: - drop usage of virtual dt node (Robh) - use read_sysreg_s instead of open coded variant (James Morse) - separate error retrieving from error reporting - use smp_call_function_single rather than smp_call_function_single_async - make driver single instance and register all 'cpu' hierarchy up front once Changes since v1: - Split dt-binding into separate patch - Sort local function variables in reverse-xmas tree order - drop unnecessary comparison and make variable bool Sascha Hauer (2): drivers/edac: Add L1 and L2 error detection for A53, A57 and A72 dt-bindings: arm: cpus: Add edac-enabled property .../devicetree/bindings/arm/cpus.yaml | 6 + drivers/edac/Kconfig | 9 + drivers/edac/Makefile | 1 + drivers/edac/cortex_arm64_l1_l2.c | 229 ++++++++++++++++++ 4 files changed, 245 insertions(+) create mode 100644 drivers/edac/cortex_arm64_l1_l2.c base-commit: 59c9ab3e8cc7f56cd65608f6e938b5ae96eb9cd2 -- 2.49.0
On Sun, May 04, 2025 at 05:27:38PM -0700, Vijay Balakrishna wrote:
> Hello,
>
> This is an attempt to revive [v5] series. I have attempted to address comments
> and suggestions from Marc Zyngier since [v5]. Additionally, I have extended
I'd like to hear from ARM folks here, whether this makes sense to have still.
> support for A72 processors. Testing the driver on a problematic A72 SoC
> has led to the detection of Correctable Errors (CEs). Below are logs captured
> from the problematic SoC during various boot instances.
>
> [ 876.896022] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
>
> [ 3700.978086] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
>
> [ 976.956158] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
>
> [ 1427.933606] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
>
> [ 192.959911] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
>
> Our primary focus is on A72. We have a significant number of A72-based systems
Then zap the support for the other CPUs as supporting those is futile.
cortex_arm64_l1_l2.c - I don't want an EDAC driver per RAS functional unit.
Call this edac_a72 or whatever, which will contain all A72 RAS functionality
support. ARM folks will give you a good idea here if you don't have.
Also, I'd need at least a reviewer entry to MAINTAINERS for patches to this
driver because you'll be the only ones testing this as you have vested
interest in this working.
The dt patch needs a reviewed-by too.
Once that is addressed, I'll take a look.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 5/5/25 02:10, Borislav Petkov wrote: > On Sun, May 04, 2025 at 05:27:38PM -0700, Vijay Balakrishna wrote: >> Hello, >> >> This is an attempt to revive [v5] series. I have attempted to address comments >> and suggestions from Marc Zyngier since [v5]. Additionally, I have extended > > I'd like to hear from ARM folks here, whether this makes sense to have still. > >> support for A72 processors. Testing the driver on a problematic A72 SoC >> has led to the detection of Correctable Errors (CEs). Below are logs captured >> from the problematic SoC during various boot instances. >> >> [ 876.896022] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2' >> >> [ 3700.978086] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2' >> >> [ 976.956158] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2' >> >> [ 1427.933606] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2' >> >> [ 192.959911] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2' >> >> Our primary focus is on A72. We have a significant number of A72-based systems > > Then zap the support for the other CPUs as supporting those is futile. > > cortex_arm64_l1_l2.c - I don't want an EDAC driver per RAS functional unit. > Call this edac_a72 or whatever, which will contain all A72 RAS functionality > support. ARM folks will give you a good idea here if you don't have. > > Also, I'd need at least a reviewer entry to MAINTAINERS for patches to this > driver because you'll be the only ones testing this as you have vested > interest in this working. > > The dt patch needs a reviewed-by too. > > Once that is addressed, I'll take a look. > > Thx. > Thank you, Boris. I will soon be posting a new series featuring only A72 functionality. Could the ARM folks on Cc please comment on additional changes we can include for A72? Tyler and I can serve as joint reviewers, and I'll update the MAINTAINERS file accordingly. Krzysztof, I would appreciate your reviewed-by for the DT patch when I post the next version. Thanks, Vijay
© 2016 - 2025 Red Hat, Inc.