[v8 PATCH 0/2] Add L1 and L2 error detection for A53, A57 and A72

Vijay Balakrishna posted 2 patches 7 months, 2 weeks ago
.../devicetree/bindings/arm/cpus.yaml         |   6 +
drivers/edac/Kconfig                          |   9 +
drivers/edac/Makefile                         |   1 +
drivers/edac/cortex_arm64_l1_l2.c             | 229 ++++++++++++++++++
4 files changed, 245 insertions(+)
create mode 100644 drivers/edac/cortex_arm64_l1_l2.c
[v8 PATCH 0/2] Add L1 and L2 error detection for A53, A57 and A72
Posted by Vijay Balakrishna 7 months, 2 weeks ago
Hello,

This is an attempt to revive [v5] series. I have attempted to address comments
and suggestions from Marc Zyngier since [v5]. Additionally, I have extended
support for A72 processors. Testing the driver on a problematic A72 SoC
has led to the detection of Correctable Errors (CEs). Below are logs captured
from the problematic SoC during various boot instances.

[  876.896022] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'

[ 3700.978086] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'

[  976.956158] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'

[ 1427.933606] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'

[  192.959911] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'

Our primary focus is on A72. We have a significant number of A72-based systems
in our fleet, and timely replacements via monitoring CEs will be instrumental
in managing them effectively.

I am eager to hear your suggestions and feedback on this series.

Thanks,
Vijay

[v5] https://lore.kernel.org/all/20210401110615.15326-1-s.hauer@pengutronix.de/#t
[v6] https://lore.kernel.org/all/1744241785-20256-1-git-send-email-vijayb@linux.microsoft.com/
[v7] https://lore.kernel.org/all/1744409319-24912-1-git-send-email-vijayb@linux.microsoft.com/#t

Changes since v7: 
- v5 was based on the internal product kernel, identified following upon review
- correct format specifier to print CPUID/WAY
- removal of unused dynamic attributes for edac_device_alloc_ctl_info() 
- driver remove callback return type is void

Changes since v6:
- restore the change made in [v5] to clear CPU/L2 syndrome registers
  back to read_errors()
- upon detecting a valid error, clear syndrome registers immediately
  to avoid clobbering between the read and write (Marc)
- NULL return check for of_get_cpu_node() (Tyler)
- of_node_put() to avoid refcount issue (Tyler)
- quotes are dropped in yaml file (Krzysztof)

Changes since v5:
- rebase on v6.15-rc1
- the syndrome registers for CPU/L2 memory errors are cleared only upon
  detecting an error and an isb() after for synchronization (Marc)
- "edac-enabled" hunk moved to initial patch to avoid breaking virtual
  environments (Marc)
- to ensure compatibility across all three families, we are not reporting
  "L1 Dirty RAM," documented only in the A53 TRM
- above prompted changing default CPU L1 error meesage from "unknown"
  to "Unspecified"
- capturing CPUID/WAY information in L2 memory error log (Marc)
- module license from "GPL v2" to "GPL" (checkpatch.pl warning)
- extend support for A72

Changes since v4:
- Rebase on v5.12-rc5

Changes since v3:
- Add edac-enabled property to make EDAC 3support optional

Changes since v2:
- drop usage of virtual dt node (Robh)
- use read_sysreg_s instead of open coded variant (James Morse)
- separate error retrieving from error reporting
- use smp_call_function_single rather than smp_call_function_single_async
- make driver single instance and register all 'cpu' hierarchy up front once

Changes since v1:
- Split dt-binding into separate patch
- Sort local function variables in reverse-xmas tree order
- drop unnecessary comparison and make variable bool

Sascha Hauer (2):
  drivers/edac: Add L1 and L2 error detection for A53, A57 and A72
  dt-bindings: arm: cpus: Add edac-enabled property

 .../devicetree/bindings/arm/cpus.yaml         |   6 +
 drivers/edac/Kconfig                          |   9 +
 drivers/edac/Makefile                         |   1 +
 drivers/edac/cortex_arm64_l1_l2.c             | 229 ++++++++++++++++++
 4 files changed, 245 insertions(+)
 create mode 100644 drivers/edac/cortex_arm64_l1_l2.c


base-commit: 59c9ab3e8cc7f56cd65608f6e938b5ae96eb9cd2
-- 
2.49.0
Re: [v8 PATCH 0/2] Add L1 and L2 error detection for A53, A57 and A72
Posted by Borislav Petkov 7 months, 2 weeks ago
On Sun, May 04, 2025 at 05:27:38PM -0700, Vijay Balakrishna wrote:
> Hello,
> 
> This is an attempt to revive [v5] series. I have attempted to address comments
> and suggestions from Marc Zyngier since [v5]. Additionally, I have extended

I'd like to hear from ARM folks here, whether this makes sense to have still.

> support for A72 processors. Testing the driver on a problematic A72 SoC
> has led to the detection of Correctable Errors (CEs). Below are logs captured
> from the problematic SoC during various boot instances.
> 
> [  876.896022] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
> 
> [ 3700.978086] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
> 
> [  976.956158] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
> 
> [ 1427.933606] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
> 
> [  192.959911] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
> 
> Our primary focus is on A72. We have a significant number of A72-based systems

Then zap the support for the other CPUs as supporting those is futile.

cortex_arm64_l1_l2.c - I don't want an EDAC driver per RAS functional unit.
Call this edac_a72 or whatever, which will contain all A72 RAS functionality
support. ARM folks will give you a good idea here if you don't have.

Also, I'd need at least a reviewer entry to MAINTAINERS for patches to this
driver because you'll be the only ones testing this as you have vested
interest in this working.

The dt patch needs a reviewed-by too.

Once that is addressed, I'll take a look.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [v8 PATCH 0/2] Add L1 and L2 error detection for A53, A57 and A72
Posted by Vijay Balakrishna 7 months, 2 weeks ago
On 5/5/25 02:10, Borislav Petkov wrote:
> On Sun, May 04, 2025 at 05:27:38PM -0700, Vijay Balakrishna wrote:
>> Hello,
>>
>> This is an attempt to revive [v5] series. I have attempted to address comments
>> and suggestions from Marc Zyngier since [v5]. Additionally, I have extended
> 
> I'd like to hear from ARM folks here, whether this makes sense to have still.
> 
>> support for A72 processors. Testing the driver on a problematic A72 SoC
>> has led to the detection of Correctable Errors (CEs). Below are logs captured
>> from the problematic SoC during various boot instances.
>>
>> [  876.896022] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
>>
>> [ 3700.978086] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
>>
>> [  976.956158] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
>>
>> [ 1427.933606] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
>>
>> [  192.959911] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'
>>
>> Our primary focus is on A72. We have a significant number of A72-based systems
> 
> Then zap the support for the other CPUs as supporting those is futile.
> 
> cortex_arm64_l1_l2.c - I don't want an EDAC driver per RAS functional unit.
> Call this edac_a72 or whatever, which will contain all A72 RAS functionality
> support. ARM folks will give you a good idea here if you don't have.
> 
> Also, I'd need at least a reviewer entry to MAINTAINERS for patches to this
> driver because you'll be the only ones testing this as you have vested
> interest in this working.
> 
> The dt patch needs a reviewed-by too.
> 
> Once that is addressed, I'll take a look.
> 
> Thx.
> 

Thank you, Boris.

I will soon be posting a new series featuring only A72 functionality. 
Could the ARM folks on Cc please comment on additional changes we can 
include for A72?

Tyler and I can serve as joint reviewers, and I'll update the 
MAINTAINERS file accordingly.

Krzysztof, I would appreciate your reviewed-by for the DT patch when I 
post the next version.

Thanks,
Vijay