ACPI: APEI: Handle repeated SEA error interrupts storm scenarios

[PATCH] ACPI: APEI: Handle repeated SEA error interrupts storm scenarios

Posted by Junhao He 3 months, 1 week ago

The do_sea() function defaults to using firmware-first mode, if supported.
It invoke acpi/apei/ghes ghes_notify_sea() to report and handling the SEA
error, The GHES uses a buffer to cache the most recent 4 kinds of SEA
errors. If the same kind SEA error continues to occur, GHES will skip to
reporting this SEA error and will not add it to the "ghes_estatus_llist"
list until the cache times out after 10 seconds, at which point the SEA
error will be reprocessed.

The GHES invoke ghes_proc_in_irq() to handle the SEA error, which
ultimately executes memory_failure() to process the page with hardware
memory corruption. If the same SEA error appears multiple times
consecutively, it indicates that the previous handling was incomplete or
unable to resolve the fault. In such cases, it is more appropriate to
return a failure when encountering the same error again, and then proceed
to arm64_do_kernel_sea for further processing.

When hardware memory corruption occurs, a memory error interrupt is
triggered. If the kernel accesses this erroneous data, it will trigger
the SEA error exception handler. All such handlers will call
memory_failure() to handle the faulty page.

If a memory error interrupt occurs first, followed by an SEA error
interrupt, the faulty page is first marked as poisoned by the memory error
interrupt process, and then the SEA error interrupt handling process will
send a SIGBUS signal to the process accessing the poisoned page.

However, if the SEA interrupt is reported first, the following exceptional
scenario occurs:

When a user process directly requests and accesses a page with hardware
memory corruption via mmap (such as with devmem), the page containing this
address may still be in a free buddy state in the kernel. At this point,
the page is marked as "poisoned" during the SEA claim memory_failure().
However, since the process does not request the page through the kernel's
MMU, the kernel cannot send SIGBUS signal to the processes. And the memory
error interrupt handling process not support send SIGBUS signal. As a
result, these processes continues to access the faulty page, causing
repeated entries into the SEA exception handler. At this time, it lead to
an SEA error interrupt storm.

Fixes this by returning a failure when encountering the same error again.

The following error logs is explained using the devmem process:
  NOTICE:  SEA Handle
  NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
  NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
  NOTICE:  EsrEl3 = 0x92000410
  NOTICE:  PA is valid: 0x1000093c00
  NOTICE:  Hest Set GenericError Data
  [ 1419.542401][    C1] {57}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
  [ 1419.551435][    C1] {57}[Hardware Error]: event severity: recoverable
  [ 1419.557865][    C1] {57}[Hardware Error]:  Error 0, type: recoverable
  [ 1419.564295][    C1] {57}[Hardware Error]:   section_type: ARM processor error
  [ 1419.571421][    C1] {57}[Hardware Error]:   MIDR: 0x0000000000000000
  [ 1419.571434][    C1] {57}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081000100
  [ 1419.586813][    C1] {57}[Hardware Error]:   error affinity level: 0
  [ 1419.586821][    C1] {57}[Hardware Error]:   running state: 0x1
  [ 1419.602714][    C1] {57}[Hardware Error]:   Power State Coordination Interface state: 0
  [ 1419.602724][    C1] {57}[Hardware Error]:   Error info structure 0:
  [ 1419.614797][    C1] {57}[Hardware Error]:   num errors: 1
  [ 1419.614804][    C1] {57}[Hardware Error]:    error_type: 0, cache error
  [ 1419.629226][    C1] {57}[Hardware Error]:    error_info: 0x0000000020400014
  [ 1419.629234][    C1] {57}[Hardware Error]:     cache level: 1
  [ 1419.642006][    C1] {57}[Hardware Error]:     the error has not been corrected
  [ 1419.642013][    C1] {57}[Hardware Error]:    physical fault address: 0x0000001000093c00
  [ 1419.654001][    C1] {57}[Hardware Error]:   Vendor specific error info has 48 bytes:
  [ 1419.654014][    C1] {57}[Hardware Error]:    00000000: 00000000 00000000 00000000 00000000  ................
  [ 1419.670685][    C1] {57}[Hardware Error]:    00000010: 00000000 00000000 00000000 00000000  ................
  [ 1419.670692][    C1] {57}[Hardware Error]:    00000020: 00000000 00000000 00000000 00000000  ................
  [ 1419.783606][T54990] Memory failure: 0x1000093: recovery action for free buddy page: Recovered
  [ 1419.919580][ T9955] EDAC MC0: 1 UE Multi-bit ECC on unknown memory (node:0 card:1 module:71 bank:7 row:0 col:0 page:0x1000093 offset:0xc00 grain:1 - APEI location: node:0 card:257 module:71 bank:7 row:0 col:0)
  NOTICE:  SEA Handle
  NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
  NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
  NOTICE:  EsrEl3 = 0x92000410
  NOTICE:  PA is valid: 0x1000093c00
  NOTICE:  Hest Set GenericError Data
  NOTICE:  SEA Handle
  NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
  NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
  NOTICE:  EsrEl3 = 0x92000410
  NOTICE:  PA is valid: 0x1000093c00
  NOTICE:  Hest Set GenericError Data
  ...
  ...        ---> Hapend SEA error interrupt storm
  ...
  NOTICE:  SEA Handle
  NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
  NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
  NOTICE:  EsrEl3 = 0x92000410
  NOTICE:  PA is valid: 0x1000093c00
  NOTICE:  Hest Set GenericError Data
  [ 1429.818080][ T9955] Memory failure: 0x1000093: already hardware poisoned
  [ 1429.825760][    C1] ghes_print_estatus: 1 callbacks suppressed
  [ 1429.825763][    C1] {59}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
  [ 1429.843731][    C1] {59}[Hardware Error]: event severity: recoverable
  [ 1429.861800][    C1] {59}[Hardware Error]:  Error 0, type: recoverable
  [ 1429.874658][    C1] {59}[Hardware Error]:   section_type: ARM processor error
  [ 1429.887516][    C1] {59}[Hardware Error]:   MIDR: 0x0000000000000000
  [ 1429.901159][    C1] {59}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081000100
  [ 1429.901166][    C1] {59}[Hardware Error]:   error affinity level: 0
  [ 1429.914896][    C1] {59}[Hardware Error]:   running state: 0x1
  [ 1429.914903][    C1] {59}[Hardware Error]:   Power State Coordination Interface state: 0
  [ 1429.933319][    C1] {59}[Hardware Error]:   Error info structure 0:
  [ 1429.946261][    C1] {59}[Hardware Error]:   num errors: 1
  [ 1429.946269][    C1] {59}[Hardware Error]:    error_type: 0, cache error
  [ 1429.970847][    C1] {59}[Hardware Error]:    error_info: 0x0000000020400014
  [ 1429.970854][    C1] {59}[Hardware Error]:     cache level: 1
  [ 1429.988406][    C1] {59}[Hardware Error]:     the error has not been corrected
  [ 1430.013419][    C1] {59}[Hardware Error]:    physical fault address: 0x0000001000093c00
  [ 1430.013425][    C1] {59}[Hardware Error]:   Vendor specific error info has 48 bytes:
  [ 1430.025424][    C1] {59}[Hardware Error]:    00000000: 00000000 00000000 00000000 00000000  ................
  [ 1430.053736][    C1] {59}[Hardware Error]:    00000010: 00000000 00000000 00000000 00000000  ................
  [ 1430.066341][    C1] {59}[Hardware Error]:    00000020: 00000000 00000000 00000000 00000000  ................
  [ 1430.294255][T54990] Memory failure: 0x1000093: already hardware poisoned
  [ 1430.305518][T54990] 0x1000093: Sending SIGBUS to devmem:54990 due to hardware memory corruption

Signed-off-by: Junhao He <hejunhao3@h-partners.com>
---
 drivers/acpi/apei/ghes.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 005de10d80c3..eebda39bfc30 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -1343,8 +1343,10 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
 	ghes_clear_estatus(ghes, &tmp_header, buf_paddr, fixmap_idx);
 
 	/* This error has been reported before, don't process it again. */
-	if (ghes_estatus_cached(estatus))
+	if (ghes_estatus_cached(estatus)) {
+		rc = -ECANCELED;
 		goto no_work;
+	}
 
 	llist_add(&estatus_node->llnode, &ghes_estatus_llist);
 
-- 
2.33.0

Re: [PATCH] ACPI: APEI: Handle repeated SEA error interrupts storm scenarios

Posted by Rafael J. Wysocki 3 months ago

On Thu, Oct 30, 2025 at 8:13 AM Junhao He <hejunhao3@h-partners.com> wrote:
>
> The do_sea() function defaults to using firmware-first mode, if supported.
> It invoke acpi/apei/ghes ghes_notify_sea() to report and handling the SEA
> error, The GHES uses a buffer to cache the most recent 4 kinds of SEA
> errors. If the same kind SEA error continues to occur, GHES will skip to
> reporting this SEA error and will not add it to the "ghes_estatus_llist"
> list until the cache times out after 10 seconds, at which point the SEA
> error will be reprocessed.
>
> The GHES invoke ghes_proc_in_irq() to handle the SEA error, which
> ultimately executes memory_failure() to process the page with hardware
> memory corruption. If the same SEA error appears multiple times
> consecutively, it indicates that the previous handling was incomplete or
> unable to resolve the fault. In such cases, it is more appropriate to
> return a failure when encountering the same error again, and then proceed
> to arm64_do_kernel_sea for further processing.
>
> When hardware memory corruption occurs, a memory error interrupt is
> triggered. If the kernel accesses this erroneous data, it will trigger
> the SEA error exception handler. All such handlers will call
> memory_failure() to handle the faulty page.
>
> If a memory error interrupt occurs first, followed by an SEA error
> interrupt, the faulty page is first marked as poisoned by the memory error
> interrupt process, and then the SEA error interrupt handling process will
> send a SIGBUS signal to the process accessing the poisoned page.
>
> However, if the SEA interrupt is reported first, the following exceptional
> scenario occurs:
>
> When a user process directly requests and accesses a page with hardware
> memory corruption via mmap (such as with devmem), the page containing this
> address may still be in a free buddy state in the kernel. At this point,
> the page is marked as "poisoned" during the SEA claim memory_failure().
> However, since the process does not request the page through the kernel's
> MMU, the kernel cannot send SIGBUS signal to the processes. And the memory
> error interrupt handling process not support send SIGBUS signal. As a
> result, these processes continues to access the faulty page, causing
> repeated entries into the SEA exception handler. At this time, it lead to
> an SEA error interrupt storm.
>
> Fixes this by returning a failure when encountering the same error again.
>
> The following error logs is explained using the devmem process:
>   NOTICE:  SEA Handle
>   NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>   NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
>   NOTICE:  EsrEl3 = 0x92000410
>   NOTICE:  PA is valid: 0x1000093c00
>   NOTICE:  Hest Set GenericError Data
>   [ 1419.542401][    C1] {57}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
>   [ 1419.551435][    C1] {57}[Hardware Error]: event severity: recoverable
>   [ 1419.557865][    C1] {57}[Hardware Error]:  Error 0, type: recoverable
>   [ 1419.564295][    C1] {57}[Hardware Error]:   section_type: ARM processor error
>   [ 1419.571421][    C1] {57}[Hardware Error]:   MIDR: 0x0000000000000000
>   [ 1419.571434][    C1] {57}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081000100
>   [ 1419.586813][    C1] {57}[Hardware Error]:   error affinity level: 0
>   [ 1419.586821][    C1] {57}[Hardware Error]:   running state: 0x1
>   [ 1419.602714][    C1] {57}[Hardware Error]:   Power State Coordination Interface state: 0
>   [ 1419.602724][    C1] {57}[Hardware Error]:   Error info structure 0:
>   [ 1419.614797][    C1] {57}[Hardware Error]:   num errors: 1
>   [ 1419.614804][    C1] {57}[Hardware Error]:    error_type: 0, cache error
>   [ 1419.629226][    C1] {57}[Hardware Error]:    error_info: 0x0000000020400014
>   [ 1419.629234][    C1] {57}[Hardware Error]:     cache level: 1
>   [ 1419.642006][    C1] {57}[Hardware Error]:     the error has not been corrected
>   [ 1419.642013][    C1] {57}[Hardware Error]:    physical fault address: 0x0000001000093c00
>   [ 1419.654001][    C1] {57}[Hardware Error]:   Vendor specific error info has 48 bytes:
>   [ 1419.654014][    C1] {57}[Hardware Error]:    00000000: 00000000 00000000 00000000 00000000  ................
>   [ 1419.670685][    C1] {57}[Hardware Error]:    00000010: 00000000 00000000 00000000 00000000  ................
>   [ 1419.670692][    C1] {57}[Hardware Error]:    00000020: 00000000 00000000 00000000 00000000  ................
>   [ 1419.783606][T54990] Memory failure: 0x1000093: recovery action for free buddy page: Recovered
>   [ 1419.919580][ T9955] EDAC MC0: 1 UE Multi-bit ECC on unknown memory (node:0 card:1 module:71 bank:7 row:0 col:0 page:0x1000093 offset:0xc00 grain:1 - APEI location: node:0 card:257 module:71 bank:7 row:0 col:0)
>   NOTICE:  SEA Handle
>   NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>   NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
>   NOTICE:  EsrEl3 = 0x92000410
>   NOTICE:  PA is valid: 0x1000093c00
>   NOTICE:  Hest Set GenericError Data
>   NOTICE:  SEA Handle
>   NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>   NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
>   NOTICE:  EsrEl3 = 0x92000410
>   NOTICE:  PA is valid: 0x1000093c00
>   NOTICE:  Hest Set GenericError Data
>   ...
>   ...        ---> Hapend SEA error interrupt storm
>   ...
>   NOTICE:  SEA Handle
>   NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>   NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
>   NOTICE:  EsrEl3 = 0x92000410
>   NOTICE:  PA is valid: 0x1000093c00
>   NOTICE:  Hest Set GenericError Data
>   [ 1429.818080][ T9955] Memory failure: 0x1000093: already hardware poisoned
>   [ 1429.825760][    C1] ghes_print_estatus: 1 callbacks suppressed
>   [ 1429.825763][    C1] {59}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
>   [ 1429.843731][    C1] {59}[Hardware Error]: event severity: recoverable
>   [ 1429.861800][    C1] {59}[Hardware Error]:  Error 0, type: recoverable
>   [ 1429.874658][    C1] {59}[Hardware Error]:   section_type: ARM processor error
>   [ 1429.887516][    C1] {59}[Hardware Error]:   MIDR: 0x0000000000000000
>   [ 1429.901159][    C1] {59}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081000100
>   [ 1429.901166][    C1] {59}[Hardware Error]:   error affinity level: 0
>   [ 1429.914896][    C1] {59}[Hardware Error]:   running state: 0x1
>   [ 1429.914903][    C1] {59}[Hardware Error]:   Power State Coordination Interface state: 0
>   [ 1429.933319][    C1] {59}[Hardware Error]:   Error info structure 0:
>   [ 1429.946261][    C1] {59}[Hardware Error]:   num errors: 1
>   [ 1429.946269][    C1] {59}[Hardware Error]:    error_type: 0, cache error
>   [ 1429.970847][    C1] {59}[Hardware Error]:    error_info: 0x0000000020400014
>   [ 1429.970854][    C1] {59}[Hardware Error]:     cache level: 1
>   [ 1429.988406][    C1] {59}[Hardware Error]:     the error has not been corrected
>   [ 1430.013419][    C1] {59}[Hardware Error]:    physical fault address: 0x0000001000093c00
>   [ 1430.013425][    C1] {59}[Hardware Error]:   Vendor specific error info has 48 bytes:
>   [ 1430.025424][    C1] {59}[Hardware Error]:    00000000: 00000000 00000000 00000000 00000000  ................
>   [ 1430.053736][    C1] {59}[Hardware Error]:    00000010: 00000000 00000000 00000000 00000000  ................
>   [ 1430.066341][    C1] {59}[Hardware Error]:    00000020: 00000000 00000000 00000000 00000000  ................
>   [ 1430.294255][T54990] Memory failure: 0x1000093: already hardware poisoned
>   [ 1430.305518][T54990] 0x1000093: Sending SIGBUS to devmem:54990 due to hardware memory corruption
>
> Signed-off-by: Junhao He <hejunhao3@h-partners.com>
> ---
>  drivers/acpi/apei/ghes.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 005de10d80c3..eebda39bfc30 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -1343,8 +1343,10 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>         ghes_clear_estatus(ghes, &tmp_header, buf_paddr, fixmap_idx);
>
>         /* This error has been reported before, don't process it again. */
> -       if (ghes_estatus_cached(estatus))
> +       if (ghes_estatus_cached(estatus)) {
> +               rc = -ECANCELED;
>                 goto no_work;
> +       }
>
>         llist_add(&estatus_node->llnode, &ghes_estatus_llist);
>
> --

This needs a response from the APEI reviewers as per MAINTAINERS, thanks!

Re: [PATCH] ACPI: APEI: Handle repeated SEA error interrupts storm scenarios

Posted by Shuai Xue 3 months ago


在 2025/11/4 00:19, Rafael J. Wysocki 写道:
> On Thu, Oct 30, 2025 at 8:13 AM Junhao He <hejunhao3@h-partners.com> wrote:
>>
>> The do_sea() function defaults to using firmware-first mode, if supported.
>> It invoke acpi/apei/ghes ghes_notify_sea() to report and handling the SEA
>> error, The GHES uses a buffer to cache the most recent 4 kinds of SEA
>> errors. If the same kind SEA error continues to occur, GHES will skip to
>> reporting this SEA error and will not add it to the "ghes_estatus_llist"
>> list until the cache times out after 10 seconds, at which point the SEA
>> error will be reprocessed.
>>
>> The GHES invoke ghes_proc_in_irq() to handle the SEA error, which
>> ultimately executes memory_failure() to process the page with hardware
>> memory corruption. If the same SEA error appears multiple times
>> consecutively, it indicates that the previous handling was incomplete or
>> unable to resolve the fault. In such cases, it is more appropriate to
>> return a failure when encountering the same error again, and then proceed
>> to arm64_do_kernel_sea for further processing.
>>
>> When hardware memory corruption occurs, a memory error interrupt is
>> triggered. If the kernel accesses this erroneous data, it will trigger
>> the SEA error exception handler. All such handlers will call
>> memory_failure() to handle the faulty page.
>>
>> If a memory error interrupt occurs first, followed by an SEA error
>> interrupt, the faulty page is first marked as poisoned by the memory error
>> interrupt process, and then the SEA error interrupt handling process will
>> send a SIGBUS signal to the process accessing the poisoned page.
>>
>> However, if the SEA interrupt is reported first, the following exceptional
>> scenario occurs:
>>
>> When a user process directly requests and accesses a page with hardware
>> memory corruption via mmap (such as with devmem), the page containing this
>> address may still be in a free buddy state in the kernel. At this point,
>> the page is marked as "poisoned" during the SEA claim memory_failure().
>> However, since the process does not request the page through the kernel's
>> MMU, the kernel cannot send SIGBUS signal to the processes. And the memory
>> error interrupt handling process not support send SIGBUS signal. As a
>> result, these processes continues to access the faulty page, causing
>> repeated entries into the SEA exception handler. At this time, it lead to
>> an SEA error interrupt storm.
>>
>> Fixes this by returning a failure when encountering the same error again.
>>
>> The following error logs is explained using the devmem process:
>>    NOTICE:  SEA Handle
>>    NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>>    NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
>>    NOTICE:  EsrEl3 = 0x92000410
>>    NOTICE:  PA is valid: 0x1000093c00
>>    NOTICE:  Hest Set GenericError Data
>>    [ 1419.542401][    C1] {57}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
>>    [ 1419.551435][    C1] {57}[Hardware Error]: event severity: recoverable
>>    [ 1419.557865][    C1] {57}[Hardware Error]:  Error 0, type: recoverable
>>    [ 1419.564295][    C1] {57}[Hardware Error]:   section_type: ARM processor error
>>    [ 1419.571421][    C1] {57}[Hardware Error]:   MIDR: 0x0000000000000000
>>    [ 1419.571434][    C1] {57}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081000100
>>    [ 1419.586813][    C1] {57}[Hardware Error]:   error affinity level: 0
>>    [ 1419.586821][    C1] {57}[Hardware Error]:   running state: 0x1
>>    [ 1419.602714][    C1] {57}[Hardware Error]:   Power State Coordination Interface state: 0
>>    [ 1419.602724][    C1] {57}[Hardware Error]:   Error info structure 0:
>>    [ 1419.614797][    C1] {57}[Hardware Error]:   num errors: 1
>>    [ 1419.614804][    C1] {57}[Hardware Error]:    error_type: 0, cache error
>>    [ 1419.629226][    C1] {57}[Hardware Error]:    error_info: 0x0000000020400014
>>    [ 1419.629234][    C1] {57}[Hardware Error]:     cache level: 1
>>    [ 1419.642006][    C1] {57}[Hardware Error]:     the error has not been corrected
>>    [ 1419.642013][    C1] {57}[Hardware Error]:    physical fault address: 0x0000001000093c00
>>    [ 1419.654001][    C1] {57}[Hardware Error]:   Vendor specific error info has 48 bytes:
>>    [ 1419.654014][    C1] {57}[Hardware Error]:    00000000: 00000000 00000000 00000000 00000000  ................
>>    [ 1419.670685][    C1] {57}[Hardware Error]:    00000010: 00000000 00000000 00000000 00000000  ................
>>    [ 1419.670692][    C1] {57}[Hardware Error]:    00000020: 00000000 00000000 00000000 00000000  ................
>>    [ 1419.783606][T54990] Memory failure: 0x1000093: recovery action for free buddy page: Recovered
>>    [ 1419.919580][ T9955] EDAC MC0: 1 UE Multi-bit ECC on unknown memory (node:0 card:1 module:71 bank:7 row:0 col:0 page:0x1000093 offset:0xc00 grain:1 - APEI location: node:0 card:257 module:71 bank:7 row:0 col:0)
>>    NOTICE:  SEA Handle
>>    NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>>    NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
>>    NOTICE:  EsrEl3 = 0x92000410
>>    NOTICE:  PA is valid: 0x1000093c00
>>    NOTICE:  Hest Set GenericError Data
>>    NOTICE:  SEA Handle
>>    NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>>    NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
>>    NOTICE:  EsrEl3 = 0x92000410
>>    NOTICE:  PA is valid: 0x1000093c00
>>    NOTICE:  Hest Set GenericError Data
>>    ...
>>    ...        ---> Hapend SEA error interrupt storm
>>    ...
>>    NOTICE:  SEA Handle
>>    NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>>    NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
>>    NOTICE:  EsrEl3 = 0x92000410
>>    NOTICE:  PA is valid: 0x1000093c00
>>    NOTICE:  Hest Set GenericError Data
>>    [ 1429.818080][ T9955] Memory failure: 0x1000093: already hardware poisoned
>>    [ 1429.825760][    C1] ghes_print_estatus: 1 callbacks suppressed
>>    [ 1429.825763][    C1] {59}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
>>    [ 1429.843731][    C1] {59}[Hardware Error]: event severity: recoverable
>>    [ 1429.861800][    C1] {59}[Hardware Error]:  Error 0, type: recoverable
>>    [ 1429.874658][    C1] {59}[Hardware Error]:   section_type: ARM processor error
>>    [ 1429.887516][    C1] {59}[Hardware Error]:   MIDR: 0x0000000000000000
>>    [ 1429.901159][    C1] {59}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081000100
>>    [ 1429.901166][    C1] {59}[Hardware Error]:   error affinity level: 0
>>    [ 1429.914896][    C1] {59}[Hardware Error]:   running state: 0x1
>>    [ 1429.914903][    C1] {59}[Hardware Error]:   Power State Coordination Interface state: 0
>>    [ 1429.933319][    C1] {59}[Hardware Error]:   Error info structure 0:
>>    [ 1429.946261][    C1] {59}[Hardware Error]:   num errors: 1
>>    [ 1429.946269][    C1] {59}[Hardware Error]:    error_type: 0, cache error
>>    [ 1429.970847][    C1] {59}[Hardware Error]:    error_info: 0x0000000020400014
>>    [ 1429.970854][    C1] {59}[Hardware Error]:     cache level: 1
>>    [ 1429.988406][    C1] {59}[Hardware Error]:     the error has not been corrected
>>    [ 1430.013419][    C1] {59}[Hardware Error]:    physical fault address: 0x0000001000093c00
>>    [ 1430.013425][    C1] {59}[Hardware Error]:   Vendor specific error info has 48 bytes:
>>    [ 1430.025424][    C1] {59}[Hardware Error]:    00000000: 00000000 00000000 00000000 00000000  ................
>>    [ 1430.053736][    C1] {59}[Hardware Error]:    00000010: 00000000 00000000 00000000 00000000  ................
>>    [ 1430.066341][    C1] {59}[Hardware Error]:    00000020: 00000000 00000000 00000000 00000000  ................
>>    [ 1430.294255][T54990] Memory failure: 0x1000093: already hardware poisoned
>>    [ 1430.305518][T54990] 0x1000093: Sending SIGBUS to devmem:54990 due to hardware memory corruption
>>
>> Signed-off-by: Junhao He <hejunhao3@h-partners.com>
>> ---
>>   drivers/acpi/apei/ghes.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 005de10d80c3..eebda39bfc30 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -1343,8 +1343,10 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>>          ghes_clear_estatus(ghes, &tmp_header, buf_paddr, fixmap_idx);
>>
>>          /* This error has been reported before, don't process it again. */
>> -       if (ghes_estatus_cached(estatus))
>> +       if (ghes_estatus_cached(estatus)) {
>> +               rc = -ECANCELED;
>>                  goto no_work;
>> +       }
>>
>>          llist_add(&estatus_node->llnode, &ghes_estatus_llist);
>>
>> --
> 
> This needs a response from the APEI reviewers as per MAINTAINERS, thanks!

Hi, Rafael and Junhao,

Sorry for late response, I try to reproduce the issue, it seems that
EINJ systems broken in 6.18.0-rc1+.

[ 3950.741186] CPU: 36 UID: 0 PID: 74112 Comm: einj_mem_uc Tainted: G            E       6.18.0-rc1+ #227 PREEMPT(none)
[ 3950.751749] Tainted: [E]=UNSIGNED_MODULE
[ 3950.755655] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS 1.91 07/29/2022
[ 3950.763797] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 3950.770729] pc : acpi_os_write_memory+0x108/0x150
[ 3950.775419] lr : acpi_os_write_memory+0x28/0x150
[ 3950.780017] sp : ffff800093fbba40
[ 3950.783319] x29: ffff800093fbba40 x28: 0000000000000000 x27: 0000000000000000
[ 3950.790425] x26: 0000000000000002 x25: ffffffffffffffff x24: 000000403f20e400
[ 3950.797530] x23: 0000000000000000 x22: 0000000000000008 x21: 000000000000ffff
[ 3950.804635] x20: 0000000000000040 x19: 000000002f7d0018 x18: 0000000000000000
[ 3950.811741] x17: 0000000000000000 x16: ffffae52d36ae5d0 x15: 000000001ba8e890
[ 3950.818847] x14: 0000000000000000 x13: 0000000000000000 x12: 0000005fffffffff
[ 3950.825952] x11: 0000000000000001 x10: ffff00400d761b90 x9 : ffffae52d365b198
[ 3950.833058] x8 : 0000280000000000 x7 : 000000002f7d0018 x6 : ffffae52d5198548
[ 3950.840164] x5 : 000000002f7d1000 x4 : 0000000000000018 x3 : ffff204016735060
[ 3950.847269] x2 : 0000000000000040 x1 : 0000000000000000 x0 : ffff8000845bd018
[ 3950.854376] Call trace:
[ 3950.856814]  acpi_os_write_memory+0x108/0x150 (P)
[ 3950.861500]  apei_write+0xb4/0xd0
[ 3950.864806]  apei_exec_write_register_value+0x88/0xc0
[ 3950.869838]  __apei_exec_run+0xac/0x120
[ 3950.873659]  __einj_error_inject+0x88/0x408 [einj]
[ 3950.878434]  einj_error_inject+0x168/0x1f0 [einj]
[ 3950.883120]  error_inject_set+0x48/0x60 [einj]
[ 3950.887548]  simple_attr_write_xsigned.constprop.0.isra.0+0x14c/0x1d0
[ 3950.893964]  simple_attr_write+0x1c/0x30
[ 3950.897873]  debugfs_attr_write+0x54/0xa0
[ 3950.901870]  vfs_write+0xc4/0x240
[ 3950.905173]  ksys_write+0x70/0x108
[ 3950.908562]  __arm64_sys_write+0x20/0x30
[ 3950.912471]  invoke_syscall+0x4c/0x110
[ 3950.916207]  el0_svc_common.constprop.0+0x44/0xe8
[ 3950.920893]  do_el0_svc+0x20/0x30
[ 3950.924194]  el0_svc+0x38/0x160
[ 3950.927324]  el0t_64_sync_handler+0x98/0xe0
[ 3950.931491]  el0t_64_sync+0x184/0x188
[ 3950.935140] Code: 14000006 7101029f 54000221 d50332bf (f9000015)
[ 3950.941210] ---[ end trace 0000000000000000 ]---
[ 3950.945807] Kernel panic - not syncing: Oops: Fatal exception

We need to fix it first.

Thanks.
Shuai