[PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs

Gavin Shan posted 8 patches 1 week, 2 days ago
Maintainers: Dongjiu Geng <gengdongjiu1@gmail.com>, "Michael S. Tsirkin" <mst@redhat.com>, Igor Mammedov <imammedo@redhat.com>, Ani Sinha <anisinha@redhat.com>, Peter Maydell <peter.maydell@linaro.org>, Paolo Bonzini <pbonzini@redhat.com>
[PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Gavin Shan 1 week, 2 days ago
In the situation where host and guest has 64KiB and 4KiB page sizes,
one problematic host page affects 16 guest pages. we need to send 16
consective errors in this specific case.

Extend acpi_ghes_memory_errors() to support multiple CPERs after the
hunk of code to generate the GHES error status is pulled out from
ghes_gen_err_data_uncorrectable_recoverable(). The status field of
generic error status block is also updated accordingly if multiple
error data entries are contained in the generic error status block.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 hw/acpi/ghes-stub.c    |  2 +-
 hw/acpi/ghes.c         | 60 +++++++++++++++++++++++-------------------
 include/hw/acpi/ghes.h |  2 +-
 target/arm/kvm.c       |  4 ++-
 4 files changed, 38 insertions(+), 30 deletions(-)

diff --git a/hw/acpi/ghes-stub.c b/hw/acpi/ghes-stub.c
index 40f660c246..4faf573aeb 100644
--- a/hw/acpi/ghes-stub.c
+++ b/hw/acpi/ghes-stub.c
@@ -12,7 +12,7 @@
 #include "hw/acpi/ghes.h"
 
 int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
-                            uint64_t physical_address)
+                            uint64_t *addresses, uint32_t num_of_addresses)
 {
     return -1;
 }
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index a9c08e73c0..527b85c8d8 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -57,8 +57,12 @@
 /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
 #define ACPI_GHES_MEM_CPER_LENGTH           80
 
-/* Masks for block_status flags */
-#define ACPI_GEBS_UNCORRECTABLE         1
+/* Bits for block_status flags */
+#define ACPI_GEBS_UNCORRECTABLE           0
+#define ACPI_GEBS_CORRECTABLE             1
+#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
+#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3
+#define ACPI_GEBS_ERROR_DATA_ENTRIES      4
 
 /*
  * Total size for Generic Error Status Block except Generic Error Data Entries
@@ -212,26 +216,6 @@ static void acpi_ghes_build_append_mem_cper(GArray *table,
     build_append_int_noprefix(table, 0, 7);
 }
 
-static void
-ghes_gen_err_data_uncorrectable_recoverable(GArray *block,
-                                            const uint8_t *section_type,
-                                            int data_length)
-{
-    /* invalid fru id: ACPI 4.0: 17.3.2.6.1 Generic Error Data,
-     * Table 17-13 Generic Error Data Entry
-     */
-    QemuUUID fru_id = {};
-
-    /* Build the new generic error status block header */
-    acpi_ghes_generic_error_status(block, ACPI_GEBS_UNCORRECTABLE,
-        0, 0, data_length, ACPI_CPER_SEV_RECOVERABLE);
-
-    /* Build this new generic error data entry header */
-    acpi_ghes_generic_error_data(block, section_type,
-        ACPI_CPER_SEV_RECOVERABLE, 0, 0,
-        ACPI_GHES_MEM_CPER_LENGTH, fru_id, 0);
-}
-
 /*
  * Build table for the hardware error fw_cfg blob.
  * Initialize "etc/hardware_errors" and "etc/hardware_errors_addr" fw_cfg blobs.
@@ -557,19 +541,26 @@ void ghes_record_cper_errors(AcpiGhesState *ags, const void *cper, size_t len,
 }
 
 int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
-                            uint64_t physical_address)
+                            uint64_t *addresses, uint32_t num_of_addresses)
 {
     /* Memory Error Section Type */
     const uint8_t guid[] =
           UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83, \
                   0xED, 0x7C, 0x83, 0xB1);
+    /*
+     * invalid fru id: ACPI 4.0: 17.3.2.6.1 Generic Error Data,
+     * Table 17-13 Generic Error Data Entry
+     */
+    QemuUUID fru_id = {};
     Error *errp = NULL;
     int data_length;
     GArray *block;
+    uint32_t block_status, i;
 
     block = g_array_new(false, true /* clear */, 1);
 
-    data_length = ACPI_GHES_DATA_LENGTH + ACPI_GHES_MEM_CPER_LENGTH;
+    data_length = num_of_addresses *
+                  (ACPI_GHES_DATA_LENGTH + ACPI_GHES_MEM_CPER_LENGTH);
     /*
      * It should not run out of the preallocated memory if adding a new generic
      * error data entry
@@ -577,10 +568,25 @@ int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
     assert((data_length + ACPI_GHES_GESB_SIZE) <=
             ACPI_GHES_MAX_RAW_DATA_LENGTH);
 
-    ghes_gen_err_data_uncorrectable_recoverable(block, guid, data_length);
+    /* Build the new generic error status block header */
+    block_status = (1 << ACPI_GEBS_UNCORRECTABLE) |
+                   (num_of_addresses << ACPI_GEBS_ERROR_DATA_ENTRIES);
+    if (num_of_addresses > 1) {
+        block_status |= ACPI_GEBS_MULTIPLE_UNCORRECTABLE;
+    }
+
+    acpi_ghes_generic_error_status(block, block_status, 0, 0,
+                                   data_length, ACPI_CPER_SEV_RECOVERABLE);
 
-    /* Build the memory section CPER for above new generic error data entry */
-    acpi_ghes_build_append_mem_cper(block, physical_address);
+    for (i = 0; i < num_of_addresses; i++) {
+        /* Build generic error data entries */
+        acpi_ghes_generic_error_data(block, guid,
+                                     ACPI_CPER_SEV_RECOVERABLE, 0, 0,
+                                     ACPI_GHES_MEM_CPER_LENGTH, fru_id, 0);
+
+        /* Memory section CPER on top of the generic error data entry */
+        acpi_ghes_build_append_mem_cper(block, addresses[i]);
+    }
 
     /* Report the error */
     ghes_record_cper_errors(ags, block->data, block->len, source_id, &errp);
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index df2ecbf6e4..f73908985d 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -99,7 +99,7 @@ void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
 void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s,
                           GArray *hardware_errors);
 int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
-                            uint64_t error_physical_addr);
+                            uint64_t *addresses, uint32_t num_of_addresses);
 void ghes_record_cper_errors(AcpiGhesState *ags, const void *cper, size_t len,
                              uint16_t source_id, Error **errp);
 
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 0d57081e69..459ca4a9b0 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2434,6 +2434,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
     ram_addr_t ram_addr;
     hwaddr paddr;
     AcpiGhesState *ags;
+    uint64_t addresses[16];
 
     assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
 
@@ -2454,10 +2455,11 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
              * later from the main thread, so doing the injection of
              * the error would be more complicated.
              */
+            addresses[0] = paddr;
             if (code == BUS_MCEERR_AR) {
                 kvm_cpu_synchronize_state(c);
                 if (!acpi_ghes_memory_errors(ags, ACPI_HEST_SRC_ID_SYNC,
-                                             paddr)) {
+                                             addresses, 1)) {
                     kvm_inject_arm_sea(c);
                 } else {
                     error_report("failed to record the error");
-- 
2.51.0
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Philippe Mathieu-Daudé 4 days ago
On 5/11/25 12:44, Gavin Shan wrote:
> In the situation where host and guest has 64KiB and 4KiB page sizes,
> one problematic host page affects 16 guest pages. we need to send 16
> consective errors in this specific case.
> 
> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
> hunk of code to generate the GHES error status is pulled out from
> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
> generic error status block is also updated accordingly if multiple
> error data entries are contained in the generic error status block.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>   hw/acpi/ghes-stub.c    |  2 +-
>   hw/acpi/ghes.c         | 60 +++++++++++++++++++++++-------------------
>   include/hw/acpi/ghes.h |  2 +-
>   target/arm/kvm.c       |  4 ++-
>   4 files changed, 38 insertions(+), 30 deletions(-)


> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> index a9c08e73c0..527b85c8d8 100644
> --- a/hw/acpi/ghes.c
> +++ b/hw/acpi/ghes.c
> @@ -57,8 +57,12 @@
>   /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
>   #define ACPI_GHES_MEM_CPER_LENGTH           80
>   
> -/* Masks for block_status flags */
> -#define ACPI_GEBS_UNCORRECTABLE         1
> +/* Bits for block_status flags */
> +#define ACPI_GEBS_UNCORRECTABLE           0
> +#define ACPI_GEBS_CORRECTABLE             1
> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3
> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4

Alternatively using "hw/registerfields.h" API:

   ...
   FIELD(ACPI_GEBS, MULTIPLE_CORRECTABLE, 3, 1)
   FIELD(ACPI_GEBS, ERROR_DATA_ENTRIES, 4, 10)

then use FIELD_DP32() to only set the correct bits.
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Gavin Shan 3 days, 11 hours ago
Hi Philippe,

On 11/11/25 12:48 AM, Philippe Mathieu-Daudé wrote:
> On 5/11/25 12:44, Gavin Shan wrote:
>> In the situation where host and guest has 64KiB and 4KiB page sizes,
>> one problematic host page affects 16 guest pages. we need to send 16
>> consective errors in this specific case.
>>
>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
>> hunk of code to generate the GHES error status is pulled out from
>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
>> generic error status block is also updated accordingly if multiple
>> error data entries are contained in the generic error status block.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   hw/acpi/ghes-stub.c    |  2 +-
>>   hw/acpi/ghes.c         | 60 +++++++++++++++++++++++-------------------
>>   include/hw/acpi/ghes.h |  2 +-
>>   target/arm/kvm.c       |  4 ++-
>>   4 files changed, 38 insertions(+), 30 deletions(-)
> 
> 
>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
>> index a9c08e73c0..527b85c8d8 100644
>> --- a/hw/acpi/ghes.c
>> +++ b/hw/acpi/ghes.c
>> @@ -57,8 +57,12 @@
>>   /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
>>   #define ACPI_GHES_MEM_CPER_LENGTH           80
>> -/* Masks for block_status flags */
>> -#define ACPI_GEBS_UNCORRECTABLE         1
>> +/* Bits for block_status flags */
>> +#define ACPI_GEBS_UNCORRECTABLE           0
>> +#define ACPI_GEBS_CORRECTABLE             1
>> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
>> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3
>> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4
> 
> Alternatively using "hw/registerfields.h" API:
> 
>    ...
>    FIELD(ACPI_GEBS, MULTIPLE_CORRECTABLE, 3, 1)
>    FIELD(ACPI_GEBS, ERROR_DATA_ENTRIES, 4, 10)
> 
> then use FIELD_DP32() to only set the correct bits.
> 

Acked. It's a nice point and will do in next revision.

Thanks,
Gavin


Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Philippe Mathieu-Daudé 4 days ago
On 5/11/25 12:44, Gavin Shan wrote:
> In the situation where host and guest has 64KiB and 4KiB page sizes,
> one problematic host page affects 16 guest pages. we need to send 16
> consective errors in this specific case.
> 
> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
> hunk of code to generate the GHES error status is pulled out from
> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
> generic error status block is also updated accordingly if multiple
> error data entries are contained in the generic error status block.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>   hw/acpi/ghes-stub.c    |  2 +-
>   hw/acpi/ghes.c         | 60 +++++++++++++++++++++++-------------------
>   include/hw/acpi/ghes.h |  2 +-
>   target/arm/kvm.c       |  4 ++-
>   4 files changed, 38 insertions(+), 30 deletions(-)


>   int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
> -                            uint64_t physical_address)
> +                            uint64_t *addresses, uint32_t num_of_addresses)
>   {
>       /* Memory Error Section Type */
>       const uint8_t guid[] =
>             UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83, \
>                     0xED, 0x7C, 0x83, 0xB1);
> +    /*
> +     * invalid fru id: ACPI 4.0: 17.3.2.6.1 Generic Error Data,
> +     * Table 17-13 Generic Error Data Entry
> +     */
> +    QemuUUID fru_id = {};
>       Error *errp = NULL;
>       int data_length;
>       GArray *block;
> +    uint32_t block_status, i;
>   
>       block = g_array_new(false, true /* clear */, 1);
>   
> -    data_length = ACPI_GHES_DATA_LENGTH + ACPI_GHES_MEM_CPER_LENGTH;
> +    data_length = num_of_addresses *
> +                  (ACPI_GHES_DATA_LENGTH + ACPI_GHES_MEM_CPER_LENGTH);

Should we check num_of_addresses is in range?
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Gavin Shan 3 days, 15 hours ago
Hi Philippe,

On 11/11/25 12:43 AM, Philippe Mathieu-Daudé wrote:
> On 5/11/25 12:44, Gavin Shan wrote:
>> In the situation where host and guest has 64KiB and 4KiB page sizes,
>> one problematic host page affects 16 guest pages. we need to send 16
>> consective errors in this specific case.
>>
>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
>> hunk of code to generate the GHES error status is pulled out from
>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
>> generic error status block is also updated accordingly if multiple
>> error data entries are contained in the generic error status block.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   hw/acpi/ghes-stub.c    |  2 +-
>>   hw/acpi/ghes.c         | 60 +++++++++++++++++++++++-------------------
>>   include/hw/acpi/ghes.h |  2 +-
>>   target/arm/kvm.c       |  4 ++-
>>   4 files changed, 38 insertions(+), 30 deletions(-)
> 
> 
>>   int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
>> -                            uint64_t physical_address)
>> +                            uint64_t *addresses, uint32_t num_of_addresses)
>>   {
>>       /* Memory Error Section Type */
>>       const uint8_t guid[] =
>>             UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83, \
>>                     0xED, 0x7C, 0x83, 0xB1);
>> +    /*
>> +     * invalid fru id: ACPI 4.0: 17.3.2.6.1 Generic Error Data,
>> +     * Table 17-13 Generic Error Data Entry
>> +     */
>> +    QemuUUID fru_id = {};
>>       Error *errp = NULL;
>>       int data_length;
>>       GArray *block;
>> +    uint32_t block_status, i;
>>       block = g_array_new(false, true /* clear */, 1);
>> -    data_length = ACPI_GHES_DATA_LENGTH + ACPI_GHES_MEM_CPER_LENGTH;
>> +    data_length = num_of_addresses *
>> +                  (ACPI_GHES_DATA_LENGTH + ACPI_GHES_MEM_CPER_LENGTH);
> 
> Should we check num_of_addresses is in range?
> 

The check is already done by the following assert().

    /*
      * It should not run out of the preallocated memory if adding a new generic
      * error data entry
      */
     assert((data_length + ACPI_GHES_GESB_SIZE) <=
             ACPI_GHES_MAX_RAW_DATA_LENGTH);

Thanks,
Gavin


Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Gavin Shan 3 days, 11 hours ago
Hi Philippe,

On 11/11/25 9:38 AM, Gavin Shan wrote:
> On 11/11/25 12:43 AM, Philippe Mathieu-Daudé wrote:
>> On 5/11/25 12:44, Gavin Shan wrote:
>>> In the situation where host and guest has 64KiB and 4KiB page sizes,
>>> one problematic host page affects 16 guest pages. we need to send 16
>>> consective errors in this specific case.
>>>
>>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
>>> hunk of code to generate the GHES error status is pulled out from
>>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
>>> generic error status block is also updated accordingly if multiple
>>> error data entries are contained in the generic error status block.
>>>
>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>> ---
>>>   hw/acpi/ghes-stub.c    |  2 +-
>>>   hw/acpi/ghes.c         | 60 +++++++++++++++++++++++-------------------
>>>   include/hw/acpi/ghes.h |  2 +-
>>>   target/arm/kvm.c       |  4 ++-
>>>   4 files changed, 38 insertions(+), 30 deletions(-)
>>
>>
>>>   int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
>>> -                            uint64_t physical_address)
>>> +                            uint64_t *addresses, uint32_t num_of_addresses)
>>>   {
>>>       /* Memory Error Section Type */
>>>       const uint8_t guid[] =
>>>             UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83, \
>>>                     0xED, 0x7C, 0x83, 0xB1);
>>> +    /*
>>> +     * invalid fru id: ACPI 4.0: 17.3.2.6.1 Generic Error Data,
>>> +     * Table 17-13 Generic Error Data Entry
>>> +     */
>>> +    QemuUUID fru_id = {};
>>>       Error *errp = NULL;
>>>       int data_length;
>>>       GArray *block;
>>> +    uint32_t block_status, i;
>>>       block = g_array_new(false, true /* clear */, 1);
>>> -    data_length = ACPI_GHES_DATA_LENGTH + ACPI_GHES_MEM_CPER_LENGTH;
>>> +    data_length = num_of_addresses *
>>> +                  (ACPI_GHES_DATA_LENGTH + ACPI_GHES_MEM_CPER_LENGTH);
>>
>> Should we check num_of_addresses is in range?
>>
> 
> The check is already done by the following assert().
> 
>     /*
>       * It should not run out of the preallocated memory if adding a new generic
>       * error data entry
>       */
>      assert((data_length + ACPI_GHES_GESB_SIZE) <=
>              ACPI_GHES_MAX_RAW_DATA_LENGTH);
> 

I may have misunderstood your point. We probably need to ensure @num_of_addresses
doesn't overflow the bit fields (Bit#0 - Bit#10), as Igor suggested in another
reply.

Thanks,
Gavin


Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Igor Mammedov 4 days ago
On Wed,  5 Nov 2025 21:44:49 +1000
Gavin Shan <gshan@redhat.com> wrote:

> In the situation where host and guest has 64KiB and 4KiB page sizes,
> one problematic host page affects 16 guest pages. we need to send 16
> consective errors in this specific case.

I still don't like it, since it won't fix anything in case of more than
1 broken host pages. (in v2 discussion quickly went hugepages route
and futility of recovering from them).

If having per vCPU source is not desirable,
can we stall all other vcpus that touch poisoned pages until
error is acked by guest and then let another VCPU to queue its own error?


> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
> hunk of code to generate the GHES error status is pulled out from
> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
> generic error status block is also updated accordingly if multiple
> error data entries are contained in the generic error status block.

I don't mind much translating 64K page error into several 4K CPER
records, so this part is fine. But it's hardly a solution to the generic
problem.

> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  hw/acpi/ghes-stub.c    |  2 +-
>  hw/acpi/ghes.c         | 60 +++++++++++++++++++++++-------------------
>  include/hw/acpi/ghes.h |  2 +-
>  target/arm/kvm.c       |  4 ++-
>  4 files changed, 38 insertions(+), 30 deletions(-)
> 
...
> @@ -577,10 +568,25 @@ int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
>      assert((data_length + ACPI_GHES_GESB_SIZE) <=
>              ACPI_GHES_MAX_RAW_DATA_LENGTH);
>  
> -    ghes_gen_err_data_uncorrectable_recoverable(block, guid, data_length);
> +    /* Build the new generic error status block header */
> +    block_status = (1 << ACPI_GEBS_UNCORRECTABLE) |
> +                   (num_of_addresses << ACPI_GEBS_ERROR_DATA_ENTRIES);
                       ^^^^^^^^^^^^^^
maybe assert in case it won't fit into bit field 

> +    if (num_of_addresses > 1) {
> +        block_status |= ACPI_GEBS_MULTIPLE_UNCORRECTABLE;
> +    }
> +
> +    acpi_ghes_generic_error_status(block, block_status, 0, 0,
> +                                   data_length, ACPI_CPER_SEV_RECOVERABLE);
>  
> -    /* Build the memory section CPER for above new generic error data entry */
> -    acpi_ghes_build_append_mem_cper(block, physical_address);
> +    for (i = 0; i < num_of_addresses; i++) {
> +        /* Build generic error data entries */
> +        acpi_ghes_generic_error_data(block, guid,
> +                                     ACPI_CPER_SEV_RECOVERABLE, 0, 0,
> +                                     ACPI_GHES_MEM_CPER_LENGTH, fru_id, 0);
> +
> +        /* Memory section CPER on top of the generic error data entry */
> +        acpi_ghes_build_append_mem_cper(block, addresses[i]);
> +    }
>  
>      /* Report the error */
>      ghes_record_cper_errors(ags, block->data, block->len, source_id, &errp);
> diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
> index df2ecbf6e4..f73908985d 100644
> --- a/include/hw/acpi/ghes.h
> +++ b/include/hw/acpi/ghes.h
> @@ -99,7 +99,7 @@ void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
>  void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s,
>                            GArray *hardware_errors);
>  int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
> -                            uint64_t error_physical_addr);
> +                            uint64_t *addresses, uint32_t num_of_addresses);
>  void ghes_record_cper_errors(AcpiGhesState *ags, const void *cper, size_t len,
>                               uint16_t source_id, Error **errp);
>  
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index 0d57081e69..459ca4a9b0 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -2434,6 +2434,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>      ram_addr_t ram_addr;
>      hwaddr paddr;
>      AcpiGhesState *ags;
> +    uint64_t addresses[16];
>  
>      assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
>  
> @@ -2454,10 +2455,11 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>               * later from the main thread, so doing the injection of
>               * the error would be more complicated.
>               */
> +            addresses[0] = paddr;
>              if (code == BUS_MCEERR_AR) {
>                  kvm_cpu_synchronize_state(c);
>                  if (!acpi_ghes_memory_errors(ags, ACPI_HEST_SRC_ID_SYNC,
> -                                             paddr)) {
> +                                             addresses, 1)) {
>                      kvm_inject_arm_sea(c);
>                  } else {
>                      error_report("failed to record the error");
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Gavin Shan 3 days, 10 hours ago
Hi Igor,

On 11/11/25 12:38 AM, Igor Mammedov wrote:
> On Wed,  5 Nov 2025 21:44:49 +1000
> Gavin Shan <gshan@redhat.com> wrote:
> 
>> In the situation where host and guest has 64KiB and 4KiB page sizes,
>> one problematic host page affects 16 guest pages. we need to send 16
>> consective errors in this specific case.
> 
> I still don't like it, since it won't fix anything in case of more than
> 1 broken host pages. (in v2 discussion quickly went hugepages route
> and futility of recovering from them).
> 
> If having per vCPU source is not desirable,
> can we stall all other vcpus that touch poisoned pages until
> error is acked by guest and then let another VCPU to queue its own error?
> 

We're trying to avoid the guest from suddenly disappearing due to the QEMU
crash, instead of recovering from the memory errors. To keep the guest
accessible, system administrators still get a chance to collect important
information from the guest.

The idea of stalling the vCPU which is accessing any poisoned pages and
retry on delivering the error was proposed in v1, but was rejected.

https://lists.nongnu.org/archive/html/qemu-arm/2025-02/msg01071.html

As the intention of this series is just to improve the memory error
reporting, to avoid QEMU crash if possible, it sounds reasonable to send
16x consecutive CPERs in one shot for this specific case (4KB guest on
64KB host). As to hugetlb cases, it's different story. If the hugetlb
folio (page) size is small enough (like 64KB), we can leverage current
design to send consecutive CPERs. I don't think there are too much we
can do if hugetlb folio size is large enough (from 2MB to 16GB).

> 
>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
>> hunk of code to generate the GHES error status is pulled out from
>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
>> generic error status block is also updated accordingly if multiple
>> error data entries are contained in the generic error status block.
> 
> I don't mind much translating 64K page error into several 4K CPER
> records, so this part is fine. But it's hardly a solution to the generic
> problem.
> 

Note that I don't expect a memory error storm from the hardware level.
In that case, it's a good sign indicating the memory DIMM has been totally
broken and needs a replacement :-)

>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   hw/acpi/ghes-stub.c    |  2 +-
>>   hw/acpi/ghes.c         | 60 +++++++++++++++++++++++-------------------
>>   include/hw/acpi/ghes.h |  2 +-
>>   target/arm/kvm.c       |  4 ++-
>>   4 files changed, 38 insertions(+), 30 deletions(-)
>>
> ...
>> @@ -577,10 +568,25 @@ int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
>>       assert((data_length + ACPI_GHES_GESB_SIZE) <=
>>               ACPI_GHES_MAX_RAW_DATA_LENGTH);
>>   
>> -    ghes_gen_err_data_uncorrectable_recoverable(block, guid, data_length);
>> +    /* Build the new generic error status block header */
>> +    block_status = (1 << ACPI_GEBS_UNCORRECTABLE) |
>> +                   (num_of_addresses << ACPI_GEBS_ERROR_DATA_ENTRIES);
>                         ^^^^^^^^^^^^^^
> maybe assert in case it won't fit into bit field
> 

Yep, Same thing was suggested by Philippe.

>> +    if (num_of_addresses > 1) {
>> +        block_status |= ACPI_GEBS_MULTIPLE_UNCORRECTABLE;
>> +    }
>> +
>> +    acpi_ghes_generic_error_status(block, block_status, 0, 0,
>> +                                   data_length, ACPI_CPER_SEV_RECOVERABLE);
>>   
>> -    /* Build the memory section CPER for above new generic error data entry */
>> -    acpi_ghes_build_append_mem_cper(block, physical_address);
>> +    for (i = 0; i < num_of_addresses; i++) {
>> +        /* Build generic error data entries */
>> +        acpi_ghes_generic_error_data(block, guid,
>> +                                     ACPI_CPER_SEV_RECOVERABLE, 0, 0,
>> +                                     ACPI_GHES_MEM_CPER_LENGTH, fru_id, 0);
>> +
>> +        /* Memory section CPER on top of the generic error data entry */
>> +        acpi_ghes_build_append_mem_cper(block, addresses[i]);
>> +    }
>>   
>>       /* Report the error */
>>       ghes_record_cper_errors(ags, block->data, block->len, source_id, &errp);
>> diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
>> index df2ecbf6e4..f73908985d 100644
>> --- a/include/hw/acpi/ghes.h
>> +++ b/include/hw/acpi/ghes.h
>> @@ -99,7 +99,7 @@ void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
>>   void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s,
>>                             GArray *hardware_errors);
>>   int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
>> -                            uint64_t error_physical_addr);
>> +                            uint64_t *addresses, uint32_t num_of_addresses);
>>   void ghes_record_cper_errors(AcpiGhesState *ags, const void *cper, size_t len,
>>                                uint16_t source_id, Error **errp);
>>   
>> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
>> index 0d57081e69..459ca4a9b0 100644
>> --- a/target/arm/kvm.c
>> +++ b/target/arm/kvm.c
>> @@ -2434,6 +2434,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>       ram_addr_t ram_addr;
>>       hwaddr paddr;
>>       AcpiGhesState *ags;
>> +    uint64_t addresses[16];
>>   
>>       assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
>>   
>> @@ -2454,10 +2455,11 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>                * later from the main thread, so doing the injection of
>>                * the error would be more complicated.
>>                */
>> +            addresses[0] = paddr;
>>               if (code == BUS_MCEERR_AR) {
>>                   kvm_cpu_synchronize_state(c);
>>                   if (!acpi_ghes_memory_errors(ags, ACPI_HEST_SRC_ID_SYNC,
>> -                                             paddr)) {
>> +                                             addresses, 1)) {
>>                       kvm_inject_arm_sea(c);
>>                   } else {
>>                       error_report("failed to record the error");
> 

Thanks,
Gavin
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Igor Mammedov 2 days, 2 hours ago
On Tue, 11 Nov 2025 14:40:42 +1000
Gavin Shan <gshan@redhat.com> wrote:

> Hi Igor,
> 
> On 11/11/25 12:38 AM, Igor Mammedov wrote:
> > On Wed,  5 Nov 2025 21:44:49 +1000
> > Gavin Shan <gshan@redhat.com> wrote:
> >   
> >> In the situation where host and guest has 64KiB and 4KiB page sizes,
> >> one problematic host page affects 16 guest pages. we need to send 16
> >> consective errors in this specific case.  
> > 
> > I still don't like it, since it won't fix anything in case of more than
> > 1 broken host pages. (in v2 discussion quickly went hugepages route
> > and futility of recovering from them).
> > 
> > If having per vCPU source is not desirable,
> > can we stall all other vcpus that touch poisoned pages until
> > error is acked by guest and then let another VCPU to queue its own error?
> >   
> 
> We're trying to avoid the guest from suddenly disappearing due to the QEMU
> crash, instead of recovering from the memory errors. To keep the guest
> accessible, system administrators still get a chance to collect important
> information from the guest.
> 
> The idea of stalling the vCPU which is accessing any poisoned pages and
> retry on delivering the error was proposed in v1, but was rejected.
> 
> https://lists.nongnu.org/archive/html/qemu-arm/2025-02/msg01071.html

that depends on what outcome we do wish for.
Described deadlock might be even desired vs QEMU abort() as it lets
guest admin to collect VM crash dump.

But honestly I'd go with per/vCPU approach if it's possible,
as that still get guest side chance to recover.


> As the intention of this series is just to improve the memory error
> reporting, to avoid QEMU crash if possible, it sounds reasonable to send
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
that,
this series doesn't do that as it would still crash QEMU if another
vCPU faults on another faulty host page (i.e. not the one we've generated CPERs)

You also mentioned in previous review that with per vCPU error source
variant that QEMU would abort elsewhere (is it fixable?).

> 16x consecutive CPERs in one shot for this specific case (4KB guest on
> 64KB host).

I don't object to generating 16x CPERs per fault as that obviously
should reduce # of guest exits. 



Given it's rather late in release cycle,
we probably can handle 1 page case 1st as in this series,
with followup series to switch to per/vCPU variant once new merge
window opens (assuming I can coax a promise from you to follow up on that).

>As to hugetlb cases, it's different story. If the hugetlb
> folio (page) size is small enough (like 64KB), we can leverage current
> design to send consecutive CPERs. I don't think there are too much we
> can do if hugetlb folio size is large enough (from 2MB to 16GB).
> 
> >   
> >> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
> >> hunk of code to generate the GHES error status is pulled out from
> >> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
> >> generic error status block is also updated accordingly if multiple
> >> error data entries are contained in the generic error status block.  
> > 
> > I don't mind much translating 64K page error into several 4K CPER
> > records, so this part is fine. But it's hardly a solution to the generic
> > problem.
> >   
> 
> Note that I don't expect a memory error storm from the hardware level.
> In that case, it's a good sign indicating the memory DIMM has been totally
> broken and needs a replacement :-)
> 
> >>
> >> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >> ---
> >>   hw/acpi/ghes-stub.c    |  2 +-
> >>   hw/acpi/ghes.c         | 60 +++++++++++++++++++++++-------------------
> >>   include/hw/acpi/ghes.h |  2 +-
> >>   target/arm/kvm.c       |  4 ++-
> >>   4 files changed, 38 insertions(+), 30 deletions(-)
> >>  
> > ...  
> >> @@ -577,10 +568,25 @@ int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
> >>       assert((data_length + ACPI_GHES_GESB_SIZE) <=
> >>               ACPI_GHES_MAX_RAW_DATA_LENGTH);
> >>   
> >> -    ghes_gen_err_data_uncorrectable_recoverable(block, guid, data_length);
> >> +    /* Build the new generic error status block header */
> >> +    block_status = (1 << ACPI_GEBS_UNCORRECTABLE) |
> >> +                   (num_of_addresses << ACPI_GEBS_ERROR_DATA_ENTRIES);  
> >                         ^^^^^^^^^^^^^^
> > maybe assert in case it won't fit into bit field
> >   
> 
> Yep, Same thing was suggested by Philippe.
> 
> >> +    if (num_of_addresses > 1) {
> >> +        block_status |= ACPI_GEBS_MULTIPLE_UNCORRECTABLE;
> >> +    }
> >> +
> >> +    acpi_ghes_generic_error_status(block, block_status, 0, 0,
> >> +                                   data_length, ACPI_CPER_SEV_RECOVERABLE);
> >>   
> >> -    /* Build the memory section CPER for above new generic error data entry */
> >> -    acpi_ghes_build_append_mem_cper(block, physical_address);
> >> +    for (i = 0; i < num_of_addresses; i++) {
> >> +        /* Build generic error data entries */
> >> +        acpi_ghes_generic_error_data(block, guid,
> >> +                                     ACPI_CPER_SEV_RECOVERABLE, 0, 0,
> >> +                                     ACPI_GHES_MEM_CPER_LENGTH, fru_id, 0);
> >> +
> >> +        /* Memory section CPER on top of the generic error data entry */
> >> +        acpi_ghes_build_append_mem_cper(block, addresses[i]);
> >> +    }
> >>   
> >>       /* Report the error */
> >>       ghes_record_cper_errors(ags, block->data, block->len, source_id, &errp);
> >> diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
> >> index df2ecbf6e4..f73908985d 100644
> >> --- a/include/hw/acpi/ghes.h
> >> +++ b/include/hw/acpi/ghes.h
> >> @@ -99,7 +99,7 @@ void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
> >>   void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s,
> >>                             GArray *hardware_errors);
> >>   int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
> >> -                            uint64_t error_physical_addr);
> >> +                            uint64_t *addresses, uint32_t num_of_addresses);
> >>   void ghes_record_cper_errors(AcpiGhesState *ags, const void *cper, size_t len,
> >>                                uint16_t source_id, Error **errp);
> >>   
> >> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> >> index 0d57081e69..459ca4a9b0 100644
> >> --- a/target/arm/kvm.c
> >> +++ b/target/arm/kvm.c
> >> @@ -2434,6 +2434,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
> >>       ram_addr_t ram_addr;
> >>       hwaddr paddr;
> >>       AcpiGhesState *ags;
> >> +    uint64_t addresses[16];
> >>   
> >>       assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
> >>   
> >> @@ -2454,10 +2455,11 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
> >>                * later from the main thread, so doing the injection of
> >>                * the error would be more complicated.
> >>                */
> >> +            addresses[0] = paddr;
> >>               if (code == BUS_MCEERR_AR) {
> >>                   kvm_cpu_synchronize_state(c);
> >>                   if (!acpi_ghes_memory_errors(ags, ACPI_HEST_SRC_ID_SYNC,
> >> -                                             paddr)) {
> >> +                                             addresses, 1)) {
> >>                       kvm_inject_arm_sea(c);
> >>                   } else {
> >>                       error_report("failed to record the error");  
> >   
> 
> Thanks,
> Gavin
>
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Jonathan Cameron via 1 week, 2 days ago
On Wed,  5 Nov 2025 21:44:49 +1000
Gavin Shan <gshan@redhat.com> wrote:

> In the situation where host and guest has 64KiB and 4KiB page sizes,
> one problematic host page affects 16 guest pages. we need to send 16
> consective errors in this specific case.
> 
> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
> hunk of code to generate the GHES error status is pulled out from
> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
> generic error status block is also updated accordingly if multiple
> error data entries are contained in the generic error status block.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
Hi Gavin,

Mostly fine, but a few comments on the defines added and a
question on what the multiple things are meant to mean?

> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> index a9c08e73c0..527b85c8d8 100644
> --- a/hw/acpi/ghes.c
> +++ b/hw/acpi/ghes.c
> @@ -57,8 +57,12 @@
>  /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
>  #define ACPI_GHES_MEM_CPER_LENGTH           80
>  
> -/* Masks for block_status flags */
> -#define ACPI_GEBS_UNCORRECTABLE         1
> +/* Bits for block_status flags */
> +#define ACPI_GEBS_UNCORRECTABLE           0
> +#define ACPI_GEBS_CORRECTABLE             1
> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3

So this maps to the bits in block status. 

I'm not actually sure what these multiple variants are meant to tell us.
The multiple error blocks example referred to by the spec is a way to represent
the same error applying to multiple places.  So that's one error, many blocks.
I have no idea if we set these bits in that case.

Based on a quick look I don't think linux even takes any notice.  THere
are defines in actbl1.h but I'm not seeing any use made of them.

> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4

This is bits 4-13 and the define isn't used. I'd drop it.
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Gavin Shan 1 week, 1 day ago
Hi Jonathan and Igor,

On 11/6/25 12:14 AM, Jonathan Cameron wrote:
> On Wed,  5 Nov 2025 21:44:49 +1000
> Gavin Shan <gshan@redhat.com> wrote:
> 
>> In the situation where host and guest has 64KiB and 4KiB page sizes,
>> one problematic host page affects 16 guest pages. we need to send 16
>> consective errors in this specific case.
>>
>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
>> hunk of code to generate the GHES error status is pulled out from
>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
>> generic error status block is also updated accordingly if multiple
>> error data entries are contained in the generic error status block.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
> Hi Gavin,
> 
> Mostly fine, but a few comments on the defines added and a
> question on what the multiple things are meant to mean?
> 

Thanks for your review and comments, replies as below.

>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
>> index a9c08e73c0..527b85c8d8 100644
>> --- a/hw/acpi/ghes.c
>> +++ b/hw/acpi/ghes.c
>> @@ -57,8 +57,12 @@
>>   /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
>>   #define ACPI_GHES_MEM_CPER_LENGTH           80
>>   
>> -/* Masks for block_status flags */
>> -#define ACPI_GEBS_UNCORRECTABLE         1
>> +/* Bits for block_status flags */
>> +#define ACPI_GEBS_UNCORRECTABLE           0
>> +#define ACPI_GEBS_CORRECTABLE             1
>> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
>> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3
> 
> So this maps to the bits in block status.
> 
> I'm not actually sure what these multiple variants are meant to tell us.
> The multiple error blocks example referred to by the spec is a way to represent
> the same error applying to multiple places.  So that's one error, many blocks.
> I have no idea if we set these bits in that case.
> 
> Based on a quick look I don't think linux even takes any notice.  THere
> are defines in actbl1.h but I'm not seeing any use made of them.
> 

I hope Igor can confirm since it was suggested by him.

It's hard to understand how exactly these multiple variants are used from the
spec. In ACPI 6.5 Table 18.11, it's explained as below.

Bit [2] - Multiple Uncorrectable Errors: If set to one, indicates that more
than one uncorrectable errors have been detected.

I don't see those multiple variants have been used by Linux. So I think it's
safe to drop them.

>> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4
> 
> This is bits 4-13 and the define isn't used. I'd drop it.
> 

The definition is used in acpi_ghes_memory_errors() of this patch. However,
I don't see it has been used by Linux. This field isn't used by Linux to determine
the total number of error entries. So I think I can drop it either if Igor is ok.

Thanks,
Gavin
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Igor Mammedov 4 days ago
On Thu, 6 Nov 2025 13:15:52 +1000
Gavin Shan <gshan@redhat.com> wrote:

> Hi Jonathan and Igor,
> 
> On 11/6/25 12:14 AM, Jonathan Cameron wrote:
> > On Wed,  5 Nov 2025 21:44:49 +1000
> > Gavin Shan <gshan@redhat.com> wrote:
> >   
> >> In the situation where host and guest has 64KiB and 4KiB page sizes,
> >> one problematic host page affects 16 guest pages. we need to send 16
> >> consective errors in this specific case.
> >>
> >> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
> >> hunk of code to generate the GHES error status is pulled out from
> >> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
> >> generic error status block is also updated accordingly if multiple
> >> error data entries are contained in the generic error status block.
> >>
> >> Signed-off-by: Gavin Shan <gshan@redhat.com>  
> > Hi Gavin,
> > 
> > Mostly fine, but a few comments on the defines added and a
> > question on what the multiple things are meant to mean?
> >   
> 
> Thanks for your review and comments, replies as below.
> 
> >> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> >> index a9c08e73c0..527b85c8d8 100644
> >> --- a/hw/acpi/ghes.c
> >> +++ b/hw/acpi/ghes.c
> >> @@ -57,8 +57,12 @@
> >>   /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
> >>   #define ACPI_GHES_MEM_CPER_LENGTH           80
> >>   
> >> -/* Masks for block_status flags */
> >> -#define ACPI_GEBS_UNCORRECTABLE         1
> >> +/* Bits for block_status flags */
> >> +#define ACPI_GEBS_UNCORRECTABLE           0
> >> +#define ACPI_GEBS_CORRECTABLE             1
> >> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
> >> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3  
> > 
> > So this maps to the bits in block status.
> > 
> > I'm not actually sure what these multiple variants are meant to tell us.
> > The multiple error blocks example referred to by the spec is a way to represent
> > the same error applying to multiple places.  So that's one error, many blocks.
> > I have no idea if we set these bits in that case.
> > 
> > Based on a quick look I don't think linux even takes any notice.  THere
> > are defines in actbl1.h but I'm not seeing any use made of them.
> >   
> 
> I hope Igor can confirm since it was suggested by him.
> 
> It's hard to understand how exactly these multiple variants are used from the
> spec. In ACPI 6.5 Table 18.11, it's explained as below.
> 
> Bit [2] - Multiple Uncorrectable Errors: If set to one, indicates that more
> than one uncorrectable errors have been detected.
> 
> I don't see those multiple variants have been used by Linux. So I think it's
> safe to drop them.

even though example describes 'same' error at different components,
the bit fields descriptions doesn't set any limits on what 'more than one' means. 

Also from guest POV it's multiple different pages that we are reporting here
as multiple CPERs.
It seems to me that setting *_MULTIPLE_* here is correct thing to do.


> >> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4  
> > 
> > This is bits 4-13 and the define isn't used. I'd drop it.
> >   
> 
> The definition is used in acpi_ghes_memory_errors() of this patch. However,
> I don't see it has been used by Linux. This field isn't used by Linux to determine
> the total number of error entries. So I think I can drop it either if Igor is ok.
> 
> Thanks,
> Gavin
> 
> 
>
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Gavin Shan 3 days, 11 hours ago
Hi Igor and Jonathan,

On 11/11/25 12:49 AM, Igor Mammedov wrote:
> On Thu, 6 Nov 2025 13:15:52 +1000
> Gavin Shan <gshan@redhat.com> wrote:
>> On 11/6/25 12:14 AM, Jonathan Cameron wrote:
>>> On Wed,  5 Nov 2025 21:44:49 +1000
>>> Gavin Shan <gshan@redhat.com> wrote:
>>>    
>>>> In the situation where host and guest has 64KiB and 4KiB page sizes,
>>>> one problematic host page affects 16 guest pages. we need to send 16
>>>> consective errors in this specific case.
>>>>
>>>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
>>>> hunk of code to generate the GHES error status is pulled out from
>>>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
>>>> generic error status block is also updated accordingly if multiple
>>>> error data entries are contained in the generic error status block.
>>>>
>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>> Hi Gavin,
>>>
>>> Mostly fine, but a few comments on the defines added and a
>>> question on what the multiple things are meant to mean?
>>>    
>>
>> Thanks for your review and comments, replies as below.
>>
>>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
>>>> index a9c08e73c0..527b85c8d8 100644
>>>> --- a/hw/acpi/ghes.c
>>>> +++ b/hw/acpi/ghes.c
>>>> @@ -57,8 +57,12 @@
>>>>    /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
>>>>    #define ACPI_GHES_MEM_CPER_LENGTH           80
>>>>    
>>>> -/* Masks for block_status flags */
>>>> -#define ACPI_GEBS_UNCORRECTABLE         1
>>>> +/* Bits for block_status flags */
>>>> +#define ACPI_GEBS_UNCORRECTABLE           0
>>>> +#define ACPI_GEBS_CORRECTABLE             1
>>>> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
>>>> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3
>>>
>>> So this maps to the bits in block status.
>>>
>>> I'm not actually sure what these multiple variants are meant to tell us.
>>> The multiple error blocks example referred to by the spec is a way to represent
>>> the same error applying to multiple places.  So that's one error, many blocks.
>>> I have no idea if we set these bits in that case.
>>>
>>> Based on a quick look I don't think linux even takes any notice.  THere
>>> are defines in actbl1.h but I'm not seeing any use made of them.
>>>    
>>
>> I hope Igor can confirm since it was suggested by him.
>>
>> It's hard to understand how exactly these multiple variants are used from the
>> spec. In ACPI 6.5 Table 18.11, it's explained as below.
>>
>> Bit [2] - Multiple Uncorrectable Errors: If set to one, indicates that more
>> than one uncorrectable errors have been detected.
>>
>> I don't see those multiple variants have been used by Linux. So I think it's
>> safe to drop them.
> 
> even though example describes 'same' error at different components,
> the bit fields descriptions doesn't set any limits on what 'more than one' means.
> 
> Also from guest POV it's multiple different pages that we are reporting here
> as multiple CPERs.
> It seems to me that setting *_MULTIPLE_* here is correct thing to do.
> 

I don't have strong opinions. Lets keep to set _MULTIPLE_ flag if Jonathan
is fine. Again, this field isn't used by Linux guest.

>>>> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4
>>>
>>> This is bits 4-13 and the define isn't used. I'd drop it.
>>>    
>>
>> The definition is used in acpi_ghes_memory_errors() of this patch. However,
>> I don't see it has been used by Linux. This field isn't used by Linux to determine
>> the total number of error entries. So I think I can drop it either if Igor is ok.
>>

Lets keep this field either in next revision if Jonathan is fine.

Thanks,
Gavin
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Jonathan Cameron via 3 days, 5 hours ago
On Tue, 11 Nov 2025 14:08:13 +1000
Gavin Shan <gshan@redhat.com> wrote:

> Hi Igor and Jonathan,
> 
> On 11/11/25 12:49 AM, Igor Mammedov wrote:
> > On Thu, 6 Nov 2025 13:15:52 +1000
> > Gavin Shan <gshan@redhat.com> wrote:  
> >> On 11/6/25 12:14 AM, Jonathan Cameron wrote:  
> >>> On Wed,  5 Nov 2025 21:44:49 +1000
> >>> Gavin Shan <gshan@redhat.com> wrote:
> >>>      
> >>>> In the situation where host and guest has 64KiB and 4KiB page sizes,
> >>>> one problematic host page affects 16 guest pages. we need to send 16
> >>>> consective errors in this specific case.
> >>>>
> >>>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
> >>>> hunk of code to generate the GHES error status is pulled out from
> >>>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
> >>>> generic error status block is also updated accordingly if multiple
> >>>> error data entries are contained in the generic error status block.
> >>>>
> >>>> Signed-off-by: Gavin Shan <gshan@redhat.com>  
> >>> Hi Gavin,
> >>>
> >>> Mostly fine, but a few comments on the defines added and a
> >>> question on what the multiple things are meant to mean?
> >>>      
> >>
> >> Thanks for your review and comments, replies as below.
> >>  
> >>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> >>>> index a9c08e73c0..527b85c8d8 100644
> >>>> --- a/hw/acpi/ghes.c
> >>>> +++ b/hw/acpi/ghes.c
> >>>> @@ -57,8 +57,12 @@
> >>>>    /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
> >>>>    #define ACPI_GHES_MEM_CPER_LENGTH           80
> >>>>    
> >>>> -/* Masks for block_status flags */
> >>>> -#define ACPI_GEBS_UNCORRECTABLE         1
> >>>> +/* Bits for block_status flags */
> >>>> +#define ACPI_GEBS_UNCORRECTABLE           0
> >>>> +#define ACPI_GEBS_CORRECTABLE             1
> >>>> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
> >>>> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3  
> >>>
> >>> So this maps to the bits in block status.
> >>>
> >>> I'm not actually sure what these multiple variants are meant to tell us.
> >>> The multiple error blocks example referred to by the spec is a way to represent
> >>> the same error applying to multiple places.  So that's one error, many blocks.
> >>> I have no idea if we set these bits in that case.
> >>>
> >>> Based on a quick look I don't think linux even takes any notice.  THere
> >>> are defines in actbl1.h but I'm not seeing any use made of them.
> >>>      
> >>
> >> I hope Igor can confirm since it was suggested by him.
> >>
> >> It's hard to understand how exactly these multiple variants are used from the
> >> spec. In ACPI 6.5 Table 18.11, it's explained as below.
> >>
> >> Bit [2] - Multiple Uncorrectable Errors: If set to one, indicates that more
> >> than one uncorrectable errors have been detected.
> >>
> >> I don't see those multiple variants have been used by Linux. So I think it's
> >> safe to drop them.  
> > 
> > even though example describes 'same' error at different components,
> > the bit fields descriptions doesn't set any limits on what 'more than one' means.
> > 
> > Also from guest POV it's multiple different pages that we are reporting here
> > as multiple CPERs.
> > It seems to me that setting *_MULTIPLE_* here is correct thing to do.
> >   
> 
> I don't have strong opinions. Lets keep to set _MULTIPLE_ flag if Jonathan
> is fine. Again, this field isn't used by Linux guest.
I don't care strongly.  Maybe we should ask for a spec clarification as I doubt
implementations will be consistent on this given the vague description and that
Linux ignores it today.

> 
> >>>> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4  
> >>>
> >>> This is bits 4-13 and the define isn't used. I'd drop it.
> >>>      
> >>
> >> The definition is used in acpi_ghes_memory_errors() of this patch. However,
> >> I don't see it has been used by Linux. This field isn't used by Linux to determine
> >> the total number of error entries. So I think I can drop it either if Igor is ok.
> >>  
> 
> Lets keep this field either in next revision if Jonathan is fine.

I'm fine with the field, but not the value.  As far as I can tell form the spec, it should
be a mask, not a single bit.

> 
> Thanks,
> Gavin
> 
> 
>
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Gavin Shan 3 days, 4 hours ago
Hi Jonathan,

On 11/11/25 8:07 PM, Jonathan Cameron wrote:
> On Tue, 11 Nov 2025 14:08:13 +1000
> Gavin Shan <gshan@redhat.com> wrote:
>> On 11/11/25 12:49 AM, Igor Mammedov wrote:
>>> On Thu, 6 Nov 2025 13:15:52 +1000
>>> Gavin Shan <gshan@redhat.com> wrote:
>>>> On 11/6/25 12:14 AM, Jonathan Cameron wrote:
>>>>> On Wed,  5 Nov 2025 21:44:49 +1000
>>>>> Gavin Shan <gshan@redhat.com> wrote:
>>>>>       
>>>>>> In the situation where host and guest has 64KiB and 4KiB page sizes,
>>>>>> one problematic host page affects 16 guest pages. we need to send 16
>>>>>> consective errors in this specific case.
>>>>>>
>>>>>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
>>>>>> hunk of code to generate the GHES error status is pulled out from
>>>>>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
>>>>>> generic error status block is also updated accordingly if multiple
>>>>>> error data entries are contained in the generic error status block.
>>>>>>
>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>> Hi Gavin,
>>>>>
>>>>> Mostly fine, but a few comments on the defines added and a
>>>>> question on what the multiple things are meant to mean?
>>>>>       
>>>>
>>>> Thanks for your review and comments, replies as below.
>>>>   
>>>>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
>>>>>> index a9c08e73c0..527b85c8d8 100644
>>>>>> --- a/hw/acpi/ghes.c
>>>>>> +++ b/hw/acpi/ghes.c
>>>>>> @@ -57,8 +57,12 @@
>>>>>>     /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
>>>>>>     #define ACPI_GHES_MEM_CPER_LENGTH           80
>>>>>>     
>>>>>> -/* Masks for block_status flags */
>>>>>> -#define ACPI_GEBS_UNCORRECTABLE         1
>>>>>> +/* Bits for block_status flags */
>>>>>> +#define ACPI_GEBS_UNCORRECTABLE           0
>>>>>> +#define ACPI_GEBS_CORRECTABLE             1
>>>>>> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
>>>>>> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3
>>>>>
>>>>> So this maps to the bits in block status.
>>>>>
>>>>> I'm not actually sure what these multiple variants are meant to tell us.
>>>>> The multiple error blocks example referred to by the spec is a way to represent
>>>>> the same error applying to multiple places.  So that's one error, many blocks.
>>>>> I have no idea if we set these bits in that case.
>>>>>
>>>>> Based on a quick look I don't think linux even takes any notice.  THere
>>>>> are defines in actbl1.h but I'm not seeing any use made of them.
>>>>>       
>>>>
>>>> I hope Igor can confirm since it was suggested by him.
>>>>
>>>> It's hard to understand how exactly these multiple variants are used from the
>>>> spec. In ACPI 6.5 Table 18.11, it's explained as below.
>>>>
>>>> Bit [2] - Multiple Uncorrectable Errors: If set to one, indicates that more
>>>> than one uncorrectable errors have been detected.
>>>>
>>>> I don't see those multiple variants have been used by Linux. So I think it's
>>>> safe to drop them.
>>>
>>> even though example describes 'same' error at different components,
>>> the bit fields descriptions doesn't set any limits on what 'more than one' means.
>>>
>>> Also from guest POV it's multiple different pages that we are reporting here
>>> as multiple CPERs.
>>> It seems to me that setting *_MULTIPLE_* here is correct thing to do.
>>>    
>>
>> I don't have strong opinions. Lets keep to set _MULTIPLE_ flag if Jonathan
>> is fine. Again, this field isn't used by Linux guest.
> I don't care strongly.  Maybe we should ask for a spec clarification as I doubt
> implementations will be consistent on this given the vague description and that
> Linux ignores it today.
> 

Google Gemini has the following question. If it can be trusted, it should be
set when @num_of_addresses is larger than 1.

Quota from Google Gemini:

The system firmware sets this bit to indicate to the Operating System Power Management (OSPM)
that more than one correctable error condition has been detected and logged for the associated
hardware component since the last time the status was cleared by the software. This is crucial
because a high frequency of correctable errors often indicates a potential underlying hardware
issue that could lead to uncorrectable (and potentially fatal) errors if not addressed (e.g.,
in memory, where multiple correctable errors might trigger a spare memory operation).

>>
>>>>>> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4
>>>>>
>>>>> This is bits 4-13 and the define isn't used. I'd drop it.
>>>>>       
>>>>
>>>> The definition is used in acpi_ghes_memory_errors() of this patch. However,
>>>> I don't see it has been used by Linux. This field isn't used by Linux to determine
>>>> the total number of error entries. So I think I can drop it either if Igor is ok.
>>>>   
>>
>> Lets keep this field either in next revision if Jonathan is fine.
> 
> I'm fine with the field, but not the value.  As far as I can tell form the spec, it should
> be a mask, not a single bit.
> 

Agreed, lets keep ACPI_HEST_ERROR_ENTRY_COUNT as zero in next revision.

Thanks,
Gavin
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Jonathan Cameron via 3 days, 3 hours ago
On Tue, 11 Nov 2025 20:55:17 +1000
Gavin Shan <gshan@redhat.com> wrote:

> Hi Jonathan,
> 
> On 11/11/25 8:07 PM, Jonathan Cameron wrote:
> > On Tue, 11 Nov 2025 14:08:13 +1000
> > Gavin Shan <gshan@redhat.com> wrote:  
> >> On 11/11/25 12:49 AM, Igor Mammedov wrote:  
> >>> On Thu, 6 Nov 2025 13:15:52 +1000
> >>> Gavin Shan <gshan@redhat.com> wrote:  
> >>>> On 11/6/25 12:14 AM, Jonathan Cameron wrote:  
> >>>>> On Wed,  5 Nov 2025 21:44:49 +1000
> >>>>> Gavin Shan <gshan@redhat.com> wrote:
> >>>>>         
> >>>>>> In the situation where host and guest has 64KiB and 4KiB page sizes,
> >>>>>> one problematic host page affects 16 guest pages. we need to send 16
> >>>>>> consective errors in this specific case.
> >>>>>>
> >>>>>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
> >>>>>> hunk of code to generate the GHES error status is pulled out from
> >>>>>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
> >>>>>> generic error status block is also updated accordingly if multiple
> >>>>>> error data entries are contained in the generic error status block.
> >>>>>>
> >>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>  
> >>>>> Hi Gavin,
> >>>>>
> >>>>> Mostly fine, but a few comments on the defines added and a
> >>>>> question on what the multiple things are meant to mean?
> >>>>>         
> >>>>
> >>>> Thanks for your review and comments, replies as below.
> >>>>     
> >>>>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> >>>>>> index a9c08e73c0..527b85c8d8 100644
> >>>>>> --- a/hw/acpi/ghes.c
> >>>>>> +++ b/hw/acpi/ghes.c
> >>>>>> @@ -57,8 +57,12 @@
> >>>>>>     /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
> >>>>>>     #define ACPI_GHES_MEM_CPER_LENGTH           80
> >>>>>>     
> >>>>>> -/* Masks for block_status flags */
> >>>>>> -#define ACPI_GEBS_UNCORRECTABLE         1
> >>>>>> +/* Bits for block_status flags */
> >>>>>> +#define ACPI_GEBS_UNCORRECTABLE           0
> >>>>>> +#define ACPI_GEBS_CORRECTABLE             1
> >>>>>> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
> >>>>>> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3  
> >>>>>
> >>>>> So this maps to the bits in block status.
> >>>>>
> >>>>> I'm not actually sure what these multiple variants are meant to tell us.
> >>>>> The multiple error blocks example referred to by the spec is a way to represent
> >>>>> the same error applying to multiple places.  So that's one error, many blocks.
> >>>>> I have no idea if we set these bits in that case.
> >>>>>
> >>>>> Based on a quick look I don't think linux even takes any notice.  THere
> >>>>> are defines in actbl1.h but I'm not seeing any use made of them.
> >>>>>         
> >>>>
> >>>> I hope Igor can confirm since it was suggested by him.
> >>>>
> >>>> It's hard to understand how exactly these multiple variants are used from the
> >>>> spec. In ACPI 6.5 Table 18.11, it's explained as below.
> >>>>
> >>>> Bit [2] - Multiple Uncorrectable Errors: If set to one, indicates that more
> >>>> than one uncorrectable errors have been detected.
> >>>>
> >>>> I don't see those multiple variants have been used by Linux. So I think it's
> >>>> safe to drop them.  
> >>>
> >>> even though example describes 'same' error at different components,
> >>> the bit fields descriptions doesn't set any limits on what 'more than one' means.
> >>>
> >>> Also from guest POV it's multiple different pages that we are reporting here
> >>> as multiple CPERs.
> >>> It seems to me that setting *_MULTIPLE_* here is correct thing to do.
> >>>      
> >>
> >> I don't have strong opinions. Lets keep to set _MULTIPLE_ flag if Jonathan
> >> is fine. Again, this field isn't used by Linux guest.  
> > I don't care strongly.  Maybe we should ask for a spec clarification as I doubt
> > implementations will be consistent on this given the vague description and that
> > Linux ignores it today.
> >   
> 
> Google Gemini has the following question. If it can be trusted, it should be
> set when @num_of_addresses is larger than 1.
> 
> Quota from Google Gemini:
> 
> The system firmware sets this bit to indicate to the Operating System Power Management (OSPM)
> that more than one correctable error condition has been detected and logged for the associated
> hardware component since the last time the status was cleared by the software. This is crucial
> because a high frequency of correctable errors often indicates a potential underlying hardware
> issue that could lead to uncorrectable (and potentially fatal) errors if not addressed (e.g.,
> in memory, where multiple correctable errors might trigger a spare memory operation).
> 
> >>  
> >>>>>> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4  
> >>>>>
> >>>>> This is bits 4-13 and the define isn't used. I'd drop it.
> >>>>>         
> >>>>
> >>>> The definition is used in acpi_ghes_memory_errors() of this patch. However,
> >>>> I don't see it has been used by Linux. This field isn't used by Linux to determine
> >>>> the total number of error entries. So I think I can drop it either if Igor is ok.
> >>>>     
> >>
> >> Lets keep this field either in next revision if Jonathan is fine.  
> > 
> > I'm fine with the field, but not the value.  As far as I can tell form the spec, it should
> > be a mask, not a single bit.
> >   
> 
> Agreed, lets keep ACPI_HEST_ERROR_ENTRY_COUNT as zero in next revision.

I'm even more confused now.  The GEBS Error Data entry count should be field from 13:4
and the value taken should be the number of entries in the record, so 1, 4, 16 depending
on the page size.

So that define of the value 4 is garbage. If it were DATA_ENTRIES_SHIFT then I'd be much happier.


> 
> Thanks,
> Gavin
> 
>
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Gavin Shan 3 days, 3 hours ago
Hi Jonathan,

On 11/11/25 9:55 PM, Jonathan Cameron wrote:
> On Tue, 11 Nov 2025 20:55:17 +1000
> Gavin Shan <gshan@redhat.com> wrote:
>> On 11/11/25 8:07 PM, Jonathan Cameron wrote:
>>> On Tue, 11 Nov 2025 14:08:13 +1000
>>> Gavin Shan <gshan@redhat.com> wrote:
>>>> On 11/11/25 12:49 AM, Igor Mammedov wrote:
>>>>> On Thu, 6 Nov 2025 13:15:52 +1000
>>>>> Gavin Shan <gshan@redhat.com> wrote:
>>>>>> On 11/6/25 12:14 AM, Jonathan Cameron wrote:
>>>>>>> On Wed,  5 Nov 2025 21:44:49 +1000
>>>>>>> Gavin Shan <gshan@redhat.com> wrote:
>>>>>>>          
>>>>>>>> In the situation where host and guest has 64KiB and 4KiB page sizes,
>>>>>>>> one problematic host page affects 16 guest pages. we need to send 16
>>>>>>>> consective errors in this specific case.
>>>>>>>>
>>>>>>>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
>>>>>>>> hunk of code to generate the GHES error status is pulled out from
>>>>>>>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
>>>>>>>> generic error status block is also updated accordingly if multiple
>>>>>>>> error data entries are contained in the generic error status block.
>>>>>>>>
>>>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>>>> Hi Gavin,
>>>>>>>
>>>>>>> Mostly fine, but a few comments on the defines added and a
>>>>>>> question on what the multiple things are meant to mean?
>>>>>>>          
>>>>>>
>>>>>> Thanks for your review and comments, replies as below.
>>>>>>      
>>>>>>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
>>>>>>>> index a9c08e73c0..527b85c8d8 100644
>>>>>>>> --- a/hw/acpi/ghes.c
>>>>>>>> +++ b/hw/acpi/ghes.c
>>>>>>>> @@ -57,8 +57,12 @@
>>>>>>>>      /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
>>>>>>>>      #define ACPI_GHES_MEM_CPER_LENGTH           80
>>>>>>>>      
>>>>>>>> -/* Masks for block_status flags */
>>>>>>>> -#define ACPI_GEBS_UNCORRECTABLE         1
>>>>>>>> +/* Bits for block_status flags */
>>>>>>>> +#define ACPI_GEBS_UNCORRECTABLE           0
>>>>>>>> +#define ACPI_GEBS_CORRECTABLE             1
>>>>>>>> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
>>>>>>>> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3
>>>>>>>
>>>>>>> So this maps to the bits in block status.
>>>>>>>
>>>>>>> I'm not actually sure what these multiple variants are meant to tell us.
>>>>>>> The multiple error blocks example referred to by the spec is a way to represent
>>>>>>> the same error applying to multiple places.  So that's one error, many blocks.
>>>>>>> I have no idea if we set these bits in that case.
>>>>>>>
>>>>>>> Based on a quick look I don't think linux even takes any notice.  THere
>>>>>>> are defines in actbl1.h but I'm not seeing any use made of them.
>>>>>>>          
>>>>>>
>>>>>> I hope Igor can confirm since it was suggested by him.
>>>>>>
>>>>>> It's hard to understand how exactly these multiple variants are used from the
>>>>>> spec. In ACPI 6.5 Table 18.11, it's explained as below.
>>>>>>
>>>>>> Bit [2] - Multiple Uncorrectable Errors: If set to one, indicates that more
>>>>>> than one uncorrectable errors have been detected.
>>>>>>
>>>>>> I don't see those multiple variants have been used by Linux. So I think it's
>>>>>> safe to drop them.
>>>>>
>>>>> even though example describes 'same' error at different components,
>>>>> the bit fields descriptions doesn't set any limits on what 'more than one' means.
>>>>>
>>>>> Also from guest POV it's multiple different pages that we are reporting here
>>>>> as multiple CPERs.
>>>>> It seems to me that setting *_MULTIPLE_* here is correct thing to do.
>>>>>       
>>>>
>>>> I don't have strong opinions. Lets keep to set _MULTIPLE_ flag if Jonathan
>>>> is fine. Again, this field isn't used by Linux guest.
>>> I don't care strongly.  Maybe we should ask for a spec clarification as I doubt
>>> implementations will be consistent on this given the vague description and that
>>> Linux ignores it today.
>>>    
>>
>> Google Gemini has the following question. If it can be trusted, it should be
>> set when @num_of_addresses is larger than 1.
>>
>> Quota from Google Gemini:
>>
>> The system firmware sets this bit to indicate to the Operating System Power Management (OSPM)
>> that more than one correctable error condition has been detected and logged for the associated
>> hardware component since the last time the status was cleared by the software. This is crucial
>> because a high frequency of correctable errors often indicates a potential underlying hardware
>> issue that could lead to uncorrectable (and potentially fatal) errors if not addressed (e.g.,
>> in memory, where multiple correctable errors might trigger a spare memory operation).
>>
>>>>   
>>>>>>>> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4
>>>>>>>
>>>>>>> This is bits 4-13 and the define isn't used. I'd drop it.
>>>>>>>          
>>>>>>
>>>>>> The definition is used in acpi_ghes_memory_errors() of this patch. However,
>>>>>> I don't see it has been used by Linux. This field isn't used by Linux to determine
>>>>>> the total number of error entries. So I think I can drop it either if Igor is ok.
>>>>>>      
>>>>
>>>> Lets keep this field either in next revision if Jonathan is fine.
>>>
>>> I'm fine with the field, but not the value.  As far as I can tell form the spec, it should
>>> be a mask, not a single bit.
>>>    
>>
>> Agreed, lets keep ACPI_HEST_ERROR_ENTRY_COUNT as zero in next revision.
> 
> I'm even more confused now.  The GEBS Error Data entry count should be field from 13:4
> and the value taken should be the number of entries in the record, so 1, 4, 16 depending
> on the page size.
> 
> So that define of the value 4 is garbage. If it were DATA_ENTRIES_SHIFT then I'd be much happier.
> 

My bad. I misunderstood your point. It will be fixed by using APIs from
"hw/registerfields.h" as suggested by Philippe in another reply.

   ...
   FIELD(ACPI_GEBS, MULTIPLE_CORRECTABLE, 3, 1)
   FIELD(ACPI_GEBS, ERROR_DATA_ENTRIES, 4, 10)

   then use FIELD_DP32() to only set the correct bits.

Thanks,
Gavin
Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Posted by Jonathan Cameron via 3 days, 2 hours ago
On Tue, 11 Nov 2025 22:19:18 +1000
Gavin Shan <gshan@redhat.com> wrote:

> Hi Jonathan,
> 
> On 11/11/25 9:55 PM, Jonathan Cameron wrote:
> > On Tue, 11 Nov 2025 20:55:17 +1000
> > Gavin Shan <gshan@redhat.com> wrote:  
> >> On 11/11/25 8:07 PM, Jonathan Cameron wrote:  
> >>> On Tue, 11 Nov 2025 14:08:13 +1000
> >>> Gavin Shan <gshan@redhat.com> wrote:  
> >>>> On 11/11/25 12:49 AM, Igor Mammedov wrote:  
> >>>>> On Thu, 6 Nov 2025 13:15:52 +1000
> >>>>> Gavin Shan <gshan@redhat.com> wrote:  
> >>>>>> On 11/6/25 12:14 AM, Jonathan Cameron wrote:  
> >>>>>>> On Wed,  5 Nov 2025 21:44:49 +1000
> >>>>>>> Gavin Shan <gshan@redhat.com> wrote:
> >>>>>>>            
> >>>>>>>> In the situation where host and guest has 64KiB and 4KiB page sizes,
> >>>>>>>> one problematic host page affects 16 guest pages. we need to send 16
> >>>>>>>> consective errors in this specific case.
> >>>>>>>>
> >>>>>>>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
> >>>>>>>> hunk of code to generate the GHES error status is pulled out from
> >>>>>>>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
> >>>>>>>> generic error status block is also updated accordingly if multiple
> >>>>>>>> error data entries are contained in the generic error status block.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>  
> >>>>>>> Hi Gavin,
> >>>>>>>
> >>>>>>> Mostly fine, but a few comments on the defines added and a
> >>>>>>> question on what the multiple things are meant to mean?
> >>>>>>>            
> >>>>>>
> >>>>>> Thanks for your review and comments, replies as below.
> >>>>>>        
> >>>>>>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> >>>>>>>> index a9c08e73c0..527b85c8d8 100644
> >>>>>>>> --- a/hw/acpi/ghes.c
> >>>>>>>> +++ b/hw/acpi/ghes.c
> >>>>>>>> @@ -57,8 +57,12 @@
> >>>>>>>>      /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
> >>>>>>>>      #define ACPI_GHES_MEM_CPER_LENGTH           80
> >>>>>>>>      
> >>>>>>>> -/* Masks for block_status flags */
> >>>>>>>> -#define ACPI_GEBS_UNCORRECTABLE         1
> >>>>>>>> +/* Bits for block_status flags */
> >>>>>>>> +#define ACPI_GEBS_UNCORRECTABLE           0
> >>>>>>>> +#define ACPI_GEBS_CORRECTABLE             1
> >>>>>>>> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
> >>>>>>>> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3  
> >>>>>>>
> >>>>>>> So this maps to the bits in block status.
> >>>>>>>
> >>>>>>> I'm not actually sure what these multiple variants are meant to tell us.
> >>>>>>> The multiple error blocks example referred to by the spec is a way to represent
> >>>>>>> the same error applying to multiple places.  So that's one error, many blocks.
> >>>>>>> I have no idea if we set these bits in that case.
> >>>>>>>
> >>>>>>> Based on a quick look I don't think linux even takes any notice.  THere
> >>>>>>> are defines in actbl1.h but I'm not seeing any use made of them.
> >>>>>>>            
> >>>>>>
> >>>>>> I hope Igor can confirm since it was suggested by him.
> >>>>>>
> >>>>>> It's hard to understand how exactly these multiple variants are used from the
> >>>>>> spec. In ACPI 6.5 Table 18.11, it's explained as below.
> >>>>>>
> >>>>>> Bit [2] - Multiple Uncorrectable Errors: If set to one, indicates that more
> >>>>>> than one uncorrectable errors have been detected.
> >>>>>>
> >>>>>> I don't see those multiple variants have been used by Linux. So I think it's
> >>>>>> safe to drop them.  
> >>>>>
> >>>>> even though example describes 'same' error at different components,
> >>>>> the bit fields descriptions doesn't set any limits on what 'more than one' means.
> >>>>>
> >>>>> Also from guest POV it's multiple different pages that we are reporting here
> >>>>> as multiple CPERs.
> >>>>> It seems to me that setting *_MULTIPLE_* here is correct thing to do.
> >>>>>         
> >>>>
> >>>> I don't have strong opinions. Lets keep to set _MULTIPLE_ flag if Jonathan
> >>>> is fine. Again, this field isn't used by Linux guest.  
> >>> I don't care strongly.  Maybe we should ask for a spec clarification as I doubt
> >>> implementations will be consistent on this given the vague description and that
> >>> Linux ignores it today.
> >>>      
> >>
> >> Google Gemini has the following question. If it can be trusted, it should be
> >> set when @num_of_addresses is larger than 1.
> >>
> >> Quota from Google Gemini:
> >>
> >> The system firmware sets this bit to indicate to the Operating System Power Management (OSPM)
> >> that more than one correctable error condition has been detected and logged for the associated
> >> hardware component since the last time the status was cleared by the software. This is crucial
> >> because a high frequency of correctable errors often indicates a potential underlying hardware
> >> issue that could lead to uncorrectable (and potentially fatal) errors if not addressed (e.g.,
> >> in memory, where multiple correctable errors might trigger a spare memory operation).
> >>  
> >>>>     
> >>>>>>>> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4  
> >>>>>>>
> >>>>>>> This is bits 4-13 and the define isn't used. I'd drop it.
> >>>>>>>            
> >>>>>>
> >>>>>> The definition is used in acpi_ghes_memory_errors() of this patch. However,
> >>>>>> I don't see it has been used by Linux. This field isn't used by Linux to determine
> >>>>>> the total number of error entries. So I think I can drop it either if Igor is ok.
> >>>>>>        
> >>>>
> >>>> Lets keep this field either in next revision if Jonathan is fine.  
> >>>
> >>> I'm fine with the field, but not the value.  As far as I can tell form the spec, it should
> >>> be a mask, not a single bit.
> >>>      
> >>
> >> Agreed, lets keep ACPI_HEST_ERROR_ENTRY_COUNT as zero in next revision.  
> > 
> > I'm even more confused now.  The GEBS Error Data entry count should be field from 13:4
> > and the value taken should be the number of entries in the record, so 1, 4, 16 depending
> > on the page size.
> > 
> > So that define of the value 4 is garbage. If it were DATA_ENTRIES_SHIFT then I'd be much happier.
> >   
> 
> My bad. I misunderstood your point. It will be fixed by using APIs from
> "hw/registerfields.h" as suggested by Philippe in another reply.
> 
>    ...
>    FIELD(ACPI_GEBS, MULTIPLE_CORRECTABLE, 3, 1)
>    FIELD(ACPI_GEBS, ERROR_DATA_ENTRIES, 4, 10)
> 
>    then use FIELD_DP32() to only set the correct bits.
> 
Perfect. Thanks!

J
> Thanks,
> Gavin
> 
>