[PATCH v4 0/7] Add ACPI CPER firmware first error injection on ARM emulation

Mauro Carvalho Chehab posted 7 patches 3 months, 3 weeks ago
There is a newer version of this series
MAINTAINERS                            |   8 +
hw/acpi/Kconfig                        |   5 +
hw/acpi/generic_event_device.c         |  17 +
hw/acpi/ghes.c                         | 178 ++++++-
hw/acpi/ghes_cper.c                    |  53 ++
hw/acpi/meson.build                    |   2 +
hw/arm/Kconfig                         |   5 +
hw/arm/virt-acpi-build.c               |  25 +-
hw/arm/virt.c                          |  33 +-
include/hw/acpi/acpi_dev_interface.h   |   1 +
include/hw/acpi/generic_event_device.h |   3 +
include/hw/acpi/ghes.h                 |  14 +-
include/hw/arm/virt.h                  |   5 +
qapi/ghes-cper.json                    |  54 ++
qapi/meson.build                       |   1 +
qapi/qapi-schema.json                  |   1 +
scripts/ghes_inject.py                 | 673 +++++++++++++++++++++++++
17 files changed, 1048 insertions(+), 30 deletions(-)
create mode 100644 hw/acpi/ghes_cper.c
create mode 100644 qapi/ghes-cper.json
create mode 100755 scripts/ghes_inject.py
[PATCH v4 0/7] Add ACPI CPER firmware first error injection on ARM emulation
Posted by Mauro Carvalho Chehab 3 months, 3 weeks ago
Testing OS kernel ACPI APEI CPER support is tricky, as one depends on
having hardware with special-purpose BIOS and/or hardware.

With QEMU, it becomes a lot easier, as it can be done via QMP.

This series add support for injecting CPER records on ARM emulation.

The QEMU side changes add a QAPI able to do CPER error injection
on ARM, with a raw data parameter, making it very flexible.

A script is provided at the final patch implementing support for
ARM Processor CPER error injection according with ACPI 6.x and 
UEFI 2.9A/2.10 specs, via QMP.

Injecting such errors can be done using the provided script:

	$ ./scripts/ghes_inject.py arm 
		 {"QMP": {"version": {"qemu": {"micro": 50, "minor": 0, "major": 9}, "package": "v9.0.0-2621-g3de6991b870a"}, "capabilities": ["oob"]}}
	{ "execute": "qmp_capabilities" } 
		 {"return": {}}
	{ "execute": "ghes-cper", "arguments": {"cper": {"notification-type": [22, 61, 158, 225, 17, 188, 228, 17, 156, 170, 194, 5, 29, 93, 70, 176], "raw-data": [0, 0, 0, 0, 1, 0, 0, 0, 72, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 32, 4, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 239, 190, 173, 222, 0, 0, 0, 0, 173, 11, 186, 171, 0, 0, 0, 0]}} }
		 {"return": {}}

Produces a simple CPER register, properly handled by the Linux
Kernel:

[ 5876.041410] {18}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[ 5876.041775] {18}[Hardware Error]: event severity: recoverable
[ 5876.042023] {18}[Hardware Error]:  Error 0, type: recoverable
[ 5876.042280] {18}[Hardware Error]:   section_type: ARM processor error
[ 5876.042538] {18}[Hardware Error]:   MIDR: 0x0000000000000000
[ 5876.042781] {18}[Hardware Error]:   Error info structure 0:
[ 5876.043013] {18}[Hardware Error]:   num errors: 2
[ 5876.043222] {18}[Hardware Error]:    error_type: 0x02: cache error
[ 5876.043500] {18}[Hardware Error]:    error_info: 0x0000000000000000
[ 5876.043800] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error

More complex use cases can be done, like:

	$ ./scripts/ghes_inject.py arm --mpidr 0x444 --running --affinity 1 --error-info 12345678 --vendor 0x13,123,4,5,1 --ctx-array 0,1,2,3,4,5 -t cache tlb bus vendor tlb,vendor
		 {"QMP": {"version": {"qemu": {"micro": 50, "minor": 0, "major": 9}, "package": "v9.0.0-2621-g3de6991b870a"}, "capabilities": ["oob"]}}
	{ "execute": "qmp_capabilities" } 
		 {"return": {}}
	{ "execute": "ghes-cper", "arguments": {"cper": {"notification-type": [22, 61, 158, 225, 17, 188, 228, 17, 156, 170, 194, 5, 29, 93, 70, 176], "raw-data": [7, 0, 0, 0, 5, 0, 1, 0, 13, 1, 0, 0, 1, 0, 0, 0, 68, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 32, 4, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 239, 190, 173, 222, 0, 0, 0, 0, 173, 11, 186, 171, 0, 0, 0, 0, 0, 32, 4, 0, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 239, 190, 173, 222, 0, 0, 0, 0, 173, 11, 186, 171, 0, 0, 0, 0, 0, 32, 4, 0, 8, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 239, 190, 173, 222, 0, 0, 0, 0, 173, 11, 186, 171, 0, 0, 0, 0, 0, 32, 4, 0, 16, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 239, 190, 173, 222, 0, 0, 0, 0, 173, 11, 186, 171, 0, 0, 0, 0, 0, 32, 0, 0, 20, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 239, 190, 173, 222, 0, 0, 0, 0, 173, 11, 186, 171, 0, 0, 0, 0, 0, 0, 5, 0, 56, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 19, 123, 4, 5, 1]}} }
		 {"return": {}}

964.134325] {19}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[ 5964.134692] {19}[Hardware Error]: event severity: recoverable
[ 5964.134942] {19}[Hardware Error]:  Error 0, type: recoverable
[ 5964.135200] {19}[Hardware Error]:   section_type: ARM processor error
[ 5964.135466] {19}[Hardware Error]:   MIDR: 0x0000000000000000
[ 5964.135700] {19}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000000000444
[ 5964.136025] {19}[Hardware Error]:   error affinity level: 1
[ 5964.136255] {19}[Hardware Error]:   running state: 0x1
[ 5964.136468] {19}[Hardware Error]:   Power State Coordination Interface state: 0
[ 5964.136767] {19}[Hardware Error]:   Error info structure 0:
[ 5964.137001] {19}[Hardware Error]:   num errors: 2
[ 5964.137210] {19}[Hardware Error]:    error_type: 0x02: cache error
[ 5964.137472] {19}[Hardware Error]:    error_info: 0x0000000000000000
[ 5964.137737] {19}[Hardware Error]:   Error info structure 1:
[ 5964.137976] {19}[Hardware Error]:   num errors: 2
[ 5964.138192] {19}[Hardware Error]:    error_type: 0x04: TLB error
[ 5964.138459] {19}[Hardware Error]:    error_info: 0x0000000000000000
[ 5964.138727] {19}[Hardware Error]:   Error info structure 2:
[ 5964.138967] {19}[Hardware Error]:   num errors: 2
[ 5964.139185] {19}[Hardware Error]:    error_type: 0x08: bus error
[ 5964.139451] {19}[Hardware Error]:    error_info: 0x0000000000000000
[ 5964.139751] {19}[Hardware Error]:   Error info structure 3:
[ 5964.139993] {19}[Hardware Error]:   num errors: 2
[ 5964.140210] {19}[Hardware Error]:    error_type: 0x10: micro-architectural error
[ 5964.140522] {19}[Hardware Error]:    error_info: 0x0000000000000000
[ 5964.140790] {19}[Hardware Error]:   Error info structure 4:
[ 5964.141030] {19}[Hardware Error]:   num errors: 2
[ 5964.141261] {19}[Hardware Error]:    error_type: 0x14: TLB error|micro-architectural error
[ 5964.141599] {19}[Hardware Error]:   Context info structure 0:
[ 5964.141843] {19}[Hardware Error]:    register context type: AArch64 EL1 context registers
[ 5964.142195] {19}[Hardware Error]:    00000000: 00000000 00000000 00000001 00000000
[ 5964.142534] {19}[Hardware Error]:    00000010: 00000002 00000000 00000003 00000000
[ 5964.142867] {19}[Hardware Error]:    00000020: 00000004 00000000 00000005 00000000
[ 5964.143193] {19}[Hardware Error]:    00000030: 00000000 00000000
[ 5964.143464] {19}[Hardware Error]:   Vendor specific error info has 5 bytes:
[ 5964.143750] {19}[Hardware Error]:    00000000: 13 7b 04 05 01                                   .{...
[ 5964.144164] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
[ 5964.144483] [Firmware Warn]: GHES: Unhandled processor error type 0x04: TLB error
[ 5964.144793] [Firmware Warn]: GHES: Unhandled processor error type 0x08: bus error
[ 5964.145099] [Firmware Warn]: GHES: Unhandled processor error type 0x10: micro-architectural error
[ 5964.145454] [Firmware Warn]: GHES: Unhandled processor error type 0x14: TLB error|micro-architectural error

---

v4:
- CPER generation moved to happen outside QEMU;
- One patch adding support for mpidr query was removed.

v3:
- patch 1 cleanups with some comment changes and adding another place where
  the poweroff GPIO define should be used. No changes on other patches (except
  due to conflict resolution).

v2:
- added a new patch using a define for GPIO power pin;
- patch 2 changed to also use a define for generic error GPIO pin;
- a couple cleanups at patch 2 removing uneeded else clauses.



Jonathan Cameron (1):
  acpi/ghes: Support GPIO error source

Mauro Carvalho Chehab (6):
  arm/virt: place power button pin number on a define
  acpi/generic_event_device: add an APEI error device
  arm/virt: Wire up GPIO error source for ACPI / GHES
  qapi/ghes-cper: add an interface to do generic CPER error injection
  acpi/ghes: add support for generic error injection via QAPI
  scripts/ghes_inject: add a script to generate GHES error inject

 MAINTAINERS                            |   8 +
 hw/acpi/Kconfig                        |   5 +
 hw/acpi/generic_event_device.c         |  17 +
 hw/acpi/ghes.c                         | 178 ++++++-
 hw/acpi/ghes_cper.c                    |  53 ++
 hw/acpi/meson.build                    |   2 +
 hw/arm/Kconfig                         |   5 +
 hw/arm/virt-acpi-build.c               |  25 +-
 hw/arm/virt.c                          |  33 +-
 include/hw/acpi/acpi_dev_interface.h   |   1 +
 include/hw/acpi/generic_event_device.h |   3 +
 include/hw/acpi/ghes.h                 |  14 +-
 include/hw/arm/virt.h                  |   5 +
 qapi/ghes-cper.json                    |  54 ++
 qapi/meson.build                       |   1 +
 qapi/qapi-schema.json                  |   1 +
 scripts/ghes_inject.py                 | 673 +++++++++++++++++++++++++
 17 files changed, 1048 insertions(+), 30 deletions(-)
 create mode 100644 hw/acpi/ghes_cper.c
 create mode 100644 qapi/ghes-cper.json
 create mode 100755 scripts/ghes_inject.py

-- 
2.45.2