Documentation/virt/kvm/api.rst | 61 ++++ arch/arm64/include/asm/kvm_host.h | 2 + arch/arm64/kvm/arm.c | 5 + arch/arm64/kvm/mmu.c | 68 +++- include/uapi/linux/kvm.h | 10 + tools/arch/arm64/include/asm/esr.h | 2 + tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/arm64/sea_to_user.c | 331 ++++++++++++++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 1 + 9 files changed, 480 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c
Problem
=======
When host APEI is unable to claim a synchronous external abort (SEA)
during guest abort, today KVM directly injects an asynchronous SError
into the VCPU then resumes it. The injected SError usually results in
unpleasant guest kernel panic.
One of the major situation of guest SEA is when VCPU consumes recoverable
uncorrected memory error (UER), which is not uncommon at all in modern
datacenter servers with large amounts of physical memory. Although SError
and guest panic is sufficient to stop the propagation of corrupted memory,
there is room to recover from an UER in a more graceful manner.
Proposed Solution
=================
The idea is, we can replay the SEA to the faulting VCPU. If the memory
error consumption or the fault that cause SEA is not from guest kernel,
the blast radius can be limited to the poison-consuming guest process,
while the VM can keep running.
In addition, instead of doing under the hood without involving userspace,
there are benefits to redirect the SEA to VMM:
- VM customers care about the disruptions caused by memory errors, and
VMM usually has the responsibility to start the process of notifying
the customers of memory error events in their VMs. For example some
cloud provider emits a critical log in their observability UI [1], and
provides a playbook for customers on how to mitigate disruptions to
their workloads.
- VMM can protect future memory error consumption by unmapping the poisoned
pages from stage-2 page table with KVM userfault [2], or by splitting the
memslot that contains the poisoned pages.
- VMM can keep track of SEA events in the VM. When VMM thinks the status
on the host or the VM is bad enough, e.g. number of distinct SEAs
exceeds a threshold, it can restart the VM on another healthy host.
- Behavior parity with x86 architecture. When machine check exception
(MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to
let VMM either recover from the MCE, or terminate itself with VM.
The prior RFC proposes to implement SIGBUS on arm64 as well, but
Marc preferred KVM exit over signal [3]. However, implementation
aside, returning SEA to VMM is on par with returning MCE to VMM.
Once SEA is redirected to VMM, among other actions, VMM is encouraged
to inject external aborts into the faulting VCPU.
New UAPIs
=========
This patchset introduces following userspace-visible changes to empower
VMM to control what happens for SEA on guest memory:
- KVM_CAP_ARM_SEA_TO_USER. While taking SEA, if userspace has enabled
this new capability at VM creation, and the SEA is not owned by kernel
allocated memory, instead of injecting SError, return KVM_EXIT_ARM_SEA
to userspace.
- KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details
about the SEA is provided in arm_sea as much as possible, including
sanitized ESR value at EL2, faulting guest virtual and physical
addresses if available.
* From v3 [4]
- Rebased on commit 3a8660878839 ("Linux 6.18-rc1").
- In selftest, print a message if GVA or GPA expects to be valid.
* From v2 [5]:
- Rebased on "[PATCH] KVM: arm64: nv: Handle SEAs due to VNCR redirection" [6]
and kvmarm/next commit 7b8346bd9fce6 ("KVM: arm64: Don't attempt vLPI
mappings when vPE allocation is disabled")
- Took the host_owns_sea implementation from Oliver [7, 8].
- Excluded the guest SEA injection patches.
- Updated selftest.
* From v1 [9]:
- Rebased on commit 4d62121ce9b5 ("KVM: arm64: vgic-debug: Avoid
dereferencing NULL ITE pointer").
- Sanitize ESR_EL2 before reporting it to userspace.
- Do not do KVM_EXIT_ARM_SEA when SEA is caused by memory allocated to
stage-2 translation table.
[1] https://cloud.google.com/solutions/sap/docs/manage-host-errors
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com
[3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org
[4] https://lore.kernel.org/kvmarm/20250731205844.1346839-1-jiaqiyan@google.com
[5] https://lore.kernel.org/kvm/20250604050902.3944054-1-jiaqiyan@google.com
[6] https://lore.kernel.org/kvmarm/20250729182342.3281742-1-oliver.upton@linux.dev
[7] https://lore.kernel.org/kvm/aHFohmTb9qR_JG1E@linux.dev
[8] https://lore.kernel.org/kvm/aHK-DPufhLy5Dtuk@linux.dev
[9] https://lore.kernel.org/kvm/20250505161412.1926643-1-jiaqiyan@google.com
Jiaqi Yan (3):
KVM: arm64: VM exit to userspace to handle SEA
KVM: selftests: Test for KVM_EXIT_ARM_SEA
Documentation: kvm: new UAPI for handling SEA
Documentation/virt/kvm/api.rst | 61 ++++
arch/arm64/include/asm/kvm_host.h | 2 +
arch/arm64/kvm/arm.c | 5 +
arch/arm64/kvm/mmu.c | 68 +++-
include/uapi/linux/kvm.h | 10 +
tools/arch/arm64/include/asm/esr.h | 2 +
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../testing/selftests/kvm/arm64/sea_to_user.c | 331 ++++++++++++++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 1 +
9 files changed, 480 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c
--
2.51.0.760.g7b8bcc2412-goog
On Mon, 13 Oct 2025 18:59:00 +0000, Jiaqi Yan wrote:
> Problem
> =======
>
> When host APEI is unable to claim a synchronous external abort (SEA)
> during guest abort, today KVM directly injects an asynchronous SError
> into the VCPU then resumes it. The injected SError usually results in
> unpleasant guest kernel panic.
>
> [...]
I've gone ahead and done some cleanups, especially around documentation.
Applied to next, thanks!
[1/3] KVM: arm64: VM exit to userspace to handle SEA
https://git.kernel.org/kvmarm/kvmarm/c/ad9c62bd8946
[2/3] KVM: selftests: Test for KVM_EXIT_ARM_SEA
https://git.kernel.org/kvmarm/kvmarm/c/feee9ef7ac16
[3/3] Documentation: kvm: new UAPI for handling SEA
https://git.kernel.org/kvmarm/kvmarm/c/4debb5e8952e
--
Best,
Oliver
On Thu, Nov 13, 2025 at 1:06 PM Oliver Upton <oupton@kernel.org> wrote: > > On Mon, 13 Oct 2025 18:59:00 +0000, Jiaqi Yan wrote: > > Problem > > ======= > > > > When host APEI is unable to claim a synchronous external abort (SEA) > > during guest abort, today KVM directly injects an asynchronous SError > > into the VCPU then resumes it. The injected SError usually results in > > unpleasant guest kernel panic. > > > > [...] > > I've gone ahead and done some cleanups, especially around documentation. > > Applied to next, thanks! Many thanks, Oliver! I assume I still need to send out v5 with typo fixed, comments addressed, and your cleanups applied? If so, what specific tag/release you want me to rebase v5 onto? > > [1/3] KVM: arm64: VM exit to userspace to handle SEA > https://git.kernel.org/kvmarm/kvmarm/c/ad9c62bd8946 > [2/3] KVM: selftests: Test for KVM_EXIT_ARM_SEA > https://git.kernel.org/kvmarm/kvmarm/c/feee9ef7ac16 > [3/3] Documentation: kvm: new UAPI for handling SEA > https://git.kernel.org/kvmarm/kvmarm/c/4debb5e8952e > > -- > Best, > Oliver
On Thu, Nov 13, 2025 at 02:14:08PM -0800, Jiaqi Yan wrote: > On Thu, Nov 13, 2025 at 1:06 PM Oliver Upton <oupton@kernel.org> wrote: > > > > On Mon, 13 Oct 2025 18:59:00 +0000, Jiaqi Yan wrote: > > > Problem > > > ======= > > > > > > When host APEI is unable to claim a synchronous external abort (SEA) > > > during guest abort, today KVM directly injects an asynchronous SError > > > into the VCPU then resumes it. The injected SError usually results in > > > unpleasant guest kernel panic. > > > > > > [...] > > > > I've gone ahead and done some cleanups, especially around documentation. > > > > Applied to next, thanks! > > Many thanks, Oliver! > > I assume I still need to send out v5 with typo fixed, comments > addressed, and your cleanups applied? If so, what specific tag/release > you want me to rebase v5 onto? No need -- I took care of the issues I spotted when applying, LMK if anything looks amiss on kvmarm/next. Thanks, Oliver
On Thu, Nov 13, 2025 at 2:34 PM Oliver Upton <oupton@kernel.org> wrote: > > On Thu, Nov 13, 2025 at 02:14:08PM -0800, Jiaqi Yan wrote: > > On Thu, Nov 13, 2025 at 1:06 PM Oliver Upton <oupton@kernel.org> wrote: > > > > > > On Mon, 13 Oct 2025 18:59:00 +0000, Jiaqi Yan wrote: > > > > Problem > > > > ======= > > > > > > > > When host APEI is unable to claim a synchronous external abort (SEA) > > > > during guest abort, today KVM directly injects an asynchronous SError > > > > into the VCPU then resumes it. The injected SError usually results in > > > > unpleasant guest kernel panic. > > > > > > > > [...] > > > > > > I've gone ahead and done some cleanups, especially around documentation. > > > > > > Applied to next, thanks! > > > > Many thanks, Oliver! > > > > I assume I still need to send out v5 with typo fixed, comments > > addressed, and your cleanups applied? If so, what specific tag/release > > you want me to rebase v5 onto? > > No need -- I took care of the issues I spotted when applying, LMK if > anything looks amiss on kvmarm/next. I took a look and everything looks fixed, and thanks for nearly rewriting the documentation! > > Thanks, > Oliver
On Mon, Oct 13, 2025 at 06:59:00PM +0000, Jiaqi Yan wrote: > Problem > ======= > > When host APEI is unable to claim a synchronous external abort (SEA) > during guest abort, today KVM directly injects an asynchronous SError > into the VCPU then resumes it. The injected SError usually results in > unpleasant guest kernel panic. > > One of the major situation of guest SEA is when VCPU consumes recoverable > uncorrected memory error (UER), which is not uncommon at all in modern > datacenter servers with large amounts of physical memory. Although SError > and guest panic is sufficient to stop the propagation of corrupted memory, > there is room to recover from an UER in a more graceful manner. > > Proposed Solution > ================= > > The idea is, we can replay the SEA to the faulting VCPU. If the memory > error consumption or the fault that cause SEA is not from guest kernel, > the blast radius can be limited to the poison-consuming guest process, > while the VM can keep running. > > In addition, instead of doing under the hood without involving userspace, > there are benefits to redirect the SEA to VMM: > > - VM customers care about the disruptions caused by memory errors, and > VMM usually has the responsibility to start the process of notifying > the customers of memory error events in their VMs. For example some > cloud provider emits a critical log in their observability UI [1], and > provides a playbook for customers on how to mitigate disruptions to > their workloads. > > - VMM can protect future memory error consumption by unmapping the poisoned > pages from stage-2 page table with KVM userfault [2], or by splitting the > memslot that contains the poisoned pages. > > - VMM can keep track of SEA events in the VM. When VMM thinks the status > on the host or the VM is bad enough, e.g. number of distinct SEAs > exceeds a threshold, it can restart the VM on another healthy host. > > - Behavior parity with x86 architecture. When machine check exception > (MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to > let VMM either recover from the MCE, or terminate itself with VM. > The prior RFC proposes to implement SIGBUS on arm64 as well, but > Marc preferred KVM exit over signal [3]. However, implementation > aside, returning SEA to VMM is on par with returning MCE to VMM. > > Once SEA is redirected to VMM, among other actions, VMM is encouraged > to inject external aborts into the faulting VCPU. I don't know much about the KVM details but this explanation makes sense to me and we also have use cases for all of what is written here. Thanks, Jason
On Mon, Oct 20, 2025 at 7:46 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Mon, Oct 13, 2025 at 06:59:00PM +0000, Jiaqi Yan wrote: > > Problem > > ======= > > > > When host APEI is unable to claim a synchronous external abort (SEA) > > during guest abort, today KVM directly injects an asynchronous SError > > into the VCPU then resumes it. The injected SError usually results in > > unpleasant guest kernel panic. > > > > One of the major situation of guest SEA is when VCPU consumes recoverable > > uncorrected memory error (UER), which is not uncommon at all in modern > > datacenter servers with large amounts of physical memory. Although SError > > and guest panic is sufficient to stop the propagation of corrupted memory, > > there is room to recover from an UER in a more graceful manner. > > > > Proposed Solution > > ================= > > > > The idea is, we can replay the SEA to the faulting VCPU. If the memory > > error consumption or the fault that cause SEA is not from guest kernel, > > the blast radius can be limited to the poison-consuming guest process, > > while the VM can keep running. > > > > In addition, instead of doing under the hood without involving userspace, > > there are benefits to redirect the SEA to VMM: > > > > - VM customers care about the disruptions caused by memory errors, and > > VMM usually has the responsibility to start the process of notifying > > the customers of memory error events in their VMs. For example some > > cloud provider emits a critical log in their observability UI [1], and > > provides a playbook for customers on how to mitigate disruptions to > > their workloads. > > > > - VMM can protect future memory error consumption by unmapping the poisoned > > pages from stage-2 page table with KVM userfault [2], or by splitting the > > memslot that contains the poisoned pages. > > > > - VMM can keep track of SEA events in the VM. When VMM thinks the status > > on the host or the VM is bad enough, e.g. number of distinct SEAs > > exceeds a threshold, it can restart the VM on another healthy host. > > > > - Behavior parity with x86 architecture. When machine check exception > > (MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to > > let VMM either recover from the MCE, or terminate itself with VM. > > The prior RFC proposes to implement SIGBUS on arm64 as well, but > > Marc preferred KVM exit over signal [3]. However, implementation > > aside, returning SEA to VMM is on par with returning MCE to VMM. > > > > Once SEA is redirected to VMM, among other actions, VMM is encouraged > > to inject external aborts into the faulting VCPU. > > I don't know much about the KVM details but this explanation makes > sense to me and we also have use cases for all of what is written > here. > > Thanks, > Jason Thanks for your feedback Jason. And thanks for the comments from Jose, Randy, and Marc. Just wondering if there are any concerns or comments on the API and implementation? If no, I will fix the typos in 1/3 and 3/3 then send out v5. Thanks, Jiaqi
Hi,
On Mon, Nov 10, 2025 at 09:41:33AM -0800, Jiaqi Yan wrote:
> On Mon, Oct 20, 2025 at 7:46 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Mon, Oct 13, 2025 at 06:59:00PM +0000, Jiaqi Yan wrote:
> > > Problem
> > > =======
> > >
> > > When host APEI is unable to claim a synchronous external abort (SEA)
> > > during guest abort, today KVM directly injects an asynchronous SError
> > > into the VCPU then resumes it. The injected SError usually results in
> > > unpleasant guest kernel panic.
> > >
> > > One of the major situation of guest SEA is when VCPU consumes recoverable
> > > uncorrected memory error (UER), which is not uncommon at all in modern
> > > datacenter servers with large amounts of physical memory. Although SError
> > > and guest panic is sufficient to stop the propagation of corrupted memory,
> > > there is room to recover from an UER in a more graceful manner.
> > >
> > > Proposed Solution
> > > =================
> > >
> > > The idea is, we can replay the SEA to the faulting VCPU. If the memory
> > > error consumption or the fault that cause SEA is not from guest kernel,
> > > the blast radius can be limited to the poison-consuming guest process,
> > > while the VM can keep running.
I like the idea of having a "guest-first"/"host-first" approach for APEI,
letting userspace (likely rasdaemon) to decide to handle hardware errors
either at the guest or at the host. Yet, it sounds wrong to have a flag
called KVM_EXIT_ARM_SEA, as:
1. This is not exclusive to ARM;
2. There are other notification mechanisms that can rise an APEI
errors. For instance QEMU code defines:
ACPI_GHES_NOTIFY_POLLED = 0,
ACPI_GHES_NOTIFY_EXTERNAL = 1,
ACPI_GHES_NOTIFY_LOCAL = 2,
ACPI_GHES_NOTIFY_SCI = 3,
ACPI_GHES_NOTIFY_NMI = 4,
ACPI_GHES_NOTIFY_CMCI = 5,
ACPI_GHES_NOTIFY_MCE = 6,
ACPI_GHES_NOTIFY_GPIO = 7,
ACPI_GHES_NOTIFY_SEA = 8,
ACPI_GHES_NOTIFY_SEI = 9,
ACPI_GHES_NOTIFY_GSIV = 10,
ACPI_GHES_NOTIFY_SDEI = 11,
ACPI_GHES_NOTIFY_RESERVED = 12
- even on arm. QEMU currently implements two mechanisms (SEA and GPIO);
- once we implement the same feature on Intel, it will likely use
NMI, MCE and/or SCI.
So, IMO, the best would be to use a more generic name like
KVM_EXIT_APEI or KVM_EXIT_GHES - or maybe even name it the way it really
is meant: KVM_EXIT_ACPI_GUEST_FIRST.
That's said, I'd say that we need an implementation on a real userspace
applicaton to be able to test it (rasdaemon being the obvious candidate).
In order to test, the better is to use the new QEMU code (for 10.2) to
allow injecting hardware errors via QMP.
Regards,
Mauro
On Thu, Nov 13, 2025 at 02:54:33PM +0100, Mauro Carvalho Chehab wrote: > Hi, > > On Mon, Nov 10, 2025 at 09:41:33AM -0800, Jiaqi Yan wrote: > > On Mon, Oct 20, 2025 at 7:46 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > On Mon, Oct 13, 2025 at 06:59:00PM +0000, Jiaqi Yan wrote: > > > > Problem > > > > ======= > > > > > > > > When host APEI is unable to claim a synchronous external abort (SEA) > > > > during guest abort, today KVM directly injects an asynchronous SError > > > > into the VCPU then resumes it. The injected SError usually results in > > > > unpleasant guest kernel panic. > > > > > > > > One of the major situation of guest SEA is when VCPU consumes recoverable > > > > uncorrected memory error (UER), which is not uncommon at all in modern > > > > datacenter servers with large amounts of physical memory. Although SError > > > > and guest panic is sufficient to stop the propagation of corrupted memory, > > > > there is room to recover from an UER in a more graceful manner. > > > > > > > > Proposed Solution > > > > ================= > > > > > > > > The idea is, we can replay the SEA to the faulting VCPU. If the memory > > > > error consumption or the fault that cause SEA is not from guest kernel, > > > > the blast radius can be limited to the poison-consuming guest process, > > > > while the VM can keep running. > > I like the idea of having a "guest-first"/"host-first" approach for APEI, > letting userspace (likely rasdaemon) to decide to handle hardware errors > either at the guest or at the host. Yet, it sounds wrong to have a flag > called KVM_EXIT_ARM_SEA, as: > > 1. This is not exclusive to ARM; > 2. There are other notification mechanisms that can rise an APEI > errors. For instance QEMU code defines: > > ACPI_GHES_NOTIFY_POLLED = 0, > ACPI_GHES_NOTIFY_EXTERNAL = 1, > ACPI_GHES_NOTIFY_LOCAL = 2, > ACPI_GHES_NOTIFY_SCI = 3, > ACPI_GHES_NOTIFY_NMI = 4, > ACPI_GHES_NOTIFY_CMCI = 5, > ACPI_GHES_NOTIFY_MCE = 6, > ACPI_GHES_NOTIFY_GPIO = 7, > ACPI_GHES_NOTIFY_SEA = 8, > ACPI_GHES_NOTIFY_SEI = 9, > ACPI_GHES_NOTIFY_GSIV = 10, > ACPI_GHES_NOTIFY_SDEI = 11, > ACPI_GHES_NOTIFY_RESERVED = 12 > > - even on arm. QEMU currently implements two mechanisms (SEA and GPIO); > - once we implement the same feature on Intel, it will likely use > NMI, MCE and/or SCI. > > So, IMO, the best would be to use a more generic name like > KVM_EXIT_APEI or KVM_EXIT_GHES - or maybe even name it the way it really > is meant: KVM_EXIT_ACPI_GUEST_FIRST. This is not the sort of thing that I'd like to seen dressed up as an arch-generic interface. What Jiaqi is dealing with is the very sorry state of RAS on arm64, giving userspace the opportunity to decide how an SEA is handled when a platform's firmware couldn't be bothered to do so. The SEA is an architecture-specific event so we provide the hardware context to the VMM to sort things out. If the APEI driver actually registers to handle the SEA then it will continue to handle the SEA before ever involving the VMM. I'm not aware of any system that does this. If you're lucky you'll take an *asynchronous* vector after to process a CPER and still have to deal with a 'bare' SEA. And of course, none of this even matters for the several billion DT-based hosts out in the wild. Thanks, Oliver
© 2016 - 2025 Red Hat, Inc.