[v4] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

[PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

Posted by Jiaqi Yan 3 months, 3 weeks ago

Problem
=======

When host APEI is unable to claim a synchronous external abort (SEA)
during guest abort, today KVM directly injects an asynchronous SError
into the VCPU then resumes it. The injected SError usually results in
unpleasant guest kernel panic.

One of the major situation of guest SEA is when VCPU consumes recoverable
uncorrected memory error (UER), which is not uncommon at all in modern
datacenter servers with large amounts of physical memory. Although SError
and guest panic is sufficient to stop the propagation of corrupted memory,
there is room to recover from an UER in a more graceful manner.

Proposed Solution
=================

The idea is, we can replay the SEA to the faulting VCPU. If the memory
error consumption or the fault that cause SEA is not from guest kernel,
the blast radius can be limited to the poison-consuming guest process,
while the VM can keep running.

In addition, instead of doing under the hood without involving userspace,
there are benefits to redirect the SEA to VMM:

- VM customers care about the disruptions caused by memory errors, and
  VMM usually has the responsibility to start the process of notifying
  the customers of memory error events in their VMs. For example some
  cloud provider emits a critical log in their observability UI [1], and
  provides a playbook for customers on how to mitigate disruptions to
  their workloads.

- VMM can protect future memory error consumption by unmapping the poisoned
  pages from stage-2 page table with KVM userfault [2], or by splitting the
  memslot that contains the poisoned pages.

- VMM can keep track of SEA events in the VM. When VMM thinks the status
  on the host or the VM is bad enough, e.g. number of distinct SEAs
  exceeds a threshold, it can restart the VM on another healthy host.

- Behavior parity with x86 architecture. When machine check exception
  (MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to
  let VMM either recover from the MCE, or terminate itself with VM.
  The prior RFC proposes to implement SIGBUS on arm64 as well, but
  Marc preferred KVM exit over signal [3]. However, implementation
  aside, returning SEA to VMM is on par with returning MCE to VMM.

Once SEA is redirected to VMM, among other actions, VMM is encouraged
to inject external aborts into the faulting VCPU.

New UAPIs
=========

This patchset introduces following userspace-visible changes to empower
VMM to control what happens for SEA on guest memory:

- KVM_CAP_ARM_SEA_TO_USER. While taking SEA, if userspace has enabled
  this new capability at VM creation, and the SEA is not owned by kernel
  allocated memory, instead of injecting SError, return KVM_EXIT_ARM_SEA
  to userspace.

- KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details
  about the SEA is provided in arm_sea as much as possible, including
  sanitized ESR value at EL2, faulting guest virtual and physical
  addresses if available.

* From v3 [4]
  - Rebased on commit 3a8660878839 ("Linux 6.18-rc1").
  - In selftest, print a message if GVA or GPA expects to be valid.

* From v2 [5]:
  - Rebased on "[PATCH] KVM: arm64: nv: Handle SEAs due to VNCR redirection" [6]
    and kvmarm/next commit 7b8346bd9fce6 ("KVM: arm64: Don't attempt vLPI
    mappings when vPE allocation is disabled")
  - Took the host_owns_sea implementation from Oliver [7, 8].
  - Excluded the guest SEA injection patches.
  - Updated selftest.

* From v1 [9]:
  - Rebased on commit 4d62121ce9b5 ("KVM: arm64: vgic-debug: Avoid
    dereferencing NULL ITE pointer").
  - Sanitize ESR_EL2 before reporting it to userspace.
  - Do not do KVM_EXIT_ARM_SEA when SEA is caused by memory allocated to
    stage-2 translation table.

[1] https://cloud.google.com/solutions/sap/docs/manage-host-errors
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com
[3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org
[4] https://lore.kernel.org/kvmarm/20250731205844.1346839-1-jiaqiyan@google.com
[5] https://lore.kernel.org/kvm/20250604050902.3944054-1-jiaqiyan@google.com
[6] https://lore.kernel.org/kvmarm/20250729182342.3281742-1-oliver.upton@linux.dev
[7] https://lore.kernel.org/kvm/aHFohmTb9qR_JG1E@linux.dev
[8] https://lore.kernel.org/kvm/aHK-DPufhLy5Dtuk@linux.dev
[9] https://lore.kernel.org/kvm/20250505161412.1926643-1-jiaqiyan@google.com

Jiaqi Yan (3):
  KVM: arm64: VM exit to userspace to handle SEA
  KVM: selftests: Test for KVM_EXIT_ARM_SEA
  Documentation: kvm: new UAPI for handling SEA

 Documentation/virt/kvm/api.rst                |  61 ++++
 arch/arm64/include/asm/kvm_host.h             |   2 +
 arch/arm64/kvm/arm.c                          |   5 +
 arch/arm64/kvm/mmu.c                          |  68 +++-
 include/uapi/linux/kvm.h                      |  10 +
 tools/arch/arm64/include/asm/esr.h            |   2 +
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../testing/selftests/kvm/arm64/sea_to_user.c | 331 ++++++++++++++++++
 tools/testing/selftests/kvm/lib/kvm_util.c    |   1 +
 9 files changed, 480 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c

-- 
2.51.0.760.g7b8bcc2412-goog

Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

Posted by Oliver Upton 2 months, 3 weeks ago

On Mon, 13 Oct 2025 18:59:00 +0000, Jiaqi Yan wrote:
> Problem
> =======
> 
> When host APEI is unable to claim a synchronous external abort (SEA)
> during guest abort, today KVM directly injects an asynchronous SError
> into the VCPU then resumes it. The injected SError usually results in
> unpleasant guest kernel panic.
> 
> [...]

I've gone ahead and done some cleanups, especially around documentation.

Applied to next, thanks!

[1/3] KVM: arm64: VM exit to userspace to handle SEA
      https://git.kernel.org/kvmarm/kvmarm/c/ad9c62bd8946
[2/3] KVM: selftests: Test for KVM_EXIT_ARM_SEA
      https://git.kernel.org/kvmarm/kvmarm/c/feee9ef7ac16
[3/3] Documentation: kvm: new UAPI for handling SEA
      https://git.kernel.org/kvmarm/kvmarm/c/4debb5e8952e

--
Best,
Oliver

Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

Posted by Jiaqi Yan 2 months, 3 weeks ago

On Thu, Nov 13, 2025 at 1:06 PM Oliver Upton <oupton@kernel.org> wrote:
>
> On Mon, 13 Oct 2025 18:59:00 +0000, Jiaqi Yan wrote:
> > Problem
> > =======
> >
> > When host APEI is unable to claim a synchronous external abort (SEA)
> > during guest abort, today KVM directly injects an asynchronous SError
> > into the VCPU then resumes it. The injected SError usually results in
> > unpleasant guest kernel panic.
> >
> > [...]
>
> I've gone ahead and done some cleanups, especially around documentation.
>
> Applied to next, thanks!

Many thanks, Oliver!

I assume I still need to send out v5 with typo fixed, comments
addressed, and your cleanups applied? If so, what specific tag/release
you want me to rebase v5 onto?

>
> [1/3] KVM: arm64: VM exit to userspace to handle SEA
>       https://git.kernel.org/kvmarm/kvmarm/c/ad9c62bd8946
> [2/3] KVM: selftests: Test for KVM_EXIT_ARM_SEA
>       https://git.kernel.org/kvmarm/kvmarm/c/feee9ef7ac16
> [3/3] Documentation: kvm: new UAPI for handling SEA
>       https://git.kernel.org/kvmarm/kvmarm/c/4debb5e8952e
>
> --
> Best,
> Oliver

Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

Posted by Oliver Upton 2 months, 3 weeks ago

On Thu, Nov 13, 2025 at 02:14:08PM -0800, Jiaqi Yan wrote:
> On Thu, Nov 13, 2025 at 1:06 PM Oliver Upton <oupton@kernel.org> wrote:
> >
> > On Mon, 13 Oct 2025 18:59:00 +0000, Jiaqi Yan wrote:
> > > Problem
> > > =======
> > >
> > > When host APEI is unable to claim a synchronous external abort (SEA)
> > > during guest abort, today KVM directly injects an asynchronous SError
> > > into the VCPU then resumes it. The injected SError usually results in
> > > unpleasant guest kernel panic.
> > >
> > > [...]
> >
> > I've gone ahead and done some cleanups, especially around documentation.
> >
> > Applied to next, thanks!
> 
> Many thanks, Oliver!
> 
> I assume I still need to send out v5 with typo fixed, comments
> addressed, and your cleanups applied? If so, what specific tag/release
> you want me to rebase v5 onto?

No need -- I took care of the issues I spotted when applying, LMK if
anything looks amiss on kvmarm/next.

Thanks,
Oliver

Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

Posted by Jiaqi Yan 2 months, 3 weeks ago

On Thu, Nov 13, 2025 at 2:34 PM Oliver Upton <oupton@kernel.org> wrote:
>
> On Thu, Nov 13, 2025 at 02:14:08PM -0800, Jiaqi Yan wrote:
> > On Thu, Nov 13, 2025 at 1:06 PM Oliver Upton <oupton@kernel.org> wrote:
> > >
> > > On Mon, 13 Oct 2025 18:59:00 +0000, Jiaqi Yan wrote:
> > > > Problem
> > > > =======
> > > >
> > > > When host APEI is unable to claim a synchronous external abort (SEA)
> > > > during guest abort, today KVM directly injects an asynchronous SError
> > > > into the VCPU then resumes it. The injected SError usually results in
> > > > unpleasant guest kernel panic.
> > > >
> > > > [...]
> > >
> > > I've gone ahead and done some cleanups, especially around documentation.
> > >
> > > Applied to next, thanks!
> >
> > Many thanks, Oliver!
> >
> > I assume I still need to send out v5 with typo fixed, comments
> > addressed, and your cleanups applied? If so, what specific tag/release
> > you want me to rebase v5 onto?
>
> No need -- I took care of the issues I spotted when applying, LMK if
> anything looks amiss on kvmarm/next.

I took a look and everything looks fixed, and thanks for nearly
rewriting the documentation!

>
> Thanks,
> Oliver

Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

Posted by Jason Gunthorpe 3 months, 2 weeks ago

On Mon, Oct 13, 2025 at 06:59:00PM +0000, Jiaqi Yan wrote:
> Problem
> =======
> 
> When host APEI is unable to claim a synchronous external abort (SEA)
> during guest abort, today KVM directly injects an asynchronous SError
> into the VCPU then resumes it. The injected SError usually results in
> unpleasant guest kernel panic.
> 
> One of the major situation of guest SEA is when VCPU consumes recoverable
> uncorrected memory error (UER), which is not uncommon at all in modern
> datacenter servers with large amounts of physical memory. Although SError
> and guest panic is sufficient to stop the propagation of corrupted memory,
> there is room to recover from an UER in a more graceful manner.
> 
> Proposed Solution
> =================
> 
> The idea is, we can replay the SEA to the faulting VCPU. If the memory
> error consumption or the fault that cause SEA is not from guest kernel,
> the blast radius can be limited to the poison-consuming guest process,
> while the VM can keep running.
> 
> In addition, instead of doing under the hood without involving userspace,
> there are benefits to redirect the SEA to VMM:
> 
> - VM customers care about the disruptions caused by memory errors, and
>   VMM usually has the responsibility to start the process of notifying
>   the customers of memory error events in their VMs. For example some
>   cloud provider emits a critical log in their observability UI [1], and
>   provides a playbook for customers on how to mitigate disruptions to
>   their workloads.
> 
> - VMM can protect future memory error consumption by unmapping the poisoned
>   pages from stage-2 page table with KVM userfault [2], or by splitting the
>   memslot that contains the poisoned pages.
> 
> - VMM can keep track of SEA events in the VM. When VMM thinks the status
>   on the host or the VM is bad enough, e.g. number of distinct SEAs
>   exceeds a threshold, it can restart the VM on another healthy host.
> 
> - Behavior parity with x86 architecture. When machine check exception
>   (MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to
>   let VMM either recover from the MCE, or terminate itself with VM.
>   The prior RFC proposes to implement SIGBUS on arm64 as well, but
>   Marc preferred KVM exit over signal [3]. However, implementation
>   aside, returning SEA to VMM is on par with returning MCE to VMM.
> 
> Once SEA is redirected to VMM, among other actions, VMM is encouraged
> to inject external aborts into the faulting VCPU.

I don't know much about the KVM details but this explanation makes
sense to me and we also have use cases for all of what is written
here.

Thanks,
Jason

Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

Posted by Jiaqi Yan 2 months, 4 weeks ago

On Mon, Oct 20, 2025 at 7:46 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Oct 13, 2025 at 06:59:00PM +0000, Jiaqi Yan wrote:
> > Problem
> > =======
> >
> > When host APEI is unable to claim a synchronous external abort (SEA)
> > during guest abort, today KVM directly injects an asynchronous SError
> > into the VCPU then resumes it. The injected SError usually results in
> > unpleasant guest kernel panic.
> >
> > One of the major situation of guest SEA is when VCPU consumes recoverable
> > uncorrected memory error (UER), which is not uncommon at all in modern
> > datacenter servers with large amounts of physical memory. Although SError
> > and guest panic is sufficient to stop the propagation of corrupted memory,
> > there is room to recover from an UER in a more graceful manner.
> >
> > Proposed Solution
> > =================
> >
> > The idea is, we can replay the SEA to the faulting VCPU. If the memory
> > error consumption or the fault that cause SEA is not from guest kernel,
> > the blast radius can be limited to the poison-consuming guest process,
> > while the VM can keep running.
> >
> > In addition, instead of doing under the hood without involving userspace,
> > there are benefits to redirect the SEA to VMM:
> >
> > - VM customers care about the disruptions caused by memory errors, and
> >   VMM usually has the responsibility to start the process of notifying
> >   the customers of memory error events in their VMs. For example some
> >   cloud provider emits a critical log in their observability UI [1], and
> >   provides a playbook for customers on how to mitigate disruptions to
> >   their workloads.
> >
> > - VMM can protect future memory error consumption by unmapping the poisoned
> >   pages from stage-2 page table with KVM userfault [2], or by splitting the
> >   memslot that contains the poisoned pages.
> >
> > - VMM can keep track of SEA events in the VM. When VMM thinks the status
> >   on the host or the VM is bad enough, e.g. number of distinct SEAs
> >   exceeds a threshold, it can restart the VM on another healthy host.
> >
> > - Behavior parity with x86 architecture. When machine check exception
> >   (MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to
> >   let VMM either recover from the MCE, or terminate itself with VM.
> >   The prior RFC proposes to implement SIGBUS on arm64 as well, but
> >   Marc preferred KVM exit over signal [3]. However, implementation
> >   aside, returning SEA to VMM is on par with returning MCE to VMM.
> >
> > Once SEA is redirected to VMM, among other actions, VMM is encouraged
> > to inject external aborts into the faulting VCPU.
>
> I don't know much about the KVM details but this explanation makes
> sense to me and we also have use cases for all of what is written
> here.
>
> Thanks,
> Jason

Thanks for your feedback Jason. And thanks for the comments from Jose,
Randy, and Marc.

Just wondering if there are any concerns or comments on the API and
implementation? If no, I will fix the typos in 1/3 and 3/3 then send
out v5.

Thanks,
Jiaqi

Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

Posted by Mauro Carvalho Chehab 2 months, 3 weeks ago

Hi,

On Mon, Nov 10, 2025 at 09:41:33AM -0800, Jiaqi Yan wrote:
> On Mon, Oct 20, 2025 at 7:46 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Mon, Oct 13, 2025 at 06:59:00PM +0000, Jiaqi Yan wrote:
> > > Problem
> > > =======
> > >
> > > When host APEI is unable to claim a synchronous external abort (SEA)
> > > during guest abort, today KVM directly injects an asynchronous SError
> > > into the VCPU then resumes it. The injected SError usually results in
> > > unpleasant guest kernel panic.
> > >
> > > One of the major situation of guest SEA is when VCPU consumes recoverable
> > > uncorrected memory error (UER), which is not uncommon at all in modern
> > > datacenter servers with large amounts of physical memory. Although SError
> > > and guest panic is sufficient to stop the propagation of corrupted memory,
> > > there is room to recover from an UER in a more graceful manner.
> > >
> > > Proposed Solution
> > > =================
> > >
> > > The idea is, we can replay the SEA to the faulting VCPU. If the memory
> > > error consumption or the fault that cause SEA is not from guest kernel,
> > > the blast radius can be limited to the poison-consuming guest process,
> > > while the VM can keep running.

I like the idea of having a "guest-first"/"host-first" approach for APEI,
letting userspace (likely rasdaemon) to decide to handle hardware errors
either at the guest or at the host. Yet, it sounds wrong to have a flag
called KVM_EXIT_ARM_SEA, as:

    1. This is not exclusive to ARM;
    2. There are other notification mechanisms that can rise an APEI
       errors. For instance QEMU code defines:

    ACPI_GHES_NOTIFY_POLLED = 0,
    ACPI_GHES_NOTIFY_EXTERNAL = 1,
    ACPI_GHES_NOTIFY_LOCAL = 2,
    ACPI_GHES_NOTIFY_SCI = 3,
    ACPI_GHES_NOTIFY_NMI = 4,
    ACPI_GHES_NOTIFY_CMCI = 5,
    ACPI_GHES_NOTIFY_MCE = 6,
    ACPI_GHES_NOTIFY_GPIO = 7,
    ACPI_GHES_NOTIFY_SEA = 8,
    ACPI_GHES_NOTIFY_SEI = 9,
    ACPI_GHES_NOTIFY_GSIV = 10,
    ACPI_GHES_NOTIFY_SDEI = 11,
    ACPI_GHES_NOTIFY_RESERVED = 12

 - even on arm. QEMU currently implements two mechanisms (SEA and GPIO);
 - once we implement the same feature on Intel, it will likely use
   NMI, MCE and/or SCI.

So, IMO, the best would be to use a more generic name like
KVM_EXIT_APEI or KVM_EXIT_GHES - or maybe even name it the way it really
is meant: KVM_EXIT_ACPI_GUEST_FIRST.

That's said, I'd say that we need an implementation on a real userspace
applicaton to be able to test it (rasdaemon being the obvious candidate).

In order to test, the better is to use the new QEMU code (for 10.2) to
allow injecting hardware errors via QMP.

Regards,
Mauro

Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

Posted by Oliver Upton 2 months, 3 weeks ago

On Thu, Nov 13, 2025 at 02:54:33PM +0100, Mauro Carvalho Chehab wrote:
> Hi,
> 
> On Mon, Nov 10, 2025 at 09:41:33AM -0800, Jiaqi Yan wrote:
> > On Mon, Oct 20, 2025 at 7:46 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Mon, Oct 13, 2025 at 06:59:00PM +0000, Jiaqi Yan wrote:
> > > > Problem
> > > > =======
> > > >
> > > > When host APEI is unable to claim a synchronous external abort (SEA)
> > > > during guest abort, today KVM directly injects an asynchronous SError
> > > > into the VCPU then resumes it. The injected SError usually results in
> > > > unpleasant guest kernel panic.
> > > >
> > > > One of the major situation of guest SEA is when VCPU consumes recoverable
> > > > uncorrected memory error (UER), which is not uncommon at all in modern
> > > > datacenter servers with large amounts of physical memory. Although SError
> > > > and guest panic is sufficient to stop the propagation of corrupted memory,
> > > > there is room to recover from an UER in a more graceful manner.
> > > >
> > > > Proposed Solution
> > > > =================
> > > >
> > > > The idea is, we can replay the SEA to the faulting VCPU. If the memory
> > > > error consumption or the fault that cause SEA is not from guest kernel,
> > > > the blast radius can be limited to the poison-consuming guest process,
> > > > while the VM can keep running.
> 
> I like the idea of having a "guest-first"/"host-first" approach for APEI,
> letting userspace (likely rasdaemon) to decide to handle hardware errors
> either at the guest or at the host. Yet, it sounds wrong to have a flag
> called KVM_EXIT_ARM_SEA, as:
> 
>     1. This is not exclusive to ARM;
>     2. There are other notification mechanisms that can rise an APEI
>        errors. For instance QEMU code defines:
> 
>     ACPI_GHES_NOTIFY_POLLED = 0,
>     ACPI_GHES_NOTIFY_EXTERNAL = 1,
>     ACPI_GHES_NOTIFY_LOCAL = 2,
>     ACPI_GHES_NOTIFY_SCI = 3,
>     ACPI_GHES_NOTIFY_NMI = 4,
>     ACPI_GHES_NOTIFY_CMCI = 5,
>     ACPI_GHES_NOTIFY_MCE = 6,
>     ACPI_GHES_NOTIFY_GPIO = 7,
>     ACPI_GHES_NOTIFY_SEA = 8,
>     ACPI_GHES_NOTIFY_SEI = 9,
>     ACPI_GHES_NOTIFY_GSIV = 10,
>     ACPI_GHES_NOTIFY_SDEI = 11,
>     ACPI_GHES_NOTIFY_RESERVED = 12
> 
>  - even on arm. QEMU currently implements two mechanisms (SEA and GPIO);
>  - once we implement the same feature on Intel, it will likely use
>    NMI, MCE and/or SCI.
> 
> So, IMO, the best would be to use a more generic name like
> KVM_EXIT_APEI or KVM_EXIT_GHES - or maybe even name it the way it really
> is meant: KVM_EXIT_ACPI_GUEST_FIRST.

This is not the sort of thing that I'd like to seen dressed up as an
arch-generic interface.

What Jiaqi is dealing with is the very sorry state of RAS on arm64,
giving userspace the opportunity to decide how an SEA is handled when a
platform's firmware couldn't be bothered to do so. The SEA is an
architecture-specific event so we provide the hardware context to
the VMM to sort things out.

If the APEI driver actually registers to handle the SEA then it will
continue to handle the SEA before ever involving the VMM. I'm not
aware of any system that does this. If you're lucky you'll take an
*asynchronous* vector after to process a CPER and still have to deal
with a 'bare' SEA.

And of course, none of this even matters for the several billion
DT-based hosts out in the wild.

Thanks,
Oliver