[RFC PATCH 00/29] KVM: VM planes

Paolo Bonzini posted 29 patches 2 days, 18 hours ago
Documentation/virt/kvm/api.rst                | 245 +++++++--
Documentation/virt/kvm/locking.rst            |   3 +
Documentation/virt/kvm/vcpu-requests.rst      |   7 +
arch/arm64/include/asm/kvm_host.h             |   5 +
arch/arm64/kvm/arm.c                          |   4 +-
arch/arm64/kvm/handle_exit.c                  |   6 +-
arch/arm64/kvm/hyp/nvhe/gen-hyprel.c          |   4 +-
arch/arm64/kvm/mmio.c                         |   4 +-
arch/loongarch/include/asm/kvm_host.h         |   5 +
arch/loongarch/kvm/exit.c                     |   8 +-
arch/loongarch/kvm/vcpu.c                     |   4 +-
arch/mips/include/asm/kvm_host.h              |   5 +
arch/mips/kvm/emulate.c                       |   2 +-
arch/mips/kvm/mips.c                          |  32 +-
arch/mips/kvm/vz.c                            |  18 +-
arch/powerpc/include/asm/kvm_host.h           |   5 +
arch/powerpc/kvm/book3s.c                     |   2 +-
arch/powerpc/kvm/book3s_hv.c                  |  46 +-
arch/powerpc/kvm/book3s_hv_rm_xics.c          |   8 +-
arch/powerpc/kvm/book3s_pr.c                  |  22 +-
arch/powerpc/kvm/book3s_pr_papr.c             |   2 +-
arch/powerpc/kvm/powerpc.c                    |   6 +-
arch/powerpc/kvm/timing.h                     |  28 +-
arch/riscv/include/asm/kvm_host.h             |   5 +
arch/riscv/kvm/vcpu.c                         |   4 +-
arch/riscv/kvm/vcpu_exit.c                    |  10 +-
arch/riscv/kvm/vcpu_insn.c                    |  16 +-
arch/riscv/kvm/vcpu_sbi.c                     |   2 +-
arch/riscv/kvm/vcpu_sbi_hsm.c                 |   2 +-
arch/s390/include/asm/kvm_host.h              |   5 +
arch/s390/kvm/diag.c                          |  18 +-
arch/s390/kvm/intercept.c                     |  20 +-
arch/s390/kvm/interrupt.c                     |  48 +-
arch/s390/kvm/kvm-s390.c                      |  10 +-
arch/s390/kvm/priv.c                          |  60 +--
arch/s390/kvm/sigp.c                          |  50 +-
arch/s390/kvm/vsie.c                          |   2 +-
arch/x86/include/asm/kvm_host.h               |  46 +-
arch/x86/kvm/cpuid.c                          |  57 +-
arch/x86/kvm/cpuid.h                          |   2 +
arch/x86/kvm/debugfs.c                        |   2 +-
arch/x86/kvm/hyperv.c                         |   7 +-
arch/x86/kvm/i8254.c                          |   7 +-
arch/x86/kvm/ioapic.c                         |   4 +-
arch/x86/kvm/irq_comm.c                       |  14 +-
arch/x86/kvm/kvm_cache_regs.h                 |   4 +-
arch/x86/kvm/lapic.c                          | 147 +++--
arch/x86/kvm/mmu/mmu.c                        |  41 +-
arch/x86/kvm/mmu/tdp_mmu.c                    |   2 +-
arch/x86/kvm/svm/sev.c                        |   4 +-
arch/x86/kvm/svm/svm.c                        |  21 +-
arch/x86/kvm/vmx/tdx.c                        |   8 +-
arch/x86/kvm/vmx/vmx.c                        |  20 +-
arch/x86/kvm/x86.c                            | 319 ++++++++---
arch/x86/kvm/xen.c                            |   1 +
include/linux/kvm_host.h                      | 130 +++--
include/linux/kvm_types.h                     |   1 +
include/uapi/linux/kvm.h                      |  28 +-
tools/testing/selftests/kvm/Makefile.kvm      |   2 +
.../testing/selftests/kvm/include/kvm_util.h  |  48 ++
.../selftests/kvm/include/x86/processor.h     |   1 +
tools/testing/selftests/kvm/lib/kvm_util.c    |  65 ++-
.../testing/selftests/kvm/lib/x86/processor.c |  15 +
tools/testing/selftests/kvm/plane_test.c      | 103 ++++
tools/testing/selftests/kvm/x86/plane_test.c  | 270 ++++++++++
virt/kvm/dirty_ring.c                         |   5 +-
virt/kvm/guest_memfd.c                        |   3 +-
virt/kvm/irqchip.c                            |   5 +-
virt/kvm/kvm_main.c                           | 500 ++++++++++++++----
69 files changed, 1991 insertions(+), 614 deletions(-)
create mode 100644 tools/testing/selftests/kvm/plane_test.c
create mode 100644 tools/testing/selftests/kvm/x86/plane_test.c
[RFC PATCH 00/29] KVM: VM planes
Posted by Paolo Bonzini 2 days, 18 hours ago
I guess April 1st is not the best date to send out such a large series
after months of radio silence, but here we are.

AMD VMPLs, Intel TDX partitions, Microsoft Hyper-V VTLs, and ARM CCA planes.
are all examples of virtual privilege level concepts that are exclusive to
guests.  In all these specifications the hypervisor hosts multiple
copies of a vCPU's register state (or at least of most of it) and provides
hypercalls or instructions to switch between them.

This is the first draft of the implementation according to the sketch that
was prepared last year between Linux Plumbers and KVM Forum.  The initial
version of the API was posted last October, and the implementation only
needed small changes.

Attempts made in the past, mostly in the context of Hyper-V VTLs and SEV-SNP
VMPLs, fell into two categories:

- use a single vCPU file descriptor, and store multiple copies of the state
  in a single struct kvm_vcpu.  This approach requires a lot of changes to
  provide multiple copies of affected fields, especially MMUs and APICs;
  and complex uAPI extensions to direct existing ioctls to a specific
  privilege level.  While more or less workable for SEV-SNP VMPLs, that
  was only because the copies of the register state were hidden
  in the VMSA (KVM does not manage it); it showed all its problems when
  applied to Hyper-V VTLs.

  The main advantage was that KVM kept the knowledge of the relationship
  between vCPUs that have the same id but belong to different privilege
  levels.  This is important in order to accelerate switches in-kernel.

- use multiple VM and vCPU file descriptors, and handle the switch entirely
  in userspace.  This got gnarly pretty fast for even more reasons than
  the previous case, for example because VMs could not share anymore
  memslots, including dirty bitmaps and private/shared attributes (a
  substantial problem for SEV-SNP since VMPLs share their ASID).

  Opposite to the other case, the total lack of kernel-level sharing of
  register state, and lack of control that vCPUs do not run in parallel,
  is what makes this approach problematic for both kernel and userspace.
  In-kernel implementation of privilege level switch becomes from
  complicated to impossible, and userspace needs a lot of complexity
  as well to ensure that higher-privileged VTLs properly interrupted a
  lower-privileged one.

This design sits squarely in the middle: it gives the initial set of
VM and vCPU file descriptors the full set of ioctls + struct kvm_run,
whereas other privilege levels ("planes") instead only support a small
part of the KVM API.  In fact for the vm file descriptor it is only three
ioctls: KVM_CHECK_EXTENSION, KVM_SIGNAL_MSI, KVM_SET_MEMORY_ATTRIBUTES.
For vCPUs it is basically KVM_GET/SET_*.

Most notably, memslots and KVM_RUN are *not* included (the choice of
which plane to run is done via vcpu->run), which solves a lot of
the problems in both of the previous approaches.  Compared to the
multiple-file-descriptors solution, it gets for free the ability to
avoid parallel execution of the same vCPUs in different privilege levels.
Compared to having a single file descriptor churn is more limited, or
at least can be attacked in small bites.  For example in this series
only per-plane interrupt controllers are switched to use the new struct
kvm_plane in place of struct kvm, and that's more or less enough in
the absence of complex interrupt delivery scenarios.

Changes to the userspace API are also relatively small; they boil down
to the introduction of a single new kind of file descriptor and almost
entirely fit in common code.  Reviewing these VM-wide and architecture-
independent changes should be the main purpose of this RFC, since 
there are still some things to fix:

- I named some fields "plane" instead of "plane_id" because I expected no
  fields of type struct kvm_plane*, but in retrospect that wasn't a great
  idea.

- online_vcpus counts across all planes but x86 code is still using it to
  deal with TSC synchronization.  Probably I will try and make kvmclock
  synchronization per-plane instead of per-VM.

- we're going to need a struct kvm_vcpu_plane similar to what Roy had in
  https://lore.kernel.org/kvm/cover.1726506534.git.roy.hopkins@suse.com/
  (probably smaller though).  Requests are per-plane for example, and I'm
  pretty sure any simplistic solution would have some corner cases where
  it's wrong; but it's a high churn change and I wanted to avoid that
  for this first posting.

There's a handful of locking TODOs where things should be checked more
carefully, but clearly identifying vCPU data that is not per-plane will
also simplify locking, thanks to having a single vcpu->mutex for the
whole plane.  So I'm not particularly worried about that; the TDX saga
hopefully has taught everyone to move in baby steps towards the intended
direction.

The handling of interrupt priorities is way more complicated than I
anticipated, unfortunately; everything else seems to fall into place
decently well---even taking into account the above incompleteness,
which anyway should not be a blocker for any VTL or VMPL experiments.
But do shout if anything makes you feel like I was too lazy, and/or you
want to puke.

Patches 1-2 are documentation and uAPI definitions.

Patches 3-9 are the common code for VM planes, while patches 10-14
are the common code for vCPU file descriptors on non-default planes.

Patches 15-26 are the x86-specific code, which is organized as follows:

- 15-20: convert APIC code to place its data in the new struct
kvm_arch_plane instead of struct kvm_arch.

- 21-24: everything else except the new userspace exit, KVM_EXIT_PLANE_EVENT

- 25: KVM_EXIT_PLANE_EVENT, which is used when one plane interrupts another.

- 26: finally make the capability available to userspace

Patches 27-29 finally are the testcases.  More are possible and planned,
but these are enough to say that, despite the missing bits, what exits
is not _completely_ broken.  I also didn't want to write dozens of tests
before committing to a selftests API.

Available for now at https://git.kernel.org/pub/scm/virt/kvm/kvm.git
branch planes-20250401.  I plan to place it in kvm-coco-queue, for lack
of a better place, as soon as TDX is merged into kvm/next and I test it
with the usual battery of kvm-unit-tests and real world guests.

Thanks,

Paolo

Paolo Bonzini (29):
  Documentation: kvm: introduce "VM plane" concept
  KVM: API definitions for plane userspace exit
  KVM: add plane info to structs
  KVM: introduce struct kvm_arch_plane
  KVM: add plane support to KVM_SIGNAL_MSI
  KVM: move mem_attr_array to kvm_plane
  KVM: do not use online_vcpus to test vCPU validity
  KVM: move vcpu_array to struct kvm_plane
  KVM: implement plane file descriptors ioctl and creation
  KVM: share statistics for same vCPU id on different planes
  KVM: anticipate allocation of dirty ring
  KVM: share dirty ring for same vCPU id on different planes
  KVM: implement vCPU creation for extra planes
  KVM: pass plane to kvm_arch_vcpu_create
  KVM: x86: pass vcpu to kvm_pv_send_ipi()
  KVM: x86: split "if" in __kvm_set_or_clear_apicv_inhibit
  KVM: x86: block creating irqchip if planes are active
  KVM: x86: track APICv inhibits per plane
  KVM: x86: move APIC map to kvm_arch_plane
  KVM: x86: add planes support for interrupt delivery
  KVM: x86: add infrastructure to share FPU across planes
  KVM: x86: implement initial plane support
  KVM: x86: extract kvm_post_set_cpuid
  KVM: x86: initialize CPUID for non-default planes
  KVM: x86: handle interrupt priorities for planes
  KVM: x86: enable up to 16 planes
  selftests: kvm: introduce basic test for VM planes
  selftests: kvm: add plane infrastructure
  selftests: kvm: add x86-specific plane test

 Documentation/virt/kvm/api.rst                | 245 +++++++--
 Documentation/virt/kvm/locking.rst            |   3 +
 Documentation/virt/kvm/vcpu-requests.rst      |   7 +
 arch/arm64/include/asm/kvm_host.h             |   5 +
 arch/arm64/kvm/arm.c                          |   4 +-
 arch/arm64/kvm/handle_exit.c                  |   6 +-
 arch/arm64/kvm/hyp/nvhe/gen-hyprel.c          |   4 +-
 arch/arm64/kvm/mmio.c                         |   4 +-
 arch/loongarch/include/asm/kvm_host.h         |   5 +
 arch/loongarch/kvm/exit.c                     |   8 +-
 arch/loongarch/kvm/vcpu.c                     |   4 +-
 arch/mips/include/asm/kvm_host.h              |   5 +
 arch/mips/kvm/emulate.c                       |   2 +-
 arch/mips/kvm/mips.c                          |  32 +-
 arch/mips/kvm/vz.c                            |  18 +-
 arch/powerpc/include/asm/kvm_host.h           |   5 +
 arch/powerpc/kvm/book3s.c                     |   2 +-
 arch/powerpc/kvm/book3s_hv.c                  |  46 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c          |   8 +-
 arch/powerpc/kvm/book3s_pr.c                  |  22 +-
 arch/powerpc/kvm/book3s_pr_papr.c             |   2 +-
 arch/powerpc/kvm/powerpc.c                    |   6 +-
 arch/powerpc/kvm/timing.h                     |  28 +-
 arch/riscv/include/asm/kvm_host.h             |   5 +
 arch/riscv/kvm/vcpu.c                         |   4 +-
 arch/riscv/kvm/vcpu_exit.c                    |  10 +-
 arch/riscv/kvm/vcpu_insn.c                    |  16 +-
 arch/riscv/kvm/vcpu_sbi.c                     |   2 +-
 arch/riscv/kvm/vcpu_sbi_hsm.c                 |   2 +-
 arch/s390/include/asm/kvm_host.h              |   5 +
 arch/s390/kvm/diag.c                          |  18 +-
 arch/s390/kvm/intercept.c                     |  20 +-
 arch/s390/kvm/interrupt.c                     |  48 +-
 arch/s390/kvm/kvm-s390.c                      |  10 +-
 arch/s390/kvm/priv.c                          |  60 +--
 arch/s390/kvm/sigp.c                          |  50 +-
 arch/s390/kvm/vsie.c                          |   2 +-
 arch/x86/include/asm/kvm_host.h               |  46 +-
 arch/x86/kvm/cpuid.c                          |  57 +-
 arch/x86/kvm/cpuid.h                          |   2 +
 arch/x86/kvm/debugfs.c                        |   2 +-
 arch/x86/kvm/hyperv.c                         |   7 +-
 arch/x86/kvm/i8254.c                          |   7 +-
 arch/x86/kvm/ioapic.c                         |   4 +-
 arch/x86/kvm/irq_comm.c                       |  14 +-
 arch/x86/kvm/kvm_cache_regs.h                 |   4 +-
 arch/x86/kvm/lapic.c                          | 147 +++--
 arch/x86/kvm/mmu/mmu.c                        |  41 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |   2 +-
 arch/x86/kvm/svm/sev.c                        |   4 +-
 arch/x86/kvm/svm/svm.c                        |  21 +-
 arch/x86/kvm/vmx/tdx.c                        |   8 +-
 arch/x86/kvm/vmx/vmx.c                        |  20 +-
 arch/x86/kvm/x86.c                            | 319 ++++++++---
 arch/x86/kvm/xen.c                            |   1 +
 include/linux/kvm_host.h                      | 130 +++--
 include/linux/kvm_types.h                     |   1 +
 include/uapi/linux/kvm.h                      |  28 +-
 tools/testing/selftests/kvm/Makefile.kvm      |   2 +
 .../testing/selftests/kvm/include/kvm_util.h  |  48 ++
 .../selftests/kvm/include/x86/processor.h     |   1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  65 ++-
 .../testing/selftests/kvm/lib/x86/processor.c |  15 +
 tools/testing/selftests/kvm/plane_test.c      | 103 ++++
 tools/testing/selftests/kvm/x86/plane_test.c  | 270 ++++++++++
 virt/kvm/dirty_ring.c                         |   5 +-
 virt/kvm/guest_memfd.c                        |   3 +-
 virt/kvm/irqchip.c                            |   5 +-
 virt/kvm/kvm_main.c                           | 500 ++++++++++++++----
 69 files changed, 1991 insertions(+), 614 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/plane_test.c
 create mode 100644 tools/testing/selftests/kvm/x86/plane_test.c

-- 
2.49.0
Re: [RFC PATCH 00/29] KVM: VM planes
Posted by Sean Christopherson 2 days, 18 hours ago
On Tue, Apr 01, 2025, Paolo Bonzini wrote:
> I guess April 1st is not the best date to send out such a large series
> after months of radio silence, but here we are.

Heh, you missed an opportunity to spell it "plains" and then spend the entire
cover letter justifying the name :-)