I guess April 1st is not the best date to send out such a large series
after months of radio silence, but here we are.
AMD VMPLs, Intel TDX partitions, Microsoft Hyper-V VTLs, and ARM CCA planes.
are all examples of virtual privilege level concepts that are exclusive to
guests. In all these specifications the hypervisor hosts multiple
copies of a vCPU's register state (or at least of most of it) and provides
hypercalls or instructions to switch between them.
This is the first draft of the implementation according to the sketch that
was prepared last year between Linux Plumbers and KVM Forum. The initial
version of the API was posted last October, and the implementation only
needed small changes.
Attempts made in the past, mostly in the context of Hyper-V VTLs and SEV-SNP
VMPLs, fell into two categories:
- use a single vCPU file descriptor, and store multiple copies of the state
in a single struct kvm_vcpu. This approach requires a lot of changes to
provide multiple copies of affected fields, especially MMUs and APICs;
and complex uAPI extensions to direct existing ioctls to a specific
privilege level. While more or less workable for SEV-SNP VMPLs, that
was only because the copies of the register state were hidden
in the VMSA (KVM does not manage it); it showed all its problems when
applied to Hyper-V VTLs.
The main advantage was that KVM kept the knowledge of the relationship
between vCPUs that have the same id but belong to different privilege
levels. This is important in order to accelerate switches in-kernel.
- use multiple VM and vCPU file descriptors, and handle the switch entirely
in userspace. This got gnarly pretty fast for even more reasons than
the previous case, for example because VMs could not share anymore
memslots, including dirty bitmaps and private/shared attributes (a
substantial problem for SEV-SNP since VMPLs share their ASID).
Opposite to the other case, the total lack of kernel-level sharing of
register state, and lack of control that vCPUs do not run in parallel,
is what makes this approach problematic for both kernel and userspace.
In-kernel implementation of privilege level switch becomes from
complicated to impossible, and userspace needs a lot of complexity
as well to ensure that higher-privileged VTLs properly interrupted a
lower-privileged one.
This design sits squarely in the middle: it gives the initial set of
VM and vCPU file descriptors the full set of ioctls + struct kvm_run,
whereas other privilege levels ("planes") instead only support a small
part of the KVM API. In fact for the vm file descriptor it is only three
ioctls: KVM_CHECK_EXTENSION, KVM_SIGNAL_MSI, KVM_SET_MEMORY_ATTRIBUTES.
For vCPUs it is basically KVM_GET/SET_*.
Most notably, memslots and KVM_RUN are *not* included (the choice of
which plane to run is done via vcpu->run), which solves a lot of
the problems in both of the previous approaches. Compared to the
multiple-file-descriptors solution, it gets for free the ability to
avoid parallel execution of the same vCPUs in different privilege levels.
Compared to having a single file descriptor churn is more limited, or
at least can be attacked in small bites. For example in this series
only per-plane interrupt controllers are switched to use the new struct
kvm_plane in place of struct kvm, and that's more or less enough in
the absence of complex interrupt delivery scenarios.
Changes to the userspace API are also relatively small; they boil down
to the introduction of a single new kind of file descriptor and almost
entirely fit in common code. Reviewing these VM-wide and architecture-
independent changes should be the main purpose of this RFC, since
there are still some things to fix:
- I named some fields "plane" instead of "plane_id" because I expected no
fields of type struct kvm_plane*, but in retrospect that wasn't a great
idea.
- online_vcpus counts across all planes but x86 code is still using it to
deal with TSC synchronization. Probably I will try and make kvmclock
synchronization per-plane instead of per-VM.
- we're going to need a struct kvm_vcpu_plane similar to what Roy had in
https://lore.kernel.org/kvm/cover.1726506534.git.roy.hopkins@suse.com/
(probably smaller though). Requests are per-plane for example, and I'm
pretty sure any simplistic solution would have some corner cases where
it's wrong; but it's a high churn change and I wanted to avoid that
for this first posting.
There's a handful of locking TODOs where things should be checked more
carefully, but clearly identifying vCPU data that is not per-plane will
also simplify locking, thanks to having a single vcpu->mutex for the
whole plane. So I'm not particularly worried about that; the TDX saga
hopefully has taught everyone to move in baby steps towards the intended
direction.
The handling of interrupt priorities is way more complicated than I
anticipated, unfortunately; everything else seems to fall into place
decently well---even taking into account the above incompleteness,
which anyway should not be a blocker for any VTL or VMPL experiments.
But do shout if anything makes you feel like I was too lazy, and/or you
want to puke.
Patches 1-2 are documentation and uAPI definitions.
Patches 3-9 are the common code for VM planes, while patches 10-14
are the common code for vCPU file descriptors on non-default planes.
Patches 15-26 are the x86-specific code, which is organized as follows:
- 15-20: convert APIC code to place its data in the new struct
kvm_arch_plane instead of struct kvm_arch.
- 21-24: everything else except the new userspace exit, KVM_EXIT_PLANE_EVENT
- 25: KVM_EXIT_PLANE_EVENT, which is used when one plane interrupts another.
- 26: finally make the capability available to userspace
Patches 27-29 finally are the testcases. More are possible and planned,
but these are enough to say that, despite the missing bits, what exits
is not _completely_ broken. I also didn't want to write dozens of tests
before committing to a selftests API.
Available for now at https://git.kernel.org/pub/scm/virt/kvm/kvm.git
branch planes-20250401. I plan to place it in kvm-coco-queue, for lack
of a better place, as soon as TDX is merged into kvm/next and I test it
with the usual battery of kvm-unit-tests and real world guests.
Thanks,
Paolo
Paolo Bonzini (29):
Documentation: kvm: introduce "VM plane" concept
KVM: API definitions for plane userspace exit
KVM: add plane info to structs
KVM: introduce struct kvm_arch_plane
KVM: add plane support to KVM_SIGNAL_MSI
KVM: move mem_attr_array to kvm_plane
KVM: do not use online_vcpus to test vCPU validity
KVM: move vcpu_array to struct kvm_plane
KVM: implement plane file descriptors ioctl and creation
KVM: share statistics for same vCPU id on different planes
KVM: anticipate allocation of dirty ring
KVM: share dirty ring for same vCPU id on different planes
KVM: implement vCPU creation for extra planes
KVM: pass plane to kvm_arch_vcpu_create
KVM: x86: pass vcpu to kvm_pv_send_ipi()
KVM: x86: split "if" in __kvm_set_or_clear_apicv_inhibit
KVM: x86: block creating irqchip if planes are active
KVM: x86: track APICv inhibits per plane
KVM: x86: move APIC map to kvm_arch_plane
KVM: x86: add planes support for interrupt delivery
KVM: x86: add infrastructure to share FPU across planes
KVM: x86: implement initial plane support
KVM: x86: extract kvm_post_set_cpuid
KVM: x86: initialize CPUID for non-default planes
KVM: x86: handle interrupt priorities for planes
KVM: x86: enable up to 16 planes
selftests: kvm: introduce basic test for VM planes
selftests: kvm: add plane infrastructure
selftests: kvm: add x86-specific plane test
Documentation/virt/kvm/api.rst | 245 +++++++--
Documentation/virt/kvm/locking.rst | 3 +
Documentation/virt/kvm/vcpu-requests.rst | 7 +
arch/arm64/include/asm/kvm_host.h | 5 +
arch/arm64/kvm/arm.c | 4 +-
arch/arm64/kvm/handle_exit.c | 6 +-
arch/arm64/kvm/hyp/nvhe/gen-hyprel.c | 4 +-
arch/arm64/kvm/mmio.c | 4 +-
arch/loongarch/include/asm/kvm_host.h | 5 +
arch/loongarch/kvm/exit.c | 8 +-
arch/loongarch/kvm/vcpu.c | 4 +-
arch/mips/include/asm/kvm_host.h | 5 +
arch/mips/kvm/emulate.c | 2 +-
arch/mips/kvm/mips.c | 32 +-
arch/mips/kvm/vz.c | 18 +-
arch/powerpc/include/asm/kvm_host.h | 5 +
arch/powerpc/kvm/book3s.c | 2 +-
arch/powerpc/kvm/book3s_hv.c | 46 +-
arch/powerpc/kvm/book3s_hv_rm_xics.c | 8 +-
arch/powerpc/kvm/book3s_pr.c | 22 +-
arch/powerpc/kvm/book3s_pr_papr.c | 2 +-
arch/powerpc/kvm/powerpc.c | 6 +-
arch/powerpc/kvm/timing.h | 28 +-
arch/riscv/include/asm/kvm_host.h | 5 +
arch/riscv/kvm/vcpu.c | 4 +-
arch/riscv/kvm/vcpu_exit.c | 10 +-
arch/riscv/kvm/vcpu_insn.c | 16 +-
arch/riscv/kvm/vcpu_sbi.c | 2 +-
arch/riscv/kvm/vcpu_sbi_hsm.c | 2 +-
arch/s390/include/asm/kvm_host.h | 5 +
arch/s390/kvm/diag.c | 18 +-
arch/s390/kvm/intercept.c | 20 +-
arch/s390/kvm/interrupt.c | 48 +-
arch/s390/kvm/kvm-s390.c | 10 +-
arch/s390/kvm/priv.c | 60 +--
arch/s390/kvm/sigp.c | 50 +-
arch/s390/kvm/vsie.c | 2 +-
arch/x86/include/asm/kvm_host.h | 46 +-
arch/x86/kvm/cpuid.c | 57 +-
arch/x86/kvm/cpuid.h | 2 +
arch/x86/kvm/debugfs.c | 2 +-
arch/x86/kvm/hyperv.c | 7 +-
arch/x86/kvm/i8254.c | 7 +-
arch/x86/kvm/ioapic.c | 4 +-
arch/x86/kvm/irq_comm.c | 14 +-
arch/x86/kvm/kvm_cache_regs.h | 4 +-
arch/x86/kvm/lapic.c | 147 +++--
arch/x86/kvm/mmu/mmu.c | 41 +-
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
arch/x86/kvm/svm/sev.c | 4 +-
arch/x86/kvm/svm/svm.c | 21 +-
arch/x86/kvm/vmx/tdx.c | 8 +-
arch/x86/kvm/vmx/vmx.c | 20 +-
arch/x86/kvm/x86.c | 319 ++++++++---
arch/x86/kvm/xen.c | 1 +
include/linux/kvm_host.h | 130 +++--
include/linux/kvm_types.h | 1 +
include/uapi/linux/kvm.h | 28 +-
tools/testing/selftests/kvm/Makefile.kvm | 2 +
.../testing/selftests/kvm/include/kvm_util.h | 48 ++
.../selftests/kvm/include/x86/processor.h | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 65 ++-
.../testing/selftests/kvm/lib/x86/processor.c | 15 +
tools/testing/selftests/kvm/plane_test.c | 103 ++++
tools/testing/selftests/kvm/x86/plane_test.c | 270 ++++++++++
virt/kvm/dirty_ring.c | 5 +-
virt/kvm/guest_memfd.c | 3 +-
virt/kvm/irqchip.c | 5 +-
virt/kvm/kvm_main.c | 500 ++++++++++++++----
69 files changed, 1991 insertions(+), 614 deletions(-)
create mode 100644 tools/testing/selftests/kvm/plane_test.c
create mode 100644 tools/testing/selftests/kvm/x86/plane_test.c
--
2.49.0