[RFC PATCH 00/29] KVM: VM planes

Paolo Bonzini posted 29 patches 10 months, 1 week ago
Documentation/virt/kvm/api.rst                | 245 +++++++--
Documentation/virt/kvm/locking.rst            |   3 +
Documentation/virt/kvm/vcpu-requests.rst      |   7 +
arch/arm64/include/asm/kvm_host.h             |   5 +
arch/arm64/kvm/arm.c                          |   4 +-
arch/arm64/kvm/handle_exit.c                  |   6 +-
arch/arm64/kvm/hyp/nvhe/gen-hyprel.c          |   4 +-
arch/arm64/kvm/mmio.c                         |   4 +-
arch/loongarch/include/asm/kvm_host.h         |   5 +
arch/loongarch/kvm/exit.c                     |   8 +-
arch/loongarch/kvm/vcpu.c                     |   4 +-
arch/mips/include/asm/kvm_host.h              |   5 +
arch/mips/kvm/emulate.c                       |   2 +-
arch/mips/kvm/mips.c                          |  32 +-
arch/mips/kvm/vz.c                            |  18 +-
arch/powerpc/include/asm/kvm_host.h           |   5 +
arch/powerpc/kvm/book3s.c                     |   2 +-
arch/powerpc/kvm/book3s_hv.c                  |  46 +-
arch/powerpc/kvm/book3s_hv_rm_xics.c          |   8 +-
arch/powerpc/kvm/book3s_pr.c                  |  22 +-
arch/powerpc/kvm/book3s_pr_papr.c             |   2 +-
arch/powerpc/kvm/powerpc.c                    |   6 +-
arch/powerpc/kvm/timing.h                     |  28 +-
arch/riscv/include/asm/kvm_host.h             |   5 +
arch/riscv/kvm/vcpu.c                         |   4 +-
arch/riscv/kvm/vcpu_exit.c                    |  10 +-
arch/riscv/kvm/vcpu_insn.c                    |  16 +-
arch/riscv/kvm/vcpu_sbi.c                     |   2 +-
arch/riscv/kvm/vcpu_sbi_hsm.c                 |   2 +-
arch/s390/include/asm/kvm_host.h              |   5 +
arch/s390/kvm/diag.c                          |  18 +-
arch/s390/kvm/intercept.c                     |  20 +-
arch/s390/kvm/interrupt.c                     |  48 +-
arch/s390/kvm/kvm-s390.c                      |  10 +-
arch/s390/kvm/priv.c                          |  60 +--
arch/s390/kvm/sigp.c                          |  50 +-
arch/s390/kvm/vsie.c                          |   2 +-
arch/x86/include/asm/kvm_host.h               |  46 +-
arch/x86/kvm/cpuid.c                          |  57 +-
arch/x86/kvm/cpuid.h                          |   2 +
arch/x86/kvm/debugfs.c                        |   2 +-
arch/x86/kvm/hyperv.c                         |   7 +-
arch/x86/kvm/i8254.c                          |   7 +-
arch/x86/kvm/ioapic.c                         |   4 +-
arch/x86/kvm/irq_comm.c                       |  14 +-
arch/x86/kvm/kvm_cache_regs.h                 |   4 +-
arch/x86/kvm/lapic.c                          | 147 +++--
arch/x86/kvm/mmu/mmu.c                        |  41 +-
arch/x86/kvm/mmu/tdp_mmu.c                    |   2 +-
arch/x86/kvm/svm/sev.c                        |   4 +-
arch/x86/kvm/svm/svm.c                        |  21 +-
arch/x86/kvm/vmx/tdx.c                        |   8 +-
arch/x86/kvm/vmx/vmx.c                        |  20 +-
arch/x86/kvm/x86.c                            | 319 ++++++++---
arch/x86/kvm/xen.c                            |   1 +
include/linux/kvm_host.h                      | 130 +++--
include/linux/kvm_types.h                     |   1 +
include/uapi/linux/kvm.h                      |  28 +-
tools/testing/selftests/kvm/Makefile.kvm      |   2 +
.../testing/selftests/kvm/include/kvm_util.h  |  48 ++
.../selftests/kvm/include/x86/processor.h     |   1 +
tools/testing/selftests/kvm/lib/kvm_util.c    |  65 ++-
.../testing/selftests/kvm/lib/x86/processor.c |  15 +
tools/testing/selftests/kvm/plane_test.c      | 103 ++++
tools/testing/selftests/kvm/x86/plane_test.c  | 270 ++++++++++
virt/kvm/dirty_ring.c                         |   5 +-
virt/kvm/guest_memfd.c                        |   3 +-
virt/kvm/irqchip.c                            |   5 +-
virt/kvm/kvm_main.c                           | 500 ++++++++++++++----
69 files changed, 1991 insertions(+), 614 deletions(-)
create mode 100644 tools/testing/selftests/kvm/plane_test.c
create mode 100644 tools/testing/selftests/kvm/x86/plane_test.c
[RFC PATCH 00/29] KVM: VM planes
Posted by Paolo Bonzini 10 months, 1 week ago
I guess April 1st is not the best date to send out such a large series
after months of radio silence, but here we are.

AMD VMPLs, Intel TDX partitions, Microsoft Hyper-V VTLs, and ARM CCA planes.
are all examples of virtual privilege level concepts that are exclusive to
guests.  In all these specifications the hypervisor hosts multiple
copies of a vCPU's register state (or at least of most of it) and provides
hypercalls or instructions to switch between them.

This is the first draft of the implementation according to the sketch that
was prepared last year between Linux Plumbers and KVM Forum.  The initial
version of the API was posted last October, and the implementation only
needed small changes.

Attempts made in the past, mostly in the context of Hyper-V VTLs and SEV-SNP
VMPLs, fell into two categories:

- use a single vCPU file descriptor, and store multiple copies of the state
  in a single struct kvm_vcpu.  This approach requires a lot of changes to
  provide multiple copies of affected fields, especially MMUs and APICs;
  and complex uAPI extensions to direct existing ioctls to a specific
  privilege level.  While more or less workable for SEV-SNP VMPLs, that
  was only because the copies of the register state were hidden
  in the VMSA (KVM does not manage it); it showed all its problems when
  applied to Hyper-V VTLs.

  The main advantage was that KVM kept the knowledge of the relationship
  between vCPUs that have the same id but belong to different privilege
  levels.  This is important in order to accelerate switches in-kernel.

- use multiple VM and vCPU file descriptors, and handle the switch entirely
  in userspace.  This got gnarly pretty fast for even more reasons than
  the previous case, for example because VMs could not share anymore
  memslots, including dirty bitmaps and private/shared attributes (a
  substantial problem for SEV-SNP since VMPLs share their ASID).

  Opposite to the other case, the total lack of kernel-level sharing of
  register state, and lack of control that vCPUs do not run in parallel,
  is what makes this approach problematic for both kernel and userspace.
  In-kernel implementation of privilege level switch becomes from
  complicated to impossible, and userspace needs a lot of complexity
  as well to ensure that higher-privileged VTLs properly interrupted a
  lower-privileged one.

This design sits squarely in the middle: it gives the initial set of
VM and vCPU file descriptors the full set of ioctls + struct kvm_run,
whereas other privilege levels ("planes") instead only support a small
part of the KVM API.  In fact for the vm file descriptor it is only three
ioctls: KVM_CHECK_EXTENSION, KVM_SIGNAL_MSI, KVM_SET_MEMORY_ATTRIBUTES.
For vCPUs it is basically KVM_GET/SET_*.

Most notably, memslots and KVM_RUN are *not* included (the choice of
which plane to run is done via vcpu->run), which solves a lot of
the problems in both of the previous approaches.  Compared to the
multiple-file-descriptors solution, it gets for free the ability to
avoid parallel execution of the same vCPUs in different privilege levels.
Compared to having a single file descriptor churn is more limited, or
at least can be attacked in small bites.  For example in this series
only per-plane interrupt controllers are switched to use the new struct
kvm_plane in place of struct kvm, and that's more or less enough in
the absence of complex interrupt delivery scenarios.

Changes to the userspace API are also relatively small; they boil down
to the introduction of a single new kind of file descriptor and almost
entirely fit in common code.  Reviewing these VM-wide and architecture-
independent changes should be the main purpose of this RFC, since 
there are still some things to fix:

- I named some fields "plane" instead of "plane_id" because I expected no
  fields of type struct kvm_plane*, but in retrospect that wasn't a great
  idea.

- online_vcpus counts across all planes but x86 code is still using it to
  deal with TSC synchronization.  Probably I will try and make kvmclock
  synchronization per-plane instead of per-VM.

- we're going to need a struct kvm_vcpu_plane similar to what Roy had in
  https://lore.kernel.org/kvm/cover.1726506534.git.roy.hopkins@suse.com/
  (probably smaller though).  Requests are per-plane for example, and I'm
  pretty sure any simplistic solution would have some corner cases where
  it's wrong; but it's a high churn change and I wanted to avoid that
  for this first posting.

There's a handful of locking TODOs where things should be checked more
carefully, but clearly identifying vCPU data that is not per-plane will
also simplify locking, thanks to having a single vcpu->mutex for the
whole plane.  So I'm not particularly worried about that; the TDX saga
hopefully has taught everyone to move in baby steps towards the intended
direction.

The handling of interrupt priorities is way more complicated than I
anticipated, unfortunately; everything else seems to fall into place
decently well---even taking into account the above incompleteness,
which anyway should not be a blocker for any VTL or VMPL experiments.
But do shout if anything makes you feel like I was too lazy, and/or you
want to puke.

Patches 1-2 are documentation and uAPI definitions.

Patches 3-9 are the common code for VM planes, while patches 10-14
are the common code for vCPU file descriptors on non-default planes.

Patches 15-26 are the x86-specific code, which is organized as follows:

- 15-20: convert APIC code to place its data in the new struct
kvm_arch_plane instead of struct kvm_arch.

- 21-24: everything else except the new userspace exit, KVM_EXIT_PLANE_EVENT

- 25: KVM_EXIT_PLANE_EVENT, which is used when one plane interrupts another.

- 26: finally make the capability available to userspace

Patches 27-29 finally are the testcases.  More are possible and planned,
but these are enough to say that, despite the missing bits, what exits
is not _completely_ broken.  I also didn't want to write dozens of tests
before committing to a selftests API.

Available for now at https://git.kernel.org/pub/scm/virt/kvm/kvm.git
branch planes-20250401.  I plan to place it in kvm-coco-queue, for lack
of a better place, as soon as TDX is merged into kvm/next and I test it
with the usual battery of kvm-unit-tests and real world guests.

Thanks,

Paolo

Paolo Bonzini (29):
  Documentation: kvm: introduce "VM plane" concept
  KVM: API definitions for plane userspace exit
  KVM: add plane info to structs
  KVM: introduce struct kvm_arch_plane
  KVM: add plane support to KVM_SIGNAL_MSI
  KVM: move mem_attr_array to kvm_plane
  KVM: do not use online_vcpus to test vCPU validity
  KVM: move vcpu_array to struct kvm_plane
  KVM: implement plane file descriptors ioctl and creation
  KVM: share statistics for same vCPU id on different planes
  KVM: anticipate allocation of dirty ring
  KVM: share dirty ring for same vCPU id on different planes
  KVM: implement vCPU creation for extra planes
  KVM: pass plane to kvm_arch_vcpu_create
  KVM: x86: pass vcpu to kvm_pv_send_ipi()
  KVM: x86: split "if" in __kvm_set_or_clear_apicv_inhibit
  KVM: x86: block creating irqchip if planes are active
  KVM: x86: track APICv inhibits per plane
  KVM: x86: move APIC map to kvm_arch_plane
  KVM: x86: add planes support for interrupt delivery
  KVM: x86: add infrastructure to share FPU across planes
  KVM: x86: implement initial plane support
  KVM: x86: extract kvm_post_set_cpuid
  KVM: x86: initialize CPUID for non-default planes
  KVM: x86: handle interrupt priorities for planes
  KVM: x86: enable up to 16 planes
  selftests: kvm: introduce basic test for VM planes
  selftests: kvm: add plane infrastructure
  selftests: kvm: add x86-specific plane test

 Documentation/virt/kvm/api.rst                | 245 +++++++--
 Documentation/virt/kvm/locking.rst            |   3 +
 Documentation/virt/kvm/vcpu-requests.rst      |   7 +
 arch/arm64/include/asm/kvm_host.h             |   5 +
 arch/arm64/kvm/arm.c                          |   4 +-
 arch/arm64/kvm/handle_exit.c                  |   6 +-
 arch/arm64/kvm/hyp/nvhe/gen-hyprel.c          |   4 +-
 arch/arm64/kvm/mmio.c                         |   4 +-
 arch/loongarch/include/asm/kvm_host.h         |   5 +
 arch/loongarch/kvm/exit.c                     |   8 +-
 arch/loongarch/kvm/vcpu.c                     |   4 +-
 arch/mips/include/asm/kvm_host.h              |   5 +
 arch/mips/kvm/emulate.c                       |   2 +-
 arch/mips/kvm/mips.c                          |  32 +-
 arch/mips/kvm/vz.c                            |  18 +-
 arch/powerpc/include/asm/kvm_host.h           |   5 +
 arch/powerpc/kvm/book3s.c                     |   2 +-
 arch/powerpc/kvm/book3s_hv.c                  |  46 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c          |   8 +-
 arch/powerpc/kvm/book3s_pr.c                  |  22 +-
 arch/powerpc/kvm/book3s_pr_papr.c             |   2 +-
 arch/powerpc/kvm/powerpc.c                    |   6 +-
 arch/powerpc/kvm/timing.h                     |  28 +-
 arch/riscv/include/asm/kvm_host.h             |   5 +
 arch/riscv/kvm/vcpu.c                         |   4 +-
 arch/riscv/kvm/vcpu_exit.c                    |  10 +-
 arch/riscv/kvm/vcpu_insn.c                    |  16 +-
 arch/riscv/kvm/vcpu_sbi.c                     |   2 +-
 arch/riscv/kvm/vcpu_sbi_hsm.c                 |   2 +-
 arch/s390/include/asm/kvm_host.h              |   5 +
 arch/s390/kvm/diag.c                          |  18 +-
 arch/s390/kvm/intercept.c                     |  20 +-
 arch/s390/kvm/interrupt.c                     |  48 +-
 arch/s390/kvm/kvm-s390.c                      |  10 +-
 arch/s390/kvm/priv.c                          |  60 +--
 arch/s390/kvm/sigp.c                          |  50 +-
 arch/s390/kvm/vsie.c                          |   2 +-
 arch/x86/include/asm/kvm_host.h               |  46 +-
 arch/x86/kvm/cpuid.c                          |  57 +-
 arch/x86/kvm/cpuid.h                          |   2 +
 arch/x86/kvm/debugfs.c                        |   2 +-
 arch/x86/kvm/hyperv.c                         |   7 +-
 arch/x86/kvm/i8254.c                          |   7 +-
 arch/x86/kvm/ioapic.c                         |   4 +-
 arch/x86/kvm/irq_comm.c                       |  14 +-
 arch/x86/kvm/kvm_cache_regs.h                 |   4 +-
 arch/x86/kvm/lapic.c                          | 147 +++--
 arch/x86/kvm/mmu/mmu.c                        |  41 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |   2 +-
 arch/x86/kvm/svm/sev.c                        |   4 +-
 arch/x86/kvm/svm/svm.c                        |  21 +-
 arch/x86/kvm/vmx/tdx.c                        |   8 +-
 arch/x86/kvm/vmx/vmx.c                        |  20 +-
 arch/x86/kvm/x86.c                            | 319 ++++++++---
 arch/x86/kvm/xen.c                            |   1 +
 include/linux/kvm_host.h                      | 130 +++--
 include/linux/kvm_types.h                     |   1 +
 include/uapi/linux/kvm.h                      |  28 +-
 tools/testing/selftests/kvm/Makefile.kvm      |   2 +
 .../testing/selftests/kvm/include/kvm_util.h  |  48 ++
 .../selftests/kvm/include/x86/processor.h     |   1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  65 ++-
 .../testing/selftests/kvm/lib/x86/processor.c |  15 +
 tools/testing/selftests/kvm/plane_test.c      | 103 ++++
 tools/testing/selftests/kvm/x86/plane_test.c  | 270 ++++++++++
 virt/kvm/dirty_ring.c                         |   5 +-
 virt/kvm/guest_memfd.c                        |   3 +-
 virt/kvm/irqchip.c                            |   5 +-
 virt/kvm/kvm_main.c                           | 500 ++++++++++++++----
 69 files changed, 1991 insertions(+), 614 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/plane_test.c
 create mode 100644 tools/testing/selftests/kvm/x86/plane_test.c

-- 
2.49.0
Re: [RFC PATCH 00/29] KVM: VM planes
Posted by Tom Lendacky 8 months ago
On 4/1/25 11:10, Paolo Bonzini wrote:
> I guess April 1st is not the best date to send out such a large series
> after months of radio silence, but here we are.

There were some miscellaneous fixes I had to apply to get the series to
compile and start working properly. I didn't break them out by patch #,
but here they are:

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 21dbc539cbe7..9d078eb001b1 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1316,32 +1316,35 @@ static void kvm_lapic_deliver_interrupt(struct kvm_vcpu *vcpu, struct kvm_lapic
 {
 	struct kvm_vcpu *plane0_vcpu = vcpu->plane0;
 	struct kvm_plane *running_plane;
+	int irr_pending_planes;
 	u16 req_exit_planes;
 
 	kvm_x86_call(deliver_interrupt)(apic, delivery_mode, trig_mode, vector);
 
 	/*
-	 * test_and_set_bit implies a memory barrier, so IRR is written before
+	 * atomic_fetch_or implies a memory barrier, so IRR is written before
 	 * reading irr_pending_planes below...
 	 */
-	if (!test_and_set_bit(vcpu->plane, &plane0_vcpu->arch.irr_pending_planes)) {
-		/*
-		 * ... and also running_plane and req_exit_planes are read after writing
-		 * irr_pending_planes.  Both barriers pair with kvm_arch_vcpu_ioctl_run().
-		 */
-		smp_mb__after_atomic();
+	irr_pending_planes = atomic_fetch_or(BIT(vcpu->plane), &plane0_vcpu->arch.irr_pending_planes);
+	if (irr_pending_planes & BIT(vcpu->plane))
+		return;
 
-		running_plane = READ_ONCE(plane0_vcpu->running_plane);
-		if (!running_plane)
-			return;
+	/*
+	 * ... and also running_plane and req_exit_planes are read after writing
+	 * irr_pending_planes.  Both barriers pair with kvm_arch_vcpu_ioctl_run().
+	 */
+	smp_mb__after_atomic();
 
-		req_exit_planes = READ_ONCE(plane0_vcpu->req_exit_planes);
-		if (!(req_exit_planes & BIT(vcpu->plane)))
-			return;
+	running_plane = READ_ONCE(plane0_vcpu->running_plane);
+	if (!running_plane)
+		return;
 
-		kvm_make_request(KVM_REQ_PLANE_INTERRUPT,
-				 kvm_get_plane_vcpu(running_plane, vcpu->vcpu_id));
-	}
+	req_exit_planes = READ_ONCE(plane0_vcpu->req_exit_planes);
+	if (!(req_exit_planes & BIT(vcpu->plane)))
+		return;
+
+	kvm_make_request(KVM_REQ_PLANE_INTERRUPT,
+			 kvm_get_plane_vcpu(running_plane, vcpu->vcpu_id));
 }
 
 /*
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 9d4492862c11..130d895f1d95 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -458,7 +458,7 @@ static int __sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp,
 	INIT_LIST_HEAD(&sev->mirror_vms);
 	sev->need_init = false;
 
-	kvm_set_apicv_inhibit(kvm->planes[[0], APICV_INHIBIT_REASON_SEV);
+	kvm_set_apicv_inhibit(kvm->planes[0], APICV_INHIBIT_REASON_SEV);
 
 	return 0;
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 917bfe8db101..656b69eabc59 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3252,7 +3252,7 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
 	 * All vCPUs which run still run nested, will remain to have their
 	 * AVIC still inhibited due to per-cpu AVIC inhibition.
 	 */
-	kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
+	kvm_clear_apicv_inhibit(vcpu->kvm->planes[vcpu->plane], APICV_INHIBIT_REASON_IRQWIN);
 
 	++vcpu->stat->irq_window_exits;
 	return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 65bc28e82140..704e8f80898f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11742,7 +11742,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		 * the other side will certainly see the cleared bit irr_pending_planes
 		 * and set it, and vice versa.
 		 */
-		clear_bit(plane_id, &plane0_vcpu->arch.irr_pending_planes);
+		atomic_and(~BIT(plane_id), &plane0_vcpu->arch.irr_pending_planes);
 		smp_mb__after_atomic();
 		if (kvm_lapic_find_highest_irr(vcpu))
 			atomic_or(BIT(plane_id), &plane0_vcpu->arch.irr_pending_planes);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3a04fdf0865d..efd45e05fddf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4224,7 +4224,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm_plane *plane, struct kvm_vcpu *pl
 	 * release semantics, which ensures the write is visible to kvm_get_vcpu().
 	 */
 	vcpu->plane = -1;
-	if (plane->plane)
+	if (!plane->plane)
 		vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus);
 	else
 		vcpu->vcpu_idx = plane0_vcpu->vcpu_idx;
@@ -4249,7 +4249,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm_plane *plane, struct kvm_vcpu *pl
 	if (r < 0)
 		goto kvm_put_xa_erase;
 
-	if (!plane0_vcpu)
+	if (!plane->plane)
 		atomic_inc(&kvm->online_vcpus);
 
 	/*

Thanks,
Tom

>
Re: [RFC PATCH 00/29] KVM: VM planes
Posted by Sean Christopherson 10 months, 1 week ago
On Tue, Apr 01, 2025, Paolo Bonzini wrote:
> I guess April 1st is not the best date to send out such a large series
> after months of radio silence, but here we are.

Heh, you missed an opportunity to spell it "plains" and then spend the entire
cover letter justifying the name :-)
Re: [RFC PATCH 00/29] KVM: VM planes
Posted by Vaishali Thakkar 6 months ago
Adding Joerg's functional email id.

On 4/1/25 6:10 PM, Paolo Bonzini wrote:
> I guess April 1st is not the best date to send out such a large series
> after months of radio silence, but here we are.
> 
> AMD VMPLs, Intel TDX partitions, Microsoft Hyper-V VTLs, and ARM CCA planes.
> are all examples of virtual privilege level concepts that are exclusive to
> guests.  In all these specifications the hypervisor hosts multiple
> copies of a vCPU's register state (or at least of most of it) and provides
> hypercalls or instructions to switch between them.
> 
> This is the first draft of the implementation according to the sketch that
> was prepared last year between Linux Plumbers and KVM Forum.  The initial
> version of the API was posted last October, and the implementation only
> needed small changes.
> 
> Attempts made in the past, mostly in the context of Hyper-V VTLs and SEV-SNP
> VMPLs, fell into two categories:
> 
> - use a single vCPU file descriptor, and store multiple copies of the state
>   in a single struct kvm_vcpu.  This approach requires a lot of changes to
>   provide multiple copies of affected fields, especially MMUs and APICs;
>   and complex uAPI extensions to direct existing ioctls to a specific
>   privilege level.  While more or less workable for SEV-SNP VMPLs, that
>   was only because the copies of the register state were hidden
>   in the VMSA (KVM does not manage it); it showed all its problems when
>   applied to Hyper-V VTLs.
> 
>   The main advantage was that KVM kept the knowledge of the relationship
>   between vCPUs that have the same id but belong to different privilege
>   levels.  This is important in order to accelerate switches in-kernel.
> 
> - use multiple VM and vCPU file descriptors, and handle the switch entirely
>   in userspace.  This got gnarly pretty fast for even more reasons than
>   the previous case, for example because VMs could not share anymore
>   memslots, including dirty bitmaps and private/shared attributes (a
>   substantial problem for SEV-SNP since VMPLs share their ASID).
> 
>   Opposite to the other case, the total lack of kernel-level sharing of
>   register state, and lack of control that vCPUs do not run in parallel,
>   is what makes this approach problematic for both kernel and userspace.
>   In-kernel implementation of privilege level switch becomes from
>   complicated to impossible, and userspace needs a lot of complexity
>   as well to ensure that higher-privileged VTLs properly interrupted a
>   lower-privileged one.
> 
> This design sits squarely in the middle: it gives the initial set of
> VM and vCPU file descriptors the full set of ioctls + struct kvm_run,
> whereas other privilege levels ("planes") instead only support a small
> part of the KVM API.  In fact for the vm file descriptor it is only three
> ioctls: KVM_CHECK_EXTENSION, KVM_SIGNAL_MSI, KVM_SET_MEMORY_ATTRIBUTES.
> For vCPUs it is basically KVM_GET/SET_*.
> 
> Most notably, memslots and KVM_RUN are *not* included (the choice of
> which plane to run is done via vcpu->run), which solves a lot of
> the problems in both of the previous approaches.  Compared to the
> multiple-file-descriptors solution, it gets for free the ability to
> avoid parallel execution of the same vCPUs in different privilege levels.
> Compared to having a single file descriptor churn is more limited, or
> at least can be attacked in small bites.  For example in this series
> only per-plane interrupt controllers are switched to use the new struct
> kvm_plane in place of struct kvm, and that's more or less enough in
> the absence of complex interrupt delivery scenarios.
> 
> Changes to the userspace API are also relatively small; they boil down
> to the introduction of a single new kind of file descriptor and almost
> entirely fit in common code.  Reviewing these VM-wide and architecture-
> independent changes should be the main purpose of this RFC, since 
> there are still some things to fix:
> 
> - I named some fields "plane" instead of "plane_id" because I expected no
>   fields of type struct kvm_plane*, but in retrospect that wasn't a great
>   idea.
> 
> - online_vcpus counts across all planes but x86 code is still using it to
>   deal with TSC synchronization.  Probably I will try and make kvmclock
>   synchronization per-plane instead of per-VM.
> 

Hi Paolo,

Is there still a plan to make kvmclock synchronization per-plane instead
of per-VM? Do you plan to handle it as part of this patchset or you
think it should be handled separately on top of this patchset?

I'm asking as coconut-svsm needs a monotonic clock source which adheres
to wall-clock time. And we have been exploring several approaches to
achieve this. One of the idea is to use kvmclock, provided it can
support a per-plane instance that remains synchronized across planes.

Thanks.


> - we're going to need a struct kvm_vcpu_plane similar to what Roy had in
>   https://lore.kernel.org/kvm/cover.1726506534.git.roy.hopkins@suse.com/
>   (probably smaller though).  Requests are per-plane for example, and I'm
>   pretty sure any simplistic solution would have some corner cases where
>   it's wrong; but it's a high churn change and I wanted to avoid that
>   for this first posting.
> 
> There's a handful of locking TODOs where things should be checked more
> carefully, but clearly identifying vCPU data that is not per-plane will
> also simplify locking, thanks to having a single vcpu->mutex for the
> whole plane.  So I'm not particularly worried about that; the TDX saga
> hopefully has taught everyone to move in baby steps towards the intended
> direction.
> 
> The handling of interrupt priorities is way more complicated than I
> anticipated, unfortunately; everything else seems to fall into place
> decently well---even taking into account the above incompleteness,
> which anyway should not be a blocker for any VTL or VMPL experiments.
> But do shout if anything makes you feel like I was too lazy, and/or you
> want to puke.
> 
> Patches 1-2 are documentation and uAPI definitions.
> 
> Patches 3-9 are the common code for VM planes, while patches 10-14
> are the common code for vCPU file descriptors on non-default planes.
> 
> Patches 15-26 are the x86-specific code, which is organized as follows:
> 
> - 15-20: convert APIC code to place its data in the new struct
> kvm_arch_plane instead of struct kvm_arch.
> 
> - 21-24: everything else except the new userspace exit, KVM_EXIT_PLANE_EVENT
> 
> - 25: KVM_EXIT_PLANE_EVENT, which is used when one plane interrupts another.
> 
> - 26: finally make the capability available to userspace
> 
> Patches 27-29 finally are the testcases.  More are possible and planned,
> but these are enough to say that, despite the missing bits, what exits
> is not _completely_ broken.  I also didn't want to write dozens of tests
> before committing to a selftests API.
> 
> Available for now at https://git.kernel.org/pub/scm/virt/kvm/kvm.git
> branch planes-20250401.  I plan to place it in kvm-coco-queue, for lack
> of a better place, as soon as TDX is merged into kvm/next and I test it
> with the usual battery of kvm-unit-tests and real world guests.
> 
> Thanks,
> 
> Paolo
> 
> Paolo Bonzini (29):
>   Documentation: kvm: introduce "VM plane" concept
>   KVM: API definitions for plane userspace exit
>   KVM: add plane info to structs
>   KVM: introduce struct kvm_arch_plane
>   KVM: add plane support to KVM_SIGNAL_MSI
>   KVM: move mem_attr_array to kvm_plane
>   KVM: do not use online_vcpus to test vCPU validity
>   KVM: move vcpu_array to struct kvm_plane
>   KVM: implement plane file descriptors ioctl and creation
>   KVM: share statistics for same vCPU id on different planes
>   KVM: anticipate allocation of dirty ring
>   KVM: share dirty ring for same vCPU id on different planes
>   KVM: implement vCPU creation for extra planes
>   KVM: pass plane to kvm_arch_vcpu_create
>   KVM: x86: pass vcpu to kvm_pv_send_ipi()
>   KVM: x86: split "if" in __kvm_set_or_clear_apicv_inhibit
>   KVM: x86: block creating irqchip if planes are active
>   KVM: x86: track APICv inhibits per plane
>   KVM: x86: move APIC map to kvm_arch_plane
>   KVM: x86: add planes support for interrupt delivery
>   KVM: x86: add infrastructure to share FPU across planes
>   KVM: x86: implement initial plane support
>   KVM: x86: extract kvm_post_set_cpuid
>   KVM: x86: initialize CPUID for non-default planes
>   KVM: x86: handle interrupt priorities for planes
>   KVM: x86: enable up to 16 planes
>   selftests: kvm: introduce basic test for VM planes
>   selftests: kvm: add plane infrastructure
>   selftests: kvm: add x86-specific plane test
> 
>  Documentation/virt/kvm/api.rst                | 245 +++++++--
>  Documentation/virt/kvm/locking.rst            |   3 +
>  Documentation/virt/kvm/vcpu-requests.rst      |   7 +
>  arch/arm64/include/asm/kvm_host.h             |   5 +
>  arch/arm64/kvm/arm.c                          |   4 +-
>  arch/arm64/kvm/handle_exit.c                  |   6 +-
>  arch/arm64/kvm/hyp/nvhe/gen-hyprel.c          |   4 +-
>  arch/arm64/kvm/mmio.c                         |   4 +-
>  arch/loongarch/include/asm/kvm_host.h         |   5 +
>  arch/loongarch/kvm/exit.c                     |   8 +-
>  arch/loongarch/kvm/vcpu.c                     |   4 +-
>  arch/mips/include/asm/kvm_host.h              |   5 +
>  arch/mips/kvm/emulate.c                       |   2 +-
>  arch/mips/kvm/mips.c                          |  32 +-
>  arch/mips/kvm/vz.c                            |  18 +-
>  arch/powerpc/include/asm/kvm_host.h           |   5 +
>  arch/powerpc/kvm/book3s.c                     |   2 +-
>  arch/powerpc/kvm/book3s_hv.c                  |  46 +-
>  arch/powerpc/kvm/book3s_hv_rm_xics.c          |   8 +-
>  arch/powerpc/kvm/book3s_pr.c                  |  22 +-
>  arch/powerpc/kvm/book3s_pr_papr.c             |   2 +-
>  arch/powerpc/kvm/powerpc.c                    |   6 +-
>  arch/powerpc/kvm/timing.h                     |  28 +-
>  arch/riscv/include/asm/kvm_host.h             |   5 +
>  arch/riscv/kvm/vcpu.c                         |   4 +-
>  arch/riscv/kvm/vcpu_exit.c                    |  10 +-
>  arch/riscv/kvm/vcpu_insn.c                    |  16 +-
>  arch/riscv/kvm/vcpu_sbi.c                     |   2 +-
>  arch/riscv/kvm/vcpu_sbi_hsm.c                 |   2 +-
>  arch/s390/include/asm/kvm_host.h              |   5 +
>  arch/s390/kvm/diag.c                          |  18 +-
>  arch/s390/kvm/intercept.c                     |  20 +-
>  arch/s390/kvm/interrupt.c                     |  48 +-
>  arch/s390/kvm/kvm-s390.c                      |  10 +-
>  arch/s390/kvm/priv.c                          |  60 +--
>  arch/s390/kvm/sigp.c                          |  50 +-
>  arch/s390/kvm/vsie.c                          |   2 +-
>  arch/x86/include/asm/kvm_host.h               |  46 +-
>  arch/x86/kvm/cpuid.c                          |  57 +-
>  arch/x86/kvm/cpuid.h                          |   2 +
>  arch/x86/kvm/debugfs.c                        |   2 +-
>  arch/x86/kvm/hyperv.c                         |   7 +-
>  arch/x86/kvm/i8254.c                          |   7 +-
>  arch/x86/kvm/ioapic.c                         |   4 +-
>  arch/x86/kvm/irq_comm.c                       |  14 +-
>  arch/x86/kvm/kvm_cache_regs.h                 |   4 +-
>  arch/x86/kvm/lapic.c                          | 147 +++--
>  arch/x86/kvm/mmu/mmu.c                        |  41 +-
>  arch/x86/kvm/mmu/tdp_mmu.c                    |   2 +-
>  arch/x86/kvm/svm/sev.c                        |   4 +-
>  arch/x86/kvm/svm/svm.c                        |  21 +-
>  arch/x86/kvm/vmx/tdx.c                        |   8 +-
>  arch/x86/kvm/vmx/vmx.c                        |  20 +-
>  arch/x86/kvm/x86.c                            | 319 ++++++++---
>  arch/x86/kvm/xen.c                            |   1 +
>  include/linux/kvm_host.h                      | 130 +++--
>  include/linux/kvm_types.h                     |   1 +
>  include/uapi/linux/kvm.h                      |  28 +-
>  tools/testing/selftests/kvm/Makefile.kvm      |   2 +
>  .../testing/selftests/kvm/include/kvm_util.h  |  48 ++
>  .../selftests/kvm/include/x86/processor.h     |   1 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  65 ++-
>  .../testing/selftests/kvm/lib/x86/processor.c |  15 +
>  tools/testing/selftests/kvm/plane_test.c      | 103 ++++
>  tools/testing/selftests/kvm/x86/plane_test.c  | 270 ++++++++++
>  virt/kvm/dirty_ring.c                         |   5 +-
>  virt/kvm/guest_memfd.c                        |   3 +-
>  virt/kvm/irqchip.c                            |   5 +-
>  virt/kvm/kvm_main.c                           | 500 ++++++++++++++----
>  69 files changed, 1991 insertions(+), 614 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/plane_test.c
>  create mode 100644 tools/testing/selftests/kvm/x86/plane_test.c
>