x86/boot, KVM: Move VMXON/VMXOFF handling from KVM to CPU lifecycle

[RFC PATCH v1 0/5] x86/boot, KVM: Move VMXON/VMXOFF handling from KVM to CPU lifecycle

Posted by Xin Li (Intel) 3 weeks, 2 days ago

There is now broad consensus that TDX should be decoupled from KVM. To
achieve this separation, it is necessary to move VMXON/VMXOFF handling
out of KVM. Sean has also discussed this approach in several TDX patch
series threads, e.g. [1], and has already done a round of refactoring
in KVM [2].

The simplest thing we could think of is to execute VMXON during the CPU
startup phase and VMXOFF during the CPU shutdown phase, even although
this leaves VMX on when it doesn't strictly need to be on.

This RFC series demonstrates the idea and seeks feedback from the KVM
community on its viability.


The benefits of doing VMXON/VMXOFF in the CPU startup/shutdown phase:

  1) Eliminates in-flight VMXON/VMXOFF during CPU hotplug, system reboot,
     or kexec while KVM is loading or unloading.

  2) Removes the “insane dances” for handling unexpected VMXON/VMXOFF
     execution, including the emergency reboot disable virtualization
     mechanism and kvm_rebooting.

  3) Allows KVM and other hypervisors on Linux to omit explicit VMX
     enable/disable logic.


This RFC series follows the direction and does the following:

  1) Move VMXON to the CPU startup phase instead of KVM initialization.

  2) Move VMXOFF to the CPU shutdown phase instead of KVM teardown.

  3) Move VMCLEAR of VMCSs to cpu_disable_virtualization().

  4) Remove the emergency reboot disable virtualization mechanism.

  5) Remove kvm_rebooting.


AMD SVM support is not included, as I do not have access to AMD hardware,
but adding it should be straightforward (currently broken in this RFC).

Note, the first two patches should ideally be merged into a single patch
to avoid breaking functionality in between. However, they are kept
separate in this RFC for clarity and easier review. I will merge them
if this approach proves viable.


[1] https://lore.kernel.org/lkml/ZhawUG0BduPVvVhN@google.com/
[2] https://lore.kernel.org/lkml/20240830043600.127750-1-seanjc@google.com/


Xin Li (Intel) (5):
  x86/boot: Shift VMXON from KVM init to CPU startup phase
  x86/boot: Move VMXOFF from KVM teardown to CPU shutdown phase
  x86/shutdown, KVM: VMX: Move VMCLEAR of VMCSs to
    cpu_disable_virtualization()
  x86/reboot: Remove emergency_reboot_disable_virtualization()
  KVM: Remove kvm_rebooting and its references

 arch/x86/include/asm/kvm_host.h  |   1 -
 arch/x86/include/asm/processor.h |   3 +
 arch/x86/include/asm/reboot.h    |  11 --
 arch/x86/include/asm/vmx.h       |   5 +
 arch/x86/kernel/cpu/common.c     | 162 +++++++++++++++++++++++++++
 arch/x86/kernel/crash.c          |   5 +-
 arch/x86/kernel/process.c        |   3 +
 arch/x86/kernel/reboot.c         |  88 ++-------------
 arch/x86/kernel/smp.c            |   3 +-
 arch/x86/kernel/smpboot.c        |   6 +
 arch/x86/kvm/svm/svm.c           |   8 --
 arch/x86/kvm/svm/vmenter.S       |  42 +++----
 arch/x86/kvm/vmx/main.c          |   1 -
 arch/x86/kvm/vmx/tdx.c           |   4 +-
 arch/x86/kvm/vmx/vmcs.h          |  10 +-
 arch/x86/kvm/vmx/vmenter.S       |   2 -
 arch/x86/kvm/vmx/vmx.c           | 185 ++-----------------------------
 arch/x86/kvm/x86.c               |  18 +--
 arch/x86/power/cpu.c             |  10 +-
 include/linux/kvm_host.h         |   9 --
 virt/kvm/kvm_main.c              |  29 +----
 21 files changed, 230 insertions(+), 375 deletions(-)


base-commit: 76eeb9b8de9880ca38696b2fb56ac45ac0a25c6c
-- 
2.51.0

Re: [RFC PATCH v1 0/5] x86/boot, KVM: Move VMXON/VMXOFF handling from KVM to CPU lifecycle

Posted by Sean Christopherson 3 weeks ago

On Tue, Sep 09, 2025, Xin Li (Intel) wrote:
> There is now broad consensus that TDX should be decoupled from KVM. To
> achieve this separation, it is necessary to move VMXON/VMXOFF handling
> out of KVM. Sean has also discussed this approach in several TDX patch
> series threads, e.g. [1], and has already done a round of refactoring
> in KVM [2].
> 
> The simplest thing we could think of is to execute VMXON during the CPU
> startup phase and VMXOFF during the CPU shutdown phase, even although
> this leaves VMX on when it doesn't strictly need to be on.
> 
> This RFC series demonstrates the idea and seeks feedback from the KVM
> community on its viability.

Sorry, but this is not at all aligned with where I want things to go.  I don't
want to simply move VMXON into the kernel, I want to extract *all* of the system-
wide management code from KVM and into a separate base module.  That is obviously
a much more invasive and difficult series to develop, but it's where we need to
go to truly decouple core virtualization functionality from KVM.

VPID and ASID allocation need to be managed system-wide, otherwise running KVM
alongside another hypervisor-like entity will result in data corruption due to
shared TLB state.

Ditto for user-return MSRs, AMD's MSR_AMD64_TSC_RATIO, and probably a few other
things I'm forgetting.

I also want to keep the code as a module, both to avoid doing VMXON unconditionally,
and for debug/testing purposes (being able to unload and reload is tremendously
valuable on that front).  This one isn't negotiable for me.

And most importantly, all of that needs to be done in a way that is fully
bisectable.  As proposed, this series will break horribly due to enabling VMXON
during early boot without any way to do VMXOFF.

In short, I don't want to half-ass this just so that I can get overwhelmed with
more TDX patches.

Re: [RFC PATCH v1 0/5] x86/boot, KVM: Move VMXON/VMXOFF handling from KVM to CPU lifecycle

Posted by Arjan van de Ven 3 weeks ago

Hi,
> I also want to keep the code as a module, both to avoid doing VMXON unconditionally,

can you expand on what the problem is with having VMXON unconditionally enabled?
A lot of things are much simpler if it's on at cpu up, and turned off only at the
down path (be it offline of kexec).. no refcounting, no locking, etc...
so would be good to understand what the problem would be with having it always on

Re: [RFC PATCH v1 0/5] x86/boot, KVM: Move VMXON/VMXOFF handling from KVM to CPU lifecycle

Posted by Sean Christopherson 2 weeks, 2 days ago

On Thu, Sep 11, 2025, Arjan van de Ven wrote:
> Hi,
> > I also want to keep the code as a module, both to avoid doing VMXON unconditionally,
> 
> can you expand on what the problem is with having VMXON unconditionally enabled?

Unlike say EFER.SVME, VMXON fundamentally changes CPU behavior.  E.g. blocks INIT,
activates VMCS caches (which aren't cleared by VMXOFF on pre-SPR CPUs, and AFAIK
Intel hasn't even publicly committed to that behavior for SPR+), restricts allowed
CR0 and CR4 values, raises questions about ucode patch updates, triggers unique
flows in SMI/RSM, prevents Intel PT from tracing on certain CPUs, and probably a
few other things I'm forgetting.

> A lot of things are much simpler if it's on at cpu up, and turned off only at the
> down path (be it offline of kexec).. no refcounting, no locking, etc...

For Intel.  Unless _all_ vendors and architectures follow suit, KVM will need
the refcounting and locking.  And while it's not anyone's fault, the *vast*
majority of complexity around enabling virtualization in KVM is due to VMX.
I.e. KVM added a bunch of code to deal with the aformentioned side effects of
VMXON, and as a result, all other vendors/architectures have had to deal with
that complexity.

> so would be good to understand what the problem would be with having it always on

Doing VMXON unconditionally is a minor objection.  My primary objection is that
this series does what's easiest for TDX, and leaves behind all of the VMX-induced
technical debt in KVM.

Re: [RFC PATCH v1 0/5] x86/boot, KVM: Move VMXON/VMXOFF handling from KVM to CPU lifecycle

Posted by Xin Li 2 weeks, 1 day ago

On 9/16/2025 10:54 AM, Sean Christopherson wrote:
> On Thu, Sep 11, 2025, Arjan van de Ven wrote:
>> Hi,
>>> I also want to keep the code as a module, both to avoid doing VMXON unconditionally,
>>
>> can you expand on what the problem is with having VMXON unconditionally enabled?
> 
> Unlike say EFER.SVME, VMXON fundamentally changes CPU behavior.  E.g. blocks INIT,
> activates VMCS caches (which aren't cleared by VMXOFF on pre-SPR CPUs, and AFAIK
> Intel hasn't even publicly committed to that behavior for SPR+), restricts allowed
> CR0 and CR4 values, raises questions about ucode patch updates, triggers unique
> flows in SMI/RSM, prevents Intel PT from tracing on certain CPUs, and probably a
> few other things I'm forgetting.


Regarding Intel PT, if VMXON/VMXOFF are moved to CPU startup/shutdown, as
Intel PT is initialized during arch_initcall() stage, entering and leaving
VMX operation no longer happen while Intel PT is _active_, thus
intel_pt_handle_vmx() no longer needs to "handles" VMX state transitions.

Thus, the function's purpose is simplified to signaling Intel pt not to
write to IA32_RTIT_CTL during VMX operation if the processor supports Intel
PT but disallows its use in VMX operation, indicated by IA32_VMX_MISC[14]
being cleared.  Otherwise, it does nothing and leaves pt_ctx.vmx_on as 0.

If the following patch is correct, it's more of a simplification then :)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index e8cf29d2b10c..8325a824700a 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -225,17 +225,6 @@ static int __init pt_pmu_hw_init(void)
  		break;
  	}

-	if (boot_cpu_has(X86_FEATURE_VMX)) {
-		/*
-		 * Intel SDM, 36.5 "Tracing post-VMXON" says that
-		 * "IA32_VMX_MISC[bit 14]" being 1 means PT can trace
-		 * post-VMXON.
-		 */
-		rdmsrq(MSR_IA32_VMX_MISC, reg);
-		if (reg & BIT(14))
-			pt_pmu.vmx = true;
-	}
-
  	for (i = 0; i < PT_CPUID_LEAVES; i++) {
  		cpuid_count(20, i,
  			    &pt_pmu.caps[CPUID_EAX + i*PT_CPUID_REGS_NUM],
@@ -1556,41 +1545,39 @@ void intel_pt_interrupt(void)
  	}
  }

-void intel_pt_handle_vmx(int on)
+/*
+ * VMXON is done in the CPU startup phase, thus pt is initialized later.
+ *
+ * Signal pt to not write IA32_RTIT_CTL while in VMX operation if the
+ * processor supports Intel PT but does not allow it to be used in VMX
+ * operation, i.e. IA32_VMX_MISC[bit 14] is cleared.
+ *
+ * Note: If IA32_VMX_MISC[bit 14] is set, vmx_on in pt_ctx remains 0.
+ */
+void intel_pt_set_vmx(int on)
  {
  	struct pt *pt = this_cpu_ptr(&pt_ctx);
-	struct perf_event *event;
-	unsigned long flags;
+	int cpu = raw_smp_processor_id();
+
+	if (!cpu && cpu_feature_enabled(X86_FEATURE_VMX)) {
+		u64 misc;
+
+		/*
+		 * Intel SDM, 36.5 "Tracing post-VMXON" says that
+		 * "IA32_VMX_MISC[bit 14]" being 1 means PT can trace
+		 * post-VMXON.
+		 */
+		rdmsrq(MSR_IA32_VMX_MISC, misc);
+		if (misc & BIT(14))
+			pt_pmu.vmx = true;
+	}

  	/* PT plays nice with VMX, do nothing */
  	if (pt_pmu.vmx)
  		return;

-	/*
-	 * VMXON will clear RTIT_CTL.TraceEn; we need to make
-	 * sure to not try to set it while VMX is on. Disable
-	 * interrupts to avoid racing with pmu callbacks;
-	 * concurrent PMI should be handled fine.
-	 */
-	local_irq_save(flags);
  	WRITE_ONCE(pt->vmx_on, on);
-
-	/*
-	 * If an AUX transaction is in progress, it will contain
-	 * gap(s), so flag it PARTIAL to inform the user.
-	 */
-	event = pt->handle.event;
-	if (event)
-		perf_aux_output_flag(&pt->handle,
-		                     PERF_AUX_FLAG_PARTIAL);
-
-	/* Turn PTs back on */
-	if (!on && event)
-		wrmsrq(MSR_IA32_RTIT_CTL, event->hw.aux_config);
-
-	local_irq_restore(flags);
  }
-EXPORT_SYMBOL_GPL(intel_pt_handle_vmx);

  /*
   * PMU callbacks
diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 70d1d94aca7e..9140796e6268 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -659,12 +659,9 @@ static inline void x86_perf_get_lbr(struct x86_pmu_lbr 
*lbr)
  #endif

  #ifdef CONFIG_CPU_SUP_INTEL
- extern void intel_pt_handle_vmx(int on);
+extern void intel_pt_set_vmx(int on);
  #else
-static inline void intel_pt_handle_vmx(int on)
-{
-
-}
+static inline void intel_pt_set_vmx(int on) { }
  #endif

  #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 03b28fa2e91e..9dad23c86152 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2009,7 +2009,7 @@ void cpu_enable_virtualization(void)
  	rdmsrq(MSR_IA32_VMX_BASIC, basic_msr);
  	this_cpu_ptr(&vmxon_vmcs)->hdr.revision_id = 
vmx_basic_vmcs_revision_id(basic_msr);

-	intel_pt_handle_vmx(1);
+	intel_pt_set_vmx(1);

  	cr4_set_bits(X86_CR4_VMXE);

@@ -2023,7 +2023,7 @@ void cpu_enable_virtualization(void)
  fault:
  	pr_err("VMXON faulted on CPU%d\n", cpu);
  	cr4_clear_bits(X86_CR4_VMXE);
-	intel_pt_handle_vmx(0);
+	intel_pt_set_vmx(0);
  }

  /*
@@ -2055,7 +2055,7 @@ void cpu_disable_virtualization(void)

  exit:
  	cr4_clear_bits(X86_CR4_VMXE);
-	intel_pt_handle_vmx(0);
+	intel_pt_set_vmx(0);
  	return;

  fault:

Re: [RFC PATCH v1 0/5] x86/boot, KVM: Move VMXON/VMXOFF handling from KVM to CPU lifecycle

Posted by Sean Christopherson 2 weeks ago

On Wed, Sep 17, 2025, Xin Li wrote:
> On 9/16/2025 10:54 AM, Sean Christopherson wrote:
> > On Thu, Sep 11, 2025, Arjan van de Ven wrote:
> > > Hi,
> > > > I also want to keep the code as a module, both to avoid doing VMXON unconditionally,
> > > 
> > > can you expand on what the problem is with having VMXON unconditionally enabled?
> > 
> > Unlike say EFER.SVME, VMXON fundamentally changes CPU behavior.  E.g. blocks INIT,
> > activates VMCS caches (which aren't cleared by VMXOFF on pre-SPR CPUs, and AFAIK
> > Intel hasn't even publicly committed to that behavior for SPR+), restricts allowed
> > CR0 and CR4 values, raises questions about ucode patch updates, triggers unique
> > flows in SMI/RSM, prevents Intel PT from tracing on certain CPUs, and probably a
> > few other things I'm forgetting.
> 
> Regarding Intel PT, if VMXON/VMXOFF are moved to CPU startup/shutdown, as
> Intel PT is initialized during arch_initcall() stage, entering and leaving
> VMX operation no longer happen while Intel PT is _active_, thus
> intel_pt_handle_vmx() no longer needs to "handles" VMX state transitions.

The issue isn't handling transitions, it's that some CPUs don't support Intel PT
post-VMXON:

  If bit 14 is read as 1, Intel® Processor Trace (Intel PT) can be used in VMX
  operation. If the processor supports Intel PT but does not allow it to be used
  in VMX operation, execution of VMXON clears IA32_RTIT_CTL.TraceEn (see
  “VMXON—Enter VMX Operation” in Chapter 32); any attempt to write IA32_RTIT_CTL
  while in VMX operation (including VMX root operation) causes a general-protection
  exception.

And again, unconditionally doing VMXON is a minor objection in all of this.

Re: [RFC PATCH v1 0/5] x86/boot, KVM: Move VMXON/VMXOFF handling from KVM to CPU lifecycle

Posted by Arjan van de Ven 2 weeks, 1 day ago

On 9/16/2025 10:54 AM, Sean Christopherson wrote:
  what the problem is with having VMXON unconditionally enabled?
> 
> Unlike say EFER.SVME, VMXON fundamentally changes CPU behavior.  E.g. blocks INIT,

blocking INIT is clearly a thing, and both KVM and this patch series deal with that by vmxoff before offline/kexec/etc cases

> activates VMCS caches (which aren't cleared by VMXOFF on pre-SPR CPUs, and AFAIK
> Intel hasn't even publicly committed to that behavior for SPR+),

the VMCS caches aren't great for sure -- which is why the behavior of having vmx on all the time and only
vmxoff at a "fatal to execution" point (offline, kexec, ..) is making life simpler, by not dealing
with this at runtime

 > restricts allowed> CR0 and CR4 values, raises questions about ucode patch updates, triggers unique
> flows in SMI/RSM, prevents Intel PT from tracing on certain CPUs, and probably a
> few other things I'm forgetting.

I went through a similar mental list and my conclusion was a bit different.
The behavior changes are minor at best ..
And yes there are a few things different in microcode -- but the reality is that every day millions of
servers and laptops/etc all run with vmxon (by virtue of running KVM or other virtualization)
all day long, day in day out -- and it is not causing any issues at all.

An argument that any supposed behavior change is unacceptable also implies virtualization
itself would run into that same argument... and a LOT of the world runs virtualized.

Re: [RFC PATCH v1 0/5] x86/boot, KVM: Move VMXON/VMXOFF handling from KVM to CPU lifecycle

Posted by Jim Mattson 2 weeks, 2 days ago

On Tue, Sep 16, 2025 at 10:54 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Sep 11, 2025, Arjan van de Ven wrote:
> > Hi,
> > > I also want to keep the code as a module, both to avoid doing VMXON unconditionally,
> >
> > can you expand on what the problem is with having VMXON unconditionally enabled?
>
> Unlike say EFER.SVME, VMXON fundamentally changes CPU behavior.  E.g. blocks INIT,
> activates VMCS caches (which aren't cleared by VMXOFF on pre-SPR CPUs, and AFAIK
> Intel hasn't even publicly committed to that behavior for SPR+), restricts allowed
> CR0 and CR4 values, raises questions about ucode patch updates, triggers unique
> flows in SMI/RSM, prevents Intel PT from tracing on certain CPUs, and probably a
> few other things I'm forgetting.

Do we leave VMX operation today when applying a late-load microcode patch?

> > A lot of things are much simpler if it's on at cpu up, and turned off only at the
> > down path (be it offline of kexec).. no refcounting, no locking, etc...
>
> For Intel.  Unless _all_ vendors and architectures follow suit, KVM will need
> the refcounting and locking.  And while it's not anyone's fault, the *vast*
> majority of complexity around enabling virtualization in KVM is due to VMX.
> I.e. KVM added a bunch of code to deal with the aformentioned side effects of
> VMXON, and as a result, all other vendors/architectures have had to deal with
> that complexity.
>
> > so would be good to understand what the problem would be with having it always on
>
> Doing VMXON unconditionally is a minor objection.  My primary objection is that
> this series does what's easiest for TDX, and leaves behind all of the VMX-induced
> technical debt in KVM.
>

Re: [RFC PATCH v1 0/5] x86/boot, KVM: Move VMXON/VMXOFF handling from KVM to CPU lifecycle

Posted by Dave Hansen 3 weeks ago

On 9/11/25 07:20, Sean Christopherson wrote:
> VPID and ASID allocation need to be managed system-wide, otherwise
> running KVM alongside another hypervisor-like entity will result in
> data corruption due to shared TLB state.
What other hypervisor-like entities are out there?

The TDX module needs (or will need) VMXON for some things that aren't
strictly for virtualization. But what other entities are out there?

Re: [RFC PATCH v1 0/5] x86/boot, KVM: Move VMXON/VMXOFF handling from KVM to CPU lifecycle

Posted by Sean Christopherson 2 weeks, 2 days ago

On Thu, Sep 11, 2025, Dave Hansen wrote:
> On 9/11/25 07:20, Sean Christopherson wrote:
> > VPID and ASID allocation need to be managed system-wide, otherwise
> > running KVM alongside another hypervisor-like entity will result in
> > data corruption due to shared TLB state.
> What other hypervisor-like entities are out there?

gVisor, VMware Workstation, Virtual Box, and maybe a few more?  Though the three
named are all moving to KVM (Virtual Box may already have full KVM support).
But it's not just existing entities, it's also the fact that lack of common
virtualization infrastructure has definitely deterred others from trying to
upstream non-KVM hypervisors (or hypervisor-like projects).

Now, that's arguably been a good thing in hindsight, e.g. gVisor, VMware, and
Virtual Box wouldn't have switched to KVM had upstream accepted their custom
drivers/hypervisors.  But I like to give us the benefit of the doubt in the sense
that, had someone tried to upstream a KVM-like hypervisor, I think we would have
redirected them into KVM and figured out how to close any gaps, as opposed to
blindly accepting a newfangled hypervisors.

However, no one even so much as proposes new hypervisor-like entities in upstream,
at least not for x86, because the barrier to doing so is extremely high due to KVM
having a stranglehold on all things virtualization.  And even if no one ever lands
another hypervisor-like thing in upstream, I think KVM (and the kernel at-large)
would benefit if patches were at least posted, e.g. would help identify areas of
opportunity and/or flaws in KVM.

> The TDX module needs (or will need) VMXON for some things that aren't
> strictly for virtualization. But what other entities are out there?

The end usage for TDX might not be virtualization focused, but the _kernel's_
usage of VMXON is absolutely for virtualization.  SEAMCALL/SEAMRET are literally
VM-Exit/VM-Enter.