[PATCH 2/3] KVM: nSVM: Do not track EFER.SVME toggling in guest mode

Yosry Ahmed posted 3 patches 1 week, 2 days ago
[PATCH 2/3] KVM: nSVM: Do not track EFER.SVME toggling in guest mode
Posted by Yosry Ahmed 1 week, 2 days ago
KVM tracks when EFER.SVME is set and cleared to initialize and tear down
nested state. However, it doesn't differentiate if EFER.SVME is getting
toggled in L1 or L2+. Toggling EFER.SVME in L2+ is inconsequential from
KVM's perspective, as the vCPU is still obviously using nested.

This causes a problem if L2 sets then clears EFER.SVME without L1
interception, as KVM exits guest mode and tears down nested state while
L2 is running, executing L1 without injecting a proper #VMEXIT.

Technically, it's not a bug as the APM states that an L1 hypervisor
should intercept EFER writes:

	The effect of turning off EFER.SVME while a guest is running is
	undefined; therefore, the VMM should always prevent guests from
	writing EFER.

However, it would be nice if KVM handled it more gracefully.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/svm.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 4575a6a7d6c4e..eaf0f8053fbfb 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -208,6 +208,13 @@ static int svm_set_efer_svme(struct kvm_vcpu *vcpu, u64 old_efer, u64 new_efer)
 	if ((old_efer & EFER_SVME) == (new_efer & EFER_SVME))
 		return 0;
 
+	/*
+	 * An L2 guest setting or clearing EFER_SVME does not change whether or
+	 * not the vCPU can use nested from KVM's perspective.
+	 */
+	if (is_guest_mode(vcpu))
+		return 0;
+
 	if (new_efer & EFER_SVME) {
 		r = svm_allocate_nested(svm);
 		if (r)
-- 
2.53.0.rc1.225.gd81095ad13-goog
Re: [PATCH 2/3] KVM: nSVM: Do not track EFER.SVME toggling in guest mode
Posted by Sean Christopherson 3 days, 14 hours ago
The shortlog is *very* misleading.  The changelog isn't much better.  This isn't
just removing "tracking", it's redefining guest visible behavior and effectively
changing the KVM virtual CPU microarchitecture.

On Fri, Jan 30, 2026, Yosry Ahmed wrote:
> KVM tracks when EFER.SVME is set and cleared to initialize and tear down
> nested state. However, it doesn't differentiate if EFER.SVME is getting
> toggled in L1 or L2+. Toggling EFER.SVME in L2+ is inconsequential from
> KVM's perspective, as the vCPU is still obviously using nested.
> 
> This causes a problem if L2 sets then clears EFER.SVME without L1
> interception, as KVM exits guest mode and tears down nested state while
> L2 is running, executing L1 without injecting a proper #VMEXIT.
> 
> Technically, it's not a bug as the APM states that an L1 hypervisor
> should intercept EFER writes:
> 
> 	The effect of turning off EFER.SVME while a guest is running is
> 	undefined; therefore, the VMM should always prevent guests from
> 	writing EFER.
> 
> However, it would be nice if KVM handled it more gracefully.

That's not sufficient justification for this change.  Nothing will ever trip this
code unless it's _trying_ to trip this code.  I.e.  Outside of a bespoke selftest
that is little more than "make sure the kernel doesn't explode", and future
malicious actors, KVM's behavior is largely irrelevant.

> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/svm.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 4575a6a7d6c4e..eaf0f8053fbfb 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -208,6 +208,13 @@ static int svm_set_efer_svme(struct kvm_vcpu *vcpu, u64 old_efer, u64 new_efer)
>  	if ((old_efer & EFER_SVME) == (new_efer & EFER_SVME))
>  		return 0;
>  
> +	/*
> +	 * An L2 guest setting or clearing EFER_SVME does not change whether or
> +	 * not the vCPU can use nested from KVM's perspective.

This should call out that the architectural behavior is undefined.  "from KVM's
perspective" is an obtuse way of saying "KVM is making up behavior because it
can".  E.g. something like

	/*
	 * Architecturally, clearing EFER.SVME while a guest is running yields
	 * undefined behavior, i.e. KVM has carte blance to do anything if L1
	 * doesn't intercept writes to EFER.  Simply do nothing, because XYZ.
	 */

> +	 */
> +	if (is_guest_mode(vcpu))

This is fine, because svm_allocate_nested() plays nice with redundant calls, but
this is a rather scary change for something that straight up doesn't happen in
practice.  Any hypervisor that doesn't intercept EFER is broken, period.

E.g. if a future change adds novel code that's guarded by the

	if ((old_efer & EFER_SVME) == (new_efer & EFER_SVME)) 
		return 0;

check, then doing nothing here could result in a guest-exploitable bug.  In other
words, from a kernel safety perspective, "doing nothing" is more dangerous than
forcibly leaving nested mode, because doing nothing deliberately puts KVM in an
inconsistent state.  Given that there's basically zero benefit in practice, I'm
strongly inclined to keep the call svm_leave_nested().

All that said, I agree that pulling the rug out from under the VM is a terrible
experience.  What if we throw a triple fault at the vCPU so that L1 gets an
immediate SHUTDOWN (not a VM-Exit, a SHUTDOWN of the L1 vCPU), instead of running
random garbage from L2?

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 5f0136dbdde6..ccd73a3be3f9 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -216,6 +216,17 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
 
        if ((old_efer & EFER_SVME) != (efer & EFER_SVME)) {
                if (!(efer & EFER_SVME)) {
+                       /*
+                        * Architecturally, clearing EFER.SVME while a guest is
+                        * running yields undefined behavior, i.e. KVM can do
+                        * literally anything.  Force the vCPU back into L1 as
+                        * that is the safest option for KVM, but synthesize a
+                        * triple fault (for L1!) so that KVM at least doesn't
+                        * run random L2 code in the context of L1.
+                        */
+                       if (is_guest_mode(vcpu))
+                               kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+
                        svm_leave_nested(vcpu);
                        /* #GP intercept is still needed for vmware backdoor */
                        if (!enable_vmware_backdoor)


> +		return 0;
> +
>  	if (new_efer & EFER_SVME) {
>  		r = svm_allocate_nested(svm);
>  		if (r)
> -- 
> 2.53.0.rc1.225.gd81095ad13-goog
>
Re: [PATCH 2/3] KVM: nSVM: Do not track EFER.SVME toggling in guest mode
Posted by Yosry Ahmed 3 days, 12 hours ago
On Wed, Feb 04, 2026 at 01:15:45PM -0800, Sean Christopherson wrote:
> The shortlog is *very* misleading.  The changelog isn't much better.  This isn't
> just removing "tracking", it's redefining guest visible behavior and effectively
> changing the KVM virtual CPU microarchitecture.
> 
> On Fri, Jan 30, 2026, Yosry Ahmed wrote:
> > KVM tracks when EFER.SVME is set and cleared to initialize and tear down
> > nested state. However, it doesn't differentiate if EFER.SVME is getting
> > toggled in L1 or L2+. Toggling EFER.SVME in L2+ is inconsequential from
> > KVM's perspective, as the vCPU is still obviously using nested.
> > 
> > This causes a problem if L2 sets then clears EFER.SVME without L1
> > interception, as KVM exits guest mode and tears down nested state while
> > L2 is running, executing L1 without injecting a proper #VMEXIT.
> > 
> > Technically, it's not a bug as the APM states that an L1 hypervisor
> > should intercept EFER writes:
> > 
> > 	The effect of turning off EFER.SVME while a guest is running is
> > 	undefined; therefore, the VMM should always prevent guests from
> > 	writing EFER.
> > 
> > However, it would be nice if KVM handled it more gracefully.
> 
> That's not sufficient justification for this change.  Nothing will ever trip this
> code unless it's _trying_ to trip this code.  I.e.  Outside of a bespoke selftest
> that is little more than "make sure the kernel doesn't explode", and future
> malicious actors, KVM's behavior is largely irrelevant.
> 
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> >  arch/x86/kvm/svm/svm.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 4575a6a7d6c4e..eaf0f8053fbfb 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -208,6 +208,13 @@ static int svm_set_efer_svme(struct kvm_vcpu *vcpu, u64 old_efer, u64 new_efer)
> >  	if ((old_efer & EFER_SVME) == (new_efer & EFER_SVME))
> >  		return 0;
> >  
> > +	/*
> > +	 * An L2 guest setting or clearing EFER_SVME does not change whether or
> > +	 * not the vCPU can use nested from KVM's perspective.
> 
> This should call out that the architectural behavior is undefined.  "from KVM's
> perspective" is an obtuse way of saying "KVM is making up behavior because it
> can".  E.g. something like
> 
> 	/*
> 	 * Architecturally, clearing EFER.SVME while a guest is running yields
> 	 * undefined behavior, i.e. KVM has carte blance to do anything if L1
> 	 * doesn't intercept writes to EFER.  Simply do nothing, because XYZ.
> 	 */
> 
> > +	 */
> > +	if (is_guest_mode(vcpu))
> 
> This is fine, because svm_allocate_nested() plays nice with redundant calls, but
> this is a rather scary change for something that straight up doesn't happen in
> practice.  Any hypervisor that doesn't intercept EFER is broken, period.
> 
> E.g. if a future change adds novel code that's guarded by the
> 
> 	if ((old_efer & EFER_SVME) == (new_efer & EFER_SVME)) 
> 		return 0;
> 
> check, then doing nothing here could result in a guest-exploitable bug.  In other
> words, from a kernel safety perspective, "doing nothing" is more dangerous than
> forcibly leaving nested mode, because doing nothing deliberately puts KVM in an
> inconsistent state.  Given that there's basically zero benefit in practice, I'm
> strongly inclined to keep the call svm_leave_nested().
> 
> All that said, I agree that pulling the rug out from under the VM is a terrible
> experience.  What if we throw a triple fault at the vCPU so that L1 gets an
> immediate SHUTDOWN (not a VM-Exit, a SHUTDOWN of the L1 vCPU), instead of running
> random garbage from L2?

I am fine with this too, anything is better than pulling the rug. I will
send a v2 and probably drop patch 1 (unless you prefer that we keep it).

> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 5f0136dbdde6..ccd73a3be3f9 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -216,6 +216,17 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
>  
>         if ((old_efer & EFER_SVME) != (efer & EFER_SVME)) {
>                 if (!(efer & EFER_SVME)) {
> +                       /*
> +                        * Architecturally, clearing EFER.SVME while a guest is
> +                        * running yields undefined behavior, i.e. KVM can do
> +                        * literally anything.  Force the vCPU back into L1 as
> +                        * that is the safest option for KVM, but synthesize a
> +                        * triple fault (for L1!) so that KVM at least doesn't
> +                        * run random L2 code in the context of L1.
> +                        */
> +                       if (is_guest_mode(vcpu))
> +                               kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> +
>                         svm_leave_nested(vcpu);
>                         /* #GP intercept is still needed for vmware backdoor */
>                         if (!enable_vmware_backdoor)
> 
> 
> > +		return 0;
> > +
> >  	if (new_efer & EFER_SVME) {
> >  		r = svm_allocate_nested(svm);
> >  		if (r)
> > -- 
> > 2.53.0.rc1.225.gd81095ad13-goog
> >
Re: [PATCH 2/3] KVM: nSVM: Do not track EFER.SVME toggling in guest mode
Posted by Sean Christopherson 3 days, 9 hours ago
On Wed, Feb 04, 2026, Yosry Ahmed wrote:
> On Wed, Feb 04, 2026 at 01:15:45PM -0800, Sean Christopherson wrote:
> > All that said, I agree that pulling the rug out from under the VM is a terrible
> > experience.  What if we throw a triple fault at the vCPU so that L1 gets an
> > immediate SHUTDOWN (not a VM-Exit, a SHUTDOWN of the L1 vCPU), instead of running
> > random garbage from L2?
> 
> I am fine with this too, anything is better than pulling the rug. I will
> send a v2 and probably drop patch 1 (unless you prefer that we keep it).

Drop it, otherwise we'll probably end up wasting several days bikeshedding the
name.