[PATCH v4 13/26] KVM: nSVM: Unify handling of VMRUN failures with proper cleanup

Yosry Ahmed posted 26 patches 3 weeks, 3 days ago
There is a newer version of this series
[PATCH v4 13/26] KVM: nSVM: Unify handling of VMRUN failures with proper cleanup
Posted by Yosry Ahmed 3 weeks, 3 days ago
There are currently two possible causes of VMRUN failures:

1) Consistency checks failures. In this case, KVM updates the exit code
   in the mapped VMCB12 and exits early in nested_svm_vmrun(). This
   causes a few problems:

  A) KVM does not clear the GIF if the early consistency checks fail
     (because nested_svm_vmexit() is not called). Nothing requires
     GIF=0 before a VMRUN, from the APM:

	It is assumed that VMM software cleared GIF some time before
	executing the VMRUN instruction, to ensure an atomic state
	switch.

     So an early #VMEXIT from early consistency checks could leave the
     GIF set.

  B) svm_leave_smm() is missing consistency checks on the newly loaded
     guest state, because the checks aren't performed by
     enter_svm_guest_mode().

2) Failure to load L2's CR3 or merge the MSR bitmaps. In this case, a
   fully-fledged #VMEXIT injection is performed as VMCB02 is already
   prepared.

Arguably all VMRUN failures should be handled before the VMCB02 is
prepared, but with proper cleanup (e.g. clear the GIF). Move all the
potential failure checks inside enter_svm_guest_mode() before switching
to VMCB02. On failure of any of these checks, nested_svm_vmrun()
synthesizes a minimal #VMEXIT through the new nested_svm_failed_vmrun()
helper.

__nested_svm_vmexit() already performs the necessary cleanup for a
failed VMRUN, including uninitializing the nested MMU and reloading L1's
CR3. This ensures that consistency check failures do proper necessary
cleanup, while other failures do not doo too much cleanup. It also
leaves a unified path for handling VMRUN failures.

Cc: stable@vger.kernel.org
Fixes: 52c65a30a5c6 ("KVM: SVM: Check for nested vmrun intercept before emulating vmrun")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/nested.c | 62 +++++++++++++++++++++++----------------
 1 file changed, 36 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 53ae761b50e2..1dfe5800c98c 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -938,22 +938,19 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
 				    vmcb12->control.intercepts[INTERCEPT_WORD4],
 				    vmcb12->control.intercepts[INTERCEPT_WORD5]);
 
-
 	svm->nested.vmcb12_gpa = vmcb12_gpa;
 
 	WARN_ON(svm->vmcb == svm->nested.vmcb02.ptr);
 
 	enter_guest_mode(vcpu);
 
+	if (!nested_vmcb_check_save(vcpu, &svm->nested.save) ||
+	    !nested_vmcb_check_controls(vcpu, &svm->nested.ctl))
+		return -EINVAL;
+
 	if (nested_npt_enabled(svm))
 		nested_svm_init_mmu_context(vcpu);
 
-	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
-
-	svm_switch_vmcb(svm, &svm->nested.vmcb02);
-	nested_vmcb02_prepare_control(svm, vmcb12->save.rip, vmcb12->save.cs.base);
-	nested_vmcb02_prepare_save(svm, vmcb12);
-
 	ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
 				  nested_npt_enabled(svm), from_vmrun);
 	if (ret)
@@ -965,6 +962,17 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
 			return ret;
 	}
 
+	/*
+	 * Any VMRUN failure needs to happen before this point, such that the
+	 * nested #VMEXIT is injected properly by nested_svm_failed_vmrun().
+	 */
+
+	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
+
+	svm_switch_vmcb(svm, &svm->nested.vmcb02);
+	nested_vmcb02_prepare_control(svm, vmcb12->save.rip, vmcb12->save.cs.base);
+	nested_vmcb02_prepare_save(svm, vmcb12);
+
 	if (!from_vmrun)
 		kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
 
@@ -983,6 +991,8 @@ static void __nested_svm_vmexit(struct vcpu_svm *svm)
 	struct vmcb *vmcb01 = svm->vmcb01.ptr;
 	struct kvm_vcpu *vcpu = &svm->vcpu;
 
+	WARN_ON_ONCE(is_guest_mode(vcpu));
+
 	svm->nested.vmcb12_gpa = 0;
 	svm->nested.ctl.nested_cr3 = 0;
 
@@ -1006,6 +1016,19 @@ static void __nested_svm_vmexit(struct vcpu_svm *svm)
 		kvm_queue_exception(vcpu, DB_VECTOR);
 }
 
+static void nested_svm_failed_vmrun(struct vcpu_svm *svm, struct vmcb *vmcb12)
+{
+	WARN_ON(svm->vmcb == svm->nested.vmcb02.ptr);
+
+	leave_guest_mode(vcpu);
+
+	vmcb12->control.exit_code = SVM_EXIT_ERR;
+	vmcb12->control.exit_code_hi = -1u;
+	vmcb12->control.exit_info_1 = 0;
+	vmcb12->control.exit_info_2 = 0;
+	__nested_svm_vmexit(svm);
+}
+
 int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -1048,15 +1071,6 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
 	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
 
-	if (!nested_vmcb_check_save(vcpu, &svm->nested.save) ||
-	    !nested_vmcb_check_controls(vcpu, &svm->nested.ctl)) {
-		vmcb12->control.exit_code    = SVM_EXIT_ERR;
-		vmcb12->control.exit_code_hi = -1u;
-		vmcb12->control.exit_info_1  = 0;
-		vmcb12->control.exit_info_2  = 0;
-		goto out;
-	}
-
 	/*
 	 * Since vmcb01 is not in use, we can use it to store some of the L1
 	 * state.
@@ -1077,15 +1091,9 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 		svm->nmi_l1_to_l2 = false;
 		svm->soft_int_injected = false;
 
-		svm->vmcb->control.exit_code    = SVM_EXIT_ERR;
-		svm->vmcb->control.exit_code_hi = -1u;
-		svm->vmcb->control.exit_info_1  = 0;
-		svm->vmcb->control.exit_info_2  = 0;
-
-		nested_svm_vmexit(svm);
+		nested_svm_failed_vmrun(svm, vmcb12);
 	}
 
-out:
 	kvm_vcpu_unmap(vcpu, &map);
 
 	return ret;
@@ -1224,6 +1232,8 @@ void nested_svm_vmexit(struct vcpu_svm *svm)
 	if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS))
 		vmcb01->control.erap_ctl |= ERAP_CONTROL_CLEAR_RAP;
 
+	/* VMRUN failures before switching to VMCB02 are handled by nested_svm_failed_vmrun() */
+	WARN_ON_ONCE(svm->vmcb != svm->nested.vmcb02.ptr);
 	svm_switch_vmcb(svm, &svm->vmcb01);
 
 	/*
@@ -1914,9 +1924,6 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 	if (nested_npt_enabled(svm))
 		nested_svm_init_mmu_context(vcpu);
 
-	svm_switch_vmcb(svm, &svm->nested.vmcb02);
-	nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip, svm->vmcb->save.cs.base);
-
 	/*
 	 * While the nested guest CR3 is already checked and set by
 	 * KVM_SET_SREGS, it was set when nested state was yet loaded,
@@ -1928,6 +1935,9 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 	if (ret)
 		goto out_free;
 
+	svm_switch_vmcb(svm, &svm->nested.vmcb02);
+	nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip, svm->vmcb->save.cs.base);
+
 	svm->nested.force_msr_bitmap_recalc = true;
 
 	kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
-- 
2.52.0.457.g6b5491de43-goog
Re: [PATCH v4 13/26] KVM: nSVM: Unify handling of VMRUN failures with proper cleanup
Posted by Sean Christopherson 2 days, 22 hours ago
On Thu, Jan 15, 2026, Yosry Ahmed wrote:
> There are currently two possible causes of VMRUN failures:

Might be worth qualifying this with:

  There are currently two possible causes of VMRUN failures emulated by
  KVM:

Because there are more than two causes in the APM and hardware, they're just not
emulated by KVM.
Re: [PATCH v4 13/26] KVM: nSVM: Unify handling of VMRUN failures with proper cleanup
Posted by Sean Christopherson 2 days, 22 hours ago
On Thu, Jan 15, 2026, Yosry Ahmed wrote:
> @@ -983,6 +991,8 @@ static void __nested_svm_vmexit(struct vcpu_svm *svm)
>  	struct vmcb *vmcb01 = svm->vmcb01.ptr;
>  	struct kvm_vcpu *vcpu = &svm->vcpu;
>  
> +	WARN_ON_ONCE(is_guest_mode(vcpu));
> +
>  	svm->nested.vmcb12_gpa = 0;
>  	svm->nested.ctl.nested_cr3 = 0;
>  
> @@ -1006,6 +1016,19 @@ static void __nested_svm_vmexit(struct vcpu_svm *svm)
>  		kvm_queue_exception(vcpu, DB_VECTOR);
>  }
>  
> +static void nested_svm_failed_vmrun(struct vcpu_svm *svm, struct vmcb *vmcb12)

I don't love the name.  "fail" has very specific meaning in VMX for VMLAUNCH and
VMRESUME, as VM-Fail is not a VM-Exit, e.g. doesn't load host state from the VMCS.

I also don't love that the name doesn't capture that this is synthesizing a #VMEXIT.
Maybe nested_svm_vmrun_error_vmexit()?  I suppose nested_svm_failed_vmrun_vmexit()
isn't too bad either, as that at least addresses my concerns about conflating it
with VMX's VM-Fail.

> +{
> +	WARN_ON(svm->vmcb == svm->nested.vmcb02.ptr);

WARN_ON_ONCE()

> +
> +	leave_guest_mode(vcpu);

Someone didn't test each patch.  "vcpu" doesn't exist until
"KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN".  Just pass in @vcpu and
@vmcb12, i.e. don't pass @svm and then pull @vcpu back out.

> +	vmcb12->control.exit_code = SVM_EXIT_ERR;
> +	vmcb12->control.exit_code_hi = -1u;
> +	vmcb12->control.exit_info_1 = 0;
> +	vmcb12->control.exit_info_2 = 0;
> +	__nested_svm_vmexit(svm);
> +}

...

> @@ -1224,6 +1232,8 @@ void nested_svm_vmexit(struct vcpu_svm *svm)
>  	if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS))
>  		vmcb01->control.erap_ctl |= ERAP_CONTROL_CLEAR_RAP;
>  
> +	/* VMRUN failures before switching to VMCB02 are handled by nested_svm_failed_vmrun() */

Please don't add comments that just point elsewhere.  They inevitably become
stale, and they don't help the reader understand "why" any of this matters.

E.g. something like

	/*
	 * This helper is intended for use only when KVM synthesizing a #VMEXIT
	 * after a successful nested VMRUN.  All VMRUN consistency checks must
	 * be performed before loading guest state, and so should use the inner
	 * helper.
	 */
Re: [PATCH v4 13/26] KVM: nSVM: Unify handling of VMRUN failures with proper cleanup
Posted by Yosry Ahmed 2 days, 7 hours ago
On Thu, Feb 05, 2026 at 04:47:59PM -0800, Sean Christopherson wrote:
> On Thu, Jan 15, 2026, Yosry Ahmed wrote:
> > @@ -983,6 +991,8 @@ static void __nested_svm_vmexit(struct vcpu_svm *svm)
> >  	struct vmcb *vmcb01 = svm->vmcb01.ptr;
> >  	struct kvm_vcpu *vcpu = &svm->vcpu;
> >  
> > +	WARN_ON_ONCE(is_guest_mode(vcpu));
> > +
> >  	svm->nested.vmcb12_gpa = 0;
> >  	svm->nested.ctl.nested_cr3 = 0;
> >  
> > @@ -1006,6 +1016,19 @@ static void __nested_svm_vmexit(struct vcpu_svm *svm)
> >  		kvm_queue_exception(vcpu, DB_VECTOR);
> >  }
> >  
> > +static void nested_svm_failed_vmrun(struct vcpu_svm *svm, struct vmcb *vmcb12)
> 
> I don't love the name.  "fail" has very specific meaning in VMX for VMLAUNCH and
> VMRESUME, as VM-Fail is not a VM-Exit, e.g. doesn't load host state from the VMCS.
> 
> I also don't love that the name doesn't capture that this is synthesizing a #VMEXIT.
> Maybe nested_svm_vmrun_error_vmexit()?  I suppose nested_svm_failed_vmrun_vmexit()
> isn't too bad either, as that at least addresses my concerns about conflating it
> with VMX's VM-Fail.

Good point, I think nested_svm_vmrun_error_vmexit() is fine.

> 
> > +{
> > +	WARN_ON(svm->vmcb == svm->nested.vmcb02.ptr);
> 
> WARN_ON_ONCE()
> 
> > +
> > +	leave_guest_mode(vcpu);
> 
> Someone didn't test each patch.  "vcpu" doesn't exist until

I plead the 5th.

> "KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN".  Just pass in @vcpu and
> @vmcb12, i.e. don't pass @svm and then pull @vcpu back out.
> 
> > +	vmcb12->control.exit_code = SVM_EXIT_ERR;
> > +	vmcb12->control.exit_code_hi = -1u;
> > +	vmcb12->control.exit_info_1 = 0;
> > +	vmcb12->control.exit_info_2 = 0;
> > +	__nested_svm_vmexit(svm);
> > +}
> 
> ...
> 
> > @@ -1224,6 +1232,8 @@ void nested_svm_vmexit(struct vcpu_svm *svm)
> >  	if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS))
> >  		vmcb01->control.erap_ctl |= ERAP_CONTROL_CLEAR_RAP;
> >  
> > +	/* VMRUN failures before switching to VMCB02 are handled by nested_svm_failed_vmrun() */
> 
> Please don't add comments that just point elsewhere.  They inevitably become
> stale, and they don't help the reader understand "why" any of this matters.
> 
> E.g. something like
> 
> 	/*
> 	 * This helper is intended for use only when KVM synthesizing a #VMEXIT
> 	 * after a successful nested VMRUN.  All VMRUN consistency checks must
> 	 * be performed before loading guest state, and so should use the inner
> 	 * helper.
> 	 */ 

Will use that, but I will replace "this helper" and "inner helper" with
nested_svm_vmexit() and __nested_svm_vmexit().