[PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN

Yosry Ahmed posted 13 patches 2 months, 4 weeks ago
There is a newer version of this series
[PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
Posted by Yosry Ahmed 2 months, 4 weeks ago
All accesses to the VMCB12 in the guest memory are limited to
nested_svm_vmrun(). However, the VMCB12 remains mapped until the end of
the function execution. Unmapping right after the consistency checks is
possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is
used after being unmapped.

Move all accesses to the VMCB12 into a new helper,
nested_svm_vmrun_read_vmcb12(),  that maps the VMCB12,
caches the needed fields, performs consistency checks, and unmaps it.
This limits the scope of the VMCB12 mapping appropriately. It also
slightly simplifies the cleanup path of nested_svm_vmrun().

nested_svm_vmrun_read_vmcb12() returns -1 if the consistency checks
fail, maintaining the current behavior of skipping the instructions and
unmapping the VMCB12 (although in the opposite order).

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/svm/nested.c | 59 ++++++++++++++++++++++-----------------
 1 file changed, 34 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index ddcd545ec1c3c..a48668c36a191 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1023,12 +1023,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
 	return 0;
 }
 
+static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	struct kvm_host_map map;
+	struct vmcb *vmcb12;
+	int ret;
+
+	ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
+	if (ret)
+		return ret;
+
+	vmcb12 = map.hva;
+
+	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
+	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
+
+	if (!nested_vmcb_check_save(vcpu) ||
+	    !nested_vmcb_check_controls(vcpu)) {
+		vmcb12->control.exit_code    = SVM_EXIT_ERR;
+		vmcb12->control.exit_code_hi = 0;
+		vmcb12->control.exit_info_1  = 0;
+		vmcb12->control.exit_info_2  = 0;
+		ret = -1;
+	}
+
+	kvm_vcpu_unmap(vcpu, &map);
+	return ret;
+}
+
 int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 	int ret;
-	struct vmcb *vmcb12;
-	struct kvm_host_map map;
 	u64 vmcb12_gpa;
 	struct vmcb *vmcb01 = svm->vmcb01.ptr;
 
@@ -1049,8 +1076,11 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 		return ret;
 	}
 
+	if (WARN_ON_ONCE(!svm->nested.initialized))
+		return -EINVAL;
+
 	vmcb12_gpa = svm->vmcb->save.rax;
-	ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
+	ret = nested_svm_vmrun_read_vmcb12(vcpu, vmcb12_gpa);
 	if (ret == -EINVAL) {
 		kvm_inject_gp(vcpu, 0);
 		return 1;
@@ -1060,23 +1090,6 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 
 	ret = kvm_skip_emulated_instruction(vcpu);
 
-	vmcb12 = map.hva;
-
-	if (WARN_ON_ONCE(!svm->nested.initialized))
-		return -EINVAL;
-
-	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
-	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
-
-	if (!nested_vmcb_check_save(vcpu) ||
-	    !nested_vmcb_check_controls(vcpu)) {
-		vmcb12->control.exit_code    = SVM_EXIT_ERR;
-		vmcb12->control.exit_code_hi = 0;
-		vmcb12->control.exit_info_1  = 0;
-		vmcb12->control.exit_info_2  = 0;
-		goto out;
-	}
-
 	/*
 	 * Since vmcb01 is not in use, we can use it to store some of the L1
 	 * state.
@@ -1096,7 +1109,7 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 		goto out_exit_err;
 
 	if (nested_svm_merge_msrpm(vcpu))
-		goto out;
+		return ret;
 
 out_exit_err:
 	svm->nested.nested_run_pending = 0;
@@ -1109,10 +1122,6 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 	svm->vmcb->control.exit_info_2  = 0;
 
 	nested_svm_vmexit(svm);
-
-out:
-	kvm_vcpu_unmap(vcpu, &map);
-
 	return ret;
 }
 
-- 
2.51.2.1041.gc1ab5b90ca-goog
Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
Posted by Sean Christopherson 2 months ago
On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> All accesses to the VMCB12 in the guest memory are limited to
> nested_svm_vmrun(). However, the VMCB12 remains mapped until the end of
> the function execution. Unmapping right after the consistency checks is
> possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is
> used after being unmapped.
> 
> Move all accesses to the VMCB12 into a new helper,
> nested_svm_vmrun_read_vmcb12(),  that maps the VMCB12,
> caches the needed fields, performs consistency checks, and unmaps it.
> This limits the scope of the VMCB12 mapping appropriately. It also
> slightly simplifies the cleanup path of nested_svm_vmrun().
> 
> nested_svm_vmrun_read_vmcb12() returns -1 if the consistency checks
> fail, maintaining the current behavior of skipping the instructions and
> unmapping the VMCB12 (although in the opposite order).
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  arch/x86/kvm/svm/nested.c | 59 ++++++++++++++++++++++-----------------
>  1 file changed, 34 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index ddcd545ec1c3c..a48668c36a191 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1023,12 +1023,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
>  	return 0;
>  }
>  
> +static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)

"read_vmcb12"() sounds like a generic helper to read a specific field.  And if
the name is more specific, then I think we can drop the "vmrun" scoping.  To
aligh with similar functions in VMX and __nested_copy_vmcb_save_to_cache(), how
about nested_svm_copy_vmcb12_to_cache()?

> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +	struct kvm_host_map map;
> +	struct vmcb *vmcb12;
> +	int ret;
> +
> +	ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> +	if (ret)
> +		return ret;
> +
> +	vmcb12 = map.hva;
> +
> +	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
> +	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
> +
> +	if (!nested_vmcb_check_save(vcpu) ||
> +	    !nested_vmcb_check_controls(vcpu)) {
> +		vmcb12->control.exit_code    = SVM_EXIT_ERR;
> +		vmcb12->control.exit_code_hi = 0;
> +		vmcb12->control.exit_info_1  = 0;
> +		vmcb12->control.exit_info_2  = 0;
> +		ret = -1;

I don't love shoving the consistency checks in here.  I get why you did it, but
it's very surprising to see (and/or easy to miss) these consistency checks.  The
caller also ends up quite wonky:

	if (ret == -EINVAL) {
		kvm_inject_gp(vcpu, 0);
		return 1;
	} else if (ret) {
		return kvm_skip_emulated_instruction(vcpu);
	}

	ret = kvm_skip_emulated_instruction(vcpu);

Ha!  And it's buggy.  __kvm_vcpu_map() can return -EFAULT if creating a host
mapping fails.  Eww, and blindly using '-1' as the "failed a consistency check"
is equally cross, as it relies on kvm_vcpu_map() not returning -EPERM in a very
weird way.

Ugh, and there's also this nastiness in nested_vmcb_check_controls():

	 * Make sure we did not enter guest mode yet, in which case
	 * kvm_read_cr0() could return L2's CR0.
	 */
	WARN_ON_ONCE(is_guest_mode(vcpu));
	return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));

nested_vmcb_check_save() and nested_vmcb_check_controls() really shouldn't exist.
They just make it harder to see what KVM is checking in the "normal" flow.

Aha!  And I'm fairly certain there are at least two pre-existing bugs due to KVM
doing "early" consistency checks in nested_svm_vmrun().

  1. KVM doesn't clear GIF on the early #VMEXIT.  In classic APM fashion, nothing
     _requires_ GIF=0 before VMRUN:

        It is assumed that VMM software cleared GIF some time before executing
        the VMRUN instruction, to ensure an atomic state switch.

     And so VMRUN with GIF=1 that hits an "early" consistency check #VMEXIT would
     incorrectly leave GIF=1.


  2. svm_leave_smm() is missing consistency checks on the newly loaded guest state,
     because the checks aren't performed by enter_svm_guest_mode().  I don't see
     anything that would prevent vmcb12 from being modified by the guest bewteen
     SMI and RSM.

Moving the consistency checks into enter_svm_guest_mode() would solve all three
(four?) problems.  And as a bonus, nested_svm_copy_vmcb12_to_cache() can use
kvm_vcpu_map_readonly().

Compile tested only, but I think we can end up with delta like so:

---
 arch/x86/kvm/svm/nested.c | 67 ++++++++++++---------------------------
 1 file changed, 20 insertions(+), 47 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 7c86987fdaca..8a0df6c535b5 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -372,9 +372,9 @@ static bool nested_svm_check_event_inj(struct kvm_vcpu *vcpu, u32 event_inj)
 	return true;
 }
 
-static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
-					 struct vmcb_ctrl_area_cached *control,
-					 unsigned long l1_cr0)
+static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
+				       struct vmcb_ctrl_area_cached *control,
+				       unsigned long l1_cr0)
 {
 	if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
 		return false;
@@ -408,8 +408,8 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
 }
 
 /* Common checks that apply to both L1 and L2 state.  */
-static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
-				     struct vmcb_save_area_cached *save)
+static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu,
+				   struct vmcb_save_area_cached *save)
 {
 	if (CC(!(save->efer & EFER_SVME)))
 		return false;
@@ -448,27 +448,6 @@ static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
 	return true;
 }
 
-static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_svm *svm = to_svm(vcpu);
-	struct vmcb_save_area_cached *save = &svm->nested.save;
-
-	return __nested_vmcb_check_save(vcpu, save);
-}
-
-static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_svm *svm = to_svm(vcpu);
-	struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
-
-	/*
-	 * Make sure we did not enter guest mode yet, in which case
-	 * kvm_read_cr0() could return L2's CR0.
-	 */
-	WARN_ON_ONCE(is_guest_mode(vcpu));
-	return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
-}
-
 static
 void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
 					 struct vmcb_ctrl_area_cached *to,
@@ -1004,6 +983,12 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
 	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
 
 	svm_switch_vmcb(svm, &svm->nested.vmcb02);
+
+	if (!nested_vmcb_check_save(vcpu, &svm->nested.save) ||
+	    !nested_vmcb_check_controls(vcpu, &svm->nested.ctl,
+					svm->vmcb01.ptr->save.cr0))
+		return -EINVAL;
+
 	nested_vmcb02_prepare_control(svm, save->rip, save->cs.base);
 	nested_vmcb02_prepare_save(svm);
 
@@ -1025,33 +1010,24 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
 	return 0;
 }
 
-static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
+static int nested_svm_copy_vmcb12_to_cache(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 	struct kvm_host_map map;
 	struct vmcb *vmcb12;
-	int ret;
+	int r;
 
-	ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
-	if (ret)
-		return ret;
+	r = kvm_vcpu_map_readonly(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
+	if (r)
+		return r;
 
 	vmcb12 = map.hva;
 
 	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
 	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
 
-	if (!nested_vmcb_check_save(vcpu) ||
-	    !nested_vmcb_check_controls(vcpu)) {
-		vmcb12->control.exit_code    = SVM_EXIT_ERR;
-		vmcb12->control.exit_code_hi = -1u;
-		vmcb12->control.exit_info_1  = 0;
-		vmcb12->control.exit_info_2  = 0;
-		ret = -1;
-	}
-
 	kvm_vcpu_unmap(vcpu, &map);
-	return ret;
+	return 0;
 }
 
 int nested_svm_vmrun(struct kvm_vcpu *vcpu)
@@ -1082,12 +1058,9 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 		return -EINVAL;
 
 	vmcb12_gpa = svm->vmcb->save.rax;
-	ret = nested_svm_vmrun_read_vmcb12(vcpu, vmcb12_gpa);
-	if (ret == -EINVAL) {
+	if (nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa)) {
 		kvm_inject_gp(vcpu, 0);
 		return 1;
-	} else if (ret) {
-		return kvm_skip_emulated_instruction(vcpu);
 	}
 
 	ret = kvm_skip_emulated_instruction(vcpu);
@@ -1919,7 +1892,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 	ret = -EINVAL;
 	__nested_copy_vmcb_control_to_cache(vcpu, &ctl_cached, ctl);
 	/* 'save' contains L1 state saved from before VMRUN */
-	if (!__nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
+	if (!nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
 		goto out_free;
 
 	/*
@@ -1938,7 +1911,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 	if (!(save->cr0 & X86_CR0_PG) ||
 	    !(save->cr0 & X86_CR0_PE) ||
 	    (save->rflags & X86_EFLAGS_VM) ||
-	    !__nested_vmcb_check_save(vcpu, &save_cached))
+	    !nested_vmcb_check_save(vcpu, &save_cached))
 		goto out_free;
 
 

base-commit: 01597a665f5dcf8d7cfbedf36f4e6d46d045eb4f
--
Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
Posted by Yosry Ahmed 1 month, 4 weeks ago
On Tue, Dec 09, 2025 at 08:03:15AM -0800, Sean Christopherson wrote:
> On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > All accesses to the VMCB12 in the guest memory are limited to
> > nested_svm_vmrun(). However, the VMCB12 remains mapped until the end of
> > the function execution. Unmapping right after the consistency checks is
> > possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is
> > used after being unmapped.
> > 
> > Move all accesses to the VMCB12 into a new helper,
> > nested_svm_vmrun_read_vmcb12(),  that maps the VMCB12,
> > caches the needed fields, performs consistency checks, and unmaps it.
> > This limits the scope of the VMCB12 mapping appropriately. It also
> > slightly simplifies the cleanup path of nested_svm_vmrun().
> > 
> > nested_svm_vmrun_read_vmcb12() returns -1 if the consistency checks
> > fail, maintaining the current behavior of skipping the instructions and
> > unmapping the VMCB12 (although in the opposite order).
> > 
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> >  arch/x86/kvm/svm/nested.c | 59 ++++++++++++++++++++++-----------------
> >  1 file changed, 34 insertions(+), 25 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index ddcd545ec1c3c..a48668c36a191 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1023,12 +1023,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> >  	return 0;
> >  }
> >  
> > +static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
> 
> "read_vmcb12"() sounds like a generic helper to read a specific field.  And if
> the name is more specific, then I think we can drop the "vmrun" scoping.  To
> aligh with similar functions in VMX and __nested_copy_vmcb_save_to_cache(), how
> about nested_svm_copy_vmcb12_to_cache()?
> 
> > +{
> > +	struct vcpu_svm *svm = to_svm(vcpu);
> > +	struct kvm_host_map map;
> > +	struct vmcb *vmcb12;
> > +	int ret;
> > +
> > +	ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> > +	if (ret)
> > +		return ret;
> > +
> > +	vmcb12 = map.hva;
> > +
> > +	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
> > +	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
> > +
> > +	if (!nested_vmcb_check_save(vcpu) ||
> > +	    !nested_vmcb_check_controls(vcpu)) {
> > +		vmcb12->control.exit_code    = SVM_EXIT_ERR;
> > +		vmcb12->control.exit_code_hi = 0;
> > +		vmcb12->control.exit_info_1  = 0;
> > +		vmcb12->control.exit_info_2  = 0;
> > +		ret = -1;
> 
> I don't love shoving the consistency checks in here.  I get why you did it, but
> it's very surprising to see (and/or easy to miss) these consistency checks.  The
> caller also ends up quite wonky:
> 
> 	if (ret == -EINVAL) {
> 		kvm_inject_gp(vcpu, 0);
> 		return 1;
> 	} else if (ret) {
> 		return kvm_skip_emulated_instruction(vcpu);
> 	}
> 
> 	ret = kvm_skip_emulated_instruction(vcpu);
> 
> Ha!  And it's buggy.  __kvm_vcpu_map() can return -EFAULT if creating a host
> mapping fails.  Eww, and blindly using '-1' as the "failed a consistency check"
> is equally cross, as it relies on kvm_vcpu_map() not returning -EPERM in a very
> weird way.
> 
> Ugh, and there's also this nastiness in nested_vmcb_check_controls():
> 
> 	 * Make sure we did not enter guest mode yet, in which case
> 	 * kvm_read_cr0() could return L2's CR0.
> 	 */
> 	WARN_ON_ONCE(is_guest_mode(vcpu));
> 	return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> 
> nested_vmcb_check_save() and nested_vmcb_check_controls() really shouldn't exist.
> They just make it harder to see what KVM is checking in the "normal" flow.
> 
> Aha!  And I'm fairly certain there are at least two pre-existing bugs due to KVM
> doing "early" consistency checks in nested_svm_vmrun().
> 
>   1. KVM doesn't clear GIF on the early #VMEXIT.  In classic APM fashion, nothing
>      _requires_ GIF=0 before VMRUN:
> 
>         It is assumed that VMM software cleared GIF some time before executing
>         the VMRUN instruction, to ensure an atomic state switch.
> 
>      And so VMRUN with GIF=1 that hits an "early" consistency check #VMEXIT would
>      incorrectly leave GIF=1.
> 
> 
>   2. svm_leave_smm() is missing consistency checks on the newly loaded guest state,
>      because the checks aren't performed by enter_svm_guest_mode().  I don't see
>      anything that would prevent vmcb12 from being modified by the guest bewteen
>      SMI and RSM.
> 
> Moving the consistency checks into enter_svm_guest_mode() would solve all three
> (four?) problems.  And as a bonus, nested_svm_copy_vmcb12_to_cache() can use
> kvm_vcpu_map_readonly().
> 
> Compile tested only, but I think we can end up with delta like so:
> 
> ---
>  arch/x86/kvm/svm/nested.c | 67 ++++++++++++---------------------------
>  1 file changed, 20 insertions(+), 47 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 7c86987fdaca..8a0df6c535b5 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -372,9 +372,9 @@ static bool nested_svm_check_event_inj(struct kvm_vcpu *vcpu, u32 event_inj)
>  	return true;
>  }
>  
> -static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> -					 struct vmcb_ctrl_area_cached *control,
> -					 unsigned long l1_cr0)
> +static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> +				       struct vmcb_ctrl_area_cached *control,
> +				       unsigned long l1_cr0)
>  {
>  	if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
>  		return false;
> @@ -408,8 +408,8 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
>  }
>  
>  /* Common checks that apply to both L1 and L2 state.  */
> -static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> -				     struct vmcb_save_area_cached *save)
> +static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> +				   struct vmcb_save_area_cached *save)
>  {
>  	if (CC(!(save->efer & EFER_SVME)))
>  		return false;
> @@ -448,27 +448,6 @@ static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
>  	return true;
>  }
>  
> -static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu)
> -{
> -	struct vcpu_svm *svm = to_svm(vcpu);
> -	struct vmcb_save_area_cached *save = &svm->nested.save;
> -
> -	return __nested_vmcb_check_save(vcpu, save);
> -}
> -
> -static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
> -{
> -	struct vcpu_svm *svm = to_svm(vcpu);
> -	struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
> -
> -	/*
> -	 * Make sure we did not enter guest mode yet, in which case
> -	 * kvm_read_cr0() could return L2's CR0.
> -	 */
> -	WARN_ON_ONCE(is_guest_mode(vcpu));
> -	return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> -}
> -
>  static
>  void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
>  					 struct vmcb_ctrl_area_cached *to,
> @@ -1004,6 +983,12 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
>  	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
>  
>  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> +
> +	if (!nested_vmcb_check_save(vcpu, &svm->nested.save) ||
> +	    !nested_vmcb_check_controls(vcpu, &svm->nested.ctl,
> +					svm->vmcb01.ptr->save.cr0))
> +		return -EINVAL;
> +
>  	nested_vmcb02_prepare_control(svm, save->rip, save->cs.base);
>  	nested_vmcb02_prepare_save(svm);
>  

Unfortunately this doesn't work, it breaks the newly introduced
nested_invalid_cr3_test. The problem is that we bail before we fully
initialize VMCB02, then nested_svm_vmrun() calls nested_svm_vmexit(),
which restores state from VMCB02 to VMCB12.

The test first tries to run L2 with a messed up CR3, which fails but
corrupts VMCB12 due to the above, then the second nested entry is
screwed.

There are two fixes, the easy one is just move the consistency checks
after nested_vmcb02_prepare_control() and nested_vmcb02_prepare_save()
(like the existing failure mode of nested_svm_load_cr3()). This works,
but the code doesn't make a lot of sense because we use VMCB12 to create
VMCB02 and THEN check that VMCB12 is valid.

The alternative is unfortunately a lot more involved. We only do a
partial restore or a "fast #VMEXIT" for failed VMRUNs. We'd need to:

1) Move nested_svm_load_cr3() above nested_vmcb02_prepare_control(),
   which needs moving nested_svm_init_mmu_context() out of
   nested_vmcb02_prepare_control() to remain before
   nested_svm_load_cr3().

   This makes sure a failed nested VMRUN always needs a "fast #VMEXIT"

2) Figure out which parts of nested_svm_vmexit() are needed in the
   failed VMRUN case. We need to at least switch the VMCB, propagate the
   error code, and do some cleanups. We can split this out into the
   "fast #VMEXIT" path, and use it for failed VMRUNs.

Let me know which way you prefer.
Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
Posted by Yosry Ahmed 1 month, 4 weeks ago
On Wed, Dec 10, 2025 at 11:05:44PM +0000, Yosry Ahmed wrote:
> On Tue, Dec 09, 2025 at 08:03:15AM -0800, Sean Christopherson wrote:
> > On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > > All accesses to the VMCB12 in the guest memory are limited to
> > > nested_svm_vmrun(). However, the VMCB12 remains mapped until the end of
> > > the function execution. Unmapping right after the consistency checks is
> > > possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is
> > > used after being unmapped.
> > > 
> > > Move all accesses to the VMCB12 into a new helper,
> > > nested_svm_vmrun_read_vmcb12(),  that maps the VMCB12,
> > > caches the needed fields, performs consistency checks, and unmaps it.
> > > This limits the scope of the VMCB12 mapping appropriately. It also
> > > slightly simplifies the cleanup path of nested_svm_vmrun().
> > > 
> > > nested_svm_vmrun_read_vmcb12() returns -1 if the consistency checks
> > > fail, maintaining the current behavior of skipping the instructions and
> > > unmapping the VMCB12 (although in the opposite order).
> > > 
> > > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > ---
> > >  arch/x86/kvm/svm/nested.c | 59 ++++++++++++++++++++++-----------------
> > >  1 file changed, 34 insertions(+), 25 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > index ddcd545ec1c3c..a48668c36a191 100644
> > > --- a/arch/x86/kvm/svm/nested.c
> > > +++ b/arch/x86/kvm/svm/nested.c
> > > @@ -1023,12 +1023,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> > >  	return 0;
> > >  }
> > >  
> > > +static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
> > 
> > "read_vmcb12"() sounds like a generic helper to read a specific field.  And if
> > the name is more specific, then I think we can drop the "vmrun" scoping.  To
> > aligh with similar functions in VMX and __nested_copy_vmcb_save_to_cache(), how
> > about nested_svm_copy_vmcb12_to_cache()?
> > 
> > > +{
> > > +	struct vcpu_svm *svm = to_svm(vcpu);
> > > +	struct kvm_host_map map;
> > > +	struct vmcb *vmcb12;
> > > +	int ret;
> > > +
> > > +	ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	vmcb12 = map.hva;
> > > +
> > > +	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
> > > +	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
> > > +
> > > +	if (!nested_vmcb_check_save(vcpu) ||
> > > +	    !nested_vmcb_check_controls(vcpu)) {
> > > +		vmcb12->control.exit_code    = SVM_EXIT_ERR;
> > > +		vmcb12->control.exit_code_hi = 0;
> > > +		vmcb12->control.exit_info_1  = 0;
> > > +		vmcb12->control.exit_info_2  = 0;
> > > +		ret = -1;
> > 
> > I don't love shoving the consistency checks in here.  I get why you did it, but
> > it's very surprising to see (and/or easy to miss) these consistency checks.  The
> > caller also ends up quite wonky:
> > 
> > 	if (ret == -EINVAL) {
> > 		kvm_inject_gp(vcpu, 0);
> > 		return 1;
> > 	} else if (ret) {
> > 		return kvm_skip_emulated_instruction(vcpu);
> > 	}
> > 
> > 	ret = kvm_skip_emulated_instruction(vcpu);
> > 
> > Ha!  And it's buggy.  __kvm_vcpu_map() can return -EFAULT if creating a host
> > mapping fails.  Eww, and blindly using '-1' as the "failed a consistency check"
> > is equally cross, as it relies on kvm_vcpu_map() not returning -EPERM in a very
> > weird way.
> > 
> > Ugh, and there's also this nastiness in nested_vmcb_check_controls():
> > 
> > 	 * Make sure we did not enter guest mode yet, in which case
> > 	 * kvm_read_cr0() could return L2's CR0.
> > 	 */
> > 	WARN_ON_ONCE(is_guest_mode(vcpu));
> > 	return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> > 
> > nested_vmcb_check_save() and nested_vmcb_check_controls() really shouldn't exist.
> > They just make it harder to see what KVM is checking in the "normal" flow.
> > 
> > Aha!  And I'm fairly certain there are at least two pre-existing bugs due to KVM
> > doing "early" consistency checks in nested_svm_vmrun().
> > 
> >   1. KVM doesn't clear GIF on the early #VMEXIT.  In classic APM fashion, nothing
> >      _requires_ GIF=0 before VMRUN:
> > 
> >         It is assumed that VMM software cleared GIF some time before executing
> >         the VMRUN instruction, to ensure an atomic state switch.
> > 
> >      And so VMRUN with GIF=1 that hits an "early" consistency check #VMEXIT would
> >      incorrectly leave GIF=1.
> > 
> > 
> >   2. svm_leave_smm() is missing consistency checks on the newly loaded guest state,
> >      because the checks aren't performed by enter_svm_guest_mode().  I don't see
> >      anything that would prevent vmcb12 from being modified by the guest bewteen
> >      SMI and RSM.
> > 
> > Moving the consistency checks into enter_svm_guest_mode() would solve all three
> > (four?) problems.  And as a bonus, nested_svm_copy_vmcb12_to_cache() can use
> > kvm_vcpu_map_readonly().
> > 
> > Compile tested only, but I think we can end up with delta like so:
> > 
> > ---
> >  arch/x86/kvm/svm/nested.c | 67 ++++++++++++---------------------------
> >  1 file changed, 20 insertions(+), 47 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index 7c86987fdaca..8a0df6c535b5 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -372,9 +372,9 @@ static bool nested_svm_check_event_inj(struct kvm_vcpu *vcpu, u32 event_inj)
> >  	return true;
> >  }
> >  
> > -static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> > -					 struct vmcb_ctrl_area_cached *control,
> > -					 unsigned long l1_cr0)
> > +static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> > +				       struct vmcb_ctrl_area_cached *control,
> > +				       unsigned long l1_cr0)
> >  {
> >  	if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
> >  		return false;
> > @@ -408,8 +408,8 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> >  }
> >  
> >  /* Common checks that apply to both L1 and L2 state.  */
> > -static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> > -				     struct vmcb_save_area_cached *save)
> > +static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> > +				   struct vmcb_save_area_cached *save)
> >  {
> >  	if (CC(!(save->efer & EFER_SVME)))
> >  		return false;
> > @@ -448,27 +448,6 @@ static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> >  	return true;
> >  }
> >  
> > -static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu)
> > -{
> > -	struct vcpu_svm *svm = to_svm(vcpu);
> > -	struct vmcb_save_area_cached *save = &svm->nested.save;
> > -
> > -	return __nested_vmcb_check_save(vcpu, save);
> > -}
> > -
> > -static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
> > -{
> > -	struct vcpu_svm *svm = to_svm(vcpu);
> > -	struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
> > -
> > -	/*
> > -	 * Make sure we did not enter guest mode yet, in which case
> > -	 * kvm_read_cr0() could return L2's CR0.
> > -	 */
> > -	WARN_ON_ONCE(is_guest_mode(vcpu));
> > -	return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> > -}
> > -
> >  static
> >  void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> >  					 struct vmcb_ctrl_area_cached *to,
> > @@ -1004,6 +983,12 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> >  	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
> >  
> >  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> > +
> > +	if (!nested_vmcb_check_save(vcpu, &svm->nested.save) ||
> > +	    !nested_vmcb_check_controls(vcpu, &svm->nested.ctl,
> > +					svm->vmcb01.ptr->save.cr0))
> > +		return -EINVAL;
> > +
> >  	nested_vmcb02_prepare_control(svm, save->rip, save->cs.base);
> >  	nested_vmcb02_prepare_save(svm);
> >  
> 
> Unfortunately this doesn't work, it breaks the newly introduced
> nested_invalid_cr3_test. The problem is that we bail before we fully
> initialize VMCB02, then nested_svm_vmrun() calls nested_svm_vmexit(),
> which restores state from VMCB02 to VMCB12.
> 
> The test first tries to run L2 with a messed up CR3, which fails but
> corrupts VMCB12 due to the above, then the second nested entry is
> screwed.
> 
> There are two fixes, the easy one is just move the consistency checks
> after nested_vmcb02_prepare_control() and nested_vmcb02_prepare_save()
> (like the existing failure mode of nested_svm_load_cr3()). This works,
> but the code doesn't make a lot of sense because we use VMCB12 to create
> VMCB02 and THEN check that VMCB12 is valid.
> 
> The alternative is unfortunately a lot more involved. We only do a
> partial restore or a "fast #VMEXIT" for failed VMRUNs. We'd need to:
> 
> 1) Move nested_svm_load_cr3() above nested_vmcb02_prepare_control(),
>    which needs moving nested_svm_init_mmu_context() out of
>    nested_vmcb02_prepare_control() to remain before
>    nested_svm_load_cr3().
> 
>    This makes sure a failed nested VMRUN always needs a "fast #VMEXIT"
> 
> 2) Figure out which parts of nested_svm_vmexit() are needed in the
>    failed VMRUN case. We need to at least switch the VMCB, propagate the
>    error code, and do some cleanups. We can split this out into the
>    "fast #VMEXIT" path, and use it for failed VMRUNs.
> 
> Let me know which way you prefer.

I think I prefer (2), the code looks cleaner and I like having a
separate code path for VMRUN failures. Unless there are objections, I
will do that in the next version.
Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
Posted by Sean Christopherson 1 month, 3 weeks ago
On Thu, Dec 11, 2025, Yosry Ahmed wrote:
> On Wed, Dec 10, 2025 at 11:05:44PM +0000, Yosry Ahmed wrote:
> > On Tue, Dec 09, 2025 at 08:03:15AM -0800, Sean Christopherson wrote:
> > > On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > Unfortunately this doesn't work, it breaks the newly introduced
> > nested_invalid_cr3_test. The problem is that we bail before we fully
> > initialize VMCB02, then nested_svm_vmrun() calls nested_svm_vmexit(),
> > which restores state from VMCB02 to VMCB12.
> > 
> > The test first tries to run L2 with a messed up CR3, which fails but
> > corrupts VMCB12 due to the above, then the second nested entry is
> > screwed.
> > 
> > There are two fixes, the easy one is just move the consistency checks
> > after nested_vmcb02_prepare_control() and nested_vmcb02_prepare_save()
> > (like the existing failure mode of nested_svm_load_cr3()). This works,
> > but the code doesn't make a lot of sense because we use VMCB12 to create
> > VMCB02 and THEN check that VMCB12 is valid.
> > 
> > The alternative is unfortunately a lot more involved. We only do a
> > partial restore or a "fast #VMEXIT" for failed VMRUNs. We'd need to:
> > 
> > 1) Move nested_svm_load_cr3() above nested_vmcb02_prepare_control(),
> >    which needs moving nested_svm_init_mmu_context() out of
> >    nested_vmcb02_prepare_control() to remain before
> >    nested_svm_load_cr3().
> > 
> >    This makes sure a failed nested VMRUN always needs a "fast #VMEXIT"
> > 
> > 2) Figure out which parts of nested_svm_vmexit() are needed in the
> >    failed VMRUN case. We need to at least switch the VMCB, propagate the
> >    error code, and do some cleanups. We can split this out into the
> >    "fast #VMEXIT" path, and use it for failed VMRUNs.
> > 
> > Let me know which way you prefer.
> 
> I think I prefer (2), the code looks cleaner and I like having a
> separate code path for VMRUN failures. Unless there are objections, I
> will do that in the next version.

With the caveat that I haven't seen the code, that has my vote too.  nVMX has a
similar flow, and logically this is equivalent, at least to me.  We can probably
even use similar terminology, e.g. vmrun_fail_vmexit instead of vmentry_fail_vmext.

vmentry_fail_vmexit:
	vmx_switch_vmcs(vcpu, &vmx->vmcs01);

	if (!from_vmentry)
		return NVMX_VMENTRY_VMEXIT;

	load_vmcs12_host_state(vcpu, vmcs12);
	vmcs12->vm_exit_reason = exit_reason.full;
	if (enable_shadow_vmcs || nested_vmx_is_evmptr12_valid(vmx))
		vmx->nested.need_vmcs12_to_shadow_sync = true;
	return NVMX_VMENTRY_VMEXIT;
Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
Posted by Yosry Ahmed 2 months ago
On Tue, Dec 09, 2025 at 08:03:15AM -0800, Sean Christopherson wrote:
> On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > All accesses to the VMCB12 in the guest memory are limited to
> > nested_svm_vmrun(). However, the VMCB12 remains mapped until the end of
> > the function execution. Unmapping right after the consistency checks is
> > possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is
> > used after being unmapped.
> > 
> > Move all accesses to the VMCB12 into a new helper,
> > nested_svm_vmrun_read_vmcb12(),  that maps the VMCB12,
> > caches the needed fields, performs consistency checks, and unmaps it.
> > This limits the scope of the VMCB12 mapping appropriately. It also
> > slightly simplifies the cleanup path of nested_svm_vmrun().
> > 
> > nested_svm_vmrun_read_vmcb12() returns -1 if the consistency checks
> > fail, maintaining the current behavior of skipping the instructions and
> > unmapping the VMCB12 (although in the opposite order).
> > 
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> >  arch/x86/kvm/svm/nested.c | 59 ++++++++++++++++++++++-----------------
> >  1 file changed, 34 insertions(+), 25 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index ddcd545ec1c3c..a48668c36a191 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1023,12 +1023,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> >  	return 0;
> >  }
> >  
> > +static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
> 
> "read_vmcb12"() sounds like a generic helper to read a specific field.  And if
> the name is more specific, then I think we can drop the "vmrun" scoping.  To
> aligh with similar functions in VMX and __nested_copy_vmcb_save_to_cache(), how
> about nested_svm_copy_vmcb12_to_cache()?

nested_svm_copy_vmcb12_to_cache() sounds good.

> 
> > +{
> > +	struct vcpu_svm *svm = to_svm(vcpu);
> > +	struct kvm_host_map map;
> > +	struct vmcb *vmcb12;
> > +	int ret;
> > +
> > +	ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> > +	if (ret)
> > +		return ret;
> > +
> > +	vmcb12 = map.hva;
> > +
> > +	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
> > +	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
> > +
> > +	if (!nested_vmcb_check_save(vcpu) ||
> > +	    !nested_vmcb_check_controls(vcpu)) {
> > +		vmcb12->control.exit_code    = SVM_EXIT_ERR;
> > +		vmcb12->control.exit_code_hi = 0;
> > +		vmcb12->control.exit_info_1  = 0;
> > +		vmcb12->control.exit_info_2  = 0;
> > +		ret = -1;
> 
> I don't love shoving the consistency checks in here.  I get why you did it, but
> it's very surprising to see (and/or easy to miss) these consistency checks.  The
> caller also ends up quite wonky:
> 
> 	if (ret == -EINVAL) {
> 		kvm_inject_gp(vcpu, 0);
> 		return 1;
> 	} else if (ret) {
> 		return kvm_skip_emulated_instruction(vcpu);
> 	}
> 
> 	ret = kvm_skip_emulated_instruction(vcpu);
> 
> Ha!  And it's buggy.  __kvm_vcpu_map() can return -EFAULT if creating a host
> mapping fails.  Eww, and blindly using '-1' as the "failed a consistency check"
> is equally cross, as it relies on kvm_vcpu_map() not returning -EPERM in a very
> weird way.

I was trying to maintain the pre-existing behavior as much as possible,
and I think the existing code will handle -EFAULT from kvm_vcpu_map() in
the same way (skip the instruction and return).

I guess I shouldn't have assumed maintaining the existing behavior is
the right thing to do.

It's honestly really hard to detangle the return values of different KVM
functions and what they mean. "return 1" here is not very meaningful,
and the return code from kvm_skip_emulated_instruction() is not
documented, so I don't really know what we're supposed to return here in
what cases. The error code are usually not interpreted until a few
layers higher up the callstack.

I agree that returning -1 is not great, but in this case the caller (and
the existing code) only cared about differentiating -EINVAL from others,
and I found other KVM functions returning -1 so I thought I shouldn't
overthink the return value. But yeah, you're right, no more -1's :)

Hence, I preferred to leave things as-is as much as possible.

> 
> Ugh, and there's also this nastiness in nested_vmcb_check_controls():
> 
> 	 * Make sure we did not enter guest mode yet, in which case
> 	 * kvm_read_cr0() could return L2's CR0.
> 	 */
> 	WARN_ON_ONCE(is_guest_mode(vcpu));
> 	return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> 
> nested_vmcb_check_save() and nested_vmcb_check_controls() really shouldn't exist.
> They just make it harder to see what KVM is checking in the "normal" flow.
> 
> Aha!  And I'm fairly certain there are at least two pre-existing bugs due to KVM
> doing "early" consistency checks in nested_svm_vmrun().
> 
>   1. KVM doesn't clear GIF on the early #VMEXIT.  In classic APM fashion, nothing
>      _requires_ GIF=0 before VMRUN:
> 
>         It is assumed that VMM software cleared GIF some time before executing
>         the VMRUN instruction, to ensure an atomic state switch.
> 
>      And so VMRUN with GIF=1 that hits an "early" consistency check #VMEXIT would
>      incorrectly leave GIF=1.
> 
> 
>   2. svm_leave_smm() is missing consistency checks on the newly loaded guest state,
>      because the checks aren't performed by enter_svm_guest_mode().  I don't see
>      anything that would prevent vmcb12 from being modified by the guest bewteen
>      SMI and RSM.
> 
> Moving the consistency checks into enter_svm_guest_mode() would solve all three
> (four?) problems.  And as a bonus, nested_svm_copy_vmcb12_to_cache() can use
> kvm_vcpu_map_readonly().

Anyway, I will move the consistency checks as you suggested, I agree
that is much better. Thanks!

> 
> Compile tested only, but I think we can end up with delta like so:
> 
> ---
>  arch/x86/kvm/svm/nested.c | 67 ++++++++++++---------------------------
>  1 file changed, 20 insertions(+), 47 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 7c86987fdaca..8a0df6c535b5 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -372,9 +372,9 @@ static bool nested_svm_check_event_inj(struct kvm_vcpu *vcpu, u32 event_inj)
>  	return true;
>  }
>  
> -static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> -					 struct vmcb_ctrl_area_cached *control,
> -					 unsigned long l1_cr0)
> +static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> +				       struct vmcb_ctrl_area_cached *control,
> +				       unsigned long l1_cr0)
>  {
>  	if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
>  		return false;
> @@ -408,8 +408,8 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
>  }
>  
>  /* Common checks that apply to both L1 and L2 state.  */
> -static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> -				     struct vmcb_save_area_cached *save)
> +static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> +				   struct vmcb_save_area_cached *save)
>  {
>  	if (CC(!(save->efer & EFER_SVME)))
>  		return false;
> @@ -448,27 +448,6 @@ static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
>  	return true;
>  }
>  
> -static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu)
> -{
> -	struct vcpu_svm *svm = to_svm(vcpu);
> -	struct vmcb_save_area_cached *save = &svm->nested.save;
> -
> -	return __nested_vmcb_check_save(vcpu, save);
> -}
> -
> -static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
> -{
> -	struct vcpu_svm *svm = to_svm(vcpu);
> -	struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
> -
> -	/*
> -	 * Make sure we did not enter guest mode yet, in which case
> -	 * kvm_read_cr0() could return L2's CR0.
> -	 */
> -	WARN_ON_ONCE(is_guest_mode(vcpu));
> -	return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> -}
> -
>  static
>  void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
>  					 struct vmcb_ctrl_area_cached *to,
> @@ -1004,6 +983,12 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
>  	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
>  
>  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> +
> +	if (!nested_vmcb_check_save(vcpu, &svm->nested.save) ||
> +	    !nested_vmcb_check_controls(vcpu, &svm->nested.ctl,
> +					svm->vmcb01.ptr->save.cr0))
> +		return -EINVAL;
> +
>  	nested_vmcb02_prepare_control(svm, save->rip, save->cs.base);
>  	nested_vmcb02_prepare_save(svm);
>  
> @@ -1025,33 +1010,24 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
>  	return 0;
>  }
>  
> -static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
> +static int nested_svm_copy_vmcb12_to_cache(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
>  	struct kvm_host_map map;
>  	struct vmcb *vmcb12;
> -	int ret;
> +	int r;
>  
> -	ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> -	if (ret)
> -		return ret;
> +	r = kvm_vcpu_map_readonly(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> +	if (r)
> +		return r;
>  
>  	vmcb12 = map.hva;
>  
>  	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
>  	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
>  
> -	if (!nested_vmcb_check_save(vcpu) ||
> -	    !nested_vmcb_check_controls(vcpu)) {
> -		vmcb12->control.exit_code    = SVM_EXIT_ERR;
> -		vmcb12->control.exit_code_hi = -1u;
> -		vmcb12->control.exit_info_1  = 0;
> -		vmcb12->control.exit_info_2  = 0;
> -		ret = -1;
> -	}
> -
>  	kvm_vcpu_unmap(vcpu, &map);
> -	return ret;
> +	return 0;
>  }
>  
>  int nested_svm_vmrun(struct kvm_vcpu *vcpu)
> @@ -1082,12 +1058,9 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
>  		return -EINVAL;
>  
>  	vmcb12_gpa = svm->vmcb->save.rax;
> -	ret = nested_svm_vmrun_read_vmcb12(vcpu, vmcb12_gpa);
> -	if (ret == -EINVAL) {
> +	if (nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa)) {
>  		kvm_inject_gp(vcpu, 0);
>  		return 1;
> -	} else if (ret) {
> -		return kvm_skip_emulated_instruction(vcpu);
>  	}
>  
>  	ret = kvm_skip_emulated_instruction(vcpu);
> @@ -1919,7 +1892,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>  	ret = -EINVAL;
>  	__nested_copy_vmcb_control_to_cache(vcpu, &ctl_cached, ctl);
>  	/* 'save' contains L1 state saved from before VMRUN */
> -	if (!__nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
> +	if (!nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
>  		goto out_free;
>  
>  	/*
> @@ -1938,7 +1911,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>  	if (!(save->cr0 & X86_CR0_PG) ||
>  	    !(save->cr0 & X86_CR0_PE) ||
>  	    (save->rflags & X86_EFLAGS_VM) ||
> -	    !__nested_vmcb_check_save(vcpu, &save_cached))
> +	    !nested_vmcb_check_save(vcpu, &save_cached))
>  		goto out_free;
>  
>  
> 
> base-commit: 01597a665f5dcf8d7cfbedf36f4e6d46d045eb4f
> --
>
Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN
Posted by Sean Christopherson 2 months ago
On Tue, Dec 09, 2025, Yosry Ahmed wrote:
> On Tue, Dec 09, 2025 at 08:03:15AM -0800, Sean Christopherson wrote:
> > On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > > +	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
> > > +	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
> > > +
> > > +	if (!nested_vmcb_check_save(vcpu) ||
> > > +	    !nested_vmcb_check_controls(vcpu)) {
> > > +		vmcb12->control.exit_code    = SVM_EXIT_ERR;
> > > +		vmcb12->control.exit_code_hi = 0;
> > > +		vmcb12->control.exit_info_1  = 0;
> > > +		vmcb12->control.exit_info_2  = 0;
> > > +		ret = -1;
> > 
> > I don't love shoving the consistency checks in here.  I get why you did it, but
> > it's very surprising to see (and/or easy to miss) these consistency checks.  The
> > caller also ends up quite wonky:
> > 
> > 	if (ret == -EINVAL) {
> > 		kvm_inject_gp(vcpu, 0);
> > 		return 1;
> > 	} else if (ret) {
> > 		return kvm_skip_emulated_instruction(vcpu);
> > 	}
> > 
> > 	ret = kvm_skip_emulated_instruction(vcpu);
> > 
> > Ha!  And it's buggy.  __kvm_vcpu_map() can return -EFAULT if creating a host
> > mapping fails.  Eww, and blindly using '-1' as the "failed a consistency check"
> > is equally cross, as it relies on kvm_vcpu_map() not returning -EPERM in a very
> > weird way.
> 
> I was trying to maintain the pre-existing behavior as much as possible,
> and I think the existing code will handle -EFAULT from kvm_vcpu_map() in
> the same way (skip the instruction and return).
> 
> I guess I shouldn't have assumed maintaining the existing behavior is
> the right thing to do.

Maintaining existing behavior is absolutely the right thing to do when moving
code around.  It's just that sometimes touching code uncovers pre-existing issues,
as is the case here.

> It's honestly really hard to detangle the return values of different KVM
> functions and what they mean. "return 1" here is not very meaningful,
> and the return code from kvm_skip_emulated_instruction() is not
> documented, so I don't really know what we're supposed to return here in
> what cases. The error code are usually not interpreted until a few
> layers higher up the callstack.

LOL, welcome to KVM x86.  This has been a complaint since before I started working
on KVM.  We're finally getting traction on that mess, but it's a _huge_ mess to
sort out.

https://lore.kernel.org/all/20251205074537.17072-1-jgross@suse.com