For guests with NRIPS disabled, L1 does not provide NextRIP when running
an L2 with an injected soft interrupt, instead it advances L2's RIP
before running it. KVM uses L2's current RIP as the NextRIP in vmcb02 to
emulate a CPU without NRIPS.
However, in svm_set_nested_state(), the value used for L2's current RIP
comes from vmcb02, which is just whatever the vCPU had in vmcb02 before
restoring nested state (zero on a freshly created vCPU). Passing the
cached RIP value instead (i.e. kvm_rip_read()) would only fix the issue
if registers are restored before nested state.
Instead, split the logic of setting NextRIP in vmcb02. Handle the
'normal' case of initializing vmcb02's NextRIP using NextRIP from vmcb12
(or KVM_GET_NESTED_STATE's payload) in nested_vmcb02_prepare_control().
Delay the special case of stuffing L2's current RIP into vmcb02's
NextRIP until shortly before the vCPU is run, to make sure the most
up-to-date value of RIP is used regardless of KVM_SET_REGS and
KVM_SET_NESTED_STATE's relative ordering.
Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
CC: stable@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
---
arch/x86/kvm/svm/nested.c | 25 ++++++++-----------------
arch/x86/kvm/svm/svm.c | 18 ++++++++++++++++++
2 files changed, 26 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index a82e6f0472ca7..b7c80aeaebab3 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -844,24 +844,15 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err;
/*
- * NextRIP is consumed on VMRUN as the return address pushed on the
- * stack for injected soft exceptions/interrupts. If nrips is exposed
- * to L1, take it verbatim from vmcb12.
- *
- * If nrips is supported in hardware but not exposed to L1, stuff the
- * actual L2 RIP to emulate what a nrips=0 CPU would do (L1 is
- * responsible for advancing RIP prior to injecting the event). This is
- * only the case for the first L2 run after VMRUN. After that (e.g.
- * during save/restore), NextRIP is updated by the CPU and/or KVM, and
- * the value of the L2 RIP from vmcb12 should not be used.
+ * If nrips is exposed to L1, take NextRIP as-is. Otherwise, L1
+ * advances L2's RIP before VMRUN instead of using NextRIP. KVM will
+ * stuff the current RIP as vmcb02's NextRIP before L2 is run. After
+ * the first run of L2 (e.g. after save+restore), NextRIP is updated by
+ * the CPU and/or KVM and should be used regardless of L1's support.
*/
- if (boot_cpu_has(X86_FEATURE_NRIPS)) {
- if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) ||
- !svm->nested.nested_run_pending)
- vmcb02->control.next_rip = svm->nested.ctl.next_rip;
- else
- vmcb02->control.next_rip = vmcb12_rip;
- }
+ if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) ||
+ !svm->nested.nested_run_pending)
+ vmcb02->control.next_rip = svm->nested.ctl.next_rip;
svm->nmi_l1_to_l2 = is_evtinj_nmi(vmcb02->control.event_inj);
if (is_evtinj_soft(vmcb02->control.event_inj)) {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 8f8bc863e2143..e084b9688f556 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1413,6 +1413,24 @@ static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
sd->bp_spec_reduce_set = true;
msr_set_bit(MSR_ZEN4_BP_CFG, MSR_ZEN4_BP_CFG_BP_SPEC_REDUCE_BIT);
}
+
+ /*
+ * If nrips is supported in hardware but not exposed to L1, stuff the
+ * actual L2 RIP to emulate what a nrips=0 CPU would do (L1 is
+ * responsible for advancing RIP prior to injecting the event). Once L2
+ * runs after L1 executes VMRUN, NextRIP is updated by the CPU and/or
+ * KVM, and this is no longer needed.
+ *
+ * This is done here (as opposed to when preparing vmcb02) to use the
+ * most up-to-date value of RIP regardless of the order of restoring
+ * registers and nested state in the vCPU save+restore path.
+ */
+ if (is_guest_mode(vcpu) && svm->nested.nested_run_pending) {
+ if (boot_cpu_has(X86_FEATURE_NRIPS) &&
+ !guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
+ svm->vmcb->control.next_rip = kvm_rip_read(vcpu);
+ }
+
svm->guest_state_loaded = true;
}
--
2.53.0.345.g96ddfc5eaa-goog
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 8f8bc863e2143..e084b9688f556 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1413,6 +1413,24 @@ static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> sd->bp_spec_reduce_set = true;
> msr_set_bit(MSR_ZEN4_BP_CFG, MSR_ZEN4_BP_CFG_BP_SPEC_REDUCE_BIT);
> }
> +
> + /*
> + * If nrips is supported in hardware but not exposed to L1, stuff the
> + * actual L2 RIP to emulate what a nrips=0 CPU would do (L1 is
> + * responsible for advancing RIP prior to injecting the event). Once L2
> + * runs after L1 executes VMRUN, NextRIP is updated by the CPU and/or
> + * KVM, and this is no longer needed.
> + *
> + * This is done here (as opposed to when preparing vmcb02) to use the
> + * most up-to-date value of RIP regardless of the order of restoring
> + * registers and nested state in the vCPU save+restore path.
> + */
> + if (is_guest_mode(vcpu) && svm->nested.nested_run_pending) {
> + if (boot_cpu_has(X86_FEATURE_NRIPS) &&
> + !guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
> + svm->vmcb->control.next_rip = kvm_rip_read(vcpu);
> + }
> +
Doing this in svm_prepare_switch_to_guest() is wrong, or at least
after the svm->guest_state_loaded check. It's possible to emulate the
nested VMRUN without doing a vcpu_put(), which means
svm->guest_state_loaded will remain true and this code will be
skipped.
In fact, this breaks the svm_nested_soft_inject_test test. Funny
enough, I was only running it with my repro changes, which papered
over the bug because it forced an exit to userspace after VMRUN due to
single-stepping, so svm->guest_state_loaded got cleared and the code
was executed on the next KVM_RUN, before L2 runs.
I can move it above the svm->guest_state_loaded check, but I think I
will just put it in pre_svm_run() instead.
On Tue, Feb 24, 2026, Yosry Ahmed wrote:
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 8f8bc863e2143..e084b9688f556 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -1413,6 +1413,24 @@ static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> > sd->bp_spec_reduce_set = true;
> > msr_set_bit(MSR_ZEN4_BP_CFG, MSR_ZEN4_BP_CFG_BP_SPEC_REDUCE_BIT);
> > }
> > +
> > + /*
> > + * If nrips is supported in hardware but not exposed to L1, stuff the
> > + * actual L2 RIP to emulate what a nrips=0 CPU would do (L1 is
> > + * responsible for advancing RIP prior to injecting the event). Once L2
> > + * runs after L1 executes VMRUN, NextRIP is updated by the CPU and/or
> > + * KVM, and this is no longer needed.
> > + *
> > + * This is done here (as opposed to when preparing vmcb02) to use the
> > + * most up-to-date value of RIP regardless of the order of restoring
> > + * registers and nested state in the vCPU save+restore path.
> > + */
> > + if (is_guest_mode(vcpu) && svm->nested.nested_run_pending) {
> > + if (boot_cpu_has(X86_FEATURE_NRIPS) &&
> > + !guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
> > + svm->vmcb->control.next_rip = kvm_rip_read(vcpu);
> > + }
> > +
>
> Doing this in svm_prepare_switch_to_guest() is wrong, or at least
> after the svm->guest_state_loaded check. It's possible to emulate the
> nested VMRUN without doing a vcpu_put(), which means
> svm->guest_state_loaded will remain true and this code will be
> skipped.
>
> In fact, this breaks the svm_nested_soft_inject_test test. Funny
> enough, I was only running it with my repro changes, which papered
> over the bug because it forced an exit to userspace after VMRUN due to
> single-stepping, so svm->guest_state_loaded got cleared and the code
> was executed on the next KVM_RUN, before L2 runs.
>
> I can move it above the svm->guest_state_loaded check, but I think I
> will just put it in pre_svm_run() instead.
I would rather not expand pre_svm_run(), and instead just open code it in
svm_vcpu_run(). pre_svm_run() probably should never have been added, because
it's far from a generic "pre run" API. E.g. if we want to keep the helper around,
it should probably be named something something ASID.
> > Doing this in svm_prepare_switch_to_guest() is wrong, or at least > > after the svm->guest_state_loaded check. It's possible to emulate the > > nested VMRUN without doing a vcpu_put(), which means > > svm->guest_state_loaded will remain true and this code will be > > skipped. > > > > In fact, this breaks the svm_nested_soft_inject_test test. Funny > > enough, I was only running it with my repro changes, which papered > > over the bug because it forced an exit to userspace after VMRUN due to > > single-stepping, so svm->guest_state_loaded got cleared and the code > > was executed on the next KVM_RUN, before L2 runs. > > > > I can move it above the svm->guest_state_loaded check, but I think I > > will just put it in pre_svm_run() instead. > > I would rather not expand pre_svm_run(), and instead just open code it in > svm_vcpu_run(). pre_svm_run() probably should never have been added, because > it's far from a generic "pre run" API. E.g. if we want to keep the helper around, > it should probably be named something something ASID. I sent a new version before I saw your response.. sorry. How strongly do you feel about this? :P
On Tue, Feb 24, 2026, Yosry Ahmed wrote: > > > Doing this in svm_prepare_switch_to_guest() is wrong, or at least > > > after the svm->guest_state_loaded check. It's possible to emulate the > > > nested VMRUN without doing a vcpu_put(), which means > > > svm->guest_state_loaded will remain true and this code will be > > > skipped. > > > > > > In fact, this breaks the svm_nested_soft_inject_test test. Funny > > > enough, I was only running it with my repro changes, which papered > > > over the bug because it forced an exit to userspace after VMRUN due to > > > single-stepping, so svm->guest_state_loaded got cleared and the code > > > was executed on the next KVM_RUN, before L2 runs. > > > > > > I can move it above the svm->guest_state_loaded check, but I think I > > > will just put it in pre_svm_run() instead. > > > > I would rather not expand pre_svm_run(), and instead just open code it in > > svm_vcpu_run(). pre_svm_run() probably should never have been added, because > > it's far from a generic "pre run" API. E.g. if we want to keep the helper around, > > it should probably be named something something ASID. > > I sent a new version before I saw your response.. sorry. > > How strongly do you feel about this? :P Strong enough that I'll fix it up when applying, unless it's a sticking point on your end. E.g. one thing that I don't love is that the code never runs for SEV guests. Which is fine, because in practice I can't imagine KVM ever supporting nested SVM for SEV, but it adds unnecessary cognitive load, because readers need to reason through why the code only applies to !SEV guests.
On Tue, Feb 24, 2026 at 5:10 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Feb 24, 2026, Yosry Ahmed wrote: > > > > Doing this in svm_prepare_switch_to_guest() is wrong, or at least > > > > after the svm->guest_state_loaded check. It's possible to emulate the > > > > nested VMRUN without doing a vcpu_put(), which means > > > > svm->guest_state_loaded will remain true and this code will be > > > > skipped. > > > > > > > > In fact, this breaks the svm_nested_soft_inject_test test. Funny > > > > enough, I was only running it with my repro changes, which papered > > > > over the bug because it forced an exit to userspace after VMRUN due to > > > > single-stepping, so svm->guest_state_loaded got cleared and the code > > > > was executed on the next KVM_RUN, before L2 runs. > > > > > > > > I can move it above the svm->guest_state_loaded check, but I think I > > > > will just put it in pre_svm_run() instead. > > > > > > I would rather not expand pre_svm_run(), and instead just open code it in > > > svm_vcpu_run(). pre_svm_run() probably should never have been added, because > > > it's far from a generic "pre run" API. E.g. if we want to keep the helper around, > > > it should probably be named something something ASID. > > > > I sent a new version before I saw your response.. sorry. > > > > How strongly do you feel about this? :P > > Strong enough that I'll fix it up when applying, unless it's a sticking point on > your end. It's just that 99% of the time someone is reading svm_vcpu_run(), they won't care about this code, and it's also cognitive load to filter it out. We can add a helper for this code (and the soft IRQ inject), something like svm_fixup_nested_rips() or sth. We discussed a helper before and you didn't like it, but that was in a different context (a helper that combined normal and special cases). WDYT? > > E.g. one thing that I don't love is that the code never runs for SEV guests. > Which is fine, because in practice I can't imagine KVM ever supporting nested > SVM for SEV, but it adds unnecessary cognitive load, because readers need to > reason through why the code only applies to !SEV guests.
On Tue, Feb 24, 2026, Yosry Ahmed wrote:
> On Tue, Feb 24, 2026 at 5:10 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Feb 24, 2026, Yosry Ahmed wrote:
> > > > > Doing this in svm_prepare_switch_to_guest() is wrong, or at least
> > > > > after the svm->guest_state_loaded check. It's possible to emulate the
> > > > > nested VMRUN without doing a vcpu_put(), which means
> > > > > svm->guest_state_loaded will remain true and this code will be
> > > > > skipped.
> > > > >
> > > > > In fact, this breaks the svm_nested_soft_inject_test test. Funny
> > > > > enough, I was only running it with my repro changes, which papered
> > > > > over the bug because it forced an exit to userspace after VMRUN due to
> > > > > single-stepping, so svm->guest_state_loaded got cleared and the code
> > > > > was executed on the next KVM_RUN, before L2 runs.
> > > > >
> > > > > I can move it above the svm->guest_state_loaded check, but I think I
> > > > > will just put it in pre_svm_run() instead.
> > > >
> > > > I would rather not expand pre_svm_run(), and instead just open code it in
> > > > svm_vcpu_run(). pre_svm_run() probably should never have been added, because
> > > > it's far from a generic "pre run" API. E.g. if we want to keep the helper around,
> > > > it should probably be named something something ASID.
> > >
> > > I sent a new version before I saw your response.. sorry.
> > >
> > > How strongly do you feel about this? :P
> >
> > Strong enough that I'll fix it up when applying, unless it's a sticking point on
> > your end.
>
> It's just that 99% of the time someone is reading svm_vcpu_run(), they
> won't care about this code, and it's also cognitive load to filter it
> out. We can add a helper for this code (and the soft IRQ inject),
> something like svm_fixup_nested_rips() or sth.
I don't entirely disagree, but at the same time, why is someome reading svm_vcpu_run()
if they don't want to look at the gory details?
> We discussed a helper before and you didn't like it, but that was in a
> different context (a helper that combined normal and special cases).
> WDYT?
A helper would work. svm_fixup_nested_rips() is good, the only flaw is the CS.base
chunk, but I'm not sure I care enough about 32-bit to reject the name just because
of that :-)
That would make it easier to reduce indentation, e.g.
static void svm_fixup_nested_rips(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
/*
* If nrips is supported in hardware but not exposed to L1, stuff the
* actual L2 RIP to emulate what a nrips=0 CPU would do (L1 is
* responsible for advancing RIP prior to injecting the event). Once L2
* runs after L1 executes VMRUN, NextRIP is updated by the CPU and/or
* KVM, and this is no longer needed.
*
* This is done here (as opposed to when preparing vmcb02) to use the
* most up-to-date value of RIP regardless of the order of restoring
* registers and nested state in the vCPU save+restore path.
*
* Simiarly, initialize svm->soft_int_* fields here to use the most
* up-to-date values of RIP and CS base, regardless of restore order.
*/
if (!is_guest_mode(vcpu) || !svm->nested.nested_run_pending)
return;
if (boot_cpu_has(X86_FEATURE_NRIPS) &&
!guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
svm->vmcb->control.next_rip = kvm_rip_read(vcpu);
if (svm->soft_int_injected) {
svm->soft_int_csbase = svm->vmcb->save.cs.base;
svm->soft_int_old_rip = kvm_rip_read(vcpu);
if (!guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
svm->soft_int_next_rip = kvm_rip_read(vcpu);
}
}
> > We discussed a helper before and you didn't like it, but that was in a
> > different context (a helper that combined normal and special cases).
> > WDYT?
>
> A helper would work. svm_fixup_nested_rips() is good, the only flaw is the CS.base
> chunk, but I'm not sure I care enough about 32-bit to reject the name just because
> of that :-)
>
> That would make it easier to reduce indentation, e.g.
>
> static void svm_fixup_nested_rips(struct kvm_vcpu *vcpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
>
> /*
> * If nrips is supported in hardware but not exposed to L1, stuff the
> * actual L2 RIP to emulate what a nrips=0 CPU would do (L1 is
> * responsible for advancing RIP prior to injecting the event). Once L2
> * runs after L1 executes VMRUN, NextRIP is updated by the CPU and/or
> * KVM, and this is no longer needed.
> *
> * This is done here (as opposed to when preparing vmcb02) to use the
> * most up-to-date value of RIP regardless of the order of restoring
> * registers and nested state in the vCPU save+restore path.
> *
> * Simiarly, initialize svm->soft_int_* fields here to use the most
> * up-to-date values of RIP and CS base, regardless of restore order.
> */
> if (!is_guest_mode(vcpu) || !svm->nested.nested_run_pending)
> return;
>
> if (boot_cpu_has(X86_FEATURE_NRIPS) &&
> !guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
> svm->vmcb->control.next_rip = kvm_rip_read(vcpu);
>
> if (svm->soft_int_injected) {
> svm->soft_int_csbase = svm->vmcb->save.cs.base;
> svm->soft_int_old_rip = kvm_rip_read(vcpu);
> if (!guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
> svm->soft_int_next_rip = kvm_rip_read(vcpu);
> }
> }
Looks good, thanks Sean!
© 2016 - 2026 Red Hat, Inc.