arch/x86/kvm/svm/nested.c | 22 +++++++++++----------- arch/x86/kvm/svm/svm.h | 3 +++ 2 files changed, 14 insertions(+), 11 deletions(-)
Hi all,
Following Greg's suggestion to turn the proposed fix into a real patch,
here is a minimal fix for the vmcb12->save.rip TOCTOU race in KVM's
nested SVM implementation.
Background
----------
The CVE-2021-29657 fix introduced nested_copy_vmcb_save_to_cache() to
snapshot vmcb12 fields before validation and use, preventing a racing L1
vCPU from modifying vmcb12 between check and use. However, the save area
cache deliberately excluded rip, rsp, and rax -- only efer, cr0, cr3,
cr4, dr6, and dr7 are snapshotted.
As a result, vmcb12->save.rip is still read three separate times from
the live guest-mapped HVA pointer during a single nested VMRUN:
1) enter_svm_guest_mode() passes vmcb12->save.rip to
nested_vmcb02_prepare_control(), where it is stored in
svm->soft_int_old_rip, svm->soft_int_next_rip, and
vmcb02->control.next_rip
2) nested_vmcb02_prepare_save() calls
kvm_rip_write(vcpu, vmcb12->save.rip), setting the KVM-internal
vCPU register state
3) nested_vmcb02_prepare_save() then does
vmcb02->save.rip = vmcb12->save.rip, setting the hardware VMCB02
save area
Since vmcb12 is mapped via kvm_vcpu_map() as a direct HVA into guest
physical memory with no write protection, a concurrent L1 vCPU can
modify vmcb12->save.rip between these reads, producing a three-way RIP
inconsistency. This is the save-area analog of CVE-2021-29657.
The inconsistency is particularly dangerous when combined with soft
interrupt injection (event_inj with TYPE_SOFT): KVM records
soft_int_old_rip from read #1 but the vCPU state and hardware VMCB
reflect reads #2 and #3 respectively. If interrupt delivery faults,
svm_complete_interrupts() uses the stale soft_int_old_rip to
reconstruct pre-injection state, which no longer matches reality.
I am aware of Yosry Ahmed's larger patch series (v3-v6) that
reworks the entire vmcb12 caching architecture and would subsume
this fix. However, that series is still under review and has not
yet been merged. This patch is a minimal, self-contained fix that
can be applied immediately to close the TOCTOU window on rip, rsp,
and rax.
Fix
---
Add rip, rsp, and rax to struct vmcb_save_area_cached, snapshot them
in __nested_copy_vmcb_save_to_cache(), and replace all direct reads
of vmcb12->save.{rip,rsp,rax} with reads from the cached copy. This
ensures all consumers within a single nested VMRUN see consistent
register values.
Testing
-------
Tested on AMD Ryzen 7 7800X3D with nested virtualization enabled
(kvm_amd nested=1). A userspace race harness demonstrated a 25.6%
hit rate for concurrent modification of the rip field between reads
across 1M iterations, with 3-way splits (all three reads returning
different values) confirmed. With the patch applied, all three
consumption points see the same snapshotted value regardless of
concurrent modification.
The original discussion that led to this patch inadvertently went to
a public list. KVM maintainers were not CC'd on the follow-up; this
submission corrects that.
Seungil Jeon (1):
KVM: nSVM: Snapshot vmcb12 save.rip to prevent TOCTOU race
arch/x86/kvm/svm/nested.c | 22 +++++++++++-----------
arch/x86/kvm/svm/svm.h | 3 +++
2 files changed, 14 insertions(+), 11 deletions(-)
--
2.43.0
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Seungil Jeon <jeon1691951@gmail.com>
Date: Wed, 26 Mar 2026 00:00:00 +0000
Subject: [PATCH] KVM: nSVM: Snapshot vmcb12 save.rip to prevent TOCTOU race
vmcb12->save.rip is read three separate times from guest-writable memory
during nested VMRUN processing, without being snapshotted into the save
area cache. A malicious L1 guest with multiple vCPUs can concurrently
modify vmcb12->save.rip through another vCPU (since the vmcb12 page is
mapped as a direct HVA into guest physical memory with no write
protection), producing a three-way RIP inconsistency across:
1) svm->soft_int_old_rip and svm->soft_int_next_rip (set in
nested_vmcb02_prepare_control() from vmcb12->save.rip)
2) vcpu->arch.regs[VCPU_REGS_RIP] (set via kvm_rip_write() in
nested_vmcb02_prepare_save())
3) vmcb02->save.rip (set directly in nested_vmcb02_prepare_save())
This is the save-area analog of CVE-2021-29657, which exploited the
same TOCTOU pattern on vmcb12->control.intercept to achieve full host
code execution. The CVE-2021-29657 fix introduced
nested_copy_vmcb_save_to_cache() but deliberately excluded rip, rsp,
and rax from the cached fields, leaving the save area vulnerable to
the same class of concurrent-modification attacks.
When combined with soft interrupt injection via event_inj, the three-way
RIP split causes KVM to record inconsistent state for interrupt
re-injection: soft_int_old_rip holds one value (from read #1), the vCPU
register state holds another (from read #2), and the hardware VMCB holds
a third (from read #3). If delivery of the soft interrupt faults (e.g.,
#PF on IDT access), svm_complete_interrupts() uses soft_int_old_rip to
reconstruct the pre-injection state, which no longer matches the actual
vCPU or hardware state. This can trigger WARN_ON assertions in KVM's
interrupt handling, cause instruction emulation at wrong addresses, or
lead to information disclosure through confused emulation.
Fix this by adding rip, rsp, and rax to vmcb_save_area_cached, copying
them in __nested_copy_vmcb_save_to_cache(), and replacing all direct
reads of vmcb12->save.{rip,rsp,rax} with reads from the cached copy.
This ensures that all consumers within a single nested VMRUN see a
consistent snapshot of these registers, closing the TOCTOU window.
The same approach is used for rsp and rax, which suffer from analogous
(though currently two-read rather than three-read) TOCTOU patterns in
nested_vmcb02_prepare_save().
Cc: stable@vger.kernel.org
Reported-by: Seungil Jeon <jeon1691951@gmail.com>
Signed-off-by: Seungil Jeon <jeon1691951@gmail.com>
---
arch/x86/kvm/svm/nested.c | 22 +++++++++++-----------
arch/x86/kvm/svm/svm.h | 3 +++
2 files changed, 14 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -XXX,XX +XXX,XX @@ struct vmcb_save_area_cached {
u64 cr4;
u64 dr6;
u64 dr7;
+ u64 rip;
+ u64 rsp;
+ u64 rax;
};
struct vmcb_ctrl_area_cached {
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index XXXXXXX..XXXXXXX 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -XXX,XX +XXX,XX @@ static void __nested_copy_vmcb_save_to_cache(struct vmcb_save_area_cached *to,
struct vmcb_save_area *from)
{
/*
- * Copy only fields that are validated, as we need them
- * to avoid TOC/TOU races.
+ * Copy fields that are either validated or read multiple times,
+ * as we need a consistent snapshot to avoid TOC/TOU races.
+ * rip/rsp/rax are read multiple times from guest-writable
+ * memory in enter_svm_guest_mode() and must be snapshotted.
*/
to->efer = from->efer;
to->cr0 = from->cr0;
to->cr3 = from->cr3;
to->cr4 = from->cr4;
to->dr6 = from->dr6;
to->dr7 = from->dr7;
+ to->rip = from->rip;
+ to->rsp = from->rsp;
+ to->rax = from->rax;
}
/*
@@ -XXX,XX +XXX,XX @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm,
struct vmcb *vmcb12)
{
+ struct vmcb_save_area_cached *save = &svm->nested.save;
struct kvm_vcpu *vcpu = &svm->vcpu;
/* ... segment register copies from vmcb12 (unchanged) ... */
- kvm_rip_write(vcpu, vmcb12->save.rip);
- vmcb02->save.rip = vmcb12->save.rip;
+ kvm_rip_write(vcpu, save->rip);
+ vmcb02->save.rip = save->rip;
- kvm_rsp_write(vcpu, vmcb12->save.rsp);
- vmcb02->save.rsp = vmcb12->save.rsp;
+ kvm_rsp_write(vcpu, save->rsp);
+ vmcb02->save.rsp = save->rsp;
- kvm_rax_write(vcpu, vmcb12->save.rax);
- vmcb02->save.rax = vmcb12->save.rax;
+ kvm_rax_write(vcpu, save->rax);
+ vmcb02->save.rax = save->rax;
/* ... remainder of function unchanged ... */
}
@@ -XXX,XX +XXX,XX @@ int enter_svm_guest_mode(struct vcpu_svm *svm,
struct vmcb *vmcb12, bool from_vmrun)
{
+ struct vmcb_save_area_cached *save = &svm->nested.save;
struct vmcb *vmcb02 = svm->nested.vmcb02.ptr;
/* ... existing code ... */
- nested_vmcb02_prepare_control(svm, vmcb12->save.rip,
- vmcb12->save.cs.base);
+ nested_vmcb02_prepare_control(svm, save->rip,
+ vmcb12->save.cs.base);
nested_vmcb02_prepare_save(svm, vmcb12);
/* ... remainder of function unchanged ... */
}
--
2.43.0
+Yosry's email used for upstream stuff
On Thu, Mar 26, 2026, 홍길동 wrote:
> Hi all,
>
> Following Greg's suggestion to turn the proposed fix into a real patch,
> here is a minimal fix for the vmcb12->save.rip TOCTOU race in KVM's
> nested SVM implementation.
>
> Background
> ----------
>
> The CVE-2021-29657 fix introduced nested_copy_vmcb_save_to_cache() to
> snapshot vmcb12 fields before validation and use, preventing a racing L1
> vCPU from modifying vmcb12 between check and use. However, the save area
> cache deliberately excluded rip, rsp, and rax -- only efer, cr0, cr3,
> cr4, dr6, and dr7 are snapshotted.
>
> As a result, vmcb12->save.rip is still read three separate times from
> the live guest-mapped HVA pointer during a single nested VMRUN:
>
> 1) enter_svm_guest_mode() passes vmcb12->save.rip to
> nested_vmcb02_prepare_control(), where it is stored in
> svm->soft_int_old_rip, svm->soft_int_next_rip, and
> vmcb02->control.next_rip
>
> 2) nested_vmcb02_prepare_save() calls
> kvm_rip_write(vcpu, vmcb12->save.rip), setting the KVM-internal
> vCPU register state
>
> 3) nested_vmcb02_prepare_save() then does
> vmcb02->save.rip = vmcb12->save.rip, setting the hardware VMCB02
> save areaq
>
> Since vmcb12 is mapped via kvm_vcpu_map() as a direct HVA into guest
> physical memory with no write protection, a concurrent L1 vCPU can
> modify vmcb12->save.rip between these reads, producing a three-way RIP
> inconsistency. This is the save-area analog of CVE-2021-29657.
>
> The inconsistency is particularly dangerous when combined with soft
Eh, I wouldn't call this dangerous per se. Problematic, sure, but AFAICT the
host is never at risk. My official stance is that any panics due to KVM WARNs
when running with panic_on_warn=1 are NOT considered KVM DoS issues.
KVM WARNs are 100% worth fixing, especially if they're guest- and/or user-triggerable,
but the WARNs themselves aren't security/DoS issues, because in my very strong
opinion, allowing use of /dev/kvm with panic_on_warn=1 when the platform owner
cares about host uptime is user/admin error.
> interrupt injection (event_inj with TYPE_SOFT): KVM records
> soft_int_old_rip from read #1 but the vCPU state and hardware VMCB
> reflect reads #2 and #3 respectively. If interrupt delivery faults,
> svm_complete_interrupts() uses the stale soft_int_old_rip to
> reconstruct pre-injection state, which no longer matches reality.
None of this matches reality though, in the sense that the instant L1 mucks with
vmcb12->save.rip while VMRUN is in-flight, all bets are off. I.e. from an
architectural perspective, KVM doesn't need to get anything "right", because
the L1 hypervisor has firmly triggered undefined behavior.
What exactly goes wrong? The changelog mentions a WARN, and at glance, the only
one that seems relevant is this one in kvm_requeue_exception():
WARN_ON_ONCE(kvm_is_exception_pending(vcpu));
> I am aware of Yosry Ahmed's larger patch series (v3-v6) that
> reworks the entire vmcb12 caching architecture and would subsume
> this fix. However, that series is still under review and has not
> yet been merged. This patch is a minimal, self-contained fix that
> can be applied immediately to close the TOCTOU window on rip, rsp,
> and rax.
>
> Fix
> ---
>
> Add rip, rsp, and rax to struct vmcb_save_area_cached, snapshot them
> in __nested_copy_vmcb_save_to_cache(), and replace all direct reads
> of vmcb12->save.{rip,rsp,rax} with reads from the cached copy. This
> ensures all consumers within a single nested VMRUN see consistent
> register values.
What is actually visibliy problematic?
Assuming the worst case scenario is a WARN, then I'm very strongly inclined to
either (a) not apply this patch at all and instead wait for Yosry's full series,
or (b) have Yosry slot in the most minimal fix (e.g. for just RIP) in a stable@
friendly location in his series.
There are many, many nSVM issues that need to be fixed, many of which are functional
problems for well-behaved setups. For me, those are by far the priority. I also
want to fix the a guest-triggerable WARN_ON_ONCE(), but it's not urgent, and not
something I want to spend a lot of effort on with respect to providing an LTS-friendly
commit (though if we can get one cheaply, that'd be great).
> > Add rip, rsp, and rax to struct vmcb_save_area_cached, snapshot them
> > in __nested_copy_vmcb_save_to_cache(), and replace all direct reads
> > of vmcb12->save.{rip,rsp,rax} with reads from the cached copy. This
> > ensures all consumers within a single nested VMRUN see consistent
> > register values.
>
> What is actually visibliy problematic?
>
> Assuming the worst case scenario is a WARN, then I'm very strongly inclined to
> either (a) not apply this patch at all and instead wait for Yosry's full series,
> or (b) have Yosry slot in the most minimal fix (e.g. for just RIP) in a stable@
> friendly location in his series.
>
> There are many, many nSVM issues that need to be fixed, many of which are functional
> problems for well-behaved setups. For me, those are by far the priority. I also
> want to fix the a guest-triggerable WARN_ON_ONCE(), but it's not urgent, and not
> something I want to spend a lot of effort on with respect to providing an LTS-friendly
> commit (though if we can get one cheaply, that'd be great).
I agree with Sean here, it's probably not worth fixing in LTS kernels.
The series has been in kvm-x86/next for a while and I don't think any
of us want to change that, it took a bit of work to get all the nSVM
patches there in good shape to begin with (and there's more pending
patches that depend on current kvm-x86/next).
That being said, I personally do not object to LTS-specific patches
(e.g. like the one attached), if Sean and Paolo think it's worth it. I
don't really have time to do that, but I can help with reviews
(although I will be OOO for the next 2 weeks). As Paolo said, be
careful that some older LTS trees do not even have the cached save
area, so they are broken in a much bigger way.
On Wed, Apr 1, 2026 at 11:49 PM Yosry Ahmed <yosry@kernel.org> wrote: > That being said, I personally do not object to LTS-specific patches > (e.g. like the one attached), if Sean and Paolo think it's worth it. I > don't really have time to do that, but I can help with reviews > (although I will be OOO for the next 2 weeks). As Paolo said, be > careful that some older LTS trees do not even have the cached save > area, so they are broken in a much bigger way. After the merge window, I will reevaluate submitting this patch to stable. But that's all we need to do. Paolo
On Wed, Apr 1, 2026 at 6:57 PM Sean Christopherson <seanjc@google.com> wrote: > Eh, I wouldn't call this dangerous per se. Problematic, sure, but AFAICT the > host is never at risk. My official stance is that any panics due to KVM WARNs > when running with panic_on_warn=1 are NOT considered KVM DoS issues. Yes. More in general panic_on_warn=1 is only a good idea in two cases: 1) because there's a much bigger vulnerability that you know about and can't patch, and you want to somehow mitigate it 2) because you (relatively speaking) don't care about availability, which means that treating WARNs as vulnerabilities is misleading. If someone (like a distro or an Android vendor, I don't know) sets panic_on_warn=1 without having evaluated (2), it's entirely on them. (Please, no bickering about the definition of vulnerability. It's perfectly possible for CNAs to describe configurations where reduced definitions of vulnerability apply, and "the user asked for it" can be one such configuration.) > KVM WARNs are 100% worth fixing, especially if they're guest- and/or user-triggerable, > but the WARNs themselves aren't security/DoS issues, because in my very strong > opinion, allowing use of /dev/kvm with panic_on_warn=1 when the platform owner > cares about host uptime is user/admin error. Yes. > There are many, many nSVM issues that need to be fixed, many of which are functional > problems for well-behaved setups. For me, those are by far the priority. I also > want to fix the a guest-triggerable WARN_ON_ONCE(), but it's not urgent, and not > something I want to spend a lot of effort on with respect to providing an LTS-friendly > commit (though if we can get one cheaply, that'd be great). Plus in the presence of large-scale refactorings and fixes over the years presence of issues is hard to ascertain mechanically. Older LTS versions won't even have struct vmcb_save_area_cached and may incorrectly register as "not having this issue", while the actual situation of nSVM there is much worse than top of tree. And for the same reason backports can be hard to evaluate. Sure WARNs tend to be of the "ok, I was overly conservative and didn't think of these circumstances", which is on the easy side, but there is a reason why KVM requested no stable autoselect. Paolo
© 2016 - 2026 Red Hat, Inc.