On KVM_SET_GUEST_DEBUG, if a #DB or #BP is injected with
KVM_GUESTDBG_INJECT_DB or KVM_GUESTDBG_INJECT_BP, KVM fails with -EBUSY
if there is an existing pending exception. This was introduced in
commit 4f926bf29186 ("KVM: x86: Polish exception injection via
KVM_SET_GUEST_DEBUG") to avoid a warning in kvm_queue_exception(),
presumably to avoid overriding a pending exception.
This added another (arguably nice) property, if there's a pending
exception, KVM_SET_GUEST_DEBUG cannot cause a #DF or triple fault.
However, if an exception is injected, KVM_SET_GUEST_DEBUG will cause
a #DF or triple fault in the guest, as kvm_multiple_exception() combines
them.
Check for both pending and injected exceptions for
KVM_GUESTDBG_INJECT_DB and KVM_GUESTDBG_INJECT_BP, to avoid accidentally
injecting a #DB or triple fault.
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
---
arch/x86/kvm/x86.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e39c5faf94230..0c8aacf1fa67f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12543,7 +12543,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {
r = -EBUSY;
- if (kvm_is_exception_pending(vcpu))
+ if (kvm_is_exception_pending(vcpu) || vcpu->arch.exception.injected)
goto out;
if (dbg->control & KVM_GUESTDBG_INJECT_DB)
kvm_queue_exception(vcpu, DB_VECTOR);
--
2.53.0.473.g4a7958ca14-goog
On Fri, Feb 27, 2026, Yosry Ahmed wrote:
> On KVM_SET_GUEST_DEBUG, if a #DB or #BP is injected with
> KVM_GUESTDBG_INJECT_DB or KVM_GUESTDBG_INJECT_BP, KVM fails with -EBUSY
> if there is an existing pending exception. This was introduced in
> commit 4f926bf29186 ("KVM: x86: Polish exception injection via
> KVM_SET_GUEST_DEBUG") to avoid a warning in kvm_queue_exception(),
> presumably to avoid overriding a pending exception.
>
> This added another (arguably nice) property, if there's a pending
> exception, KVM_SET_GUEST_DEBUG cannot cause a #DF or triple fault.
> However, if an exception is injected, KVM_SET_GUEST_DEBUG will cause
> a #DF or triple fault in the guest, as kvm_multiple_exception() combines
> them.
First off, this patch looks good irrespective of nested crud. Disallowing injection
of #DB/#BP while there's already an injected exception aligns with architectural
behavior; KVM needs to finish delivering the exception and thus "complete" the
instruction before queueing a new exception.
As for nested, I _was_ going to say that, assuming the original motivation for
this patch is to avoid triggering a nested VM-Exit while nested_run_pending=1,
this is incomplete. E.g. if the #BP or #DB itself is being intercepted by L1,
then queueing the exception will incorrectly deliver the nested VM-Exit before
nested VM-Enter completes. I.e. I _thought_ we would also need:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db3f393192d9..fdbd272027ed 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12526,6 +12526,9 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
if (vcpu->arch.guest_state_protected)
return -EINVAL;
+ if (vcpu->arch.nested_run_pending)
+ return -EBUSY;
+
vcpu_load(vcpu);
if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {
But that isn't actually the case, because {svm,vmx}_check_nested_events() blocks
exceptions and VM-Exits while nested_run_pending=1. Off-list, I had rejected
modifying nested_vmx_triple_fault() to drop the triple fault like so:
@@ -5191,6 +5191,9 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
static void nested_vmx_triple_fault(struct kvm_vcpu *vcpu)
{
+ if (to_vmx(vcpu)->nested.nested_run_pending)
+ return;
+
kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0);
}
because "That would largely defeat the entire purpose of the WARN. The whole point
is to detect bugs where KVM synthesizes a VM-Exit before completing nested VM-Enter.
It should be impossible for nested_{vmx,svm}_triple_fault() to be called with
nested_run_pending=true."
Given the above, that argument doesn't hold up at first glance, because blocking
the triple fault if nested_run_pending=1 would be consistent with how other "synchronous"
events are handled by nVMX and nSVM. But after staring at this a bit, unless I'm
forgetting something (entirely possible), I actually think {svm,vmx}_check_nested_events()
are wrong, because I stand behind my quoted statement: it should be impossible for
KVM to synthesize a synchronous event before completing nested VM-Enter.
Blocking pending and injected exceptions was added by bfcf83b1444d ("KVM: nVMX:
Fix trying to cancel vmlauch/vmresume"), and unfortunately neither the changelog
nor Lore[1] provides any details as to what exactly prompted the fix. I _suspect_
it was either syzkaller induced, or the manifestation of other bugs in KVM's
exception handling.
Finally getting to the point, in theory, I _think_ KVM should actually WARN on
all cases where it temporarily blocks a synchronous event due to a pending nested
VM-Enter (see below diff for the basic gist). However, that would open the
floodgates to syzkaller, because KVM_SET_VCPU_EVENTS can obviously stuff events,
KVM_X86_SET_MCE can queue a #MC, etc.
I _think_ we might be able to get away with rejecting KVM_SET_VCPU_EVENTS if
nested_run_pending=1, without breaking userspace? Google's VMM is insane and does
KVM_SET_VCPU_EVENTS before KVM_SET_NESTED_STATE, but in real usage, i.e. outside
of selftests, (I hope) no VMM will restore into a "live" vCPU.
So instead of patch 1, I want to try either (a) blocking KVM_SET_VCPU_EVENTS,
KVM_X86_SET_MCE, and KVM_SET_GUEST_DEBUG if nested_run_pending=1, *and* follow-up
with the below WARN-spree, or (b) add a separate flag, e.g. nested_run_in_progress
or so, that is set with nested_run_pending, but cleared on an exit to userspace,
and then WARN on _that_, i.e. so that we can detect KVM bugs (the whole point of
the WARN) and hopefully stop playing this losing game of whack-a-mole with syzkaller.
I think I'm leaning toward (b)? Except for KVM_SET_GUEST_DEBUG, where userspace
is trying to interpose on the guest, restricting ioctls doesn't really add any
value in practice. Yeah, in theory it could _maybe_ prevent userspace from shooting
itself in the foot, but practically speaking, if userspace is restoring state into
a vCPU with nested_run_pending=1, it's either playing on expert mode or is already
completely broken.
My only hesitation with (b) is that KVM wouldn't be entirely consistent, since
vmx_unhandleable_emulation_required() _does_ explicitly reject a "userspace did
something stupid with nested_run_pending=1" case. So from that perspective, part
of me wants to get greedy and try for (a).
[1] https://lore.kernel.org/all/?q=%22KVM:%20nVMX:%20Fix%20trying%20to%20cancel%20vmlauch%22
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index de90b104a0dd..d624e4db704a 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1605,14 +1605,14 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
}
if (vcpu->arch.exception_vmexit.pending) {
- if (block_nested_exceptions)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
nested_svm_inject_exception_vmexit(vcpu);
return 0;
}
if (vcpu->arch.exception.pending) {
- if (block_nested_exceptions)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
return 0;
}
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 248635da6766..a223c5e86188 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4336,7 +4336,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
*/
if (vcpu->arch.exception_vmexit.pending &&
!vmx_is_low_priority_db_trap(&vcpu->arch.exception_vmexit)) {
- if (block_nested_exceptions)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
nested_vmx_inject_exception_vmexit(vcpu);
@@ -4345,13 +4345,13 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
if (vcpu->arch.exception.pending &&
!vmx_is_low_priority_db_trap(&vcpu->arch.exception)) {
- if (block_nested_exceptions)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
goto no_vmexit;
}
if (vmx->nested.mtf_pending) {
- if (block_nested_events)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
nested_vmx_update_pending_dbg(vcpu);
nested_vmx_vmexit(vcpu, EXIT_REASON_MONITOR_TRAP_FLAG, 0, 0);
@@ -4359,7 +4359,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
}
if (vcpu->arch.exception_vmexit.pending) {
- if (block_nested_exceptions)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
nested_vmx_inject_exception_vmexit(vcpu);
@@ -4367,7 +4367,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
}
if (vcpu->arch.exception.pending) {
- if (block_nested_exceptions)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
goto no_vmexit;
}
On Fri, Feb 27, 2026, Sean Christopherson wrote: > So instead of patch 1, I want to try either (a) blocking KVM_SET_VCPU_EVENTS, > KVM_X86_SET_MCE, and KVM_SET_GUEST_DEBUG if nested_run_pending=1, *and* follow-up > with the below WARN-spree, or (b) add a separate flag, e.g. nested_run_in_progress > or so, that is set with nested_run_pending, but cleared on an exit to userspace, > and then WARN on _that_, i.e. so that we can detect KVM bugs (the whole point of > the WARN) and hopefully stop playing this losing game of whack-a-mole with syzkaller. > > I think I'm leaning toward (b)? Except for KVM_SET_GUEST_DEBUG, where userspace > is trying to interpose on the guest, restricting ioctls doesn't really add any > value in practice. Yeah, in theory it could _maybe_ prevent userspace from shooting > itself in the foot, but practically speaking, if userspace is restoring state into > a vCPU with nested_run_pending=1, it's either playing on expert mode or is already > completely broken. > > My only hesitation with (b) is that KVM wouldn't be entirely consistent, since > vmx_unhandleable_emulation_required() _does_ explicitly reject a "userspace did > something stupid with nested_run_pending=1" case. So from that perspective, part > of me wants to get greedy and try for (a). On second (fifth?) thought, I don't think (a) is a good idea. In addition to potentially breaking userspace, it also risks preventing genuinely useful sequences. E.g. even if no VMM does so today, it's entirely plausible that a VMM could want to asynchronously inject an #MC to mimic a broadcast, and that the injection could collide with a pending nested VM-Enter. I'll send a separate (maybe RFC?) series for (b) using patch 1 as a starting point. I want to fiddle around with some ideas, and it'll be faster to sketch things out in code versus trying to describe things in text.
On Fri, Feb 27, 2026 at 8:34 AM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Feb 27, 2026, Sean Christopherson wrote: > > So instead of patch 1, I want to try either (a) blocking KVM_SET_VCPU_EVENTS, > > KVM_X86_SET_MCE, and KVM_SET_GUEST_DEBUG if nested_run_pending=1, *and* follow-up > > with the below WARN-spree, or (b) add a separate flag, e.g. nested_run_in_progress > > or so, that is set with nested_run_pending, but cleared on an exit to userspace, > > and then WARN on _that_, i.e. so that we can detect KVM bugs (the whole point of > > the WARN) and hopefully stop playing this losing game of whack-a-mole with syzkaller. I like the idea of the WARN there, although something in the back of my mind tells me I went through this code before with an exception in mind that could be injected with nested_run_pending=1, but I can't remember it. Maybe it was injected by userspace and all is good. That being said, I hate nested_run_in_progress. It's too close to nested_run_pending and I am pretty sure they will be mixed up. exception_from_userspace's name made me think this is something we could key off to WARN, but it's meant to morph queued exceptions from userspace into an "exception_vmexit" if needed. The field name is generic but its functionality isn't, maybe it should have been called exception_check_vmexit or something. Anyway.. That gave me an idea though, can we add a field to kvm_queued_exception to identify the origin of the exception (userspace vs. KVM)? Then we can key the warning off of that. We can potentially also avoid adding the field and just plumb the argument through to kvm_multiple_exception(), and WARN there if nested_run_pending is set and the origin is not userspace? > > > > I think I'm leaning toward (b)? Except for KVM_SET_GUEST_DEBUG, where userspace > > is trying to interpose on the guest, restricting ioctls doesn't really add any > > value in practice. Yeah, in theory it could _maybe_ prevent userspace from shooting > > itself in the foot, but practically speaking, if userspace is restoring state into > > a vCPU with nested_run_pending=1, it's either playing on expert mode or is already > > completely broken. > > > > My only hesitation with (b) is that KVM wouldn't be entirely consistent, since > > vmx_unhandleable_emulation_required() _does_ explicitly reject a "userspace did > > something stupid with nested_run_pending=1" case. So from that perspective, part > > of me wants to get greedy and try for (a). > > On second (fifth?) thought, I don't think (a) is a good idea. In addition to > potentially breaking userspace, it also risks preventing genuinely useful sequences. > E.g. even if no VMM does so today, it's entirely plausible that a VMM could want > to asynchronously inject an #MC to mimic a broadcast, and that the injection could > collide with a pending nested VM-Enter. > > I'll send a separate (maybe RFC?) series for (b) using patch 1 as a starting point. > I want to fiddle around with some ideas, and it'll be faster to sketch things out > in code versus trying to describe things in text. So you'll apply patch 3 as-is, drop patch 2, and (potentially) take patch 1 and build another series on top of it?
On Fri, Feb 27, 2026, Yosry Ahmed wrote:
> On Fri, Feb 27, 2026 at 8:34 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Feb 27, 2026, Sean Christopherson wrote:
> > > So instead of patch 1, I want to try either (a) blocking KVM_SET_VCPU_EVENTS,
> > > KVM_X86_SET_MCE, and KVM_SET_GUEST_DEBUG if nested_run_pending=1, *and* follow-up
> > > with the below WARN-spree, or (b) add a separate flag, e.g. nested_run_in_progress
> > > or so, that is set with nested_run_pending, but cleared on an exit to userspace,
> > > and then WARN on _that_, i.e. so that we can detect KVM bugs (the whole point of
> > > the WARN) and hopefully stop playing this losing game of whack-a-mole with syzkaller.
>
> I like the idea of the WARN there, although something in the back of
> my mind tells me I went through this code before with an exception in
> mind that could be injected with nested_run_pending=1, but I can't
> remember it. Maybe it was injected by userspace and all is good.
If there is such a flow, it's likely a bug, i.e. we'd want the WARN. AFAIK,
every single time the WARN has been hit in the last ~2-3 years has been due to
syzkaller.
> That being said, I hate nested_run_in_progress. It's too close to
> nested_run_pending and I am pretty sure they will be mixed up.
Agreed, though the fact that name is _too_ close means that, aside from the
potential for disaster (minor detail), it's accurate.
One thought is to hide nested_run_in_progress beyond a KConfig, so that attempts
to use it for anything but the sanity check(s) would fail the build. I don't
really want to create yet another KVM_PROVE_xxx though, but unlike KVM_PROVE_MMU,
I think we want to this enabled in production.
I'll chew on this a bit...
> exception_from_userspace's name made me think this is something we
> could key off to WARN, but it's meant to morph queued exceptions from
> userspace into an "exception_vmexit" if needed. The field name is
> generic but its functionality isn't, maybe it should have been called
> exception_check_vmexit or something. Anyway..
No? It's not a "check", it's literally an pending exception that has been morphed
to a VM-Exit.
Hmm, though looking at all that code again, I bet we can dedup a _lot_ code by
adding kvm_queued_exception.is_vmexit instead of tracking a completely separate
exception. The only potential hiccup I can think of is if there's some wrinkle
with the interaction with already pending/injected exceptions. Pending should be
fine, as the VM-Exit has priority.
Ah, scratch that idea, injected exceptions need to be tracked separate, e.g. see
vmcs12_save_pending_event(). It's correct for vmx_check_nested_events() to
deliver a VM-Exit even if there is an already-injected exception, e.g. if an EPT
Violation in L1's purview triggers when vectoring an injected exception, but in
that case, KVM needs to save the injected exception information into vmc{b,s}12.
> That gave me an idea though, can we add a field to
> kvm_queued_exception to identify the origin of the exception
> (userspace vs. KVM)? Then we can key the warning off of that.
That would incur non-trivial maintenance costs, and it would be tricky to get the
broader protection of the existing WARNing "right". E.g. how would KVM know that
the VM-Exit was originally induced by an exception that was queued by userspace?
> We can potentially also avoid adding the field and just plumb the
> argument through to kvm_multiple_exception(), and WARN there if
> nested_run_pending is set and the origin is not userspace?
Not really, because kvm_vcpu_ioctl_x86_set_vcpu_events() doesn't use
kvm_queued_exception(), it stuffs things directly.
That said, if you want to try and code it up, I say go for it. Worst case scenario
you'll have wasted a bit of time.
> > > I think I'm leaning toward (b)? Except for KVM_SET_GUEST_DEBUG, where userspace
> > > is trying to interpose on the guest, restricting ioctls doesn't really add any
> > > value in practice. Yeah, in theory it could _maybe_ prevent userspace from shooting
> > > itself in the foot, but practically speaking, if userspace is restoring state into
> > > a vCPU with nested_run_pending=1, it's either playing on expert mode or is already
> > > completely broken.
> > >
> > > My only hesitation with (b) is that KVM wouldn't be entirely consistent, since
> > > vmx_unhandleable_emulation_required() _does_ explicitly reject a "userspace did
> > > something stupid with nested_run_pending=1" case. So from that perspective, part
> > > of me wants to get greedy and try for (a).
> >
> > On second (fifth?) thought, I don't think (a) is a good idea. In addition to
> > potentially breaking userspace, it also risks preventing genuinely useful sequences.
> > E.g. even if no VMM does so today, it's entirely plausible that a VMM could want
> > to asynchronously inject an #MC to mimic a broadcast, and that the injection could
> > collide with a pending nested VM-Enter.
> >
> > I'll send a separate (maybe RFC?) series for (b) using patch 1 as a starting point.
> > I want to fiddle around with some ideas, and it'll be faster to sketch things out
> > in code versus trying to describe things in text.
>
> So you'll apply patch 3 as-is, drop patch 2, and (potentially) take
> patch 1 and build another series on top of it?
Yeah, that's where I'm trending.
> > That being said, I hate nested_run_in_progress. It's too close to
> > nested_run_pending and I am pretty sure they will be mixed up.
>
> Agreed, though the fact that name is _too_ close means that, aside from the
> potential for disaster (minor detail), it's accurate.
>
> One thought is to hide nested_run_in_progress beyond a KConfig, so that attempts
> to use it for anything but the sanity check(s) would fail the build. I don't
> really want to create yet another KVM_PROVE_xxx though, but unlike KVM_PROVE_MMU,
> I think we want to this enabled in production.
>
> I'll chew on this a bit...
Maybe (if we go this direction) name it very explicitly
warn_on_nested_exception if it's only intended to be used for the
sanity checks?
>
> > exception_from_userspace's name made me think this is something we
> > could key off to WARN, but it's meant to morph queued exceptions from
> > userspace into an "exception_vmexit" if needed. The field name is
> > generic but its functionality isn't, maybe it should have been called
> > exception_check_vmexit or something. Anyway..
>
> No? It's not a "check", it's literally an pending exception that has been morphed
> to a VM-Exit.
I meant that the exception_from_userspace flag means "KVM should check
if this exception should be morphed to a VM-Exit". It doesn't mean
that the exception has already morphed or will necessarily morph. So
exception_check_vmexit makes sense to me, if it's set, KVM checks if
we need to morph the exception to a VM-Exit.
> > That gave me an idea though, can we add a field to
> > kvm_queued_exception to identify the origin of the exception
> > (userspace vs. KVM)? Then we can key the warning off of that.
>
> That would incur non-trivial maintenance costs, and it would be tricky to get the
> broader protection of the existing WARNing "right". E.g. how would KVM know that
> the VM-Exit was originally induced by an exception that was queued by userspace?
It should have the info when morphing a pending exception to a
VM-Exit, assuming whoever is queuing the exception is passing it in,
but yeah I see how this can be a burden.
>
> > We can potentially also avoid adding the field and just plumb the
> > argument through to kvm_multiple_exception(), and WARN there if
> > nested_run_pending is set and the origin is not userspace?
>
> Not really, because kvm_vcpu_ioctl_x86_set_vcpu_events() doesn't use
> kvm_queued_exception(), it stuffs things directly.
Right, what I had in mind was that by default exceptions are assumed
to be queued by KVM, so kvm_vcpu_ioctl_x86_set_vcpu_events() doesn't
need to change. Basically, if a code path is queuing an exception from
userspace, it should use a new variant of kvm_queued_exception() (e.g.
kvm_queue_exception_u()). If it's stuffing things directly, nothing to
do. I think the code paths queuing exceptions from userspace should be
limited, so it should be fine to do this.
That being said, this still WARNs on the queuing side, not on the
checking side, so if you think that's not the right thing to do in
general then scratch this too.
>
> That said, if you want to try and code it up, I say go for it. Worst case scenario
> you'll have wasted a bit of time.
I meant something like this (completely untested), this just WARNs if
we try to queue the exception, doesn't really stop it:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c8aacf1fa67f..6f4148eae08be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -836,11 +836,14 @@ static void kvm_queue_exception_vmexit(struct
kvm_vcpu *vcpu, unsigned int vecto
static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
bool has_error, u32 error_code,
- bool has_payload, unsigned long payload)
+ bool has_payload, unsigned long payload,
+ bool from_userspace)
{
u32 prev_nr;
int class1, class2;
+ WARN_ON_ONCE(!from_userspace && vcpu->arch.nested_run_pending);
+
kvm_make_request(KVM_REQ_EVENT, vcpu);
/*
@@ -899,7 +902,7 @@ static void kvm_multiple_exception(struct kvm_vcpu
*vcpu, unsigned int nr,
void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)
{
- kvm_multiple_exception(vcpu, nr, false, 0, false, 0);
+ kvm_multiple_exception(vcpu, nr, false, 0, false, 0, false);
}
EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_queue_exception);
@@ -907,14 +910,19 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_queue_exception);
void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr,
unsigned long payload)
{
- kvm_multiple_exception(vcpu, nr, false, 0, true, payload);
+ kvm_multiple_exception(vcpu, nr, false, 0, true, payload, false);
}
EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_queue_exception_p);
static void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
u32 error_code, unsigned long payload)
{
- kvm_multiple_exception(vcpu, nr, true, error_code, true, payload);
+ kvm_multiple_exception(vcpu, nr, true, error_code, true,
payload, false);
+}
+
+static void kvm_queue_exception_u(struct kvm_vcpu *vcpu, unsigned nr)
+{
+ kvm_multiple_exception(vcpu, nr, false, 0, false, 0, true);
}
void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned int nr,
@@ -1015,7 +1023,7 @@ void kvm_inject_nmi(struct kvm_vcpu *vcpu)
void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code)
{
- kvm_multiple_exception(vcpu, nr, true, error_code, false, 0);
+ kvm_multiple_exception(vcpu, nr, true, error_code, false, 0, false);
}
EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_queue_exception_e);
@@ -5519,7 +5527,7 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct
kvm_vcpu *vcpu,
banks[3] = mce->misc;
vcpu->arch.mcg_status = mce->mcg_status;
banks[1] = mce->status;
- kvm_queue_exception(vcpu, MC_VECTOR);
+ kvm_queue_exception_u(vcpu, MC_VECTOR);
} else if (!(banks[1] & MCI_STATUS_VAL)
|| !(banks[1] & MCI_STATUS_UC)) {
if (banks[1] & MCI_STATUS_VAL)
@@ -12546,9 +12554,9 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct
kvm_vcpu *vcpu,
if (kvm_is_exception_pending(vcpu) ||
vcpu->arch.exception.injected)
goto out;
if (dbg->control & KVM_GUESTDBG_INJECT_DB)
- kvm_queue_exception(vcpu, DB_VECTOR);
+ kvm_queue_exception_u(vcpu, DB_VECTOR);
else
- kvm_queue_exception(vcpu, BP_VECTOR);
+ kvm_queue_exception_u(vcpu, BP_VECTOR);
}
/*
On Fri, Feb 27, 2026, Yosry Ahmed wrote:
> > > That being said, I hate nested_run_in_progress. It's too close to
> > > nested_run_pending and I am pretty sure they will be mixed up.
> >
> > Agreed, though the fact that name is _too_ close means that, aside from the
> > potential for disaster (minor detail), it's accurate.
> >
> > One thought is to hide nested_run_in_progress beyond a KConfig, so that attempts
> > to use it for anything but the sanity check(s) would fail the build. I don't
> > really want to create yet another KVM_PROVE_xxx though, but unlike KVM_PROVE_MMU,
> > I think we want to this enabled in production.
> >
> > I'll chew on this a bit...
>
> Maybe (if we go this direction) name it very explicitly
> warn_on_nested_exception if it's only intended to be used for the
> sanity checks?
It's not just about exceptions though. That's the case that has caused a rash
of recent problems, but the rule isn't specific to exceptions, it's very broadly
Thou Shalt Not Cancel VMRUN.
I think that's where there's some disconnect. We can't make the nested_run_pending
warnings go away by adding more sanity checks, and I am dead set against removing
those warnings.
Aha! Idea. What if we turn nested_run_pending into a u8, and use a magic value
of '2' to indicate that userspace gained control of the CPU since nested_run_pending
was set, and then only WARN on nested_run_pending==1? That way we don't have to
come up with a new name, and there's zero chance of nested_run_pending and something
like nested_run_in_progress getting out of sync.
---
arch/x86/include/asm/kvm_host.h | 6 +++++-
arch/x86/kvm/svm/nested.c | 3 ++-
arch/x86/kvm/vmx/nested.c | 4 ++--
arch/x86/kvm/x86.c | 7 +++++++
arch/x86/kvm/x86.h | 10 ++++++++++
5 files changed, 26 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 19b3790e5e99..a8d39b3aff6a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1104,8 +1104,12 @@ struct kvm_vcpu_arch {
* can only occur at instruction boundaries. The only exception is
* VMX's "notify" exits, which exist in large part to break the CPU out
* of infinite ucode loops, but can corrupt vCPU state in the process!
+ *
+ * For all intents and purposes, this is a boolean, but it's tracked as
+ * a u8 so that KVM can detect when userspace may have stuffed vCPU
+ * state and generated an architecturally-impossible VM-Exit.
*/
- bool nested_run_pending;
+ u8 nested_run_pending;
#if IS_ENABLED(CONFIG_HYPERV)
hpa_t hv_root_tdp;
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index c2d4c9c63146..77ff9ead957c 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1138,7 +1138,8 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
/* Exit Guest-Mode */
leave_guest_mode(vcpu);
svm->nested.vmcb12_gpa = 0;
- WARN_ON_ONCE(vcpu->arch.nested_run_pending);
+
+ kvm_warn_on_nested_run_pending(vcpu);
kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 031075467a6d..5659545360dc 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -5042,7 +5042,7 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
vmx->nested.mtf_pending = false;
/* trying to cancel vmlaunch/vmresume is a bug */
- WARN_ON_ONCE(vcpu->arch.nested_run_pending);
+ kvm_warn_on_nested_run_pending(vcpu);
#ifdef CONFIG_KVM_HYPERV
if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
@@ -6665,7 +6665,7 @@ bool nested_vmx_reflect_vmexit(struct kvm_vcpu *vcpu)
unsigned long exit_qual;
u32 exit_intr_info;
- WARN_ON_ONCE(vcpu->arch.nested_run_pending);
+ kvm_warn_on_nested_run_pending(vcpu);
/*
* Late nested VM-Fail shares the same flow as nested VM-Exit since KVM
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db3f393192d9..30ff5a755572 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12023,6 +12023,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
if (r <= 0)
goto out;
+ /*
+ * If userspace may have modified vCPU state, mark nested_run_pending
+ * as "untrusted" to avoid triggering false-positive WARNs.
+ */
+ if (vcpu->arch.nested_run_pending == 1)
+ vcpu->arch.nested_run_pending = 2;
+
r = vcpu_run(vcpu);
out:
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 94d4f07aaaa0..d3003c8be961 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -188,6 +188,16 @@ static inline bool kvm_can_set_cpuid_and_feature_msrs(struct kvm_vcpu *vcpu)
return vcpu->arch.last_vmentry_cpu == -1 && !is_guest_mode(vcpu);
}
+/*
+ * WARN if a nested VM-Enter is pending completion, and userspace hasn't gained
+ * control since the nested VM-Enter was initiated (in which case, userspace
+ * may have modified vCPU state to induce an architecturally invalid VM-Exit).
+ */
+static inline void kvm_warn_on_nested_run_pending(struct kvm_vcpu *vcpu)
+{
+ WARN_ON_ONCE(vcpu->arch.nested_run_pending == 1);
+}
+
static inline void kvm_set_mp_state(struct kvm_vcpu *vcpu, int mp_state)
{
vcpu->arch.mp_state = mp_state;
base-commit: a68a4bbc5b9ce5b722473399f05cb05217abaee8
--
On Mon, Mar 2, 2026 at 3:22 PM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Feb 27, 2026, Yosry Ahmed wrote: > > > > That being said, I hate nested_run_in_progress. It's too close to > > > > nested_run_pending and I am pretty sure they will be mixed up. > > > > > > Agreed, though the fact that name is _too_ close means that, aside from the > > > potential for disaster (minor detail), it's accurate. > > > > > > One thought is to hide nested_run_in_progress beyond a KConfig, so that attempts > > > to use it for anything but the sanity check(s) would fail the build. I don't > > > really want to create yet another KVM_PROVE_xxx though, but unlike KVM_PROVE_MMU, > > > I think we want to this enabled in production. > > > > > > I'll chew on this a bit... > > > > Maybe (if we go this direction) name it very explicitly > > warn_on_nested_exception if it's only intended to be used for the > > sanity checks? > > It's not just about exceptions though. That's the case that has caused a rash > of recent problems, but the rule isn't specific to exceptions, it's very broadly > Thou Shalt Not Cancel VMRUN. > > I think that's where there's some disconnect. We can't make the nested_run_pending > warnings go away by adding more sanity checks, and I am dead set against removing > those warnings. > > Aha! Idea. What if we turn nested_run_pending into a u8, and use a magic value > of '2' to indicate that userspace gained control of the CPU since nested_run_pending > was set, and then only WARN on nested_run_pending==1? That way we don't have to > come up with a new name, and there's zero chance of nested_run_pending and something > like nested_run_in_progress getting out of sync. Yeah this should work, the only thing I would change is using macros instead of 1 and 2 for readability.
On Mon, Mar 02, 2026, Yosry Ahmed wrote: > On Mon, Mar 2, 2026 at 3:22 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Fri, Feb 27, 2026, Yosry Ahmed wrote: > > > > > That being said, I hate nested_run_in_progress. It's too close to > > > > > nested_run_pending and I am pretty sure they will be mixed up. > > > > > > > > Agreed, though the fact that name is _too_ close means that, aside from the > > > > potential for disaster (minor detail), it's accurate. > > > > > > > > One thought is to hide nested_run_in_progress beyond a KConfig, so that attempts > > > > to use it for anything but the sanity check(s) would fail the build. I don't > > > > really want to create yet another KVM_PROVE_xxx though, but unlike KVM_PROVE_MMU, > > > > I think we want to this enabled in production. > > > > > > > > I'll chew on this a bit... > > > > > > Maybe (if we go this direction) name it very explicitly > > > warn_on_nested_exception if it's only intended to be used for the > > > sanity checks? > > > > It's not just about exceptions though. That's the case that has caused a rash > > of recent problems, but the rule isn't specific to exceptions, it's very broadly > > Thou Shalt Not Cancel VMRUN. > > > > I think that's where there's some disconnect. We can't make the nested_run_pending > > warnings go away by adding more sanity checks, and I am dead set against removing > > those warnings. > > > > Aha! Idea. What if we turn nested_run_pending into a u8, and use a magic value > > of '2' to indicate that userspace gained control of the CPU since nested_run_pending > > was set, and then only WARN on nested_run_pending==1? That way we don't have to > > come up with a new name, and there's zero chance of nested_run_pending and something > > like nested_run_in_progress getting out of sync. > > Yeah this should work, the only thing I would change is using macros > instead of 1 and 2 for readability. I was "this" close to using a enum or #define, but I couldn't figure out a clean solution to this code: vcpu->arch.nested_run_pending = !!(kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING); as I didn't want to end up with effectively: if (true) x = 1; else x = 0; But thinking more on it, that code is inherently untrusuted, so it can be this: if (kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING) vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING_UNTRUSTED; else vcpu->arch.nested_run_pending = 0; which is pretty much the same, but at least is a bit more than a convoluted cast from a bool to an int. FWIW, I verified this makes the C reproducer happy.
© 2016 - 2026 Red Hat, Inc.