Fix a lost async pagefault notification when the guest is using SMM

[PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued

Posted by Maxim Levitsky 1 month, 3 weeks ago

Fix a semi theoretical race condition in reading of page_ready_pending
in kvm_arch_async_page_present_queued.

Only trust the value of page_ready_pending if the guest is about to
enter guest mode (vcpu->mode).

To achieve this, read the vcpu->mode using smp_load_acquire which is
paired with smp_store_release in vcpu_enter_guest.

Then only if vcpu_mode is IN_GUEST_MODE, trust the value of the
page_ready_pending because it was written before and therefore its correct
value is visible.

Also if the above mentioned check is true, avoid raising the request
on the target vCPU.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/x86.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9018d56b4b0a..3d45a4cd08a4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 
 void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
 {
-	kvm_make_request(KVM_REQ_APF_READY, vcpu);
-	if (!vcpu->arch.apf.pageready_pending)
+	/* Pairs with smp_store_release in vcpu_enter_guest. */
+	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
+	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
+
+	if (!in_guest_mode || !page_ready_pending) {
+		kvm_make_request(KVM_REQ_APF_READY, vcpu);
 		kvm_vcpu_kick(vcpu);
+	}
 }
 
 bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu)
-- 
2.49.0

Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued

Posted by Paolo Bonzini 1 week, 4 days ago

On 8/13/25 21:23, Maxim Levitsky wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9018d56b4b0a..3d45a4cd08a4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>   
>   void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
>   {
> -	kvm_make_request(KVM_REQ_APF_READY, vcpu);
> -	if (!vcpu->arch.apf.pageready_pending)
> +	/* Pairs with smp_store_release in vcpu_enter_guest. */
> +	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> +	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> +
> +	if (!in_guest_mode || !page_ready_pending) {
> +		kvm_make_request(KVM_REQ_APF_READY, vcpu);
>   		kvm_vcpu_kick(vcpu);
> +	}

Unlike Sean, I think the race exists in abstract and is not benign, but
there are already enough memory barriers to tame it.

That said, in_guest_mode is a red herring.  The way I look at it, is
through the very common wake/sleep (or kick/check) pattern that has
smp_mb() on both sides.

The pair you are considering consists of this  function (the "kick
side"), and the wrmsr handler for MSR_KVM_ASYNC_PF_ACK (the "check
side").  Let's see how the synchronization between the two sides
happens:

- here, you need to check whether to inject the interrupt.  It looks
like this:

   kvm_make_request(KVM_REQ_APF_READY, vcpu);
   smp_mb();
   if (!READ_ONCE(page_ready_pending))
     kvm_vcpu_kick(vcpu);

- on the other side, after clearing page_ready_pending, there will be a
check for a wakeup:

   WRITE_ONCE(page_ready_pending, false);
   smp_mb();
   if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
     kvm_check_async_pf_completion(vcpu)

except that the "if" is not in kvm_set_msr_common(); it will happen
naturally as part of the first re-entry.


So let's look at the changes you need to make, in order to make the code
look like the above.

- using READ_ONCE/WRITE_ONCE for pageready_pending never hurts

- here in kvm_arch_async_page_present_queued(), a smp_mb__after_atomic()
(compiler barrier on x86) is missing after kvm_make_request():

         kvm_make_request(KVM_REQ_APF_READY, vcpu);
	/*
	 * Tell vCPU to wake up before checking if they need an
	 * interrupt.  Pairs with any memory barrier between
	 * the clearing of pageready_pending and vCPU entry.
	 */
	smp_mb__after_atomic();
         if (!READ_ONCE(vcpu->arch.apf.pageready_pending))
                 kvm_vcpu_kick(vcpu);

- in kvm_set_msr_common(), there are two possibilities.
The easy one is to just use smp_store_mb() to clear
vcpu->arch.apf.pageready_pending.  The other would be a comment
like this:

	WRITE_ONCE(vcpu->arch.apf.pageready_pending, false);
	/*
	 * Ensure they know to wake this vCPU up, before the vCPU
	 * next checks KVM_REQ_APF_READY.  Use an existing memory
	 * barrier between here and thenext kvm_request_pending(),
	 * for example in vcpu_run().
	 */
	/* smp_mb(); */

plus a memory barrier in common code like this:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 706b6fd56d3c..e302c617e4b2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11236,6 +11236,11 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
  		if (r <= 0)
  			break;
  
+		/*
+		 * Provide a memory barrier between handle_exit and the
+		 * kvm_request_pending() read in vcpu_enter_guest().  It
+		 * pairs with any barrier after kvm_make_request(), for
+		 * example in kvm_arch_async_page_present_queued().
+		 */
+		smp_mb__before_atomic();
  		kvm_clear_request(KVM_REQ_UNBLOCK, vcpu);
  		if (kvm_xen_has_pending_events(vcpu))
  			kvm_xen_inject_pending_events(vcpu);


The only advantage of this second, more complex approach is that
it shows *why* the race was not happening.  The 50 clock cycles
saved on an MSR write are not worth the extra complication, and
on a quick grep I could not find other cases which rely on the same
implicit barriers.  So I'd say use smp_store_mb(), with a comment
about the pairing with kvm_arch_async_page_present_queued(); and write
in the commit message that the race wasn't happening thanks to unrelated
memory barriers between handle_exit and the kvm_request_pending()
read in vcpu_enter_guest.

Thanks,

Paolo

Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued

Posted by Sean Christopherson 1 week, 3 days ago

On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> On 8/13/25 21:23, Maxim Levitsky wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 9018d56b4b0a..3d45a4cd08a4 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> >   void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
> >   {
> > -	kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > -	if (!vcpu->arch.apf.pageready_pending)
> > +	/* Pairs with smp_store_release in vcpu_enter_guest. */
> > +	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> > +	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> > +
> > +	if (!in_guest_mode || !page_ready_pending) {
> > +		kvm_make_request(KVM_REQ_APF_READY, vcpu);
> >   		kvm_vcpu_kick(vcpu);
> > +	}
> 
> Unlike Sean, I think the race exists in abstract and is not benign

How is it not benign?  I never said the race doesn't exist, I said that consuming
a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued()
is benign.

Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued

Posted by Paolo Bonzini 1 week, 3 days ago

On 9/23/25 20:55, Sean Christopherson wrote:
> On Tue, Sep 23, 2025, Paolo Bonzini wrote:
>> On 8/13/25 21:23, Maxim Levitsky wrote:
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 9018d56b4b0a..3d45a4cd08a4 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>>>    void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
>>>    {
>>> -	kvm_make_request(KVM_REQ_APF_READY, vcpu);
>>> -	if (!vcpu->arch.apf.pageready_pending)
>>> +	/* Pairs with smp_store_release in vcpu_enter_guest. */
>>> +	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
>>> +	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
>>> +
>>> +	if (!in_guest_mode || !page_ready_pending) {
>>> +		kvm_make_request(KVM_REQ_APF_READY, vcpu);
>>>    		kvm_vcpu_kick(vcpu);
>>> +	}
>>
>> Unlike Sean, I think the race exists in abstract and is not benign
> 
> How is it not benign?  I never said the race doesn't exist, I said that consuming
> a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued()
> is benign.

In principle there is a possibility that a KVM_REQ_APF_READY is missed. 
Just by the reading of the specs, without a smp__mb_after_atomic() this 
is broken:

	kvm_make_request(KVM_REQ_APF_READY, vcpu);
	if (!vcpu->arch.apf.pageready_pending)
    		kvm_vcpu_kick(vcpu);

It won't happen because set_bit() is written with asm("memory"), because 
x86 set_bit() does prevent reordering at the processor level, etc.

In other words the race is only avoided by the fact that compiler 
reorderings are prevented even in cases that memory-barriers.txt does 
not promise.

Paolo

Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued

Posted by Sean Christopherson 1 week, 3 days ago

On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> On 9/23/25 20:55, Sean Christopherson wrote:
> > On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> > > On 8/13/25 21:23, Maxim Levitsky wrote:
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 9018d56b4b0a..3d45a4cd08a4 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> > > >    void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
> > > >    {
> > > > -	kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > > -	if (!vcpu->arch.apf.pageready_pending)
> > > > +	/* Pairs with smp_store_release in vcpu_enter_guest. */
> > > > +	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> > > > +	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> > > > +
> > > > +	if (!in_guest_mode || !page_ready_pending) {
> > > > +		kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > >    		kvm_vcpu_kick(vcpu);
> > > > +	}
> > > 
> > > Unlike Sean, I think the race exists in abstract and is not benign
> > 
> > How is it not benign?  I never said the race doesn't exist, I said that consuming
> > a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued()
> > is benign.
> 
> In principle there is a possibility that a KVM_REQ_APF_READY is missed.

I think you mean a kick (wakeup or IPI), is missed, not that the APF_READY itself
is missed.  I.e. KVM_REQ_APF_READY will never be lost, KVM just might enter the
guest or schedule out the vCPU with the flag set.

All in all, I think we're in violent agreement.  I agree that kvm_vcpu_kick()
could be missed (theoretically), but I'm saying that missing the kick would be
benign due to a myriad of other barriers and checks, i.e. that the vCPU is
guaranteed to see KVM_REQ_APF_READY anyways.

E.g. my suggestion earlier regarding OUTSIDE_GUEST_MODE was to rely on the
smp_mb__after_srcu_read_{,un}lock() barriers in vcpu_enter_guest() to ensure
KVM_REQ_APF_READY would be observed before trying VM-Enter, and that if KVM might
be in the process of emulating HLT (blocking), that either KVM_REQ_APF_READY is
visible to the vCPU or that kvm_arch_async_page_present() wakes the vCPU.  Oh,
hilarious, async_pf_execute() also does an unconditional __kvm_vcpu_wake_up().

Huh.  But isn't that a real bug?  KVM doesn't consider KVM_REQ_APF_READY to be a
wake event, so isn't this an actual race?

  vCPU                                  async #PF
  kvm_check_async_pf_completion()
  pageready_pending = false
  VM-Enter
  HLT
  VM-Exit
                                        kvm_make_request(KVM_REQ_APF_READY, vcpu)
                                        kvm_vcpu_kick(vcpu)  // nop as the vCPU isn't blocking, yet
                                        __kvm_vcpu_wake_up() // nop for the same reason
  vcpu_block()
  <hang>

On x86, the "page ready" IRQ is only injected from vCPU context, so AFAICT nothing
is guarnateed wake the vCPU in the above sequence.

> broken:
> 
> 	kvm_make_request(KVM_REQ_APF_READY, vcpu);
> 	if (!vcpu->arch.apf.pageready_pending)
>    		kvm_vcpu_kick(vcpu);
> 
> It won't happen because set_bit() is written with asm("memory"), because x86
> set_bit() does prevent reordering at the processor level, etc.
> 
> In other words the race is only avoided by the fact that compiler
> reorderings are prevented even in cases that memory-barriers.txt does not
> promise.

Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued

Posted by Paolo Bonzini 1 week, 4 days ago

On 9/23/25 17:58, Paolo Bonzini wrote:
> - on the other side, after clearing page_ready_pending, there will be a
> check for a wakeup:
> 
>    WRITE_ONCE(page_ready_pending, false);
>    smp_mb();
>    if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
>      kvm_check_async_pf_completion(vcpu)
> 
> except that the "if" is not in kvm_set_msr_common(); it will happen
> naturally as part of the first re-entry.

Important thing that I forgot to mention: the above only covers the race 
case.  There's also the case where KVM_REQ_APF_READY has been cleared 
already, and for that one the call to kvm_check_async_pf_completion() is 
*also* needed in kvm_set_msr_common().

Paolo

> 
> So let's look at the changes you need to make, in order to make the code
> look like the above.
> 
> - using READ_ONCE/WRITE_ONCE for pageready_pending never hurts
> 
> - here in kvm_arch_async_page_present_queued(), a smp_mb__after_atomic()
> (compiler barrier on x86) is missing after kvm_make_request():
> 
>          kvm_make_request(KVM_REQ_APF_READY, vcpu);
>      /*
>       * Tell vCPU to wake up before checking if they need an
>       * interrupt.  Pairs with any memory barrier between
>       * the clearing of pageready_pending and vCPU entry.
>       */
>      smp_mb__after_atomic();
>          if (!READ_ONCE(vcpu->arch.apf.pageready_pending))
>                  kvm_vcpu_kick(vcpu);
> 
> - in kvm_set_msr_common(), there are two possibilities.
> The easy one is to just use smp_store_mb() to clear
> vcpu->arch.apf.pageready_pending.  The other would be a comment
> like this:
> 
>      WRITE_ONCE(vcpu->arch.apf.pageready_pending, false);
>      /*
>       * Ensure they know to wake this vCPU up, before the vCPU
>       * next checks KVM_REQ_APF_READY.  Use an existing memory
>       * barrier between here and thenext kvm_request_pending(),
>       * for example in vcpu_run().
>       */
>      /* smp_mb(); */
> 
> plus a memory barrier in common code like this:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 706b6fd56d3c..e302c617e4b2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11236,6 +11236,11 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>           if (r <= 0)
>               break;
> 
> +        /*
> +         * Provide a memory barrier between handle_exit and the
> +         * kvm_request_pending() read in vcpu_enter_guest().  It
> +         * pairs with any barrier after kvm_make_request(), for
> +         * example in kvm_arch_async_page_present_queued().
> +         */
> +        smp_mb__before_atomic();
>           kvm_clear_request(KVM_REQ_UNBLOCK, vcpu);
>           if (kvm_xen_has_pending_events(vcpu))
>               kvm_xen_inject_pending_events(vcpu);
> 
> 
> The only advantage of this second, more complex approach is that
> it shows *why* the race was not happening.  The 50 clock cycles
> saved on an MSR write are not worth the extra complication, and
> on a quick grep I could not find other cases which rely on the same
> implicit barriers.  So I'd say use smp_store_mb(), with a comment
> about the pairing with kvm_arch_async_page_present_queued(); and write
> in the commit message that the race wasn't happening thanks to unrelated
> memory barriers between handle_exit and the kvm_request_pending()
> read in vcpu_enter_guest.
> 
> Thanks,
> 
> Paolo
> 
>

Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued

Posted by Sean Christopherson 1 month, 2 weeks ago

On Wed, Aug 13, 2025, Maxim Levitsky wrote:
> Fix a semi theoretical race condition in reading of page_ready_pending
> in kvm_arch_async_page_present_queued.

This needs to explain what can actually go wrong if the race is "hit".  After
staring at all of this for far, far too long, I'm 99.9% confident the race is
benign.

If the worker "incorrectly" sees pageready_pending as %false, then the result
is simply a spurious kick, and that spurious kick is all but guaranteed to be a
nop since if kvm_arch_async_page_present() is setting the flag, then (a) the
vCPU isn't blocking and (b) isn't IN_GUEST_MODE and thus won't be IPI'd.

If the worker incorrectly sees pageready_pending as %true, then the vCPU has
*just* written MSR_KVM_ASYNC_PF_ACK, and is guaranteed to observe and process
KVM_REQ_APF_READY before re-entering the guest, and the sole purpose of the kick
is to ensure the request is processed.

> Only trust the value of page_ready_pending if the guest is about to
> enter guest mode (vcpu->mode).

This is misleading, e.g. IN_GUEST_MODE can be true if the vCPU just *exited*.
All IN_GUEST_MODE says is that the vCPU task is somewhere in KVM's inner run loop.

> To achieve this, read the vcpu->mode using smp_load_acquire which is
> paired with smp_store_release in vcpu_enter_guest.
> 
> Then only if vcpu_mode is IN_GUEST_MODE, trust the value of the
> page_ready_pending because it was written before and therefore its correct
> value is visible.
> 
> Also if the above mentioned check is true, avoid raising the request
> on the target vCPU.

Why?  At worst, a dangling KVM_REQ_APF_READY will cause KVM to bail from the
fastpath when it's not strictly necessary to do so.  On the other hand, a missing
request could hang the guest.  So I don't see any reason to try and be super
precise when setting KVM_REQ_APF_READY.

> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/x86.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9018d56b4b0a..3d45a4cd08a4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>  
>  void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
>  {
> -	kvm_make_request(KVM_REQ_APF_READY, vcpu);
> -	if (!vcpu->arch.apf.pageready_pending)
> +	/* Pairs with smp_store_release in vcpu_enter_guest. */
> +	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);

In terms of arch.apf.pageready_pending being modified, it's not IN_GUEST_MODE
that's special, it's OUTSIDE_GUEST_MODE that's special, because that's the only
time the task that hold vcpu->mutex can clear pageready_pending.

> +	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);

This should be paired with WRITE_ONCE() on the vCPU.

> +
> +	if (!in_guest_mode || !page_ready_pending) {
> +		kvm_make_request(KVM_REQ_APF_READY, vcpu);
>  		kvm_vcpu_kick(vcpu);
> +	}

Given that the race is guaranteed to be bening (assuming my analysis is correct),
I definitely think there should be a comment here explaining that pageready_pending
is "technically unstable".  Otherwise, it takes a lot of staring to understand
what this code is actually doing.

I also think it makes sense to do the bare minimum for OUTSIDE_GUEST_MODE, which
is to wake the vCPU.  Because calling kvm_vcpu_kick() when the vCPU is known to
not be IN_GUEST_MODE is weird.

For the code+comment, how about this?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bdf7ef0b535..d721fab3418d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4000,7 +4000,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
                if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF_INT))
                        return 1;
                if (data & 0x1) {
-                       vcpu->arch.apf.pageready_pending = false;
+                       WRITE_ONCE(vcpu->arch.apf.pageready_pending, false);
                        kvm_check_async_pf_completion(vcpu);
                }
                break;
@@ -13457,7 +13457,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
        if ((work->wakeup_all || work->notpresent_injected) &&
            kvm_pv_async_pf_enabled(vcpu) &&
            !apf_put_user_ready(vcpu, work->arch.token)) {
-               vcpu->arch.apf.pageready_pending = true;
+               WRITE_ONCE(vcpu->arch.apf.pageready_pending, true);
                kvm_apic_set_irq(vcpu, &irq, NULL);
        }

@@ -13468,7 +13468,20 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
 {
        kvm_make_request(KVM_REQ_APF_READY, vcpu);
-       if (!vcpu->arch.apf.pageready_pending)
+
+       /*
+        * Don't kick the vCPU if it has an outstanding "page ready" event as
+        * KVM won't be able to deliver the next "page ready" token until the
+        * outstanding one is handled.  Ignore pageready_pending if the vCPU is
+        * outside "guest mode", i.e. if KVM might be sending "page ready" or
+        * servicing a MSR_KVM_ASYNC_PF_ACK write, as the flag is technically
+        * unstable.  However, in that case, there's obviously no need to kick
+        * the vCPU out of the guest, so just ensure the vCPU is awakened if
+        * it's blocking.
+        */
+       if (smp_load_acquire(vcpu->mode) == OUTSIDE_GUEST_MODE)
+               kvm_vcpu_wake_up(vcpu);
+       else if (!READ_ONCE(vcpu->arch.apf.pageready_pending))
                kvm_vcpu_kick(vcpu);
 }