Fix a semi theoretical race condition in reading of page_ready_pending
in kvm_arch_async_page_present_queued.
Only trust the value of page_ready_pending if the guest is about to
enter guest mode (vcpu->mode).
To achieve this, read the vcpu->mode using smp_load_acquire which is
paired with smp_store_release in vcpu_enter_guest.
Then only if vcpu_mode is IN_GUEST_MODE, trust the value of the
page_ready_pending because it was written before and therefore its correct
value is visible.
Also if the above mentioned check is true, avoid raising the request
on the target vCPU.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
arch/x86/kvm/x86.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9018d56b4b0a..3d45a4cd08a4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
{
- kvm_make_request(KVM_REQ_APF_READY, vcpu);
- if (!vcpu->arch.apf.pageready_pending)
+ /* Pairs with smp_store_release in vcpu_enter_guest. */
+ bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
+ bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
+
+ if (!in_guest_mode || !page_ready_pending) {
+ kvm_make_request(KVM_REQ_APF_READY, vcpu);
kvm_vcpu_kick(vcpu);
+ }
}
bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu)
--
2.49.0
On 8/13/25 21:23, Maxim Levitsky wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9018d56b4b0a..3d45a4cd08a4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>
> void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
> {
> - kvm_make_request(KVM_REQ_APF_READY, vcpu);
> - if (!vcpu->arch.apf.pageready_pending)
> + /* Pairs with smp_store_release in vcpu_enter_guest. */
> + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> +
> + if (!in_guest_mode || !page_ready_pending) {
> + kvm_make_request(KVM_REQ_APF_READY, vcpu);
> kvm_vcpu_kick(vcpu);
> + }
Unlike Sean, I think the race exists in abstract and is not benign, but
there are already enough memory barriers to tame it.
That said, in_guest_mode is a red herring. The way I look at it, is
through the very common wake/sleep (or kick/check) pattern that has
smp_mb() on both sides.
The pair you are considering consists of this function (the "kick
side"), and the wrmsr handler for MSR_KVM_ASYNC_PF_ACK (the "check
side"). Let's see how the synchronization between the two sides
happens:
- here, you need to check whether to inject the interrupt. It looks
like this:
kvm_make_request(KVM_REQ_APF_READY, vcpu);
smp_mb();
if (!READ_ONCE(page_ready_pending))
kvm_vcpu_kick(vcpu);
- on the other side, after clearing page_ready_pending, there will be a
check for a wakeup:
WRITE_ONCE(page_ready_pending, false);
smp_mb();
if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
kvm_check_async_pf_completion(vcpu)
except that the "if" is not in kvm_set_msr_common(); it will happen
naturally as part of the first re-entry.
So let's look at the changes you need to make, in order to make the code
look like the above.
- using READ_ONCE/WRITE_ONCE for pageready_pending never hurts
- here in kvm_arch_async_page_present_queued(), a smp_mb__after_atomic()
(compiler barrier on x86) is missing after kvm_make_request():
kvm_make_request(KVM_REQ_APF_READY, vcpu);
/*
* Tell vCPU to wake up before checking if they need an
* interrupt. Pairs with any memory barrier between
* the clearing of pageready_pending and vCPU entry.
*/
smp_mb__after_atomic();
if (!READ_ONCE(vcpu->arch.apf.pageready_pending))
kvm_vcpu_kick(vcpu);
- in kvm_set_msr_common(), there are two possibilities.
The easy one is to just use smp_store_mb() to clear
vcpu->arch.apf.pageready_pending. The other would be a comment
like this:
WRITE_ONCE(vcpu->arch.apf.pageready_pending, false);
/*
* Ensure they know to wake this vCPU up, before the vCPU
* next checks KVM_REQ_APF_READY. Use an existing memory
* barrier between here and thenext kvm_request_pending(),
* for example in vcpu_run().
*/
/* smp_mb(); */
plus a memory barrier in common code like this:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 706b6fd56d3c..e302c617e4b2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11236,6 +11236,11 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
if (r <= 0)
break;
+ /*
+ * Provide a memory barrier between handle_exit and the
+ * kvm_request_pending() read in vcpu_enter_guest(). It
+ * pairs with any barrier after kvm_make_request(), for
+ * example in kvm_arch_async_page_present_queued().
+ */
+ smp_mb__before_atomic();
kvm_clear_request(KVM_REQ_UNBLOCK, vcpu);
if (kvm_xen_has_pending_events(vcpu))
kvm_xen_inject_pending_events(vcpu);
The only advantage of this second, more complex approach is that
it shows *why* the race was not happening. The 50 clock cycles
saved on an MSR write are not worth the extra complication, and
on a quick grep I could not find other cases which rely on the same
implicit barriers. So I'd say use smp_store_mb(), with a comment
about the pairing with kvm_arch_async_page_present_queued(); and write
in the commit message that the race wasn't happening thanks to unrelated
memory barriers between handle_exit and the kvm_request_pending()
read in vcpu_enter_guest.
Thanks,
Paolo
On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> On 8/13/25 21:23, Maxim Levitsky wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 9018d56b4b0a..3d45a4cd08a4 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> > void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
> > {
> > - kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > - if (!vcpu->arch.apf.pageready_pending)
> > + /* Pairs with smp_store_release in vcpu_enter_guest. */
> > + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> > + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> > +
> > + if (!in_guest_mode || !page_ready_pending) {
> > + kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > kvm_vcpu_kick(vcpu);
> > + }
>
> Unlike Sean, I think the race exists in abstract and is not benign
How is it not benign? I never said the race doesn't exist, I said that consuming
a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued()
is benign.
On 9/23/25 20:55, Sean Christopherson wrote:
> On Tue, Sep 23, 2025, Paolo Bonzini wrote:
>> On 8/13/25 21:23, Maxim Levitsky wrote:
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 9018d56b4b0a..3d45a4cd08a4 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>>> void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
>>> {
>>> - kvm_make_request(KVM_REQ_APF_READY, vcpu);
>>> - if (!vcpu->arch.apf.pageready_pending)
>>> + /* Pairs with smp_store_release in vcpu_enter_guest. */
>>> + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
>>> + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
>>> +
>>> + if (!in_guest_mode || !page_ready_pending) {
>>> + kvm_make_request(KVM_REQ_APF_READY, vcpu);
>>> kvm_vcpu_kick(vcpu);
>>> + }
>>
>> Unlike Sean, I think the race exists in abstract and is not benign
>
> How is it not benign? I never said the race doesn't exist, I said that consuming
> a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued()
> is benign.
In principle there is a possibility that a KVM_REQ_APF_READY is missed.
Just by the reading of the specs, without a smp__mb_after_atomic() this
is broken:
kvm_make_request(KVM_REQ_APF_READY, vcpu);
if (!vcpu->arch.apf.pageready_pending)
kvm_vcpu_kick(vcpu);
It won't happen because set_bit() is written with asm("memory"), because
x86 set_bit() does prevent reordering at the processor level, etc.
In other words the race is only avoided by the fact that compiler
reorderings are prevented even in cases that memory-barriers.txt does
not promise.
Paolo
On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> On 9/23/25 20:55, Sean Christopherson wrote:
> > On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> > > On 8/13/25 21:23, Maxim Levitsky wrote:
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 9018d56b4b0a..3d45a4cd08a4 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> > > > void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
> > > > {
> > > > - kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > > - if (!vcpu->arch.apf.pageready_pending)
> > > > + /* Pairs with smp_store_release in vcpu_enter_guest. */
> > > > + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> > > > + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> > > > +
> > > > + if (!in_guest_mode || !page_ready_pending) {
> > > > + kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > > kvm_vcpu_kick(vcpu);
> > > > + }
> > >
> > > Unlike Sean, I think the race exists in abstract and is not benign
> >
> > How is it not benign? I never said the race doesn't exist, I said that consuming
> > a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued()
> > is benign.
>
> In principle there is a possibility that a KVM_REQ_APF_READY is missed.
I think you mean a kick (wakeup or IPI), is missed, not that the APF_READY itself
is missed. I.e. KVM_REQ_APF_READY will never be lost, KVM just might enter the
guest or schedule out the vCPU with the flag set.
All in all, I think we're in violent agreement. I agree that kvm_vcpu_kick()
could be missed (theoretically), but I'm saying that missing the kick would be
benign due to a myriad of other barriers and checks, i.e. that the vCPU is
guaranteed to see KVM_REQ_APF_READY anyways.
E.g. my suggestion earlier regarding OUTSIDE_GUEST_MODE was to rely on the
smp_mb__after_srcu_read_{,un}lock() barriers in vcpu_enter_guest() to ensure
KVM_REQ_APF_READY would be observed before trying VM-Enter, and that if KVM might
be in the process of emulating HLT (blocking), that either KVM_REQ_APF_READY is
visible to the vCPU or that kvm_arch_async_page_present() wakes the vCPU. Oh,
hilarious, async_pf_execute() also does an unconditional __kvm_vcpu_wake_up().
Huh. But isn't that a real bug? KVM doesn't consider KVM_REQ_APF_READY to be a
wake event, so isn't this an actual race?
vCPU async #PF
kvm_check_async_pf_completion()
pageready_pending = false
VM-Enter
HLT
VM-Exit
kvm_make_request(KVM_REQ_APF_READY, vcpu)
kvm_vcpu_kick(vcpu) // nop as the vCPU isn't blocking, yet
__kvm_vcpu_wake_up() // nop for the same reason
vcpu_block()
<hang>
On x86, the "page ready" IRQ is only injected from vCPU context, so AFAICT nothing
is guarnateed wake the vCPU in the above sequence.
> broken:
>
> kvm_make_request(KVM_REQ_APF_READY, vcpu);
> if (!vcpu->arch.apf.pageready_pending)
> kvm_vcpu_kick(vcpu);
>
> It won't happen because set_bit() is written with asm("memory"), because x86
> set_bit() does prevent reordering at the processor level, etc.
>
> In other words the race is only avoided by the fact that compiler
> reorderings are prevented even in cases that memory-barriers.txt does not
> promise.
On Tue, Sep 23, 2025, Sean Christopherson wrote:
> On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> > On 9/23/25 20:55, Sean Christopherson wrote:
> > > On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> > > > On 8/13/25 21:23, Maxim Levitsky wrote:
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index 9018d56b4b0a..3d45a4cd08a4 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> > > > > void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
> > > > > {
> > > > > - kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > > > - if (!vcpu->arch.apf.pageready_pending)
> > > > > + /* Pairs with smp_store_release in vcpu_enter_guest. */
> > > > > + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> > > > > + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> > > > > +
> > > > > + if (!in_guest_mode || !page_ready_pending) {
> > > > > + kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > > > kvm_vcpu_kick(vcpu);
> > > > > + }
> > > >
> > > > Unlike Sean, I think the race exists in abstract and is not benign
> > >
> > > How is it not benign? I never said the race doesn't exist, I said that consuming
> > > a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued()
> > > is benign.
> >
> > In principle there is a possibility that a KVM_REQ_APF_READY is missed.
>
> I think you mean a kick (wakeup or IPI), is missed, not that the APF_READY itself
> is missed. I.e. KVM_REQ_APF_READY will never be lost, KVM just might enter the
> guest or schedule out the vCPU with the flag set.
>
> All in all, I think we're in violent agreement. I agree that kvm_vcpu_kick()
> could be missed (theoretically), but I'm saying that missing the kick would be
> benign due to a myriad of other barriers and checks, i.e. that the vCPU is
> guaranteed to see KVM_REQ_APF_READY anyways.
>
> E.g. my suggestion earlier regarding OUTSIDE_GUEST_MODE was to rely on the
> smp_mb__after_srcu_read_{,un}lock() barriers in vcpu_enter_guest() to ensure
> KVM_REQ_APF_READY would be observed before trying VM-Enter, and that if KVM might
> be in the process of emulating HLT (blocking), that either KVM_REQ_APF_READY is
> visible to the vCPU or that kvm_arch_async_page_present() wakes the vCPU. Oh,
> hilarious, async_pf_execute() also does an unconditional __kvm_vcpu_wake_up().
>
> Huh. But isn't that a real bug? KVM doesn't consider KVM_REQ_APF_READY to be a
> wake event, so isn't this an actual race?
>
> vCPU async #PF
> kvm_check_async_pf_completion()
> pageready_pending = false
> VM-Enter
> HLT
> VM-Exit
> kvm_make_request(KVM_REQ_APF_READY, vcpu)
> kvm_vcpu_kick(vcpu) // nop as the vCPU isn't blocking, yet
> __kvm_vcpu_wake_up() // nop for the same reason
> vcpu_block()
> <hang>
>
> On x86, the "page ready" IRQ is only injected from vCPU context, so AFAICT nothing
> is guarnateed wake the vCPU in the above sequence.
Gah, KVM checks async_pf.done instead of the request. So I don't think there's
a bug, just weird code.
bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
{
if (!list_empty_careful(&vcpu->async_pf.done)) <===
return true;
On Mon, 2025-10-27 at 08:00 -0700, Sean Christopherson wrote:
> On Tue, Sep 23, 2025, Sean Christopherson wrote:
> > On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> > > On 9/23/25 20:55, Sean Christopherson wrote:
> > > > On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> > > > > On 8/13/25 21:23, Maxim Levitsky wrote:
> > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > index 9018d56b4b0a..3d45a4cd08a4 100644
> > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> > > > > > void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
> > > > > > {
> > > > > > - kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > > > > - if (!vcpu->arch.apf.pageready_pending)
> > > > > > + /* Pairs with smp_store_release in vcpu_enter_guest. */
> > > > > > + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> > > > > > + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> > > > > > +
> > > > > > + if (!in_guest_mode || !page_ready_pending) {
> > > > > > + kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > > > > kvm_vcpu_kick(vcpu);
> > > > > > + }
> > > > >
> > > > > Unlike Sean, I think the race exists in abstract and is not benign
> > > >
> > > > How is it not benign? I never said the race doesn't exist, I said that consuming
> > > > a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued()
> > > > is benign.
> > >
> > > In principle there is a possibility that a KVM_REQ_APF_READY is missed.
> >
> > I think you mean a kick (wakeup or IPI), is missed, not that the APF_READY itself
> > is missed. I.e. KVM_REQ_APF_READY will never be lost, KVM just might enter the
> > guest or schedule out the vCPU with the flag set.
> >
> > All in all, I think we're in violent agreement. I agree that kvm_vcpu_kick()
> > could be missed (theoretically), but I'm saying that missing the kick would be
> > benign due to a myriad of other barriers and checks, i.e. that the vCPU is
> > guaranteed to see KVM_REQ_APF_READY anyways.
> >
> > E.g. my suggestion earlier regarding OUTSIDE_GUEST_MODE was to rely on the
> > smp_mb__after_srcu_read_{,un}lock() barriers in vcpu_enter_guest() to ensure
> > KVM_REQ_APF_READY would be observed before trying VM-Enter, and that if KVM might
> > be in the process of emulating HLT (blocking), that either KVM_REQ_APF_READY is
> > visible to the vCPU or that kvm_arch_async_page_present() wakes the vCPU. Oh,
> > hilarious, async_pf_execute() also does an unconditional __kvm_vcpu_wake_up().
> >
> > Huh. But isn't that a real bug? KVM doesn't consider KVM_REQ_APF_READY to be a
> > wake event, so isn't this an actual race?
> >
> > vCPU async #PF
> > kvm_check_async_pf_completion()
> > pageready_pending = false
> > VM-Enter
> > HLT
> > VM-Exit
> > kvm_make_request(KVM_REQ_APF_READY, vcpu)
> > kvm_vcpu_kick(vcpu) // nop as the vCPU isn't blocking, yet
> > __kvm_vcpu_wake_up() // nop for the same reason
> > vcpu_block()
> > <hang>
> >
> > On x86, the "page ready" IRQ is only injected from vCPU context, so AFAICT nothing
> > is guarnateed wake the vCPU in the above sequence.
>
> Gah, KVM checks async_pf.done instead of the request. So I don't think there's
> a bug, just weird code.
Hi!
Note that I posted a v2 of this patch series. Do I need to drop this patch or its better to keep it
(the patch should still be correct, but maybe an overkill I think).
What do you think?
Can we have the patch 3 of the v2 merged as it fixes an real issue, which actually
causes random and hard to debug failures?
Best regards,
Maxim Levitsky
>
> bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
> {
> if (!list_empty_careful(&vcpu->async_pf.done)) <===
> return true;
>
On Thu, Oct 30, 2025, mlevitsk@redhat.com wrote: > On Mon, 2025-10-27 at 08:00 -0700, Sean Christopherson wrote: > > On Tue, Sep 23, 2025, Sean Christopherson wrote: > > > On x86, the "page ready" IRQ is only injected from vCPU context, so AFAICT nothing > > > is guarnateed wake the vCPU in the above sequence. > > > > Gah, KVM checks async_pf.done instead of the request. So I don't think there's > > a bug, just weird code. > > Hi! > > Note that I posted a v2 of this patch series. I got 'em, and looked at them in depth (which is how I figured out the above weirdness with async_pf.done). They're sitting in my "for_next" folder, I just haven't spent any time on applying+testing upstream patches this week (I expect to get to your series tomorrow, or early next week). > Do I need to drop this patch or its better to keep it (the patch should still > be correct, but maybe an overkill I think). It's probably overkill, but there's no real downside, so I'm inclined to apply the v2 version (and am planning on doing so).
On 9/23/25 17:58, Paolo Bonzini wrote: > - on the other side, after clearing page_ready_pending, there will be a > check for a wakeup: > > WRITE_ONCE(page_ready_pending, false); > smp_mb(); > if (kvm_check_request(KVM_REQ_APF_READY, vcpu)) > kvm_check_async_pf_completion(vcpu) > > except that the "if" is not in kvm_set_msr_common(); it will happen > naturally as part of the first re-entry. Important thing that I forgot to mention: the above only covers the race case. There's also the case where KVM_REQ_APF_READY has been cleared already, and for that one the call to kvm_check_async_pf_completion() is *also* needed in kvm_set_msr_common(). Paolo > > So let's look at the changes you need to make, in order to make the code > look like the above. > > - using READ_ONCE/WRITE_ONCE for pageready_pending never hurts > > - here in kvm_arch_async_page_present_queued(), a smp_mb__after_atomic() > (compiler barrier on x86) is missing after kvm_make_request(): > > kvm_make_request(KVM_REQ_APF_READY, vcpu); > /* > * Tell vCPU to wake up before checking if they need an > * interrupt. Pairs with any memory barrier between > * the clearing of pageready_pending and vCPU entry. > */ > smp_mb__after_atomic(); > if (!READ_ONCE(vcpu->arch.apf.pageready_pending)) > kvm_vcpu_kick(vcpu); > > - in kvm_set_msr_common(), there are two possibilities. > The easy one is to just use smp_store_mb() to clear > vcpu->arch.apf.pageready_pending. The other would be a comment > like this: > > WRITE_ONCE(vcpu->arch.apf.pageready_pending, false); > /* > * Ensure they know to wake this vCPU up, before the vCPU > * next checks KVM_REQ_APF_READY. Use an existing memory > * barrier between here and thenext kvm_request_pending(), > * for example in vcpu_run(). > */ > /* smp_mb(); */ > > plus a memory barrier in common code like this: > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 706b6fd56d3c..e302c617e4b2 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -11236,6 +11236,11 @@ static int vcpu_run(struct kvm_vcpu *vcpu) > if (r <= 0) > break; > > + /* > + * Provide a memory barrier between handle_exit and the > + * kvm_request_pending() read in vcpu_enter_guest(). It > + * pairs with any barrier after kvm_make_request(), for > + * example in kvm_arch_async_page_present_queued(). > + */ > + smp_mb__before_atomic(); > kvm_clear_request(KVM_REQ_UNBLOCK, vcpu); > if (kvm_xen_has_pending_events(vcpu)) > kvm_xen_inject_pending_events(vcpu); > > > The only advantage of this second, more complex approach is that > it shows *why* the race was not happening. The 50 clock cycles > saved on an MSR write are not worth the extra complication, and > on a quick grep I could not find other cases which rely on the same > implicit barriers. So I'd say use smp_store_mb(), with a comment > about the pairing with kvm_arch_async_page_present_queued(); and write > in the commit message that the race wasn't happening thanks to unrelated > memory barriers between handle_exit and the kvm_request_pending() > read in vcpu_enter_guest. > > Thanks, > > Paolo > >
On Wed, Aug 13, 2025, Maxim Levitsky wrote:
> Fix a semi theoretical race condition in reading of page_ready_pending
> in kvm_arch_async_page_present_queued.
This needs to explain what can actually go wrong if the race is "hit". After
staring at all of this for far, far too long, I'm 99.9% confident the race is
benign.
If the worker "incorrectly" sees pageready_pending as %false, then the result
is simply a spurious kick, and that spurious kick is all but guaranteed to be a
nop since if kvm_arch_async_page_present() is setting the flag, then (a) the
vCPU isn't blocking and (b) isn't IN_GUEST_MODE and thus won't be IPI'd.
If the worker incorrectly sees pageready_pending as %true, then the vCPU has
*just* written MSR_KVM_ASYNC_PF_ACK, and is guaranteed to observe and process
KVM_REQ_APF_READY before re-entering the guest, and the sole purpose of the kick
is to ensure the request is processed.
> Only trust the value of page_ready_pending if the guest is about to
> enter guest mode (vcpu->mode).
This is misleading, e.g. IN_GUEST_MODE can be true if the vCPU just *exited*.
All IN_GUEST_MODE says is that the vCPU task is somewhere in KVM's inner run loop.
> To achieve this, read the vcpu->mode using smp_load_acquire which is
> paired with smp_store_release in vcpu_enter_guest.
>
> Then only if vcpu_mode is IN_GUEST_MODE, trust the value of the
> page_ready_pending because it was written before and therefore its correct
> value is visible.
>
> Also if the above mentioned check is true, avoid raising the request
> on the target vCPU.
Why? At worst, a dangling KVM_REQ_APF_READY will cause KVM to bail from the
fastpath when it's not strictly necessary to do so. On the other hand, a missing
request could hang the guest. So I don't see any reason to try and be super
precise when setting KVM_REQ_APF_READY.
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
> arch/x86/kvm/x86.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9018d56b4b0a..3d45a4cd08a4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>
> void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
> {
> - kvm_make_request(KVM_REQ_APF_READY, vcpu);
> - if (!vcpu->arch.apf.pageready_pending)
> + /* Pairs with smp_store_release in vcpu_enter_guest. */
> + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
In terms of arch.apf.pageready_pending being modified, it's not IN_GUEST_MODE
that's special, it's OUTSIDE_GUEST_MODE that's special, because that's the only
time the task that hold vcpu->mutex can clear pageready_pending.
> + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
This should be paired with WRITE_ONCE() on the vCPU.
> +
> + if (!in_guest_mode || !page_ready_pending) {
> + kvm_make_request(KVM_REQ_APF_READY, vcpu);
> kvm_vcpu_kick(vcpu);
> + }
Given that the race is guaranteed to be bening (assuming my analysis is correct),
I definitely think there should be a comment here explaining that pageready_pending
is "technically unstable". Otherwise, it takes a lot of staring to understand
what this code is actually doing.
I also think it makes sense to do the bare minimum for OUTSIDE_GUEST_MODE, which
is to wake the vCPU. Because calling kvm_vcpu_kick() when the vCPU is known to
not be IN_GUEST_MODE is weird.
For the code+comment, how about this?
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bdf7ef0b535..d721fab3418d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4000,7 +4000,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF_INT))
return 1;
if (data & 0x1) {
- vcpu->arch.apf.pageready_pending = false;
+ WRITE_ONCE(vcpu->arch.apf.pageready_pending, false);
kvm_check_async_pf_completion(vcpu);
}
break;
@@ -13457,7 +13457,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
if ((work->wakeup_all || work->notpresent_injected) &&
kvm_pv_async_pf_enabled(vcpu) &&
!apf_put_user_ready(vcpu, work->arch.token)) {
- vcpu->arch.apf.pageready_pending = true;
+ WRITE_ONCE(vcpu->arch.apf.pageready_pending, true);
kvm_apic_set_irq(vcpu, &irq, NULL);
}
@@ -13468,7 +13468,20 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
{
kvm_make_request(KVM_REQ_APF_READY, vcpu);
- if (!vcpu->arch.apf.pageready_pending)
+
+ /*
+ * Don't kick the vCPU if it has an outstanding "page ready" event as
+ * KVM won't be able to deliver the next "page ready" token until the
+ * outstanding one is handled. Ignore pageready_pending if the vCPU is
+ * outside "guest mode", i.e. if KVM might be sending "page ready" or
+ * servicing a MSR_KVM_ASYNC_PF_ACK write, as the flag is technically
+ * unstable. However, in that case, there's obviously no need to kick
+ * the vCPU out of the guest, so just ensure the vCPU is awakened if
+ * it's blocking.
+ */
+ if (smp_load_acquire(vcpu->mode) == OUTSIDE_GUEST_MODE)
+ kvm_vcpu_wake_up(vcpu);
+ else if (!READ_ONCE(vcpu->arch.apf.pageready_pending))
kvm_vcpu_kick(vcpu);
}
© 2016 - 2026 Red Hat, Inc.