Fix a semi theoretical race condition in reading of page_ready_pending
in kvm_arch_async_page_present_queued.
Only trust the value of page_ready_pending if the guest is about to
enter guest mode (vcpu->mode).
To achieve this, read the vcpu->mode using smp_load_acquire which is
paired with smp_store_release in vcpu_enter_guest.
Then only if vcpu_mode is IN_GUEST_MODE, trust the value of the
page_ready_pending because it was written before and therefore its correct
value is visible.
Also if the above mentioned check is true, avoid raising the request
on the target vCPU.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
arch/x86/kvm/x86.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9018d56b4b0a..3d45a4cd08a4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
{
- kvm_make_request(KVM_REQ_APF_READY, vcpu);
- if (!vcpu->arch.apf.pageready_pending)
+ /* Pairs with smp_store_release in vcpu_enter_guest. */
+ bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
+ bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
+
+ if (!in_guest_mode || !page_ready_pending) {
+ kvm_make_request(KVM_REQ_APF_READY, vcpu);
kvm_vcpu_kick(vcpu);
+ }
}
bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu)
--
2.49.0
On 8/13/25 21:23, Maxim Levitsky wrote: > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 9018d56b4b0a..3d45a4cd08a4 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, > > void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu) > { > - kvm_make_request(KVM_REQ_APF_READY, vcpu); > - if (!vcpu->arch.apf.pageready_pending) > + /* Pairs with smp_store_release in vcpu_enter_guest. */ > + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE); > + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending); > + > + if (!in_guest_mode || !page_ready_pending) { > + kvm_make_request(KVM_REQ_APF_READY, vcpu); > kvm_vcpu_kick(vcpu); > + } Unlike Sean, I think the race exists in abstract and is not benign, but there are already enough memory barriers to tame it. That said, in_guest_mode is a red herring. The way I look at it, is through the very common wake/sleep (or kick/check) pattern that has smp_mb() on both sides. The pair you are considering consists of this function (the "kick side"), and the wrmsr handler for MSR_KVM_ASYNC_PF_ACK (the "check side"). Let's see how the synchronization between the two sides happens: - here, you need to check whether to inject the interrupt. It looks like this: kvm_make_request(KVM_REQ_APF_READY, vcpu); smp_mb(); if (!READ_ONCE(page_ready_pending)) kvm_vcpu_kick(vcpu); - on the other side, after clearing page_ready_pending, there will be a check for a wakeup: WRITE_ONCE(page_ready_pending, false); smp_mb(); if (kvm_check_request(KVM_REQ_APF_READY, vcpu)) kvm_check_async_pf_completion(vcpu) except that the "if" is not in kvm_set_msr_common(); it will happen naturally as part of the first re-entry. So let's look at the changes you need to make, in order to make the code look like the above. - using READ_ONCE/WRITE_ONCE for pageready_pending never hurts - here in kvm_arch_async_page_present_queued(), a smp_mb__after_atomic() (compiler barrier on x86) is missing after kvm_make_request(): kvm_make_request(KVM_REQ_APF_READY, vcpu); /* * Tell vCPU to wake up before checking if they need an * interrupt. Pairs with any memory barrier between * the clearing of pageready_pending and vCPU entry. */ smp_mb__after_atomic(); if (!READ_ONCE(vcpu->arch.apf.pageready_pending)) kvm_vcpu_kick(vcpu); - in kvm_set_msr_common(), there are two possibilities. The easy one is to just use smp_store_mb() to clear vcpu->arch.apf.pageready_pending. The other would be a comment like this: WRITE_ONCE(vcpu->arch.apf.pageready_pending, false); /* * Ensure they know to wake this vCPU up, before the vCPU * next checks KVM_REQ_APF_READY. Use an existing memory * barrier between here and thenext kvm_request_pending(), * for example in vcpu_run(). */ /* smp_mb(); */ plus a memory barrier in common code like this: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 706b6fd56d3c..e302c617e4b2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11236,6 +11236,11 @@ static int vcpu_run(struct kvm_vcpu *vcpu) if (r <= 0) break; + /* + * Provide a memory barrier between handle_exit and the + * kvm_request_pending() read in vcpu_enter_guest(). It + * pairs with any barrier after kvm_make_request(), for + * example in kvm_arch_async_page_present_queued(). + */ + smp_mb__before_atomic(); kvm_clear_request(KVM_REQ_UNBLOCK, vcpu); if (kvm_xen_has_pending_events(vcpu)) kvm_xen_inject_pending_events(vcpu); The only advantage of this second, more complex approach is that it shows *why* the race was not happening. The 50 clock cycles saved on an MSR write are not worth the extra complication, and on a quick grep I could not find other cases which rely on the same implicit barriers. So I'd say use smp_store_mb(), with a comment about the pairing with kvm_arch_async_page_present_queued(); and write in the commit message that the race wasn't happening thanks to unrelated memory barriers between handle_exit and the kvm_request_pending() read in vcpu_enter_guest. Thanks, Paolo
On Tue, Sep 23, 2025, Paolo Bonzini wrote: > On 8/13/25 21:23, Maxim Levitsky wrote: > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 9018d56b4b0a..3d45a4cd08a4 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, > > void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu) > > { > > - kvm_make_request(KVM_REQ_APF_READY, vcpu); > > - if (!vcpu->arch.apf.pageready_pending) > > + /* Pairs with smp_store_release in vcpu_enter_guest. */ > > + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE); > > + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending); > > + > > + if (!in_guest_mode || !page_ready_pending) { > > + kvm_make_request(KVM_REQ_APF_READY, vcpu); > > kvm_vcpu_kick(vcpu); > > + } > > Unlike Sean, I think the race exists in abstract and is not benign How is it not benign? I never said the race doesn't exist, I said that consuming a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued() is benign.
On 9/23/25 20:55, Sean Christopherson wrote: > On Tue, Sep 23, 2025, Paolo Bonzini wrote: >> On 8/13/25 21:23, Maxim Levitsky wrote: >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index 9018d56b4b0a..3d45a4cd08a4 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, >>> void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu) >>> { >>> - kvm_make_request(KVM_REQ_APF_READY, vcpu); >>> - if (!vcpu->arch.apf.pageready_pending) >>> + /* Pairs with smp_store_release in vcpu_enter_guest. */ >>> + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE); >>> + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending); >>> + >>> + if (!in_guest_mode || !page_ready_pending) { >>> + kvm_make_request(KVM_REQ_APF_READY, vcpu); >>> kvm_vcpu_kick(vcpu); >>> + } >> >> Unlike Sean, I think the race exists in abstract and is not benign > > How is it not benign? I never said the race doesn't exist, I said that consuming > a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued() > is benign. In principle there is a possibility that a KVM_REQ_APF_READY is missed. Just by the reading of the specs, without a smp__mb_after_atomic() this is broken: kvm_make_request(KVM_REQ_APF_READY, vcpu); if (!vcpu->arch.apf.pageready_pending) kvm_vcpu_kick(vcpu); It won't happen because set_bit() is written with asm("memory"), because x86 set_bit() does prevent reordering at the processor level, etc. In other words the race is only avoided by the fact that compiler reorderings are prevented even in cases that memory-barriers.txt does not promise. Paolo
On Tue, Sep 23, 2025, Paolo Bonzini wrote: > On 9/23/25 20:55, Sean Christopherson wrote: > > On Tue, Sep 23, 2025, Paolo Bonzini wrote: > > > On 8/13/25 21:23, Maxim Levitsky wrote: > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > index 9018d56b4b0a..3d45a4cd08a4 100644 > > > > --- a/arch/x86/kvm/x86.c > > > > +++ b/arch/x86/kvm/x86.c > > > > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, > > > > void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu) > > > > { > > > > - kvm_make_request(KVM_REQ_APF_READY, vcpu); > > > > - if (!vcpu->arch.apf.pageready_pending) > > > > + /* Pairs with smp_store_release in vcpu_enter_guest. */ > > > > + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE); > > > > + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending); > > > > + > > > > + if (!in_guest_mode || !page_ready_pending) { > > > > + kvm_make_request(KVM_REQ_APF_READY, vcpu); > > > > kvm_vcpu_kick(vcpu); > > > > + } > > > > > > Unlike Sean, I think the race exists in abstract and is not benign > > > > How is it not benign? I never said the race doesn't exist, I said that consuming > > a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued() > > is benign. > > In principle there is a possibility that a KVM_REQ_APF_READY is missed. I think you mean a kick (wakeup or IPI), is missed, not that the APF_READY itself is missed. I.e. KVM_REQ_APF_READY will never be lost, KVM just might enter the guest or schedule out the vCPU with the flag set. All in all, I think we're in violent agreement. I agree that kvm_vcpu_kick() could be missed (theoretically), but I'm saying that missing the kick would be benign due to a myriad of other barriers and checks, i.e. that the vCPU is guaranteed to see KVM_REQ_APF_READY anyways. E.g. my suggestion earlier regarding OUTSIDE_GUEST_MODE was to rely on the smp_mb__after_srcu_read_{,un}lock() barriers in vcpu_enter_guest() to ensure KVM_REQ_APF_READY would be observed before trying VM-Enter, and that if KVM might be in the process of emulating HLT (blocking), that either KVM_REQ_APF_READY is visible to the vCPU or that kvm_arch_async_page_present() wakes the vCPU. Oh, hilarious, async_pf_execute() also does an unconditional __kvm_vcpu_wake_up(). Huh. But isn't that a real bug? KVM doesn't consider KVM_REQ_APF_READY to be a wake event, so isn't this an actual race? vCPU async #PF kvm_check_async_pf_completion() pageready_pending = false VM-Enter HLT VM-Exit kvm_make_request(KVM_REQ_APF_READY, vcpu) kvm_vcpu_kick(vcpu) // nop as the vCPU isn't blocking, yet __kvm_vcpu_wake_up() // nop for the same reason vcpu_block() <hang> On x86, the "page ready" IRQ is only injected from vCPU context, so AFAICT nothing is guarnateed wake the vCPU in the above sequence. > broken: > > kvm_make_request(KVM_REQ_APF_READY, vcpu); > if (!vcpu->arch.apf.pageready_pending) > kvm_vcpu_kick(vcpu); > > It won't happen because set_bit() is written with asm("memory"), because x86 > set_bit() does prevent reordering at the processor level, etc. > > In other words the race is only avoided by the fact that compiler > reorderings are prevented even in cases that memory-barriers.txt does not > promise.
On 9/23/25 17:58, Paolo Bonzini wrote: > - on the other side, after clearing page_ready_pending, there will be a > check for a wakeup: > > WRITE_ONCE(page_ready_pending, false); > smp_mb(); > if (kvm_check_request(KVM_REQ_APF_READY, vcpu)) > kvm_check_async_pf_completion(vcpu) > > except that the "if" is not in kvm_set_msr_common(); it will happen > naturally as part of the first re-entry. Important thing that I forgot to mention: the above only covers the race case. There's also the case where KVM_REQ_APF_READY has been cleared already, and for that one the call to kvm_check_async_pf_completion() is *also* needed in kvm_set_msr_common(). Paolo > > So let's look at the changes you need to make, in order to make the code > look like the above. > > - using READ_ONCE/WRITE_ONCE for pageready_pending never hurts > > - here in kvm_arch_async_page_present_queued(), a smp_mb__after_atomic() > (compiler barrier on x86) is missing after kvm_make_request(): > > kvm_make_request(KVM_REQ_APF_READY, vcpu); > /* > * Tell vCPU to wake up before checking if they need an > * interrupt. Pairs with any memory barrier between > * the clearing of pageready_pending and vCPU entry. > */ > smp_mb__after_atomic(); > if (!READ_ONCE(vcpu->arch.apf.pageready_pending)) > kvm_vcpu_kick(vcpu); > > - in kvm_set_msr_common(), there are two possibilities. > The easy one is to just use smp_store_mb() to clear > vcpu->arch.apf.pageready_pending. The other would be a comment > like this: > > WRITE_ONCE(vcpu->arch.apf.pageready_pending, false); > /* > * Ensure they know to wake this vCPU up, before the vCPU > * next checks KVM_REQ_APF_READY. Use an existing memory > * barrier between here and thenext kvm_request_pending(), > * for example in vcpu_run(). > */ > /* smp_mb(); */ > > plus a memory barrier in common code like this: > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 706b6fd56d3c..e302c617e4b2 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -11236,6 +11236,11 @@ static int vcpu_run(struct kvm_vcpu *vcpu) > if (r <= 0) > break; > > + /* > + * Provide a memory barrier between handle_exit and the > + * kvm_request_pending() read in vcpu_enter_guest(). It > + * pairs with any barrier after kvm_make_request(), for > + * example in kvm_arch_async_page_present_queued(). > + */ > + smp_mb__before_atomic(); > kvm_clear_request(KVM_REQ_UNBLOCK, vcpu); > if (kvm_xen_has_pending_events(vcpu)) > kvm_xen_inject_pending_events(vcpu); > > > The only advantage of this second, more complex approach is that > it shows *why* the race was not happening. The 50 clock cycles > saved on an MSR write are not worth the extra complication, and > on a quick grep I could not find other cases which rely on the same > implicit barriers. So I'd say use smp_store_mb(), with a comment > about the pairing with kvm_arch_async_page_present_queued(); and write > in the commit message that the race wasn't happening thanks to unrelated > memory barriers between handle_exit and the kvm_request_pending() > read in vcpu_enter_guest. > > Thanks, > > Paolo > >
On Wed, Aug 13, 2025, Maxim Levitsky wrote: > Fix a semi theoretical race condition in reading of page_ready_pending > in kvm_arch_async_page_present_queued. This needs to explain what can actually go wrong if the race is "hit". After staring at all of this for far, far too long, I'm 99.9% confident the race is benign. If the worker "incorrectly" sees pageready_pending as %false, then the result is simply a spurious kick, and that spurious kick is all but guaranteed to be a nop since if kvm_arch_async_page_present() is setting the flag, then (a) the vCPU isn't blocking and (b) isn't IN_GUEST_MODE and thus won't be IPI'd. If the worker incorrectly sees pageready_pending as %true, then the vCPU has *just* written MSR_KVM_ASYNC_PF_ACK, and is guaranteed to observe and process KVM_REQ_APF_READY before re-entering the guest, and the sole purpose of the kick is to ensure the request is processed. > Only trust the value of page_ready_pending if the guest is about to > enter guest mode (vcpu->mode). This is misleading, e.g. IN_GUEST_MODE can be true if the vCPU just *exited*. All IN_GUEST_MODE says is that the vCPU task is somewhere in KVM's inner run loop. > To achieve this, read the vcpu->mode using smp_load_acquire which is > paired with smp_store_release in vcpu_enter_guest. > > Then only if vcpu_mode is IN_GUEST_MODE, trust the value of the > page_ready_pending because it was written before and therefore its correct > value is visible. > > Also if the above mentioned check is true, avoid raising the request > on the target vCPU. Why? At worst, a dangling KVM_REQ_APF_READY will cause KVM to bail from the fastpath when it's not strictly necessary to do so. On the other hand, a missing request could hang the guest. So I don't see any reason to try and be super precise when setting KVM_REQ_APF_READY. > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> > --- > arch/x86/kvm/x86.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 9018d56b4b0a..3d45a4cd08a4 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, > > void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu) > { > - kvm_make_request(KVM_REQ_APF_READY, vcpu); > - if (!vcpu->arch.apf.pageready_pending) > + /* Pairs with smp_store_release in vcpu_enter_guest. */ > + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE); In terms of arch.apf.pageready_pending being modified, it's not IN_GUEST_MODE that's special, it's OUTSIDE_GUEST_MODE that's special, because that's the only time the task that hold vcpu->mutex can clear pageready_pending. > + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending); This should be paired with WRITE_ONCE() on the vCPU. > + > + if (!in_guest_mode || !page_ready_pending) { > + kvm_make_request(KVM_REQ_APF_READY, vcpu); > kvm_vcpu_kick(vcpu); > + } Given that the race is guaranteed to be bening (assuming my analysis is correct), I definitely think there should be a comment here explaining that pageready_pending is "technically unstable". Otherwise, it takes a lot of staring to understand what this code is actually doing. I also think it makes sense to do the bare minimum for OUTSIDE_GUEST_MODE, which is to wake the vCPU. Because calling kvm_vcpu_kick() when the vCPU is known to not be IN_GUEST_MODE is weird. For the code+comment, how about this? diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6bdf7ef0b535..d721fab3418d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4000,7 +4000,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF_INT)) return 1; if (data & 0x1) { - vcpu->arch.apf.pageready_pending = false; + WRITE_ONCE(vcpu->arch.apf.pageready_pending, false); kvm_check_async_pf_completion(vcpu); } break; @@ -13457,7 +13457,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, if ((work->wakeup_all || work->notpresent_injected) && kvm_pv_async_pf_enabled(vcpu) && !apf_put_user_ready(vcpu, work->arch.token)) { - vcpu->arch.apf.pageready_pending = true; + WRITE_ONCE(vcpu->arch.apf.pageready_pending, true); kvm_apic_set_irq(vcpu, &irq, NULL); } @@ -13468,7 +13468,20 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu) { kvm_make_request(KVM_REQ_APF_READY, vcpu); - if (!vcpu->arch.apf.pageready_pending) + + /* + * Don't kick the vCPU if it has an outstanding "page ready" event as + * KVM won't be able to deliver the next "page ready" token until the + * outstanding one is handled. Ignore pageready_pending if the vCPU is + * outside "guest mode", i.e. if KVM might be sending "page ready" or + * servicing a MSR_KVM_ASYNC_PF_ACK write, as the flag is technically + * unstable. However, in that case, there's obviously no need to kick + * the vCPU out of the guest, so just ensure the vCPU is awakened if + * it's blocking. + */ + if (smp_load_acquire(vcpu->mode) == OUTSIDE_GUEST_MODE) + kvm_vcpu_wake_up(vcpu); + else if (!READ_ONCE(vcpu->arch.apf.pageready_pending)) kvm_vcpu_kick(vcpu); }
© 2016 - 2025 Red Hat, Inc.