Retry page faults without acquiring mmu_lock if the resolved hva is covered
by an active invalidation. Contending for mmu_lock is especially
problematic on preemptible kernels as the mmu_notifier invalidation task
will yield mmu_lock (see rwlock_needbreak()), delay the in-progress
invalidation, and ultimately increase the latency of resolving the page
fault. And in the worst case scenario, yielding will be accompanied by a
remote TLB flush, e.g. if the invalidation covers a large range of memory
and vCPUs are accessing addresses that were already zapped.
Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
iterators to perform more work before yielding, but that wouldn't solve
the lock contention and would negatively affect scenarios where a vCPU is
trying to fault in an address that is NOT covered by the in-progress
invalidation.
Reported-by: Yan Zhao <yan.y.zhao@intel.com>
Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/mmu/mmu.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1a5a1e7d1eb7..8e2e07ed1a1b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4334,6 +4334,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
if (unlikely(!fault->slot))
return kvm_handle_noslot_fault(vcpu, fault, access);
+ if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
+ return RET_PF_RETRY;
+
return RET_PF_CONTINUE;
}
--
2.42.0.rc2.253.gd59a3bf2b4-goog
On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote: > Retry page faults without acquiring mmu_lock if the resolved hva is covered > by an active invalidation. Contending for mmu_lock is especially > problematic on preemptible kernels as the mmu_notifier invalidation task > will yield mmu_lock (see rwlock_needbreak()), delay the in-progress > invalidation, and ultimately increase the latency of resolving the page > fault. And in the worst case scenario, yielding will be accompanied by a > remote TLB flush, e.g. if the invalidation covers a large range of memory > and vCPUs are accessing addresses that were already zapped. > > Alternatively, the yielding issue could be mitigated by teaching KVM's MMU > iterators to perform more work before yielding, but that wouldn't solve > the lock contention and would negatively affect scenarios where a vCPU is > trying to fault in an address that is NOT covered by the in-progress > invalidation. > > Reported-by: Yan Zhao <yan.y.zhao@intel.com> > Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com > Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Kai Huang <kai.huang@intel.com> Nit below ... > --- > arch/x86/kvm/mmu/mmu.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 1a5a1e7d1eb7..8e2e07ed1a1b 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4334,6 +4334,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > if (unlikely(!fault->slot)) > return kvm_handle_noslot_fault(vcpu, fault, access); > > + if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva)) > + return RET_PF_RETRY; > + ... Perhaps a comment saying this is to avoid unnecessary MMU lock contention would be nice. Otherwise we have is_page_fault_stale() called later within the MMU lock. I suppose people only tend to use git blamer when they cannot find answer in the code :-) > return RET_PF_CONTINUE; > } > Btw, currently fault->mmu_seq is set in kvm_faultin_pfn(), which happens after fast_page_fault(). Conceptually, should we move this to even before fast_page_fault() because I assume the range zapping should also apply to the cases that fast_page_fault() handles?
On Wed, Sep 06, 2023, Kai Huang wrote: > On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote: > > --- > > arch/x86/kvm/mmu/mmu.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 1a5a1e7d1eb7..8e2e07ed1a1b 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -4334,6 +4334,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > if (unlikely(!fault->slot)) > > return kvm_handle_noslot_fault(vcpu, fault, access); > > > > + if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva)) > > + return RET_PF_RETRY; > > + > > ... Perhaps a comment saying this is to avoid unnecessary MMU lock contention > would be nice. Otherwise we have is_page_fault_stale() called later within the > MMU lock. I suppose people only tend to use git blamer when they cannot find > answer in the code :-) Agreed, will add. > > return RET_PF_CONTINUE; > > } > > > > Btw, currently fault->mmu_seq is set in kvm_faultin_pfn(), which happens after > fast_page_fault(). Conceptually, should we move this to even before > fast_page_fault() because I assume the range zapping should also apply to the > cases that fast_page_fault() handles? Nope, fast_page_fault() doesn't need to "manually" detect invalidated SPTEs because it only modifies shadow-present SPTEs and does so with an atomic CMPXCHG. If a SPTE is zapped by an mmu_notifier event (or anything else), the CMPXCHG will fail and fast_page_fault() will see the !PRESENT SPTE on the next retry and bail.
On Thu, 2023-09-07 at 07:45 -0700, Sean Christopherson wrote: > On Wed, Sep 06, 2023, Kai Huang wrote: > > On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote: > > > --- > > > arch/x86/kvm/mmu/mmu.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > index 1a5a1e7d1eb7..8e2e07ed1a1b 100644 > > > --- a/arch/x86/kvm/mmu/mmu.c > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > @@ -4334,6 +4334,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > > if (unlikely(!fault->slot)) > > > return kvm_handle_noslot_fault(vcpu, fault, access); > > > > > > + if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva)) > > > + return RET_PF_RETRY; > > > + > > > > ... Perhaps a comment saying this is to avoid unnecessary MMU lock contention > > would be nice. Otherwise we have is_page_fault_stale() called later within the > > MMU lock. I suppose people only tend to use git blamer when they cannot find > > answer in the code :-) > > Agreed, will add. > > > > return RET_PF_CONTINUE; > > > } > > > > > > > Btw, currently fault->mmu_seq is set in kvm_faultin_pfn(), which happens after > > fast_page_fault(). Conceptually, should we move this to even before > > fast_page_fault() because I assume the range zapping should also apply to the > > cases that fast_page_fault() handles? > > Nope, fast_page_fault() doesn't need to "manually" detect invalidated SPTEs because > it only modifies shadow-present SPTEs and does so with an atomic CMPXCHG. If a > SPTE is zapped by an mmu_notifier event (or anything else), the CMPXCHG will fail > and fast_page_fault() will see the !PRESENT SPTE on the next retry and bail. Ah yes. Thanks.
© 2016 - 2025 Red Hat, Inc.