KVM: Pre-check mmu_notifier retry on x86

[PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock

Posted by Sean Christopherson 2 years, 5 months ago

Allow checking mmu_invalidate_retry_hva() without holding mmu_lock, even
though mmu_lock must be held to guarantee correctness, i.e. to avoid
false negatives.  Dropping the requirement that mmu_lock be held will
allow pre-checking for retry before acquiring mmu_lock, e.g. to avoid
contending mmu_lock when the guest is accessing a range that is being
invalidated by the host.

Contending mmu_lock can have severe negative side effects for x86's TDP
MMU when running on preemptible kernels, as KVM will yield from the
zapping task (holds mmu_lock for write) when there is lock contention,
and yielding after any SPTEs have been zapped requires a VM-scoped TLB
flush.

Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that calling
mmu_invalidate_retry_hva() in a loop won't put KVM into an infinite loop,
e.g. due to caching the in-progress flag and never seeing it go to '0'.

Force a load of mmu_invalidate_seq as well, even though it isn't strictly
necessary to avoid an infinite loop, as doing so improves the probability
that KVM will detect an invalidation that already completed before
acquiring mmu_lock and bailing anyways.

Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
may generate a load into a register instead of doing a direct comparison
(MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
is a few bytes of code and maaaaybe a cycle or three.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7418e881c21c..7314138ba5f4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1962,18 +1962,29 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
 					   unsigned long mmu_seq,
 					   unsigned long hva)
 {
-	lockdep_assert_held(&kvm->mmu_lock);
 	/*
 	 * If mmu_invalidate_in_progress is non-zero, then the range maintained
 	 * by kvm_mmu_notifier_invalidate_range_start contains all addresses
 	 * that might be being invalidated. Note that it may include some false
 	 * positives, due to shortcuts when handing concurrent invalidations.
+	 *
+	 * Note the lack of a memory barriers!  The caller *must* hold mmu_lock
+	 * to avoid false negatives!  Holding mmu_lock is not mandatory though,
+	 * e.g. to allow pre-checking for an in-progress invalidation to
+	 * avoiding contending mmu_lock.  Ensure that the in-progress flag and
+	 * sequence counter are always read from memory, so that checking for
+	 * retry in a loop won't result in an infinite retry loop.  Don't force
+	 * loads for start+end, as the key to avoiding an infinite retry loops
+	 * is observing the 1=>0 transition of in-progress, i.e. getting false
+	 * negatives (if mmu_lock isn't held) due to stale start+end values is
+	 * acceptable.
 	 */
-	if (unlikely(kvm->mmu_invalidate_in_progress) &&
+	if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
 	    hva >= kvm->mmu_invalidate_range_start &&
 	    hva < kvm->mmu_invalidate_range_end)
 		return 1;
-	if (kvm->mmu_invalidate_seq != mmu_seq)
+
+	if (READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq)
 		return 1;
 	return 0;
 }
-- 
2.42.0.rc2.253.gd59a3bf2b4-goog

Re: [PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock

Posted by Huang, Kai 2 years, 5 months ago

On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote:
> Allow checking mmu_invalidate_retry_hva() without holding mmu_lock, even
> though mmu_lock must be held to guarantee correctness, i.e. to avoid
> false negatives.  Dropping the requirement that mmu_lock be held will
> allow pre-checking for retry before acquiring mmu_lock, e.g. to avoid
> contending mmu_lock when the guest is accessing a range that is being
> invalidated by the host.
> 
> Contending mmu_lock can have severe negative side effects for x86's TDP
> MMU when running on preemptible kernels, as KVM will yield from the
> zapping task (holds mmu_lock for write) when there is lock contention,
> and yielding after any SPTEs have been zapped requires a VM-scoped TLB
> flush.
> 
> Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that calling
> mmu_invalidate_retry_hva() in a loop won't put KVM into an infinite loop,
> e.g. due to caching the in-progress flag and never seeing it go to '0'.
> 
> Force a load of mmu_invalidate_seq as well, even though it isn't strictly
> necessary to avoid an infinite loop, as doing so improves the probability
> that KVM will detect an invalidation that already completed before
> acquiring mmu_lock and bailing anyways.

Without the READ_ONCE() on mmu_invalidate_seq, with patch 2 and
mmu_invalidate_retry_hva() inlined IIUC the kvm_faultin_pfn() can look like
this:

	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;		<-- (1)
	smp_rmb();

	...
	READ_ONCE(vcpu->kvm->mmu_invalidate_in_progress);
	...

	if (vcpu->kvm->mmu_invalidate_seq != fault->mmu_seq)	<-- (2)
		...

Perhaps stupid question :-) Will compiler even believes both vcpu->kvm-
>mmu_invaludate_seq and fault->mmu_seq are never changed thus eliminates the
code in 1) and 2)?  Or all the barriers between are enough to prevent compiler
to do such stupid thing?

> 
> Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
> may generate a load into a register instead of doing a direct comparison
> (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
> is a few bytes of code and maaaaybe a cycle or three.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  include/linux/kvm_host.h | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7418e881c21c..7314138ba5f4 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1962,18 +1962,29 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  					   unsigned long mmu_seq,
>  					   unsigned long hva)
>  {
> -	lockdep_assert_held(&kvm->mmu_lock);
>  	/*
>  	 * If mmu_invalidate_in_progress is non-zero, then the range maintained
>  	 * by kvm_mmu_notifier_invalidate_range_start contains all addresses
>  	 * that might be being invalidated. Note that it may include some false
>  	 * positives, due to shortcuts when handing concurrent invalidations.
> +	 *
> +	 * Note the lack of a memory barriers!  The caller *must* hold mmu_lock
> +	 * to avoid false negatives!  Holding mmu_lock is not mandatory though,
> +	 * e.g. to allow pre-checking for an in-progress invalidation to
> +	 * avoiding contending mmu_lock.  Ensure that the in-progress flag and
> +	 * sequence counter are always read from memory, so that checking for
> +	 * retry in a loop won't result in an infinite retry loop.  Don't force
> +	 * loads for start+end, as the key to avoiding an infinite retry loops
> +	 * is observing the 1=>0 transition of in-progress, i.e. getting false
> +	 * negatives (if mmu_lock isn't held) due to stale start+end values is
> +	 * acceptable.
>  	 */
> -	if (unlikely(kvm->mmu_invalidate_in_progress) &&
> +	if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
>  	    hva >= kvm->mmu_invalidate_range_start &&
>  	    hva < kvm->mmu_invalidate_range_end)
>  		return 1;
> -	if (kvm->mmu_invalidate_seq != mmu_seq)
> +
> +	if (READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq)
>  		return 1;
>  	return 0;
>  }

I am not sure how mmu_invalidate_retry_hva() can be called in a loop so looks
all those should be theoretical thing, but the extra cost should be nearly empty
as you said.

Acked-by: Kai Huang <kai.huang@intel.com>

(Or perhaps patch 1/2 can be just merged to one patch)

Re: [PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock

Posted by Sean Christopherson 2 years, 5 months ago

On Thu, Sep 07, 2023, Kai Huang wrote:
> On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote:
> > Allow checking mmu_invalidate_retry_hva() without holding mmu_lock, even
> > though mmu_lock must be held to guarantee correctness, i.e. to avoid
> > false negatives.  Dropping the requirement that mmu_lock be held will
> > allow pre-checking for retry before acquiring mmu_lock, e.g. to avoid
> > contending mmu_lock when the guest is accessing a range that is being
> > invalidated by the host.
> > 
> > Contending mmu_lock can have severe negative side effects for x86's TDP
> > MMU when running on preemptible kernels, as KVM will yield from the
> > zapping task (holds mmu_lock for write) when there is lock contention,
> > and yielding after any SPTEs have been zapped requires a VM-scoped TLB
> > flush.
> > 
> > Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that calling
> > mmu_invalidate_retry_hva() in a loop won't put KVM into an infinite loop,
> > e.g. due to caching the in-progress flag and never seeing it go to '0'.
> > 
> > Force a load of mmu_invalidate_seq as well, even though it isn't strictly
> > necessary to avoid an infinite loop, as doing so improves the probability
> > that KVM will detect an invalidation that already completed before
> > acquiring mmu_lock and bailing anyways.
> 
> Without the READ_ONCE() on mmu_invalidate_seq, with patch 2 and
> mmu_invalidate_retry_hva() inlined IIUC the kvm_faultin_pfn() can look like
> this:
> 
> 	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;		<-- (1)
> 	smp_rmb();
> 
> 	...
> 	READ_ONCE(vcpu->kvm->mmu_invalidate_in_progress);
> 	...
> 
> 	if (vcpu->kvm->mmu_invalidate_seq != fault->mmu_seq)	<-- (2)
> 		...
> 
> Perhaps stupid question :-) Will compiler even believes both vcpu->kvm-
> >mmu_invaludate_seq and fault->mmu_seq are never changed thus eliminates the
> code in 1) and 2)?  Or all the barriers between are enough to prevent compiler

Practically speaking, no, there's far too much going on in __kvm_faultin_pfn().

But, KVM _could_ do the freshness check before __kvm_faultin_pfn() since KVM
has the memslot and thus the host virtual addess at that point.  I highly doubt
we'll ever do that, but it's possible.  At that point, there'd be no spinlocks
or other barries to ensure the load+check wouldn't get elided.  That's still
extremely theoretical though.

1) can't be eliminated because acquiring mmu_lock provides enough barries to
prevent the compiler from omitting the load, i.e. the compiler can't omit the
comparison that done inside the critical section.  (2) can theoretically be optimized
away by the compiler (when called before acquiring mmu_lock), though it's extremely
unlikely since there's sooo much going on between the load and the check.

> > Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
> > may generate a load into a register instead of doing a direct comparison
> > (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
> > is a few bytes of code and maaaaybe a cycle or three.

...

> > -	if (kvm->mmu_invalidate_seq != mmu_seq)
> > +
> > +	if (READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq)
> >  		return 1;
> >  	return 0;
> >  }
> 
> I am not sure how mmu_invalidate_retry_hva() can be called in a loop so looks
> all those should be theoretical thing, but the extra cost should be nearly empty
> as you said.

It's not currently called in a loop, but it wouldn't be wrong for KVM to do
something like the below instead of fully re-entering the guest, in which case
mmu_invalidate_retry_hva() would be called in a loop, just a big loop :-)

And if KVM checked for freshness before resolving the PFN, it'd be possible for
the loop to read and check the sequence in the loop without any barriers that would
guarantee a reload (again, very, very theoretically).

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e340098d034..c7617991e290 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5725,11 +5725,13 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
        }
 
        if (r == RET_PF_INVALID) {
-               r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
-                                         lower_32_bits(error_code), false,
-                                         &emulation_type);
-               if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
-                       return -EIO;
+               do {
+                       r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
+                                                 lower_32_bits(error_code),
+                                                 false, &emulation_type);
+                       if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
+                               return -EIO;
+               while (r == RET_PF_RETRY);
        }
 
        if (r < 0)

[PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock
[PATCH 2/2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing