[PATCH v9 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()

James Houghton posted 11 patches 10 months, 1 week ago
[PATCH v9 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
Posted by James Houghton 10 months, 1 week ago
Walk the TDP MMU in an RCU read-side critical section without holding
mmu_lock when harvesting and potentially updating age information on
sptes. This requires a way to do RCU-safe walking of the tdp_mmu_roots;
do this with a new macro. The PTE modifications are now always done
atomically.

spte_has_volatile_bits() no longer checks for Accessed bit at all. It
can (now) be set and cleared without taking the mmu_lock, but dropping
Accessed bit updates is already tolerated (the TLB is not invalidated
after clearing the Accessed bit).

If the cmpxchg for marking the spte for access tracking fails, leave it
as is and treat it as if it were young, as if the spte is being actively
modified, it is most likely young.

Harvesting age information from the shadow MMU is still done while
holding the MMU write lock.

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/Kconfig            |  1 +
 arch/x86/kvm/mmu/mmu.c          | 10 +++++++--
 arch/x86/kvm/mmu/spte.c         | 10 +++++++--
 arch/x86/kvm/mmu/tdp_iter.h     |  9 +++++----
 arch/x86/kvm/mmu/tdp_mmu.c      | 36 +++++++++++++++++++++++----------
 6 files changed, 48 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f378cd43241c..0e44fc1cec0d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1479,6 +1479,7 @@ struct kvm_arch {
 	 * tdp_mmu_page set.
 	 *
 	 * For reads, this list is protected by:
+	 *	RCU alone or
 	 *	the MMU lock in read mode + RCU or
 	 *	the MMU lock in write mode
 	 *
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ea2c4f21c1ca..f0a60e59c884 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -22,6 +22,7 @@ config KVM_X86
 	select KVM_COMMON
 	select KVM_GENERIC_MMU_NOTIFIER
 	select KVM_ELIDE_TLB_FLUSH_IF_YOUNG
+	select KVM_MMU_NOTIFIER_AGING_LOCKLESS
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_PFNCACHE
 	select HAVE_KVM_DIRTY_RING_TSO
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a45ae60e84ab..7779b49f386d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1592,8 +1592,11 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm))
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
 		young = kvm_rmap_age_gfn_range(kvm, range, false);
+		write_unlock(&kvm->mmu_lock);
+	}
 
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
@@ -1605,8 +1608,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (kvm_memslots_have_rmaps(kvm))
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
 		young = kvm_rmap_age_gfn_range(kvm, range, true);
+		write_unlock(&kvm->mmu_lock);
+	}
 
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 22551e2f1d00..e984b440c0f0 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -142,8 +142,14 @@ bool spte_has_volatile_bits(u64 spte)
 		return true;
 
 	if (spte_ad_enabled(spte)) {
-		if (!(spte & shadow_accessed_mask) ||
-		    (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
+		/*
+		 * Do not check the Accessed bit. It can be set (by the CPU)
+		 * and cleared (by kvm_tdp_mmu_age_spte()) without holding
+		 * the mmu_lock, but when clearing the Accessed bit, we do
+		 * not invalidate the TLB, so we can already miss Accessed bit
+		 * updates.
+		 */
+		if (is_writable_pte(spte) && !(spte & shadow_dirty_mask))
 			return true;
 	}
 
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 9135b035fa40..05e9d678aac9 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -39,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
 }
 
 /*
- * SPTEs must be modified atomically if they are shadow-present, leaf
- * SPTEs, and have volatile bits, i.e. has bits that can be set outside
- * of mmu_lock.  The Writable bit can be set by KVM's fast page fault
- * handler, and Accessed and Dirty bits can be set by the CPU.
+ * SPTEs must be modified atomically if they have bits that can be set outside
+ * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
+ * Writable bit can be set by KVM's fast page fault handler, the Accessed and
+ * Dirty bits can be set by the CPU, and the Accessed and W/R/X bits can be
+ * cleared by age_gfn_range().
  *
  * Note, non-leaf SPTEs do have Accessed bits and those bits are
  * technically volatile, but KVM doesn't consume the Accessed bit of
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 046b6ba31197..c9778c3e6ecd 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -193,6 +193,19 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		     !tdp_mmu_root_match((_root), (_types)))) {			\
 		} else
 
+/*
+ * Iterate over all TDP MMU roots in an RCU read-side critical section.
+ * It is safe to iterate over the SPTEs under the root, but their values will
+ * be unstable, so all writes must be atomic. As this routine is meant to be
+ * used without holding the mmu_lock at all, any bits that are flipped must
+ * be reflected in kvm_tdp_mmu_spte_need_atomic_write().
+ */
+#define for_each_tdp_mmu_root_rcu(_kvm, _root, _as_id, _types)			\
+	list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link)		\
+		if ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||	\
+		    !tdp_mmu_root_match((_root), (_types))) {			\
+		} else
+
 #define for_each_valid_tdp_mmu_root(_kvm, _root, _as_id)		\
 	__for_each_tdp_mmu_root(_kvm, _root, _as_id, KVM_VALID_ROOTS)
 
@@ -1332,21 +1345,22 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
  * from the clear_young() or clear_flush_young() notifier, which uses the
  * return value to determine if the page has been accessed.
  */
-static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
+static void kvm_tdp_mmu_age_spte(struct kvm *kvm, struct tdp_iter *iter)
 {
 	u64 new_spte;
 
 	if (spte_ad_enabled(iter->old_spte)) {
-		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
-							 iter->old_spte,
-							 shadow_accessed_mask,
-							 iter->level);
+		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
+								shadow_accessed_mask);
 		new_spte = iter->old_spte & ~shadow_accessed_mask;
 	} else {
 		new_spte = mark_spte_for_access_track(iter->old_spte);
-		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
-							iter->old_spte, new_spte,
-							iter->level);
+		/*
+		 * It is safe for the following cmpxchg to fail. Leave the
+		 * Accessed bit set, as the spte is most likely young anyway.
+		 */
+		if (__tdp_mmu_set_spte_atomic(kvm, iter, new_spte))
+			return;
 	}
 
 	trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
@@ -1371,9 +1385,9 @@ static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *kvm,
 	 * valid roots!
 	 */
 	WARN_ON(types & ~KVM_VALID_ROOTS);
-	__for_each_tdp_mmu_root(kvm, root, range->slot->as_id, types) {
-		guard(rcu)();
 
+	guard(rcu)();
+	for_each_tdp_mmu_root_rcu(kvm, root, range->slot->as_id, types) {
 		tdp_root_for_each_leaf_pte(iter, kvm, root, range->start, range->end) {
 			if (!is_accessed_spte(iter.old_spte))
 				continue;
@@ -1382,7 +1396,7 @@ static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *kvm,
 				return true;
 
 			ret = true;
-			kvm_tdp_mmu_age_spte(&iter);
+			kvm_tdp_mmu_age_spte(kvm, &iter);
 		}
 	}
 
-- 
2.48.1.362.g079036d154-goog
Re: [PATCH v9 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
Posted by Sean Christopherson 10 months ago
On Tue, Feb 04, 2025, James Houghton wrote:
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 22551e2f1d00..e984b440c0f0 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -142,8 +142,14 @@ bool spte_has_volatile_bits(u64 spte)
>  		return true;
>  
>  	if (spte_ad_enabled(spte)) {
> -		if (!(spte & shadow_accessed_mask) ||
> -		    (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
> +		/*
> +		 * Do not check the Accessed bit. It can be set (by the CPU)
> +		 * and cleared (by kvm_tdp_mmu_age_spte()) without holding

When possible, don't reference functions by name in comments.  There are situations
where it's unavoidable, e.g. when calling out memory barrier pairs, but for cases
like this, it inevitably leads to stale code.

> +		 * the mmu_lock, but when clearing the Accessed bit, we do
> +		 * not invalidate the TLB, so we can already miss Accessed bit

No "we" in comments or changelog.

> +		 * updates.
> +		 */
> +		if (is_writable_pte(spte) && !(spte & shadow_dirty_mask))
>  			return true;

This 100% belongs in a separate prepatory patch.  And if it's moved to a separate
patch, then the rename can/should happen at the same time.

And with the Accessed check gone, and looking forward to the below change, I think
this is the perfect opportunity to streamline the final check to:

	return spte_ad_enabled(spte) && is_writable_pte(spte) &&
	       !(spte & shadow_dirty_mask);

No need to send another version, I'll move things around when applying.

Also, as discussed off-list I'm 99% certain that with the lockless aging, KVM
must atomically update A/D-disabled SPTEs, as the SPTE can be access-tracked and
restored outside of mmu_lock.  E.g. a task that holds mmu_lock and is clearing
the writable bit needs to update it atomically to avoid its change from being
lost.

I'll slot this is in:

--
From: Sean Christopherson <seanjc@google.com>
Date: Wed, 12 Feb 2025 12:58:39 -0800
Subject: [PATCH 03/10] KVM: x86/mmu: Always update A/D-disable SPTEs
 atomically

In anticipation of aging SPTEs outside of mmu_lock, force A/D-disabled
SPTEs to be updated atomically, as aging A/D-disabled SPTEs will mark them
for access-tracking outside of mmu_lock.  Coupled with restoring access-
tracked SPTEs in the fast page fault handler, the end result is that
A/D-disable SPTEs will be volatile at all times.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/spte.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 9663ba867178..0f9f47b4ab0e 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -141,8 +141,11 @@ bool spte_needs_atomic_update(u64 spte)
 	if (!is_writable_pte(spte) && is_mmu_writable_spte(spte))
 		return true;
 
-	/* Access-tracked SPTEs can be restored by KVM's fast page fault handler. */
-	if (is_access_track_spte(spte))
+	/*
+	 * A/D-disabled SPTEs can be access-tracked by aging, and access-tracked
+	 * SPTEs can be restored by KVM's fast page fault handler.
+	 */
+	if (!spte_ad_enabled(spte))
 		return true;
 
 	/*
@@ -151,8 +154,7 @@ bool spte_needs_atomic_update(u64 spte)
 	 * invalidate TLBs when aging SPTEs, and so it's safe to clobber the
 	 * Accessed bit (and rare in practice).
 	 */
-	return spte_ad_enabled(spte) && is_writable_pte(spte) &&
-	       !(spte & shadow_dirty_mask);
+	return is_writable_pte(spte) && !(spte & shadow_dirty_mask);
 }
 
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
--
Re: [PATCH v9 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
Posted by James Houghton 10 months ago
On Wed, Feb 12, 2025 at 2:07 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Feb 04, 2025, James Houghton wrote:
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index 22551e2f1d00..e984b440c0f0 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -142,8 +142,14 @@ bool spte_has_volatile_bits(u64 spte)
> >               return true;
> >
> >       if (spte_ad_enabled(spte)) {
> > -             if (!(spte & shadow_accessed_mask) ||
> > -                 (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
> > +             /*
> > +              * Do not check the Accessed bit. It can be set (by the CPU)
> > +              * and cleared (by kvm_tdp_mmu_age_spte()) without holding
>
> When possible, don't reference functions by name in comments.  There are situations
> where it's unavoidable, e.g. when calling out memory barrier pairs, but for cases
> like this, it inevitably leads to stale code.

Good point. Thanks.

>
> > +              * the mmu_lock, but when clearing the Accessed bit, we do
> > +              * not invalidate the TLB, so we can already miss Accessed bit
>
> No "we" in comments or changelog.

Ah my mistake.

> > +              * updates.
> > +              */
> > +             if (is_writable_pte(spte) && !(spte & shadow_dirty_mask))
> >                       return true;
>
> This 100% belongs in a separate prepatory patch.  And if it's moved to a separate
> patch, then the rename can/should happen at the same time.
>
> And with the Accessed check gone, and looking forward to the below change, I think
> this is the perfect opportunity to streamline the final check to:
>
>         return spte_ad_enabled(spte) && is_writable_pte(spte) &&
>                !(spte & shadow_dirty_mask);

LGTM.

> No need to send another version, I'll move things around when applying.

Thanks!

> Also, as discussed off-list I'm 99% certain that with the lockless aging, KVM
> must atomically update A/D-disabled SPTEs, as the SPTE can be access-tracked and
> restored outside of mmu_lock.  E.g. a task that holds mmu_lock and is clearing
> the writable bit needs to update it atomically to avoid its change from being
> lost.

Yeah you're right. This logic applies to A/D-enabled SPTEs too, just
that we choose to tolerate failures to update the Accessed bit. But in
the case of A/D-disabled SPTEs, we can't do that. So this makes sense.
Thanks!

> I'll slot this is in:
>
> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Wed, 12 Feb 2025 12:58:39 -0800
> Subject: [PATCH 03/10] KVM: x86/mmu: Always update A/D-disable SPTEs
>  atomically
>
> In anticipation of aging SPTEs outside of mmu_lock, force A/D-disabled
> SPTEs to be updated atomically, as aging A/D-disabled SPTEs will mark them
> for access-tracking outside of mmu_lock.  Coupled with restoring access-
> tracked SPTEs in the fast page fault handler, the end result is that
> A/D-disable SPTEs will be volatile at all times.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: James Houghton <jthoughton@google.com>

> ---
>  arch/x86/kvm/mmu/spte.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 9663ba867178..0f9f47b4ab0e 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -141,8 +141,11 @@ bool spte_needs_atomic_update(u64 spte)
>         if (!is_writable_pte(spte) && is_mmu_writable_spte(spte))
>                 return true;
>
> -       /* Access-tracked SPTEs can be restored by KVM's fast page fault handler. */
> -       if (is_access_track_spte(spte))
> +       /*
> +        * A/D-disabled SPTEs can be access-tracked by aging, and access-tracked
> +        * SPTEs can be restored by KVM's fast page fault handler.
> +        */
> +       if (!spte_ad_enabled(spte))
>                 return true;
>
>         /*
> @@ -151,8 +154,7 @@ bool spte_needs_atomic_update(u64 spte)
>          * invalidate TLBs when aging SPTEs, and so it's safe to clobber the
>          * Accessed bit (and rare in practice).
>          */
> -       return spte_ad_enabled(spte) && is_writable_pte(spte) &&
> -              !(spte & shadow_dirty_mask);
> +       return is_writable_pte(spte) && !(spte & shadow_dirty_mask);
>  }
>
>  bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> --