TDX MMU refactors

[PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs

Posted by Rick Edgecombe 5 days, 17 hours ago

From: Sean Christopherson <seanjc@google.com>

Centralize the updates to present external PTEs to the
handle_changed_spte() function.

When setting a PTE to present in the mirror page tables, the update needs
to propagate to the external page tables (in TDX parlance the S-EPT).
Today this is handled by special mirror page tables branching in
__tdp_mmu_set_spte_atomic(), which is the only place where present PTEs
are set for TDX.

This keeps things running, but is a bit hacked on. The hook for setting
present leaf PTEs are added only where TDX happens to need them. For
example, TDX does not support any of the operations that use the
non-atomic variant, tdp_mmu_set_spte() to set present PTEs. Since the hook
is missing there, it is very hard to understand the code from a non-TDX
lens. If the reader doesn’t know the TDX specifics it could look like the
external update is missing.

In addition to being confusing, it also litters the TDP MMU with
"external" update callbacks. This is especially unfortunate because there
is already a central place to react to TDP updates, handle_changed_spte().

Begin the process of moving towards a model where all mirror page table
updates are forwarded to TDX code where the TDX specific logic can live
with a more proper separation of concerns. Do this by teaching
handle_changed_spte() how to return error codes, such that it can
propagate the failures that may come from TDX external page table updates.

Atomic mirror page table updates need to be done in a special way to
prevent concurrent updates to the mirror page table while the external
page table is updated. The mirror page table is set to the frozen PTE
value while the external version is updates. This frozen PTE dance is
currently done in __tdp_mmu_set_spte_atomic(). Hoist it up a level so that
the external update in handle_changed_spte() can be done while the PTE is
frozen.

Link: https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/
Not-yet-Signed-off-by: Sean Christopherson <seanjc@google.com>
[Based on a diff by Sean Chrisopherson]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 150 ++++++++++++++++++++++---------------
 1 file changed, 88 insertions(+), 62 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 338957bc5109..db16e81b9701 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -320,9 +320,9 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
 	}
 }
 
-static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared);
+static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
+				gfn_t gfn, u64 old_spte, u64 new_spte,
+				int level, bool shared);
 
 static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
@@ -471,8 +471,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
 							  FROZEN_SPTE, level);
 		}
-		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
-				    old_spte, FROZEN_SPTE, level, shared);
+		handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
 
 		if (is_mirror_sp(sp)) {
 			KVM_BUG_ON(shared, kvm);
@@ -508,22 +507,15 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
 	return NULL;
 }
 
-static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
-						 gfn_t gfn, u64 *old_spte,
-						 u64 new_spte, int level)
+static int __must_check set_external_spte_present(struct kvm *kvm,
+						  gfn_t gfn, u64 old_spte,
+						  u64 new_spte, int level)
 {
 	bool is_present = is_shadow_present_pte(new_spte);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	int ret = 0;
 
 	lockdep_assert_held(&kvm->mmu_lock);
-	/*
-	 * We need to lock out other updates to the SPTE until the external
-	 * page table has been modified. Use FROZEN_SPTE similar to
-	 * the zapping case.
-	 */
-	if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
-		return -EBUSY;
 
 	/*
 	 * Use different call to either set up middle level
@@ -537,17 +529,13 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 		KVM_BUG_ON(!external_spt, kvm);
 		ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
 	}
-	if (ret)
-		__kvm_tdp_mmu_write_spte(sptep, *old_spte);
-	else
-		__kvm_tdp_mmu_write_spte(sptep, new_spte);
 	return ret;
 }
 
 /**
- * handle_changed_spte - handle bookkeeping associated with an SPTE change
+ * __handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
- * @as_id: the address space of the paging structure the SPTE was a part of
+ * @sp: the page table in which the SPTE resides
  * @gfn: the base GFN that was mapped by the SPTE
  * @old_spte: The value of the SPTE before the change
  * @new_spte: The value of the SPTE after the change
@@ -560,15 +548,17 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
  * dirty logging updates are handled in common code, not here (see make_spte()
  * and fast_pf_fix_direct_spte()).
  */
-static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared)
+static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
+				 gfn_t gfn, u64 old_spte, u64 new_spte,
+				 int level, bool shared)
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	int as_id = kvm_mmu_page_as_id(sp);
+	int r;
 
 	WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON_ONCE(level < PG_LEVEL_4K);
@@ -598,9 +588,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	}
 
 	if (old_spte == new_spte)
-		return;
-
-	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
+		return 0;
 
 	if (is_leaf)
 		check_spte_writable_invariants(new_spte);
@@ -627,21 +615,41 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 			       "a temporary frozen SPTE.\n"
 			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
 			       as_id, gfn, old_spte, new_spte, level);
-		return;
+		return 0;
 	}
 
-	if (is_leaf != was_leaf)
-		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
-
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
 	 * the paging structure.  Note the WARN on the PFN changing without the
 	 * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
 	 * pages are kernel allocations and should never be migrated.
+	 *
+	 * When modifying leaf entries in mirrored page tables, propagate all
+	 * changes to the external SPTE.
 	 */
 	if (was_present && !was_leaf &&
-	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
+	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
+	} else if (is_mirror_sp(sp) && is_present) {
+		r = set_external_spte_present(kvm, gfn, old_spte, new_spte, level);
+		if (r)
+			return r;
+	}
+
+	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
+
+	if (is_leaf != was_leaf)
+		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
+
+	return 0;
+}
+
+static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
+				gfn_t gfn, u64 old_spte, u64 new_spte,
+				int level, bool shared)
+{
+	KVM_BUG_ON(__handle_changed_spte(kvm, sp, gfn, old_spte, new_spte,
+					 level, shared), kvm);
 }
 
 static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
@@ -657,32 +665,14 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
 
 	/*
-	 * FROZEN_SPTE is a temporary state and should never be set via higher
-	 * level helpers.
+	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
+	 * does not hold the mmu_lock.  On failure, i.e. if a different logical
+	 * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
+	 * the current value, so the caller operates on fresh data, e.g. if it
+	 * retries tdp_mmu_set_spte_atomic()
 	 */
-	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
-
-	if (is_mirror_sptep(iter->sptep)) {
-		int ret;
-
-		ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
-						&iter->old_spte, new_spte, iter->level);
-		if (ret)
-			return ret;
-	} else {
-		u64 *sptep = rcu_dereference(iter->sptep);
-
-		/*
-		 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
-		 * and does not hold the mmu_lock.  On failure, i.e. if a
-		 * different logical CPU modified the SPTE, try_cmpxchg64()
-		 * updates iter->old_spte with the current value, so the caller
-		 * operates on fresh data, e.g. if it retries
-		 * tdp_mmu_set_spte_atomic()
-		 */
-		if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
-			return -EBUSY;
-	}
+	if (!try_cmpxchg64(rcu_dereference(iter->sptep), &iter->old_spte, new_spte))
+		return -EBUSY;
 
 	return 0;
 }
@@ -708,18 +698,49 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
 						       struct tdp_iter *iter,
 						       u64 new_spte)
 {
+	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
 	int ret;
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
-	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
+	/* KVM should never freeze SPTEs using higher level APIs. */
+	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
+
+	/*
+	 * Temporarily freeze the SPTE until the external PTE operation has
+	 * completed (unless the new SPTE itself will be frozen), e.g. so that
+	 * concurrent faults don't attempt to install a child PTE in the
+	 * external page table before the parent PTE has been written, or try
+	 * to re-install a page table before the old one was removed.
+	 */
+	if (is_mirror_sptep(iter->sptep))
+		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
+	else
+		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
 	if (ret)
 		return ret;
 
-	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			    new_spte, iter->level, true);
+	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
+				    new_spte, iter->level, true);
 
-	return 0;
+	/*
+	 * Unfreeze the mirror SPTE.  If updating the external SPTE failed,
+	 * restore the old SPTE so that the SPTE isn't frozen in perpetuity,
+	 * otherwise set the mirror SPTE to the new desired value.
+	 */
+	if (is_mirror_sptep(iter->sptep)) {
+		if (ret)
+			__kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
+		else
+			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
+	} else {
+		/*
+		 * Bug the VM if handling the change failed, as failure is only
+		 * allowed if KVM couldn't update the external SPTE.
+		 */
+		KVM_BUG_ON(ret, kvm);
+	}
+	return ret;
 }
 
 /*
@@ -738,6 +759,8 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
 static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 			    u64 old_spte, u64 new_spte, gfn_t gfn, int level)
 {
+	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(sptep));
+
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	/*
@@ -751,7 +774,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 
 	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
 
-	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
+	handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level, false);
 
 	/*
 	 * Users that do non-atomic setting of PTEs don't operate on mirror
@@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm, struct tdp_iter *iter)
 {
 	u64 new_spte;
 
+	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
+		return;
+
 	if (spte_ad_enabled(iter->old_spte)) {
 		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
 								shadow_accessed_mask);
-- 
2.53.0

Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs

Posted by Yan Zhao 1 day, 5 hours ago

Below are responses to issues reported by sashiko. [1]

[1] https://sashiko.dev/#/patchset/20260327201421.2824383-1-rick.p.edgecombe%40intel.com

> @@ -627,21 +615,41 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  			       "a temporary frozen SPTE.\n"
>  			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
>  			       as_id, gfn, old_spte, new_spte, level);
> -		return;
> +		return 0;
>  	}
>  
> -	if (is_leaf != was_leaf)
> -		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
> -
>  	/*
>  	 * Recursively handle child PTs if the change removed a subtree from
>  	 * the paging structure.  Note the WARN on the PFN changing without the
>  	 * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
>  	 * pages are kernel allocations and should never be migrated.
> +	 *
> +	 * When modifying leaf entries in mirrored page tables, propagate all
> +	 * changes to the external SPTE.
>  	 */
>  	if (was_present && !was_leaf &&
> -	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> +	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
>  		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> +	} else if (is_mirror_sp(sp) && is_present) {
> +		r = set_external_spte_present(kvm, gfn, old_spte, new_spte, level);
> +		if (r)
> +			return r;
> +	}
Issue #1:
"If a present non-leaf SPTE is replaced with a present leaf SPTE, such as
 during a hugepage collapse, the first branch executes and calls
 handle_removed_pt(). Because of the mutually exclusive "else if",
 set_external_spte_present() is bypassed for the new leaf SPTE. Could this
 leave the external S-EPT out of sync, with the parent entry pointing to a
 freed child page table?
"
Response:
This is the page merging scenario. The mirror root does not allow huge pages yet,
So, maybe nothing is required for now.
After the basic TDX huge page series, page merging is also disabled [4].
See more in the responses to issues #2 and #3.

Issue #2:
"Additionally, if an SPTE is atomically zapped, is_present evaluates to false.
 Because tdp_mmu_set_spte_atomic() lacks a fallback remove_external_spte()
 call like the non-atomic tdp_mmu_set_spte() has, does this mean the external
 S-EPT mapping remains fully present while KVM marks it as non-present?
"

Response:
Up to this patch, this seems to be a valid concern. But atomic zap on the mirror
root isn't allowed yet (though warning on such case is dropped in patch 5 [2]).
Centralizing atomic zap to __handle_changed_spte() will be done in patch 15 [3].
Maybe move patch 5 to after patch 15?

[2] https://lore.kernel.org/kvm/20260327201421.2824383-6-rick.p.edgecombe@intel.com/
[3] https://lore.kernel.org/kvm/20260327201421.2824383-16-rick.p.edgecombe@intel.com/
[4] https://lore.kernel.org/all/20260106102024.25023-1-yan.y.zhao@intel.com/

Thinking more about centralizing TDX hooks, could we be more aggressive? i.e.,
let TDX just have a single hook set_external_spte() for propagation of changes
from mirror page table to S-EPT?
(below change is on code base with TDX huge page support).

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 13910d9af03a..d9404aeae5f3 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -95,7 +95,6 @@ KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
 KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
 KVM_X86_OP_OPTIONAL_RET0(set_external_spte)
-KVM_X86_OP_OPTIONAL(reclaim_external_spt)
 KVM_X86_OP_OPTIONAL_RET0(topup_external_cache)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 76f119a6412b..b722cce0b994 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1857,11 +1857,6 @@ struct kvm_x86_ops {
 	int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 				 u64 new_spte, enum pg_level level);
 
-	/* Update external page tables for page table about to be freed. */
-	void (*reclaim_external_spt)(struct kvm *kvm, gfn_t gfn,
-				     struct kvm_mmu_page *sp);
-
-
 	int (*topup_external_cache)(struct kvm *kvm, struct kvm_vcpu *vcpu, int min_nr_spts);
 
 	bool (*has_wbinvd_exit)(void);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2f6bfc875de1..9b4fa5b31f23 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -461,9 +461,6 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 		handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
 	}
 
-	if (is_mirror_sp(sp))
-		kvm_x86_call(reclaim_external_spt)(kvm, base_gfn, sp);
-
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
@@ -563,9 +560,17 @@ static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
 	 * changes to the external SPTE.
 	 */
 	if (was_present && !was_leaf &&
-	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
+	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
-	} else if (is_mirror_sp(sp)) {
+
+	if (is_mirror_sp(sp)) {
+		/*
+		 * Can also propagate changes to remove external pt. Since this
+		 * occurs after the call_rcu() in handle_removed_pt(), the RCU
+		 * read lock must be held.
+		 */
+		RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu read lock held");
+
 		r = kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
 						    new_spte, level);
 		if (r)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 32d0c206ade6..cc03b9c7838c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1965,9 +1965,13 @@ static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level leve
 	return ret;
 }
 
-static void tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
+static int tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
 					struct kvm_mmu_page *sp)
 {
+	int ret;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
 	/*
 	 * KVM doesn't (yet) zap page table pages in mirror page table while
 	 * TD is active, though guest pages mapped in mirror page table could be
@@ -1981,8 +1985,10 @@ static void tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
 	 * the page to prevent the kernel from accessing the encrypted page.
 	 */
 	if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
-	    tdx_reclaim_page(virt_to_page(sp->external_spt)))
+	    tdx_reclaim_page(virt_to_page(sp->external_spt))) {
+		ret = -EIO;
 		goto out;
+	}
 
 	/*
 	 * Immediately free the S-EPT page as the TDX subsystem doesn't support
@@ -1990,8 +1996,10 @@ static void tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
 	 * TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding readers.
 	 */
 	free_page((unsigned long)sp->external_spt);
+	ret = 0;
 out:
 	sp->external_spt = NULL;
+	return ret;
 }
 
 static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
@@ -2045,10 +2053,17 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 				     u64 new_spte, enum pg_level level)
 {
-	if (is_shadow_present_pte(old_spte) && is_shadow_present_pte(new_spte))
-		return tdx_sept_split_private_spte(kvm, gfn, old_spte, new_spte, level);
-	else if (is_shadow_present_pte(old_spte))
-		return tdx_sept_remove_private_spte(kvm, gfn, level, old_spte);
+	if (is_shadow_present_pte(old_spte)) {
+		if (is_shadow_present_pte(new_spte)) {
+			if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
+				return tdx_sept_split_private_spte(kvm, gfn, old_spte, new_spte, level);
+		} else {
+			if (is_last_spte(old_spte, level))
+				return tdx_sept_remove_private_spte(kvm, gfn, level, old_spte);
+			else
+				return tdx_sept_reclaim_private_spt(kvm, gfn, spte_to_child_sp(old_spte));
+		}
+	}
 
 	if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
 		return -EIO;
@@ -3915,7 +3930,6 @@ void __init tdx_hardware_setup(void)
 	vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size, sizeof(struct kvm_tdx));
 
 	vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
-	vt_x86_ops.reclaim_external_spt = tdx_sept_reclaim_private_spt;
 	vt_x86_ops.gmem_convert = tdx_gmem_convert;
 


> @@ -708,18 +698,49 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  						       struct tdp_iter *iter,
>  						       u64 new_spte)
>  {
> +	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
>  	int ret;
>  
>  	lockdep_assert_held_read(&kvm->mmu_lock);
>  
> -	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> +	/* KVM should never freeze SPTEs using higher level APIs. */
> +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> +
> +	/*
> +	 * Temporarily freeze the SPTE until the external PTE operation has
> +	 * completed (unless the new SPTE itself will be frozen), e.g. so that
> +	 * concurrent faults don't attempt to install a child PTE in the
> +	 * external page table before the parent PTE has been written, or try
> +	 * to re-install a page table before the old one was removed.
> +	 */
> +	if (is_mirror_sptep(iter->sptep))
> +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> +	else
> +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
>  	if (ret)
>  		return ret;
>  
> -	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> -			    new_spte, iter->level, true);
> +	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> +				    new_spte, iter->level, true);
>  
> -	return 0;
> +	/*
> +	 * Unfreeze the mirror SPTE.  If updating the external SPTE failed,
> +	 * restore the old SPTE so that the SPTE isn't frozen in perpetuity,
> +	 * otherwise set the mirror SPTE to the new desired value.
> +	 */
> +	if (is_mirror_sptep(iter->sptep)) {
> +		if (ret)
> +			__kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
> +		else
> +			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
Issue 3:
"
 If the mutually exclusive branches in __handle_changed_spte() are updated
 so both can run, there seems to be a risk during rollback. The function
 handle_removed_pt() unlinks the child page table and schedules it for RCU
 destruction. If set_external_spte_present() then fails, the rollback
 code here restores iter->old_spte. Since iter->old_spte points to the child
 page table that was just scheduled for freeing, can this cause a
 use-after-free when the old SPTE is restored?
"

Response:
As in the response to issue 1, changes from non-leaf present to leaf present
in mirror page table are not valid. We need special handling in TDX MMU to
support such scenario in the future. e.g., TDX requires all children entries
must have been all mapped or all unmapped before merging.
Maybe TDX huge page series could skip the handle_removed_pt() for the page
merging case first if more robustness is preferred to guard unexpected page
merging.


Issue 4:
 "
 Furthermore, tdp_mmu_set_spte_atomic() now returns errors from
 __handle_changed_spte(). Previously, it only returned an error (like -EBUSY)
 if the try_cmpxchg64() failed, which natively updated iter->old_spte to the
 new memory value.
 If try_cmpxchg64() succeeds but __handle_changed_spte() fails, iter->old_spte
 is not updated. Callers like clear_dirty_gfn_range() loop on failure assuming
 the old value was refreshed:
 clear_dirty_gfn_range() {
 ...
 retry:
         ...
         if (tdp_mmu_set_spte_atomic(kvm, &iter, iter.old_spte & ~dbit))
                 goto retry;
 ...
 }
 Will this result in an infinite loop, as the caller retries the exact same
 atomic operation with the unchanged iter.old_spte?
"

Response:
Though the only expected error from __handle_changed_spte() is also -EBUSY, the
infinite loop is possible if updating the external page table constantly fails
due to contentions.
However, it's currently benign since callers like clear_dirty_gfn_range() can't
operate on the mirror page table.

Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs

Posted by Huang, Kai 2 days, 4 hours ago

> @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm, struct tdp_iter *iter)
>  {
>  	u64 new_spte;
>  
> +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> +		return;
> +
>  	if (spte_ad_enabled(iter->old_spte)) {
>  		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
>  								shadow_accessed_mask);

AFAICT this chunk isn't related to "Centralize updates to present external
PTEs"?  Should it be in a separate patch?

Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs

Posted by Edgecombe, Rick P 14 hours ago

On Tue, 2026-03-31 at 10:09 +0000, Huang, Kai wrote:
> > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > struct tdp_iter *iter)
> >   {
> >   	u64 new_spte;
> >   
> > +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > +		return;
> > +
> >   	if (spte_ad_enabled(iter->old_spte)) {
> >   		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter-
> > >sptep,
> >   								shadow_acce
> > ssed_mask);
> 
> AFAICT this chunk isn't related to "Centralize updates to present external
> PTEs"?  Should it be in a separate patch?

It was because we lose warning coverage in this patch for the aging case. Though
the diff was from Sean. But my take was this just maintains coverage that
already existed.

Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs

Posted by Yan Zhao 3 days, 7 hours ago

On Fri, Mar 27, 2026 at 01:14:11PM -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Centralize the updates to present external PTEs to the
> handle_changed_spte() function.
> 
> When setting a PTE to present in the mirror page tables, the update needs
> to propagate to the external page tables (in TDX parlance the S-EPT).
> Today this is handled by special mirror page tables branching in
> __tdp_mmu_set_spte_atomic(), which is the only place where present PTEs
> are set for TDX.
> 
> This keeps things running, but is a bit hacked on. The hook for setting
> present leaf PTEs are added only where TDX happens to need them. For
> example, TDX does not support any of the operations that use the
> non-atomic variant, tdp_mmu_set_spte() to set present PTEs. Since the hook
> is missing there, it is very hard to understand the code from a non-TDX
> lens. If the reader doesn’t know the TDX specifics it could look like the
> external update is missing.
> 
> In addition to being confusing, it also litters the TDP MMU with
> "external" update callbacks. This is especially unfortunate because there
> is already a central place to react to TDP updates, handle_changed_spte().
> 
> Begin the process of moving towards a model where all mirror page table
> updates are forwarded to TDX code where the TDX specific logic can live
> with a more proper separation of concerns. Do this by teaching
> handle_changed_spte() how to return error codes, such that it can
> propagate the failures that may come from TDX external page table updates.
> 
> Atomic mirror page table updates need to be done in a special way to
> prevent concurrent updates to the mirror page table while the external
> page table is updated. The mirror page table is set to the frozen PTE
> value while the external version is updates. This frozen PTE dance is
"is updates" --> "is updating"?

> currently done in __tdp_mmu_set_spte_atomic(). Hoist it up a level so that
> the external update in handle_changed_spte() can be done while the PTE is
> frozen.
> 
> Link: https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/
> Not-yet-Signed-off-by: Sean Christopherson <seanjc@google.com>
> [Based on a diff by Sean Chrisopherson]
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 150 ++++++++++++++++++++++---------------
>  1 file changed, 88 insertions(+), 62 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 338957bc5109..db16e81b9701 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -320,9 +320,9 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
>  	}
>  }
>  
> -static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				u64 old_spte, u64 new_spte, int level,
> -				bool shared);
> +static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
> +				gfn_t gfn, u64 old_spte, u64 new_spte,
> +				int level, bool shared);
>  
>  static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> @@ -471,8 +471,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>  			old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
>  							  FROZEN_SPTE, level);
>  		}
> -		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> -				    old_spte, FROZEN_SPTE, level, shared);
> +		handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
>  
>  		if (is_mirror_sp(sp)) {
>  			KVM_BUG_ON(shared, kvm);
> @@ -508,22 +507,15 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
>  	return NULL;
>  }
>  
> -static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
> -						 gfn_t gfn, u64 *old_spte,
> -						 u64 new_spte, int level)
> +static int __must_check set_external_spte_present(struct kvm *kvm,
> +						  gfn_t gfn, u64 old_spte,
> +						  u64 new_spte, int level)
>  {
>  	bool is_present = is_shadow_present_pte(new_spte);
>  	bool is_leaf = is_present && is_last_spte(new_spte, level);
>  	int ret = 0;
>  
>  	lockdep_assert_held(&kvm->mmu_lock);
> -	/*
> -	 * We need to lock out other updates to the SPTE until the external
> -	 * page table has been modified. Use FROZEN_SPTE similar to
> -	 * the zapping case.
> -	 */
> -	if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
> -		return -EBUSY;
>  
>  	/*
>  	 * Use different call to either set up middle level
> @@ -537,17 +529,13 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
>  		KVM_BUG_ON(!external_spt, kvm);
>  		ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
>  	}
> -	if (ret)
> -		__kvm_tdp_mmu_write_spte(sptep, *old_spte);
> -	else
> -		__kvm_tdp_mmu_write_spte(sptep, new_spte);
>  	return ret;
>  }
>  
>  /**
> - * handle_changed_spte - handle bookkeeping associated with an SPTE change
> + * __handle_changed_spte - handle bookkeeping associated with an SPTE change
>   * @kvm: kvm instance
> - * @as_id: the address space of the paging structure the SPTE was a part of
> + * @sp: the page table in which the SPTE resides
>   * @gfn: the base GFN that was mapped by the SPTE
>   * @old_spte: The value of the SPTE before the change
>   * @new_spte: The value of the SPTE after the change
> @@ -560,15 +548,17 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
>   * dirty logging updates are handled in common code, not here (see make_spte()
>   * and fast_pf_fix_direct_spte()).
>   */
> -static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				u64 old_spte, u64 new_spte, int level,
> -				bool shared)
> +static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
> +				 gfn_t gfn, u64 old_spte, u64 new_spte,
> +				 int level, bool shared)
>  {
>  	bool was_present = is_shadow_present_pte(old_spte);
>  	bool is_present = is_shadow_present_pte(new_spte);
>  	bool was_leaf = was_present && is_last_spte(old_spte, level);
>  	bool is_leaf = is_present && is_last_spte(new_spte, level);
>  	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> +	int as_id = kvm_mmu_page_as_id(sp);
> +	int r;
>  
>  	WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
>  	WARN_ON_ONCE(level < PG_LEVEL_4K);
> @@ -598,9 +588,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  	}
>  
>  	if (old_spte == new_spte)
> -		return;
> -
> -	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
> +		return 0;
>  
>  	if (is_leaf)
>  		check_spte_writable_invariants(new_spte);
> @@ -627,21 +615,41 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  			       "a temporary frozen SPTE.\n"
>  			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
>  			       as_id, gfn, old_spte, new_spte, level);
> -		return;
> +		return 0;
>  	}
>  
> -	if (is_leaf != was_leaf)
> -		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
> -
>  	/*
>  	 * Recursively handle child PTs if the change removed a subtree from
>  	 * the paging structure.  Note the WARN on the PFN changing without the
>  	 * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
>  	 * pages are kernel allocations and should never be migrated.
> +	 *
> +	 * When modifying leaf entries in mirrored page tables, propagate all
> +	 * changes to the external SPTE.
>  	 */
>  	if (was_present && !was_leaf &&
> -	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> +	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
>  		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> +	} else if (is_mirror_sp(sp) && is_present) {
> +		r = set_external_spte_present(kvm, gfn, old_spte, new_spte, level);
> +		if (r)
> +			return r;
> +	}
> +
> +	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
> +
> +	if (is_leaf != was_leaf)
> +		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
> +
> +	return 0;
> +}
> +
> +static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
> +				gfn_t gfn, u64 old_spte, u64 new_spte,
> +				int level, bool shared)
> +{
> +	KVM_BUG_ON(__handle_changed_spte(kvm, sp, gfn, old_spte, new_spte,
> +					 level, shared), kvm);
>  }
>  
>  static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> @@ -657,32 +665,14 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
>  
>  	/*
> -	 * FROZEN_SPTE is a temporary state and should never be set via higher
> -	 * level helpers.
> +	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
> +	 * does not hold the mmu_lock.  On failure, i.e. if a different logical
> +	 * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
> +	 * the current value, so the caller operates on fresh data, e.g. if it
> +	 * retries tdp_mmu_set_spte_atomic()
>  	 */
> -	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> -
> -	if (is_mirror_sptep(iter->sptep)) {
> -		int ret;
> -
> -		ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
> -						&iter->old_spte, new_spte, iter->level);
> -		if (ret)
> -			return ret;
> -	} else {
> -		u64 *sptep = rcu_dereference(iter->sptep);
> -
> -		/*
> -		 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
> -		 * and does not hold the mmu_lock.  On failure, i.e. if a
> -		 * different logical CPU modified the SPTE, try_cmpxchg64()
> -		 * updates iter->old_spte with the current value, so the caller
> -		 * operates on fresh data, e.g. if it retries
> -		 * tdp_mmu_set_spte_atomic()
> -		 */
> -		if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
> -			return -EBUSY;
> -	}
> +	if (!try_cmpxchg64(rcu_dereference(iter->sptep), &iter->old_spte, new_spte))
> +		return -EBUSY;
>  
>  	return 0;
>  }
> @@ -708,18 +698,49 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  						       struct tdp_iter *iter,
>  						       u64 new_spte)
>  {
> +	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
>  	int ret;
>  
>  	lockdep_assert_held_read(&kvm->mmu_lock);
>  
> -	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> +	/* KVM should never freeze SPTEs using higher level APIs. */
"higher level API" is ambiguous. e.g. kvm_tdp_mmu_write_spte_atomic() allows
new_spte to be FROZEN_SPTE.

What about just "callers of tdp_mmu_set_spte_atomic() should not freeze SPTEs
directly"?

> +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> +
> +	/*
> +	 * Temporarily freeze the SPTE until the external PTE operation has
> +	 * completed (unless the new SPTE itself will be frozen), e.g. so that
> +	 * concurrent faults don't attempt to install a child PTE in the
> +	 * external page table before the parent PTE has been written, or try
> +	 * to re-install a page table before the old one was removed.
> +	 */
> +	if (is_mirror_sptep(iter->sptep))
> +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> +	else
> +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
>  	if (ret)
>  		return ret;
>  
> -	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> -			    new_spte, iter->level, true);
> +	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> +				    new_spte, iter->level, true);

What about adding a comment for the tricky part for the mirror page table:
while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()
for freezing the mirror page table, the original new_spte from the caller of
tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
properly update statistics and propagate to the external page table.

> -	return 0;
> +	/*
> +	 * Unfreeze the mirror SPTE.  If updating the external SPTE failed,
> +	 * restore the old SPTE so that the SPTE isn't frozen in perpetuity,
> +	 * otherwise set the mirror SPTE to the new desired value.
> +	 */
> +	if (is_mirror_sptep(iter->sptep)) {
> +		if (ret)
> +			__kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
> +		else
> +			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> +	} else {
> +		/*
> +		 * Bug the VM if handling the change failed, as failure is only
> +		 * allowed if KVM couldn't update the external SPTE.
> +		 */
> +		KVM_BUG_ON(ret, kvm);
> +	}
> +	return ret;
>  }
>  
>  /*
> @@ -738,6 +759,8 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>  			    u64 old_spte, u64 new_spte, gfn_t gfn, int level)
>  {
> +	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(sptep));
> +
>  	lockdep_assert_held_write(&kvm->mmu_lock);
>  
>  	/*
> @@ -751,7 +774,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>  
>  	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
>  
> -	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> +	handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level, false);
>  
>  	/*
>  	 * Users that do non-atomic setting of PTEs don't operate on mirror
> @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm, struct tdp_iter *iter)
>  {
>  	u64 new_spte;
>  
> +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> +		return;
> +
Add a comment for why mirror page table is not expected here?

And do we need a similar WARN_ON_ONCE() in kvm_tdp_mmu_clear_dirty_pt_masked()
or clear_dirty_pt_masked()?

>  	if (spte_ad_enabled(iter->old_spte)) {
>  		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
>  								shadow_accessed_mask);
> -- 
> 2.53.0
>

Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs

Posted by Edgecombe, Rick P 14 hours ago

On Mon, 2026-03-30 at 14:14 +0800, Yan Zhao wrote:
> >   						       u64 new_spte)
> >   {
> > +	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter-
> > >sptep));
> >   	int ret;
> >   
> >   	lockdep_assert_held_read(&kvm->mmu_lock);
> >   
> > -	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > +	/* KVM should never freeze SPTEs using higher level APIs. */
> "higher level API" is ambiguous. e.g. kvm_tdp_mmu_write_spte_atomic() allows
> new_spte to be FROZEN_SPTE.

Yea you are right. It felt too fuzzy but I couldn't think of a better word.

> 
> What about just "callers of tdp_mmu_set_spte_atomic() should not freeze SPTEs
> directly"?

Sure.

> 
> > +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > +
> > +	/*
> > +	 * Temporarily freeze the SPTE until the external PTE operation has
> > +	 * completed (unless the new SPTE itself will be frozen), e.g. so
> > that
> > +	 * concurrent faults don't attempt to install a child PTE in the
> > +	 * external page table before the parent PTE has been written, or
> > try
> > +	 * to re-install a page table before the old one was removed.
> > +	 */
> > +	if (is_mirror_sptep(iter->sptep))
> > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> > +	else
> > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> >   	if (ret)
> >   		return ret;
> >   
> > -	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > -			    new_spte, iter->level, true);
> > +	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > +				    new_spte, iter->level, true);
> 
> What about adding a comment for the tricky part for the mirror page table:
> while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()

You meant it sets iter->sptep I think.

> for freezing the mirror page table, the original new_spte from the caller of
> tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
> properly update statistics and propagate to the external page table.

new_spte was already passed in. What changed? You mean that
__tdp_mmu_set_spte_atomic() sets iter->sptep and doesn't update new_spte? If so
I'm not sure if it threshold TDP MMU.

> 
> > -	return 0;
> > +	/*
> > +	 * Unfreeze the mirror SPTE.  If updating the external SPTE failed,
> > +	 * restore the old SPTE so that the SPTE isn't frozen in
> > perpetuity,
> > +	 * otherwise set the mirror SPTE to the new desired value.
> > +	 */
> > +	if (is_mirror_sptep(iter->sptep)) {
> > +		if (ret)
> > +			__kvm_tdp_mmu_write_spte(iter->sptep, iter-
> > >old_spte);
> > +		else
> > +			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> > +	} else {
> > +		/*
> > +		 * Bug the VM if handling the change failed, as failure is
> > only
> > +		 * allowed if KVM couldn't update the external SPTE.
> > +		 */
> > +		KVM_BUG_ON(ret, kvm);
> > +	}
> > +	return ret;
> >   }
> >   
> >   /*
> > @@ -738,6 +759,8 @@ static inline int __must_check
> > tdp_mmu_set_spte_atomic(struct kvm *kvm,
> >   static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> >   			    u64 old_spte, u64 new_spte, gfn_t gfn, int
> > level)
> >   {
> > +	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(sptep));
> > +
> >   	lockdep_assert_held_write(&kvm->mmu_lock);
> >   
> >   	/*
> > @@ -751,7 +774,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > tdp_ptep_t sptep,
> >   
> >   	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte,
> > level);
> >   
> > -	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> > false);
> > +	handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level,
> > false);
> >   
> >   	/*
> >   	 * Users that do non-atomic setting of PTEs don't operate on mirror
> > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > struct tdp_iter *iter)
> >   {
> >   	u64 new_spte;
> >   
> > +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > +		return;
> > +
> Add a comment for why mirror page table is not expected here?

Ehh, maybe. Thinking about what to put... The warning is kind of cheating a
little bit on the idea of the patch: to forward all changes through limited ops
in a central place, such that we don't have TDX specifics encoded in core MMU.
Trying to forward this through properly would result in more burden to the TDP
MMU, so that's not the right answer either.

"Mirror TDP doesn't support PTE aging" is a pretty obvious comment. I'm fine
just leaving it without comment, but I can add something like that. Or do you
have another suggestion?

> 
> And do we need a similar WARN_ON_ONCE() in kvm_tdp_mmu_clear_dirty_pt_masked()
> or clear_dirty_pt_masked()?

Nothing changes for those in this patch though? For the kvm_tdp_mmu_age_spte()
case, warning coverage is removed in this patch.

> 
> >   	if (spte_ad_enabled(iter->old_spte)) {
> >   		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter-
> > >sptep,
> >   								shadow_acce
> > ssed_mask);
> > -- 
> > 2.53.0

Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs

Posted by Yan Zhao 12 hours ago

On Thu, Apr 02, 2026 at 07:45:54AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2026-03-30 at 14:14 +0800, Yan Zhao wrote:
> > > +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > > +
> > > +	/*
> > > +	 * Temporarily freeze the SPTE until the external PTE operation has
> > > +	 * completed (unless the new SPTE itself will be frozen), e.g. so
> > > that
> > > +	 * concurrent faults don't attempt to install a child PTE in the
> > > +	 * external page table before the parent PTE has been written, or
> > > try
> > > +	 * to re-install a page table before the old one was removed.
> > > +	 */
> > > +	if (is_mirror_sptep(iter->sptep))
> > > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> > > +	else
> > > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > >   	if (ret)
> > >   		return ret;
> > >   
> > > -	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > > -			    new_spte, iter->level, true);
> > > +	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > > +				    new_spte, iter->level, true);
> > 
> > What about adding a comment for the tricky part for the mirror page table:
> > while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()
> 
> You meant it sets iter->sptep I think.
> 
> > for freezing the mirror page table, the original new_spte from the caller of
> > tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
> > properly update statistics and propagate to the external page table.
> 
> new_spte was already passed in. What changed? You mean that
> __tdp_mmu_set_spte_atomic() sets iter->sptep and doesn't update new_spte? If so
> I'm not sure if it threshold TDP MMU.

For mirror page table, a successful path in tdp_mmu_set_spte_atomic() looks
like this:
tdp_mmu_set_spte_atomic() {
  __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE); ==>sets mirror to frozen
  __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
			new_spte, iter->level, true);==>sets S-EPT to new_spte
  __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);   ==>sets mirror to new_spte
}

So, the tricky part is that S-EPT is updated to new_spte ahead of mirror EPT.


> > > -	return 0;
> > > +	/*
> > > +	 * Unfreeze the mirror SPTE.  If updating the external SPTE failed,
> > > +	 * restore the old SPTE so that the SPTE isn't frozen in
> > > perpetuity,
> > > +	 * otherwise set the mirror SPTE to the new desired value.
> > > +	 */
> > > +	if (is_mirror_sptep(iter->sptep)) {
> > > +		if (ret)
> > > +			__kvm_tdp_mmu_write_spte(iter->sptep, iter-
> > > >old_spte);
> > > +		else
> > > +			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> > > +	} else {
> > > +		/*
> > > +		 * Bug the VM if handling the change failed, as failure is
> > > only
> > > +		 * allowed if KVM couldn't update the external SPTE.
> > > +		 */
> > > +		KVM_BUG_ON(ret, kvm);
> > > +	}
> > > +	return ret;
> > >   }
> > >   
> > >   /*
> > > @@ -738,6 +759,8 @@ static inline int __must_check
> > > tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > >   static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > >   			    u64 old_spte, u64 new_spte, gfn_t gfn, int
> > > level)
> > >   {
> > > +	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(sptep));
> > > +
> > >   	lockdep_assert_held_write(&kvm->mmu_lock);
> > >   
> > >   	/*
> > > @@ -751,7 +774,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > > tdp_ptep_t sptep,
> > >   
> > >   	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte,
> > > level);
> > >   
> > > -	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> > > false);
> > > +	handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level,
> > > false);
> > >   
> > >   	/*
> > >   	 * Users that do non-atomic setting of PTEs don't operate on mirror
> > > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > > struct tdp_iter *iter)
> > >   {
> > >   	u64 new_spte;
> > >   
> > > +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > > +		return;
> > > +
> > Add a comment for why mirror page table is not expected here?
> 
> Ehh, maybe. Thinking about what to put... The warning is kind of cheating a
> little bit on the idea of the patch: to forward all changes through limited ops
> in a central place, such that we don't have TDX specifics encoded in core MMU.
> Trying to forward this through properly would result in more burden to the TDP
> MMU, so that's not the right answer either.
> 
> "Mirror TDP doesn't support PTE aging" is a pretty obvious comment. I'm fine
> just leaving it without comment, but I can add something like that. Or do you
> have another suggestion?
What about "External TDP doesn't support clearing PTE A/D bit"?
I'm ok with leaving without comment if you think it's too obvious.

> > And do we need a similar WARN_ON_ONCE() in kvm_tdp_mmu_clear_dirty_pt_masked()
> > or clear_dirty_pt_masked()?
> 
> Nothing changes for those in this patch though? For the kvm_tdp_mmu_age_spte()
> case, warning coverage is removed in this patch.

In kvm_tdp_mmu_age_spte() and clear_dirty_pt_masked() SPTEs are updated via
__tdp_mmu_set_spte_atomic() or tdp_mmu_clear_spte_bits(), which don't propagate
changes to the external page table.
So, I think it's better to warn if a mirror root is accidentally passed in to
those functions.