[PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock

Yan Zhao posted 24 patches 1 month ago
[PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock
Posted by Yan Zhao 1 month ago
Introduce kvm_x86_ops.split_external_spte() and wrap it in a helper
function split_external_spte(). Invoke the helper function
split_external_spte() in tdp_mmu_set_spte() to propagate splitting
transitions from the mirror page table to the external page table under
write mmu_lock.

Introduce a new valid transition case for splitting and document all valid
transitions of the mirror page table under write mmu_lock in
tdp_mmu_set_spte().

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Rename split_external_spt() to split_external_spte().

- Pass in param "old_mirror_spte" to hook kvm_x86_ops.set_external_spte().
  This is in aligned with the parameter change in hook
  kvm_x86_ops.set_external_spte() in Sean's cleanup series, and also allows
  future DPAMT patches to acquire guest private PFN from the old mirror
  spte.

- Rename param "external_spt" to "new_external_spt" in hook
  kvm_x86_ops.set_external_spte() to indicate this is a new page table page
  for the external page table.

- Drop declaration of get_external_spt() by moving split_external_spte()
  after get_external_spt() but before set_external_spte_present() and
  tdp_mmu_set_spte(). (Kai)

- split_external_spte --> split_external_spte() (Kai)

RFC v2:
- Removed the KVM_BUG_ON() in split_external_spt(). (Rick)
- Add a comment for the KVM_BUG_ON() in tdp_mmu_set_spte(). (Rick)
- Use kvm_x86_call() instead of static_call(). (Binbin)

RFC v1:
- Split patch.
- Dropped invoking hook zap_private_spte and kvm_flush_remote_tlbs() in KVM
  MMU core.
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  4 ++++
 arch/x86/kvm/mmu/tdp_mmu.c         | 29 +++++++++++++++++++++++++----
 3 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 58c5c9b082ca..84fa8689b45c 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -98,6 +98,7 @@ KVM_X86_OP_OPTIONAL(link_external_spt)
 KVM_X86_OP_OPTIONAL(set_external_spte)
 KVM_X86_OP_OPTIONAL(free_external_spt)
 KVM_X86_OP_OPTIONAL(remove_external_spte)
+KVM_X86_OP_OPTIONAL(split_external_spte)
 KVM_X86_OP_OPTIONAL(alloc_external_fault_cache)
 KVM_X86_OP_OPTIONAL(topup_external_fault_cache)
 KVM_X86_OP_OPTIONAL(free_external_fault_cache)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7818da148a8c..56089d6b9b51 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1848,6 +1848,10 @@ struct kvm_x86_ops {
 	void (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				     u64 mirror_spte);
 
+	/* Split a huge mapping into smaller mappings in external page table */
+	int (*split_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				   u64 old_mirror_spte, void *new_external_spt);
+
 	/* Allocation a pages from the external page cache. */
 	void *(*alloc_external_fault_cache)(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index dfa56554f9e0..977914b2627f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -508,6 +508,19 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
 	return NULL;
 }
 
+static int split_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
+			       u64 new_spte, int level)
+{
+	void *new_external_spt = get_external_spt(gfn, new_spte, level);
+	int ret;
+
+	KVM_BUG_ON(!new_external_spt, kvm);
+
+	ret = kvm_x86_call(split_external_spte)(kvm, gfn, level, old_spte,
+						new_external_spt);
+	return ret;
+}
+
 static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
 						 gfn_t gfn, u64 old_spte,
 						 u64 new_spte, int level)
@@ -758,12 +771,20 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
 
 	/*
-	 * Users that do non-atomic setting of PTEs don't operate on mirror
-	 * roots, so don't handle it and bug the VM if it's seen.
+	 * Propagate changes of SPTE to the external page table under write
+	 * mmu_lock.
+	 * Current valid transitions:
+	 * - present leaf to !present.
+	 * - present non-leaf to !present.
+	 * - present leaf to present non-leaf (splitting)
 	 */
 	if (is_mirror_sptep(sptep)) {
-		KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
-		remove_external_spte(kvm, gfn, old_spte, level);
+		if (!is_shadow_present_pte(new_spte))
+			remove_external_spte(kvm, gfn, old_spte, level);
+		else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
+			split_external_spte(kvm, gfn, old_spte, new_spte, level);
+		else
+			KVM_BUG_ON(1, kvm);
 	}
 
 	return old_spte;
-- 
2.43.2
Re: [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock
Posted by Sean Christopherson 1 week, 4 days ago
On Tue, Jan 06, 2026, Yan Zhao wrote:
> Introduce a new valid transition case for splitting and document all valid
> transitions of the mirror page table under write mmu_lock in
> tdp_mmu_set_spte().

...

> ---
>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>  arch/x86/include/asm/kvm_host.h    |  4 ++++
>  arch/x86/kvm/mmu/tdp_mmu.c         | 29 +++++++++++++++++++++++++----
>  3 files changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 58c5c9b082ca..84fa8689b45c 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -98,6 +98,7 @@ KVM_X86_OP_OPTIONAL(link_external_spt)
>  KVM_X86_OP_OPTIONAL(set_external_spte)
>  KVM_X86_OP_OPTIONAL(free_external_spt)
>  KVM_X86_OP_OPTIONAL(remove_external_spte)
> +KVM_X86_OP_OPTIONAL(split_external_spte)

This is all going in the wrong direction.  Sprinking S-EPT callbacks all over the
TDP MMU leaks *more* TDX details into the MMU, and for all intents and purposes
does nothing in terms of encapsulating MMU details in the MMU.  E.g. the TDX code
has sanity checks all over the place to ensure the "right" API is called.

The bajillion callbacks also make this code extremely difficult to follow and
review.  It requires knowing exactly which TDP MMU paths are used for what
operations, and what paths are (allegedly) unreachable for mirror roots.  Adding
hooks at specific points is also brittle, because an unexpected update/change is
more likely to go unnoticed, at least until the system explodes.

There are really only two novel paths: atomic versus non-atomic writes.  An atomic
set_spte() can fail, and also needs special handling so that the entire operation
is atomic from KVM's point of view.

There's another outlier, removal of a non-leaf S-EPT page, that I think is worth
keeping separate, because I don't see a sane way of containing the TDX-Module's
ordering requirements to the TDX code.  Specifically, the TDX-Module requires that
leaf S-EPT entries be removed before the parent page table can be removed, whereas
KVM prefers to prune the page table and _then_ reap its children.

We _could_ funnel that case into the non-atomic update, but it would either require
propagating the non-leaf removal to TDX after the call_rcu():

	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);

which is all kinds of gross, or it would require moving the call_rcu() invocation,
which obviously bleeds TDX details into the TDP MMU.  So I think it's work keeping
a dedicated hook for that case, but literally everything else can funnel into a
single callback, invoked from two locations: handle_changed_spte() and
__tdp_mmu_set_spte_atomic().

Then the TDX code is (quite simply, IMO):

static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
				     u64 new_spte, enum pg_level level)
{
	if (is_shadow_present_pte(old_spte) && is_shadow_present_pte(new_spte))
		return tdx_sept_split_private_spte(kvm, gfn, old_spte, new_spte, level);
	else if (is_shadow_present_pte(old_spte))
		return tdx_sept_remove_private_spte(kvm, gfn, old_spte, level);

	if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
		return -EIO;

	if (!is_last_spte(new_spte, level))
		return tdx_sept_link_private_spt(kvm, gfn, new_spte, level);

	return tdx_sept_map_leaf_spte(kvm, gfn, new_spte, level);
}