[v3] KVM: TDX huge page support for private memory

[PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

Posted by Yan Zhao 1 month ago

Introduce per-VM external cache for splitting the external page table by
adding KVM x86 ops for cache "topup", "free", "need topup" operations.

Invoke the KVM x86 ops for "topup", "need topup" for the per-VM external
split cache when splitting the mirror root in
tdp_mmu_split_huge_pages_root() where there's no per-vCPU context.

Invoke the KVM x86 op for "free" to destroy the per-VM external split cache
when KVM frees memory caches.

This per-VM external split cache is only used when per-vCPU context is not
available. Use the per-vCPU external fault cache in the fault path
when per-vCPU context is available.

The per-VM external split cache is protected under both kvm->mmu_lock and a
cache lock inside vendor implementations to ensure that there're enough
pages in cache for one split:

- Dequeuing of the per-VM external split cache is in
  kvm_x86_ops.split_external_spte() under mmu_lock.

- Yield the traversal in tdp_mmu_split_huge_pages_root() after topup of
  the per-VM cache, so that need_topup() is checked again after
  re-acquiring the mmu_lock.

- Vendor implementations of the per-VM external split cache provide a
  cache lock to protect the enqueue/dequeue of pages into/from the cache.

Here's the sequence to show how enough pages in cache is guaranteed.

a. with write mmu_lock:

   1. write_lock(&kvm->mmu_lock)
      kvm_x86_ops.need_topup()

   2. write_unlock(&kvm->mmu_lock)
      kvm_x86_ops.topup() --> in vendor:
      {
        allocate pages
        get cache lock
        enqueue pages in cache
        put cache lock
      }

   3. write_lock(&kvm->mmu_lock)
      kvm_x86_ops.need_topup() (goto 2 if topup is necessary)  (*)

      kvm_x86_ops.split_external_spte() --> in vendor:
      {
         get cache lock
         dequeue pages in cache
         put cache lock
      }
      write_unlock(&kvm->mmu_lock)

b. with read mmu_lock,

   1. read_lock(&kvm->mmu_lock)
      kvm_x86_ops.need_topup()

   2. read_unlock(&kvm->mmu_lock)
      kvm_x86_ops.topup() --> in vendor:
      {
        allocate pages
        get cache lock
        enqueue pages in cache
        put cache lock
      }

   3. read_lock(&kvm->mmu_lock)
      kvm_x86_ops.need_topup() (goto 2 if topup is necessary)

      kvm_x86_ops.split_external_spte() --> in vendor:
      {
         get cache lock
         kvm_x86_ops.need_topup() (return retry if topup is necessary) (**)
         dequeue pages in cache
         put cache lock
      }

      read_unlock(&kvm->mmu_lock)

Due to (*) and (**) in step 3, enough pages for split is guaranteed.

Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Introduce x86 ops to manages the cache.
---
 arch/x86/include/asm/kvm-x86-ops.h |  3 ++
 arch/x86/include/asm/kvm_host.h    | 17 +++++++
 arch/x86/kvm/mmu/mmu.c             |  2 +
 arch/x86/kvm/mmu/tdp_mmu.c         | 71 +++++++++++++++++++++++++++++-
 4 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 84fa8689b45c..307edc51ad8d 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -102,6 +102,9 @@ KVM_X86_OP_OPTIONAL(split_external_spte)
 KVM_X86_OP_OPTIONAL(alloc_external_fault_cache)
 KVM_X86_OP_OPTIONAL(topup_external_fault_cache)
 KVM_X86_OP_OPTIONAL(free_external_fault_cache)
+KVM_X86_OP_OPTIONAL(topup_external_per_vm_split_cache)
+KVM_X86_OP_OPTIONAL(free_external_per_vm_split_cache)
+KVM_X86_OP_OPTIONAL(need_topup_external_per_vm_split_cache)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 315ffb23e9d8..6122801f334b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1862,6 +1862,23 @@ struct kvm_x86_ops {
 	/* Free in external page fault cache. */
 	void (*free_external_fault_cache)(struct kvm_vcpu *vcpu);
 
+	/*
+	 * Top up extra pages needed in the per-VM cache for splitting external
+	 * page table.
+	 */
+	int (*topup_external_per_vm_split_cache)(struct kvm *kvm,
+						 enum pg_level level);
+
+	/* Free the per-VM cache for splitting external page table. */
+	void (*free_external_per_vm_split_cache)(struct kvm *kvm);
+
+	/*
+	 * Check if it's necessary to top up the per-VM cache for splitting
+	 * external page table.
+	 */
+	bool (*need_topup_external_per_vm_split_cache)(struct kvm *kvm,
+						       enum pg_level level);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 35a6e37bfc68..3d568512201d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6924,6 +6924,8 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
 	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
 	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
 	kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
+	if (kvm_has_mirrored_tdp(kvm))
+		kvm_x86_call(free_external_per_vm_split_cache)(kvm);
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b984027343b7..b45d3da683f2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1606,6 +1606,55 @@ static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
 		 (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
 }
 
+/*
+ * Check the per-VM external split cache under write mmu_lock or read mmu_lock
+ * in tdp_mmu_split_huge_pages_root().
+ *
+ * When need_topup_external_split_cache() returns false, the mmu_lock is held
+ * throughout the exectution from
+ * (a) need_topup_external_split_cache(), and
+ * (b) the cache dequeuing (in tdx_sept_split_private_spte() called by
+ *     tdp_mmu_split_huge_page()).
+ *
+ * - When mmu_lock is held for write, the per-VM external split cache is
+ *   exclusively accessed by a single user. Therefore, the result returned from
+ *   need_topup_external_split_cache() is accurate.
+ *
+ * - When mmu_lock is held for read, the per-VM external split cache can be
+ *   shared among multiple users. Cache dequeuing in
+ *   tdx_sept_split_private_spte() thus needs to check again of the cache page
+ *   count after acquiring its internal split cache lock and return an error for
+ *   retry if the cache page count is not sufficient.
+ */
+static bool need_topup_external_split_cache(struct kvm *kvm, int level)
+{
+	return kvm_x86_call(need_topup_external_per_vm_split_cache)(kvm, level);
+}
+
+static int topup_external_split_cache(struct kvm *kvm, int level, bool shared)
+{
+	int r;
+
+	rcu_read_unlock();
+
+	if (shared)
+		read_unlock(&kvm->mmu_lock);
+	else
+		write_unlock(&kvm->mmu_lock);
+
+	r = kvm_x86_call(topup_external_per_vm_split_cache)(kvm, level);
+
+	if (shared)
+		read_lock(&kvm->mmu_lock);
+	else
+		write_lock(&kvm->mmu_lock);
+
+	if (!r)
+		rcu_read_lock();
+
+	return r;
+}
+
 static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 					 struct kvm_mmu_page *root,
 					 gfn_t start, gfn_t end,
@@ -1614,6 +1663,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 {
 	struct kvm_mmu_page *sp = NULL;
 	struct tdp_iter iter;
+	int r = 0;
 
 	rcu_read_lock();
 
@@ -1672,6 +1722,21 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 			continue;
 		}
 
+		if (is_mirror_sp(root) &&
+		    need_topup_external_split_cache(kvm, iter.level)) {
+			r = topup_external_split_cache(kvm, iter.level, shared);
+
+			if (r) {
+				trace_kvm_mmu_split_huge_page(iter.gfn,
+							      iter.old_spte,
+							      iter.level, r);
+				goto out;
+			}
+
+			iter.yielded = true;
+			continue;
+		}
+
 		tdp_mmu_init_child_sp(sp, &iter);
 
 		if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
@@ -1682,15 +1747,17 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 
 	rcu_read_unlock();
 
+out:
 	/*
 	 * It's possible to exit the loop having never used the last sp if, for
 	 * example, a vCPU doing HugePage NX splitting wins the race and
-	 * installs its own sp in place of the last sp we tried to split.
+	 * installs its own sp in place of the last sp we tried to split or
+	 * topup_external_split_cache() failure.
 	 */
 	if (sp)
 		tdp_mmu_free_sp(sp);
 
-	return 0;
+	return r;
 }
 
 
-- 
2.43.2

Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

Posted by Huang, Kai 2 weeks, 3 days ago

On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> Introduce per-VM external cache for splitting the external page table by
> adding KVM x86 ops for cache "topup", "free", "need topup" operations.
> 
> Invoke the KVM x86 ops for "topup", "need topup" for the per-VM external
> split cache when splitting the mirror root in
> tdp_mmu_split_huge_pages_root() where there's no per-vCPU context.
> 
> Invoke the KVM x86 op for "free" to destroy the per-VM external split cache
> when KVM frees memory caches.
> 
> This per-VM external split cache is only used when per-vCPU context is not
> available. Use the per-vCPU external fault cache in the fault path
> when per-vCPU context is available.
> 
> The per-VM external split cache is protected under both kvm->mmu_lock and a
> cache lock inside vendor implementations to ensure that there're enough
> pages in cache for one split:
> 
> - Dequeuing of the per-VM external split cache is in
>   kvm_x86_ops.split_external_spte() under mmu_lock.
> 
> - Yield the traversal in tdp_mmu_split_huge_pages_root() after topup of
>   the per-VM cache, so that need_topup() is checked again after
>   re-acquiring the mmu_lock.
> 
> - Vendor implementations of the per-VM external split cache provide a
>   cache lock to protect the enqueue/dequeue of pages into/from the cache.
> 
> Here's the sequence to show how enough pages in cache is guaranteed.
> 
> a. with write mmu_lock:
> 
>    1. write_lock(&kvm->mmu_lock)
>       kvm_x86_ops.need_topup()
> 
>    2. write_unlock(&kvm->mmu_lock)
>       kvm_x86_ops.topup() --> in vendor:
>       {
>         allocate pages
>         get cache lock
>         enqueue pages in cache
>         put cache lock
>       }
> 
>    3. write_lock(&kvm->mmu_lock)
>       kvm_x86_ops.need_topup() (goto 2 if topup is necessary)  (*)
> 
>       kvm_x86_ops.split_external_spte() --> in vendor:
>       {
>          get cache lock
>          dequeue pages in cache
>          put cache lock
>       }
>       write_unlock(&kvm->mmu_lock)
> 
> b. with read mmu_lock,
> 
>    1. read_lock(&kvm->mmu_lock)
>       kvm_x86_ops.need_topup()
> 
>    2. read_unlock(&kvm->mmu_lock)
>       kvm_x86_ops.topup() --> in vendor:
>       {
>         allocate pages
>         get cache lock
>         enqueue pages in cache
>         put cache lock
>       }
> 
>    3. read_lock(&kvm->mmu_lock)
>       kvm_x86_ops.need_topup() (goto 2 if topup is necessary)
> 
>       kvm_x86_ops.split_external_spte() --> in vendor:
>       {
>          get cache lock
>          kvm_x86_ops.need_topup() (return retry if topup is necessary) (**)
>          dequeue pages in cache
>          put cache lock
>       }
> 
>       read_unlock(&kvm->mmu_lock)
> 
> Due to (*) and (**) in step 3, enough pages for split is guaranteed.

It feels like enormous pain to make sure there's enough objects in the
cache, _especially_ under MMU read lock -- you need an additional cache
lock and need to call need_topup() twice for that, and the caller needs
handle -EAGAIN.

That being said, I _think_ this is also the reason that
tdp_mmu_alloc_sp_for_split() chose to just use normal memory allocation
for allocating sp and sp->spt but not use a per-VM cache of KVM's
kvm_mmu_memory_cache.

I have been thinking whether we can simplify the solution, not only just
for avoiding this complicated memory cache topup-then-consume mechanism
under MMU read lock, but also for avoiding kinda duplicated code about how
to calculate how many DPAMT pages needed to topup etc between your next
patch and similar code in DPAMT series for the per-vCPU cache.

IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
and the mapped 2M range when splitting.

- For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
tdx_alloc_page() directly which also handles DPAMT pages internally.

Here in tdp_mmmu_alloc_sp_for_split():

	sp->external_spt = tdx_alloc_page();

For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':

https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/

So we can get rid of the per-VM DPAMT cache for S-EPT pages.

- For DPAMT pages for the TDX guest private memory, I think we can also
get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
needed DPAMT pages:

--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -111,6 +111,7 @@ struct kvm_mmu_page {
                 * Passed to TDX module, not accessed by KVM.
                 */
                void *external_spt;
+               void *leaf_level_private;
        };

Then we can define a structure which contains DPAMT pages for a given 2M
range:

	struct tdx_dmapt_metadata {
		struct page *page1;
		struct page *page2;
	};

Then when we allocate sp->external_spt, we can also allocate it for
leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
last level page table.

In this case, I think we can get rid of the per-VM DPAMT cache?

For the fault path, similarly, I believe we can use a per-vCPU cache for
'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
hooks.

The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
TDX guests even they are never used, but if what I said above is feasible,
maybe it's worth the cost.

But it's completely possible that I missed something.  Any thoughts?

Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

Posted by Sean Christopherson 2 weeks, 2 days ago

On Wed, Jan 21, 2026, Kai Huang wrote:
> On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> I have been thinking whether we can simplify the solution, not only just
> for avoiding this complicated memory cache topup-then-consume mechanism
> under MMU read lock, but also for avoiding kinda duplicated code about how
> to calculate how many DPAMT pages needed to topup etc between your next
> patch and similar code in DPAMT series for the per-vCPU cache.
> 
> IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
> and the mapped 2M range when splitting.
> 
> - For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
> tdx_alloc_page() directly which also handles DPAMT pages internally.
> 
> Here in tdp_mmmu_alloc_sp_for_split():
> 
> 	sp->external_spt = tdx_alloc_page();
> 
> For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
> that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
> mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
> add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':
> 
> https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/
> 
> So we can get rid of the per-VM DPAMT cache for S-EPT pages.
> 
> - For DPAMT pages for the TDX guest private memory, I think we can also
> get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
> needed DPAMT pages:
> 
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -111,6 +111,7 @@ struct kvm_mmu_page {
>                  * Passed to TDX module, not accessed by KVM.
>                  */
>                 void *external_spt;
> +               void *leaf_level_private;
>         };

There's no need to put this in with external_spt, we could throw it in a new union
with unsync_child_bitmap (TDP MMU can't have unsync children).  IIRC, the main
reason I've never suggested unionizing unsync_child_bitmap is that overloading
the bitmap would risk corruption if KVM ever marked a TDP MMU page as unsync, but
that's easy enough to guard against:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d568512201d..d6c6768c1f50 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1917,9 +1917,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)

 static void mark_unsync(u64 *spte)
 {
-       struct kvm_mmu_page *sp;
+       struct kvm_mmu_page *sp = sptep_to_sp(spte);

-       sp = sptep_to_sp(spte);
+       if (WARN_ON_ONCE(is_tdp_mmu_page(sp)))
+               return;
        if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
                return;
        if (sp->unsync_children++)

I might send a patch to do that even if we don't overload the bitmap, as a
hardening measure.

> Then we can define a structure which contains DPAMT pages for a given 2M
> range:
> 
> 	struct tdx_dmapt_metadata {
> 		struct page *page1;
> 		struct page *page2;
> 	};
> 
> Then when we allocate sp->external_spt, we can also allocate it for
> leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
> last level page table.
> 
> In this case, I think we can get rid of the per-VM DPAMT cache?
> 
> For the fault path, similarly, I believe we can use a per-vCPU cache for
> 'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
> hooks.
> 
> The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
> TDX guests even they are never used, but if what I said above is feasible,
> maybe it's worth the cost.
> 
> But it's completely possible that I missed something.  Any thoughts?

I *LOVE* the core idea (seriously, this made my week), though I think we should
take it a step further and _immediately_ do DPAMT maintenance on allocation.
I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
SP cache instead of waiting until KVM links the SP.  Then KVM doesn't need to
track PAMT pages except for memory that is mapped into a guest, and we end up
with better symmetry and more consistency throughout TDX.  E.g. all pages that
KVM allocates and gifts to the TDX-Module will allocated and freed via the same
TDX APIs.

Absolute worst case scenario, KVM allocates 40 (KVM's SP cache capacity) PAMT
entries per-vCPU that end up being free without ever being gifted to the TDX-Module.
But I doubt that will be a problem in practice, because odds are good the adjacent
pages/pfns will already have been consumed, i.e. the "speculative" allocation is
really just bumping the refcount.  And _if_ it's a problem, e.g. results in too
many wasted DPAMT entries, then it's one we can solve in KVM by tuning the cache
capacity to less aggresively allocate DPAMT entries.

I'll send compile-tested v4 for the DPAMT series later today (I think I can get
it out today), as I have other non-trival feedback that I've accumulated when
going through the patches.

Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

Posted by Yan Zhao 2 weeks, 2 days ago

On Wed, Jan 21, 2026 at 09:30:28AM -0800, Sean Christopherson wrote:
> On Wed, Jan 21, 2026, Kai Huang wrote:
> > On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> > I have been thinking whether we can simplify the solution, not only just
> > for avoiding this complicated memory cache topup-then-consume mechanism
> > under MMU read lock, but also for avoiding kinda duplicated code about how
> > to calculate how many DPAMT pages needed to topup etc between your next
> > patch and similar code in DPAMT series for the per-vCPU cache.
> > 
> > IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
> > and the mapped 2M range when splitting.
> > 
> > - For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
> > tdx_alloc_page() directly which also handles DPAMT pages internally.
> > 
> > Here in tdp_mmmu_alloc_sp_for_split():
> > 
> > 	sp->external_spt = tdx_alloc_page();
> > 
> > For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
> > that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
> > mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
> > add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':
> > 
> > https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/
> > 
> > So we can get rid of the per-VM DPAMT cache for S-EPT pages.
> > 
> > - For DPAMT pages for the TDX guest private memory, I think we can also
> > get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
> > needed DPAMT pages:
> > 
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -111,6 +111,7 @@ struct kvm_mmu_page {
> >                  * Passed to TDX module, not accessed by KVM.
> >                  */
> >                 void *external_spt;
> > +               void *leaf_level_private;
> >         };
> 
> There's no need to put this in with external_spt, we could throw it in a new union
> with unsync_child_bitmap (TDP MMU can't have unsync children).  IIRC, the main
> reason I've never suggested unionizing unsync_child_bitmap is that overloading
> the bitmap would risk corruption if KVM ever marked a TDP MMU page as unsync, but
> that's easy enough to guard against:
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3d568512201d..d6c6768c1f50 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1917,9 +1917,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
>  
>  static void mark_unsync(u64 *spte)
>  {
> -       struct kvm_mmu_page *sp;
> +       struct kvm_mmu_page *sp = sptep_to_sp(spte);
>  
> -       sp = sptep_to_sp(spte);
> +       if (WARN_ON_ONCE(is_tdp_mmu_page(sp)))
> +               return;
>         if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
>                 return;
>         if (sp->unsync_children++)
> 
> 
> I might send a patch to do that even if we don't overload the bitmap, as a
> hardening measure.
> 
> > Then we can define a structure which contains DPAMT pages for a given 2M
> > range:
> > 
> > 	struct tdx_dmapt_metadata {
> > 		struct page *page1;
> > 		struct page *page2;
> > 	};

Note: we need 4 pages to split a 2MB range, 2 for the new S-EPT page, 2 for the
2MB guest memory range.


> > Then when we allocate sp->external_spt, we can also allocate it for
> > leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
> > last level page table.
> > 
> > In this case, I think we can get rid of the per-VM DPAMT cache?
> > 
> > For the fault path, similarly, I believe we can use a per-vCPU cache for
> > 'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
> > hooks.
> > 
> > The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
> > TDX guests even they are never used, but if what I said above is feasible,
> > maybe it's worth the cost.
> > 
> > But it's completely possible that I missed something.  Any thoughts?
> 
> I *LOVE* the core idea (seriously, this made my week), though I think we should
Me too!

> take it a step further and _immediately_ do DPAMT maintenance on allocation.
> I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
> SP cache instead of waiting until KVM links the SP.  Then KVM doesn't need to
> track PAMT pages except for memory that is mapped into a guest, and we end up
> with better symmetry and more consistency throughout TDX.  E.g. all pages that
> KVM allocates and gifts to the TDX-Module will allocated and freed via the same
> TDX APIs.
Not sure if I understand this paragraph correctly.

I'm wondering if it can help us get rid of asymmetry. e.g.
When KVM wants to split a 2MB page, it allocates a sp for level 4K, which
contains 2 PAMT pages for the new S-EPT page.
During split, the 2 PAMT pages are installed successfully. However, the
splitting fails due to DEMOTE failure. Then, it looks like KVM needs to
uninstall and free the 2 PAMT pages for the new S-EPT page, right?

However, some other threads may have consumed the 2 PAMT pages for an adjacent
4KB page within the same 2MB range of the new S-EPT page.
So, KVM still can't free the 2 PAMT pages allocated from it.

Will check your patches for better understanding.

> Absolute worst case scenario, KVM allocates 40 (KVM's SP cache capacity) PAMT
> entries per-vCPU that end up being free without ever being gifted to the TDX-Module.
> But I doubt that will be a problem in practice, because odds are good the adjacent
> pages/pfns will already have been consumed, i.e. the "speculative" allocation is
> really just bumping the refcount.  And _if_ it's a problem, e.g. results in too
> many wasted DPAMT entries, then it's one we can solve in KVM by tuning the cache
> capacity to less aggresively allocate DPAMT entries.
> 
> I'll send compile-tested v4 for the DPAMT series later today (I think I can get
> it out today), as I have other non-trival feedback that I've accumulated when
> going through the patches.

Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

Posted by Huang, Kai 2 weeks, 2 days ago

> > 
> > > Then we can define a structure which contains DPAMT pages for a given 2M
> > > range:
> > > 
> > > 	struct tdx_dmapt_metadata {
> > > 		struct page *page1;
> > > 		struct page *page2;
> > > 	};
> 
> Note: we need 4 pages to split a 2MB range, 2 for the new S-EPT page, 2 for the
> 2MB guest memory range.

In this proposal the pair for S-EPT is already handled by tdx_alloc_page()
(or tdx_alloc_control_page()):

  sp->external_spt = tdx_alloc_page();

Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

Posted by Yan Zhao 2 weeks, 2 days ago

On Thu, Jan 22, 2026 at 03:30:16PM +0800, Huang, Kai wrote:
> > > 
> > > > Then we can define a structure which contains DPAMT pages for a given 2M
> > > > range:
> > > > 
> > > > 	struct tdx_dmapt_metadata {
> > > > 		struct page *page1;
> > > > 		struct page *page2;
> > > > 	};
> > 
> > Note: we need 4 pages to split a 2MB range, 2 for the new S-EPT page, 2 for the
> > 2MB guest memory range.
> 
> In this proposal the pair for S-EPT is already handled by tdx_alloc_page()
> (or tdx_alloc_control_page()):
> 
>   sp->external_spt = tdx_alloc_page();
Oh, ok.

So, in the fault path, sp->external_spt and sp->leaf_level_private are from
fault cache.

In the non-vCPU split path, sp->external_spt is from tdx_alloc_page(),
sp->leaf_level_private is from 2 get_zeroed_page() (or the count is from an
x86_kvm hook ?)

Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

Posted by Huang, Kai 2 weeks, 1 day ago

On Thu, 2026-01-22 at 15:49 +0800, Zhao, Yan Y wrote:
> On Thu, Jan 22, 2026 at 03:30:16PM +0800, Huang, Kai wrote:
> > > > 
> > > > > Then we can define a structure which contains DPAMT pages for a given 2M
> > > > > range:
> > > > > 
> > > > > 	struct tdx_dmapt_metadata {
> > > > > 		struct page *page1;
> > > > > 		struct page *page2;
> > > > > 	};
> > > 
> > > Note: we need 4 pages to split a 2MB range, 2 for the new S-EPT page, 2 for the
> > > 2MB guest memory range.
> > 
> > In this proposal the pair for S-EPT is already handled by tdx_alloc_page()
> > (or tdx_alloc_control_page()):
> > 
> >   sp->external_spt = tdx_alloc_page();
> Oh, ok.
> 
> So, in the fault path, sp->external_spt and sp->leaf_level_private are from
> fault cache.
> 
> In the non-vCPU split path, sp->external_spt is from tdx_alloc_page(),
> sp->leaf_level_private is from 2 get_zeroed_page() (or the count is from an
> x86_kvm hook ?)

The idea is we can add two new hooks (e.g., obj_alloc()/obj_free()) to
'kvm_mmu_memory_cache' so that we can just hook tdx_alloc_page() for vcpu-
>arch.mmu_external_spte_cache(), in which way KVM will also call
tdx_alloc_page() when topping up memory cache for the S-EPT.

Ditto for DPAMT pair for the actual TDX guest private memory:

We can provide a helper to allocate a structure which contains a pair of
DPAMT pages.  For split path we call that helper directly for sp-
>leaf_level_private; for the fault path we can have another per-vCPU
'kvm_mmu_memory_cache' for DPAMT pair with obj_alloc() being hooked to
that helper so that KVM will also call that helper when topping up the
DPAMT pair cache.

At least this is my idea.

Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

Posted by Huang, Kai 2 weeks, 2 days ago

On Wed, 2026-01-21 at 09:30 -0800, Sean Christopherson wrote:
> On Wed, Jan 21, 2026, Kai Huang wrote:
> > On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> > I have been thinking whether we can simplify the solution, not only just
> > for avoiding this complicated memory cache topup-then-consume mechanism
> > under MMU read lock, but also for avoiding kinda duplicated code about how
> > to calculate how many DPAMT pages needed to topup etc between your next
> > patch and similar code in DPAMT series for the per-vCPU cache.
> > 
> > IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
> > and the mapped 2M range when splitting.
> > 
> > - For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
> > tdx_alloc_page() directly which also handles DPAMT pages internally.
> > 
> > Here in tdp_mmmu_alloc_sp_for_split():
> > 
> > 	sp->external_spt = tdx_alloc_page();
> > 
> > For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
> > that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
> > mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
> > add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':
> > 
> > https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/
> > 
> > So we can get rid of the per-VM DPAMT cache for S-EPT pages.
> > 
> > - For DPAMT pages for the TDX guest private memory, I think we can also
> > get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
> > needed DPAMT pages:
> > 
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -111,6 +111,7 @@ struct kvm_mmu_page {
> >                  * Passed to TDX module, not accessed by KVM.
> >                  */
> >                 void *external_spt;
> > +               void *leaf_level_private;
> >         };
> 
> There's no need to put this in with external_spt, we could throw it in a new union
> with unsync_child_bitmap (TDP MMU can't have unsync children).
> 

Agreed.

> IIRC, the main
> reason I've never suggested unionizing unsync_child_bitmap is that overloading
> the bitmap would risk corruption if KVM ever marked a TDP MMU page as unsync, but
> that's easy enough to guard against:
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3d568512201d..d6c6768c1f50 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1917,9 +1917,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
>  
>  static void mark_unsync(u64 *spte)
>  {
> -       struct kvm_mmu_page *sp;
> +       struct kvm_mmu_page *sp = sptep_to_sp(spte);
>  
> -       sp = sptep_to_sp(spte);
> +       if (WARN_ON_ONCE(is_tdp_mmu_page(sp)))
> +               return;
>         if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
>                 return;
>         if (sp->unsync_children++)
> 

LGTM.

> 
> I might send a patch to do that even if we don't overload the bitmap, as a
> hardening measure.
> 
> > Then we can define a structure which contains DPAMT pages for a given 2M
> > range:
> > 
> > 	struct tdx_dmapt_metadata {
> > 		struct page *page1;
> > 		struct page *page2;
> > 	};
> > 
> > Then when we allocate sp->external_spt, we can also allocate it for
> > leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
> > last level page table.
> > 
> > In this case, I think we can get rid of the per-VM DPAMT cache?
> > 
> > For the fault path, similarly, I believe we can use a per-vCPU cache for
> > 'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
> > hooks.
> > 
> > The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
> > TDX guests even they are never used, but if what I said above is feasible,
> > maybe it's worth the cost.
> > 
> > But it's completely possible that I missed something.  Any thoughts?
> 
> I *LOVE* the core idea (seriously, this made my week), though I think we should
> take it a step further and _immediately_ do DPAMT maintenance on allocation.
> I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
> SP cache instead of waiting until KVM links the SP.  Then KVM doesn't need to
> track PAMT pages except for memory that is mapped into a guest, and we end up
> with better symmetry and more consistency throughout TDX.  E.g. all pages that
> KVM allocates and gifts to the TDX-Module will allocated and freed via the same
> TDX APIs.

Agreed.

Nit: do you still want to use the name tdx_alloc_control_page() instead of
tdx_alloc_page() if it is also used for S-EPT pages?  Kinda not sure but
both work for me.

Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

Posted by Edgecombe, Rick P 2 weeks, 2 days ago

On Wed, 2026-01-21 at 09:30 -0800, Sean Christopherson wrote:
> I *LOVE* the core idea (seriously, this made my week), though I think we should
> take it a step further and _immediately_ do DPAMT maintenance on allocation.
> I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
> SP cache instead of waiting until KVM links the SP.  Then KVM doesn't need to
> track PAMT pages except for memory that is mapped into a guest, and we end up
> with better symmetry and more consistency throughout TDX.  E.g. all pages that
> KVM allocates and gifts to the TDX-Module will allocated and freed via the same
> TDX APIs.
> 
> Absolute worst case scenario, KVM allocates 40 (KVM's SP cache capacity) PAMT
> entries per-vCPU that end up being free without ever being gifted to the TDX-Module.
> But I doubt that will be a problem in practice, because odds are good the adjacent
> pages/pfns will already have been consumed, i.e. the "speculative" allocation is
> really just bumping the refcount.  And _if_ it's a problem, e.g. results in too
> many wasted DPAMT entries, then it's one we can solve in KVM by tuning the cache
> capacity to less aggresively allocate DPAMT entries.

It doesn't sound like much impact. Especially given we earlier considered
installing DPAMT for all TDX capable memory to try to simplify things.

> 
> I'll send compile-tested v4 for the DPAMT series later today (I think I can get
> it out today), as I have other non-trival feedback that I've accumulated when
> going through the patches.

Interesting idea. I have a local branch with the rest of the feedback and a few
other tweaks. Anything I can do do help?