[v4] TDX: Enable Dynamic PAMT

[PATCH v4 11/16] KVM: TDX: Add x86 ops for external spt cache

Posted by Rick Edgecombe 2 months, 2 weeks ago

Move mmu_external_spt_cache behind x86 ops.

In the mirror/external MMU concept, the KVM MMU manages a non-active EPT
tree for private memory (the mirror). The actual active EPT tree the
private memory is protected inside the TDX module. Whenever the mirror EPT
is changed, it needs to call out into one of a set of x86 opts that
implement various update operation with TDX specific SEAMCALLs and other
tricks. These implementations operate on the TDX S-EPT (the external).

In reality these external operations are designed narrowly with respect to
TDX particulars. On the surface, what TDX specific things are happening to
fulfill these update operations are mostly hidden from the MMU, but there
is one particular area of interest where some details leak through.

The S-EPT needs pages to use for the S-EPT page tables. These page tables
need to be allocated before taking the mmu lock, like all the rest. So the
KVM MMU pre-allocates pages for TDX to use for the S-EPT in the same place
where it pre-allocates the other page tables. It’s not too bad and fits
nicely with the others.

However, Dynamic PAMT will need even more pages for the same operations.
Further, these pages will need to be handed to the arch/x86 side which used
them for DPAMT updates, which is hard for the existing KVM based cache.
The details living in core MMU code start to add up.

So in preparation to make it more complicated, move the external page
table cache into TDX code by putting it behind some x86 ops. Have one for
topping up and one for allocation. Don’t go so far to try to hide the
existence of external page tables completely from the generic MMU, as they
are currently stored in their mirror struct kvm_mmu_page and it’s quite
handy.

To plumb the memory cache operations through tdx.c, export some of
the functions temporarily. This will be removed in future changes.

Acked-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
v4:
 - Add Kiryl ack
 - Log typo (Binbin)
 - Add pages arg to topup_external_fault_cache() (Yan)
 - After more consideration, create free_external_fault_cache()
   as suggest by (Yan)

v3:
 - New patch
---
 arch/x86/include/asm/kvm-x86-ops.h |  3 +++
 arch/x86/include/asm/kvm_host.h    | 14 +++++++++-----
 arch/x86/kvm/mmu/mmu.c             |  6 +++---
 arch/x86/kvm/mmu/mmu_internal.h    |  2 +-
 arch/x86/kvm/vmx/tdx.c             | 24 ++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h             |  2 ++
 virt/kvm/kvm_main.c                |  3 +++
 7 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index de709fb5bd76..58c5c9b082ca 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -98,6 +98,9 @@ KVM_X86_OP_OPTIONAL(link_external_spt)
 KVM_X86_OP_OPTIONAL(set_external_spte)
 KVM_X86_OP_OPTIONAL(free_external_spt)
 KVM_X86_OP_OPTIONAL(remove_external_spte)
+KVM_X86_OP_OPTIONAL(alloc_external_fault_cache)
+KVM_X86_OP_OPTIONAL(topup_external_fault_cache)
+KVM_X86_OP_OPTIONAL(free_external_fault_cache)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8d8cc059fed6..dde94b84610c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -855,11 +855,6 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
 	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
-	/*
-	 * This cache is to allocate external page table. E.g. private EPT used
-	 * by the TDX module.
-	 */
-	struct kvm_mmu_memory_cache mmu_external_spt_cache;
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
@@ -1856,6 +1851,15 @@ struct kvm_x86_ops {
 	void (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				     u64 mirror_spte);
 
+	/* Allocation a pages from the external page cache. */
+	void *(*alloc_external_fault_cache)(struct kvm_vcpu *vcpu);
+
+	/* Top up extra pages needed for faulting in external page tables. */
+	int (*topup_external_fault_cache)(struct kvm_vcpu *vcpu, unsigned int cnt);
+
+	/* Free in external page fault cache. */
+	void (*free_external_fault_cache)(struct kvm_vcpu *vcpu);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3cfabfbdd843..3b1b91fd37dd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -601,8 +601,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 	if (r)
 		return r;
 	if (kvm_has_mirrored_tdp(vcpu->kvm)) {
-		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_external_spt_cache,
-					       PT64_ROOT_MAX_LEVEL);
+		r = kvm_x86_call(topup_external_fault_cache)(vcpu, PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
 	}
@@ -625,8 +624,9 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_external_spt_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
+	if (kvm_has_mirrored_tdp(vcpu->kvm))
+		kvm_x86_call(free_external_fault_cache)(vcpu);
 }
 
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 73cdcbccc89e..12234ee468ce 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -165,7 +165,7 @@ static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_
 	 * Therefore, KVM does not need to initialize or access external_spt.
 	 * KVM only interacts with sp->spt for private EPT operations.
 	 */
-	sp->external_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
+	sp->external_spt = kvm_x86_call(alloc_external_fault_cache)(vcpu);
 }
 
 static inline gfn_t kvm_gfn_root_bits(const struct kvm *kvm, const struct kvm_mmu_page *root)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b6d7f4b5f40f..260bb0e6eb44 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1642,6 +1642,27 @@ static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 	return 0;
 }
 
+static void *tdx_alloc_external_fault_cache(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	return kvm_mmu_memory_cache_alloc(&tdx->mmu_external_spt_cache);
+}
+
+static int tdx_topup_external_fault_cache(struct kvm_vcpu *vcpu, unsigned int cnt)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	return kvm_mmu_topup_memory_cache(&tdx->mmu_external_spt_cache, cnt);
+}
+
+static void tdx_free_external_fault_cache(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	kvm_mmu_free_memory_cache(&tdx->mmu_external_spt_cache);
+}
+
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 			    enum pg_level level, kvm_pfn_t pfn)
 {
@@ -3602,4 +3623,7 @@ void __init tdx_hardware_setup(void)
 	vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
 	vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
 	vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
+	vt_x86_ops.alloc_external_fault_cache = tdx_alloc_external_fault_cache;
+	vt_x86_ops.topup_external_fault_cache = tdx_topup_external_fault_cache;
+	vt_x86_ops.free_external_fault_cache = tdx_free_external_fault_cache;
 }
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index ce2720a028ad..1eefa1b0df5e 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -73,6 +73,8 @@ struct vcpu_tdx {
 
 	u64 map_gpa_next;
 	u64 map_gpa_end;
+
+	struct kvm_mmu_memory_cache mmu_external_spt_cache;
 };
 
 void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9eca084bdcbe..cff24b950baa 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -404,6 +404,7 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 {
 	return __kvm_mmu_topup_memory_cache(mc, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE, min);
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_topup_memory_cache);
 
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
 {
@@ -424,6 +425,7 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
 	mc->objects = NULL;
 	mc->capacity = 0;
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_free_memory_cache);
 
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
 {
@@ -436,6 +438,7 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
 	BUG_ON(!p);
 	return p;
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_memory_cache_alloc);
 #endif
 
 static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
-- 
2.51.2

Re: [PATCH v4 11/16] KVM: TDX: Add x86 ops for external spt cache

Posted by Sean Christopherson 3 weeks, 1 day ago

On Thu, Nov 20, 2025, Rick Edgecombe wrote:
> Move mmu_external_spt_cache behind x86 ops.
> 
> In the mirror/external MMU concept, the KVM MMU manages a non-active EPT
> tree for private memory (the mirror). The actual active EPT tree the
> private memory is protected inside the TDX module. Whenever the mirror EPT
> is changed, it needs to call out into one of a set of x86 opts that
> implement various update operation with TDX specific SEAMCALLs and other
> tricks. These implementations operate on the TDX S-EPT (the external).
> 
> In reality these external operations are designed narrowly with respect to
> TDX particulars. On the surface, what TDX specific things are happening to
> fulfill these update operations are mostly hidden from the MMU, but there
> is one particular area of interest where some details leak through.
> 
> The S-EPT needs pages to use for the S-EPT page tables. These page tables
> need to be allocated before taking the mmu lock, like all the rest. So the
> KVM MMU pre-allocates pages for TDX to use for the S-EPT in the same place
> where it pre-allocates the other page tables. It’s not too bad and fits
> nicely with the others.
> 
> However, Dynamic PAMT will need even more pages for the same operations.
> Further, these pages will need to be handed to the arch/x86 side which used
> them for DPAMT updates, which is hard for the existing KVM based cache.
> The details living in core MMU code start to add up.
> 
> So in preparation to make it more complicated, move the external page
> table cache into TDX code by putting it behind some x86 ops. Have one for
> topping up and one for allocation. Don’t go so far to try to hide the
> existence of external page tables completely from the generic MMU, as they
> are currently stored in their mirror struct kvm_mmu_page and it’s quite
> handy.
> 
> To plumb the memory cache operations through tdx.c, export some of
> the functions temporarily. This will be removed in future changes.
> 
> Acked-by: Kiryl Shutsemau <kas@kernel.org>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---

NAK.  I kinda sorta get why you did this?  But the pages KVM uses for page tables
are KVM's, not to be mixed with PAMT pages.

Eww.  Definitely a hard "no".  In tdp_mmu_alloc_sp_for_split(), the allocation
comes from KVM:

	if (mirror) {
		sp->external_spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
		if (!sp->external_spt) {
			free_page((unsigned long)sp->spt);
			kmem_cache_free(mmu_page_header_cache, sp);
			return NULL;
		}
	}

But then in kvm_tdp_mmu_map(), via kvm_mmu_alloc_external_spt(), the allocation
comes from get_tdx_prealloc_page()

  static void *tdx_alloc_external_fault_cache(struct kvm_vcpu *vcpu)
  {
	struct page *page = get_tdx_prealloc_page(&to_tdx(vcpu)->prealloc);

	if (WARN_ON_ONCE(!page))
		return (void *)__get_free_page(GFP_ATOMIC | __GFP_ACCOUNT);

	return page_address(page);
  }

But then regardles of where the page came from, KVM frees it.  Seriously.

  static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
  {
	free_page((unsigned long)sp->external_spt);  <=====
	free_page((unsigned long)sp->spt);
	kmem_cache_free(mmu_page_header_cache, sp);
  }

Oh, and the hugepage series also fumbles its topup (why there's yet another
topup API, I have no idea).

  static int tdx_topup_vm_split_cache(struct kvm *kvm, enum pg_level level)
  {
	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
	struct tdx_prealloc *prealloc = &kvm_tdx->prealloc_split_cache;
	int cnt = tdx_min_split_cache_sz(kvm, level);

	while (READ_ONCE(prealloc->cnt) < cnt) {
		struct page *page = alloc_page(GFP_KERNEL);  <==== GFP_KERNEL_ACCOUNT

		if (!page)
			return -ENOMEM;

		spin_lock(&kvm_tdx->prealloc_split_cache_lock);
		list_add(&page->lru, &prealloc->page_list);
		prealloc->cnt++;
		spin_unlock(&kvm_tdx->prealloc_split_cache_lock);
	}

	return 0;
  }

Re: [PATCH v4 11/16] KVM: TDX: Add x86 ops for external spt cache

Posted by Edgecombe, Rick P 2 weeks, 4 days ago

Sean, really appreciate you taking a look despite being overbooked.

On Fri, 2026-01-16 at 16:53 -0800, Sean Christopherson wrote:
> NAK.  I kinda sorta get why you did this?  But the pages KVM uses for page tables
> are KVM's, not to be mixed with PAMT pages.
> 
> Eww.  Definitely a hard "no".  In tdp_mmu_alloc_sp_for_split(), the allocation
> comes from KVM:
> 
> 	if (mirror) {
> 		sp->external_spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> 		if (!sp->external_spt) {
> 			free_page((unsigned long)sp->spt);
> 			kmem_cache_free(mmu_page_header_cache, sp);
> 			return NULL;
> 		}
> 	}

Ah, this is from the TDX huge pages series. There is a bit of fallout from TDX 
/coco's eternal nemesis: stacks of code all being co-designed at once.

Dave has been directing us recently to focus on only the needs of the current
series. Now that we can test at each incremental step we don't have the same
problems as before. But of course there is still desire for updated TDX huge
pages, etc to help with development of all the other WIP stuff.

For this design aspect of how the topup caches work for DPAMT, he asked
specifically for the DPAMT patches to *not* consider how TDX huge pages will use
them.

Now the TDX huge pages coverletter asked you to look at some aspects of that,
and traditionally KVM side has preferred to look at how the code is all going to
work together. The presentation of this was a bit rushed and confused, but
looking forward, how do you want to do this?

After the 130 patches ordeal, I'm a bit amenable to Dave's view. What do you
think?

> 
> But then in kvm_tdp_mmu_map(), via kvm_mmu_alloc_external_spt(), the allocation
> comes from get_tdx_prealloc_page()
> 
>   static void *tdx_alloc_external_fault_cache(struct kvm_vcpu *vcpu)
>   {
> 	struct page *page = get_tdx_prealloc_page(&to_tdx(vcpu)->prealloc);
> 
> 	if (WARN_ON_ONCE(!page))
> 		return (void *)__get_free_page(GFP_ATOMIC | __GFP_ACCOUNT);
> 
> 	return page_address(page);
>   }
> 
> But then regardles of where the page came from, KVM frees it.  Seriously.
> 
>   static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
>   {
> 	free_page((unsigned long)sp->external_spt);  <=====

Ah! Ok, we might have a good option here. Kai had suggested that instead of pre-
faulting all the pages that we might need for DPAMT backing, we pre-install the
DPAMT backing for all pages in the external page table cache at top-up time. It
has a complication for TDX huge pages, which needs to decide late if a page is
huge and needs dpamt backing or not, but based on Dave's direction above I gave
it a try. A downside for pure DPAMT was that it needed to handle the freeing
specially because it needed to potentially reclaim the DPAMT backing. This
required an x86 op for freeing these pages, and was a bit more code on the KVM
side in general.

But if the asymmetry is a problem either way, maybe I'll give it a try for v5. 

> 	free_page((unsigned long)sp->spt);
> 	kmem_cache_free(mmu_page_header_cache, sp);
>   }
> 
> Oh, and the hugepage series also fumbles its topup (why there's yet another
> topup API, I have no idea).
> 
>   static int tdx_topup_vm_split_cache(struct kvm *kvm, enum pg_level level)
>   {
> 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> 	struct tdx_prealloc *prealloc = &kvm_tdx->prealloc_split_cache;
> 	int cnt = tdx_min_split_cache_sz(kvm, level);
> 
> 	while (READ_ONCE(prealloc->cnt) < cnt) {
> 		struct page *page = alloc_page(GFP_KERNEL);  <==== GFP_KERNEL_ACCOUNT
> 
> 		if (!page)
> 			return -ENOMEM;
> 
> 		spin_lock(&kvm_tdx->prealloc_split_cache_lock);
> 		list_add(&page->lru, &prealloc->page_list);
> 		prealloc->cnt++;
> 		spin_unlock(&kvm_tdx->prealloc_split_cache_lock);
> 	}
> 
> 	return 0;
>   }

Re: [PATCH v4 11/16] KVM: TDX: Add x86 ops for external spt cache

Posted by Sean Christopherson 2 weeks, 3 days ago

On Tue, Jan 20, 2026, Rick P Edgecombe wrote:
> Sean, really appreciate you taking a look despite being overbooked.
> 
> On Fri, 2026-01-16 at 16:53 -0800, Sean Christopherson wrote:
> > NAK.  I kinda sorta get why you did this?  But the pages KVM uses for page tables
> > are KVM's, not to be mixed with PAMT pages.
> > 
> > Eww.  Definitely a hard "no".  In tdp_mmu_alloc_sp_for_split(), the allocation
> > comes from KVM:
> > 
> > 	if (mirror) {
> > 		sp->external_spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> > 		if (!sp->external_spt) {
> > 			free_page((unsigned long)sp->spt);
> > 			kmem_cache_free(mmu_page_header_cache, sp);
> > 			return NULL;
> > 		}
> > 	}
> 
> Ah, this is from the TDX huge pages series. There is a bit of fallout from TDX 
> /coco's eternal nemesis: stacks of code all being co-designed at once.
> 
> Dave has been directing us recently to focus on only the needs of the current
> series. Now that we can test at each incremental step we don't have the same
> problems as before. But of course there is still desire for updated TDX huge
> pages, etc to help with development of all the other WIP stuff.
> 
> For this design aspect of how the topup caches work for DPAMT, he asked
> specifically for the DPAMT patches to *not* consider how TDX huge pages will use
> them.
> 
> Now the TDX huge pages coverletter asked you to look at some aspects of that,
> and traditionally KVM side has preferred to look at how the code is all going to
> work together. The presentation of this was a bit rushed and confused, but
> looking forward, how do you want to do this?
> 
> After the 130 patches ordeal, I'm a bit amenable to Dave's view. What do you
> think?

IMO, it's largely irrelevant for this discussion.  Bluntly, the code proposed
here is simply bad.  S-EPT hugepage support just makes it worse.

The core issue is that the ownership of the pre-allocation cache is split across
KVM and the TDX subsystem (and within KVM, between tdx.c and the MMU), which makes
it extremely difficult to understand who is responsible for what, which in turn
leads to brittle code, and sets the hugepage series up to fail, e.g. by unnecessarily
mixing S-EPT page allocation with PAMT maintenance.q

That aside, I generally agree with Dave.  The only caveat I'll throw in is that
I do think we need to _at least_ consider how things will likely play out when
all is said and done, otherwise we'll probably paint ourselves into a corner.
E.g. we don't need to know exactly how S-EPT hugepage support will interact with
DPAMT, but IMO we do need to be aware that KVM will need to demote pages outside
of vCPU context, and thus will need to pre-allocate pages for PAMT without having
a loaded/running vCPU.  That knowledge doesn't require active support in the
DPAMT series, but it most definitely influences design decisions.

Re: [PATCH v4 11/16] KVM: TDX: Add x86 ops for external spt cache

Posted by Edgecombe, Rick P 2 weeks, 3 days ago

On Wed, 2026-01-21 at 14:12 -0800, Sean Christopherson wrote:
> IMO, it's largely irrelevant for this discussion.  Bluntly, the code proposed
> here is simply bad.
> 

FWIW, I wasn't ecstatic about it, but was starting to doubt we could find a
better solution after a string of failed POCs. Will take this feedback under
consideration for the future.

>   S-EPT hugepage support just makes it worse.
> 
> The core issue is that the ownership of the pre-allocation cache is split across
> KVM and the TDX subsystem (and within KVM, between tdx.c and the MMU), which makes
> it extremely difficult to understand who is responsible for what, which in turn
> leads to brittle code, and sets the hugepage series up to fail, e.g. by unnecessarily
> mixing S-EPT page allocation with PAMT maintenance.q

...or require more extensive changes with relevant huge page related
justification, right? We are not defining uABI I think.

> 
> That aside, I generally agree with Dave.  
> 

Ok.

> The only caveat I'll throw in is that
> I do think we need to _at least_ consider how things will likely play out when
> all is said and done, otherwise we'll probably paint ourselves into a corner.

Well this is the tricky part then.

> E.g. we don't need to know exactly how S-EPT hugepage support will interact with
> DPAMT, but IMO we do need to be aware that KVM will need to demote pages outside
> of vCPU context, and thus will need to pre-allocate pages for PAMT without having
> a loaded/running vCPU.  That knowledge doesn't require active support in the
> DPAMT series, but it most definitely influences design decisions.

I see what you are saying here though. It depends on how you look at it a bit.

Re: [PATCH v4 11/16] KVM: TDX: Add x86 ops for external spt cache

Posted by Yan Zhao 2 weeks, 6 days ago

On Fri, Jan 16, 2026 at 04:53:57PM -0800, Sean Christopherson wrote:
> On Thu, Nov 20, 2025, Rick Edgecombe wrote:
> > Move mmu_external_spt_cache behind x86 ops.
> > 
> > In the mirror/external MMU concept, the KVM MMU manages a non-active EPT
> > tree for private memory (the mirror). The actual active EPT tree the
> > private memory is protected inside the TDX module. Whenever the mirror EPT
> > is changed, it needs to call out into one of a set of x86 opts that
> > implement various update operation with TDX specific SEAMCALLs and other
> > tricks. These implementations operate on the TDX S-EPT (the external).
> > 
> > In reality these external operations are designed narrowly with respect to
> > TDX particulars. On the surface, what TDX specific things are happening to
> > fulfill these update operations are mostly hidden from the MMU, but there
> > is one particular area of interest where some details leak through.
> > 
> > The S-EPT needs pages to use for the S-EPT page tables. These page tables
> > need to be allocated before taking the mmu lock, like all the rest. So the
> > KVM MMU pre-allocates pages for TDX to use for the S-EPT in the same place
> > where it pre-allocates the other page tables. It’s not too bad and fits
> > nicely with the others.
> > 
> > However, Dynamic PAMT will need even more pages for the same operations.
> > Further, these pages will need to be handed to the arch/x86 side which used
> > them for DPAMT updates, which is hard for the existing KVM based cache.
> > The details living in core MMU code start to add up.
> > 
> > So in preparation to make it more complicated, move the external page
> > table cache into TDX code by putting it behind some x86 ops. Have one for
> > topping up and one for allocation. Don’t go so far to try to hide the
> > existence of external page tables completely from the generic MMU, as they
> > are currently stored in their mirror struct kvm_mmu_page and it’s quite
> > handy.
> > 
> > To plumb the memory cache operations through tdx.c, export some of
> > the functions temporarily. This will be removed in future changes.
> > 
> > Acked-by: Kiryl Shutsemau <kas@kernel.org>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> 
> NAK.  I kinda sorta get why you did this?  But the pages KVM uses for page tables
> are KVM's, not to be mixed with PAMT pages.
> 
> Eww.  Definitely a hard "no".  In tdp_mmu_alloc_sp_for_split(), the allocation
> comes from KVM:
> 
> 	if (mirror) {
> 		sp->external_spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> 		if (!sp->external_spt) {
> 			free_page((unsigned long)sp->spt);
> 			kmem_cache_free(mmu_page_header_cache, sp);
> 			return NULL;
> 		}
> 	}
> 
> But then in kvm_tdp_mmu_map(), via kvm_mmu_alloc_external_spt(), the allocation
> comes from get_tdx_prealloc_page()
> 
>   static void *tdx_alloc_external_fault_cache(struct kvm_vcpu *vcpu)
>   {
> 	struct page *page = get_tdx_prealloc_page(&to_tdx(vcpu)->prealloc);
> 
> 	if (WARN_ON_ONCE(!page))
> 		return (void *)__get_free_page(GFP_ATOMIC | __GFP_ACCOUNT);
> 
> 	return page_address(page);
>   }
> 
> But then regardles of where the page came from, KVM frees it.  Seriously.
> 
>   static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
>   {
> 	free_page((unsigned long)sp->external_spt);  <=====
> 	free_page((unsigned long)sp->spt);
> 	kmem_cache_free(mmu_page_header_cache, sp);
>   }
IMHO, it's by design. I don't see a problem with KVM freeing the sp->external_spt,
regardless of whether it's from:
(1) KVM's mmu cache,
(2) tdp_mmu_alloc_sp_for_split(), or
(3) tdx_alloc_external_fault_cache().
Please correct me if I missed anything.

None of (1)-(3) keeps the pages in list after KVM obtains the pages and maps
them into SPTEs.

So, with SPTEs as the pages' sole consumer, it's perfectly fine for KVM to free
the pages when freeing SPTEs. No?

Also, in the current upstream code, after tdp_mmu_split_huge_pages_root() is
invoked for dirty tracking, some sp->spt are allocated from
tdp_mmu_alloc_sp_for_split(), while others are from kvm_mmu_memory_cache_alloc().
However, tdp_mmu_free_sp() can still free them without any problem.

> Oh, and the hugepage series also fumbles its topup (why there's yet another
> topup API, I have no idea).
Introducing another topup API is because in the hugepage usage, the split occurs
in a non-vCPU context. So, the "kvm_tdx->prealloc_split_cache" is per-VM, since
hugepage cannot reuse tdx_topup_external_fault_cache().
(Please also see the patch log of
"[PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting" [1]).

[1] https://lore.kernel.org/all/20260106102331.25244-1-yan.y.zhao@intel.com.

>   static int tdx_topup_vm_split_cache(struct kvm *kvm, enum pg_level level)
>   {
> 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> 	struct tdx_prealloc *prealloc = &kvm_tdx->prealloc_split_cache;
> 	int cnt = tdx_min_split_cache_sz(kvm, level);
> 
> 	while (READ_ONCE(prealloc->cnt) < cnt) {
> 		struct page *page = alloc_page(GFP_KERNEL);  <==== GFP_KERNEL_ACCOUNT
Thanks for caching. Will use GFP_KERNEL_ACCOUNT instead.

> 		if (!page)
> 			return -ENOMEM;
> 
> 		spin_lock(&kvm_tdx->prealloc_split_cache_lock);

I didn't have tdx_topup_vm_split_cache() reuse topup_tdx_prealloc_page() (as
used by tdx_topup_external_fault_cache()). This is due to the need to add the
spinlock kvm_tdx->prealloc_split_cache_lock to protect page enqueuing and
dequeuing.  (I've pasted the explanation of the per-VM external cache for
splitting and its lock protections from [2] below as an "Appendix" for your
convenience).

If we can have the split for huge pages not reuse
tdp_mmu_split_huge_pages_root() (i.e., create a specific interface for
kvm_split_cross_boundary_leafs() instead of reusing
tdp_mmu_split_huge_pages_root(), as I asked about in [3]), we can preallocate
enough pages and hold mmu_lock without releasing it for memory allocation. Then,
this spinlock kvm_tdx->prealloc_split_cache_lock would not be needed.

[2] https://lore.kernel.org/all/20260106101646.24809-1-yan.y.zhao@intel.com
[3] https://lore.kernel.org/all/aW2Iwpuwoyod8eQc@yzhao56-desk.sh.intel.com
> 		list_add(&page->lru, &prealloc->page_list);
> 		prealloc->cnt++;
> 		spin_unlock(&kvm_tdx->prealloc_split_cache_lock);
> 	}
> 
> 	return 0;
>   }
Appendix:
Explanation of the per-VM external cache for splitting huge pages from the
cover letter of huge pages v3 [2]:

"
9. DPAMT
   Currently, DPAMT's involvement with TDX huge page is limited to page
   splitting.

   As shown in the following call stack, DPAMT pages used by splitting are
   pre-allocated and queued in the per-VM external split cache. They are
   dequeued and consumed in tdx_sept_split_private_spte().

   kvm_split_cross_boundary_leafs
     kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
       tdp_mmu_split_huge_pages_root
 (*)      1) tdp_mmu_alloc_sp_for_split()
  +-----2.1) need_topup_external_split_cache(): check if enough pages in
  |          the external split cache. Go to 3 if pages are enough.
  |  +--2.2) topup_external_split_cache(): preallocate/enqueue pages in
  |  |       the external split cache.
  |  |    3) tdp_mmu_split_huge_page
  |  |         tdp_mmu_link_sp
  |  |           tdp_mmu_iter_set_spte
  |  |(**)         tdp_mmu_set_spte
  |  |               split_external_spte
  |  |                 kvm_x86_call(split_external_spte)
  |  |                   tdx_sept_split_private_spte
  |  |                   3.1) BLOCK, TRACK
  +--+-------------------3.2) Dequeue PAMT pages from the external split
  |  |                        cache for the new sept page
  |  |                   3.3) PAMT_ADD for the new sept page
  +--+-------------------3.4) Dequeue PAMT pages from the external split
                              cache for the 2MB guest private memory.
                         3.5) DEMOTE.
                         3.6) Update PAMT refcount of the 2MB guest private
                              memory.

   (*) The write mmu_lock is held across the checking of enough pages in
       cache in step 2.1 and the page dequeuing in steps 3.2 and 3.4, so
       it's ensured that dequeuing has enough pages in cache.

  (**) A spinlock prealloc_split_cache_lock is used inside the TDX's cache
       implementation to protect page enqueuing in step 2.2 and page
       dequeuing in steps 3.2 and 3.4.
"

Re: [PATCH v4 11/16] KVM: TDX: Add x86 ops for external spt cache

Posted by Huang, Kai 2 weeks, 5 days ago

On Mon, 2026-01-19 at 10:31 +0800, Yan Zhao wrote:
> On Fri, Jan 16, 2026 at 04:53:57PM -0800, Sean Christopherson wrote:
> > On Thu, Nov 20, 2025, Rick Edgecombe wrote:
> > > Move mmu_external_spt_cache behind x86 ops.
> > > 
> > > In the mirror/external MMU concept, the KVM MMU manages a non-active EPT
> > > tree for private memory (the mirror). The actual active EPT tree the
> > > private memory is protected inside the TDX module. Whenever the mirror EPT
> > > is changed, it needs to call out into one of a set of x86 opts that
> > > implement various update operation with TDX specific SEAMCALLs and other
> > > tricks. These implementations operate on the TDX S-EPT (the external).
> > > 
> > > In reality these external operations are designed narrowly with respect to
> > > TDX particulars. On the surface, what TDX specific things are happening to
> > > fulfill these update operations are mostly hidden from the MMU, but there
> > > is one particular area of interest where some details leak through.
> > > 
> > > The S-EPT needs pages to use for the S-EPT page tables. These page tables
> > > need to be allocated before taking the mmu lock, like all the rest. So the
> > > KVM MMU pre-allocates pages for TDX to use for the S-EPT in the same place
> > > where it pre-allocates the other page tables. It’s not too bad and fits
> > > nicely with the others.
> > > 
> > > However, Dynamic PAMT will need even more pages for the same operations.
> > > Further, these pages will need to be handed to the arch/x86 side which used
> > > them for DPAMT updates, which is hard for the existing KVM based cache.
> > > The details living in core MMU code start to add up.
> > > 
> > > So in preparation to make it more complicated, move the external page
> > > table cache into TDX code by putting it behind some x86 ops. Have one for
> > > topping up and one for allocation. Don’t go so far to try to hide the
> > > existence of external page tables completely from the generic MMU, as they
> > > are currently stored in their mirror struct kvm_mmu_page and it’s quite
> > > handy.
> > > 
> > > To plumb the memory cache operations through tdx.c, export some of
> > > the functions temporarily. This will be removed in future changes.
> > > 
> > > Acked-by: Kiryl Shutsemau <kas@kernel.org>
> > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > ---
> > 
> > NAK.  I kinda sorta get why you did this?  But the pages KVM uses for page tables
> > are KVM's, not to be mixed with PAMT pages.
> > 
> > Eww.  Definitely a hard "no".  In tdp_mmu_alloc_sp_for_split(), the allocation
> > comes from KVM:
> > 
> > 	if (mirror) {
> > 		sp->external_spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> > 		if (!sp->external_spt) {
> > 			free_page((unsigned long)sp->spt);
> > 			kmem_cache_free(mmu_page_header_cache, sp);
> > 			return NULL;
> > 		}
> > 	}
> > 
> > But then in kvm_tdp_mmu_map(), via kvm_mmu_alloc_external_spt(), the allocation
> > comes from get_tdx_prealloc_page()
> > 
> >   static void *tdx_alloc_external_fault_cache(struct kvm_vcpu *vcpu)
> >   {
> > 	struct page *page = get_tdx_prealloc_page(&to_tdx(vcpu)->prealloc);
> > 
> > 	if (WARN_ON_ONCE(!page))
> > 		return (void *)__get_free_page(GFP_ATOMIC | __GFP_ACCOUNT);
> > 
> > 	return page_address(page);
> >   }
> > 
> > But then regardles of where the page came from, KVM frees it.  Seriously.
> > 
> >   static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
> >   {
> > 	free_page((unsigned long)sp->external_spt);  <=====
> > 	free_page((unsigned long)sp->spt);
> > 	kmem_cache_free(mmu_page_header_cache, sp);
> >   }
> IMHO, it's by design. I don't see a problem with KVM freeing the sp->external_spt,
> regardless of whether it's from:
> (1) KVM's mmu cache,
> (2) tdp_mmu_alloc_sp_for_split(), or
> (3) tdx_alloc_external_fault_cache().
> Please correct me if I missed anything.
> 
> None of (1)-(3) keeps the pages in list after KVM obtains the pages and maps
> them into SPTEs.
> 
> So, with SPTEs as the pages' sole consumer, it's perfectly fine for KVM to free
> the pages when freeing SPTEs. No?
> 
> Also, in the current upstream code, after tdp_mmu_split_huge_pages_root() is
> invoked for dirty tracking, some sp->spt are allocated from
> tdp_mmu_alloc_sp_for_split(), while others are from kvm_mmu_memory_cache_alloc().
> However, tdp_mmu_free_sp() can still free them without any problem.
> 
> 

Well I think it's for consistency, and IMHO you can even argue this is a
bug in the current code, because IIUC there's indeed one issue in the
current code.

When sp->spt is allocated via per-vCPU mmu_shadow_page_cache, it is
actually initialized to SHADOW_NONPRESENT_VALUE:

        vcpu->arch.mmu_shadow_page_cache.init_value =                    
                SHADOW_NONPRESENT_VALUE;                                 

So the way sp->spt is allocated in tdp_mmu_alloc_sp_for_split() is
actually broken IMHO because entries in sp->spt is never initialized.

Fortunately tdp_mmu_alloc_sp_for_split() isn't reachable for TDX guests,
so we are lucky so far.

A per-VM cache requires more code to handle, but to me I still think we
should just use the same way to allocate staff when possible, and that
includes spt->external_spt.

Re: [PATCH v4 11/16] KVM: TDX: Add x86 ops for external spt cache

Posted by Yan Zhao 2 weeks, 5 days ago

On Tue, Jan 20, 2026 at 04:42:37PM +0800, Huang, Kai wrote:
> On Mon, 2026-01-19 at 10:31 +0800, Yan Zhao wrote:
> > On Fri, Jan 16, 2026 at 04:53:57PM -0800, Sean Christopherson wrote:
> > > On Thu, Nov 20, 2025, Rick Edgecombe wrote:
> > > > Move mmu_external_spt_cache behind x86 ops.
> > > > 
> > > > In the mirror/external MMU concept, the KVM MMU manages a non-active EPT
> > > > tree for private memory (the mirror). The actual active EPT tree the
> > > > private memory is protected inside the TDX module. Whenever the mirror EPT
> > > > is changed, it needs to call out into one of a set of x86 opts that
> > > > implement various update operation with TDX specific SEAMCALLs and other
> > > > tricks. These implementations operate on the TDX S-EPT (the external).
> > > > 
> > > > In reality these external operations are designed narrowly with respect to
> > > > TDX particulars. On the surface, what TDX specific things are happening to
> > > > fulfill these update operations are mostly hidden from the MMU, but there
> > > > is one particular area of interest where some details leak through.
> > > > 
> > > > The S-EPT needs pages to use for the S-EPT page tables. These page tables
> > > > need to be allocated before taking the mmu lock, like all the rest. So the
> > > > KVM MMU pre-allocates pages for TDX to use for the S-EPT in the same place
> > > > where it pre-allocates the other page tables. It’s not too bad and fits
> > > > nicely with the others.
> > > > 
> > > > However, Dynamic PAMT will need even more pages for the same operations.
> > > > Further, these pages will need to be handed to the arch/x86 side which used
> > > > them for DPAMT updates, which is hard for the existing KVM based cache.
> > > > The details living in core MMU code start to add up.
> > > > 
> > > > So in preparation to make it more complicated, move the external page
> > > > table cache into TDX code by putting it behind some x86 ops. Have one for
> > > > topping up and one for allocation. Don’t go so far to try to hide the
> > > > existence of external page tables completely from the generic MMU, as they
> > > > are currently stored in their mirror struct kvm_mmu_page and it’s quite
> > > > handy.
> > > > 
> > > > To plumb the memory cache operations through tdx.c, export some of
> > > > the functions temporarily. This will be removed in future changes.
> > > > 
> > > > Acked-by: Kiryl Shutsemau <kas@kernel.org>
> > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > ---
> > > 
> > > NAK.  I kinda sorta get why you did this?  But the pages KVM uses for page tables
> > > are KVM's, not to be mixed with PAMT pages.
> > > 
> > > Eww.  Definitely a hard "no".  In tdp_mmu_alloc_sp_for_split(), the allocation
> > > comes from KVM:
> > > 
> > > 	if (mirror) {
> > > 		sp->external_spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> > > 		if (!sp->external_spt) {
> > > 			free_page((unsigned long)sp->spt);
> > > 			kmem_cache_free(mmu_page_header_cache, sp);
> > > 			return NULL;
> > > 		}
> > > 	}
> > > 
> > > But then in kvm_tdp_mmu_map(), via kvm_mmu_alloc_external_spt(), the allocation
> > > comes from get_tdx_prealloc_page()
> > > 
> > >   static void *tdx_alloc_external_fault_cache(struct kvm_vcpu *vcpu)
> > >   {
> > > 	struct page *page = get_tdx_prealloc_page(&to_tdx(vcpu)->prealloc);
> > > 
> > > 	if (WARN_ON_ONCE(!page))
> > > 		return (void *)__get_free_page(GFP_ATOMIC | __GFP_ACCOUNT);
> > > 
> > > 	return page_address(page);
> > >   }
> > > 
> > > But then regardles of where the page came from, KVM frees it.  Seriously.
> > > 
> > >   static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
> > >   {
> > > 	free_page((unsigned long)sp->external_spt);  <=====
> > > 	free_page((unsigned long)sp->spt);
> > > 	kmem_cache_free(mmu_page_header_cache, sp);
> > >   }
> > IMHO, it's by design. I don't see a problem with KVM freeing the sp->external_spt,
> > regardless of whether it's from:
> > (1) KVM's mmu cache,
> > (2) tdp_mmu_alloc_sp_for_split(), or
> > (3) tdx_alloc_external_fault_cache().
> > Please correct me if I missed anything.
> > 
> > None of (1)-(3) keeps the pages in list after KVM obtains the pages and maps
> > them into SPTEs.
> > 
> > So, with SPTEs as the pages' sole consumer, it's perfectly fine for KVM to free
> > the pages when freeing SPTEs. No?
> > 
> > Also, in the current upstream code, after tdp_mmu_split_huge_pages_root() is
> > invoked for dirty tracking, some sp->spt are allocated from
> > tdp_mmu_alloc_sp_for_split(), while others are from kvm_mmu_memory_cache_alloc().
> > However, tdp_mmu_free_sp() can still free them without any problem.
> > 
> > 
> 
> Well I think it's for consistency, and IMHO you can even argue this is a
> bug in the current code, because IIUC there's indeed one issue in the
> current code.
> 
> When sp->spt is allocated via per-vCPU mmu_shadow_page_cache, it is
> actually initialized to SHADOW_NONPRESENT_VALUE:
> 
>         vcpu->arch.mmu_shadow_page_cache.init_value =                    
>                 SHADOW_NONPRESENT_VALUE;                                 
> 
> So the way sp->spt is allocated in tdp_mmu_alloc_sp_for_split() is
> actually broken IMHO because entries in sp->spt is never initialized.
The sp->spt allocated in tdp_mmu_alloc_sp_for_split() is initialized in
tdp_mmu_split_huge_page()...

> Fortunately tdp_mmu_alloc_sp_for_split() isn't reachable for TDX guests,
> so we are lucky so far.
> 
> A per-VM cache requires more code to handle, but to me I still think we
> should just use the same way to allocate staff when possible, and that
> includes spt->external_spt.

Re: [PATCH v4 11/16] KVM: TDX: Add x86 ops for external spt cache

Posted by Huang, Kai 2 weeks, 5 days ago

On Tue, 2026-01-20 at 17:18 +0800, Yan Zhao wrote:
> > When sp->spt is allocated via per-vCPU mmu_shadow_page_cache, it is
> > actually initialized to SHADOW_NONPRESENT_VALUE:
> > 
> >          vcpu->arch.mmu_shadow_page_cache.init_value =                    
> >                  SHADOW_NONPRESENT_VALUE;                                 
> > 
> > So the way sp->spt is allocated in tdp_mmu_alloc_sp_for_split() is
> > actually broken IMHO because entries in sp->spt is never initialized.
> The sp->spt allocated in tdp_mmu_alloc_sp_for_split() is initialized in
> tdp_mmu_split_huge_page()...

Oh right, we already have a huge SPTE to copy from in this case, so no
problem here, but seems the inconsistency is still there to me.