TDX: Dynamic PAMT + S-EPT Hugepage

[RFC PATCH v5 19/45] KVM: Allow owner of kvm_mmu_memory_cache to provide a custom page allocator

Posted by Sean Christopherson 1 week, 3 days ago

Extend "struct kvm_mmu_memory_cache" to support a custom page allocator
so that x86's TDX can update per-page metadata on allocation and free().

Name the allocator page_get() to align with __get_free_page(), e.g. to
communicate that it returns an "unsigned long", not a "struct page", and
to avoid collisions with macros, e.g. with alloc_page.

Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_types.h | 2 ++
 virt/kvm/kvm_main.c       | 7 ++++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index a568d8e6f4e8..87fa9deffdb7 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -112,6 +112,8 @@ struct kvm_mmu_memory_cache {
 	gfp_t gfp_custom;
 	u64 init_value;
 	struct kmem_cache *kmem_cache;
+	unsigned long (*page_get)(gfp_t gfp);
+	void (*page_free)(unsigned long addr);
 	int capacity;
 	int nobjs;
 	void **objects;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 571cf0d6ec01..7015edce5bd8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -356,7 +356,10 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 	if (mc->kmem_cache)
 		return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
 
-	page = (void *)__get_free_page(gfp_flags);
+	if (mc->page_get)
+		page = (void *)mc->page_get(gfp_flags);
+	else
+		page = (void *)__get_free_page(gfp_flags);
 	if (page && mc->init_value)
 		memset64(page, mc->init_value, PAGE_SIZE / sizeof(u64));
 	return page;
@@ -416,6 +419,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
 	while (mc->nobjs) {
 		if (mc->kmem_cache)
 			kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
+		else if (mc->page_free)
+			mc->page_free((unsigned long)mc->objects[--mc->nobjs]);
 		else
 			free_page((unsigned long)mc->objects[--mc->nobjs]);
 	}
-- 
2.53.0.rc1.217.geba53bf80e-goog

Re: [RFC PATCH v5 19/45] KVM: Allow owner of kvm_mmu_memory_cache to provide a custom page allocator

Posted by Huang, Kai 5 days, 7 hours ago

On Wed, 2026-01-28 at 17:14 -0800, Sean Christopherson wrote:
> Extend "struct kvm_mmu_memory_cache" to support a custom page allocator
> so that x86's TDX can update per-page metadata on allocation and free().
> 
> Name the allocator page_get() to align with __get_free_page(), e.g. to
> communicate that it returns an "unsigned long", not a "struct page", and
> to avoid collisions with macros, e.g. with alloc_page.
> 
> Suggested-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

I thought it could be more generic for allocating an object, but not just a
page.

E.g., I thought we might be able to use it to allocate a structure which has
"pair of DPAMT pages" so it could be assigned to 'struct kvm_mmu_page'.  But
it seems you abandoned this idea.  May I ask why?  Just want to understand
the reasoning here.

Anyway:

Reviewed-by: Kai Huang <kai.huang@intel.com>

> ---
>  include/linux/kvm_types.h | 2 ++
>  virt/kvm/kvm_main.c       | 7 ++++++-
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index a568d8e6f4e8..87fa9deffdb7 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -112,6 +112,8 @@ struct kvm_mmu_memory_cache {
>  	gfp_t gfp_custom;
>  	u64 init_value;
>  	struct kmem_cache *kmem_cache;
> +	unsigned long (*page_get)(gfp_t gfp);
> +	void (*page_free)(unsigned long addr);
>  	int capacity;
>  	int nobjs;
>  	void **objects;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 571cf0d6ec01..7015edce5bd8 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -356,7 +356,10 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>  	if (mc->kmem_cache)
>  		return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
>  
> -	page = (void *)__get_free_page(gfp_flags);
> +	if (mc->page_get)
> +		page = (void *)mc->page_get(gfp_flags);
> +	else
> +		page = (void *)__get_free_page(gfp_flags);
>  	if (page && mc->init_value)
>  		memset64(page, mc->init_value, PAGE_SIZE / sizeof(u64));
>  	return page;
> @@ -416,6 +419,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
>  	while (mc->nobjs) {
>  		if (mc->kmem_cache)
>  			kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
> +		else if (mc->page_free)
> +			mc->page_free((unsigned long)mc->objects[--mc->nobjs]);
>  		else
>  			free_page((unsigned long)mc->objects[--mc->nobjs]);
>  	}
> -- 
> 2.53.0.rc1.217.geba53bf80e-goog

Re: [RFC PATCH v5 19/45] KVM: Allow owner of kvm_mmu_memory_cache to provide a custom page allocator

Posted by Sean Christopherson 4 days, 21 hours ago

On Tue, Feb 03, 2026, Kai Huang wrote:
> On Wed, 2026-01-28 at 17:14 -0800, Sean Christopherson wrote:
> > Extend "struct kvm_mmu_memory_cache" to support a custom page allocator
> > so that x86's TDX can update per-page metadata on allocation and free().
> > 
> > Name the allocator page_get() to align with __get_free_page(), e.g. to
> > communicate that it returns an "unsigned long", not a "struct page", and
> > to avoid collisions with macros, e.g. with alloc_page.
> > 
> > Suggested-by: Kai Huang <kai.huang@intel.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> I thought it could be more generic for allocating an object, but not just a
> page.
> 
> E.g., I thought we might be able to use it to allocate a structure which has
> "pair of DPAMT pages" so it could be assigned to 'struct kvm_mmu_page'.  But
> it seems you abandoned this idea.  May I ask why?  Just want to understand
> the reasoning here.

Because that requires more complexity and there's no known use case, and I don't
see an obvious way for a use case to come along.  All of the motiviations for a
custom allocation scheme that I can think of apply only to full pages, or fit
nicely in a kmem_cache.

Specifically, the "cache" logic is already bifurcated between "kmem_cache' and
"page" usage.  Further splitting the "page" case doesn't require modifications to
the "kmem_cache" case, whereas providing a fully generic solution would require
additional changes, e.g. to handle this code:

	page = (void *)__get_free_page(gfp_flags);
	if (page && mc->init_value)
		memset64(page, mc->init_value, PAGE_SIZE / sizeof(u64));

It certainly wouldn't be much complexity, but this code is already a bit awkward,
so I don't think it makes sense to add support for something that will probably
never be used.

Re: [RFC PATCH v5 19/45] KVM: Allow owner of kvm_mmu_memory_cache to provide a custom page allocator

Posted by Huang, Kai 4 days, 20 hours ago

On Tue, 2026-02-03 at 12:12 -0800, Sean Christopherson wrote:
> On Tue, Feb 03, 2026, Kai Huang wrote:
> > On Wed, 2026-01-28 at 17:14 -0800, Sean Christopherson wrote:
> > > Extend "struct kvm_mmu_memory_cache" to support a custom page allocator
> > > so that x86's TDX can update per-page metadata on allocation and free().
> > > 
> > > Name the allocator page_get() to align with __get_free_page(), e.g. to
> > > communicate that it returns an "unsigned long", not a "struct page", and
> > > to avoid collisions with macros, e.g. with alloc_page.
> > > 
> > > Suggested-by: Kai Huang <kai.huang@intel.com>
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > 
> > I thought it could be more generic for allocating an object, but not just a
> > page.
> > 
> > E.g., I thought we might be able to use it to allocate a structure which has
> > "pair of DPAMT pages" so it could be assigned to 'struct kvm_mmu_page'.  But
> > it seems you abandoned this idea.  May I ask why?  Just want to understand
> > the reasoning here.
> 
> Because that requires more complexity and there's no known use case, and I don't
> see an obvious way for a use case to come along.  All of the motiviations for a
> custom allocation scheme that I can think of apply only to full pages, or fit
> nicely in a kmem_cache.
> 
> Specifically, the "cache" logic is already bifurcated between "kmem_cache' and
> "page" usage.  Further splitting the "page" case doesn't require modifications to
> the "kmem_cache" case, whereas providing a fully generic solution would require
> additional changes, e.g. to handle this code:
> 
> 	page = (void *)__get_free_page(gfp_flags);
> 	if (page && mc->init_value)
> 		memset64(page, mc->init_value, PAGE_SIZE / sizeof(u64));
> 
> It certainly wouldn't be much complexity, but this code is already a bit awkward,
> so I don't think it makes sense to add support for something that will probably
> never be used.

For this particular piece of code, we can add a helper for allocating normal
page table pages, get rid of mc->init_value completely and hook mc-
>page_get() to that helper.

A bonus is we can then call that helper in all places when KVM needs to
allocate a page for normal page table instead of just calling
get_zerod_pages() directly, e.g., like the one in
tdp_mmu_alloc_sp_for_split(), so that we can have a consistent way for
allocating normal page table pages.

Re: [RFC PATCH v5 19/45] KVM: Allow owner of kvm_mmu_memory_cache to provide a custom page allocator

Posted by Sean Christopherson 4 days, 15 hours ago

On Tue, Feb 03, 2026, Kai Huang wrote:
> On Tue, 2026-02-03 at 12:12 -0800, Sean Christopherson wrote:
> > On Tue, Feb 03, 2026, Kai Huang wrote:
> > > On Wed, 2026-01-28 at 17:14 -0800, Sean Christopherson wrote:
> > > > Extend "struct kvm_mmu_memory_cache" to support a custom page allocator
> > > > so that x86's TDX can update per-page metadata on allocation and free().
> > > > 
> > > > Name the allocator page_get() to align with __get_free_page(), e.g. to
> > > > communicate that it returns an "unsigned long", not a "struct page", and
> > > > to avoid collisions with macros, e.g. with alloc_page.
> > > > 
> > > > Suggested-by: Kai Huang <kai.huang@intel.com>
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > 
> > > I thought it could be more generic for allocating an object, but not just a
> > > page.
> > > 
> > > E.g., I thought we might be able to use it to allocate a structure which has
> > > "pair of DPAMT pages" so it could be assigned to 'struct kvm_mmu_page'.  But
> > > it seems you abandoned this idea.  May I ask why?  Just want to understand
> > > the reasoning here.
> > 
> > Because that requires more complexity and there's no known use case, and I don't
> > see an obvious way for a use case to come along.  All of the motiviations for a
> > custom allocation scheme that I can think of apply only to full pages, or fit
> > nicely in a kmem_cache.
> > 
> > Specifically, the "cache" logic is already bifurcated between "kmem_cache' and
> > "page" usage.  Further splitting the "page" case doesn't require modifications to
> > the "kmem_cache" case, whereas providing a fully generic solution would require
> > additional changes, e.g. to handle this code:
> > 
> > 	page = (void *)__get_free_page(gfp_flags);
> > 	if (page && mc->init_value)
> > 		memset64(page, mc->init_value, PAGE_SIZE / sizeof(u64));
> > 
> > It certainly wouldn't be much complexity, but this code is already a bit awkward,
> > so I don't think it makes sense to add support for something that will probably
> > never be used.
> 
> For this particular piece of code, we can add a helper for allocating normal
> page table pages, get rid of mc->init_value completely and hook mc-page_get()
> to that helper.

Hmm, I like the idea, but I don't think it would be a net positive.  In practice,
x86's "normal" page tables stop being normal, because KVM now initializes all
SPTEs with BIT(63)=1 on x86-64.  And that would also incur an extra RETPOLINE on
all those allocations.

> A bonus is we can then call that helper in all places when KVM needs to
> allocate a page for normal page table instead of just calling
> get_zerod_pages() directly, e.g., like the one in
> tdp_mmu_alloc_sp_for_split(),

Huh.  Actually, that's a bug, but not the one you probably expect.  At a glance,
it looks like KVM incorrectly zeroing the page instead of initializing it with
SHADOW_NONPRESENT_VALUE.  But it's actually a "performance" bug, because KVM
doesn't actually need to pre-initialize the page: either the page will never be
used, or every SPTE will be initialized as a child SPTE.

So that one _should_ be different, e.g. should be:

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a32192c35099..36afd67601fc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1456,7 +1456,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
        if (!sp)
                return NULL;
 
-       sp->spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+       sp->spt = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
        if (!sp->spt)
                goto err_spt;
 
> so that we can have a consistent way for allocating normal page table pages.

Re: [RFC PATCH v5 19/45] KVM: Allow owner of kvm_mmu_memory_cache to provide a custom page allocator

Posted by Huang, Kai 4 days, 11 hours ago

On Tue, 2026-02-03 at 18:16 -0800, Sean Christopherson wrote:
> On Tue, Feb 03, 2026, Kai Huang wrote:
> > On Tue, 2026-02-03 at 12:12 -0800, Sean Christopherson wrote:
> > > On Tue, Feb 03, 2026, Kai Huang wrote:
> > > > On Wed, 2026-01-28 at 17:14 -0800, Sean Christopherson wrote:
> > > > > Extend "struct kvm_mmu_memory_cache" to support a custom page allocator
> > > > > so that x86's TDX can update per-page metadata on allocation and free().
> > > > > 
> > > > > Name the allocator page_get() to align with __get_free_page(), e.g. to
> > > > > communicate that it returns an "unsigned long", not a "struct page", and
> > > > > to avoid collisions with macros, e.g. with alloc_page.
> > > > > 
> > > > > Suggested-by: Kai Huang <kai.huang@intel.com>
> > > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > 
> > > > I thought it could be more generic for allocating an object, but not just a
> > > > page.
> > > > 
> > > > E.g., I thought we might be able to use it to allocate a structure which has
> > > > "pair of DPAMT pages" so it could be assigned to 'struct kvm_mmu_page'.  But
> > > > it seems you abandoned this idea.  May I ask why?  Just want to understand
> > > > the reasoning here.
> > > 
> > > Because that requires more complexity and there's no known use case, and I don't
> > > see an obvious way for a use case to come along.  All of the motiviations for a
> > > custom allocation scheme that I can think of apply only to full pages, or fit
> > > nicely in a kmem_cache.
> > > 
> > > Specifically, the "cache" logic is already bifurcated between "kmem_cache' and
> > > "page" usage.  Further splitting the "page" case doesn't require modifications to
> > > the "kmem_cache" case, whereas providing a fully generic solution would require
> > > additional changes, e.g. to handle this code:
> > > 
> > > 	page = (void *)__get_free_page(gfp_flags);
> > > 	if (page && mc->init_value)
> > > 		memset64(page, mc->init_value, PAGE_SIZE / sizeof(u64));
> > > 
> > > It certainly wouldn't be much complexity, but this code is already a bit awkward,
> > > so I don't think it makes sense to add support for something that will probably
> > > never be used.
> > 
> > For this particular piece of code, we can add a helper for allocating normal
> > page table pages, get rid of mc->init_value completely and hook mc-page_get()
> > to that helper.
> 
> Hmm, I like the idea, but I don't think it would be a net positive.  In practice,
> x86's "normal" page tables stop being normal, because KVM now initializes all
> SPTEs with BIT(63)=1 on x86-64.  And that would also incur an extra RETPOLINE on
> all those allocations.

No argument on this.  People hate indirect calls I guess. :-)

> 
> > A bonus is we can then call that helper in all places when KVM needs to
> > allocate a page for normal page table instead of just calling
> > get_zerod_pages() directly, e.g., like the one in
> > tdp_mmu_alloc_sp_for_split(),
> 
> Huh.  Actually, that's a bug, but not the one you probably expect.  At a glance,
> it looks like KVM incorrectly zeroing the page instead of initializing it with
> SHADOW_NONPRESENT_VALUE.  But it's actually a "performance" bug, because KVM
> doesn't actually need to pre-initialize the page: either the page will never be
> used, or every SPTE will be initialized as a child SPTE.
> 
> So that one _should_ be different, e.g. should be:
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index a32192c35099..36afd67601fc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1456,7 +1456,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>         if (!sp)
>                 return NULL;
>  
> -       sp->spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> +       sp->spt = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
>         if (!sp->spt)
>                 goto err_spt;
> 

If we look from "performance" perspective, yeah indeed, albeit we probably
not gonna see any performance difference.

But no more arguments.  I just think it will be less error-prone if we have
a consistent way for allocating the same object (no matter what it is), but
it's just a theoretical thing.

Re: [RFC PATCH v5 19/45] KVM: Allow owner of kvm_mmu_memory_cache to provide a custom page allocator

Posted by Edgecombe, Rick P 4 days, 21 hours ago

On Tue, 2026-02-03 at 12:12 -0800, Sean Christopherson wrote:
> > E.g., I thought we might be able to use it to allocate a structure which has
> > "pair of DPAMT pages" so it could be assigned to 'struct kvm_mmu_page'.  But
> > it seems you abandoned this idea.  May I ask why?  Just want to understand
> > the reasoning here.
> 
> Because that requires more complexity and there's no known use case, and I
> don't see an obvious way for a use case to come along.  All of the
> motiviations for a custom allocation scheme that I can think of apply only to
> full pages, or fit nicely in a kmem_cache.
> 
> Specifically, the "cache" logic is already bifurcated between "kmem_cache' and
> "page" usage.  Further splitting the "page" case doesn't require modifications
> to the "kmem_cache" case, whereas providing a fully generic solution would
> require additional changes, e.g. to handle this code:
> 
> 	page = (void *)__get_free_page(gfp_flags);
> 	if (page && mc->init_value)
> 		memset64(page, mc->init_value, PAGE_SIZE / sizeof(u64));
> 
> It certainly wouldn't be much complexity, but this code is already a bit
> awkward, so I don't think it makes sense to add support for something that
> will probably never be used.

The thing that the design needlessly works around is that we can rely on that
there are only two DPAMT pages per 2MB range. We don't need the dynamic page
count allocations.

This means we don't need to pass around the list of pages that lets arch/x86
take as many pages as it needs. We can maybe just pass in a struct like Kai was
suggesting to the get/put helpers. So I was in the process of trying to morph
this series in that direction to get rid of the complexity resulting from the
dynamic assumption. 

This was what I had done in response to v4 discussions, so now retrofitting it
into this new ops scheme. Care to warn me off of this before I have something to
show?

Re: [RFC PATCH v5 19/45] KVM: Allow owner of kvm_mmu_memory_cache to provide a custom page allocator

Posted by Sean Christopherson 4 days, 20 hours ago

On Tue, Feb 03, 2026, Rick P Edgecombe wrote:
> On Tue, 2026-02-03 at 12:12 -0800, Sean Christopherson wrote:
> > > E.g., I thought we might be able to use it to allocate a structure which has
> > > "pair of DPAMT pages" so it could be assigned to 'struct kvm_mmu_page'.  But
> > > it seems you abandoned this idea.  May I ask why?  Just want to understand
> > > the reasoning here.
> > 
> > Because that requires more complexity and there's no known use case, and I
> > don't see an obvious way for a use case to come along.  All of the
> > motiviations for a custom allocation scheme that I can think of apply only to
> > full pages, or fit nicely in a kmem_cache.
> > 
> > Specifically, the "cache" logic is already bifurcated between "kmem_cache' and
> > "page" usage.  Further splitting the "page" case doesn't require modifications
> > to the "kmem_cache" case, whereas providing a fully generic solution would
> > require additional changes, e.g. to handle this code:
> > 
> > 	page = (void *)__get_free_page(gfp_flags);
> > 	if (page && mc->init_value)
> > 		memset64(page, mc->init_value, PAGE_SIZE / sizeof(u64));
> > 
> > It certainly wouldn't be much complexity, but this code is already a bit
> > awkward, so I don't think it makes sense to add support for something that
> > will probably never be used.
> 
> The thing that the design needlessly works around is that we can rely on that
> there are only two DPAMT pages per 2MB range. We don't need the dynamic page
> count allocations.
> 
> This means we don't need to pass around the list of pages that lets arch/x86
> take as many pages as it needs. We can maybe just pass in a struct like Kai was
> suggesting to the get/put helpers. So I was in the process of trying to morph
> this series in that direction to get rid of the complexity resulting from the
> dynamic assumption. 
> 
> This was what I had done in response to v4 discussions, so now retrofitting it
> into this new ops scheme. Care to warn me off of this before I have something to
> show?

That's largely orthogonal to this change.  This change is about preparing the
DPAMT when S-EPT page is allocated versus being installed.  The fact that DPAMT
requires at most two pages versus a more dynamic maximum is irrelevant.

The caches aren't about dynamic sizes (though they play nicely with them), they're
about:

  (a) not having to deal with allocating under spinlock
  (b) not having to free memory that goes unused (for a single page fault)
  (c) batching allocations for performance reasons (with the caveat that I doubt
      anyone has measured the performance impact in many, many years).

None of those talking points change at all if KVM needs to provide 2 pages versus
N pages.  The max number of pages needed for page tables is pretty much the same
thing as DPAMT, just with a higher max (4/5 vs. 2).  In both cases, the allocated
pages may or may not be consumed for any given fault.

For the leaf pages (including the hugepage splitting cases), which don't utilize
KVM's kvm_mmu_memory_cache, I wouldn't expect the KVM details to change all that
much.  In fact, they shouldn't change at all, because tracking 2 pages versus N
pages in "struct tdx_pamt_cache" is a detail that is 100% buried in the TDX subsystem
(which was pretty much the entire goal of my design).

Though maybe I'm misunderstanding what you have in mind?