[v3] KVM: TDX huge page support for private memory

[PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages

Posted by Yan Zhao 1 month ago

Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.

The SEAMCALL TDH_MEM_PAGE_AUG currently supports adding physical memory to
the S-EPT up to 2MB in size.

While keeping the "level" parameter in the tdh_mem_page_aug() wrapper to
allow callers to specify the physical memory size, introduce the parameters
"folio" and "start_idx" to specify the physical memory starting from the
page at "start_idx" within the "folio". The specified physical memory must
be fully contained within a single folio.

Invoke tdx_clflush_page() for each 4KB segment of the physical memory being
added. tdx_clflush_page() performs CLFLUSH operations conservatively to
prevent dirty cache lines from writing back later and corrupting TD memory.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- nth_page() --> folio_page(). (Kai, Dave)
- Rebased on top of DPAMT v4.

RFC v2:
- Refine patch log. (Rick)
- Removed the level checking. (Kirill, Chao Gao)
- Use "folio", and "start_idx" rather than "page".
- Return TDX_OPERAND_INVALID if the specified physical memory is not
  contained within a single folio.
- Use PTE_SHIFT to replace the 9 in "1 << (level * 9)" (Kirill)
- Use C99-style definition of variables inside a loop. (Nikolay Borisov)

RFC v1:
- Rebased to new tdh_mem_page_aug() with "struct page *" as param.
- Check folio, folio_page_idx.
---
 arch/x86/include/asm/tdx.h  |  3 ++-
 arch/x86/kvm/vmx/tdx.c      |  5 +++--
 arch/x86/virt/vmx/tdx/tdx.c | 13 ++++++++++---
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8c0c548f9735..f92850789193 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -235,7 +235,8 @@ u64 tdh_mng_addcx(struct tdx_td *td, struct page *tdcs_page);
 u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, struct page *page, struct page *source, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mem_sept_add(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page);
-u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2);
+u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio,
+		     unsigned long start_idx, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mem_range_block(struct tdx_td *td, u64 gpa, int level, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mng_key_config(struct tdx_td *td);
 u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 98ff84bc83f2..2f03c51515b9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1679,12 +1679,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct page *page = pfn_to_page(pfn);
+	struct folio *folio = page_folio(page);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 entry, level_state;
 	u64 err;
 
-	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
-
+	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio,
+			       folio_page_idx(folio, page), &entry, &level_state);
 	if (unlikely(IS_TDX_OPERAND_BUSY(err)))
 		return -EBUSY;
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b0b33f606c11..41ce18619ffc 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1743,16 +1743,23 @@ u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page)
 }
 EXPORT_SYMBOL_GPL(tdh_vp_addcx);
 
-u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2)
+u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio,
+		     unsigned long start_idx, u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
 		.rcx = gpa | level,
 		.rdx = tdx_tdr_pa(td),
-		.r8 = page_to_phys(page),
+		.r8 = page_to_phys(folio_page(folio, start_idx)),
 	};
+	unsigned long npages = 1 << (level * PTE_SHIFT);
 	u64 ret;
 
-	tdx_clflush_page(page);
+	if (start_idx + npages > folio_nr_pages(folio))
+		return TDX_OPERAND_INVALID;
+
+	for (int i = 0; i < npages; i++)
+		tdx_clflush_page(folio_page(folio, start_idx + i));
+
 	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
 
 	*ext_err1 = args.rcx;
-- 
2.43.2

Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages

Posted by Dave Hansen 1 month ago

On 1/6/26 02:18, Yan Zhao wrote:
> Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
> 
> The SEAMCALL TDH_MEM_PAGE_AUG currently supports adding physical memory to
> the S-EPT up to 2MB in size.
> 
> While keeping the "level" parameter in the tdh_mem_page_aug() wrapper to
> allow callers to specify the physical memory size, introduce the parameters
> "folio" and "start_idx" to specify the physical memory starting from the
> page at "start_idx" within the "folio". The specified physical memory must
> be fully contained within a single folio.
> 
> Invoke tdx_clflush_page() for each 4KB segment of the physical memory being
> added. tdx_clflush_page() performs CLFLUSH operations conservatively to
> prevent dirty cache lines from writing back later and corrupting TD memory.

This changelog is heavy on the "what" and weak on the "why". It's not
telling me what I need to know.

...
> +	struct folio *folio = page_folio(page);
>  	gpa_t gpa = gfn_to_gpa(gfn);
>  	u64 entry, level_state;
>  	u64 err;
>  
> -	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> -
> +	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio,
> +			       folio_page_idx(folio, page), &entry, &level_state);
>  	if (unlikely(IS_TDX_OPERAND_BUSY(err)))
>  		return -EBUSY;

For example, 'folio' is able to be trivially derived from page. Yet,
this removes the 'page' argument and replaces it with 'folio' _and_
another value which can be derived from 'page'.

This looks superficially like an illogical change. *Why* was this done?

> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index b0b33f606c11..41ce18619ffc 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1743,16 +1743,23 @@ u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page)
>  }
>  EXPORT_SYMBOL_GPL(tdh_vp_addcx);
>  
> -u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2)
> +u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio,
> +		     unsigned long start_idx, u64 *ext_err1, u64 *ext_err2)
>  {
>  	struct tdx_module_args args = {
>  		.rcx = gpa | level,
>  		.rdx = tdx_tdr_pa(td),
> -		.r8 = page_to_phys(page),
> +		.r8 = page_to_phys(folio_page(folio, start_idx)),
>  	};
> +	unsigned long npages = 1 << (level * PTE_SHIFT);
>  	u64 ret;

This 'npages' calculation is not obviously correct. It's not clear what
"level" is or what values it should have.

This is precisely the kind of place to deploy a helper that explains
what is going on.

> -	tdx_clflush_page(page);
> +	if (start_idx + npages > folio_nr_pages(folio))
> +		return TDX_OPERAND_INVALID;

Why is this necessary? Would it be a bug if this happens?

> +	for (int i = 0; i < npages; i++)
> +		tdx_clflush_page(folio_page(folio, start_idx + i));

All of the page<->folio conversions are kinda hurting my brain. I think
we need to decide what the canonical type for these things is in TDX, do
the conversion once, and stick with it.

Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages

Posted by Yan Zhao 1 month ago

Hi Dave

Thanks for the review!

On Tue, Jan 06, 2026 at 01:08:00PM -0800, Dave Hansen wrote:
> On 1/6/26 02:18, Yan Zhao wrote:
> > Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
> > 
> > The SEAMCALL TDH_MEM_PAGE_AUG currently supports adding physical memory to
> > the S-EPT up to 2MB in size.
> > 
> > While keeping the "level" parameter in the tdh_mem_page_aug() wrapper to
> > allow callers to specify the physical memory size, introduce the parameters
> > "folio" and "start_idx" to specify the physical memory starting from the
> > page at "start_idx" within the "folio". The specified physical memory must
> > be fully contained within a single folio.
> > 
> > Invoke tdx_clflush_page() for each 4KB segment of the physical memory being
> > added. tdx_clflush_page() performs CLFLUSH operations conservatively to
> > prevent dirty cache lines from writing back later and corrupting TD memory.
> 
> This changelog is heavy on the "what" and weak on the "why". It's not
> telling me what I need to know.
Indeed. I missed that. I'll keep it in mind. Thanks!

> > +	struct folio *folio = page_folio(page);
> >  	gpa_t gpa = gfn_to_gpa(gfn);
> >  	u64 entry, level_state;
> >  	u64 err;
> >  
> > -	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> > -
> > +	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio,
> > +			       folio_page_idx(folio, page), &entry, &level_state);
> >  	if (unlikely(IS_TDX_OPERAND_BUSY(err)))
> >  		return -EBUSY;
> 
> For example, 'folio' is able to be trivially derived from page. Yet,
> this removes the 'page' argument and replaces it with 'folio' _and_
> another value which can be derived from 'page'.
> 
> This looks superficially like an illogical change. *Why* was this done?
Sorry for missing the "why".

I think we can alternatively derive "folio" and "start_idx" from "page" inside
the wrapper tdh_mem_page_aug() for huge pages.

However, my understanding is that it's better for functions expecting huge pages
to explicitly receive "folio" instead of "page". This way, people can tell from
a function's declaration what the function expects. Is this understanding
correct?

Passing "start_idx" along with "folio" is due to the requirement of mapping only
a sub-range of a huge folio. e.g., we allow creating a 2MB mapping starting from
the nth idx of a 1GB folio.

On the other hand, if we instead pass "page" to tdh_mem_page_aug() for huge
pages and have tdh_mem_page_aug() internally convert it to "folio" and
"start_idx", it makes me wonder if we could have previously just passed "pfn" to
tdh_mem_page_aug() and had tdh_mem_page_aug() convert it to "page".

> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index b0b33f606c11..41ce18619ffc 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1743,16 +1743,23 @@ u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page)
> >  }
> >  EXPORT_SYMBOL_GPL(tdh_vp_addcx);
> >  
> > -u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2)
> > +u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio,
> > +		     unsigned long start_idx, u64 *ext_err1, u64 *ext_err2)
> >  {
> >  	struct tdx_module_args args = {
> >  		.rcx = gpa | level,
> >  		.rdx = tdx_tdr_pa(td),
> > -		.r8 = page_to_phys(page),
> > +		.r8 = page_to_phys(folio_page(folio, start_idx)),
> >  	};
> > +	unsigned long npages = 1 << (level * PTE_SHIFT);
> >  	u64 ret;
> 
> This 'npages' calculation is not obviously correct. It's not clear what
> "level" is or what values it should have.
> 
> This is precisely the kind of place to deploy a helper that explains
> what is going on.
Will do. Thanks for pointing it out!

> > -	tdx_clflush_page(page);
> > +	if (start_idx + npages > folio_nr_pages(folio))
> > +		return TDX_OPERAND_INVALID;
> 
> Why is this necessary? Would it be a bug if this happens?
This sanity check is due to the requirement in KVM that mapping size should be
no larger than the backend folio size, which ensures the mapping pages are
physically contiguous with homogeneous page attributes. (See the discussion
about "EPT mapping size and folio size" in thread [1]).

Failure of the sanity check could only be due to bugs in the caller (KVM). I
didn't convert the sanity check to an assertion because there's already a
TDX_BUG_ON_2() on error following the invocation of tdh_mem_page_aug() in KVM.

Also, there's no alignment checking because SEAMCALL TDH_MEM_PAGE_AUG() would
fail with a misaligned base PFN.

[1] https://lore.kernel.org/all/aV2A39fXgzuM4Toa@google.com/

> > +	for (int i = 0; i < npages; i++)
> > +		tdx_clflush_page(folio_page(folio, start_idx + i));
> 
> All of the page<->folio conversions are kinda hurting my brain. I think
> we need to decide what the canonical type for these things is in TDX, do
> the conversion once, and stick with it.
Got it!

Since passing in base "page" or base "pfn" may still require the
wrappers/helpers to internally convert them to "folio" for sanity checks, could
we decide that "folio" and "start_idx" are the canonical params for functions
expecting huge pages? Or do you prefer KVM to do the sanity check by itself?

Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages

Posted by Dave Hansen 1 month ago

On 1/7/26 01:12, Yan Zhao wrote:
...
> However, my understanding is that it's better for functions expecting huge pages
> to explicitly receive "folio" instead of "page". This way, people can tell from
> a function's declaration what the function expects. Is this understanding
> correct?

In a perfect world, maybe.

But, in practice, a 'struct page' can still represent huge pages and
*does* represent huge pages all over the kernel. There's no need to cram
a folio in here just because a huge page is involved.

> Passing "start_idx" along with "folio" is due to the requirement of mapping only
> a sub-range of a huge folio. e.g., we allow creating a 2MB mapping starting from
> the nth idx of a 1GB folio.
> 
> On the other hand, if we instead pass "page" to tdh_mem_page_aug() for huge
> pages and have tdh_mem_page_aug() internally convert it to "folio" and
> "start_idx", it makes me wonder if we could have previously just passed "pfn" to
> tdh_mem_page_aug() and had tdh_mem_page_aug() convert it to "page".

As a general pattern, I discourage folks from using pfns and physical
addresses when passing around references to physical memory. They have
zero type safety.

It's also not just about type safety. A 'struct page' also *means*
something. It means that the kernel is, on some level, aware of and
managing that memory. It's not MMIO. It doesn't represent the physical
address of the APIC page. It's not SGX memory. It doesn't have a
Shared/Private bit.

All of those properties are important and they're *GONE* if you use a
pfn. It's even worse if you use a raw physical address.

Please don't go back to raw integers (pfns or paddrs).

>>> -	tdx_clflush_page(page);
>>> +	if (start_idx + npages > folio_nr_pages(folio))
>>> +		return TDX_OPERAND_INVALID;
>>
>> Why is this necessary? Would it be a bug if this happens?
> This sanity check is due to the requirement in KVM that mapping size should be
> no larger than the backend folio size, which ensures the mapping pages are
> physically contiguous with homogeneous page attributes. (See the discussion
> about "EPT mapping size and folio size" in thread [1]).
> 
> Failure of the sanity check could only be due to bugs in the caller (KVM). I
> didn't convert the sanity check to an assertion because there's already a
> TDX_BUG_ON_2() on error following the invocation of tdh_mem_page_aug() in KVM.

We generally don't protect against bugs in callers. Otherwise, we'd have
a trillion NULL checks in every function in the kernel.

The only reason to add caller sanity checks is to make things easier to
debug, and those almost always include some kind of spew:
WARN_ON_ONCE(), pr_warn(), etc...

>>> +	for (int i = 0; i < npages; i++)
>>> +		tdx_clflush_page(folio_page(folio, start_idx + i));
>>
>> All of the page<->folio conversions are kinda hurting my brain. I think
>> we need to decide what the canonical type for these things is in TDX, do
>> the conversion once, and stick with it.
> Got it!
> 
> Since passing in base "page" or base "pfn" may still require the
> wrappers/helpers to internally convert them to "folio" for sanity checks, could
> we decide that "folio" and "start_idx" are the canonical params for functions
> expecting huge pages? Or do you prefer KVM to do the sanity check by itself?

I'm not convinced the sanity check is a good idea in the first place. It
just adds complexity.

Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages

Posted by Yan Zhao 1 month ago

On Wed, Jan 07, 2026 at 08:39:55AM -0800, Dave Hansen wrote:
> On 1/7/26 01:12, Yan Zhao wrote:
> ...
> > However, my understanding is that it's better for functions expecting huge pages
> > to explicitly receive "folio" instead of "page". This way, people can tell from
> > a function's declaration what the function expects. Is this understanding
> > correct?
> 
> In a perfect world, maybe.
> 
> But, in practice, a 'struct page' can still represent huge pages and
> *does* represent huge pages all over the kernel. There's no need to cram
> a folio in here just because a huge page is involved.
Ok. I can modify the param "struct page *page" to "struct page *base_page", 
explaining that it may belong to a huge folio but is not necessarily the
head page of the folio.

> > Passing "start_idx" along with "folio" is due to the requirement of mapping only
> > a sub-range of a huge folio. e.g., we allow creating a 2MB mapping starting from
> > the nth idx of a 1GB folio.
> > 
> > On the other hand, if we instead pass "page" to tdh_mem_page_aug() for huge
> > pages and have tdh_mem_page_aug() internally convert it to "folio" and
> > "start_idx", it makes me wonder if we could have previously just passed "pfn" to
> > tdh_mem_page_aug() and had tdh_mem_page_aug() convert it to "page".
> 
> As a general pattern, I discourage folks from using pfns and physical
> addresses when passing around references to physical memory. They have
> zero type safety.
> 
> It's also not just about type safety. A 'struct page' also *means*
> something. It means that the kernel is, on some level, aware of and
> managing that memory. It's not MMIO. It doesn't represent the physical
> address of the APIC page. It's not SGX memory. It doesn't have a
> Shared/Private bit.
> 
> All of those properties are important and they're *GONE* if you use a
> pfn. It's even worse if you use a raw physical address.
> 
> Please don't go back to raw integers (pfns or paddrs).
I understood and fully accept it.

I previously wondered if we could allow KVM to pass in pfn and let the SEAMCALL
wrapper do the pfn_to_page() conversion.
But it was just out of curiosity. I actually prefer "struct page" too.


> >>> -	tdx_clflush_page(page);
> >>> +	if (start_idx + npages > folio_nr_pages(folio))
> >>> +		return TDX_OPERAND_INVALID;
> >>
> >> Why is this necessary? Would it be a bug if this happens?
> > This sanity check is due to the requirement in KVM that mapping size should be
> > no larger than the backend folio size, which ensures the mapping pages are
> > physically contiguous with homogeneous page attributes. (See the discussion
> > about "EPT mapping size and folio size" in thread [1]).
> > 
> > Failure of the sanity check could only be due to bugs in the caller (KVM). I
> > didn't convert the sanity check to an assertion because there's already a
> > TDX_BUG_ON_2() on error following the invocation of tdh_mem_page_aug() in KVM.
> 
> We generally don't protect against bugs in callers. Otherwise, we'd have
> a trillion NULL checks in every function in the kernel.
> 
> The only reason to add caller sanity checks is to make things easier to
> debug, and those almost always include some kind of spew:
> WARN_ON_ONCE(), pr_warn(), etc...

Would it be better if I use WARN_ON_ONCE()? like this:

u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *base_page,
                     u64 *ext_err1, u64 *ext_err2)
{
        unsigned long npages = tdx_sept_level_to_npages(level);
        struct tdx_module_args args = {
                .rcx = gpa | level,
                .rdx = tdx_tdr_pa(td),
                .r8 = page_to_phys(base_page),
        };
        u64 ret;

        WARN_ON_ONCE(page_folio(base_page) != page_folio(base_page + npages - 1));

        for (int i = 0; i < npages; i++)
                tdx_clflush_page(base_page + i);

        ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);

        *ext_err1 = args.rcx;
        *ext_err2 = args.rdx;

        return ret;
}

The WARN_ON_ONCE() serves 2 purposes:
1. Loudly warn of subtle KVM bugs.
2. Ensure "page_to_pfn(base_page + i) == (page_to_pfn(base_page) + i)".

If you don't like using "base_page + i" (as the discussion in v2 [1]), we can
invoke folio_page() for the ith page instead.

[1] https://lore.kernel.org/all/01731a9a0346b08577fad75ae560c650145c7f39.camel@intel.com/

> >>> +	for (int i = 0; i < npages; i++)
> >>> +		tdx_clflush_page(folio_page(folio, start_idx + i));
> >>
> >> All of the page<->folio conversions are kinda hurting my brain. I think
> >> we need to decide what the canonical type for these things is in TDX, do
> >> the conversion once, and stick with it.
> > Got it!
> > 
> > Since passing in base "page" or base "pfn" may still require the
> > wrappers/helpers to internally convert them to "folio" for sanity checks, could
> > we decide that "folio" and "start_idx" are the canonical params for functions
> > expecting huge pages? Or do you prefer KVM to do the sanity check by itself?
> 
> I'm not convinced the sanity check is a good idea in the first place. It
> just adds complexity.
I'm worried about subtle bugs introduced by careless coding that might be
silently ignored otherwise, like the one in thread [2].

[2] https://lore.kernel.org/kvm/aV2A39fXgzuM4Toa@google.com/

Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages

Posted by Ackerley Tng 1 month ago

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jan 07, 2026 at 08:39:55AM -0800, Dave Hansen wrote:
>> On 1/7/26 01:12, Yan Zhao wrote:
>> ...
>> > However, my understanding is that it's better for functions expecting huge pages
>> > to explicitly receive "folio" instead of "page". This way, people can tell from
>> > a function's declaration what the function expects. Is this understanding
>> > correct?
>>
>> In a perfect world, maybe.
>>
>> But, in practice, a 'struct page' can still represent huge pages and
>> *does* represent huge pages all over the kernel. There's no need to cram
>> a folio in here just because a huge page is involved.
> Ok. I can modify the param "struct page *page" to "struct page *base_page",
> explaining that it may belong to a huge folio but is not necessarily the
> head page of the folio.
>
>> > Passing "start_idx" along with "folio" is due to the requirement of mapping only
>> > a sub-range of a huge folio. e.g., we allow creating a 2MB mapping starting from
>> > the nth idx of a 1GB folio.
>> >
>> > On the other hand, if we instead pass "page" to tdh_mem_page_aug() for huge
>> > pages and have tdh_mem_page_aug() internally convert it to "folio" and
>> > "start_idx", it makes me wonder if we could have previously just passed "pfn" to
>> > tdh_mem_page_aug() and had tdh_mem_page_aug() convert it to "page".
>>
>> As a general pattern, I discourage folks from using pfns and physical
>> addresses when passing around references to physical memory. They have
>> zero type safety.
>>
>> It's also not just about type safety. A 'struct page' also *means*
>> something. It means that the kernel is, on some level, aware of and
>> managing that memory. It's not MMIO. It doesn't represent the physical
>> address of the APIC page. It's not SGX memory. It doesn't have a
>> Shared/Private bit.
>>
>> All of those properties are important and they're *GONE* if you use a
>> pfn. It's even worse if you use a raw physical address.
>>
>> Please don't go back to raw integers (pfns or paddrs).
> I understood and fully accept it.
>
> I previously wondered if we could allow KVM to pass in pfn and let the SEAMCALL
> wrapper do the pfn_to_page() conversion.
> But it was just out of curiosity. I actually prefer "struct page" too.
>
>
>> >>> -	tdx_clflush_page(page);
>> >>> +	if (start_idx + npages > folio_nr_pages(folio))
>> >>> +		return TDX_OPERAND_INVALID;
>> >>
>> >> Why is this necessary? Would it be a bug if this happens?
>> > This sanity check is due to the requirement in KVM that mapping size should be
>> > no larger than the backend folio size, which ensures the mapping pages are
>> > physically contiguous with homogeneous page attributes. (See the discussion
>> > about "EPT mapping size and folio size" in thread [1]).
>> >
>> > Failure of the sanity check could only be due to bugs in the caller (KVM). I
>> > didn't convert the sanity check to an assertion because there's already a
>> > TDX_BUG_ON_2() on error following the invocation of tdh_mem_page_aug() in KVM.
>>
>> We generally don't protect against bugs in callers. Otherwise, we'd have
>> a trillion NULL checks in every function in the kernel.
>>
>> The only reason to add caller sanity checks is to make things easier to
>> debug, and those almost always include some kind of spew:
>> WARN_ON_ONCE(), pr_warn(), etc...
>
> Would it be better if I use WARN_ON_ONCE()? like this:
>
> u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *base_page,
>                      u64 *ext_err1, u64 *ext_err2)
> {
>         unsigned long npages = tdx_sept_level_to_npages(level);
>         struct tdx_module_args args = {
>                 .rcx = gpa | level,
>                 .rdx = tdx_tdr_pa(td),
>                 .r8 = page_to_phys(base_page),
>         };
>         u64 ret;
>
>         WARN_ON_ONCE(page_folio(base_page) != page_folio(base_page + npages - 1));

This WARNs if the first and last folios are not the same folio, which
still assumes something about how pages are grouped into folios. I feel
that this is still stretching TDX code over to make assumptions about
how the kernel manages memory metadata, which is more than TDX actually
cares about.

>
>         for (int i = 0; i < npages; i++)
>                 tdx_clflush_page(base_page + i);
>
>         ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
>
>         *ext_err1 = args.rcx;
>         *ext_err2 = args.rdx;
>
>         return ret;
> }
>
> The WARN_ON_ONCE() serves 2 purposes:
> 1. Loudly warn of subtle KVM bugs.
> 2. Ensure "page_to_pfn(base_page + i) == (page_to_pfn(base_page) + i)".
>

I disagree with checking within TDX code, but if you would still like to
check, 2. that you suggested is less dependent on the concept of how the
kernel groups pages in folios, how about:

  WARN_ON_ONCE(page_to_pfn(base_page + npages - 1) !=
               page_to_pfn(base_page) + npages - 1);

The full contiguity check will scan every page, but I think this doesn't
take too many CPU cycles, and would probably catch what you're looking
to catch in most cases.

I still don't think TDX code should check. The caller should check or
know the right thing to do.

> If you don't like using "base_page + i" (as the discussion in v2 [1]), we can
> invoke folio_page() for the ith page instead.
>
> [1] https://lore.kernel.org/all/01731a9a0346b08577fad75ae560c650145c7f39.camel@intel.com/
>
>> >>> +	for (int i = 0; i < npages; i++)
>> >>> +		tdx_clflush_page(folio_page(folio, start_idx + i));
>> >>
>> >> All of the page<->folio conversions are kinda hurting my brain. I think
>> >> we need to decide what the canonical type for these things is in TDX, do
>> >> the conversion once, and stick with it.
>> > Got it!
>> >
>> > Since passing in base "page" or base "pfn" may still require the
>> > wrappers/helpers to internally convert them to "folio" for sanity checks, could
>> > we decide that "folio" and "start_idx" are the canonical params for functions
>> > expecting huge pages? Or do you prefer KVM to do the sanity check by itself?
>>
>> I'm not convinced the sanity check is a good idea in the first place. It
>> just adds complexity.
> I'm worried about subtle bugs introduced by careless coding that might be
> silently ignored otherwise, like the one in thread [2].
>
> [2] https://lore.kernel.org/kvm/aV2A39fXgzuM4Toa@google.com/

Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages

Posted by Yan Zhao 3 weeks, 6 days ago

On Fri, Jan 09, 2026 at 10:29:47AM -0800, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, Jan 07, 2026 at 08:39:55AM -0800, Dave Hansen wrote:
> >> On 1/7/26 01:12, Yan Zhao wrote:
> >> ...
> >> > However, my understanding is that it's better for functions expecting huge pages
> >> > to explicitly receive "folio" instead of "page". This way, people can tell from
> >> > a function's declaration what the function expects. Is this understanding
> >> > correct?
> >>
> >> In a perfect world, maybe.
> >>
> >> But, in practice, a 'struct page' can still represent huge pages and
> >> *does* represent huge pages all over the kernel. There's no need to cram
> >> a folio in here just because a huge page is involved.
> > Ok. I can modify the param "struct page *page" to "struct page *base_page",
> > explaining that it may belong to a huge folio but is not necessarily the
> > head page of the folio.
> >
> >> > Passing "start_idx" along with "folio" is due to the requirement of mapping only
> >> > a sub-range of a huge folio. e.g., we allow creating a 2MB mapping starting from
> >> > the nth idx of a 1GB folio.
> >> >
> >> > On the other hand, if we instead pass "page" to tdh_mem_page_aug() for huge
> >> > pages and have tdh_mem_page_aug() internally convert it to "folio" and
> >> > "start_idx", it makes me wonder if we could have previously just passed "pfn" to
> >> > tdh_mem_page_aug() and had tdh_mem_page_aug() convert it to "page".
> >>
> >> As a general pattern, I discourage folks from using pfns and physical
> >> addresses when passing around references to physical memory. They have
> >> zero type safety.
> >>
> >> It's also not just about type safety. A 'struct page' also *means*
> >> something. It means that the kernel is, on some level, aware of and
> >> managing that memory. It's not MMIO. It doesn't represent the physical
> >> address of the APIC page. It's not SGX memory. It doesn't have a
> >> Shared/Private bit.
> >>
> >> All of those properties are important and they're *GONE* if you use a
> >> pfn. It's even worse if you use a raw physical address.
> >>
> >> Please don't go back to raw integers (pfns or paddrs).
> > I understood and fully accept it.
> >
> > I previously wondered if we could allow KVM to pass in pfn and let the SEAMCALL
> > wrapper do the pfn_to_page() conversion.
> > But it was just out of curiosity. I actually prefer "struct page" too.
> >
> >
> >> >>> -	tdx_clflush_page(page);
> >> >>> +	if (start_idx + npages > folio_nr_pages(folio))
> >> >>> +		return TDX_OPERAND_INVALID;
> >> >>
> >> >> Why is this necessary? Would it be a bug if this happens?
> >> > This sanity check is due to the requirement in KVM that mapping size should be
> >> > no larger than the backend folio size, which ensures the mapping pages are
> >> > physically contiguous with homogeneous page attributes. (See the discussion
> >> > about "EPT mapping size and folio size" in thread [1]).
> >> >
> >> > Failure of the sanity check could only be due to bugs in the caller (KVM). I
> >> > didn't convert the sanity check to an assertion because there's already a
> >> > TDX_BUG_ON_2() on error following the invocation of tdh_mem_page_aug() in KVM.
> >>
> >> We generally don't protect against bugs in callers. Otherwise, we'd have
> >> a trillion NULL checks in every function in the kernel.
> >>
> >> The only reason to add caller sanity checks is to make things easier to
> >> debug, and those almost always include some kind of spew:
> >> WARN_ON_ONCE(), pr_warn(), etc...
> >
> > Would it be better if I use WARN_ON_ONCE()? like this:
> >
> > u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *base_page,
> >                      u64 *ext_err1, u64 *ext_err2)
> > {
> >         unsigned long npages = tdx_sept_level_to_npages(level);
> >         struct tdx_module_args args = {
> >                 .rcx = gpa | level,
> >                 .rdx = tdx_tdr_pa(td),
> >                 .r8 = page_to_phys(base_page),
> >         };
> >         u64 ret;
> >
> >         WARN_ON_ONCE(page_folio(base_page) != page_folio(base_page + npages - 1));
> 
> This WARNs if the first and last folios are not the same folio, which
If the first and last page are belonging to the same folio, the entire range
should be fully covered by a single folio, no?

Maybe the original checking as below is better :)

struct folio *folio = page_folio(base_page);
WARN_ON_ONCE(folio_page_idx(folio, base_page) + npages > folio_nr_pages(folio));

See more in the next comment below.

> still assumes something about how pages are grouped into folios. I feel
> that this is still stretching TDX code over to make assumptions about
> how the kernel manages memory metadata, which is more than TDX actually
> cares about.
> 
> >
> >         for (int i = 0; i < npages; i++)
> >                 tdx_clflush_page(base_page + i);
> >
> >         ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
> >
> >         *ext_err1 = args.rcx;
> >         *ext_err2 = args.rdx;
> >
> >         return ret;
> > }
> >
> > The WARN_ON_ONCE() serves 2 purposes:
> > 1. Loudly warn of subtle KVM bugs.
> > 2. Ensure "page_to_pfn(base_page + i) == (page_to_pfn(base_page) + i)".
> >
> 
> I disagree with checking within TDX code, but if you would still like to
> check, 2. that you suggested is less dependent on the concept of how the
> kernel groups pages in folios, how about:
> 
>   WARN_ON_ONCE(page_to_pfn(base_page + npages - 1) !=
>                page_to_pfn(base_page) + npages - 1);
> 
> The full contiguity check will scan every page, but I think this doesn't
> take too many CPU cycles, and would probably catch what you're looking
> to catch in most cases.
As Dave said,  "struct page" serves to guard against MMIO.

e.g., with below memory layout, checking continuity of every PFN is still not
enough.

PFN 0x1000: Normal RAM
PFN 0x1001: MMIO
PFN 0x1002: Normal RAM

Also, is it even safe to reference struct page for PFN 0x1001 (e.g. with
SPARSEMEM without SPARSEMEM_VMEMMAP)?

Leveraging folio makes it safe and simpler.
Since KVM also relies on folio size to determine mapping size, TDX doesn't
introduce extra limitations.

> I still don't think TDX code should check. The caller should check or
> know the right thing to do.
Hmm. I don't think the backend folio should be split before it's unmapped
(refer to __folio_split()). Or at least we need to split the S-EPT before
performing the backend folio split (see *).

However, the new gmem does make this happen.
So, I think a warning is necessary to aid in debugging subtle bugs.

[*] https://lore.kernel.org/kvm/aWRQ2xyc9coA6aCg@yzhao56-desk.sh.intel.com/
> > If you don't like using "base_page + i" (as the discussion in v2 [1]), we can
> > invoke folio_page() for the ith page instead.
> >
> > [1] https://lore.kernel.org/all/01731a9a0346b08577fad75ae560c650145c7f39.camel@intel.com/
> >
> >> >>> +	for (int i = 0; i < npages; i++)
> >> >>> +		tdx_clflush_page(folio_page(folio, start_idx + i));
> >> >>
> >> >> All of the page<->folio conversions are kinda hurting my brain. I think
> >> >> we need to decide what the canonical type for these things is in TDX, do
> >> >> the conversion once, and stick with it.
> >> > Got it!
> >> >
> >> > Since passing in base "page" or base "pfn" may still require the
> >> > wrappers/helpers to internally convert them to "folio" for sanity checks, could
> >> > we decide that "folio" and "start_idx" are the canonical params for functions
> >> > expecting huge pages? Or do you prefer KVM to do the sanity check by itself?
> >>
> >> I'm not convinced the sanity check is a good idea in the first place. It
> >> just adds complexity.
> > I'm worried about subtle bugs introduced by careless coding that might be
> > silently ignored otherwise, like the one in thread [2].
> >
> > [2] https://lore.kernel.org/kvm/aV2A39fXgzuM4Toa@google.com/

Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages

Posted by Vishal Annapurve 3 weeks, 5 days ago

On Sun, Jan 11, 2026 at 6:44 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> > > The WARN_ON_ONCE() serves 2 purposes:
> > > 1. Loudly warn of subtle KVM bugs.
> > > 2. Ensure "page_to_pfn(base_page + i) == (page_to_pfn(base_page) + i)".
> > >
> >
> > I disagree with checking within TDX code, but if you would still like to
> > check, 2. that you suggested is less dependent on the concept of how the
> > kernel groups pages in folios, how about:
> >
> >   WARN_ON_ONCE(page_to_pfn(base_page + npages - 1) !=
> >                page_to_pfn(base_page) + npages - 1);
> >
> > The full contiguity check will scan every page, but I think this doesn't
> > take too many CPU cycles, and would probably catch what you're looking
> > to catch in most cases.
> As Dave said,  "struct page" serves to guard against MMIO.
>
> e.g., with below memory layout, checking continuity of every PFN is still not
> enough.
>
> PFN 0x1000: Normal RAM
> PFN 0x1001: MMIO
> PFN 0x1002: Normal RAM
>

I don't see how guest_memfd memory can be interspersed with MMIO regions.

Is this in reference to the future extension to add private MMIO
ranges? I think this discussion belongs in the context of TDX connect
feature patches. I assume shared/private MMIO assignment to the guests
will happen via completely different paths. And I would assume EPT
entries will have information about whether the mapped ranges are MMIO
or normal memory.

i.e. Anything mapped as normal memory in SEPT entries as a huge range
should be safe to operate on without needing to cross-check sanity in
the KVM TDX stack. If a hugerange has MMIO/normal RAM ranges mixed up
then that is a much bigger problem.

> Also, is it even safe to reference struct page for PFN 0x1001 (e.g. with
> SPARSEMEM without SPARSEMEM_VMEMMAP)?
>
> Leveraging folio makes it safe and simpler.
> Since KVM also relies on folio size to determine mapping size, TDX doesn't
> introduce extra limitations.
>

Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages

Posted by Yan Zhao 3 weeks, 4 days ago

On Tue, Jan 13, 2026 at 08:50:30AM -0800, Vishal Annapurve wrote:
> On Sun, Jan 11, 2026 at 6:44 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > > > The WARN_ON_ONCE() serves 2 purposes:
> > > > 1. Loudly warn of subtle KVM bugs.
> > > > 2. Ensure "page_to_pfn(base_page + i) == (page_to_pfn(base_page) + i)".
> > > >
> > >
> > > I disagree with checking within TDX code, but if you would still like to
> > > check, 2. that you suggested is less dependent on the concept of how the
> > > kernel groups pages in folios, how about:
> > >
> > >   WARN_ON_ONCE(page_to_pfn(base_page + npages - 1) !=
> > >                page_to_pfn(base_page) + npages - 1);
> > >
> > > The full contiguity check will scan every page, but I think this doesn't
> > > take too many CPU cycles, and would probably catch what you're looking
> > > to catch in most cases.
> > As Dave said,  "struct page" serves to guard against MMIO.
> >
> > e.g., with below memory layout, checking continuity of every PFN is still not
> > enough.
> >
> > PFN 0x1000: Normal RAM
> > PFN 0x1001: MMIO
> > PFN 0x1002: Normal RAM
> >
> 
> I don't see how guest_memfd memory can be interspersed with MMIO regions.
It's about API design.

When KVM invokes tdh_phymem_page_wbinvd_hkid(), passing "struct page *base_page"
and "unsigned long npages", WARN_ON_ONCE() in tdh_phymem_page_wbinvd_hkid() to
ensure those pages belong to a folio can effectively ensure they are physically
contiguous and do not contain MMIO.

Similar to "VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio)" in
__folio_split().

Otherwise, why not just pass "pfn + npages" to tdh_phymem_page_wbinvd_hkid()?

> Is this in reference to the future extension to add private MMIO
> ranges? I think this discussion belongs in the context of TDX connect
> feature patches. I assume shared/private MMIO assignment to the guests
> will happen via completely different paths. And I would assume EPT
> entries will have information about whether the mapped ranges are MMIO
> or normal memory.
> 
> i.e. Anything mapped as normal memory in SEPT entries as a huge range
> should be safe to operate on without needing to cross-check sanity in
> the KVM TDX stack. If a hugerange has MMIO/normal RAM ranges mixed up
> then that is a much bigger problem.
> 
> > Also, is it even safe to reference struct page for PFN 0x1001 (e.g. with
> > SPARSEMEM without SPARSEMEM_VMEMMAP)?
> >
> > Leveraging folio makes it safe and simpler.
> > Since KVM also relies on folio size to determine mapping size, TDX doesn't
> > introduce extra limitations.
> >

Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages

Posted by Ackerley Tng 1 month ago

Dave Hansen <dave.hansen@intel.com> writes:

> On 1/7/26 01:12, Yan Zhao wrote:
> ...
>> However, my understanding is that it's better for functions expecting huge pages
>> to explicitly receive "folio" instead of "page". This way, people can tell from
>> a function's declaration what the function expects. Is this understanding
>> correct?
>
> In a perfect world, maybe.
>
> But, in practice, a 'struct page' can still represent huge pages and
> *does* represent huge pages all over the kernel. There's no need to cram
> a folio in here just because a huge page is involved.
>
>> Passing "start_idx" along with "folio" is due to the requirement of mapping only
>> a sub-range of a huge folio. e.g., we allow creating a 2MB mapping starting from
>> the nth idx of a 1GB folio.
>>
>> On the other hand, if we instead pass "page" to tdh_mem_page_aug() for huge
>> pages and have tdh_mem_page_aug() internally convert it to "folio" and
>> "start_idx", it makes me wonder if we could have previously just passed "pfn" to
>> tdh_mem_page_aug() and had tdh_mem_page_aug() convert it to "page".
>
> As a general pattern, I discourage folks from using pfns and physical
> addresses when passing around references to physical memory. They have
> zero type safety.
>
> It's also not just about type safety. A 'struct page' also *means*
> something. It means that the kernel is, on some level, aware of and
> managing that memory. It's not MMIO. It doesn't represent the physical
> address of the APIC page. It's not SGX memory. It doesn't have a
> Shared/Private bit.
>

I agree that the use of struct pages is better than the use of struct
folios. I think the use of folios unnecessarily couples low level TDX
code to memory metadata (pages and folios) in the kernel.

> All of those properties are important and they're *GONE* if you use a
> pfn. It's even worse if you use a raw physical address.
>

We were thinking through what it would take to have TDs use VM_PFNMAP
memory, where the memory may not actually have associated struct
pages. Without further work, having struct pages in the TDX interface
would kind of lock out those sources of memory. Is TDX open to using
non-kernel managed memory?

> Please don't go back to raw integers (pfns or paddrs).
>

I guess what we're all looking for is a type representing regular memory
(to exclude MMIO/APIC pages/SGX/etc) but isn't limited to memory the
kernel.

Perhaps the best we have now is still pfn/paddrs + nr_pages, and having
the callers of TDX functions handle/ensure the checking required to
exclude unsupported types of memory.

For type safety, would phyrs help? [1] Perhaps starting with pfn/paddrs
+ nr_pages would allow transitioning to phyrs later. Using pages would
be okay for now, but I would rather not use folios.

[1] https://lore.kernel.org/all/YdyKWeU0HTv8m7wD@casper.infradead.org/

>>>> -	tdx_clflush_page(page);
>>>> +	if (start_idx + npages > folio_nr_pages(folio))
>>>> +		return TDX_OPERAND_INVALID;
>>>
>>> Why is this necessary? Would it be a bug if this happens?
>> This sanity check is due to the requirement in KVM that mapping size should be
>> no larger than the backend folio size, which ensures the mapping pages are
>> physically contiguous with homogeneous page attributes. (See the discussion
>> about "EPT mapping size and folio size" in thread [1]).
>>
>> Failure of the sanity check could only be due to bugs in the caller (KVM). I
>> didn't convert the sanity check to an assertion because there's already a
>> TDX_BUG_ON_2() on error following the invocation of tdh_mem_page_aug() in KVM.
>
> We generally don't protect against bugs in callers. Otherwise, we'd have
> a trillion NULL checks in every function in the kernel.
>
> The only reason to add caller sanity checks is to make things easier to
> debug, and those almost always include some kind of spew:
> WARN_ON_ONCE(), pr_warn(), etc...
>
>>>> +	for (int i = 0; i < npages; i++)
>>>> +		tdx_clflush_page(folio_page(folio, start_idx + i));
>>>
>>> All of the page<->folio conversions are kinda hurting my brain. I think
>>> we need to decide what the canonical type for these things is in TDX, do
>>> the conversion once, and stick with it.
>> Got it!
>>
>> Since passing in base "page" or base "pfn" may still require the
>> wrappers/helpers to internally convert them to "folio" for sanity checks, could
>> we decide that "folio" and "start_idx" are the canonical params for functions
>> expecting huge pages? Or do you prefer KVM to do the sanity check by itself?
>
> I'm not convinced the sanity check is a good idea in the first place. It
> just adds complexity.

Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages

Posted by Dave Hansen 1 month ago

On 1/8/26 11:05, Ackerley Tng wrote:
...
>> All of those properties are important and they're *GONE* if you use a
>> pfn. It's even worse if you use a raw physical address.
> 
> We were thinking through what it would take to have TDs use VM_PFNMAP
> memory, where the memory may not actually have associated struct
> pages. Without further work, having struct pages in the TDX interface
> would kind of lock out those sources of memory. Is TDX open to using
> non-kernel managed memory?

I was afraid someone was going to bring that up. I'm not open to such a
beast today. I'd certainly look at the patches, but it would be a hard
sell and it would need an awfully strong justification.

> For type safety, would phyrs help? [1] Perhaps starting with pfn/paddrs
> + nr_pages would allow transitioning to phyrs later. Using pages would
> be okay for now, but I would rather not use folios.

I don't have any first-hand experience with phyrs. It seems interesting,
but might be unwieldy to use in practice, kinda how the proposed code
got messy when folios got thrown in.

Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages

Posted by Vishal Annapurve 1 month ago

On Thu, Jan 8, 2026 at 11:24 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/8/26 11:05, Ackerley Tng wrote:
> ...
> >> All of those properties are important and they're *GONE* if you use a
> >> pfn. It's even worse if you use a raw physical address.
> >
> > We were thinking through what it would take to have TDs use VM_PFNMAP
> > memory, where the memory may not actually have associated struct
> > pages. Without further work, having struct pages in the TDX interface
> > would kind of lock out those sources of memory. Is TDX open to using
> > non-kernel managed memory?
>
> I was afraid someone was going to bring that up. I'm not open to such a
> beast today. I'd certainly look at the patches, but it would be a hard
> sell and it would need an awfully strong justification.

Yeah, I will punt this discussion to later when we have something
working on the guest_memfd side. I expect that discussion will carry a
strong justification, backed by all the complexity in guest_memfd.

>
> > For type safety, would phyrs help? [1] Perhaps starting with pfn/paddrs
> > + nr_pages would allow transitioning to phyrs later. Using pages would
> > be okay for now, but I would rather not use folios.
>
> I don't have any first-hand experience with phyrs. It seems interesting,
> but might be unwieldy to use in practice, kinda how the proposed code
> got messy when folios got thrown in.