From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Add helpers to use when allocating or preparing pages that need DPAMT
backing. Make them handle races internally for the case of multiple
callers trying operate on the same 2MB range simultaneously.
While the TDX initialization code in arch/x86 uses pages with 2MB
alignment, KVM will need to hand 4KB pages for it to use. Under DPAMT,
these pages will need DPAMT backing 4KB backing.
Add tdx_alloc_page() and tdx_free_page() to handle both page allocation
and DPAMT installation. Make them behave like normal alloc/free functions
where allocation can fail in the case of no memory, but free (with any
necessary DPAMT release) always succeeds. Do this so they can support the
existing TDX flows that require cleanups to succeed. Also create
tdx_pamt_put()/tdx_pamt_get() to handle installing DPAMT 4KB backing for
pages that are already allocated (such as external page tables, or S-EPT
pages).
Allocate the pages as GFP_KERNEL_ACCOUNT based on that the allocations
will be easily user triggerable.
Since the source of these pages is the page allocator, multiple TDs could
each get 4KB pages that are covered by the same 2MB range. When this
happens only one page pair needs to be installed to cover the 2MB range.
Similarly, when one page is freed, the DPAMT backing cannot be freed until
all TDX pages in the range are no longer in use. Have the helpers manage
these races internally.
So the requirements are that:
1. Free path cannot fail (i.e. no TDX module BUSY errors).
2. Allocation paths need to handle finding that DPAMT backing is already
installed, and only return an error in the case of no memory, not in the
case of losing races with other’s trying to operate on the same DPAMT
range.
3. Free paths cannot fail, and also need to clean up the DPAMT backing
when the last page in the 2MB range is no longer needed by TDX.
Previous changes allocated refcounts to be used to track how many 4KB
pages are in use by TDX for each 2MB region. So update those inside the
helpers and use them to decide when to actually install the DPAMT backing
pages.
tdx_pamt_put() needs to guarantee the DPAMT is installed before returning
so that racing threads don’t tell the TDX module to operate on the page
before it’s installed. Take a lock while adjusting the refcount and doing
the actual TDH.PHYMEM.PAMT.ADD/REMOVE to make sure these happen
atomically. The lock is heavyweight, but will be optimized in future
changes. Just do the simple solution before any complex improvements.
TDH.PHYMEM.PAMT.ADD/REMOVE take exclusive locks at the granularity each
2MB range. A simultaneous attempt to operate on the same 2MB region would
result in a BUSY error code returned from the SEAMCALL. Since the
invocation of SEAMCALLs are behind a lock, this won’t conflict.
Besides the contention between TDH.PHYMEM.PAMT.ADD/REMOVE, many other
SEAMCALLs take the same 2MB granularity locks as shared. This means any
attempt to operate on the page by the TDX module while simultaneously
doing PAMT.ADD/REMOVE will result in a BUSY error. This should not happen,
as the PAMT pages always has to be installed before giving the pages to
the TDX module anyway.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[Add feedback, update log]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
v4:
- Update tdx_find_pamt_refcount() calls to pass PFN and rely on
internal PMD bucket calculations. Based on changes in previous patch.
- Pull calculation TDX DPAMT 2MB range arg into helper.
- Fix alloc_pamt_array() doesn't zero array on allocation failure (Yan)
- Move "prealloc" comment to future patch. (Kai)
- Use union for dpamt page array. (Dave)
- Use sizeof(*args_array) everywhere instead of sizeof(u64) in some
places. (Dave)
- Fix refcount inc/dec cases. (Xiaoyao)
- Rearrange error handling in tdx_pamt_get()/tdx_pamt_put() to remove
some indented lines.
- Make alloc_pamt_array() use GFP_KERNEL_ACCOUNT like the pre-fault
path does later.
v3:
- Fix hard to follow iteration over struct members.
- Simplify the code by removing the intermediate lists of pages.
- Clear PAMT pages before freeing. (Adrian)
- Rename tdx_nr_pamt_pages(). (Dave)
- Add comments some comments, but thought the simpler code needed
less. So not as much as seem to be requested. (Dave)
- Fix asymmetry in which level of the add/remove helpers global lock is
held.
- Split out optimization.
- Write log.
- Flatten call hierarchies and adjust errors accordingly.
---
arch/x86/include/asm/shared/tdx.h | 7 +
arch/x86/include/asm/tdx.h | 8 +-
arch/x86/virt/vmx/tdx/tdx.c | 258 ++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 2 +
4 files changed, 274 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 6a1646fc2b2f..cc2f251cb791 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -145,6 +145,13 @@ struct tdx_module_args {
u64 rsi;
};
+struct tdx_module_array_args {
+ union {
+ struct tdx_module_args args;
+ u64 args_array[sizeof(struct tdx_module_args) / sizeof(u64)];
+ };
+};
+
/* Used to communicate with the TDX module */
u64 __tdcall(u64 fn, struct tdx_module_args *args);
u64 __tdcall_ret(u64 fn, struct tdx_module_args *args);
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index cf51ccd16194..914213123d94 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -135,11 +135,17 @@ static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
return false; /* To be enabled when kernel is ready */
}
+void tdx_quirk_reset_page(struct page *page);
+
int tdx_guest_keyid_alloc(void);
u32 tdx_get_nr_guest_keyids(void);
void tdx_guest_keyid_free(unsigned int keyid);
-void tdx_quirk_reset_page(struct page *page);
+int tdx_pamt_get(struct page *page);
+void tdx_pamt_put(struct page *page);
+
+struct page *tdx_alloc_page(void);
+void tdx_free_page(struct page *page);
struct tdx_td {
/* TD root structure: */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index edf9182ed86d..745b308785d6 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -2009,6 +2009,264 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
}
EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
+/* Number PAMT pages to be provided to TDX module per 2M region of PA */
+static int tdx_dpamt_entry_pages(void)
+{
+ if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
+ return 0;
+
+ return tdx_sysinfo.tdmr.pamt_4k_entry_size * PTRS_PER_PTE / PAGE_SIZE;
+}
+
+/*
+ * The TDX spec treats the registers like an array, as they are ordered
+ * in the struct. The array size is limited by the number or registers,
+ * so define the max size it could be for worst case allocations and sanity
+ * checking.
+ */
+#define MAX_TDX_ARG_SIZE(reg) (sizeof(struct tdx_module_args) - \
+ offsetof(struct tdx_module_args, reg))
+#define TDX_ARG_INDEX(reg) (offsetof(struct tdx_module_args, reg) / \
+ sizeof(u64))
+
+/*
+ * Treat struct the registers like an array that starts at RDX, per
+ * TDX spec. Do some sanitychecks, and return an indexable type.
+ */
+static u64 *dpamt_args_array_ptr(struct tdx_module_array_args *args)
+{
+ WARN_ON_ONCE(tdx_dpamt_entry_pages() > MAX_TDX_ARG_SIZE(rdx));
+
+ return &args->args_array[TDX_ARG_INDEX(rdx)];
+}
+
+static int alloc_pamt_array(u64 *pa_array)
+{
+ struct page *page;
+ int i;
+
+ for (i = 0; i < tdx_dpamt_entry_pages(); i++) {
+ page = alloc_page(GFP_KERNEL_ACCOUNT);
+ if (!page)
+ goto err;
+ pa_array[i] = page_to_phys(page);
+ }
+
+ return 0;
+err:
+ /*
+ * Zero the rest of the array to help with
+ * freeing in error paths.
+ */
+ for (; i < tdx_dpamt_entry_pages(); i++)
+ pa_array[i] = 0;
+ return -ENOMEM;
+}
+
+static void free_pamt_array(u64 *pa_array)
+{
+ for (int i = 0; i < tdx_dpamt_entry_pages(); i++) {
+ if (!pa_array[i])
+ break;
+
+ /*
+ * Reset pages unconditionally to cover cases
+ * where they were passed to the TDX module.
+ */
+ tdx_quirk_reset_paddr(pa_array[i], PAGE_SIZE);
+
+ __free_page(phys_to_page(pa_array[i]));
+ }
+}
+
+/*
+ * Calculate the arg needed for operating on the DPAMT backing for
+ * a given 4KB page.
+ */
+static u64 pamt_2mb_arg(struct page *page)
+{
+ unsigned long hpa_2mb = ALIGN_DOWN(page_to_phys(page), PMD_SIZE);
+
+ return hpa_2mb | TDX_PS_2M;
+}
+
+/*
+ * Add PAMT backing for the given page. Return's negative error code
+ * for kernel side error conditions (-ENOMEM) and 1 for TDX Module
+ * error. In the case of TDX module error, the return code is stored
+ * in tdx_err.
+ */
+static u64 tdh_phymem_pamt_add(struct page *page, u64 *pamt_pa_array)
+{
+ struct tdx_module_array_args args = {
+ .args.rcx = pamt_2mb_arg(page)
+ };
+ u64 *dpamt_arg_array = dpamt_args_array_ptr(&args);
+
+ /* Copy PAMT page PA's into the struct per the TDX ABI */
+ memcpy(dpamt_arg_array, pamt_pa_array,
+ tdx_dpamt_entry_pages() * sizeof(*dpamt_arg_array));
+
+ return seamcall(TDH_PHYMEM_PAMT_ADD, &args.args);
+}
+
+/* Remove PAMT backing for the given page. */
+static u64 tdh_phymem_pamt_remove(struct page *page, u64 *pamt_pa_array)
+{
+ struct tdx_module_array_args args = {
+ .args.rcx = pamt_2mb_arg(page),
+ };
+ u64 *args_array = dpamt_args_array_ptr(&args);
+ u64 ret;
+
+ ret = seamcall_ret(TDH_PHYMEM_PAMT_REMOVE, &args.args);
+ if (ret)
+ return ret;
+
+ /* Copy PAMT page PA's out of the struct per the TDX ABI */
+ memcpy(pamt_pa_array, args_array,
+ tdx_dpamt_entry_pages() * sizeof(*args_array));
+
+ return ret;
+}
+
+/* Serializes adding/removing PAMT memory */
+static DEFINE_SPINLOCK(pamt_lock);
+
+/* Bump PAMT refcount for the given page and allocate PAMT memory if needed */
+int tdx_pamt_get(struct page *page)
+{
+ u64 pamt_pa_array[MAX_TDX_ARG_SIZE(rdx)];
+ atomic_t *pamt_refcount;
+ u64 tdx_status;
+ int ret;
+
+ if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
+ return 0;
+
+ ret = alloc_pamt_array(pamt_pa_array);
+ if (ret)
+ goto out_free;
+
+ pamt_refcount = tdx_find_pamt_refcount(page_to_pfn(page));
+
+ scoped_guard(spinlock, &pamt_lock) {
+ /*
+ * If the pamt page is already added (i.e. refcount >= 1),
+ * then just increment the refcount.
+ */
+ if (atomic_read(pamt_refcount)) {
+ atomic_inc(pamt_refcount);
+ goto out_free;
+ }
+
+ /* Try to add the pamt page and take the refcount 0->1. */
+
+ tdx_status = tdh_phymem_pamt_add(page, pamt_pa_array);
+ if (!IS_TDX_SUCCESS(tdx_status)) {
+ pr_err("TDH_PHYMEM_PAMT_ADD failed: %#llx\n", tdx_status);
+ goto out_free;
+ }
+
+ atomic_inc(pamt_refcount);
+ }
+
+ return ret;
+out_free:
+ /*
+ * pamt_pa_array is populated or zeroed up to tdx_dpamt_entry_pages()
+ * above. free_pamt_array() can handle either case.
+ */
+ free_pamt_array(pamt_pa_array);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_pamt_get);
+
+/*
+ * Drop PAMT refcount for the given page and free PAMT memory if it is no
+ * longer needed.
+ */
+void tdx_pamt_put(struct page *page)
+{
+ u64 pamt_pa_array[MAX_TDX_ARG_SIZE(rdx)];
+ atomic_t *pamt_refcount;
+ u64 tdx_status;
+
+ if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
+ return;
+
+ pamt_refcount = tdx_find_pamt_refcount(page_to_pfn(page));
+
+ scoped_guard(spinlock, &pamt_lock) {
+ /*
+ * If the there are more than 1 references on the pamt page,
+ * don't remove it yet. Just decrement the refcount.
+ */
+ if (atomic_read(pamt_refcount) > 1) {
+ atomic_dec(pamt_refcount);
+ return;
+ }
+
+ /* Try to remove the pamt page and take the refcount 1->0. */
+
+ tdx_status = tdh_phymem_pamt_remove(page, pamt_pa_array);
+ if (!IS_TDX_SUCCESS(tdx_status)) {
+ pr_err("TDH_PHYMEM_PAMT_REMOVE failed: %#llx\n", tdx_status);
+
+ /*
+ * Don't free pamt_pa_array as it could hold garbage
+ * when tdh_phymem_pamt_remove() fails.
+ */
+ return;
+ }
+
+ atomic_dec(pamt_refcount);
+ }
+
+ /*
+ * pamt_pa_array is populated up to tdx_dpamt_entry_pages() by the TDX
+ * module with pages, or remains zero inited. free_pamt_array() can
+ * handle either case. Just pass it unconditionally.
+ */
+ free_pamt_array(pamt_pa_array);
+}
+EXPORT_SYMBOL_GPL(tdx_pamt_put);
+
+/*
+ * Return a page that can be used as TDX private memory
+ * and obtain TDX protections.
+ */
+struct page *tdx_alloc_page(void)
+{
+ struct page *page;
+
+ page = alloc_page(GFP_KERNEL);
+ if (!page)
+ return NULL;
+
+ if (tdx_pamt_get(page)) {
+ __free_page(page);
+ return NULL;
+ }
+
+ return page;
+}
+EXPORT_SYMBOL_GPL(tdx_alloc_page);
+
+/*
+ * Free a page that was used as TDX private memory. After this,
+ * the page is no longer protected by TDX.
+ */
+void tdx_free_page(struct page *page)
+{
+ if (!page)
+ return;
+
+ tdx_pamt_put(page);
+ __free_page(page);
+}
+EXPORT_SYMBOL_GPL(tdx_free_page);
+
#ifdef CONFIG_KEXEC_CORE
void tdx_cpu_flush_cache_for_kexec(void)
{
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 82bb82be8567..46c4214b79fb 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -46,6 +46,8 @@
#define TDH_PHYMEM_PAGE_WBINVD 41
#define TDH_VP_WR 43
#define TDH_SYS_CONFIG 45
+#define TDH_PHYMEM_PAMT_ADD 58
+#define TDH_PHYMEM_PAMT_REMOVE 59
/*
* SEAMCALL leaf:
--
2.51.2
On Thu, Nov 20, 2025, Rick Edgecombe wrote:
> +/*
> + * Return a page that can be used as TDX private memory
> + * and obtain TDX protections.
Wrap at ~80.
This comment is also misleading, arguably wrong. Because from KVM's perspective,
these APIs are _never_ used to back TDX private memory. They are used only for
control pages, which yeah, I suppose might be encrypted with the guest's private
key, but most readers will interpret "used as TDX private memory" to mean that
these are _the_ source of pages for guest private memory.
> + */
> +struct page *tdx_alloc_page(void)
And in a similar vein, given terminology in other places, maybe call these
tdx_{alloc,free}_control_page()?
> +{
> + struct page *page;
> +
> + page = alloc_page(GFP_KERNEL);
GFP_KERNEL_ACCOUNT, all of these allocations are tied to a VM.
> + if (!page)
> + return NULL;
> +
> + if (tdx_pamt_get(page)) {
> + __free_page(page);
> + return NULL;
> + }
> +
> + return page;
> +}
> +EXPORT_SYMBOL_GPL(tdx_alloc_page);
Note, these can all now be EXPORT_SYMBOL_FOR_KVM.
On 1/16/26 15:17, Sean Christopherson wrote:
>> +struct page *tdx_alloc_page(void)
> And in a similar vein, given terminology in other places, maybe call these
> tdx_{alloc,free}_control_page()?
Ack on the naming, especially if they're never used as normal guest
memory and are *only* for TDX module metadata that the guest never
actually sees.
On Fri, 2026-01-16 at 15:17 -0800, Sean Christopherson wrote:
> On Thu, Nov 20, 2025, Rick Edgecombe wrote:
> > +/*
> > + * Return a page that can be used as TDX private memory
> > + * and obtain TDX protections.
>
> Wrap at ~80.
>
> This comment is also misleading, arguably wrong. Because from KVM's perspective,
> these APIs are _never_ used to back TDX private memory. They are used only for
> control pages, which yeah, I suppose might be encrypted with the guest's private
> key, but most readers will interpret "used as TDX private memory" to mean that
> these are _the_ source of pages for guest private memory.
Maybe just drop "private" and call it TDX memory?
>
> > + */
> > +struct page *tdx_alloc_page(void)
>
> And in a similar vein, given terminology in other places, maybe call these
> tdx_{alloc,free}_control_page()?
True, that is the only use today, but also there is nothing control page
specific about the functions themselves. I'm ok changing it, but I'm not sure it
helps that much.
>
> > +{
> > + struct page *page;
> > +
> > + page = alloc_page(GFP_KERNEL);
>
> GFP_KERNEL_ACCOUNT, all of these allocations are tied to a VM.
Yes, thanks.
>
> > + if (!page)
> > + return NULL;
> > +
> > + if (tdx_pamt_get(page)) {
> > + __free_page(page);
> > + return NULL;
> > + }
> > +
> > + return page;
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_alloc_page);
>
> Note, these can all now be EXPORT_SYMBOL_FOR_KVM.
Sure.
On Thu, Nov 20, 2025 at 04:51:16PM -0800, Rick Edgecombe wrote:
> +static int alloc_pamt_array(u64 *pa_array)
> +{
> + struct page *page;
> + int i;
> +
> + for (i = 0; i < tdx_dpamt_entry_pages(); i++) {
> + page = alloc_page(GFP_KERNEL_ACCOUNT);
> + if (!page)
> + goto err;
> + pa_array[i] = page_to_phys(page);
> + }
> +
> + return 0;
> +err:
On failure to allocate the 2nd page, instead of zeroing out the rest of the
array and having the caller invoke free_pamt_array() to free the 1st page,
could we free the 1st page here directly?
Upon alloc_pamt_array() error, the pages shouldn't have been passed to the TDX
module, so there's no need to invoke tdx_quirk_reset_paddr() as in
free_pamt_array(), right?
It's also somewhat strange to have the caller to invoke free_pamt_array() after
alloc_pamt_array() fails.
> + /*
> + * Zero the rest of the array to help with
> + * freeing in error paths.
> + */
> + for (; i < tdx_dpamt_entry_pages(); i++)
> + pa_array[i] = 0;
> + return -ENOMEM;
> +}
> +
> +static void free_pamt_array(u64 *pa_array)
> +{
> + for (int i = 0; i < tdx_dpamt_entry_pages(); i++) {
> + if (!pa_array[i])
> + break;
> +
> + /*
> + * Reset pages unconditionally to cover cases
> + * where they were passed to the TDX module.
> + */
> + tdx_quirk_reset_paddr(pa_array[i], PAGE_SIZE);
> +
> + __free_page(phys_to_page(pa_array[i]));
> + }
> +}
...
> +/* Bump PAMT refcount for the given page and allocate PAMT memory if needed */
> +int tdx_pamt_get(struct page *page)
> +{
> + u64 pamt_pa_array[MAX_TDX_ARG_SIZE(rdx)];
> + atomic_t *pamt_refcount;
> + u64 tdx_status;
> + int ret;
> +
> + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> + return 0;
> +
> + ret = alloc_pamt_array(pamt_pa_array);
> + if (ret)
> + goto out_free;
> +
> + pamt_refcount = tdx_find_pamt_refcount(page_to_pfn(page));
> +
> + scoped_guard(spinlock, &pamt_lock) {
> + /*
> + * If the pamt page is already added (i.e. refcount >= 1),
> + * then just increment the refcount.
> + */
> + if (atomic_read(pamt_refcount)) {
> + atomic_inc(pamt_refcount);
> + goto out_free;
> + }
> +
> + /* Try to add the pamt page and take the refcount 0->1. */
> +
> + tdx_status = tdh_phymem_pamt_add(page, pamt_pa_array);
> + if (!IS_TDX_SUCCESS(tdx_status)) {
> + pr_err("TDH_PHYMEM_PAMT_ADD failed: %#llx\n", tdx_status);
> + goto out_free;
> + }
> +
> + atomic_inc(pamt_refcount);
> + }
> +
> + return ret;
> +out_free:
> + /*
> + * pamt_pa_array is populated or zeroed up to tdx_dpamt_entry_pages()
> + * above. free_pamt_array() can handle either case.
> + */
> + free_pamt_array(pamt_pa_array);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_pamt_get);
On Mon, 2025-12-08 at 17:15 +0800, Yan Zhao wrote: > On failure to allocate the 2nd page, instead of zeroing out the rest of the > array and having the caller invoke free_pamt_array() to free the 1st page, > could we free the 1st page here directly? > > Upon alloc_pamt_array() error, the pages shouldn't have been passed to the TDX > module, so there's no need to invoke tdx_quirk_reset_paddr() as in > free_pamt_array(), right? We don't want to optimize the error path. So I think less code is better than making it slightly faster in an ultra rare case. > > It's also somewhat strange to have the caller to invoke free_pamt_array() after > alloc_pamt_array() fails. Yea, I agree this code is a bit odd. Based on the discussion with Nikolay and Dave, this can be less variable. It should get simplified as part of that.
On 21.11.25 г. 2:51 ч., Rick Edgecombe wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Add helpers to use when allocating or preparing pages that need DPAMT
> backing. Make them handle races internally for the case of multiple
> callers trying operate on the same 2MB range simultaneously.
>
> While the TDX initialization code in arch/x86 uses pages with 2MB
> alignment, KVM will need to hand 4KB pages for it to use. Under DPAMT,
> these pages will need DPAMT backing 4KB backing.
>
> Add tdx_alloc_page() and tdx_free_page() to handle both page allocation
> and DPAMT installation. Make them behave like normal alloc/free functions
> where allocation can fail in the case of no memory, but free (with any
> necessary DPAMT release) always succeeds. Do this so they can support the
> existing TDX flows that require cleanups to succeed. Also create
> tdx_pamt_put()/tdx_pamt_get() to handle installing DPAMT 4KB backing for
> pages that are already allocated (such as external page tables, or S-EPT
> pages).
>
> Allocate the pages as GFP_KERNEL_ACCOUNT based on that the allocations
> will be easily user triggerable.
>
> Since the source of these pages is the page allocator, multiple TDs could
> each get 4KB pages that are covered by the same 2MB range. When this
> happens only one page pair needs to be installed to cover the 2MB range.
> Similarly, when one page is freed, the DPAMT backing cannot be freed until
> all TDX pages in the range are no longer in use. Have the helpers manage
> these races internally.
>
> So the requirements are that:
>
> 1. Free path cannot fail (i.e. no TDX module BUSY errors).
> 2. Allocation paths need to handle finding that DPAMT backing is already
> installed, and only return an error in the case of no memory, not in the
> case of losing races with other’s trying to operate on the same DPAMT
> range.
> 3. Free paths cannot fail, and also need to clean up the DPAMT backing
> when the last page in the 2MB range is no longer needed by TDX.
>
> Previous changes allocated refcounts to be used to track how many 4KB
> pages are in use by TDX for each 2MB region. So update those inside the
> helpers and use them to decide when to actually install the DPAMT backing
> pages.
>
> tdx_pamt_put() needs to guarantee the DPAMT is installed before returning
> so that racing threads don’t tell the TDX module to operate on the page
> before it’s installed. Take a lock while adjusting the refcount and doing
> the actual TDH.PHYMEM.PAMT.ADD/REMOVE to make sure these happen
> atomically. The lock is heavyweight, but will be optimized in future
> changes. Just do the simple solution before any complex improvements.
>
> TDH.PHYMEM.PAMT.ADD/REMOVE take exclusive locks at the granularity each
> 2MB range. A simultaneous attempt to operate on the same 2MB region would
> result in a BUSY error code returned from the SEAMCALL. Since the
> invocation of SEAMCALLs are behind a lock, this won’t conflict.
>
> Besides the contention between TDH.PHYMEM.PAMT.ADD/REMOVE, many other
> SEAMCALLs take the same 2MB granularity locks as shared. This means any
> attempt to operate on the page by the TDX module while simultaneously
> doing PAMT.ADD/REMOVE will result in a BUSY error. This should not happen,
> as the PAMT pages always has to be installed before giving the pages to
> the TDX module anyway.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> [Add feedback, update log]
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> v4:
> - Update tdx_find_pamt_refcount() calls to pass PFN and rely on
> internal PMD bucket calculations. Based on changes in previous patch.
> - Pull calculation TDX DPAMT 2MB range arg into helper.
> - Fix alloc_pamt_array() doesn't zero array on allocation failure (Yan)
> - Move "prealloc" comment to future patch. (Kai)
> - Use union for dpamt page array. (Dave)
> - Use sizeof(*args_array) everywhere instead of sizeof(u64) in some
> places. (Dave)
> - Fix refcount inc/dec cases. (Xiaoyao)
> - Rearrange error handling in tdx_pamt_get()/tdx_pamt_put() to remove
> some indented lines.
> - Make alloc_pamt_array() use GFP_KERNEL_ACCOUNT like the pre-fault
> path does later.
>
> v3:
> - Fix hard to follow iteration over struct members.
> - Simplify the code by removing the intermediate lists of pages.
> - Clear PAMT pages before freeing. (Adrian)
> - Rename tdx_nr_pamt_pages(). (Dave)
> - Add comments some comments, but thought the simpler code needed
> less. So not as much as seem to be requested. (Dave)
> - Fix asymmetry in which level of the add/remove helpers global lock is
> held.
> - Split out optimization.
> - Write log.
> - Flatten call hierarchies and adjust errors accordingly.
> ---
> arch/x86/include/asm/shared/tdx.h | 7 +
> arch/x86/include/asm/tdx.h | 8 +-
> arch/x86/virt/vmx/tdx/tdx.c | 258 ++++++++++++++++++++++++++++++
> arch/x86/virt/vmx/tdx/tdx.h | 2 +
> 4 files changed, 274 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
> index 6a1646fc2b2f..cc2f251cb791 100644
> --- a/arch/x86/include/asm/shared/tdx.h
> +++ b/arch/x86/include/asm/shared/tdx.h
> @@ -145,6 +145,13 @@ struct tdx_module_args {
> u64 rsi;
> };
>
> +struct tdx_module_array_args {
> + union {
> + struct tdx_module_args args;
> + u64 args_array[sizeof(struct tdx_module_args) / sizeof(u64)];
> + };
> +};
> +
> /* Used to communicate with the TDX module */
> u64 __tdcall(u64 fn, struct tdx_module_args *args);
> u64 __tdcall_ret(u64 fn, struct tdx_module_args *args);
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index cf51ccd16194..914213123d94 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -135,11 +135,17 @@ static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
> return false; /* To be enabled when kernel is ready */
> }
>
> +void tdx_quirk_reset_page(struct page *page);
> +
> int tdx_guest_keyid_alloc(void);
> u32 tdx_get_nr_guest_keyids(void);
> void tdx_guest_keyid_free(unsigned int keyid);
>
> -void tdx_quirk_reset_page(struct page *page);
> +int tdx_pamt_get(struct page *page);
> +void tdx_pamt_put(struct page *page);
> +
> +struct page *tdx_alloc_page(void);
> +void tdx_free_page(struct page *page);
>
> struct tdx_td {
> /* TD root structure: */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index edf9182ed86d..745b308785d6 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -2009,6 +2009,264 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> }
> EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
>
> +/* Number PAMT pages to be provided to TDX module per 2M region of PA */
> +static int tdx_dpamt_entry_pages(void)
> +{
> + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> + return 0;
> +
> + return tdx_sysinfo.tdmr.pamt_4k_entry_size * PTRS_PER_PTE / PAGE_SIZE;
> +}
Isn't this guaranteed to return 2 always as per the ABI? Can't the
allocation of the 2 pages be moved closer to where it's used - in
tdh_phymem_pamt_add which will simplify things a bit?
<snip>
On Thu, 2025-11-27 at 18:11 +0200, Nikolay Borisov wrote:
> > +/* Number PAMT pages to be provided to TDX module per 2M region of PA */
> > +static int tdx_dpamt_entry_pages(void)
> > +{
> > + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> > + return 0;
> > +
> > + return tdx_sysinfo.tdmr.pamt_4k_entry_size * PTRS_PER_PTE /
> > PAGE_SIZE;
> > +}
>
> Isn't this guaranteed to return 2 always as per the ABI? Can't the
> allocation of the 2 pages be moved closer to where it's used - in
> tdh_phymem_pamt_add which will simplify things a bit?
Yea, it could be simpler if it was always guaranteed to be 2 pages. But it was
my understanding that it would not be a fixed size. Can you point to what docs
makes you think that?
Another option would be to ask TDX folks to make it fixed, and then require an
opt-in for it to be expanded later if needed. I would have to check on them on
the reasoning for it being dynamic sized. I'm not sure if it is *that*
complicated at this point though. Once there is more than one, the loops becomes
tempting. And if we loop over 2 we could easily loop over n.
On 2.12.25 г. 0:39 ч., Edgecombe, Rick P wrote:
> On Thu, 2025-11-27 at 18:11 +0200, Nikolay Borisov wrote:
>>> +/* Number PAMT pages to be provided to TDX module per 2M region of PA */
>>> +static int tdx_dpamt_entry_pages(void)
>>> +{
>>> + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
>>> + return 0;
>>> +
>>> + return tdx_sysinfo.tdmr.pamt_4k_entry_size * PTRS_PER_PTE /
>>> PAGE_SIZE;
>>> +}
>>
>> Isn't this guaranteed to return 2 always as per the ABI? Can't the
>> allocation of the 2 pages be moved closer to where it's used - in
>> tdh_phymem_pamt_add which will simplify things a bit?
>
> Yea, it could be simpler if it was always guaranteed to be 2 pages. But it was
> my understanding that it would not be a fixed size. Can you point to what docs
> makes you think that?
Looking at the PHYMEM.PAMT.ADD ABI spec the pages being added are always
put into pair in rdx/r8. So e.g. looking into tdh_phymem_pamt_add rcx is
set to a 2mb page, and subsequently we have the memcpy which simply sets
the rdx/r8 input argument registers, no ? Or am I misunderstanding the
code?
>
> Another option would be to ask TDX folks to make it fixed, and then require an
> opt-in for it to be expanded later if needed. I would have to check on them on
> the reasoning for it being dynamic sized. I'm not sure if it is *that*
> complicated at this point though. Once there is more than one, the loops becomes
> tempting. And if we loop over 2 we could easily loop over n.
On Tue, 2025-12-02 at 09:38 +0200, Nikolay Borisov wrote: > > Yea, it could be simpler if it was always guaranteed to be 2 pages. But it > > was > > my understanding that it would not be a fixed size. Can you point to what > > docs > > makes you think that? > > Looking at the PHYMEM.PAMT.ADD ABI spec the pages being added are always > put into pair in rdx/r8. So e.g. looking into tdh_phymem_pamt_add rcx is > set to a 2mb page, and subsequently we have the memcpy which simply sets > the rdx/r8 input argument registers, no ? Or am I misunderstanding the > code? Hmm, you are totally right. The docs specify the size of the 4k entries, but doesn't specify that Dynamic PAMT is supposed to provide larger sizes in the other registers. A reasonable reading could assume 2 pages always, and the usage of the other registers seems like an assumption. Kirill, any history here? Otherwise, we could turn tdx_dpamt_entry_pages() into a define and shrink the stack allocated buffers. I'll see how much removing the loops helps.
On Tue, Dec 02, 2025 at 08:02:38PM +0000, Edgecombe, Rick P wrote: > On Tue, 2025-12-02 at 09:38 +0200, Nikolay Borisov wrote: > > > Yea, it could be simpler if it was always guaranteed to be 2 pages. But it > > > was > > > my understanding that it would not be a fixed size. Can you point to what > > > docs > > > makes you think that? > > > > Looking at the PHYMEM.PAMT.ADD ABI spec the pages being added are always > > put into pair in rdx/r8. So e.g. looking into tdh_phymem_pamt_add rcx is > > set to a 2mb page, and subsequently we have the memcpy which simply sets > > the rdx/r8 input argument registers, no ? Or am I misunderstanding the > > code? > > Hmm, you are totally right. The docs specify the size of the 4k entries, but > doesn't specify that Dynamic PAMT is supposed to provide larger sizes in the > other registers. A reasonable reading could assume 2 pages always, and the usage > of the other registers seems like an assumption. > > Kirill, any history here? There was a plan to future prove DPAMT by allowing PAMT descriptor to grow in the future. The concrete approach was not settled last time I checked. This code was my attempt to accommodate it. I don't know if it fits the current plan. -- Kiryl Shutsemau / Kirill A. Shutemov
On 3.12.25 г. 15:46 ч., Kiryl Shutsemau wrote: > On Tue, Dec 02, 2025 at 08:02:38PM +0000, Edgecombe, Rick P wrote: >> On Tue, 2025-12-02 at 09:38 +0200, Nikolay Borisov wrote: >>>> Yea, it could be simpler if it was always guaranteed to be 2 pages. But it >>>> was >>>> my understanding that it would not be a fixed size. Can you point to what >>>> docs >>>> makes you think that? >>> >>> Looking at the PHYMEM.PAMT.ADD ABI spec the pages being added are always >>> put into pair in rdx/r8. So e.g. looking into tdh_phymem_pamt_add rcx is >>> set to a 2mb page, and subsequently we have the memcpy which simply sets >>> the rdx/r8 input argument registers, no ? Or am I misunderstanding the >>> code? >> >> Hmm, you are totally right. The docs specify the size of the 4k entries, but >> doesn't specify that Dynamic PAMT is supposed to provide larger sizes in the >> other registers. A reasonable reading could assume 2 pages always, and the usage >> of the other registers seems like an assumption. >> >> Kirill, any history here? > > There was a plan to future prove DPAMT by allowing PAMT descriptor to > grow in the future. The concrete approach was not settled last time I > checked. This code was my attempt to accommodate it. I don't know if it > fits the current plan. > Considering this, I'd opt for the simplest possible approach that works _now_. If in the future there are changes to the ABI let's introduce them incrementally when their time comes.
On 12/3/25 05:48, Nikolay Borisov wrote: >> There was a plan to future prove DPAMT by allowing PAMT descriptor to >> grow in the future. The concrete approach was not settled last time I >> checked. This code was my attempt to accommodate it. I don't know if it >> fits the current plan. > > Considering this, I'd opt for the simplest possible approach that works > _now_. If in the future there are changes to the ABI let's introduce > them incrementally when their time comes. It's fixed at a "page pair" in the ABI documentation, no? If Intel wants anything else, it should have documented that as the ABI. So, as far as the code goes, it's a "page pair" now and forever. Linux does not need to go out of its way to make it inflexible, but there's no need to add complexity now for some future that may never come. I agree with Nikolay: KISS.
On Wed, 2025-12-03 at 07:41 -0800, Dave Hansen wrote: > It's fixed at a "page pair" in the ABI documentation, no? > > If Intel wants anything else, it should have documented that as the ABI. > So, as far as the code goes, it's a "page pair" now and forever. Linux > does not need to go out of its way to make it inflexible, but there's no > need to add complexity now for some future that may never come. > > I agree with Nikolay: KISS. Thanks Dave. Yes, let's stick to the spec. I'm going to try to pull the loops out too because we can get rid of the union array thing too. Thanks for the catch Nikolay.
On 12/3/25 10:15, Edgecombe, Rick P wrote: > On Wed, 2025-12-03 at 07:41 -0800, Dave Hansen wrote: >> It's fixed at a "page pair" in the ABI documentation, no? >> >> If Intel wants anything else, it should have documented that as the ABI. >> So, as far as the code goes, it's a "page pair" now and forever. Linux >> does not need to go out of its way to make it inflexible, but there's no >> need to add complexity now for some future that may never come. >> >> I agree with Nikolay: KISS. > Thanks Dave. Yes, let's stick to the spec. I'm going to try to pull the loops > out too because we can get rid of the union array thing too. Also, I honestly don't see the problem with just allocating an order-1 page for this. Yeah, the TDX modules doesn't need physically contiguous pages, but it's easier for _us_ to lug them around if they are physically contiguous. Plus, if you permanently allocate 2 order-0 pages, you are _probably_ going to permanently destroy 2 potential future 2MB pages. The order-1 allocation will only destroy 1.
On Wed, 2025-12-03 at 10:21 -0800, Dave Hansen wrote:
> > Thanks Dave. Yes, let's stick to the spec. I'm going to try to pull the
> > loops
> > out too because we can get rid of the union array thing too.
>
> Also, I honestly don't see the problem with just allocating an order-1
> page for this. Yeah, the TDX modules doesn't need physically contiguous
> pages, but it's easier for _us_ to lug them around if they are
> physically contiguous.
We have two spin locks to contend with for these allocations. One is the global
spin lock on the arch/x86 side. In this case, the the pages don't have to be
passed far, like:
tdx_pamt_get(some_page, NULL)
page1 = alloc()
page2 = alloc()
scoped_guard(spinlock, &pamt_lock) {
tdh_phymem_pamt_add(.., page1, page2)
/* Pack into struct */
seamcall()
}
I think it's not too bad?
Then there is the KVM MMU spin lock during the fault path. This lock happens way
up the call chain. It goes something like:
topup_tdx_pages_cache() /* Add order-0 pages for S-EPT page tables and dpamt */
spin_lock()
... many calls ...
order_0_s_ept_page table = alloc_from_order_0_cache();
tdx_sept_link_private_spt(order_0_s_ept_page)
tdx_pamt_get(order_0_s_ept_page, order_0_cache)
/* alloc two pages from order_0_cache for dpamt */
tdx_sept_set_private_spte(guest_page)
tdx_pamt_get(guest_page, order_0_cache)
/* alloc two pages from order_0_cache for dpamt*/
spin_unlock()
So if we decide to pass a single order-1 page into tdx_pamt_get() instead of
order_0_cache, we can stop passing the cache between KVM and arch/x86, but we
then need two cache's instead of one. One for order-0 S-EPT page tables and one
for order-1 DPAMT page pairs.
Also, if we have to allocate the order-1 page in each caller, it simplifies the
arch/x86 code, but duplicates the allocation in the KVM callers (only 2 today
though).
So I'm suspicious it's not going to be a big win, but I'll give it a try.
>
> Plus, if you permanently allocate 2 order-0 pages, you are _probably_
> going to permanently destroy 2 potential future 2MB pages. The order-1
> allocation will only destroy 1.
Doesn't the buddy allocator try to avoid splitting larger blocks? I guess you
mean in the worst case, but the DPAMT should also not be allocated forever
either. So I think it's only at the intersection of two worst cases? Worth it?
On 12/3/25 11:59, Edgecombe, Rick P wrote:
> On Wed, 2025-12-03 at 10:21 -0800, Dave Hansen wrote:
>>> Thanks Dave. Yes, let's stick to the spec. I'm going to try to pull the
>>> loops
>>> out too because we can get rid of the union array thing too.
>>
>> Also, I honestly don't see the problem with just allocating an order-1
>> page for this. Yeah, the TDX modules doesn't need physically contiguous
>> pages, but it's easier for _us_ to lug them around if they are
>> physically contiguous.
>
> We have two spin locks to contend with for these allocations. One is the global
> spin lock on the arch/x86 side. In this case, the the pages don't have to be
> passed far, like:
>
> tdx_pamt_get(some_page, NULL)
> page1 = alloc()
> page2 = alloc()
>
> scoped_guard(spinlock, &pamt_lock) {
> tdh_phymem_pamt_add(.., page1, page2)
> /* Pack into struct */
> seamcall()
> }
>
> I think it's not too bad?
No, that's not bad.
The thing I thought was annoying was in the past when there were a bunch
of functions with two explicit page pointers plumbed in to them.
> So if we decide to pass a single order-1 page into tdx_pamt_get() instead of
> order_0_cache, we can stop passing the cache between KVM and arch/x86, but we
> then need two cache's instead of one. One for order-0 S-EPT page tables and one
> for order-1 DPAMT page pairs.
>
> Also, if we have to allocate the order-1 page in each caller, it simplifies the
> arch/x86 code, but duplicates the allocation in the KVM callers (only 2 today
> though).
>
> So I'm suspicious it's not going to be a big win, but I'll give it a try.
Yeah, the value of doing order-1 is super low if it means managing a
second cache.
>> Plus, if you permanently allocate 2 order-0 pages, you are _probably_
>> going to permanently destroy 2 potential future 2MB pages. The order-1
>> allocation will only destroy 1.
>
> Doesn't the buddy allocator try to avoid splitting larger blocks? I guess you
> mean in the worst case, but the DPAMT should also not be allocated forever
> either. So I think it's only at the intersection of two worst cases? Worth it?
It's not splitting them in this case. They're *already* split:
# cat /proc/buddyinfo
...
Node 0, zone Normal 32903 33566 ...
See, there are already ~33,000 4k pages sitting there. Those will be
consumed first on any 4k allocation. So, yeah, it'll avoid splitting an
8k block to get 4k pages normally.
BTW, I *DO* expect the DPAMT pages to be mostly allocated forever. Maybe
I'm just a pessimist, but you can't get them back for compaction or
reclaim, so they're basically untouchable. Sure, if you kill all the TDX
guests you get them back, but that's a very different kind of kernel
memory from stuff that's truly reclaimable under pressure.
On Wed, 2025-12-03 at 12:13 -0800, Dave Hansen wrote: > BTW, I *DO* expect the DPAMT pages to be mostly allocated forever. Maybe > I'm just a pessimist, but you can't get them back for compaction or > reclaim, so they're basically untouchable. Sure, if you kill all the TDX > guests you get them back, but that's a very different kind of kernel > memory from stuff that's truly reclaimable under pressure. If we want to improve it (later) we could topup via a single higher order allocation. So like if we need 15 pages we could round up to an order-4, then hand them out individually. But I suspect someone would want to see some proof before that kind of tweak. So I take it you are ok to leave the general approach, but worth trying to remove the dynamic sizing.
On 12/3/25 13:39, Edgecombe, Rick P wrote: > So I take it you are ok to leave the general approach, but worth trying to > remove the dynamic sizing. Yeah, that sounds fine.
On 21.11.25 г. 2:51 ч., Rick Edgecombe wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Add helpers to use when allocating or preparing pages that need DPAMT
> backing. Make them handle races internally for the case of multiple
> callers trying operate on the same 2MB range simultaneously.
>
> While the TDX initialization code in arch/x86 uses pages with 2MB
> alignment, KVM will need to hand 4KB pages for it to use. Under DPAMT,
> these pages will need DPAMT backing 4KB backing.
That paragraph is rather hard to parse. KVM will need to hand 4k pages
to whom? The tdx init code? Also the last sentence with the 2 "backing"
words is hard to parse. Does it say that the 4k pages that KVM need to
pass must be backed by DPAMT pages i.e like a chicken and egg problem?
>
> Add tdx_alloc_page() and tdx_free_page() to handle both page allocation
> and DPAMT installation. Make them behave like normal alloc/free functions
> where allocation can fail in the case of no memory, but free (with any
> necessary DPAMT release) always succeeds. Do this so they can support the
> existing TDX flows that require cleanups to succeed. Also create
> tdx_pamt_put()/tdx_pamt_get() to handle installing DPAMT 4KB backing for
> pages that are already allocated (such as external page tables, or S-EPT
> pages).
>
<snip>
> +
> +/* Serializes adding/removing PAMT memory */
> +static DEFINE_SPINLOCK(pamt_lock);
> +
> +/* Bump PAMT refcount for the given page and allocate PAMT memory if needed */
> +int tdx_pamt_get(struct page *page)
> +{
> + u64 pamt_pa_array[MAX_TDX_ARG_SIZE(rdx)];
> + atomic_t *pamt_refcount;
> + u64 tdx_status;
> + int ret;
> +
> + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> + return 0;
> +
> + ret = alloc_pamt_array(pamt_pa_array);
> + if (ret)
> + goto out_free;
> +
> + pamt_refcount = tdx_find_pamt_refcount(page_to_pfn(page));
> +
> + scoped_guard(spinlock, &pamt_lock) {
> + /*
> + * If the pamt page is already added (i.e. refcount >= 1),
> + * then just increment the refcount.
> + */
> + if (atomic_read(pamt_refcount)) {
> + atomic_inc(pamt_refcount);
> + goto out_free;
> + }
Replace this pair of read/inc with a single call to atomic_inc_not_zero()
> +
> + /* Try to add the pamt page and take the refcount 0->1. */
> +
> + tdx_status = tdh_phymem_pamt_add(page, pamt_pa_array);
> + if (!IS_TDX_SUCCESS(tdx_status)) {
> + pr_err("TDH_PHYMEM_PAMT_ADD failed: %#llx\n", tdx_status);
> + goto out_free;
> + }
> +
> + atomic_inc(pamt_refcount);
> + }
> +
> + return ret;
> +out_free:
> + /*
> + * pamt_pa_array is populated or zeroed up to tdx_dpamt_entry_pages()
> + * above. free_pamt_array() can handle either case.
> + */
> + free_pamt_array(pamt_pa_array);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_pamt_get);
> +
> +/*
> + * Drop PAMT refcount for the given page and free PAMT memory if it is no
> + * longer needed.
> + */
> +void tdx_pamt_put(struct page *page)
> +{
> + u64 pamt_pa_array[MAX_TDX_ARG_SIZE(rdx)];
> + atomic_t *pamt_refcount;
> + u64 tdx_status;
> +
> + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> + return;
> +
> + pamt_refcount = tdx_find_pamt_refcount(page_to_pfn(page));
> +
> + scoped_guard(spinlock, &pamt_lock) {
> + /*
> + * If the there are more than 1 references on the pamt page,
> + * don't remove it yet. Just decrement the refcount.
> + */
> + if (atomic_read(pamt_refcount) > 1) {
> + atomic_dec(pamt_refcount);
> + return;
> + }
nit: Could be replaced with : atomic_add_unless(pamt_refcount, -1, 1);
Probably it would have been better to simply use atomic64_dec_and_test
and if it returns true do the phymem_pamt_remove, but I suspect you
can't do it because in case it fails you don't want to decrement the
last refcount, though that could be remedied by an extra atomic_int in
the failure path. I guess it might be worth simplifying since the extra
inc will only be needed in exceptional cases (we don't expect failure ot
be the usual path) and freeing is not a fast path.
> +
> + /* Try to remove the pamt page and take the refcount 1->0. */
> +
> + tdx_status = tdh_phymem_pamt_remove(page, pamt_pa_array);
> + if (!IS_TDX_SUCCESS(tdx_status)) {
> + pr_err("TDH_PHYMEM_PAMT_REMOVE failed: %#llx\n", tdx_status);
> +
> + /*
> + * Don't free pamt_pa_array as it could hold garbage
> + * when tdh_phymem_pamt_remove() fails.
> + */
> + return;
> + }
> +
> + atomic_dec(pamt_refcount);
> + }
> +
> + /*
> + * pamt_pa_array is populated up to tdx_dpamt_entry_pages() by the TDX
> + * module with pages, or remains zero inited. free_pamt_array() can
> + * handle either case. Just pass it unconditionally.
> + */
> + free_pamt_array(pamt_pa_array);
> +}
> +EXPORT_SYMBOL_GPL(tdx_pamt_put);
<snip>
On Thu, 2025-11-27 at 14:29 +0200, Nikolay Borisov wrote:
> > While the TDX initialization code in arch/x86 uses pages with 2MB
> > alignment, KVM will need to hand 4KB pages for it to use. Under DPAMT,
> > these pages will need DPAMT backing 4KB backing.
>
> That paragraph is rather hard to parse. KVM will need to hand 4k pages
> to whom? The tdx init code? Also the last sentence with the 2 "backing"
> words is hard to parse. Does it say that the 4k pages that KVM need to
> pass must be backed by DPAMT pages i.e like a chicken and egg problem?
You're right it's confusing and also a slightly misleading typo, maybe this is
an improvement?
While the TDX module initialization code in arch/x86 only hands pages to the
TDX module with 2MB alignment, KVM will need to hand pages to the TDX module
at 4KB granularity. Under DPAMT, such pages will require performing the
TDH.PHYMEM.PAMT.ADD SEAMCALL to give a page pair the the TDX module to use
for PAMT tracking at the 4KB page size page.
Note: There is no chicken or egg problem, TDH.PHYMEM.PAMT.ADD handles it.
>
> >
> > Add tdx_alloc_page() and tdx_free_page() to handle both page allocation
> > and DPAMT installation. Make them behave like normal alloc/free functions
> > where allocation can fail in the case of no memory, but free (with any
> > necessary DPAMT release) always succeeds. Do this so they can support the
> > existing TDX flows that require cleanups to succeed. Also create
> > tdx_pamt_put()/tdx_pamt_get() to handle installing DPAMT 4KB backing for
> > pages that are already allocated (such as external page tables, or S-EPT
> > pages).
> >
>
> <snip>
>
>
> > +
> > +/* Serializes adding/removing PAMT memory */
> > +static DEFINE_SPINLOCK(pamt_lock);
> > +
> > +/* Bump PAMT refcount for the given page and allocate PAMT memory if needed
> > */
> > +int tdx_pamt_get(struct page *page)
> > +{
> > + u64 pamt_pa_array[MAX_TDX_ARG_SIZE(rdx)];
> > + atomic_t *pamt_refcount;
> > + u64 tdx_status;
> > + int ret;
> > +
> > + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> > + return 0;
> > +
> > + ret = alloc_pamt_array(pamt_pa_array);
> > + if (ret)
> > + goto out_free;
> > +
> > + pamt_refcount = tdx_find_pamt_refcount(page_to_pfn(page));
> > +
> > + scoped_guard(spinlock, &pamt_lock) {
> > + /*
> > + * If the pamt page is already added (i.e. refcount >= 1),
> > + * then just increment the refcount.
> > + */
> > + if (atomic_read(pamt_refcount)) {
> > + atomic_inc(pamt_refcount);
> > + goto out_free;
> > + }
>
> Replace this pair of read/inc with a single call to atomic_inc_not_zero()
From the thread with Binbin on this patch, the atomics aren't really needed
until the optimization patch. So was thinking to actually use a simpler non-
atomic operations.
>
> > +
> > + /* Try to add the pamt page and take the refcount 0->1. */
> > +
> > + tdx_status = tdh_phymem_pamt_add(page, pamt_pa_array);
> > + if (!IS_TDX_SUCCESS(tdx_status)) {
> > + pr_err("TDH_PHYMEM_PAMT_ADD failed: %#llx\n",
> > tdx_status);
> > + goto out_free;
> > + }
> > +
> > + atomic_inc(pamt_refcount);
> > + }
> > +
> > + return ret;
> > +out_free:
> > + /*
> > + * pamt_pa_array is populated or zeroed up to
> > tdx_dpamt_entry_pages()
> > + * above. free_pamt_array() can handle either case.
> > + */
> > + free_pamt_array(pamt_pa_array);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_pamt_get);
> > +
> > +/*
> > + * Drop PAMT refcount for the given page and free PAMT memory if it is no
> > + * longer needed.
> > + */
> > +void tdx_pamt_put(struct page *page)
> > +{
> > + u64 pamt_pa_array[MAX_TDX_ARG_SIZE(rdx)];
> > + atomic_t *pamt_refcount;
> > + u64 tdx_status;
> > +
> > + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> > + return;
> > +
> > + pamt_refcount = tdx_find_pamt_refcount(page_to_pfn(page));
> > +
> > + scoped_guard(spinlock, &pamt_lock) {
> > + /*
> > + * If the there are more than 1 references on the pamt
> > page,
> > + * don't remove it yet. Just decrement the refcount.
> > + */
> > + if (atomic_read(pamt_refcount) > 1) {
> > + atomic_dec(pamt_refcount);
> > + return;
> > + }
>
> nit: Could be replaced with : atomic_add_unless(pamt_refcount, -1, 1);
>
> Probably it would have been better to simply use atomic64_dec_and_test
> and if it returns true do the phymem_pamt_remove, but I suspect you
> can't do it because in case it fails you don't want to decrement the
> last refcount, though that could be remedied by an extra atomic_int in
> the failure path. I guess it might be worth simplifying since the extra
> inc will only be needed in exceptional cases (we don't expect failure ot
> be the usual path) and freeing is not a fast path.
The goal of this patch is to be as simple and obviously correct as possible.
Then the next patch "x86/virt/tdx: Optimize tdx_alloc/free_page() helpers"
should have the optimized versions. Do you have any similar suggestions on the
code after the next patch is applied?
On Thu, 2025-11-20 at 16:51 -0800, Rick Edgecombe wrote: > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h > index cf51ccd16194..914213123d94 100644 > --- a/arch/x86/include/asm/tdx.h > +++ b/arch/x86/include/asm/tdx.h > @@ -135,11 +135,17 @@ static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo) > return false; /* To be enabled when kernel is ready */ > } > > +void tdx_quirk_reset_page(struct page *page); > + > int tdx_guest_keyid_alloc(void); > u32 tdx_get_nr_guest_keyids(void); > void tdx_guest_keyid_free(unsigned int keyid); > > -void tdx_quirk_reset_page(struct page *page); I don't think it's mandatory to move the declaration of tdx_quirk_reset_page()?
On Wed, 2025-11-26 at 01:21 +0000, Huang, Kai wrote: > > } > > > > +void tdx_quirk_reset_page(struct page *page); > > + > > int tdx_guest_keyid_alloc(void); > > u32 tdx_get_nr_guest_keyids(void); > > void tdx_guest_keyid_free(unsigned int keyid); > > > > -void tdx_quirk_reset_page(struct page *page); > > I don't think it's mandatory to move the declaration of > tdx_quirk_reset_page()? Sure, I'll fix it.
On 11/21/2025 8:51 AM, Rick Edgecombe wrote:
[...]
>
> +/* Number PAMT pages to be provided to TDX module per 2M region of PA */
^ ^
of 2MB
> +static int tdx_dpamt_entry_pages(void)
> +{
> + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> + return 0;
> +
> + return tdx_sysinfo.tdmr.pamt_4k_entry_size * PTRS_PER_PTE / PAGE_SIZE;
> +}
> +
> +/*
> + * The TDX spec treats the registers like an array, as they are ordered
> + * in the struct. The array size is limited by the number or registers,
> + * so define the max size it could be for worst case allocations and sanity
> + * checking.
> + */
> +#define MAX_TDX_ARG_SIZE(reg) (sizeof(struct tdx_module_args) - \
> + offsetof(struct tdx_module_args, reg))
This should be the maximum number of registers could be used to pass the
addresses of the pages (or other information), it needs to be divided by sizeof(u64).
Also, "SIZE" in the name could be confusing.
> +#define TDX_ARG_INDEX(reg) (offsetof(struct tdx_module_args, reg) / \
> + sizeof(u64))
> +
> +/*
> + * Treat struct the registers like an array that starts at RDX, per
> + * TDX spec. Do some sanitychecks, and return an indexable type.
sanitychecks -> sanity checks
> + */
[...]
> +/* Serializes adding/removing PAMT memory */
> +static DEFINE_SPINLOCK(pamt_lock);
> +
> +/* Bump PAMT refcount for the given page and allocate PAMT memory if needed */
> +int tdx_pamt_get(struct page *page)
> +{
> + u64 pamt_pa_array[MAX_TDX_ARG_SIZE(rdx)];
> + atomic_t *pamt_refcount;
> + u64 tdx_status;
> + int ret;
> +
> + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> + return 0;
> +
> + ret = alloc_pamt_array(pamt_pa_array);
> + if (ret)
> + goto out_free;
> +
> + pamt_refcount = tdx_find_pamt_refcount(page_to_pfn(page));
> +
> + scoped_guard(spinlock, &pamt_lock) {
> + /*
> + * If the pamt page is already added (i.e. refcount >= 1),
> + * then just increment the refcount.
> + */
> + if (atomic_read(pamt_refcount)) {
> + atomic_inc(pamt_refcount);
So far, all atomic operations are inside the spinlock.
May be better to add some info in the change log that atomic is needed due to
the optimization in the later patch?
> + goto out_free;
> + }
> +
> + /* Try to add the pamt page and take the refcount 0->1. */
> +
> + tdx_status = tdh_phymem_pamt_add(page, pamt_pa_array);
> + if (!IS_TDX_SUCCESS(tdx_status)) {
> + pr_err("TDH_PHYMEM_PAMT_ADD failed: %#llx\n", tdx_status);
Can use pr_tdx_error().
Aslo, so for in this patch, when this SEAMCALL failed, does it indicate a bug?
> + goto out_free;
> + }
> +
> + atomic_inc(pamt_refcount);
> + }
> +
> + return ret;
Maybe just return 0 here since all error paths must be directed to out_free.
> +out_free:
> + /*
> + * pamt_pa_array is populated or zeroed up to tdx_dpamt_entry_pages()
> + * above. free_pamt_array() can handle either case.
> + */
> + free_pamt_array(pamt_pa_array);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_pamt_get);
> +
>
[...]
On Tue, 2025-11-25 at 16:09 +0800, Binbin Wu wrote:
>
> On 11/21/2025 8:51 AM, Rick Edgecombe wrote:
> [...]
> >
> > +/* Number PAMT pages to be provided to TDX module per 2M region of PA */
> ^ ^
> of 2MB
Sounds good.
> > +static int tdx_dpamt_entry_pages(void)
> > +{
> > + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> > + return 0;
> > +
> > + return tdx_sysinfo.tdmr.pamt_4k_entry_size * PTRS_PER_PTE / PAGE_SIZE;
> > +}
> > +
> > +/*
> > + * The TDX spec treats the registers like an array, as they are ordered
> > + * in the struct. The array size is limited by the number or registers,
> > + * so define the max size it could be for worst case allocations and sanity
> > + * checking.
> > + */
> > +#define MAX_TDX_ARG_SIZE(reg) (sizeof(struct tdx_module_args) - \
> > + offsetof(struct tdx_module_args, reg))
>
> This should be the maximum number of registers could be used to pass the
> addresses of the pages (or other information), it needs to be divided by sizeof(u64).
Oops, right.
>
> Also, "SIZE" in the name could be confusing.
Yes LENGTH is probably better.
>
> > +#define TDX_ARG_INDEX(reg) (offsetof(struct tdx_module_args, reg) / \
> > + sizeof(u64))
> > +
> > +/*
> > + * Treat struct the registers like an array that starts at RDX, per
> > + * TDX spec. Do some sanitychecks, and return an indexable type.
> sanitychecks -> sanity checks
>
> > + */
> [...]
> > +/* Serializes adding/removing PAMT memory */
> > +static DEFINE_SPINLOCK(pamt_lock);
> > +
> > +/* Bump PAMT refcount for the given page and allocate PAMT memory if needed */
> > +int tdx_pamt_get(struct page *page)
> > +{
> > + u64 pamt_pa_array[MAX_TDX_ARG_SIZE(rdx)];
> > + atomic_t *pamt_refcount;
> > + u64 tdx_status;
> > + int ret;
> > +
> > + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> > + return 0;
> > +
> > + ret = alloc_pamt_array(pamt_pa_array);
> > + if (ret)
> > + goto out_free;
> > +
> > + pamt_refcount = tdx_find_pamt_refcount(page_to_pfn(page));
> > +
> > + scoped_guard(spinlock, &pamt_lock) {
> > + /*
> > + * If the pamt page is already added (i.e. refcount >= 1),
> > + * then just increment the refcount.
> > + */
> > + if (atomic_read(pamt_refcount)) {
> > + atomic_inc(pamt_refcount);
>
> So far, all atomic operations are inside the spinlock.
> May be better to add some info in the change log that atomic is needed due to
> the optimization in the later patch?
Yea, I really debated this. Changing to atomics is more churn, but doing part of
the optimizations early is weird too. I think I might go with the more churn
approach.
>
> > + goto out_free;
> > + }
> > +
> > + /* Try to add the pamt page and take the refcount 0->1. */
> > +
> > + tdx_status = tdh_phymem_pamt_add(page, pamt_pa_array);
> > + if (!IS_TDX_SUCCESS(tdx_status)) {
> > + pr_err("TDH_PHYMEM_PAMT_ADD failed: %#llx\n", tdx_status);
>
> Can use pr_tdx_error().
Not in arch/x86, plus it became TDX_BUG_ON() in the cleanup series.
>
> Aslo, so for in this patch, when this SEAMCALL failed, does it indicate a bug?
Yes. This should set ret to an error value too. Yes SEAMCALL failure indicates a
bug, but it should be harmless to the host. tdx_pamt_get() returns a failure so
the calling code can handle it. The tdx_pamt_put() is a little weirder. We can
have the option to keep the refcount elevated if remove fails. This means DPAMT
stays addded for this 2MB range. I think it is ok for 4KB guest memory. It
basically means a DPAMT page will stay in place forever in the case of a bug,
which is the current non-dynamic PAMT situation. So I'm not sure if a full WARN
is needed.
But I'm just realizing that the page it is backing could reform it's way into a
2MB page. Then, after the TDX huge pages series, it could end up getting mapped
as 2MB private memory. In which case the PAMT.REMOVE needs to succeed before
adding the page as 2MB. Hmm, need to check TDX huge page series, because I think
this is not handled. So when TDX huge pages comes, a WARN might be more
appropriate.
We might need to make tdx_pamt_put() return an error for huge pages, and treat
failure like the other TDX remove failures, or maybe just a WARN. I'd lean
towards just the WARN. Do you have any better ideas?
>
> > + goto out_free;
> > + }
> > +
> > + atomic_inc(pamt_refcount);
> > + }
> > +
> > + return ret;
>
> Maybe just return 0 here since all error paths must be directed to out_free.
Yes.
>
> > +out_free:
> > + /*
> > + * pamt_pa_array is populated or zeroed up to tdx_dpamt_entry_pages()
> > + * above. free_pamt_array() can handle either case.
> > + */
> > + free_pamt_array(pamt_pa_array);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_pamt_get);
> > +
> >
> [...]
Gah, forgot to hit "send" on this.
On Wed, Nov 26, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-11-25 at 16:09 +0800, Binbin Wu wrote:
> > On 11/21/2025 8:51 AM, Rick Edgecombe wrote:
> > > +/*
> > > + * The TDX spec treats the registers like an array, as they are ordered
> > > + * in the struct. The array size is limited by the number or registers,
> > > + * so define the max size it could be for worst case allocations and sanity
> > > + * checking.
> > > + */
> > > +#define MAX_TDX_ARG_SIZE(reg) (sizeof(struct tdx_module_args) - \
> > > + offsetof(struct tdx_module_args, reg))
> >
> > This should be the maximum number of registers could be used to pass the
> > addresses of the pages (or other information), it needs to be divided by sizeof(u64).
>
> Oops, right.
>
> >
> > Also, "SIZE" in the name could be confusing.
>
> Yes LENGTH is probably better.
>
> >
> > > +#define TDX_ARG_INDEX(reg) (offsetof(struct tdx_module_args, reg) / \
> > > + sizeof(u64))
Honestly, the entire scheme is a mess. Four days of staring at this and I finally
undertand what the code is doing. The whole "struct tdx_module_array_args" union
is completely unnecessary, the resulting args.args crud is ugly, having a pile of
duplicate accessors is brittle, the code obfuscates a simple concept, and the end
result doesn't provide any actual protection since the kernel will happily overflow
the buffer after the WARN.
It's also relying on the developer to correctly copy+paste the same register in
multiple locations: ~5 depending on how you want to count.
static u64 *dpamt_args_array_ptr_r12(struct tdx_module_array_args *args)
#1
{
WARN_ON_ONCE(tdx_dpamt_entry_pages() > MAX_TDX_ARGS(r12));
#2
return &args->args_array[TDX_ARG_INDEX(r12)];
#3
u64 guest_memory_pamt_page[MAX_TDX_ARGS(r12)];
#4
u64 *args_array = dpamt_args_array_ptr_r12(&args);
#5
After all of that boilerplate, the caller _still_ has to do the actual memcpy(),
and for me at least, all of the above makes it _harder_ to understand what the code
is doing.
Drop the struct+union overlay and just provide a helper with wrappers to copy
to/from a tdx_module_args structure. It's far from bulletproof, but it at least
avoids an immediate buffer overflow, and defers to the kernel owner with respect
to handling uninitialized stack data.
/*
* For SEAMCALLs that pass a bundle of pages, the TDX spec treats the registers
* like an array, as they are ordered in the struct. The effective array size
* is (obviously) limited by the number or registers, relative to the starting
* register. Fill the register array at a given starting register, with sanity
* checks to avoid overflowing the args structure.
*/
static void dpamt_copy_regs_array(struct tdx_module_args *args, void *reg,
u64 *pamt_pa_array, bool copy_to_regs)
{
int size = tdx_dpamt_entry_pages() * sizeof(*pamt_pa_array);
if (WARN_ON_ONCE(reg + size > (void *)args) + sizeof(*args))
return;
/* Copy PAMT page PA's to/from the struct per the TDX ABI. */
if (copy_to_regs)
memcpy(reg, pamt_pa_array, size);
else
memcpy(pamt_pa_array, reg, size);
}
#define dpamt_copy_from_regs(dst, args, reg) \
dpamt_copy_regs_array(args, &(args)->reg, dst, false)
#define dpamt_copy_to_regs(args, reg, src) \
dpamt_copy_regs_array(args, &(args)->reg, src, true)
As far as the on-stack allocations go, why bother being precise? Except for
paranoid setups which explicitly initialize the stack, "allocating" ~48 unused
bytes is literally free. Not to mention the cost relative to the latency of a
SEAMCALL is in the noise.
/*
* When declaring PAMT arrays on the stack, use the maximum theoretical number
* of entries that can be squeezed into a SEAMCALL, as stack allocations are
* practically free, i.e. any wasted space is a non-issue.
*/
#define MAX_NR_DPAMT_ARGS (sizeof(struct tdx_module_args) / sizeof(u64))
With that, callers don't have to regurgitate the same register multiple times,
and we don't need a new wrapper for every variation of SEAMCALL. E.g.
u64 pamt_pa_array[MAX_NR_DPAMT_ARGS];
...
bool dpamt = tdx_supports_dynamic_pamt(&tdx_sysinfo) && level == PG_LEVEL_2M;
u64 pamt_pa_array[MAX_NR_DPAMT_ARGS];
struct tdx_module_args args = {
.rcx = gpa | pg_level_to_tdx_sept_level(level),
.rdx = tdx_tdr_pa(td),
.r8 = page_to_phys(new_sp),
};
u64 ret;
if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
return TDX_SW_ERROR;
if (dpamt) {
if (alloc_pamt_array(pamt_pa_array, pamt_cache))
return TDX_SW_ERROR;
dpamt_copy_to_regs(&args, r12, pamt_pa_array);
}
Which to me is easier to read and much more intuitive than:
u64 guest_memory_pamt_page[MAX_TDX_ARGS(r12)];
struct tdx_module_array_args args = {
.args.rcx = gpa | pg_level_to_tdx_sept_level(level),
.args.rdx = tdx_tdr_pa(td),
.args.r8 = PFN_PHYS(page_to_pfn(new_sp)),
};
struct tdx_module_array_args retry_args;
int i = 0;
u64 ret;
if (dpamt) {
u64 *args_array = dpamt_args_array_ptr_r12(&args);
if (alloc_pamt_array(guest_memory_pamt_page, pamt_cache))
return TDX_SW_ERROR;
/*
* Copy PAMT page PAs of the guest memory into the struct per the
* TDX ABI
*/
memcpy(args_array, guest_memory_pamt_page,
tdx_dpamt_entry_pages() * sizeof(*args_array));
}
On Wed, 2026-01-28 at 17:19 -0800, Sean Christopherson wrote:
> Honestly, the entire scheme is a mess. Four days of staring at this
> and I finally undertand what the code is doing. The whole "struct
> tdx_module_array_args" union is completely unnecessary, the resulting
> args.args crud is ugly, having a pile of duplicate accessors is
> brittle, the code obfuscates a simple concept, and the end result
> doesn't provide any actual protection since the kernel will happily
> overflow the buffer after the WARN.
The original sin for this, as was spotted by Nikilay in v3, is actually
that it turns out that the whole variable length thing was intended to
give the TDX module flexibility *if* it wanted to increase it in the
future. As in it's not required today. Worse, whether it would actually
grow in the specific way the code assumes is not covered in the spec.
Apparently it was based on some past internal discussions. So the
agreement on v3 was to just support the fixed two page size in the
spec.
Here was the end of that thread:
https://lore.kernel.org/kvm/da3701ea-08ea-45c9-94a8-355205a45f8e@intel.com/
This simplifies the whole thing, no union, no worse case allocations,
etc. I'm just getting back and going through mail so will check out
your full solution. (Thanks!) But from the below I think the fixed
array size code will be better still.
>
> It's also relying on the developer to correctly copy+paste the same
> register in multiple locations: ~5 depending on how you want to
> count.
>
> static u64 *dpamt_args_array_ptr_r12(struct tdx_module_array_args
> *args)
> #1
> {
> WARN_ON_ONCE(tdx_dpamt_entry_pages() > MAX_TDX_ARGS(r12));
> #2
>
> return &args->args_array[TDX_ARG_INDEX(r12)];
> #3
>
>
> u64 guest_memory_pamt_page[MAX_TDX_ARGS(r12)];
> #4
>
>
> u64 *args_array = dpamt_args_array_ptr_r12(&args);
> #5
Yea it could probably use another DEFINE or two to make it less error
prone. Vanilla DPAMT has 4 instances of rdx.
>
> After all of that boilerplate, the caller _still_ has to do the
> actual memcpy(), and for me at least, all of the above makes it
> _harder_ to understand what the code is doing.
>
> Drop the struct+union overlay and just provide a helper with wrappers
> to copy to/from a tdx_module_args structure. It's far from
> bulletproof, but it at least avoids an immediate buffer overflow, and
> defers to the kernel owner with respect to handling uninitialized
> stack data.
>
> /*
> * For SEAMCALLs that pass a bundle of pages, the TDX spec treats the
> registers
> * like an array, as they are ordered in the struct. The effective
> array size
> * is (obviously) limited by the number or registers, relative to the
> starting
> * register. Fill the register array at a given starting register,
> with sanity
> * checks to avoid overflowing the args structure.
> */
> static void dpamt_copy_regs_array(struct tdx_module_args *args, void
> *reg,
> u64 *pamt_pa_array, bool
> copy_to_regs)
> {
> int size = tdx_dpamt_entry_pages() * sizeof(*pamt_pa_array);
>
> if (WARN_ON_ONCE(reg + size > (void *)args) + sizeof(*args))
> return;
>
> /* Copy PAMT page PA's to/from the struct per the TDX ABI.
> */
> if (copy_to_regs)
> memcpy(reg, pamt_pa_array, size);
> else
> memcpy(pamt_pa_array, reg, size);
> }
>
> #define dpamt_copy_from_regs(dst, args, reg) \
> dpamt_copy_regs_array(args, &(args)->reg, dst, false)
>
> #define dpamt_copy_to_regs(args, reg, src) \
> dpamt_copy_regs_array(args, &(args)->reg, src, true)
>
> As far as the on-stack allocations go, why bother being precise?
> Except for paranoid setups which explicitly initialize the stack,
> "allocating" ~48 unused bytes is literally free. Not to mention the
> cost relative to the latency of a SEAMCALL is in the noise.
>
> /*
> * When declaring PAMT arrays on the stack, use the maximum
> theoretical number
> * of entries that can be squeezed into a SEAMCALL, as stack
> allocations are
> * practically free, i.e. any wasted space is a non-issue.
> */
> #define MAX_NR_DPAMT_ARGS (sizeof(struct tdx_module_args) /
> sizeof(u64))
>
>
> With that, callers don't have to regurgitate the same register
> multiple times, and we don't need a new wrapper for every variation
> of SEAMCALL.
> E.g.
>
>
> u64 pamt_pa_array[MAX_NR_DPAMT_ARGS];
>
> ...
>
> bool dpamt = tdx_supports_dynamic_pamt(&tdx_sysinfo) &&
> level == PG_LEVEL_2M;
> u64 pamt_pa_array[MAX_NR_DPAMT_ARGS];
> struct tdx_module_args args = {
> .rcx = gpa | pg_level_to_tdx_sept_level(level),
> .rdx = tdx_tdr_pa(td),
> .r8 = page_to_phys(new_sp),
> };
> u64 ret;
>
> if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
> return TDX_SW_ERROR;
>
> if (dpamt) {
> if (alloc_pamt_array(pamt_pa_array, pamt_cache))
> return TDX_SW_ERROR;
>
> dpamt_copy_to_regs(&args, r12, pamt_pa_array);
> }
>
> Which to me is easier to read and much more intuitive than:
>
>
> u64 guest_memory_pamt_page[MAX_TDX_ARGS(r12)];
> struct tdx_module_array_args args = {
> .args.rcx = gpa | pg_level_to_tdx_sept_level(level),
> .args.rdx = tdx_tdr_pa(td),
> .args.r8 = PFN_PHYS(page_to_pfn(new_sp)),
> };
> struct tdx_module_array_args retry_args;
> int i = 0;
> u64 ret;
>
> if (dpamt) {
> u64 *args_array = dpamt_args_array_ptr_r12(&args);
>
> if (alloc_pamt_array(guest_memory_pamt_page,
> pamt_cache))
> return TDX_SW_ERROR;
>
> /*
> * Copy PAMT page PAs of the guest memory into the
> struct per the
> * TDX ABI
> */
> memcpy(args_array, guest_memory_pamt_page,
> tdx_dpamt_entry_pages() *
> sizeof(*args_array));
> }
What you have here is close to what I had done when I first took this
series. But it ran afoul of FORTIFY_SOUCE and required some horrible
casting to trick it. I wonder if this code will hit that issue too.
Dave didn't like the solution and suggested the union actually:
https://lore.kernel.org/kvm/355ad607-52ed-42cc-9a48-63aaa49f4c68@intel.com/#t
I'm aware of your tendency to dislike union based solutions. But since
this was purely contained to tip, I went with Dave's preference.
But I think it's all moot because the fixed size-2 solution doesn't
need union or array copying. They can be just normal tdx_module_args
args.
On Thu, Jan 29, 2026, Rick P Edgecombe wrote: > On Wed, 2026-01-28 at 17:19 -0800, Sean Christopherson wrote: > > Honestly, the entire scheme is a mess. Four days of staring at this > > and I finally undertand what the code is doing. The whole "struct > > tdx_module_array_args" union is completely unnecessary, the resulting > > args.args crud is ugly, having a pile of duplicate accessors is > > brittle, the code obfuscates a simple concept, and the end result > > doesn't provide any actual protection since the kernel will happily > > overflow the buffer after the WARN. > > The original sin for this, as was spotted by Nikilay in v3, is actually > that it turns out that the whole variable length thing was intended to > give the TDX module flexibility *if* it wanted to increase it in the > future. As in it's not required today. Worse, whether it would actually > grow in the specific way the code assumes is not covered in the spec. > Apparently it was based on some past internal discussions. So the > agreement on v3 was to just support the fixed two page size in the > spec. Heh, I was _this_ close to suggesting we add a compile-time assert on the incoming size of the array (and effective regs array), with a constant max size supported by the kernel. It wouldn't eliminate the array shenanigans, but it would let us make them more or less bombproof. > Yea it could probably use another DEFINE or two to make it less error > prone. Vanilla DPAMT has 4 instances of rdx. For me, it's not just a syntax problem. It's the approach of getting a pointer to the middle of structure and then doing a memcpy() at a later point in time. More below. > What you have here is close to what I had done when I first took this > series. But it ran afoul of FORTIFY_SOUCE and required some horrible > casting to trick it. I wonder if this code will hit that issue too. AFAICT, FORTIFY_SOURCE doesn't complain. > Dave didn't like the solution and suggested the union actually: > https://lore.kernel.org/kvm/355ad607-52ed-42cc-9a48-63aaa49f4c68@intel.com/#t What you proposed is fundamentally quite different than what I'm proposing. I'm not complaining about the union, I'm complaining about providing a helper to grab a pointer to the middle of a struct and then open coding memcpy() calls using that pointer. I find that _extremely_ difficult to grok, because it does a poor job of capturing the intent (copy these values to this sequence of registers). The decouple pointer+memcpy() approach also bleeds gory details about how PAMT pages are passed in multiple args throughout all SEAMCALL APIs that have such args. I want the APIs to not have to care about the underlying mechanics of how PAMT pages are copied from an array to registers, e.g. so that readers of the code can focus on the semantics of the SEAMCALL and the pieces of logic that are unique to each SEAMCALL. In other words, I see the necessity for a union as being a symptom of the flawed approach. > I'm aware of your tendency to dislike union based solutions. But since > this was purely contained to tip, I went with Dave's preference.
On Thu, 2026-01-29 at 11:09 -0800, Sean Christopherson wrote: > What you proposed is fundamentally quite different than what I'm > proposing. I'm not complaining about the union, I'm complaining > about providing a helper to grab a pointer to the middle of a struct > and then open coding memcpy() calls using that pointer. I find that > _extremely_ difficult to grok, because it does a poor job of > capturing the intent (copy these values to this sequence of > registers). > > The decouple pointer+memcpy() approach also bleeds gory details about > how PAMT pages are passed in multiple args throughout all SEAMCALL > APIs that have such args. I want the APIs to not have to care about > the underlying mechanics of how PAMT pages are copied from an array > to registers, e.g. so that readers of the code can focus on the > semantics of the SEAMCALL and the pieces of logic that are > unique to each SEAMCALL. > > In other words, I see the necessity for a union as being a symptom of > the flawed approach. Ah ok, fair enough.
© 2016 - 2026 Red Hat, Inc.