[PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()

Yan Zhao posted 24 patches 1 month ago
[PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
Posted by Yan Zhao 1 month ago
From: Xiaoyao Li <xiaoyao.li@intel.com>

Introduce SEAMCALL wrapper tdh_mem_page_demote() to invoke
TDH_MEM_PAGE_DEMOTE, which splits a 2MB or a 1GB mapping in S-EPT into
512 4KB or 2MB mappings respectively.

SEAMCALL TDH_MEM_PAGE_DEMOTE walks the S-EPT to locate the huge mapping to
split and add a new S-EPT page to hold the 512 smaller mappings.

Parameters "gpa" and "level" specify the huge mapping to split, and
parameter "new_sept_page" specifies the 4KB page to be added as the S-EPT
page. Invoke tdx_clflush_page() before adding the new S-EPT page
conservatively to prevent dirty cache lines from writing back later and
corrupting TD memory.

tdh_mem_page_demote() may fail, e.g., due to S-EPT walk error. Callers must
check function return value and can retrieve the extended error info from
the output parameters "ext_err1", and "ext_err2".

The TDX module has many internal locks. To avoid staying in SEAM mode for
too long, SEAMCALLs return a BUSY error code to the kernel instead of
spinning on the locks. Depending on the specific SEAMCALL, the caller may
need to handle this error in specific ways (e.g., retry). Therefore, return
the SEAMCALL error code directly to the caller without attempting to handle
it in the core kernel.

Enable tdh_mem_page_demote() only on TDX modules that support feature
TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error
TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2].

This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle.
The TDX module provides no guaranteed maximum retry count to ensure forward
progress of the demotion. Interrupt storms could then result in a DoS if
host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling
interrupts before invoking the SEAMCALL also doesn't work because NMIs can
also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic
TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the
reasonable execution time for demotion. [1]

Link: https://lore.kernel.org/kvm/99f5585d759328db973403be0713f68e492b492a.camel@intel.com [1]
Link: https://lore.kernel.org/all/fbf04b09f13bc2ce004ac97ee9c1f2c965f44fdf.camel@intel.com [2]
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Use a var name that clearly tell that the page is used as a page table
  page. (Binbin).
- Check if TDX module supports feature ENHANCE_DEMOTE_INTERRUPTIBILITY.
  (Kai).

RFC v2:
- Refine the patch log (Rick).
- Do not handle TDX_INTERRUPTED_RESTARTABLE as the new TDX modules in
  planning do not check interrupts for basic TDX.

RFC v1:
- Rebased and split patch. Updated patch log.
---
 arch/x86/include/asm/tdx.h  |  8 ++++++++
 arch/x86/virt/vmx/tdx/tdx.c | 24 ++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 3 files changed, 33 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f92850789193..d1891e099d42 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -15,6 +15,7 @@
 /* Bit definitions of TDX_FEATURES0 metadata field */
 #define TDX_FEATURES0_NO_RBP_MOD		BIT_ULL(18)
 #define TDX_FEATURES0_DYNAMIC_PAMT		BIT_ULL(36)
+#define TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY	BIT_ULL(51)
 
 #ifndef __ASSEMBLER__
 
@@ -140,6 +141,11 @@ static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
 	return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT;
 }
 
+static inline bool tdx_supports_demote_nointerrupt(const struct tdx_sys_info *sysinfo)
+{
+	return sysinfo->features.tdx_features0 & TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY;
+}
+
 void tdx_quirk_reset_page(struct page *page);
 
 int tdx_guest_keyid_alloc(void);
@@ -242,6 +248,8 @@ u64 tdh_mng_key_config(struct tdx_td *td);
 u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
 u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
 u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
+			u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_finalize(struct tdx_td *td);
 u64 tdh_vp_flush(struct tdx_vp *vp);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 41ce18619ffc..c3f4457816c8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1837,6 +1837,30 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
 }
 EXPORT_SYMBOL_GPL(tdh_mng_rd);
 
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
+			u64 *ext_err1, u64 *ext_err2)
+{
+	struct tdx_module_args args = {
+		.rcx = gpa | level,
+		.rdx = tdx_tdr_pa(td),
+		.r8 = page_to_phys(new_sept_page),
+	};
+	u64 ret;
+
+	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
+		return TDX_SW_ERROR;
+
+	/* Flush the new S-EPT page to be added */
+	tdx_clflush_page(new_sept_page);
+	ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
+
+	*ext_err1 = args.rcx;
+	*ext_err2 = args.rdx;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mem_page_demote);
+
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 096c78a1d438..a6c0fa53ece9 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -24,6 +24,7 @@
 #define TDH_MNG_KEY_CONFIG		8
 #define TDH_MNG_CREATE			9
 #define TDH_MNG_RD			11
+#define TDH_MEM_PAGE_DEMOTE		15
 #define TDH_MR_EXTEND			16
 #define TDH_MR_FINALIZE			17
 #define TDH_VP_FLUSH			18
-- 
2.43.2
Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
Posted by Sean Christopherson 1 week, 3 days ago
On Tue, Jan 06, 2026, Yan Zhao wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
> 
> Introduce SEAMCALL wrapper tdh_mem_page_demote() to invoke
> TDH_MEM_PAGE_DEMOTE, which splits a 2MB or a 1GB mapping in S-EPT into
> 512 4KB or 2MB mappings respectively.
> 
> SEAMCALL TDH_MEM_PAGE_DEMOTE walks the S-EPT to locate the huge mapping to
> split and add a new S-EPT page to hold the 512 smaller mappings.
> 
> Parameters "gpa" and "level" specify the huge mapping to split, and
> parameter "new_sept_page" specifies the 4KB page to be added as the S-EPT
> page. Invoke tdx_clflush_page() before adding the new S-EPT page
> conservatively to prevent dirty cache lines from writing back later and
> corrupting TD memory.
> 
> tdh_mem_page_demote() may fail, e.g., due to S-EPT walk error. Callers must
> check function return value and can retrieve the extended error info from
> the output parameters "ext_err1", and "ext_err2".
> 
> The TDX module has many internal locks. To avoid staying in SEAM mode for
> too long, SEAMCALLs return a BUSY error code to the kernel instead of
> spinning on the locks. Depending on the specific SEAMCALL, the caller may
> need to handle this error in specific ways (e.g., retry). Therefore, return
> the SEAMCALL error code directly to the caller without attempting to handle
> it in the core kernel.
> 
> Enable tdh_mem_page_demote() only on TDX modules that support feature
> TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error
> TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2].
> 
> This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle.
> The TDX module provides no guaranteed maximum retry count to ensure forward
> progress of the demotion. Interrupt storms could then result in a DoS if
> host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling
> interrupts before invoking the SEAMCALL also doesn't work because NMIs can
> also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic
> TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the
> reasonable execution time for demotion. [1]
> 
> Link: https://lore.kernel.org/kvm/99f5585d759328db973403be0713f68e492b492a.camel@intel.com [1]
> Link: https://lore.kernel.org/all/fbf04b09f13bc2ce004ac97ee9c1f2c965f44fdf.camel@intel.com [2]
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---

This is ridiculous.  The DEMOTE API is spread across three patches:

  Add SEAMCALL wrapper tdh_mem_page_demote()
  Add/Remove DPAMT pages for guest private memory to demote
  Pass guest memory's PFN info to demote for updating pamt_refcount

with significant changes between the "add wrapper" and when the API is actually
usable.  Even worse, it's wired up in KVM before it's finalized, and so those
changes mean touching KVM code.

And to top things off, "Add/Remove DPAMT pages for guest private memory to demote"
includes a non-trivial refactoring and code movement of MAX_TDX_ARG_SIZE() and
dpamt_args_array_ptr().

This borderline unreviewable.  It took me literally more than a day to wrap my
head around what actually needs to happen for DEMOTE, what patch was doing what,
at what point in the series DEMOTE actually became usable, etc.

I get that y'all are juggling multiple intertwined series, but spraying them all
at upstream without any apparent rhyme or reason does not work.  Figure out
priorities, pick an ordering, and make it happen.

In the end, this fits nicely into *one* patch, with significantly fewer lines
changed overall.

E.g.

---
 arch/x86/include/asm/tdx.h  |  9 ++++++
 arch/x86/virt/vmx/tdx/tdx.c | 58 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 3 files changed, 68 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 50feea01b066..483441de7fe0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -15,6 +15,7 @@
 /* Bit definitions of TDX_FEATURES0 metadata field */
 #define TDX_FEATURES0_NO_RBP_MOD		BIT_ULL(18)
 #define TDX_FEATURES0_DYNAMIC_PAMT		BIT_ULL(36)
+#define TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY	BIT_ULL(51)
 
 #ifndef __ASSEMBLER__
 
@@ -140,6 +141,11 @@ static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
 	return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT;
 }
 
+static inline bool tdx_supports_demote_nointerrupt(const struct tdx_sys_info *sysinfo)
+{
+	return sysinfo->features.tdx_features0 & TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY;
+}
+
 /* Simple structure for pre-allocating Dynamic PAMT pages outside of locks. */
 struct tdx_pamt_cache {
 	struct list_head page_list;
@@ -240,6 +246,9 @@ u64 tdh_mng_key_config(struct tdx_td *td);
 u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
 u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
 u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, enum pg_level level, u64 pfn,
+			struct page *new_sp, struct tdx_pamt_cache *pamt_cache,
+			u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_finalize(struct tdx_td *td);
 u64 tdh_vp_flush(struct tdx_vp *vp);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 6a2871e83761..97016b3e26b8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1874,6 +1874,64 @@ static u64 *dpamt_args_array_ptr_r12(struct tdx_module_array_args *args)
 	return &args->args_array[TDX_ARG_INDEX(r12)];
 }
 
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, enum pg_level level, u64 pfn,
+			struct page *new_sp, struct tdx_pamt_cache *pamt_cache,
+			u64 *ext_err1, u64 *ext_err2)
+{
+	bool dpamt = tdx_supports_dynamic_pamt(&tdx_sysinfo) && level == PG_LEVEL_2M;
+	u64 guest_memory_pamt_page[MAX_TDX_ARGS(r12)];
+	struct tdx_module_array_args args = {
+		.args.rcx = gpa | pg_level_to_tdx_sept_level(level),
+		.args.rdx = tdx_tdr_pa(td),
+		.args.r8 = page_to_phys(new_sp),
+	};
+	u64 ret;
+
+	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
+		return TDX_SW_ERROR;
+
+	if (dpamt) {
+		u64 *args_array = dpamt_args_array_ptr_r12(&args);
+
+		if (alloc_pamt_array(guest_memory_pamt_page, pamt_cache))
+			return TDX_SW_ERROR;
+
+		/*
+		 * Copy PAMT page PAs of the guest memory into the struct per the
+		 * TDX ABI
+		 */
+		memcpy(args_array, guest_memory_pamt_page,
+		       tdx_dpamt_entry_pages() * sizeof(*args_array));
+	}
+
+	/* Flush the new S-EPT page to be added */
+	tdx_clflush_page(new_sp);
+
+	ret = seamcall_saved_ret(TDH_MEM_PAGE_DEMOTE, &args.args);
+
+	*ext_err1 = args.args.rcx;
+	*ext_err2 = args.args.rdx;
+
+	if (dpamt) {
+		if (ret) {
+			free_pamt_array(guest_memory_pamt_page);
+		} else {
+			/*
+			 * Set the PAMT refcount for the guest private memory,
+			 * i.e. for the hugepage that was just demoted to 512
+			 * smaller pages.
+			 */
+			atomic_t *pamt_refcount;
+
+			pamt_refcount = tdx_find_pamt_refcount(pfn);
+			WARN_ON_ONCE(atomic_cmpxchg_release(pamt_refcount, 0,
+							    PTRS_PER_PMD));
+		}
+	}
+	return ret;
+}
+EXPORT_SYMBOL_FOR_KVM(tdh_mem_page_demote);
+
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 096c78a1d438..a6c0fa53ece9 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -24,6 +24,7 @@
 #define TDH_MNG_KEY_CONFIG		8
 #define TDH_MNG_CREATE			9
 #define TDH_MNG_RD			11
+#define TDH_MEM_PAGE_DEMOTE		15
 #define TDH_MR_EXTEND			16
 #define TDH_MR_FINALIZE			17
 #define TDH_VP_FLUSH			18

base-commit: 0f969bc3e7a9aa441122ad51bc2ff220a200b88e
--
Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
Posted by Huang, Kai 3 weeks, 2 days ago
On Tue, 2026-01-06 at 18:18 +0800, Yan Zhao wrote:
>  /* Bit definitions of TDX_FEATURES0 metadata field */
>  #define TDX_FEATURES0_NO_RBP_MOD		BIT_ULL(18)
>  #define TDX_FEATURES0_DYNAMIC_PAMT		BIT_ULL(36)
> +#define TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY	BIT_ULL(51)

Nit: the spec uses "ENHANCED" but not "ENHANCE", so perhaps change to
TDX_FEATURES0_ENHANCED_DEMOTE_INTERRUPTIBILITY ?
Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
Posted by Yan Zhao 2 weeks, 6 days ago
On Fri, Jan 16, 2026 at 07:22:20PM +0800, Huang, Kai wrote:
> On Tue, 2026-01-06 at 18:18 +0800, Yan Zhao wrote:
> >  /* Bit definitions of TDX_FEATURES0 metadata field */
> >  #define TDX_FEATURES0_NO_RBP_MOD		BIT_ULL(18)
> >  #define TDX_FEATURES0_DYNAMIC_PAMT		BIT_ULL(36)
> > +#define TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY	BIT_ULL(51)
> 
> Nit: the spec uses "ENHANCED" but not "ENHANCE", so perhaps change to
> TDX_FEATURES0_ENHANCED_DEMOTE_INTERRUPTIBILITY ?
Good catch. Will update the name. Thanks!
Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
Posted by Huang, Kai 3 weeks, 2 days ago
> 
> Enable tdh_mem_page_demote() only on TDX modules that support feature
> TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error
> TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2].
> 
> This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle.
> The TDX module provides no guaranteed maximum retry count to ensure forward
> progress of the demotion. Interrupt storms could then result in a DoS if
> host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling
> interrupts before invoking the SEAMCALL also doesn't work because NMIs can
> also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic
> TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the
> reasonable execution time for demotion. [1]
> 

[...]

> v3:
> - Use a var name that clearly tell that the page is used as a page table
>   page. (Binbin).
> - Check if TDX module supports feature ENHANCE_DEMOTE_INTERRUPTIBILITY.
>   (Kai).
> 
[...]

> +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
> +			u64 *ext_err1, u64 *ext_err2)
> +{
> +	struct tdx_module_args args = {
> +		.rcx = gpa | level,
> +		.rdx = tdx_tdr_pa(td),
> +		.r8 = page_to_phys(new_sept_page),
> +	};
> +	u64 ret;
> +
> +	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
> +		return TDX_SW_ERROR;
> 

For the record, while I replied my suggestion [*] to this patch in v2, it
was basically because the discussion was already in that patch -- I didn't
mean to do this check inside tdh_mem_page_demote(), but do this check in
KVM page fault patch and return 4K as maximum mapping level.

The precise words were:

  So if the decision is to not use 2M page when TDH_MEM_PAGE_DEMOTE can 
  return TDX_INTERRUPTED_RESTARTABLE, maybe we can just check this 
  enumeration in fault handler and always make mapping level as 4K?

Looking at this series, this is eventually done in your last patch.  But I
don't quite understand what's the additional value of doing such check and
return TDX_SW_ERROR in this SEAMCALL wrapper.

Currently in this series, it doesn't matter whether this wrapper returns
TDX_SW_ERROR or the real TDX_INTERRUPTED_RESTARTABLE -- KVM terminates the
TD anyway (see your patch 8) because this is unexpected as checked in your
last patch.

IMHO we should get rid of this check in this low level wrapper.

[*]:
https://lore.kernel.org/all/fbf04b09f13bc2ce004ac97ee9c1f2c965f44fdf.camel@intel.com/#t
Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
Posted by Yan Zhao 3 weeks, 2 days ago
Hi Kai,
Thanks for reviewing!

On Fri, Jan 16, 2026 at 09:00:29AM +0800, Huang, Kai wrote:
> 
> > 
> > Enable tdh_mem_page_demote() only on TDX modules that support feature
> > TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error
> > TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2].
> > 
> > This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle.
> > The TDX module provides no guaranteed maximum retry count to ensure forward
> > progress of the demotion. Interrupt storms could then result in a DoS if
> > host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling
> > interrupts before invoking the SEAMCALL also doesn't work because NMIs can
> > also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic
> > TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the
> > reasonable execution time for demotion. [1]
> > 
> 
> [...]
> 
> > v3:
> > - Use a var name that clearly tell that the page is used as a page table
> >   page. (Binbin).
> > - Check if TDX module supports feature ENHANCE_DEMOTE_INTERRUPTIBILITY.
> >   (Kai).
> > 
> [...]
> 
> > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
> > +			u64 *ext_err1, u64 *ext_err2)
> > +{
> > +	struct tdx_module_args args = {
> > +		.rcx = gpa | level,
> > +		.rdx = tdx_tdr_pa(td),
> > +		.r8 = page_to_phys(new_sept_page),
> > +	};
> > +	u64 ret;
> > +
> > +	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
> > +		return TDX_SW_ERROR;
> > 
> 
> For the record, while I replied my suggestion [*] to this patch in v2, it
> was basically because the discussion was already in that patch -- I didn't
> mean to do this check inside tdh_mem_page_demote(), but do this check in
> KVM page fault patch and return 4K as maximum mapping level.
> 
> The precise words were:
> 
>   So if the decision is to not use 2M page when TDH_MEM_PAGE_DEMOTE can 
>   return TDX_INTERRUPTED_RESTARTABLE, maybe we can just check this 
>   enumeration in fault handler and always make mapping level as 4K?
Right. I followed it in the last patch (patch 24).

> Looking at this series, this is eventually done in your last patch.  But I
> don't quite understand what's the additional value of doing such check and
> return TDX_SW_ERROR in this SEAMCALL wrapper.
> 
> Currently in this series, it doesn't matter whether this wrapper returns
> TDX_SW_ERROR or the real TDX_INTERRUPTED_RESTARTABLE -- KVM terminates the
> TD anyway (see your patch 8) because this is unexpected as checked in your
> last patch.
> 
> IMHO we should get rid of this check in this low level wrapper.
You are right, the wrapper shouldn't hit this error after the last patch.

However, I found it's better to introduce the feature bit
TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY and the helper
tdx_supports_demote_nointerrupt() together with the demote SEAMCALL wrapper.
This way, people can understand how the TDX_INTERRUPTED_RESTARTABLE error is
handled for this SEAMCALL. Invoking the helper in this patch also gives the
helper a user :)

What do you think about changing it to a WARN_ON_ONCE()? i.e.,
WARN_ON_ONCE(!tdx_supports_demote_nointerrupt(&tdx_sysinfo));


> [*]:
> https://lore.kernel.org/all/fbf04b09f13bc2ce004ac97ee9c1f2c965f44fdf.camel@intel.com/#t
Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
Posted by Huang, Kai 3 weeks, 2 days ago
On Fri, 2026-01-16 at 16:35 +0800, Yan Zhao wrote:
> Hi Kai,
> Thanks for reviewing!
> 
> On Fri, Jan 16, 2026 at 09:00:29AM +0800, Huang, Kai wrote:
> > 
> > > 
> > > Enable tdh_mem_page_demote() only on TDX modules that support feature
> > > TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error
> > > TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2].
> > > 
> > > This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle.
> > > The TDX module provides no guaranteed maximum retry count to ensure forward
> > > progress of the demotion. Interrupt storms could then result in a DoS if
> > > host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling
> > > interrupts before invoking the SEAMCALL also doesn't work because NMIs can
> > > also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic
> > > TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the
> > > reasonable execution time for demotion. [1]
> > > 
> > 
> > [...]
> > 
> > > v3:
> > > - Use a var name that clearly tell that the page is used as a page table
> > >   page. (Binbin).
> > > - Check if TDX module supports feature ENHANCE_DEMOTE_INTERRUPTIBILITY.
> > >   (Kai).
> > > 
> > [...]
> > 
> > > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
> > > +			u64 *ext_err1, u64 *ext_err2)
> > > +{
> > > +	struct tdx_module_args args = {
> > > +		.rcx = gpa | level,
> > > +		.rdx = tdx_tdr_pa(td),
> > > +		.r8 = page_to_phys(new_sept_page),
> > > +	};
> > > +	u64 ret;
> > > +
> > > +	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
> > > +		return TDX_SW_ERROR;
> > > 
> > 
> > For the record, while I replied my suggestion [*] to this patch in v2, it
> > was basically because the discussion was already in that patch -- I didn't
> > mean to do this check inside tdh_mem_page_demote(), but do this check in
> > KVM page fault patch and return 4K as maximum mapping level.
> > 
> > The precise words were:
> > 
> >   So if the decision is to not use 2M page when TDH_MEM_PAGE_DEMOTE can 
> >   return TDX_INTERRUPTED_RESTARTABLE, maybe we can just check this 
> >   enumeration in fault handler and always make mapping level as 4K?
> Right. I followed it in the last patch (patch 24).
> 
> > Looking at this series, this is eventually done in your last patch.  But I
> > don't quite understand what's the additional value of doing such check and
> > return TDX_SW_ERROR in this SEAMCALL wrapper.
> > 
> > Currently in this series, it doesn't matter whether this wrapper returns
> > TDX_SW_ERROR or the real TDX_INTERRUPTED_RESTARTABLE -- KVM terminates the
> > TD anyway (see your patch 8) because this is unexpected as checked in your
> > last patch.
> > 
> > IMHO we should get rid of this check in this low level wrapper.
> You are right, the wrapper shouldn't hit this error after the last patch.
> 
> However, I found it's better to introduce the feature bit
> TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY and the helper
> tdx_supports_demote_nointerrupt() together with the demote SEAMCALL wrapper.
> This way, people can understand how the TDX_INTERRUPTED_RESTARTABLE error is
> handled for this SEAMCALL. 
> 

So the "handling" here is basically making DEMOTE SEAMCALL unavailable
when DEMOTE is interruptible at low SEAMCALL wrapper level.

I guess you can argue this has some value since it tells users "don't even
try to call me when I am interruptible because I am not available".  

However, IMHO this also implies the benefit is mostly for the case where
the user wants to use this wrapper to tell whether DEMOTE is available. 
E.g.,

	err = tdh_mem_page_demote(...);
	if (err == TDX_SW_ERROR)
		enable_tdx_hugepage = false;

But in this series you are using tdx_supports_demote_nointerrupt() for
this purpose, which is better IMHO.

So maybe there's a *theoretical* value to have the check here, but I don't
see any *real* value.

But I don't have strong opinion either -- I guess I just don't like making
these low level SEAMCALL wrappers more complicated than what the SEAMCALL
does -- and it's up to you to decide. :-)

> 
> What do you think about changing it to a WARN_ON_ONCE()? i.e.,
> WARN_ON_ONCE(!tdx_supports_demote_nointerrupt(&tdx_sysinfo));

What's your intention?

W/o the WARN(), the caller _can_ call this wrapper (i.e., not a kernel
bug) but it always get a SW-defined error.  Again, maybe it has value for
the case where the caller wants to use this to tell whether DEMOTE is
available.

With the WARN(), it's a kernel bug to call the wrapper, and the caller
needs to use other way (i.e., tdx_supports_demote_nointerrupt()) to tell
whether DEMOTE is available.

So if you want the check, probably WARN() is a better idea since I suppose
we always want users to use tdx_supports_demote_nointerrupt() to know
whether DEMOTE can be done, and the WARN() is just to catch bug.
Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
Posted by Yan Zhao 2 weeks, 6 days ago
On Fri, Jan 16, 2026 at 07:10:05PM +0800, Huang, Kai wrote:
> On Fri, 2026-01-16 at 16:35 +0800, Yan Zhao wrote:
> > Hi Kai,
> > Thanks for reviewing!
> > 
> > On Fri, Jan 16, 2026 at 09:00:29AM +0800, Huang, Kai wrote:
> > > 
> > > > 
> > > > Enable tdh_mem_page_demote() only on TDX modules that support feature
> > > > TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error
> > > > TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2].
> > > > 
> > > > This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle.
> > > > The TDX module provides no guaranteed maximum retry count to ensure forward
> > > > progress of the demotion. Interrupt storms could then result in a DoS if
> > > > host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling
> > > > interrupts before invoking the SEAMCALL also doesn't work because NMIs can
> > > > also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic
> > > > TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the
> > > > reasonable execution time for demotion. [1]
> > > > 
> > > 
> > > [...]
> > > 
> > > > v3:
> > > > - Use a var name that clearly tell that the page is used as a page table
> > > >   page. (Binbin).
> > > > - Check if TDX module supports feature ENHANCE_DEMOTE_INTERRUPTIBILITY.
> > > >   (Kai).
> > > > 
> > > [...]
> > > 
> > > > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
> > > > +			u64 *ext_err1, u64 *ext_err2)
> > > > +{
> > > > +	struct tdx_module_args args = {
> > > > +		.rcx = gpa | level,
> > > > +		.rdx = tdx_tdr_pa(td),
> > > > +		.r8 = page_to_phys(new_sept_page),
> > > > +	};
> > > > +	u64 ret;
> > > > +
> > > > +	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
> > > > +		return TDX_SW_ERROR;
> > > > 
> > > 
> > > For the record, while I replied my suggestion [*] to this patch in v2, it
> > > was basically because the discussion was already in that patch -- I didn't
> > > mean to do this check inside tdh_mem_page_demote(), but do this check in
> > > KVM page fault patch and return 4K as maximum mapping level.
> > > 
> > > The precise words were:
> > > 
> > >   So if the decision is to not use 2M page when TDH_MEM_PAGE_DEMOTE can 
> > >   return TDX_INTERRUPTED_RESTARTABLE, maybe we can just check this 
> > >   enumeration in fault handler and always make mapping level as 4K?
> > Right. I followed it in the last patch (patch 24).
> > 
> > > Looking at this series, this is eventually done in your last patch.  But I
> > > don't quite understand what's the additional value of doing such check and
> > > return TDX_SW_ERROR in this SEAMCALL wrapper.
> > > 
> > > Currently in this series, it doesn't matter whether this wrapper returns
> > > TDX_SW_ERROR or the real TDX_INTERRUPTED_RESTARTABLE -- KVM terminates the
> > > TD anyway (see your patch 8) because this is unexpected as checked in your
> > > last patch.
> > > 
> > > IMHO we should get rid of this check in this low level wrapper.
> > You are right, the wrapper shouldn't hit this error after the last patch.
> > 
> > However, I found it's better to introduce the feature bit
> > TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY and the helper
> > tdx_supports_demote_nointerrupt() together with the demote SEAMCALL wrapper.
> > This way, people can understand how the TDX_INTERRUPTED_RESTARTABLE error is
> > handled for this SEAMCALL. 
> > 
> 
> So the "handling" here is basically making DEMOTE SEAMCALL unavailable
> when DEMOTE is interruptible at low SEAMCALL wrapper level.
> 
> I guess you can argue this has some value since it tells users "don't even
> try to call me when I am interruptible because I am not available".  
Right. The caller can understand the API usage by examining the code
implementation.

> However, IMHO this also implies the benefit is mostly for the case where
> the user wants to use this wrapper to tell whether DEMOTE is available. 
> E.g.,
> 
> 	err = tdh_mem_page_demote(...);
> 	if (err == TDX_SW_ERROR)
> 		enable_tdx_hugepage = false;
This use case is not valid.
When the caller invokes tdh_mem_page_demote(), it means huge pages have already
been enabled, so turning huge pages off on error from splitting huge pages is
self-contradictory.

> But in this series you are using tdx_supports_demote_nointerrupt() for
> this purpose, which is better IMHO.
> 
> So maybe there's a *theoretical* value to have the check here, but I don't
> see any *real* value.
> 
> But I don't have strong opinion either -- I guess I just don't like making
> these low level SEAMCALL wrappers more complicated than what the SEAMCALL
> does -- and it's up to you to decide. :-)
Thanks. I added the checking in the SEAMCALL wrapper for two reasons:
- Let the callers know what the wrapper is expected to work under. So, the
  caller (e.g., KVM) can turn off huge pages upon detecting an incompatible TDX
  module. And forgetting to turn off huge pages would yield at least a WARNING.

- Give tdx_supports_demote_nointerrupt() a user in this patch which introduces
  the helper.

So, I'll keep the check unless someone has a strong opinion :)

> > What do you think about changing it to a WARN_ON_ONCE()? i.e.,
> > WARN_ON_ONCE(!tdx_supports_demote_nointerrupt(&tdx_sysinfo));
> 
> What's your intention?
Hmm, either TDX_SW_ERROR or WARN_ON_ONCE() is fine with me.

I've asked about it in [1]. Let's wait for the maintainers' reply.

[1] https://lore.kernel.org/all/aW3G6yZuvclYABzP@yzhao56-desk.sh.intel.com

> W/o the WARN(), the caller _can_ call this wrapper (i.e., not a kernel
> bug) but it always get a SW-defined error.  Again, maybe it has value for
> the case where the caller wants to use this to tell whether DEMOTE is
> available.
> 
> With the WARN(), it's a kernel bug to call the wrapper, and the caller
> needs to use other way (i.e., tdx_supports_demote_nointerrupt()) to tell
> whether DEMOTE is available.
> 
> So if you want the check, probably WARN() is a better idea since I suppose
> we always want users to use tdx_supports_demote_nointerrupt() to know
> whether DEMOTE can be done, and the WARN() is just to catch bug.
Agreed.
Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
Posted by Huang, Kai 3 weeks, 2 days ago
On Fri, 2026-01-16 at 11:10 +0000, Huang, Kai wrote:
> W/o the WARN(), the caller _can_ call this wrapper (i.e., not a kernel
> bug) but it always get a SW-defined error.  Again, maybe it has value for
> the case where the caller wants to use this to tell whether DEMOTE is
> available.
> 
> With the WARN(), it's a kernel bug to call the wrapper, and the caller
> needs to use other way (i.e., tdx_supports_demote_nointerrupt()) to tell
> whether DEMOTE is available.
> 
> So if you want the check, probably WARN() is a better idea since I suppose
> we always want users to use tdx_supports_demote_nointerrupt() to know
> whether DEMOTE can be done, and the WARN() is just to catch bug.

Forgot to say, the name tdx_supports_demote_nointerrupt() somehow only
tells the TDX module *supports* non-interruptible DEMOTE, it doesn't tell
whether TDX module has *enabled* that.

So while we know for this DEMOTE case, there's no need to *enable* this
feature (i.e., DEMOTE is non-interruptible when this feature is reported
as *supported*), from kernel's point of view, is it better to just use a
clearer name?

E.g., tdx_huge_page_demote_uninterruptible()?

A bonus is the name contains "huge_page" so it's super clear what's the
demote about.
Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
Posted by Yan Zhao 2 weeks, 6 days ago
On Fri, Jan 16, 2026 at 07:22:33PM +0800, Huang, Kai wrote:
> On Fri, 2026-01-16 at 11:10 +0000, Huang, Kai wrote:
> > W/o the WARN(), the caller _can_ call this wrapper (i.e., not a kernel
> > bug) but it always get a SW-defined error.  Again, maybe it has value for
> > the case where the caller wants to use this to tell whether DEMOTE is
> > available.
> > 
> > With the WARN(), it's a kernel bug to call the wrapper, and the caller
> > needs to use other way (i.e., tdx_supports_demote_nointerrupt()) to tell
> > whether DEMOTE is available.
> > 
> > So if you want the check, probably WARN() is a better idea since I suppose
> > we always want users to use tdx_supports_demote_nointerrupt() to know
> > whether DEMOTE can be done, and the WARN() is just to catch bug.
> 
> Forgot to say, the name tdx_supports_demote_nointerrupt() somehow only
> tells the TDX module *supports* non-interruptible DEMOTE, it doesn't tell
> whether TDX module has *enabled* that.
> 
> So while we know for this DEMOTE case, there's no need to *enable* this
> feature (i.e., DEMOTE is non-interruptible when this feature is reported
> as *supported*), from kernel's point of view, is it better to just use a
> clearer name?
> 
> E.g., tdx_huge_page_demote_uninterruptible()?
> 
> A bonus is the name contains "huge_page" so it's super clear what's the
> demote about.
LGTM. Thanks!