[v1] KVM: TDX huge page support for private memory

[RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 9 months, 3 weeks ago

Allow TDX's .private_max_mapping_level hook to return 2MB after the TD is
RUNNABLE, enabling KVM to map TDX private pages at the 2MB level. Remove
TODOs and adjust KVM_BUG_ON()s accordingly.

Note: Instead of placing this patch at the tail of the series, it's
positioned here to show the code changes for basic mapping of private huge
pages (i.e., transitioning from non-present to present).

However, since this patch also allows KVM to trigger the merging of small
entries into a huge leaf entry or the splitting of a huge leaf entry into
small entries, errors are expected if any of these operations are triggered
due to the current lack of splitting/merging support.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e23dce59fc72..6b3a8f3e6c9c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1561,10 +1561,6 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct page *page = pfn_to_page(pfn);
 
-	/* TODO: handle large pages. */
-	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
-		return -EINVAL;
-
 	/*
 	 * Because guest_memfd doesn't support page migration with
 	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
@@ -1612,8 +1608,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
 
-	/* TODO: handle large pages. */
-	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+	if (KVM_BUG_ON(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K, kvm))
 		return -EINVAL;
 
 	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
@@ -1714,8 +1709,8 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
 	u64 err, entry, level_state;
 
-	/* For now large page isn't supported yet. */
-	WARN_ON_ONCE(level != PG_LEVEL_4K);
+	/* Before TD runnable, large page is not supported */
+	WARN_ON_ONCE(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K);
 
 	err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
 
@@ -1817,6 +1812,9 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct page *page = pfn_to_page(pfn);
 	int ret;
 
+	WARN_ON_ONCE(folio_page_idx(page_folio(page), page) + KVM_PAGES_PER_HPAGE(level) >
+		     folio_nr_pages(page_folio(page)));
+
 	/*
 	 * HKID is released after all private pages have been removed, and set
 	 * before any might be populated. Warn if zapping is attempted when
@@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
 	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
 		return PG_LEVEL_4K;
 
-	return PG_LEVEL_4K;
+	return PG_LEVEL_2M;
 }
 
 static int tdx_online_cpu(unsigned int cpu)
-- 
2.43.2

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 9 months ago

On Thu, 2025-04-24 at 11:06 +0800, Yan Zhao wrote:
> Allow TDX's .private_max_mapping_level hook to return 2MB after the TD is
> RUNNABLE, enabling KVM to map TDX private pages at the 2MB level. Remove
> TODOs and adjust KVM_BUG_ON()s accordingly.
> 
> Note: Instead of placing this patch at the tail of the series, it's
> positioned here to show the code changes for basic mapping of private huge
> pages (i.e., transitioning from non-present to present).
> 
> However, since this patch also allows KVM to trigger the merging of small
> entries into a huge leaf entry or the splitting of a huge leaf entry into
> small entries, errors are expected if any of these operations are triggered
> due to the current lack of splitting/merging support.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e23dce59fc72..6b3a8f3e6c9c 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1561,10 +1561,6 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>  	struct page *page = pfn_to_page(pfn);
>  
> -	/* TODO: handle large pages. */
> -	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> -		return -EINVAL;
> -
>  	/*
>  	 * Because guest_memfd doesn't support page migration with
>  	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> @@ -1612,8 +1608,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>  	gpa_t gpa = gfn_to_gpa(gfn);
>  	u64 err, entry, level_state;
>  
> -	/* TODO: handle large pages. */
> -	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> +	if (KVM_BUG_ON(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K, kvm))

It's not clear why some of these warnings are here and some are in patch 4.

>  		return -EINVAL;
>  
>  	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
> @@ -1714,8 +1709,8 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
>  	u64 err, entry, level_state;
>  
> -	/* For now large page isn't supported yet. */
> -	WARN_ON_ONCE(level != PG_LEVEL_4K);
> +	/* Before TD runnable, large page is not supported */
> +	WARN_ON_ONCE(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K);
>  
>  	err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
>  
> @@ -1817,6 +1812,9 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>  	struct page *page = pfn_to_page(pfn);
>  	int ret;
>  
> +	WARN_ON_ONCE(folio_page_idx(page_folio(page), page) + KVM_PAGES_PER_HPAGE(level) >
> +		     folio_nr_pages(page_folio(page)));
> +
>  	/*
>  	 * HKID is released after all private pages have been removed, and set
>  	 * before any might be populated. Warn if zapping is attempted when
> @@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
>  	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
>  		return PG_LEVEL_4K;
>  
> -	return PG_LEVEL_4K;
> +	return PG_LEVEL_2M;

Maybe combine this with patch 4, or split them into sensible categories.

>  }
>  
>  static int tdx_online_cpu(unsigned int cpu)

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 8 months, 4 weeks ago

On Wed, May 14, 2025 at 04:10:10AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:06 +0800, Yan Zhao wrote:
> > Allow TDX's .private_max_mapping_level hook to return 2MB after the TD is
> > RUNNABLE, enabling KVM to map TDX private pages at the 2MB level. Remove
> > TODOs and adjust KVM_BUG_ON()s accordingly.
> > 
> > Note: Instead of placing this patch at the tail of the series, it's
> > positioned here to show the code changes for basic mapping of private huge
> > pages (i.e., transitioning from non-present to present).
> > 
> > However, since this patch also allows KVM to trigger the merging of small
> > entries into a huge leaf entry or the splitting of a huge leaf entry into
> > small entries, errors are expected if any of these operations are triggered
> > due to the current lack of splitting/merging support.
> > 
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/kvm/vmx/tdx.c | 16 +++++++---------
> >  1 file changed, 7 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index e23dce59fc72..6b3a8f3e6c9c 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1561,10 +1561,6 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> >  	struct page *page = pfn_to_page(pfn);
> >  
> > -	/* TODO: handle large pages. */
> > -	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > -		return -EINVAL;
> > -
> >  	/*
> >  	 * Because guest_memfd doesn't support page migration with
> >  	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> > @@ -1612,8 +1608,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> >  	gpa_t gpa = gfn_to_gpa(gfn);
> >  	u64 err, entry, level_state;
> >  
> > -	/* TODO: handle large pages. */
> > -	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > +	if (KVM_BUG_ON(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K, kvm))
> 
> It's not clear why some of these warnings are here and some are in patch 4.
Patch 4 contains only changes for !TD_STATE_RUNNABLE stage.
This patch is to allow huge page after TD_STATE_RUNNABLE.
So, relaxed the condition to trigger BUG_ON in this patch, i.e.,
before this patch, always bug on level > 4K;
after this patch, only bug on level > 4K before TD is runnable.

> >  		return -EINVAL;
> >  
> >  	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
> > @@ -1714,8 +1709,8 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> >  	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
> >  	u64 err, entry, level_state;
> >  
> > -	/* For now large page isn't supported yet. */
> > -	WARN_ON_ONCE(level != PG_LEVEL_4K);
> > +	/* Before TD runnable, large page is not supported */
> > +	WARN_ON_ONCE(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K);
> >  
> >  	err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
> >  
> > @@ -1817,6 +1812,9 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> >  	struct page *page = pfn_to_page(pfn);
> >  	int ret;
> >  
> > +	WARN_ON_ONCE(folio_page_idx(page_folio(page), page) + KVM_PAGES_PER_HPAGE(level) >
> > +		     folio_nr_pages(page_folio(page)));
> > +
> >  	/*
> >  	 * HKID is released after all private pages have been removed, and set
> >  	 * before any might be populated. Warn if zapping is attempted when
> > @@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> >  	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> >  		return PG_LEVEL_4K;
> >  
> > -	return PG_LEVEL_4K;
> > +	return PG_LEVEL_2M;
> 
> Maybe combine this with patch 4, or split them into sensible categories.
Sorry to bring confusion.

As explained in the patch msg, this patch to return PG_LEVEL_2M actually needs
to be placed at the end of the series, after patches for page splitting/merging.

As inital RFC, it's placed earlier to show changes to enable basic TDX huge page
(without splitting/merging).

> >  }
> >  
> >  static int tdx_online_cpu(unsigned int cpu)
>

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Huang, Kai 9 months ago

On Tue, 2025-05-13 at 20:10 +0000, Edgecombe, Rick P wrote:
> > @@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> >   	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> >   		return PG_LEVEL_4K;
> >   
> > -	return PG_LEVEL_4K;
> > +	return PG_LEVEL_2M;
> 
> Maybe combine this with patch 4, or split them into sensible categories.

How about merge with patch 12

  [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's 
  ACCEPT level

instead?

Per patch 12, the fault due to TDH.MEM.PAGE.ACCPT contains fault level info, so
KVM should just return that.  But seems we are still returning PG_LEVEL_2M if no
such info is provided (IIUC):

int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, 
				       gfn_t gfn)
 {
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
 	if (unlikely(to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE))
 		return PG_LEVEL_4K;
 
+	if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
+		return tdx->violation_request_level;
+
 	return PG_LEVEL_2M;
 }

So why not returning PT_LEVEL_4K at the end?

I am asking because below text mentioned in the coverletter:

    A rare case that could lead to splitting in the fault path is when a TD
    is configured to receive #VE and accesses memory before the ACCEPT
    operation. By the time a vCPU accesses a private GFN, due to the lack
    of any guest preferred level, KVM could create a mapping at 2MB level.
    If the TD then only performs the ACCEPT operation at 4KB level,
    splitting in the fault path will be triggered. However, this is not
    regarded as a typical use case, as usually TD always accepts pages in
    the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
    splitting request is an endless EPT violation. This would not happen
    for a Linux guest, which does not expect any #VE.

Changing to return PT_LEVEL_4K should avoid this problem.  It doesn't hurt
normal cases either, since guest will always do ACCEPT (which contains the
accepting level) before accessing the memory.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 8 months, 4 weeks ago

On Fri, May 16, 2025 at 09:35:37AM +0800, Huang, Kai wrote:
> On Tue, 2025-05-13 at 20:10 +0000, Edgecombe, Rick P wrote:
> > > @@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> > >   	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> > >   		return PG_LEVEL_4K;
> > >   
> > > -	return PG_LEVEL_4K;
> > > +	return PG_LEVEL_2M;
> > 
> > Maybe combine this with patch 4, or split them into sensible categories.
> 
> How about merge with patch 12
> 
>   [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's 
>   ACCEPT level
> 
> instead?
> 
> Per patch 12, the fault due to TDH.MEM.PAGE.ACCPT contains fault level info, so
> KVM should just return that.  But seems we are still returning PG_LEVEL_2M if no
> such info is provided (IIUC):
Yes, if without such info (tdx->violation_request_level), we always return
PG_LEVEL_2M.


> int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, 
> 				       gfn_t gfn)
>  {
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
>  	if (unlikely(to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE))
>  		return PG_LEVEL_4K;
>  
> +	if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> +		return tdx->violation_request_level;
> +
>  	return PG_LEVEL_2M;
>  }
> 
> So why not returning PT_LEVEL_4K at the end?
>
> I am asking because below text mentioned in the coverletter:
> 
>     A rare case that could lead to splitting in the fault path is when a TD
>     is configured to receive #VE and accesses memory before the ACCEPT
>     operation. By the time a vCPU accesses a private GFN, due to the lack
>     of any guest preferred level, KVM could create a mapping at 2MB level.
>     If the TD then only performs the ACCEPT operation at 4KB level,
>     splitting in the fault path will be triggered. However, this is not
>     regarded as a typical use case, as usually TD always accepts pages in
>     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
>     splitting request is an endless EPT violation. This would not happen
>     for a Linux guest, which does not expect any #VE.
> 
> Changing to return PT_LEVEL_4K should avoid this problem.  It doesn't hurt
For TDs expect #VE, guests access private memory before accept it.
In that case, upon KVM receives EPT violation, there's no expected level from
the TDX module. Returning PT_LEVEL_4K at the end basically disables huge pages
for those TDs.

Besides, according to Kirill [1], the order from 1GB->2MB->4KB is only the case
for linux guests.

[1] https://lore.kernel.org/all/6vdj4mfxlyvypn743klxq5twda66tkugwzljdt275rug2gmwwl@zdziylxpre6y/#t

> normal cases either, since guest will always do ACCEPT (which contains the
> accepting level) before accessing the memory.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Huang, Kai 8 months, 4 weeks ago

On Fri, 2025-05-16 at 17:43 +0800, Zhao, Yan Y wrote:
> On Fri, May 16, 2025 at 09:35:37AM +0800, Huang, Kai wrote:
> > On Tue, 2025-05-13 at 20:10 +0000, Edgecombe, Rick P wrote:
> > > > @@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> > > >   	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> > > >   		return PG_LEVEL_4K;
> > > >   
> > > > -	return PG_LEVEL_4K;
> > > > +	return PG_LEVEL_2M;
> > > 
> > > Maybe combine this with patch 4, or split them into sensible categories.
> > 
> > How about merge with patch 12
> > 
> >   [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's 
> >   ACCEPT level
> > 
> > instead?
> > 
> > Per patch 12, the fault due to TDH.MEM.PAGE.ACCPT contains fault level info, so
> > KVM should just return that.  But seems we are still returning PG_LEVEL_2M if no
> > such info is provided (IIUC):
> Yes, if without such info (tdx->violation_request_level), we always return
> PG_LEVEL_2M.
> 
> 
> > int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, 
> > 				       gfn_t gfn)
> >  {
> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +
> >  	if (unlikely(to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE))
> >  		return PG_LEVEL_4K;
> >  
> > +	if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> > +		return tdx->violation_request_level;
> > +
> >  	return PG_LEVEL_2M;
> >  }
> > 
> > So why not returning PT_LEVEL_4K at the end?
> > 
> > I am asking because below text mentioned in the coverletter:
> > 
> >     A rare case that could lead to splitting in the fault path is when a TD
> >     is configured to receive #VE and accesses memory before the ACCEPT
> >     operation. By the time a vCPU accesses a private GFN, due to the lack
> >     of any guest preferred level, KVM could create a mapping at 2MB level.
> >     If the TD then only performs the ACCEPT operation at 4KB level,
> >     splitting in the fault path will be triggered. However, this is not
> >     regarded as a typical use case, as usually TD always accepts pages in
> >     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
> >     splitting request is an endless EPT violation. This would not happen
> >     for a Linux guest, which does not expect any #VE.
> > 
> > Changing to return PT_LEVEL_4K should avoid this problem.  It doesn't hurt
> For TDs expect #VE, guests access private memory before accept it.
> In that case, upon KVM receives EPT violation, there's no expected level from
> the TDX module. Returning PT_LEVEL_4K at the end basically disables huge pages
> for those TDs.

Just want to make sure I understand correctly:

Linux TDs always ACCEPT memory first before touching that memory, therefore KVM
should always be able to get the accept level for Linux TDs.

In other words, returning PG_LEVEL_4K doesn't impact establishing large page
mapping for Linux TDs.

However, other TDs may choose to touch memory first to receive #VE and then
accept that memory.  Returning PG_LEVEL_2M allows those TDs to use large page
mappings in SEPT.  Otherwise, returning PG_LEVEL_4K essentially disables large
page for them (since we don't support PROMOTE for now?).

But in the above text you mentioned that, if doing so, because we choose to
ignore splitting request on read, returning 2M could result in *endless* EPT
violation.

So to me it seems you choose a design that could bring performance gain for
certain non-Linux TDs when they follow a certain behaviour but otherwise could
result in endless EPT violation in KVM.

I am not sure how is this OK?  Or probably I have misunderstanding?

> 
> Besides, according to Kirill [1], the order from 1GB->2MB->4KB is only the case
> for linux guests.
> 
> [1] https://lore.kernel.org/all/6vdj4mfxlyvypn743klxq5twda66tkugwzljdt275rug2gmwwl@zdziylxpre6y/#t

I am not sure how is this related?

On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept order,
e.g., they always accept 4K, there could be *endless EPT violation* if I
understand your words correctly.

Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead of
2M if no accept level is provided in the fault?

> 
> > normal cases either, since guest will always do ACCEPT (which contains the
> > accepting level) before accessing the memory.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 8 months, 3 weeks ago

On Sat, May 17, 2025 at 06:35:57AM +0800, Huang, Kai wrote:
> On Fri, 2025-05-16 at 17:43 +0800, Zhao, Yan Y wrote:
> > On Fri, May 16, 2025 at 09:35:37AM +0800, Huang, Kai wrote:
> > > On Tue, 2025-05-13 at 20:10 +0000, Edgecombe, Rick P wrote:
> > > > > @@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> > > > >   	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> > > > >   		return PG_LEVEL_4K;
> > > > >   
> > > > > -	return PG_LEVEL_4K;
> > > > > +	return PG_LEVEL_2M;
> > > > 
> > > > Maybe combine this with patch 4, or split them into sensible categories.
> > > 
> > > How about merge with patch 12
> > > 
> > >   [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's 
> > >   ACCEPT level
> > > 
> > > instead?
> > > 
> > > Per patch 12, the fault due to TDH.MEM.PAGE.ACCPT contains fault level info, so
> > > KVM should just return that.  But seems we are still returning PG_LEVEL_2M if no
> > > such info is provided (IIUC):
> > Yes, if without such info (tdx->violation_request_level), we always return
> > PG_LEVEL_2M.
> > 
> > 
> > > int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, 
> > > 				       gfn_t gfn)
> > >  {
> > > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > +
> > >  	if (unlikely(to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE))
> > >  		return PG_LEVEL_4K;
> > >  
> > > +	if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> > > +		return tdx->violation_request_level;
> > > +
> > >  	return PG_LEVEL_2M;
> > >  }
> > > 
> > > So why not returning PT_LEVEL_4K at the end?
> > > 
> > > I am asking because below text mentioned in the coverletter:
> > > 
> > >     A rare case that could lead to splitting in the fault path is when a TD
> > >     is configured to receive #VE and accesses memory before the ACCEPT
> > >     operation. By the time a vCPU accesses a private GFN, due to the lack
> > >     of any guest preferred level, KVM could create a mapping at 2MB level.
> > >     If the TD then only performs the ACCEPT operation at 4KB level,
> > >     splitting in the fault path will be triggered. However, this is not
> > >     regarded as a typical use case, as usually TD always accepts pages in
> > >     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
> > >     splitting request is an endless EPT violation. This would not happen
> > >     for a Linux guest, which does not expect any #VE.
> > > 
> > > Changing to return PT_LEVEL_4K should avoid this problem.  It doesn't hurt
> > For TDs expect #VE, guests access private memory before accept it.
> > In that case, upon KVM receives EPT violation, there's no expected level from
> > the TDX module. Returning PT_LEVEL_4K at the end basically disables huge pages
> > for those TDs.
> 
> Just want to make sure I understand correctly:
> 
> Linux TDs always ACCEPT memory first before touching that memory, therefore KVM
> should always be able to get the accept level for Linux TDs.
> 
> In other words, returning PG_LEVEL_4K doesn't impact establishing large page
> mapping for Linux TDs.
>
> However, other TDs may choose to touch memory first to receive #VE and then
> accept that memory.  Returning PG_LEVEL_2M allows those TDs to use large page
> mappings in SEPT.  Otherwise, returning PG_LEVEL_4K essentially disables large
> page for them (since we don't support PROMOTE for now?).
Not only because we don't support PROMOTE.

After KVM maps at 4KB level, if the guest accepts at 2MB level, it would get
a TDACCEPT_SIZE_MISMATCH error.

The case of "KVM maps at 4KB, guest accepts at 2MB" is different from
"KVM maps at 2MB, guest accepts at 4KB".

For the former, the guest can't trigger endless EPT violation. Just consider
when the guest wants to accept at 2MB while KVM can't meet its request.
If it can trigger endless EPT violation, the linux guest should trigger endless
EPT already with today's basic TDX.

> But in the above text you mentioned that, if doing so, because we choose to
> ignore splitting request on read, returning 2M could result in *endless* EPT
> violation.
I don't get what you mean.
What's the relationship between splitting and "returning 2M could result in
*endless* EPT" ?

> So to me it seems you choose a design that could bring performance gain for
> certain non-Linux TDs when they follow a certain behaviour but otherwise could
> result in endless EPT violation in KVM.
Also don't understand here.
Which design could result in endless EPT violation?

> I am not sure how is this OK?  Or probably I have misunderstanding?

> > 
> > Besides, according to Kirill [1], the order from 1GB->2MB->4KB is only the case
> > for linux guests.
> > 
> > [1] https://lore.kernel.org/all/6vdj4mfxlyvypn743klxq5twda66tkugwzljdt275rug2gmwwl@zdziylxpre6y/#t
> 
> I am not sure how is this related?
> 
> On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept order,
> e.g., they always accept 4K, there could be *endless EPT violation* if I
> understand your words correctly.
> 
> Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead of
> 2M if no accept level is provided in the fault?
As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH. 

> 
> > 
> > > normal cases either, since guest will always do ACCEPT (which contains the
> > > accepting level) before accessing the memory.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Huang, Kai 8 months, 3 weeks ago

On Mon, 2025-05-19 at 16:32 +0800, Zhao, Yan Y wrote:
> > But in the above text you mentioned that, if doing so, because we choose to
> > ignore splitting request on read, returning 2M could result in *endless* EPT
> > violation.
> I don't get what you mean.
> What's the relationship between splitting and "returning 2M could result in
> *endless* EPT" ?
> 
> > So to me it seems you choose a design that could bring performance gain for
> > certain non-Linux TDs when they follow a certain behaviour but otherwise could
> > result in endless EPT violation in KVM.
> Also don't understand here.
> Which design could result in endless EPT violation?

[Sorry somehow I didn't see your replies yesterday in my mailbox.]

You mentioned below in your coverletter:

    (b) with shared kvm->mmu_lock, triggered by fault.

    ....

    This series simply ignores the splitting request in the fault path to
    avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
    at a lower level would finally figures out the page has been accepted
    at a higher level by another vCPU.

    ... The worst outcome to ignore the resulting
    splitting request is an endless EPT violation. This would not happen
    for a Linux guest, which does not expect any #VE.

So to me, IIUC, this means:

 - this series choose to ignore splitting request when read ..
 - the worse outcome to ignore the resulting splitting request is an endless
   EPT violation..

And this happens exactly in below case:

 1) Guest touches a 4K page
 2) KVM AUGs 2M page
 3) Guest re-accesses that 4K page, and receives #VE
 4) Guest ACCEPTs that 4K page, this triggers EPT violation

IIUC, you choose to ignore splitting large page in step 4) (am I right???). 
Then if guest always ACCEPTs page at 4K level, then KVM will have *endless EPT
violation*.

So, is this the "worst outcome to ignore the resulting splitting request" that
you mentioned in your changelog?

If it is, then why is it OK?

It is OK *ONLY* when "guest always ACCEPTs 4K page" is a buggy behaviour of the
guest itself (which KVM is not responsible for).  I.e., the guest is always
supposed to find the page size that KVM has AUGed upon receiving the #VE (does
the #VE contain such information?) and then do ACCEPT at that page level.

Otherwise, if it's a legal behaviour for the guest to always ACCEPT at 4K level,
then I don't think it's OK to have endless EPT violation in KVM.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 8 months, 3 weeks ago

On Wed, May 21, 2025 at 07:34:52AM +0800, Huang, Kai wrote:
> On Mon, 2025-05-19 at 16:32 +0800, Zhao, Yan Y wrote:
> > > But in the above text you mentioned that, if doing so, because we choose to
> > > ignore splitting request on read, returning 2M could result in *endless* EPT
> > > violation.
> > I don't get what you mean.
> > What's the relationship between splitting and "returning 2M could result in
> > *endless* EPT" ?
> > 
> > > So to me it seems you choose a design that could bring performance gain for
> > > certain non-Linux TDs when they follow a certain behaviour but otherwise could
> > > result in endless EPT violation in KVM.
> > Also don't understand here.
> > Which design could result in endless EPT violation?
> 
> [Sorry somehow I didn't see your replies yesterday in my mailbox.]
> 
> You mentioned below in your coverletter:
> 
>     (b) with shared kvm->mmu_lock, triggered by fault.
> 
>     ....
> 
>     This series simply ignores the splitting request in the fault path to
>     avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
>     at a lower level would finally figures out the page has been accepted
>     at a higher level by another vCPU.
> 
>     ... The worst outcome to ignore the resulting
>     splitting request is an endless EPT violation. This would not happen
>     for a Linux guest, which does not expect any #VE.
> 
> So to me, IIUC, this means:
> 
>  - this series choose to ignore splitting request when read ..
>  - the worse outcome to ignore the resulting splitting request is an endless
>    EPT violation..
> 
> And this happens exactly in below case:
> 
>  1) Guest touches a 4K page
>  2) KVM AUGs 2M page
>  3) Guest re-accesses that 4K page, and receives #VE
>  4) Guest ACCEPTs that 4K page, this triggers EPT violation
> 
> IIUC, you choose to ignore splitting large page in step 4) (am I right???). 
> Then if guest always ACCEPTs page at 4K level, then KVM will have *endless EPT
> violation*.
> 
> So, is this the "worst outcome to ignore the resulting splitting request" that
> you mentioned in your changelog?
> 
> If it is, then why is it OK?
Initially I assumed the guest should always accept in the sequence of
"1G->2M->4K" as what's linux guest is doing.

If that's true, we can simply ignore the splitting request in the fault (shared)
path because it's the guest that not follow the convention.

However, Kirill and you are right, the guest can accept at 4K.

Given that, the "worst outcome to ignore the resulting splitting request" is not
OK. 

> It is OK *ONLY* when "guest always ACCEPTs 4K page" is a buggy behaviour of the
> guest itself (which KVM is not responsible for).  I.e., the guest is always
> supposed to find the page size that KVM has AUGed upon receiving the #VE (does
> the #VE contain such information?) and then do ACCEPT at that page level.
> 
> Otherwise, if it's a legal behaviour for the guest to always ACCEPT at 4K level,
> then I don't think it's OK to have endless EPT violation in KVM.
We can avoid the endless EPT violation by allowing the splitting in the fault
path, which involves the introduction of several locks in TDX code though. I had
a POC for that one, but we felt that it's better to keep the initial support
simple.

So, if we all agree not to support huge pages for non-Linux TDs as an initial
step, your proposal is a good idea to keep splitting code simple.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 8 months, 3 weeks ago

On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept
> > order,
> > e.g., they always accept 4K, there could be *endless EPT violation* if I
> > understand your words correctly.
> > 
> > Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead
> > of
> > 2M if no accept level is provided in the fault?
> As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.

TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
docs say the VMM needs to demote *if* the mapping is large and the accept size
is small. But if we map at 4k size for non-accept EPT violations, we won't hit
this case. I also wonder what is preventing the TDX module from handling a 2MB
accept size at 4k mappings. It could be changed maybe.

But I think Kai's question was: why are we complicating the code for the case of
non-Linux TDs that also use #VE for accept? It's not necessary to be functional,
and there aren't any known TDs like that which are expected to use KVM today.
(err, except the MMU stress test). So in another form the question is: should we
optimize KVM for a case we don't even know if anyone will use? The answer seems
obviously no to me.

I think this connects the question of whether we can pass the necessary info
into fault via synthetic error code. Consider this new design:

 - tdx_gmem_private_max_mapping_level() simply returns 4k for prefetch and pre-
runnable, otherwise returns 2MB
 - if fault has accept info 2MB size, pass 2MB size into fault. Otherwise pass
4k (i.e. VMs that are relying on #VE to do the accept won't get huge pages
*yet*).

What goes wrong? Seems simpler and no more stuffing fault info on the vcpu.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 8 months, 3 weeks ago

On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept
> > > order,
> > > e.g., they always accept 4K, there could be *endless EPT violation* if I
> > > understand your words correctly.
> > > 
> > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead
> > > of
> > > 2M if no accept level is provided in the fault?
> > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> 
> TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> docs say the VMM needs to demote *if* the mapping is large and the accept size
> is small. But if we map at 4k size for non-accept EPT violations, we won't hit
> this case. I also wonder what is preventing the TDX module from handling a 2MB
> accept size at 4k mappings. It could be changed maybe.
> 
> But I think Kai's question was: why are we complicating the code for the case of
> non-Linux TDs that also use #VE for accept? It's not necessary to be functional,
> and there aren't any known TDs like that which are expected to use KVM today.
> (err, except the MMU stress test). So in another form the question is: should we
> optimize KVM for a case we don't even know if anyone will use? The answer seems
> obviously no to me.
So, you want to disallow huge pages for non-Linux TDs, then we have no need
to support splitting in the fault path, right?

I'm OK if we don't care non-Linux TDs for now.
This can simplify the splitting code and we can add the support when there's a
need.

> I think this connects the question of whether we can pass the necessary info
> into fault via synthetic error code. Consider this new design:
> 
>  - tdx_gmem_private_max_mapping_level() simply returns 4k for prefetch and pre-
> runnable, otherwise returns 2MB
Why prefetch and pre-runnable faults go the first path, while

>  - if fault has accept info 2MB size, pass 2MB size into fault. Otherwise pass
> 4k (i.e. VMs that are relying on #VE to do the accept won't get huge pages
> *yet*).
other faults go the second path?

> What goes wrong? Seems simpler and no more stuffing fault info on the vcpu.
I tried to avoid the double paths.
IMHO, it's confusing to specify max_level from two paths.

The fault info in vcpu_tdx isn't a real problem as it's per-vCPU.
An existing example in KVM is vcpu->arch.mmio_gfn.

We don't need something like the vcpu->arch.mmio_gen because
tdx->violation_gfn_* and tdx->violation_request_level are reset in each
tdx_handle_ept_violation().

BTW, dug into some history:

In v18 of TDX basic series,
enforcing 4KB for pre-runnable faults were done by passing PG_LEVEL_4K to
kvm_mmu_map_tdp_page().
https://lore.kernel.org/all/1a64f798b550dad9e096603e8dae3b6e8fb2fbd5.1705965635.git.isaku.yamahata@intel.com/
https://lore.kernel.org/all/97bb1f2996d8a7b828cd9e3309380d1a86ca681b.1705965635.git.isaku.yamahata@intel.com/

For the other faults, it's done by altering max_level in kvm_mmu_do_page_fault(),
and Paolo asked to use the tdx_gmem_private_max_mapping_level() path.
https://lore.kernel.org/all/CABgObfbu1-Ok607uYdo4DzwZf8ZGVQnvHU+y9_M1Zae55K5xwQ@mail.gmail.com/

For the patch "KVM: x86/mmu: Allow per-VM override of the TDP max page level",
it's initially acked by Paolo in v2, and Sean's reply is at
https://lore.kernel.org/all/YO3%2FgvK9A3tgYfT6@google.com .

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 8 months, 3 weeks ago

On Tue, 2025-05-20 at 17:34 +0800, Yan Zhao wrote:
> So, you want to disallow huge pages for non-Linux TDs, then we have no need
> to support splitting in the fault path, right?
> 
> I'm OK if we don't care non-Linux TDs for now.
> This can simplify the splitting code and we can add the support when there's a
> need.

We do need to care about non-Linux TDs functioning, but we don't need to
optimize for them at this point. We need to optimize for things that happen
often. Pending-#VE using TDs are rare, and don't need to have huge pages in
order to work.

Yesterday Kirill and I were chatting offline about the newly defined
TDG.MEM.PAGE.RELEASE. It is kind of like an unaccept, so another possibility is:
1. Guest accepts at 2MB
2. Guest releases at 2MB (no notice to VMM)
3. Guest accepts at 4k, EPT violation with expectation to demote

In that case, KVM won't know to expect it, and that it needs to preemptively map
things at 4k.

For full coverage of the issue, can we discuss a little bit about what demote in
the fault path would look like? The current zapping operation that is involved
depends on mmu write lock. And I remember you had a POC that added essentially a
hidden exclusive lock in TDX code as a substitute. But unlike the other callers,
the fault path demote case could actually handle failure. So if we just returned
busy and didn't try to force the retry, we would just run the risk of
interfering with TDX module sept lock? Is that the only issue with a design that
would allows failure of demote in the fault path?

Let's keep in mind that we could ask for TDX module changes to enable this path.
I think we could probably get away with ignoring TDG.MEM.PAGE.RELEASE if we had
a plan to fix it up with TDX module changes. And if the ultimate root cause of
the complication is avoiding zero-step (sept lock), we should fix that instead
of design around it further.

> 
> > I think this connects the question of whether we can pass the necessary info
> > into fault via synthetic error code. Consider this new design:
> > 
> >   - tdx_gmem_private_max_mapping_level() simply returns 4k for prefetch and pre-
> > runnable, otherwise returns 2MB
> Why prefetch and pre-runnable faults go the first path, while

Because these are either passed into private_max_mapping_level(), or not
associated with the fault (runnable state).

> 
> >   - if fault has accept info 2MB size, pass 2MB size into fault. Otherwise pass
> > 4k (i.e. VMs that are relying on #VE to do the accept won't get huge pages
> > *yet*).
> other faults go the second path?

This info is related to the specific fault.

>  
> > What goes wrong? Seems simpler and no more stuffing fault info on the vcpu.
> I tried to avoid the double paths.
> IMHO, it's confusing to specify max_level from two paths.
> 
> The fault info in vcpu_tdx isn't a real problem as it's per-vCPU.
> An existing example in KVM is vcpu->arch.mmio_gfn.

mmio_gfn isn't info about the fault though, it's info about the gfn being mmio.
So not fault scoped.

> 
> We don't need something like the vcpu->arch.mmio_gen because
> tdx->violation_gfn_* and tdx->violation_request_level are reset in each
> tdx_handle_ept_violation().
> 
> 
> BTW, dug into some history:
> 
> In v18 of TDX basic series,
> enforcing 4KB for pre-runnable faults were done by passing PG_LEVEL_4K to
> kvm_mmu_map_tdp_page().
> https://lore.kernel.org/all/1a64f798b550dad9e096603e8dae3b6e8fb2fbd5.1705965635.git.isaku.yamahata@intel.com/
> https://lore.kernel.org/all/97bb1f2996d8a7b828cd9e3309380d1a86ca681b.1705965635.git.isaku.yamahata@intel.com/
> 
> For the other faults, it's done by altering max_level in kvm_mmu_do_page_fault(),
> and Paolo asked to use the tdx_gmem_private_max_mapping_level() path.
> https://lore.kernel.org/all/CABgObfbu1-Ok607uYdo4DzwZf8ZGVQnvHU+y9_M1Zae55K5xwQ@mail.gmail.com/
> 
> For the patch "KVM: x86/mmu: Allow per-VM override of the TDP max page level",
> it's initially acked by Paolo in v2, and Sean's reply is at
> https://lore.kernel.org/all/YO3%2FgvK9A3tgYfT6@google.com .

The SNP case is not checking fault info, it's closer to the other cases. I don't
see that any of that conversation applies to this case. Can you clarify?

On the subject of the whether to pass accept level into the fault, or stuff it
on the vcpu, I'm still in the camp that it is better to pass it in the error
code. If you disagree, let's see if we can flag down Sean and Paolo to weigh in.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 8 months, 3 weeks ago

On Wed, May 21, 2025 at 11:40:15PM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-05-20 at 17:34 +0800, Yan Zhao wrote:
> > So, you want to disallow huge pages for non-Linux TDs, then we have no need
> > to support splitting in the fault path, right?
> > 
> > I'm OK if we don't care non-Linux TDs for now.
> > This can simplify the splitting code and we can add the support when there's a
> > need.
> 
> We do need to care about non-Linux TDs functioning, but we don't need to
> optimize for them at this point. We need to optimize for things that happen
> often. Pending-#VE using TDs are rare, and don't need to have huge pages in
> order to work.
> 
> Yesterday Kirill and I were chatting offline about the newly defined
> TDG.MEM.PAGE.RELEASE. It is kind of like an unaccept, so another possibility is:
> 1. Guest accepts at 2MB
> 2. Guest releases at 2MB (no notice to VMM)
> 3. Guest accepts at 4k, EPT violation with expectation to demote
> 
> In that case, KVM won't know to expect it, and that it needs to preemptively map
> things at 4k.
> 
> For full coverage of the issue, can we discuss a little bit about what demote in
> the fault path would look like?
For demote in the fault path, it will take mmu read lock.

So, the flow in the fault path is
1. zap with mmu read lock.
   ret = tdx_sept_zap_private_spte(kvm, gfn, level, page, true);
   if (ret <= 0)
       return ret;
2. track with mmu read lock
   ret = tdx_track(kvm, true);
   if (ret)
       return ret;
3. demote with mmu read lock
   ret = tdx_spte_demote_private_spte(kvm, gfn, level, page, true);
   if (ret)
       goto err;
4. return success or unzap as error fallback.
   tdx_sept_unzap_private_spte(kvm, gfn, level);

Steps 1-3 will return -EBUSY on busy error (which will not be very often as we
will introduce kvm_tdx->sept_lock. I can post the full lock analysis if
necessary).
Step 4 will be ensured to succeed.

Here's the detailed code for step 1, 3 and 4.

static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
                                     enum pg_level level, struct page *page,
                                     bool mmu_lock_shared)
{
        int tdx_level = pg_level_to_tdx_sept_level(level);
        struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
        gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
        u64 err, entry, level_state;

        /* Before TD runnable, large page is not supported */
        WARN_ON_ONCE(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K);

        if (mmu_lock_shared)
                lockdep_assert_held_read(&kvm->mmu_lock);
        else
                lockdep_assert_held_write(&kvm->mmu_lock);

        write_lock(&kvm_tdx->sept_lock);
        err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
        write_unlock(&kvm_tdx->sept_lock);

        if (unlikely(tdx_operand_busy(err))) {
                if (mmu_lock_shared)
                        return -EBUSY;

                /* After no vCPUs enter, the second retry is expected to succeed */
                write_lock(&kvm_tdx->sept_lock);
                tdx_no_vcpus_enter_start(kvm);
                err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
                tdx_no_vcpus_enter_stop(kvm);
                write_unlock(&kvm_tdx->sept_lock);
        }

        if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
            !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
                atomic64_dec(&kvm_tdx->nr_premapped);
                return 0;
        }

        if (KVM_BUG_ON(err, kvm)) {
                pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
                return -EIO;
        }
        return 1;
}

static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
                                        enum pg_level level, struct page *page,
                                        bool mmu_lock_shared)
{
       int tdx_level = pg_level_to_tdx_sept_level(level);
       struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
       gpa_t gpa = gfn_to_gpa(gfn);
       u64 err, entry, level_state;

       do {
               read_lock(&kvm_tdx->sept_lock);
               err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
                                       &entry, &level_state);
               read_unlock(&kvm_tdx->sept_lock);
       } while (err == TDX_INTERRUPTED_RESTARTABLE);

       if (unlikely(tdx_operand_busy(err)) {
                unsigned long flags;

                if (mmu_lock_shared)
                        return -EBUSY;

                tdx_no_vcpus_enter_start(kvm);
                read_lock(&kvm_tdx->sept_lock);

                local_irq_save(flags);
                err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
                                          &entry, &level_state);
                local_irq_restore(flags);
                read_unlock(&kvm_tdx->sept_lock);
                tdx_no_vcpus_enter_stop(kvm);
        }

        if (KVM_BUG_ON(err, kvm)) {
                pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
                return -EIO;
        }
        return 0;
}

static void tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn,
                                     enum pg_level level)
{
        int tdx_level = pg_level_to_tdx_sept_level(level);
        struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
        gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
        u64 err, entry, level_state;

        write_lock(&kvm_tdx->sept_lock);
        err = tdh_mem_range_unblock(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
        write_unlock(&kvm_tdx->sept_lock);

        if (unlikely(tdx_operand_busy(err))) {
                write_lock(&kvm_tdx->sept_lock);
                tdx_no_vcpus_enter_start(kvm);
                err = tdh_mem_range_unblock(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
                tdx_no_vcpus_enter_stop(kvm);
                write_unlock(&kvm_tdx->sept_lock);
        }

        if (KVM_BUG_ON(err, kvm)) {
                pr_tdx_error_2(TDH_MEM_RANGE_UNBLOCK, err, entry, level_state);
        }
}


> The current zapping operation that is involved
> depends on mmu write lock. And I remember you had a POC that added essentially a
> hidden exclusive lock in TDX code as a substitute. But unlike the other callers,
Right, The kvm_tdx->sept_lock is introduced as a rw lock. The write lock is held
in a very short period, around tdh_mem_sept_remove(), tdh_mem_range_block(),
tdh_mem_range_unblock().

The read/write status of the kvm_tdx->sept_lock corresponds to that in the TDX
module.

  Resources          SHARED  users              EXCLUSIVE users 
-----------------------------------------------------------------------
 secure_ept_lock   tdh_mem_sept_add            tdh_vp_enter
                   tdh_mem_page_aug            tdh_mem_sept_remove
                   tdh_mem_page_remove         tdh_mem_range_block
                   tdh_mem_page_promote        tdh_mem_range_unblock
                   tdh_mem_page_demote

> the fault path demote case could actually handle failure. So if we just returned
> busy and didn't try to force the retry, we would just run the risk of
> interfering with TDX module sept lock? Is that the only issue with a design that
> would allows failure of demote in the fault path?
The concern to support split in the fault path is mainly to avoid unnecesssary
split, e.g., when two vCPUs try to accept at different levels.

Besides that we need to introduce 3 locks inside TDX:
rwlock_t sept_lock, spinlock_t no_vcpu_enter_lock, spinlock_t track_lock.

To ensure the success of unzap (to restore the state), kicking of vCPUs in the
fault path is required, which is not ideal. But with the introduced lock and the
proposed TDX modules's change to tdg_mem_page_accept() (as in the next comment),
the chance to invoke unzap is very low.

> Let's keep in mind that we could ask for TDX module changes to enable this path.
We may need TDX module's change to let tdg_mem_page_accept() not to take lock on
an non-ACCEPTable entry to avoid contention with guest and the potential error
TDX_HOST_PRIORITY_BUSY_TIMEOUT.

> I think we could probably get away with ignoring TDG.MEM.PAGE.RELEASE if we had
> a plan to fix it up with TDX module changes. And if the ultimate root cause of
> the complication is avoiding zero-step (sept lock), we should fix that instead
> of design around it further.
Ok.

> > > I think this connects the question of whether we can pass the necessary info
> > > into fault via synthetic error code. Consider this new design:
> > > 
> > >   - tdx_gmem_private_max_mapping_level() simply returns 4k for prefetch and pre-
> > > runnable, otherwise returns 2MB
> > Why prefetch and pre-runnable faults go the first path, while
> 
> Because these are either passed into private_max_mapping_level(), or not
> associated with the fault (runnable state).
> 
> > 
> > >   - if fault has accept info 2MB size, pass 2MB size into fault. Otherwise pass
> > > 4k (i.e. VMs that are relying on #VE to do the accept won't get huge pages
> > > *yet*).
> > other faults go the second path?
> 
> This info is related to the specific fault.
> 
> >  
> > > What goes wrong? Seems simpler and no more stuffing fault info on the vcpu.
> > I tried to avoid the double paths.
> > IMHO, it's confusing to specify max_level from two paths.
> > 
> > The fault info in vcpu_tdx isn't a real problem as it's per-vCPU.
> > An existing example in KVM is vcpu->arch.mmio_gfn.
> 
> mmio_gfn isn't info about the fault though, it's info about the gfn being mmio.
> So not fault scoped.
> 
> > 
> > We don't need something like the vcpu->arch.mmio_gen because
> > tdx->violation_gfn_* and tdx->violation_request_level are reset in each
> > tdx_handle_ept_violation().
> > 
> > 
> > BTW, dug into some history:
> > 
> > In v18 of TDX basic series,
> > enforcing 4KB for pre-runnable faults were done by passing PG_LEVEL_4K to
> > kvm_mmu_map_tdp_page().
> > https://lore.kernel.org/all/1a64f798b550dad9e096603e8dae3b6e8fb2fbd5.1705965635.git.isaku.yamahata@intel.com/
> > https://lore.kernel.org/all/97bb1f2996d8a7b828cd9e3309380d1a86ca681b.1705965635.git.isaku.yamahata@intel.com/
> > 
> > For the other faults, it's done by altering max_level in kvm_mmu_do_page_fault(),
> > and Paolo asked to use the tdx_gmem_private_max_mapping_level() path.
> > https://lore.kernel.org/all/CABgObfbu1-Ok607uYdo4DzwZf8ZGVQnvHU+y9_M1Zae55K5xwQ@mail.gmail.com/
> > 
> > For the patch "KVM: x86/mmu: Allow per-VM override of the TDP max page level",
> > it's initially acked by Paolo in v2, and Sean's reply is at
> > https://lore.kernel.org/all/YO3%2FgvK9A3tgYfT6@google.com .
> 
> The SNP case is not checking fault info, it's closer to the other cases. I don't
> see that any of that conversation applies to this case. Can you clarify?
My concern of stuffing the error_code to pass in the fault max_level is that
if it's a good path, the TDX basic enabling code should have been implemented in
that way by always passing in 4KB.

Why Sean said
"
Looks like SNP needs a dynamic check, i.e. a kvm_x86_ops hook, to handle an edge
case in the RMP.  That's probably the better route given that this is a short-term
hack (hopefully :-D).
"
instead of suggesting TDX enable the error code path earlier and hardcode the
level to 4KB?

> On the subject of the whether to pass accept level into the fault, or stuff it
> on the vcpu, I'm still in the camp that it is better to pass it in the error
> code. If you disagree, let's see if we can flag down Sean and Paolo to weigh in.
Ok.

To document for further discussions with Sean and Paolo:

- Passing in max_level in tdx_gmem_private_max_mapping_level()
  Cons:
  a) needs to stuff info in the vcpu to get accept level info.

  Pros:
  a) a uniform approach as to SEV.
  b) dynamic. Can get more fault info, e.g. is_prefetch, gfn, pfn.
  c) can get increased/decreased level for a given gfn similarly to get the
     accept level
  d) flexibility for TDX to implement advanced features. e.g.
     1. determine an accept level after certain negotiation with guest
     2. pre-fetch memory


- To pass in max_level in error_code
  Cons:
  a) still need tdx_gmem_private_max_mapping_level() to get dynamic info.
  b) still need info stuffed on the vcpu under certain conditions. e.g.
     when promotion fails with TDX_EPT_INVALID_PROMOTE_CONDITIONS, we can skip
     the local retry by reducing the max_level.
  c) only effective in the EPT violation path.
  Pros:
     currently easy to pass in accept level info.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 8 months, 3 weeks ago

On Thu, 2025-05-22 at 11:52 +0800, Yan Zhao wrote:
> On Wed, May 21, 2025 at 11:40:15PM +0800, Edgecombe, Rick P wrote:
> > On Tue, 2025-05-20 at 17:34 +0800, Yan Zhao wrote:
> > > So, you want to disallow huge pages for non-Linux TDs, then we have no need
> > > to support splitting in the fault path, right?
> > > 
> > > I'm OK if we don't care non-Linux TDs for now.
> > > This can simplify the splitting code and we can add the support when there's a
> > > need.
> > 
> > We do need to care about non-Linux TDs functioning, but we don't need to
> > optimize for them at this point. We need to optimize for things that happen
> > often. Pending-#VE using TDs are rare, and don't need to have huge pages in
> > order to work.
> > 
> > Yesterday Kirill and I were chatting offline about the newly defined
> > TDG.MEM.PAGE.RELEASE. It is kind of like an unaccept, so another possibility is:
> > 1. Guest accepts at 2MB
> > 2. Guest releases at 2MB (no notice to VMM)
> > 3. Guest accepts at 4k, EPT violation with expectation to demote
> > 
> > In that case, KVM won't know to expect it, and that it needs to preemptively map
> > things at 4k.
> > 
> > For full coverage of the issue, can we discuss a little bit about what demote in
> > the fault path would look like?
> For demote in the fault path, it will take mmu read lock.
> 
> So, the flow in the fault path is
> 1. zap with mmu read lock.
>    ret = tdx_sept_zap_private_spte(kvm, gfn, level, page, true);
>    if (ret <= 0)
>        return ret;
> 2. track with mmu read lock
>    ret = tdx_track(kvm, true);
>    if (ret)
>        return ret;
> 3. demote with mmu read lock
>    ret = tdx_spte_demote_private_spte(kvm, gfn, level, page, true);
>    if (ret)
>        goto err;
> 4. return success or unzap as error fallback.
>    tdx_sept_unzap_private_spte(kvm, gfn, level);
> 
> Steps 1-3 will return -EBUSY on busy error (which will not be very often as we
> will introduce kvm_tdx->sept_lock. I can post the full lock analysis if
> necessary).

That is true that it would not be taken very often. It's not a performance
issue, but I think we should not add a lock if we can at all avoid it. It
creates a special case for TDX for the TDP MMU. People would have to then keep
in mind that two mmu read lock threads could still still contend.

[snip]
> 
> 
> > The current zapping operation that is involved
> > depends on mmu write lock. And I remember you had a POC that added essentially a
> > hidden exclusive lock in TDX code as a substitute. But unlike the other callers,
> Right, The kvm_tdx->sept_lock is introduced as a rw lock. The write lock is held
> in a very short period, around tdh_mem_sept_remove(), tdh_mem_range_block(),
> tdh_mem_range_unblock().
> 
> The read/write status of the kvm_tdx->sept_lock corresponds to that in the TDX
> module.
> 
>   Resources          SHARED  users              EXCLUSIVE users 
> -----------------------------------------------------------------------
>  secure_ept_lock   tdh_mem_sept_add            tdh_vp_enter
>                    tdh_mem_page_aug            tdh_mem_sept_remove
>                    tdh_mem_page_remove         tdh_mem_range_block
>                    tdh_mem_page_promote        tdh_mem_range_unblock
>                    tdh_mem_page_demote
> 
> > the fault path demote case could actually handle failure. So if we just returned
> > busy and didn't try to force the retry, we would just run the risk of
> > interfering with TDX module sept lock? Is that the only issue with a design that
> > would allows failure of demote in the fault path?
> The concern to support split in the fault path is mainly to avoid unnecesssary
> split, e.g., when two vCPUs try to accept at different levels.

We are just talking about keeping rare TDs functional here, right? Two cases
are:
 - TDs using PAGE.RELEASE
 - TDs using pending #VEs and accepting memory in strange patterns

Not maintaining huge pages there seems totally acceptable. How I look at this
whole thing is that it just an optimization, not a feature. Every aspect has a
complexity/performance tradeoff that we need to make a sensible decision on.
Maintaining huge page mappings in every possible case is not the goal.

> 
> Besides that we need to introduce 3 locks inside TDX:
> rwlock_t sept_lock, spinlock_t no_vcpu_enter_lock, spinlock_t track_lock.

Huh?

> 
> To ensure the success of unzap (to restore the state), kicking of vCPUs in the
> fault path is required, which is not ideal. But with the introduced lock and the
> proposed TDX modules's change to tdg_mem_page_accept() (as in the next comment),
> the chance to invoke unzap is very low.

Yes, it's probably not safe to expect the exact same demote call chain again.
The fault path could maybe learn to recover from the blocked state?

> 
> > Let's keep in mind that we could ask for TDX module changes to enable this path.
> We may need TDX module's change to let tdg_mem_page_accept() not to take lock on
> an non-ACCEPTable entry to avoid contention with guest and the potential error
> TDX_HOST_PRIORITY_BUSY_TIMEOUT.

Part of that is already in the works (accepting not-present entries). It seems
reasonable. But also, what about looking at having the TDX module do the full
demote operation internally. The track part obviously happens outside of the TDX
module, but maybe the whole thing could be simplified.

> 
> > I think we could probably get away with ignoring TDG.MEM.PAGE.RELEASE if we had
> > a plan to fix it up with TDX module changes. And if the ultimate root cause of
> > the complication is avoiding zero-step (sept lock), we should fix that instead
> > of design around it further.
> Ok.
> 
> > > 

I'll respond to the error code half of this mail separately.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 8 months, 2 weeks ago

On Sat, May 24, 2025 at 07:40:25AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-05-22 at 11:52 +0800, Yan Zhao wrote:
> > On Wed, May 21, 2025 at 11:40:15PM +0800, Edgecombe, Rick P wrote:
> > > On Tue, 2025-05-20 at 17:34 +0800, Yan Zhao wrote:
> > > > So, you want to disallow huge pages for non-Linux TDs, then we have no need
> > > > to support splitting in the fault path, right?
> > > > 
> > > > I'm OK if we don't care non-Linux TDs for now.
> > > > This can simplify the splitting code and we can add the support when there's a
> > > > need.
> > > 
> > > We do need to care about non-Linux TDs functioning, but we don't need to
> > > optimize for them at this point. We need to optimize for things that happen
> > > often. Pending-#VE using TDs are rare, and don't need to have huge pages in
> > > order to work.
> > > 
> > > Yesterday Kirill and I were chatting offline about the newly defined
> > > TDG.MEM.PAGE.RELEASE. It is kind of like an unaccept, so another possibility is:
> > > 1. Guest accepts at 2MB
> > > 2. Guest releases at 2MB (no notice to VMM)
> > > 3. Guest accepts at 4k, EPT violation with expectation to demote
> > > 
> > > In that case, KVM won't know to expect it, and that it needs to preemptively map
> > > things at 4k.
> > > 
> > > For full coverage of the issue, can we discuss a little bit about what demote in
> > > the fault path would look like?
> > For demote in the fault path, it will take mmu read lock.
> > 
> > So, the flow in the fault path is
> > 1. zap with mmu read lock.
> >    ret = tdx_sept_zap_private_spte(kvm, gfn, level, page, true);
> >    if (ret <= 0)
> >        return ret;
> > 2. track with mmu read lock
> >    ret = tdx_track(kvm, true);
> >    if (ret)
> >        return ret;
> > 3. demote with mmu read lock
> >    ret = tdx_spte_demote_private_spte(kvm, gfn, level, page, true);
> >    if (ret)
> >        goto err;
> > 4. return success or unzap as error fallback.
> >    tdx_sept_unzap_private_spte(kvm, gfn, level);
> > 
> > Steps 1-3 will return -EBUSY on busy error (which will not be very often as we
> > will introduce kvm_tdx->sept_lock. I can post the full lock analysis if
> > necessary).
> 
> That is true that it would not be taken very often. It's not a performance
> issue, but I think we should not add a lock if we can at all avoid it. It
> creates a special case for TDX for the TDP MMU. People would have to then keep
> in mind that two mmu read lock threads could still still contend.
Hmm, without the kvm_tdx->sept_lock, we can return retry if busy error is
returned from tdh_mem_range_block(). However, we need to ensure the success of
tdh_mem_range_unblock() before completing the split.

Besides, we need the kvm_tdx->track_lock to serialize tdh_mem_track() and
kicking off vCPUs. In the base series, we use write kvm->mmu_lock to achieve
this purpose.

BTW: Looks Kirill's DPAMT series will introduce a pamt_lock [1]. 
[1] https://lore.kernel.org/all/20250502130828.4071412-6-kirill.shutemov@linux.intel.com/

> > > The current zapping operation that is involved
> > > depends on mmu write lock. And I remember you had a POC that added essentially a
> > > hidden exclusive lock in TDX code as a substitute. But unlike the other callers,
> > Right, The kvm_tdx->sept_lock is introduced as a rw lock. The write lock is held
> > in a very short period, around tdh_mem_sept_remove(), tdh_mem_range_block(),
> > tdh_mem_range_unblock().
> > 
> > The read/write status of the kvm_tdx->sept_lock corresponds to that in the TDX
> > module.
> > 
> >   Resources          SHARED  users              EXCLUSIVE users 
> > -----------------------------------------------------------------------
> >  secure_ept_lock   tdh_mem_sept_add            tdh_vp_enter
> >                    tdh_mem_page_aug            tdh_mem_sept_remove
> >                    tdh_mem_page_remove         tdh_mem_range_block
> >                    tdh_mem_page_promote        tdh_mem_range_unblock
> >                    tdh_mem_page_demote
> > 
> > > the fault path demote case could actually handle failure. So if we just returned
> > > busy and didn't try to force the retry, we would just run the risk of
> > > interfering with TDX module sept lock? Is that the only issue with a design that
> > > would allows failure of demote in the fault path?
> > The concern to support split in the fault path is mainly to avoid unnecesssary
> > split, e.g., when two vCPUs try to accept at different levels.
> 
> We are just talking about keeping rare TDs functional here, right? Two cases
> are:
>  - TDs using PAGE.RELEASE
This is for future linux TDs, right?

>  - TDs using pending #VEs and accepting memory in strange patterns
> 
> Not maintaining huge pages there seems totally acceptable. How I look at this
> whole thing is that it just an optimization, not a feature. Every aspect has a
> complexity/performance tradeoff that we need to make a sensible decision on.
> Maintaining huge page mappings in every possible case is not the goal.
So, can I interpret your preference as follows?
For now,
- Do not support huge pages on non-linux TDs.
- Do not support page splitting in fault path.

> > 
> > Besides that we need to introduce 3 locks inside TDX:
> > rwlock_t sept_lock, spinlock_t no_vcpu_enter_lock, spinlock_t track_lock.
> 
> Huh?
In the base series, no_vcpu_enter_lock and track_lock are saved by holding the
write kvm->mmu_lock.

> 
> > 
> > To ensure the success of unzap (to restore the state), kicking of vCPUs in the
> > fault path is required, which is not ideal. But with the introduced lock and the
> > proposed TDX modules's change to tdg_mem_page_accept() (as in the next comment),
> > the chance to invoke unzap is very low.
> 
> Yes, it's probably not safe to expect the exact same demote call chain again.
> The fault path could maybe learn to recover from the blocked state?
Do you mean you want to introduce a blocked state in the mirror page table?
I don't like it for its complexity.

Do you think we can try to ask for tdh_mem_page_demote() not to use
tdh_mem_range_block() and tdh_mem_range_unblock(). Looks it's anyway required
for TDX connect.

If that's true, the tdh_mem_range_{un}block()/tdh_mem_track() can be avoided in
the fault path.

> > 
> > > Let's keep in mind that we could ask for TDX module changes to enable this path.
> > We may need TDX module's change to let tdg_mem_page_accept() not to take lock on
> > an non-ACCEPTable entry to avoid contention with guest and the potential error
> > TDX_HOST_PRIORITY_BUSY_TIMEOUT.
> 
> Part of that is already in the works (accepting not-present entries). It seems
> reasonable. But also, what about looking at having the TDX module do the full
> demote operation internally. The track part obviously happens outside of the TDX
> module, but maybe the whole thing could be simplified.
> 
> > 
> > > I think we could probably get away with ignoring TDG.MEM.PAGE.RELEASE if we had
> > > a plan to fix it up with TDX module changes. And if the ultimate root cause of
> > > the complication is avoiding zero-step (sept lock), we should fix that instead
> > > of design around it further.
> > Ok.
> > 
> > > > 
> 
> I'll respond to the error code half of this mail separately.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Huang, Kai 8 months, 3 weeks ago

On Tue, 2025-05-20 at 17:34 +0800, Zhao, Yan Y wrote:
> On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> > On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept
> > > > order,
> > > > e.g., they always accept 4K, there could be *endless EPT violation* if I
> > > > understand your words correctly.
> > > > 
> > > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead
> > > > of
> > > > 2M if no accept level is provided in the fault?
> > > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> > 
> > TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> > docs say the VMM needs to demote *if* the mapping is large and the accept size
> > is small. But if we map at 4k size for non-accept EPT violations, we won't hit
> > this case. I also wonder what is preventing the TDX module from handling a 2MB
> > accept size at 4k mappings. It could be changed maybe.
> > 
> > But I think Kai's question was: why are we complicating the code for the case of
> > non-Linux TDs that also use #VE for accept? It's not necessary to be functional,
> > and there aren't any known TDs like that which are expected to use KVM today.
> > (err, except the MMU stress test). So in another form the question is: should we
> > optimize KVM for a case we don't even know if anyone will use? The answer seems
> > obviously no to me.
> So, you want to disallow huge pages for non-Linux TDs, then we have no need
> to support splitting in the fault path, right?
> 
> I'm OK if we don't care non-Linux TDs for now.
> This can simplify the splitting code and we can add the support when there's a
> need.

For the record, I am not saying we don't care non-Linux TDs.  I am worrying
about the *endless* EPT violation in your below words:

    ... The worst outcome to ignore the resulting
    splitting request is an endless EPT violation.  This would not happen
    for a Linux guest, which does not expect any #VE.

And the point is, it's not OK if a *legal* guest behaviour can trigger this.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Sean Christopherson 8 months ago

On Tue, May 20, 2025, Kai Huang wrote:
> On Tue, 2025-05-20 at 17:34 +0800, Zhao, Yan Y wrote:
> > On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> > > On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K
> > > > > accept order, e.g., they always accept 4K, there could be *endless
> > > > > EPT violation* if I understand your words correctly.
> > > > > 
> > > > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K
> > > > > instead of 2M if no accept level is provided in the fault?
> > > > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > > > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> > > 
> > > TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> > > docs say the VMM needs to demote *if* the mapping is large and the accept size
> > > is small.

No thanks, fix the spec and the TDX Module.  Punting an error to the VMM is
inconsistent, convoluted, and inefficient.

Per "Table 8.2: TDG.MEM.PAGE.ACCEPT SEPT Walk Cases":

  S-EPT state         ACCEPT vs. Mapping Size         Behavior
  Leaf SEPT_PRESENT   Smaller                         TDACCEPT_SIZE_MISMATCH
  Leaf !SEPT_PRESENT  Smaller                         EPT Violation <=========================|
  Leaf DONT_CARE      Same                            Success                                 | => THESE TWO SHOULD MATCH!!!
  !Leaf SEPT_FREE     Larger                          EPT Violation, BECAUSE THERE'S NO PAGE  |
  !Leaf SEPT_FREE     Larger                          TDACCEPT_SIZE_MISMATCH <================|

If ACCEPT is "too small", an EPT violation occurs.  But if ACCEPT is "too big",
a TDACCEPT_SIZE_MISMATCH error occurs.  That's asinine.

The only reason that comes to mind for punting the "too small" case to the VMM
is to try and keep the guest alive if the VMM is mapping more memory than has
been enumerated to the guest.  E.g. if the guest suspects the VMM is malicious
or buggy.  IMO, that's a terrible reason to push this much complexity into the
host.  It also risks godawful boot times, e.g. if the guest kernel is buggy and
accepts everything at 4KiB granularity.

The TDX Module should return TDACCEPT_SIZE_MISMATCH and force the guest to take
action, not force the hypervisor to limp along in a degraded state.  If the guest
doesn't want to ACCEPT at a larger granularity, e.g. because it doesn't think the
entire 2MiB/1GiB region is available, then the guest can either log a warning and
"poison" the page(s), or terminate and refuse to boot.

If for some reason the guest _can't_ ACCEPT at larger granularity, i.e. if the
guest _knows_ that 2MiB or 1GiB is available/usable but refuses to ACCEPT at the
appropriate granularity, then IMO that's firmly a guest bug.

If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
memory, then there should be an explicit TDCALL to request that the unwanted
regions of memory be unmapped.  Smushing everything into implicit behavior has
obvioulsy created a giant mess.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Xiaoyao Li 8 months ago

On 6/11/2025 10:42 PM, Sean Christopherson wrote:
> On Tue, May 20, 2025, Kai Huang wrote:
>> On Tue, 2025-05-20 at 17:34 +0800, Zhao, Yan Y wrote:
>>> On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
>>>> On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
>>>>>> On the opposite, if other non-Linux TDs don't follow 1G->2M->4K
>>>>>> accept order, e.g., they always accept 4K, there could be *endless
>>>>>> EPT violation* if I understand your words correctly.
>>>>>>
>>>>>> Isn't this yet-another reason we should choose to return PG_LEVEL_4K
>>>>>> instead of 2M if no accept level is provided in the fault?
>>>>> As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
>>>>> TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
>>>>
>>>> TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
>>>> docs say the VMM needs to demote *if* the mapping is large and the accept size
>>>> is small.
> 
> No thanks, fix the spec and the TDX Module.  Punting an error to the VMM is
> inconsistent, convoluted, and inefficient.
> 
> Per "Table 8.2: TDG.MEM.PAGE.ACCEPT SEPT Walk Cases":
> 
>    S-EPT state         ACCEPT vs. Mapping Size         Behavior
>    Leaf SEPT_PRESENT   Smaller                         TDACCEPT_SIZE_MISMATCH
>    Leaf !SEPT_PRESENT  Smaller                         EPT Violation <=========================|
>    Leaf DONT_CARE      Same                            Success                                 | => THESE TWO SHOULD MATCH!!!
>    !Leaf SEPT_FREE     Larger                          EPT Violation, BECAUSE THERE'S NO PAGE  |
>    !Leaf SEPT_FREE     Larger                          TDACCEPT_SIZE_MISMATCH <================|
> 
> 
> If ACCEPT is "too small", an EPT violation occurs.  But if ACCEPT is "too big",
> a TDACCEPT_SIZE_MISMATCH error occurs.  That's asinine.
> 
> The only reason that comes to mind for punting the "too small" case to the VMM
> is to try and keep the guest alive if the VMM is mapping more memory than has
> been enumerated to the guest.  E.g. if the guest suspects the VMM is malicious
> or buggy.  IMO, that's a terrible reason to push this much complexity into the
> host.  It also risks godawful boot times, e.g. if the guest kernel is buggy and
> accepts everything at 4KiB granularity.
> 
> The TDX Module should return TDACCEPT_SIZE_MISMATCH and force the guest to take
> action, not force the hypervisor to limp along in a degraded state.  If the guest
> doesn't want to ACCEPT at a larger granularity, e.g. because it doesn't think the
> entire 2MiB/1GiB region is available, then the guest can either log a warning and
> "poison" the page(s), or terminate and refuse to boot.
> 
> If for some reason the guest _can't_ ACCEPT at larger granularity, i.e. if the
> guest _knows_ that 2MiB or 1GiB is available/usable but refuses to ACCEPT at the
> appropriate granularity, then IMO that's firmly a guest bug.

It might just be guest doesn't want to accept a larger level instead of 
can't. Use case see below.

> If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
> memory, then there should be an explicit TDCALL to request that the unwanted
> regions of memory be unmapped.  Smushing everything into implicit behavior has
> obvioulsy created a giant mess.

Isn't the ACCEPT with a specific level explicit? Note that ACCEPT is not 
only for the case that VMM has already mapped page and guest only needs 
to accept it to make it available, it also works for the case that guest 
requests VMM to map the page for a gpa (at specific level) then guest 
accepts it.

Even for the former case, it is understandable for behaving differently 
for the "too small" and "too big" case. If the requested accept level is 
"too small", VMM can handle it by demoting the page to satisfy guest. 
But when the level is "too big", usually the VMM cannot map the page at 
a higher level so that ept violation cannot help. I admit that it leads 
to the requirement that VMM should always try to map the page at the 
highest available level, if the EPT violation is not caused by ACCEPT 
which contains a desired mapping level.

As for the scenario, the one I can think of is, guest is trying to 
convert a 4K sized page between private and shared constantly, for 
testing purpose. Guest knows that if accepting the gpa at higher level, 
it takes more time. And when convert it to shared, it triggers DEMOTE 
and more time. So for better performance, guest just calls ACCEPT with 
4KB page. However, VMM returns PAGE_SIZE_MATCH and enforces guest to 
accept a bigger size. what a stupid VMM.

Anyway, I'm just expressing how I understand the current design and I 
think it's reasonable. And I don't object the idea to return 
ACCEPT_SIZE_MISMATCH for "too small" case, but it's needs to be guest 
opt-in, i.e., let guest itself chooses the behavior.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 8 months ago

On Fri, Jun 13, 2025 at 10:41:21AM +0800, Xiaoyao Li wrote:
> On 6/11/2025 10:42 PM, Sean Christopherson wrote:
> > On Tue, May 20, 2025, Kai Huang wrote:
> > > On Tue, 2025-05-20 at 17:34 +0800, Zhao, Yan Y wrote:
> > > > On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> > > > > On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > > > > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K
> > > > > > > accept order, e.g., they always accept 4K, there could be *endless
> > > > > > > EPT violation* if I understand your words correctly.
> > > > > > > 
> > > > > > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K
> > > > > > > instead of 2M if no accept level is provided in the fault?
> > > > > > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > > > > > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> > > > > 
> > > > > TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> > > > > docs say the VMM needs to demote *if* the mapping is large and the accept size
> > > > > is small.
> > 
> > No thanks, fix the spec and the TDX Module.  Punting an error to the VMM is
> > inconsistent, convoluted, and inefficient.
> > 
> > Per "Table 8.2: TDG.MEM.PAGE.ACCEPT SEPT Walk Cases":
> > 
> >    S-EPT state         ACCEPT vs. Mapping Size         Behavior
> >    Leaf SEPT_PRESENT   Smaller                         TDACCEPT_SIZE_MISMATCH
> >    Leaf !SEPT_PRESENT  Smaller                         EPT Violation <=========================|
> >    Leaf DONT_CARE      Same                            Success                                 | => THESE TWO SHOULD MATCH!!!
> >    !Leaf SEPT_FREE     Larger                          EPT Violation, BECAUSE THERE'S NO PAGE  |
> >    !Leaf SEPT_FREE     Larger                          TDACCEPT_SIZE_MISMATCH <================|
> > 
> > 
> > If ACCEPT is "too small", an EPT violation occurs.  But if ACCEPT is "too big",
> > a TDACCEPT_SIZE_MISMATCH error occurs.  That's asinine.
> > 
> > The only reason that comes to mind for punting the "too small" case to the VMM
> > is to try and keep the guest alive if the VMM is mapping more memory than has
> > been enumerated to the guest.  E.g. if the guest suspects the VMM is malicious
> > or buggy.  IMO, that's a terrible reason to push this much complexity into the
> > host.  It also risks godawful boot times, e.g. if the guest kernel is buggy and
> > accepts everything at 4KiB granularity.
> > 
> > The TDX Module should return TDACCEPT_SIZE_MISMATCH and force the guest to take
> > action, not force the hypervisor to limp along in a degraded state.  If the guest
> > doesn't want to ACCEPT at a larger granularity, e.g. because it doesn't think the
> > entire 2MiB/1GiB region is available, then the guest can either log a warning and
> > "poison" the page(s), or terminate and refuse to boot.
> > 
> > If for some reason the guest _can't_ ACCEPT at larger granularity, i.e. if the
> > guest _knows_ that 2MiB or 1GiB is available/usable but refuses to ACCEPT at the
> > appropriate granularity, then IMO that's firmly a guest bug.
> 
> It might just be guest doesn't want to accept a larger level instead of
> can't. Use case see below.
> 
> > If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
> > memory, then there should be an explicit TDCALL to request that the unwanted
> > regions of memory be unmapped.  Smushing everything into implicit behavior has
> > obvioulsy created a giant mess.
> 
> Isn't the ACCEPT with a specific level explicit? Note that ACCEPT is not
> only for the case that VMM has already mapped page and guest only needs to
> accept it to make it available, it also works for the case that guest
> requests VMM to map the page for a gpa (at specific level) then guest
> accepts it.
> 
> Even for the former case, it is understandable for behaving differently for
> the "too small" and "too big" case. If the requested accept level is "too
> small", VMM can handle it by demoting the page to satisfy guest. But when
> the level is "too big", usually the VMM cannot map the page at a higher
> level so that ept violation cannot help. I admit that it leads to the
> requirement that VMM should always try to map the page at the highest
> available level, if the EPT violation is not caused by ACCEPT which contains
> a desired mapping level.
> 
> As for the scenario, the one I can think of is, guest is trying to convert a
> 4K sized page between private and shared constantly, for testing purpose.
> Guest knows that if accepting the gpa at higher level, it takes more time.
> And when convert it to shared, it triggers DEMOTE and more time. So for
> better performance, guest just calls ACCEPT with 4KB page. However, VMM
Hmm, ACCEPT at 4KB level at the first time triggers DEMOTE already.
So, I don't see how ACCEPT at 4KB helps performance.

Support VMM has mapped a page at 4MB,

         Scenario 1                           Effort
  (1) Guest ACCEPT at 2MB                   ACCEPT 2MB         
  (2) converts a 4KB page to shared         DEMOTE
  (3) convert it back to private            ACCEPT 4KB


         Scenario 2                           Effort
  (1) Guest ACCEPT at 4MB                   DEMOTE, ACCEPT 4MB         
  (2) converts a 4KB page to shared
  (3) convert it back to private            ACCEPT 4KB


In step (3) of "Scenario 1", VMM will not map the page at 2MB according to the
current implementation because PROMOTION requires uniform ACCEPT status across
all 512 4KB pages to be succeed.

> returns PAGE_SIZE_MATCH and enforces guest to accept a bigger size. what a
> stupid VMM.
I agree with Sean that if guest doesn't want to accept at a bigger size for
certain reasons (e.g. it thinks it's unsafe or consider it as an attack),
invoking an explicit TDVMCALL may be a better approach.

> Anyway, I'm just expressing how I understand the current design and I think
> it's reasonable. And I don't object the idea to return ACCEPT_SIZE_MISMATCH
> for "too small" case, but it's needs to be guest opt-in, i.e., let guest
> itself chooses the behavior.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 8 months ago

On Fri, Jun 13, 2025 at 11:29:39AM +0800, Yan Zhao wrote:
> On Fri, Jun 13, 2025 at 10:41:21AM +0800, Xiaoyao Li wrote:
> > On 6/11/2025 10:42 PM, Sean Christopherson wrote:
> > > On Tue, May 20, 2025, Kai Huang wrote:
> > > > On Tue, 2025-05-20 at 17:34 +0800, Zhao, Yan Y wrote:
> > > > > On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> > > > > > On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > > > > > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K
> > > > > > > > accept order, e.g., they always accept 4K, there could be *endless
> > > > > > > > EPT violation* if I understand your words correctly.
> > > > > > > > 
> > > > > > > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K
> > > > > > > > instead of 2M if no accept level is provided in the fault?
> > > > > > > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > > > > > > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> > > > > > 
> > > > > > TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> > > > > > docs say the VMM needs to demote *if* the mapping is large and the accept size
> > > > > > is small.
> > > 
> > > No thanks, fix the spec and the TDX Module.  Punting an error to the VMM is
> > > inconsistent, convoluted, and inefficient.
> > > 
> > > Per "Table 8.2: TDG.MEM.PAGE.ACCEPT SEPT Walk Cases":
> > > 
> > >    S-EPT state         ACCEPT vs. Mapping Size         Behavior
> > >    Leaf SEPT_PRESENT   Smaller                         TDACCEPT_SIZE_MISMATCH
> > >    Leaf !SEPT_PRESENT  Smaller                         EPT Violation <=========================|
> > >    Leaf DONT_CARE      Same                            Success                                 | => THESE TWO SHOULD MATCH!!!
> > >    !Leaf SEPT_FREE     Larger                          EPT Violation, BECAUSE THERE'S NO PAGE  |
> > >    !Leaf SEPT_FREE     Larger                          TDACCEPT_SIZE_MISMATCH <================|
> > > 
> > > 
> > > If ACCEPT is "too small", an EPT violation occurs.  But if ACCEPT is "too big",
> > > a TDACCEPT_SIZE_MISMATCH error occurs.  That's asinine.
> > > 
> > > The only reason that comes to mind for punting the "too small" case to the VMM
> > > is to try and keep the guest alive if the VMM is mapping more memory than has
> > > been enumerated to the guest.  E.g. if the guest suspects the VMM is malicious
> > > or buggy.  IMO, that's a terrible reason to push this much complexity into the
> > > host.  It also risks godawful boot times, e.g. if the guest kernel is buggy and
> > > accepts everything at 4KiB granularity.
> > > 
> > > The TDX Module should return TDACCEPT_SIZE_MISMATCH and force the guest to take
> > > action, not force the hypervisor to limp along in a degraded state.  If the guest
> > > doesn't want to ACCEPT at a larger granularity, e.g. because it doesn't think the
> > > entire 2MiB/1GiB region is available, then the guest can either log a warning and
> > > "poison" the page(s), or terminate and refuse to boot.
> > > 
> > > If for some reason the guest _can't_ ACCEPT at larger granularity, i.e. if the
> > > guest _knows_ that 2MiB or 1GiB is available/usable but refuses to ACCEPT at the
> > > appropriate granularity, then IMO that's firmly a guest bug.
> > 
> > It might just be guest doesn't want to accept a larger level instead of
> > can't. Use case see below.
> > 
> > > If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
> > > memory, then there should be an explicit TDCALL to request that the unwanted
> > > regions of memory be unmapped.  Smushing everything into implicit behavior has
> > > obvioulsy created a giant mess.
> > 
> > Isn't the ACCEPT with a specific level explicit? Note that ACCEPT is not
> > only for the case that VMM has already mapped page and guest only needs to
> > accept it to make it available, it also works for the case that guest
> > requests VMM to map the page for a gpa (at specific level) then guest
> > accepts it.
To avoid confusion, here's the full new design:

1.when an EPT violation carries an ACCEPT level info
  (This occurs when TD performs ACCEPT before it accesses memory),
  KVM maps the page at map level <= the specified level.
  Guest's ACCEPT will succeed or return PAGE_SIZE_MATCH if map level < the
  specified level.

2.when an EPT violation does not carry ACCEPT level info
  (This occurs when TD accesses memory before invoking ACCEPT),

  1) if the TD is configured to always accept VMM's map level,
     KVM allows to map at 2MB.
     TD's later 4KB ACCEPT will return PAGE_SIZE_MATCH.
     TD can either retry with 2MB ACCEPT or explictly invoke a TDVMCALL for
     demotion.
  2) if the TD is not configured to always accept VMM's map level,
     KVM always maps at 4KB.
     TD's 2MB ACCEPT will return PAGE_SIZE_MATCH.

Please let me know if anything does not look right.

> > Even for the former case, it is understandable for behaving differently for
> > the "too small" and "too big" case. If the requested accept level is "too
> > small", VMM can handle it by demoting the page to satisfy guest. But when
> > the level is "too big", usually the VMM cannot map the page at a higher
> > level so that ept violation cannot help. I admit that it leads to the
> > requirement that VMM should always try to map the page at the highest
> > available level, if the EPT violation is not caused by ACCEPT which contains
> > a desired mapping level.
> > 
> > As for the scenario, the one I can think of is, guest is trying to convert a
> > 4K sized page between private and shared constantly, for testing purpose.
> > Guest knows that if accepting the gpa at higher level, it takes more time.
> > And when convert it to shared, it triggers DEMOTE and more time. So for
> > better performance, guest just calls ACCEPT with 4KB page. However, VMM
> Hmm, ACCEPT at 4KB level at the first time triggers DEMOTE already.
> So, I don't see how ACCEPT at 4KB helps performance.
Hmm, sent too fast previously. Some correction below:

> Support VMM has mapped a page at 4MB,
Suppose VMM has mapped a page at 2MB when an EPT violation (triggered by TD
memory access instead of by TD ACCEPT) does not carry ACCEPT level info,

> 
>          Scenario 1                           Effort
>   (1) Guest ACCEPT at 2MB                   ACCEPT 2MB         
>   (2) converts a 4KB page to shared         DEMOTE
>   (3) convert it back to private            ACCEPT 4KB
> 
> 
>          Scenario 2                           Effort
>   (1) Guest ACCEPT at 4MB                   DEMOTE, ACCEPT 4MB         
    (1) Guest ACCEPT at 4KB                   DEMOTE, ACCEPT 4KB
>   (2) converts a 4KB page to shared
>   (3) convert it back to private            ACCEPT 4KB
> 
> 
> In step (3) of "Scenario 1", VMM will not map the page at 2MB according to the
> current implementation because PROMOTION requires uniform ACCEPT status across
> all 512 4KB pages to be succeed.
> 
> > returns PAGE_SIZE_MATCH and enforces guest to accept a bigger size. what a
> > stupid VMM.
> I agree with Sean that if guest doesn't want to accept at a bigger size for
> certain reasons (e.g. it thinks it's unsafe or consider it as an attack),
> invoking an explicit TDVMCALL may be a better approach.
> 
> > Anyway, I'm just expressing how I understand the current design and I think
> > it's reasonable. And I don't object the idea to return ACCEPT_SIZE_MISMATCH
> > for "too small" case, but it's needs to be guest opt-in, i.e., let guest
> > itself chooses the behavior.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Xiaoyao Li 8 months ago

On 6/13/2025 1:35 PM, Yan Zhao wrote:
> To avoid confusion, here's the full new design:
> 
> 1.when an EPT violation carries an ACCEPT level info
>    (This occurs when TD performs ACCEPT before it accesses memory),
>    KVM maps the page at map level <= the specified level.
>    Guest's ACCEPT will succeed or return PAGE_SIZE_MATCH if map level < the
>    specified level.
> 
> 2.when an EPT violation does not carry ACCEPT level info
>    (This occurs when TD accesses memory before invoking ACCEPT),
> 
>    1) if the TD is configured to always accept VMM's map level,
>       KVM allows to map at 2MB.
>       TD's later 4KB ACCEPT will return PAGE_SIZE_MATCH.
>       TD can either retry with 2MB ACCEPT or explictly invoke a TDVMCALL for
>       demotion.
>    2) if the TD is not configured to always accept VMM's map level,
>       KVM always maps at 4KB.

Is it the decision derived from the discussion of this series to make 
the design simple and avoid the demotion on ACCEPT?

It looks like KVM's own design preference that if the TD doesn't opt-in 
the proposed new feature "always accept VMM's map level', the only way 
it can get the page mapped by EPT as hugepage is always trying to accept 
the page before first access and trying accept starting from biggest 
page size.

I'm OK with it.

>       TD's 2MB ACCEPT will return PAGE_SIZE_MATCH.
> 
> Please let me know if anything does not look right.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 8 months ago

On Wed, 2025-06-11 at 07:42 -0700, Sean Christopherson wrote:
> If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
> memory, then there should be an explicit TDCALL to request that the unwanted
> regions of memory be unmapped.  Smushing everything into implicit behavior has
> obvioulsy created a giant mess.

Hi, still digging on if there is any possible use.

I think this may need a guest opt-in, so the guest can say it can handle errors
for both smaller and larger page size matches. So it may not matter if there is
a rare usage or not. If KVM finds the guest opts-in (how to do that TBD), it can
start mapping at the host level. If KVM doesn't see the opt-in, the guest gets
4k pages.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Sean Christopherson 8 months ago

On Thu, Jun 12, 2025, Rick P Edgecombe wrote:
> On Wed, 2025-06-11 at 07:42 -0700, Sean Christopherson wrote:
> > If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
> > memory, then there should be an explicit TDCALL to request that the unwanted
> > regions of memory be unmapped.  Smushing everything into implicit behavior has
> > obvioulsy created a giant mess.
> 
> Hi, still digging on if there is any possible use.
> 
> I think this may need a guest opt-in, so the guest can say it can handle errors
> for both smaller and larger page size matches. So it may not matter if there is
> a rare usage or not. If KVM finds the guest opts-in (how to do that TBD), it can
> start mapping at the host level. 

Hmm, clever.  That should work; requiring an updated guest kernel to get optimal
performance doesn't seem too onerous.

> If KVM doesn't see the opt-in, the guest gets 4k pages.

As in, KVM doesn't even try to use hugepage mappings?  If so, this idea probably
gets my vote.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 8 months ago

On Thu, 2025-06-12 at 17:19 -0700, Sean Christopherson wrote:
> > I think this may need a guest opt-in, so the guest can say it can handle
> > errors
> > for both smaller and larger page size matches. So it may not matter if there
> > is
> > a rare usage or not. If KVM finds the guest opts-in (how to do that TBD), it
> > can
> > start mapping at the host level. 
> 
> Hmm, clever.  That should work; requiring an updated guest kernel to get
> optimal
> performance doesn't seem too onerous.
> 
> > If KVM doesn't see the opt-in, the guest gets 4k pages.
> 
> As in, KVM doesn't even try to use hugepage mappings?  If so, this idea
> probably
> gets my vote.

Maybe an "I can handle it" accept size bit that comes in the exit qualification?

Yan, do you see any problems with that? Like if a guest passed it in some accept
and not others? Thinking about the new "unaccept" SEAMCALL...

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Sean Christopherson 8 months ago

On Fri, Jun 13, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-06-12 at 17:19 -0700, Sean Christopherson wrote:
> > > I think this may need a guest opt-in, so the guest can say it can handle
> > > errors for both smaller and larger page size matches. So it may not
> > > matter if there is a rare usage or not. If KVM finds the guest opts-in
> > > (how to do that TBD), it can start mapping at the host level. 
> > 
> > Hmm, clever.  That should work; requiring an updated guest kernel to get
> > optimal performance doesn't seem too onerous.
> > 
> > > If KVM doesn't see the opt-in, the guest gets 4k pages.
> > 
> > As in, KVM doesn't even try to use hugepage mappings?  If so, this idea
> > probably gets my vote.
> 
> Maybe an "I can handle it" accept size bit that comes in the exit qualification?

Eww, no.  Having to react on _every_ EPT violation would be annoying, and trying
to debug issues where the guest is mixing options would probably be a nightmare.

I was thinking of something along the lines of an init-time or boot-time opt-in.

> Yan, do you see any problems with that? Like if a guest passed it in some accept
> and not others? Thinking about the new "unaccept" SEAMCALL...

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 8 months ago

On Thu, 2025-06-12 at 17:44 -0700, Sean Christopherson wrote:
> > Maybe an "I can handle it" accept size bit that comes in the exit
> > qualification?
> 
> Eww, no.  Having to react on _every_ EPT violation would be annoying, and
> trying
> to debug issues where the guest is mixing options would probably be a
> nightmare.
> 
> I was thinking of something along the lines of an init-time or boot-time opt-
> in.

Fair.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 8 months ago

On Fri, Jun 13, 2025 at 08:47:28AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-06-12 at 17:44 -0700, Sean Christopherson wrote:
> > > Maybe an "I can handle it" accept size bit that comes in the exit
> > > qualification?

Dynamic turning on "I can handle it" would still require the supporting of
demotion in the fault path because there may be existing huge pages when
an EPT violation without "I can handle it" comes.

> > Eww, no.  Having to react on _every_ EPT violation would be annoying, and
> > trying
> > to debug issues where the guest is mixing options would probably be a
> > nightmare.
> > 
> > I was thinking of something along the lines of an init-time or boot-time opt-
> > in.
> 
> Fair.

Agreed.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 8 months ago

On Fri, 2025-06-13 at 09:32 +0800, Yan Zhao wrote:
> > > Eww, no.  Having to react on _every_ EPT violation would be annoying, and
> > > trying
> > > to debug issues where the guest is mixing options would probably be a
> > > nightmare.
> > > 
> > > I was thinking of something along the lines of an init-time or boot-time
> > > opt-
> > > in.
> > 
> > Fair.
> 
> Agreed.

Arg, I just realized a one-way opt-in will have a theoretical gap. If the guest
kexec's, the new kernel will need to match the opt-in.

A full solution could allow a later opt-out that is handled by the VMM by
shattering all page tables. But it starts to get too complex I think. Especially
since Linux guests already try to accept in the order 1GB->2MB->4k. So in
practice we are already worrying about correctness and not functional issues.
Maybe we just ignore it.

Otherwise, we currently have the following requirements I think:
1. One-way guest opt-in to new TDG.MEM.PAGE.ACCEPT behavior
2. Some notification to KVM that the guest has opted in.
3. After opt-in, TDG.MEM.PAGE.ACCEPT will return TDX_PAGE_SIZE_MISMATCH if
mapping is too small or too big

Thinking about how we would like the notification... Maybe we could have the
actual behavior controlled by the host, and have some GHCI like communication
like a TDVMCALL. The TDVMCALL (or similar) could be handled within KVM.
Basically just call the host side opt-in.

The reason to have it host controllable is that, as above, the new behavior
should be fine for a normal Linux guest. A host user controlled opt-in could be
useful for anyone that wants to run huge pages for old guest kernels. A kvm
module param maybe.

If this sounds good, I'll get the TDX modules input and come back with some
specific spec.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Sean Christopherson 8 months ago

On Fri, Jun 13, 2025, Rick P Edgecombe wrote:
> On Fri, 2025-06-13 at 09:32 +0800, Yan Zhao wrote:
> > > > Eww, no.  Having to react on _every_ EPT violation would be annoying,
> > > > and trying to debug issues where the guest is mixing options would
> > > > probably be a nightmare.
> > > > 
> > > > I was thinking of something along the lines of an init-time or
> > > > boot-time opt- in.
> > > 
> > > Fair.
> > 
> > Agreed.
> 
> Arg, I just realized a one-way opt-in will have a theoretical gap. If the guest
> kexec's, the new kernel will need to match the opt-in.

All the more reason to make this a property of the VM that is passed via
"struct td_params".  I.e. put the onus on the owner of the VM to ensure their
kernel(s) have been updated accordingly.

I understand that this could be painful, but honestly _all_ of TDX and SNP is
painful for the guest.  E.g. I don't think it's any worse than the security
issues with TDX (and SNP) guests using kvmclock (which I'd love some reviews on,
btw).

https://lore.kernel.org/all/20250227021855.3257188-35-seanjc@google.com

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 8 months ago

On Fri, 2025-06-13 at 15:19 -0700, Sean Christopherson wrote:
> > Arg, I just realized a one-way opt-in will have a theoretical gap. If the
> > guest
> > kexec's, the new kernel will need to match the opt-in.
> 
> All the more reason to make this a property of the VM that is passed via
> "struct td_params".  I.e. put the onus on the owner of the VM to ensure their
> kernel(s) have been updated accordingly.

Hmm, it gives me pause. At minimum it should have an enumeration to the guest.

> 
> I understand that this could be painful, but honestly _all_ of TDX and SNP is
> painful for the guest.  E.g. I don't think it's any worse than the security
> issues with TDX (and SNP) guests using kvmclock (which I'd love some reviews
> on,
> btw).
> 
> https://lore.kernel.org/all/20250227021855.3257188-35-seanjc@google.com

Oh, nice. I hadn't seen this. Agree that a comprehensive guest setup is quite
manual. But here we are playing with guest ABI. In practice, yes it's similar to
passing yet another arg to get a good TD.

We can start with a prototype the host side arg and see how it turns out. I
realized we need to verify edk2 as well.

Thanks Sean.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 7 months, 4 weeks ago

On Sat, Jun 14, 2025 at 07:33:48AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2025-06-13 at 15:19 -0700, Sean Christopherson wrote:
> > > Arg, I just realized a one-way opt-in will have a theoretical gap. If the
> > > guest
> > > kexec's, the new kernel will need to match the opt-in.
> > 
> > All the more reason to make this a property of the VM that is passed via
> > "struct td_params".  I.e. put the onus on the owner of the VM to ensure their
> > kernel(s) have been updated accordingly.
> 
> Hmm, it gives me pause. At minimum it should have an enumeration to the guest.
> 
> > 
> > I understand that this could be painful, but honestly _all_ of TDX and SNP is
> > painful for the guest.  E.g. I don't think it's any worse than the security
> > issues with TDX (and SNP) guests using kvmclock (which I'd love some reviews
> > on,
> > btw).
> > 
> > https://lore.kernel.org/all/20250227021855.3257188-35-seanjc@google.com
> 
> Oh, nice. I hadn't seen this. Agree that a comprehensive guest setup is quite
> manual. But here we are playing with guest ABI. In practice, yes it's similar to
> passing yet another arg to get a good TD.
Could we introduce a TD attr TDX_ATTR_SEPT_EXPLICIT_DEMOTION?

It can be something similar to TDX_ATTR_SEPT_VE_DISABLE except that we don't
provide a dynamical way as the TDCS_CONFIG_FLEXIBLE_PENDING_VE to allow guest to
turn on/off SEPT_VE_DISABLE.
(See the disable_sept_ve() in ./arch/x86/coco/tdx/tdx.c).

So, if userspace configures a TD with TDX_ATTR_SEPT_EXPLICIT_DEMOTION, KVM first
checks if SEPT_EXPLICIT_DEMOTION is supported.
The guest can also check if it would like to support SEPT_EXPLICIT_DEMOTION to
determine to continue or shut down. (If it does not check SEPT_EXPLICIT_DEMOTION,
e.g., if we don't want to update EDK2, the guest must accept memory before
memory accessing).

- if TD is configured with SEPT_EXPLICIT_DEMOTION, KVM allows to map at 2MB when
  there's no level info in an EPT violation. The guest must accept memory before
  accessing memory or if it wants to accept only a partial of host's mapping, it
  needs to explicitly invoke a TDVMCALL to request KVM to perform page demotion.

- if TD is configured without SEPT_EXPLICIT_DEMOTION, KVM always maps at 4KB
  when there's no level info in an EPT violation.

- No matter SEPT_EXPLICIT_DEMOTION is configured or not, if there's a level info
  in an EPT violation, while KVM honors the level info as the max_level info,
  KVM ignores the demotion request in the fault path.

> We can start with a prototype the host side arg and see how it turns out. I
> realized we need to verify edk2 as well.
Current EDK2 should always accept pages before actual memory access.
So, I think it should be fine.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 7 months, 4 weeks ago

On Mon, 2025-06-16 at 11:14 +0800, Yan Zhao wrote:
> > Oh, nice. I hadn't seen this. Agree that a comprehensive guest setup is
> > quite
> > manual. But here we are playing with guest ABI. In practice, yes it's
> > similar to
> > passing yet another arg to get a good TD.
> Could we introduce a TD attr TDX_ATTR_SEPT_EXPLICIT_DEMOTION?
> 
> It can be something similar to TDX_ATTR_SEPT_VE_DISABLE except that we don't
> provide a dynamical way as the TDCS_CONFIG_FLEXIBLE_PENDING_VE to allow guest
> to
> turn on/off SEPT_VE_DISABLE.
> (See the disable_sept_ve() in ./arch/x86/coco/tdx/tdx.c).
> 
> So, if userspace configures a TD with TDX_ATTR_SEPT_EXPLICIT_DEMOTION, KVM
> first
> checks if SEPT_EXPLICIT_DEMOTION is supported.
> The guest can also check if it would like to support SEPT_EXPLICIT_DEMOTION to
> determine to continue or shut down. (If it does not check
> SEPT_EXPLICIT_DEMOTION,
> e.g., if we don't want to update EDK2, the guest must accept memory before
> memory accessing).
> 
> - if TD is configured with SEPT_EXPLICIT_DEMOTION, KVM allows to map at 2MB
> when
>   there's no level info in an EPT violation. The guest must accept memory
> before
>   accessing memory or if it wants to accept only a partial of host's mapping,
> it
>   needs to explicitly invoke a TDVMCALL to request KVM to perform page
> demotion.
> 
> - if TD is configured without SEPT_EXPLICIT_DEMOTION, KVM always maps at 4KB
>   when there's no level info in an EPT violation.
> 
> - No matter SEPT_EXPLICIT_DEMOTION is configured or not, if there's a level
> info
>   in an EPT violation, while KVM honors the level info as the max_level info,
>   KVM ignores the demotion request in the fault path.

I think this is what Sean was suggesting. We are going to need a qemu command
line opt-in too.

> 
> > We can start with a prototype the host side arg and see how it turns out. I
> > realized we need to verify edk2 as well.
> Current EDK2 should always accept pages before actual memory access.
> So, I think it should be fine.

It's not just that, it needs to handle the the accept page size being lower than
the mapping size. I went and looked and it is accepting at 4k size in places. It
hopefully is just handling accepting a whole range that is not 2MB aligned. But
I think we need to verify this more.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 7 months, 4 weeks ago

On Tue, Jun 17, 2025 at 06:49:00AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-06-16 at 11:14 +0800, Yan Zhao wrote:
> > > Oh, nice. I hadn't seen this. Agree that a comprehensive guest setup is
> > > quite
> > > manual. But here we are playing with guest ABI. In practice, yes it's
> > > similar to
> > > passing yet another arg to get a good TD.
> > Could we introduce a TD attr TDX_ATTR_SEPT_EXPLICIT_DEMOTION?
> > 
> > It can be something similar to TDX_ATTR_SEPT_VE_DISABLE except that we don't
> > provide a dynamical way as the TDCS_CONFIG_FLEXIBLE_PENDING_VE to allow guest
> > to
> > turn on/off SEPT_VE_DISABLE.
> > (See the disable_sept_ve() in ./arch/x86/coco/tdx/tdx.c).
> > 
> > So, if userspace configures a TD with TDX_ATTR_SEPT_EXPLICIT_DEMOTION, KVM
> > first
> > checks if SEPT_EXPLICIT_DEMOTION is supported.
> > The guest can also check if it would like to support SEPT_EXPLICIT_DEMOTION to
> > determine to continue or shut down. (If it does not check
> > SEPT_EXPLICIT_DEMOTION,
> > e.g., if we don't want to update EDK2, the guest must accept memory before
> > memory accessing).
> > 
> > - if TD is configured with SEPT_EXPLICIT_DEMOTION, KVM allows to map at 2MB
> > when
> >   there's no level info in an EPT violation. The guest must accept memory
> > before
> >   accessing memory or if it wants to accept only a partial of host's mapping,
> > it
> >   needs to explicitly invoke a TDVMCALL to request KVM to perform page
> > demotion.
> > 
> > - if TD is configured without SEPT_EXPLICIT_DEMOTION, KVM always maps at 4KB
> >   when there's no level info in an EPT violation.
> > 
> > - No matter SEPT_EXPLICIT_DEMOTION is configured or not, if there's a level
> > info
> >   in an EPT violation, while KVM honors the level info as the max_level info,
> >   KVM ignores the demotion request in the fault path.
> 
> I think this is what Sean was suggesting. We are going to need a qemu command
> line opt-in too.
> 
> > 
> > > We can start with a prototype the host side arg and see how it turns out. I
> > > realized we need to verify edk2 as well.
> > Current EDK2 should always accept pages before actual memory access.
> > So, I think it should be fine.
> 
> It's not just that, it needs to handle the the accept page size being lower than
> the mapping size. I went and looked and it is accepting at 4k size in places. It
As it accepts pages before memory access, the "accept page size being lower than
the the mapping size" can't happen. 

> hopefully is just handling accepting a whole range that is not 2MB aligned. But
> I think we need to verify this more.
Ok.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 7 months, 3 weeks ago

On Tue, 2025-06-17 at 08:52 +0800, Yan Zhao wrote:
> > hopefully is just handling accepting a whole range that is not 2MB aligned.
> > But
> > I think we need to verify this more.
> Ok.

In Linux guest if a memory region is not 2MB aligned the guest will accept the
ends at 4k size. If a memory region is identical to a memslot range this will be
fine. KVM will map the ends at 4k because it won't let huge pages span a
memslot. But if several memory regions are not 2MB aligned and are covered by
one large memslot, the accept will fail on the 4k ends under this proposal. I
don't know if this is a common configuration, but to cover it in the TDX guest
may not be trivial.

So I think this will only work if guests can reasonably "merge" all of the
adjacent accepts. Or of we declare a bunch of memory/memslot layouts illegal.

Kirill, how difficult would it be for TDX Linux guest to merge all 2MB adjacent
accepts?

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Shutemov, Kirill 7 months, 3 weeks ago

On Wed, Jun 18, 2025 at 04:22:59AM +0300, Edgecombe, Rick P wrote:
> On Tue, 2025-06-17 at 08:52 +0800, Yan Zhao wrote:
> > > hopefully is just handling accepting a whole range that is not 2MB aligned.
> > > But
> > > I think we need to verify this more.
> > Ok.
> 
> In Linux guest if a memory region is not 2MB aligned the guest will accept the
> ends at 4k size. If a memory region is identical to a memslot range this will be
> fine. KVM will map the ends at 4k because it won't let huge pages span a
> memslot. But if several memory regions are not 2MB aligned and are covered by
> one large memslot, the accept will fail on the 4k ends under this proposal. I
> don't know if this is a common configuration, but to cover it in the TDX guest
> may not be trivial.
> 
> So I think this will only work if guests can reasonably "merge" all of the
> adjacent accepts. Or of we declare a bunch of memory/memslot layouts illegal.
> 
> Kirill, how difficult would it be for TDX Linux guest to merge all 2MB adjacent
> accepts?

Hm. What do you mean by merging?

Kernel only accepts <4k during early boot -- in EFI stub. The bitmap we
use to track unaccepted memory tracks the status in 2M granularity and
all later accept requests will be issues on 2M pages with fallback to 4k.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Sean Christopherson 7 months, 3 weeks ago

On Wed, Jun 18, 2025, Kirill Shutemov wrote:
> On Wed, Jun 18, 2025 at 04:22:59AM +0300, Edgecombe, Rick P wrote:
> > On Tue, 2025-06-17 at 08:52 +0800, Yan Zhao wrote:
> > > > hopefully is just handling accepting a whole range that is not 2MB aligned.
> > > > But
> > > > I think we need to verify this more.
> > > Ok.
> > 
> > In Linux guest if a memory region is not 2MB aligned the guest will accept the

What is a "memory region" in this context?  An e820 region?  Something else?

> > ends at 4k size. If a memory region is identical to a memslot range this will be
> > fine. KVM will map the ends at 4k because it won't let huge pages span a
> > memslot. But if several memory regions are not 2MB aligned and are covered by
> > one large memslot, the accept will fail on the 4k ends under this proposal. I
> > don't know if this is a common configuration, but to cover it in the TDX guest
> > may not be trivial.
> > 
> > So I think this will only work if guests can reasonably "merge" all of the
> > adjacent accepts. Or of we declare a bunch of memory/memslot layouts illegal.
> > 
> > Kirill, how difficult would it be for TDX Linux guest to merge all 2MB adjacent
> > accepts?
> 
> Hm. What do you mean by merging?
> 
> Kernel only accepts <4k during early boot -- in EFI stub. The bitmap we
> use to track unaccepted memory tracks the status in 2M granularity and
> all later accept requests will be issues on 2M pages with fallback to 4k.
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Kirill Shutemov 7 months, 3 weeks ago

On Fri, Jun 20, 2025 at 09:32:45AM -0700, Sean Christopherson wrote:
> On Wed, Jun 18, 2025, Kirill Shutemov wrote:
> > On Wed, Jun 18, 2025 at 04:22:59AM +0300, Edgecombe, Rick P wrote:
> > > On Tue, 2025-06-17 at 08:52 +0800, Yan Zhao wrote:
> > > > > hopefully is just handling accepting a whole range that is not 2MB aligned.
> > > > > But
> > > > > I think we need to verify this more.
> > > > Ok.
> > > 
> > > In Linux guest if a memory region is not 2MB aligned the guest will accept the
> 
> What is a "memory region" in this context?  An e820 region?  Something else?

EFI memory map entry.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Sean Christopherson 7 months, 3 weeks ago

On Fri, Jun 20, 2025, Kirill Shutemov wrote:
> On Fri, Jun 20, 2025 at 09:32:45AM -0700, Sean Christopherson wrote:
> > On Wed, Jun 18, 2025, Kirill Shutemov wrote:
> > > On Wed, Jun 18, 2025 at 04:22:59AM +0300, Edgecombe, Rick P wrote:
> > > > On Tue, 2025-06-17 at 08:52 +0800, Yan Zhao wrote:
> > > > > > hopefully is just handling accepting a whole range that is not 2MB aligned.
> > > > > > But
> > > > > > I think we need to verify this more.
> > > > > Ok.
> > > > 
> > > > In Linux guest if a memory region is not 2MB aligned the guest will accept the
> > 
> > What is a "memory region" in this context?  An e820 region?  Something else?
> 
> EFI memory map entry.

I forget, for TDX, is the EFI map built by guest firmware or by the VMM?

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Kirill Shutemov 7 months, 3 weeks ago

On Fri, Jun 20, 2025 at 11:40:31AM -0700, Sean Christopherson wrote:
> On Fri, Jun 20, 2025, Kirill Shutemov wrote:
> > On Fri, Jun 20, 2025 at 09:32:45AM -0700, Sean Christopherson wrote:
> > > On Wed, Jun 18, 2025, Kirill Shutemov wrote:
> > > > On Wed, Jun 18, 2025 at 04:22:59AM +0300, Edgecombe, Rick P wrote:
> > > > > On Tue, 2025-06-17 at 08:52 +0800, Yan Zhao wrote:
> > > > > > > hopefully is just handling accepting a whole range that is not 2MB aligned.
> > > > > > > But
> > > > > > > I think we need to verify this more.
> > > > > > Ok.
> > > > > 
> > > > > In Linux guest if a memory region is not 2MB aligned the guest will accept the
> > > 
> > > What is a "memory region" in this context?  An e820 region?  Something else?
> > 
> > EFI memory map entry.
> 
> I forget, for TDX, is the EFI map built by guest firmware or by the VMM?

Guest BIOS.

The BIOS would accept some memory on its own (typically the first 4G) and
leave the rest to be accepted by OS. EFI boot services can also accept
memory on OS request (e.g. on memory allocation), updating the map.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 7 months, 3 weeks ago

On Tue, Jun 17, 2025 at 08:52:49AM +0800, Yan Zhao wrote:
> On Tue, Jun 17, 2025 at 06:49:00AM +0800, Edgecombe, Rick P wrote:
> > On Mon, 2025-06-16 at 11:14 +0800, Yan Zhao wrote:
> > > > Oh, nice. I hadn't seen this. Agree that a comprehensive guest setup is
> > > > quite
> > > > manual. But here we are playing with guest ABI. In practice, yes it's
> > > > similar to
> > > > passing yet another arg to get a good TD.
> > > Could we introduce a TD attr TDX_ATTR_SEPT_EXPLICIT_DEMOTION?
> > > 
> > > It can be something similar to TDX_ATTR_SEPT_VE_DISABLE except that we don't
> > > provide a dynamical way as the TDCS_CONFIG_FLEXIBLE_PENDING_VE to allow guest
> > > to
> > > turn on/off SEPT_VE_DISABLE.
> > > (See the disable_sept_ve() in ./arch/x86/coco/tdx/tdx.c).
> > > 
> > > So, if userspace configures a TD with TDX_ATTR_SEPT_EXPLICIT_DEMOTION, KVM
> > > first
> > > checks if SEPT_EXPLICIT_DEMOTION is supported.
> > > The guest can also check if it would like to support SEPT_EXPLICIT_DEMOTION to
> > > determine to continue or shut down. (If it does not check
> > > SEPT_EXPLICIT_DEMOTION,
> > > e.g., if we don't want to update EDK2, the guest must accept memory before
> > > memory accessing).
> > > 
> > > - if TD is configured with SEPT_EXPLICIT_DEMOTION, KVM allows to map at 2MB
> > > when
> > >   there's no level info in an EPT violation. The guest must accept memory
> > > before
> > >   accessing memory or if it wants to accept only a partial of host's mapping,
> > > it
> > >   needs to explicitly invoke a TDVMCALL to request KVM to perform page
> > > demotion.
> > > 
> > > - if TD is configured without SEPT_EXPLICIT_DEMOTION, KVM always maps at 4KB
> > >   when there's no level info in an EPT violation.
> > > 
> > > - No matter SEPT_EXPLICIT_DEMOTION is configured or not, if there's a level
> > > info
> > >   in an EPT violation, while KVM honors the level info as the max_level info,
> > >   KVM ignores the demotion request in the fault path.
Hi Sean,
Could you please confirm if this matches what you think?
i.e.,

  when an EPT violation carries an ACCEPT level info
  KVM maps the page at map level <= the specified level.
  (If KVM finds a shadow-present lead SPTE, it will not try to merge/split it.)
  Guest's ACCEPT will succeed or return PAGE_SIZE_MATCH if map level < the
  specified level.

This can keep linux guests (with SEPT_VE_DISABLE being true) more efficient.
So, for linux guests, if it only wants to accept at 4KB, the flow is
1. guest ACCEPT 4KB
2. KVM maps it at 4KB
3. ACCEPT 4KB returns success

As the ACCEPT comes before KVM actually maps anything, we can avoid the complex
flow:
1. guest ACCEPT 4KB
2. KVM maps it at 2MB
3. ACCEPT 4KB returns PAGE_SIZE_MATCH.
4.(a) guest ACCEPT 2MB or
4.(b) guest triggers TDVMCALL to demote
5. KVM demotes the 2MB mapping
6. guest ACCEPT at 4KB
7. ACCEPT 4KB returns success 

For non-linux guests (with SEPT_VE_DISABLE being false), I totally agree with
your suggestions!

Thanks
Yan

> > I think this is what Sean was suggesting. We are going to need a qemu command
> > line opt-in too.
> > 
> > > 
> > > > We can start with a prototype the host side arg and see how it turns out. I
> > > > realized we need to verify edk2 as well.
> > > Current EDK2 should always accept pages before actual memory access.
> > > So, I think it should be fine.
> > 
> > It's not just that, it needs to handle the the accept page size being lower than
> > the mapping size. I went and looked and it is accepting at 4k size in places. It
> As it accepts pages before memory access, the "accept page size being lower than
> the the mapping size" can't happen. 
> 
> > hopefully is just handling accepting a whole range that is not 2MB aligned. But
> > I think we need to verify this more.
> Ok.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Sean Christopherson 7 months, 3 weeks ago

On Wed, Jun 18, 2025, Yan Zhao wrote:
>   when an EPT violation carries an ACCEPT level info
>   KVM maps the page at map level <= the specified level.

No.  I want KVM to map at the maximal level KVM supports, irrespective of what
the guest's ACCEPT level says.  I.e. I want KVM to be able to completely ignore
the ACCEPT level.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 7 months, 3 weeks ago

On Fri, 2025-06-20 at 09:31 -0700, Sean Christopherson wrote:
> > On Wed, Jun 18, 2025, Yan Zhao wrote:
> > > >    when an EPT violation carries an ACCEPT level info
> > > >    KVM maps the page at map level <= the specified level.
> > 
> > No.  I want KVM to map at the maximal level KVM supports, irrespective of what
> > the guest's ACCEPT level says.  I.e. I want KVM to be able to completely ignore
> > the ACCEPT level.

This is what I was thinking, but I'm starting to think it might not be a good
idea.

The PAGE_SIZE_MISMATCH error code asymmetry is indeed weird. But "accepted" is
in some important ways a type of permission that is controllable by both the
guest and host. To change the ABI and guests such that the permission is still
controlled by the host and guest, but the allowed granularity is only
controllable by the host, feels wrong in a couple ways.

First, it turns host mapping details into guest ABI that could break guests that
rely on it. Second, it bets that there will never be a need for guests to set
the accept state on a specific smaller granularity. Otherwise, this path would 
just be a temporary shortcut and not about components imposing things that are
none of their business.

Instead I think the two impositions that matter here are:
1. TDX requires size to be passed through the generic fault handler somehow.
2. TDX demote is hard to make work under mmu read lock (already working on this
one)

Sean, were the two options for (1) really that bad? Or how do you think about
changing directions in general and we can try to find some other options?

On the subject of alternates to (1). I wonder if the ugly part is that both of
the options sort of break the KVM model where the TDP is not the real backing
state. TDG.MEM.PAGE.ACCEPT is kind of two things, changing the "permission" of
the memory *and* the mapping of it. TDX module asks, map this at this page size
so that I can map it at the right permission. KVM would rather learn that the
permission from the backing GPA info (memslots, etc) and then map it at it's
correct page size. Like what happens with kvm_lpage_info->disallow_lpage.

Maybe we could have EPT violations that contain 4k accept sizes first update the
attribute for the GFN to be accepted or not, like have tdx.c call out to set
kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
like that. Maybe set a "accepted" attribute, or something. Not sure if could be
done without the mmu write lock... But it might fit KVM better?

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 7 months, 2 weeks ago

On Tue, Jun 24, 2025 at 05:44:17AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2025-06-20 at 09:31 -0700, Sean Christopherson wrote:
> > > On Wed, Jun 18, 2025, Yan Zhao wrote:
> > > > >    when an EPT violation carries an ACCEPT level info
> > > > >    KVM maps the page at map level <= the specified level.
> > > 
> > > No.  I want KVM to map at the maximal level KVM supports, irrespective of what
> > > the guest's ACCEPT level says.  I.e. I want KVM to be able to completely ignore
> > > the ACCEPT level.
> 
> This is what I was thinking, but I'm starting to think it might not be a good
> idea.
> 
> The PAGE_SIZE_MISMATCH error code asymmetry is indeed weird. But "accepted" is
> in some important ways a type of permission that is controllable by both the
> guest and host. To change the ABI and guests such that the permission is still
> controlled by the host and guest, but the allowed granularity is only
> controllable by the host, feels wrong in a couple ways.
> 
> First, it turns host mapping details into guest ABI that could break guests that
> rely on it. Second, it bets that there will never be a need for guests to set
> the accept state on a specific smaller granularity. Otherwise, this path would 
> just be a temporary shortcut and not about components imposing things that are
> none of their business.
> 
> Instead I think the two impositions that matter here are:
> 1. TDX requires size to be passed through the generic fault handler somehow.
> 2. TDX demote is hard to make work under mmu read lock (already working on this
> one)
> 
> Sean, were the two options for (1) really that bad? Or how do you think about
> changing directions in general and we can try to find some other options?
> 
> On the subject of alternates to (1). I wonder if the ugly part is that both of
> the options sort of break the KVM model where the TDP is not the real backing
> state. TDG.MEM.PAGE.ACCEPT is kind of two things, changing the "permission" of
> the memory *and* the mapping of it. TDX module asks, map this at this page size
> so that I can map it at the right permission. KVM would rather learn that the
> permission from the backing GPA info (memslots, etc) and then map it at it's
> correct page size. Like what happens with kvm_lpage_info->disallow_lpage.
Could we provide the info via the private_max_mapping_level hook (i.e. via
tdx_gmem_private_max_mapping_level())?

Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
private fault?

> Maybe we could have EPT violations that contain 4k accept sizes first update the
> attribute for the GFN to be accepted or not, like have tdx.c call out to set
> kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
Something like kvm_lpage_info->disallow_lpage would disallow later page
promotion, though we don't support it right now.

> like that. Maybe set a "accepted" attribute, or something. Not sure if could be
Setting "accepted" attribute in the EPT violation handler?
It's a little odd, as the accept operation is not yet completed.

> done without the mmu write lock... But it might fit KVM better?

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 7 months, 2 weeks ago

On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> Could we provide the info via the private_max_mapping_level hook (i.e. via
> tdx_gmem_private_max_mapping_level())?

This is one of the previous two methods discussed. Can you elaborate on what you
are trying to say?

> 
> Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
> private fault?
> 
> > Maybe we could have EPT violations that contain 4k accept sizes first update the
> > attribute for the GFN to be accepted or not, like have tdx.c call out to set
> > kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
> Something like kvm_lpage_info->disallow_lpage would disallow later page
> promotion, though we don't support it right now.

Well I was originally thinking it would not set kvm_lpage_info->disallow_lpage
directly, but rely on the logic that checks for mixed attributes. But more
below...

> 
> > like that. Maybe set a "accepted" attribute, or something. Not sure if could be
> Setting "accepted" attribute in the EPT violation handler?
> It's a little odd, as the accept operation is not yet completed.

I guess the question in both of these comments is: what is the life cycle. Guest
could call TDG.MEM.PAGE.RELEASE to unaccept it as well. Oh, geez. It looks like
TDG.MEM.PAGE.RELEASE will give the same size hints in the EPT violation. So an
accept attribute is not going work, at least without TDX module changes.

Actually, the problem we have doesn't fit the mixed attributes behavior. If many
vCPU's accept at 2MB region at 4k page size, the entire 2MB range could be non-
mixed and then individual accepts would fail.

So instead there could be a KVM_LPAGE_GUEST_INHIBIT that doesn't get cleared
based on mixed attributes. It would be one way. It would need to get set by
something like kvm_write_track_add_gfn() that lives in tdx.c and is called
before going into the fault handler on 4k accept size. It would have to take mmu
write lock I think, which would kill scalability in the 4k accept case (but not
the normal 2MB one). But as long as mmu_write lock is held, demote will be no
problem, which the operation would also need to do.

I think it actually makes KVM's behavior easier to understand. We don't need to
worry about races between multiple accept sizes and things like that. It also
leaves the core MMU code mostly untouched. Performance/scalability wise it only
punishes the rare case.

For leaving the option open to promote the GFNs in the future, a GHCI interface
or similar could be defined for the guest to say "I don't care about page size
anymore for this gfn". So it won't close it off forever.

> 
> > done without the mmu write lock... But it might fit KVM better?

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Vishal Annapurve 7 months, 2 weeks ago

On Tue, Jun 24, 2025 at 11:36 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> ...
> For leaving the option open to promote the GFNs in the future, a GHCI interface
> or similar could be defined for the guest to say "I don't care about page size
> anymore for this gfn". So it won't close it off forever.
>

I think it's in the host's interest to get the pages mapped at large
page granularity whenever possible. Even if guest doesn't buy-in into
the "future" GHCI interface, there should be some ABI between TDX
module and host VMM to allow promotion probably as soon as all the
ranges within a hugepage get accepted but are still mapped at 4K
granularity.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 7 months, 2 weeks ago

On Wed, 2025-06-25 at 06:47 -0700, Vishal Annapurve wrote:
> On Tue, Jun 24, 2025 at 11:36 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > ...
> > For leaving the option open to promote the GFNs in the future, a GHCI interface
> > or similar could be defined for the guest to say "I don't care about page size
> > anymore for this gfn". So it won't close it off forever.
> > 
> 
> I think it's in the host's interest to get the pages mapped at large
> page granularity whenever possible. Even if guest doesn't buy-in into
> the "future" GHCI interface, there should be some ABI between TDX
> module and host VMM to allow promotion probably as soon as all the
> ranges within a hugepage get accepted but are still mapped at 4K
> granularity.

In the 4k accept size, the guest is kind of requesting a specific host page
size. I agree it's not good to let the guest influence the host's resource
usage. But this already happens with private/shared conversions.

As for future promotion opportunities, I think that part needs a re-think. I
don't think cost/benefit is really there today. If we had a simpler solution (we
discussed some TDX module changes offline), then it changes the calculus. But we
shouldn't focus too much on the ideal TDX implementation. Getting the ideal case
upstream is far, far away. In the meantime we should focus on the simplest
things with the most benefit. In the end I'd expect an iterative, evolving
implementation to be faster to upstream then thinking through how it works with
every idea. The exception is thinking through a sane ABI ahead of time.

I don't think we necessarily need a GHCI interface to expose control of host
page sizes to the guest, but I think it might help with determinism. I meant it
sort of as an escape hatch. Like if we find some nasty races that prevent
optimizations for promotion, we could have an option to have the guest help by
making the ABI around page sizes more formal.

Side topic on page promotion, I'm wondering if the biggest bang-for-the-buck
promotion opportunity will be the memory that gets added via PAGE.ADD at TD
startup time. Which is a narrow specific case that may be easier to attack.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 7 months, 2 weeks ago

On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > Could we provide the info via the private_max_mapping_level hook (i.e. via
> > tdx_gmem_private_max_mapping_level())?
> 
> This is one of the previous two methods discussed. Can you elaborate on what you
> are trying to say?
I don't get why we can't use the existing tdx_gmem_private_max_mapping_level()
to convey the max_level info at which a vendor hopes a GFN to be mapped.

Before TDX huge pages, tdx_gmem_private_max_mapping_level() always returns 4KB;
after TDX huge pages, it returns
- 4KB during the TD build stage
- at TD runtime: 4KB or 2MB

Why does KVM need to care how the vendor determines this max_level?
I think a vendor should have its freedom to decide based on software limitation,
guest's wishes, hardware bugs or whatever.

> > Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
> > private fault?
> > 
> > > Maybe we could have EPT violations that contain 4k accept sizes first update the
> > > attribute for the GFN to be accepted or not, like have tdx.c call out to set
> > > kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
> > Something like kvm_lpage_info->disallow_lpage would disallow later page
> > promotion, though we don't support it right now.
> 
> Well I was originally thinking it would not set kvm_lpage_info->disallow_lpage
> directly, but rely on the logic that checks for mixed attributes. But more
> below...
> 
> > 
> > > like that. Maybe set a "accepted" attribute, or something. Not sure if could be
> > Setting "accepted" attribute in the EPT violation handler?
> > It's a little odd, as the accept operation is not yet completed.
> 
> I guess the question in both of these comments is: what is the life cycle. Guest
> could call TDG.MEM.PAGE.RELEASE to unaccept it as well. Oh, geez. It looks like
> TDG.MEM.PAGE.RELEASE will give the same size hints in the EPT violation. So an
> accept attribute is not going work, at least without TDX module changes.
> 
> 
> Actually, the problem we have doesn't fit the mixed attributes behavior. If many
> vCPU's accept at 2MB region at 4k page size, the entire 2MB range could be non-
> mixed and then individual accepts would fail.
> 
> 
> So instead there could be a KVM_LPAGE_GUEST_INHIBIT that doesn't get cleared
Set KVM_LPAGE_GUEST_INHIBIT via a TDVMCALL ?

Or just set the KVM_LPAGE_GUEST_INHIBIT when an EPT violation contains 4KB
level info?

I guess it's the latter one as it can avoid modification to both EDK2 and Linux
guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
2MB" during the boot-up of a TD with 4GB memory.

But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?

> based on mixed attributes. It would be one way. It would need to get set by
> something like kvm_write_track_add_gfn() that lives in tdx.c and is called
> before going into the fault handler on 4k accept size. It would have to take mmu
> write lock I think, which would kill scalability in the 4k accept case (but not
> the normal 2MB one). But as long as mmu_write lock is held, demote will be no
> problem, which the operation would also need to do.
> 
> I think it actually makes KVM's behavior easier to understand. We don't need to
> worry about races between multiple accept sizes and things like that. It also
> leaves the core MMU code mostly untouched. Performance/scalability wise it only
> punishes the rare case.
Write down my understanding to check if it's correct:

- when a TD is NOT configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL, KVM
  always maps at 4KB

- When a TD is configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL,

(a)
1. guest accepts at 4KB
2. TDX sets KVM_LPAGE_GUEST_INHIBIT and try splitting.(with write mmu_lock)
3. KVM maps at 4KB (with read mmu_lock)
4. guest's 4KB accept succeeds.

(b)
1. guest accepts at 2MB.
2. KVM maps at 4KB due to a certain reason.
3. guest's accept 2MB fails with TDACCEPT_SIZE_MISMATCH.
4. guest accepts at 4KB
5. guest's 4KB accept succeeds.

> For leaving the option open to promote the GFNs in the future, a GHCI interface
> or similar could be defined for the guest to say "I don't care about page size
> anymore for this gfn". So it won't close it off forever.
ok.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 7 months, 2 weeks ago

On Wed, 2025-06-25 at 17:28 +0800, Yan Zhao wrote:
> On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> > On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > > Could we provide the info via the private_max_mapping_level hook (i.e. via
> > > tdx_gmem_private_max_mapping_level())?
> > 
> > This is one of the previous two methods discussed. Can you elaborate on what you
> > are trying to say?
> I don't get why we can't use the existing tdx_gmem_private_max_mapping_level()
> to convey the max_level info at which a vendor hopes a GFN to be mapped.
> 
> Before TDX huge pages, tdx_gmem_private_max_mapping_level() always returns 4KB;
> after TDX huge pages, it returns
> - 4KB during the TD build stage
> - at TD runtime: 4KB or 2MB
> 
> Why does KVM need to care how the vendor determines this max_level?
> I think a vendor should have its freedom to decide based on software limitation,
> guest's wishes, hardware bugs or whatever.

I don't see that big of a difference between "KVM" and "vendor". TDX code is KVM
code. Just because it's in tdx.c doesn't mean it's ok for it to be hard to trace
the logic.

I'm not sure what Sean's objection was to that approach, or if he objected to
just the weird SIZE_MISMATCH behavior of TDX module. I think you already know
why I don't prefer it:
 - Requiring demote in the fault handler. This requires an additional write lock
inside the mmu read lock, or TDX module changes. Although now I wonder if the
interrupt error code related problems will get worse with this solution. The
solution is currently not settled.
 - Requiring passing args on the vCPU struct, which as you point out will work
functionally today only because the prefault stuff will avoid seeing it. But
it's fragile
 - The page size behavior is a bit implicit

I'm coming back to this draft after PUCK. Sean shared his thoughts there. I'll
try to summarize. He didn't like how the page size requirements were passed
through the fault handler in a "transient" way. That "transient" property covers
both of the two options for passing the size info through the fault handler that
we were debating. He also didn't like how TDX arch requires KVM to map at a
specific host size around accept. Michael Roth brought up that SNP has the same
requirement, but it can do the zap and refault approach.

We then discussed this lpage_info idea. He was in favor of it, but not, I'd say,
overly enthusiastic. In a "least worst option" kind of way.

I think the biggest downside is the MMU write lock. Our goal for this series is
to help performance, not to get huge page sizes. So if we do this idea, we can't
fully waive our hands that any optimization is pre-mature. It *is* an
optimization. We need to either convince ourselves that the overall benefit is
still there, or have a plan to adopt the guest to avoid 4k accepts. Which we
were previously discussing of requiring anyway.

But I much prefer the deterministic behavior of this approach from a
maintainability perspective.

> 
> > > Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
> > > private fault?
> > > 
> > > > Maybe we could have EPT violations that contain 4k accept sizes first update the
> > > > attribute for the GFN to be accepted or not, like have tdx.c call out to set
> > > > kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
> > > Something like kvm_lpage_info->disallow_lpage would disallow later page
> > > promotion, though we don't support it right now.
> > 
> > Well I was originally thinking it would not set kvm_lpage_info->disallow_lpage
> > directly, but rely on the logic that checks for mixed attributes. But more
> > below...
> > 
> > > 
> > > > like that. Maybe set a "accepted" attribute, or something. Not sure if could be
> > > Setting "accepted" attribute in the EPT violation handler?
> > > It's a little odd, as the accept operation is not yet completed.
> > 
> > I guess the question in both of these comments is: what is the life cycle. Guest
> > could call TDG.MEM.PAGE.RELEASE to unaccept it as well. Oh, geez. It looks like
> > TDG.MEM.PAGE.RELEASE will give the same size hints in the EPT violation. So an
> > accept attribute is not going work, at least without TDX module changes.
> > 
> > 
> > Actually, the problem we have doesn't fit the mixed attributes behavior. If many
> > vCPU's accept at 2MB region at 4k page size, the entire 2MB range could be non-
> > mixed and then individual accepts would fail.
> > 
> > 
> > So instead there could be a KVM_LPAGE_GUEST_INHIBIT that doesn't get cleared
> Set KVM_LPAGE_GUEST_INHIBIT via a TDVMCALL ?
> 
> Or just set the KVM_LPAGE_GUEST_INHIBIT when an EPT violation contains 4KB
> level info?

Yes, that's the idea. 2MB accepts can behave like normal.

> 
> I guess it's the latter one as it can avoid modification to both EDK2 and Linux
> guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
> 2MB" during the boot-up of a TD with 4GB memory.

Oh, wow that is more than I expected. Did you notice how many vCPUs they were
spread across? What memory size did you use? What was your guest memory
configuration?

> 
> But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
> and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?

I think so. I didn't check the reason, but the other similar code took it. Maybe
not? If we don't need to take mmu write lock, then this idea seems like a clear
winner to me.

> 
> > based on mixed attributes. It would be one way. It would need to get set by
> > something like kvm_write_track_add_gfn() that lives in tdx.c and is called
> > before going into the fault handler on 4k accept size. It would have to take mmu
> > write lock I think, which would kill scalability in the 4k accept case (but not
> > the normal 2MB one). But as long as mmu_write lock is held, demote will be no
> > problem, which the operation would also need to do.
> > 
> > I think it actually makes KVM's behavior easier to understand. We don't need to
> > worry about races between multiple accept sizes and things like that. It also
> > leaves the core MMU code mostly untouched. Performance/scalability wise it only
> > punishes the rare case.
> Write down my understanding to check if it's correct:

Will respond to this part on your later mail with corrections.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 7 months, 2 weeks ago

On Wed, Jun 25, 2025 at 10:47:47PM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-06-25 at 17:28 +0800, Yan Zhao wrote:
> > On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> > > On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > > > Could we provide the info via the private_max_mapping_level hook (i.e. via
> > > > tdx_gmem_private_max_mapping_level())?
> > > 
> > > This is one of the previous two methods discussed. Can you elaborate on what you
> > > are trying to say?
> > I don't get why we can't use the existing tdx_gmem_private_max_mapping_level()
> > to convey the max_level info at which a vendor hopes a GFN to be mapped.
> > 
> > Before TDX huge pages, tdx_gmem_private_max_mapping_level() always returns 4KB;
> > after TDX huge pages, it returns
> > - 4KB during the TD build stage
> > - at TD runtime: 4KB or 2MB
> > 
> > Why does KVM need to care how the vendor determines this max_level?
> > I think a vendor should have its freedom to decide based on software limitation,
> > guest's wishes, hardware bugs or whatever.
> 
> I don't see that big of a difference between "KVM" and "vendor". TDX code is KVM
> code. Just because it's in tdx.c doesn't mean it's ok for it to be hard to trace
> the logic.
> 
> I'm not sure what Sean's objection was to that approach, or if he objected to
> just the weird SIZE_MISMATCH behavior of TDX module. I think you already know
> why I don't prefer it:
>  - Requiring demote in the fault handler. This requires an additional write lock
> inside the mmu read lock, or TDX module changes. Although now I wonder if the
> interrupt error code related problems will get worse with this solution. The
> solution is currently not settled.
>  - Requiring passing args on the vCPU struct, which as you point out will work
> functionally today only because the prefault stuff will avoid seeing it. But
> it's fragile
>  - The page size behavior is a bit implicit
Hmm, strictly speaking, all the 3 are not the fault of
tdx_gmem_private_max_mapping_level().

With tdx_gmem_private_max_mapping_level() to pass in the level, we can track
KVM_LPAGE_GUEST_INHIBIT with tdx.c without changing lpage_info.
tdx.c then has the freedom to change KVM_LPAGE_GUEST_INHIBIT to some more
flexible scheme in future while keeping KVM MMU core intact.

But with lpage_info, looks we can save some memory.
The downside is that we may need to update TDX MMU core in case of future
changes.

> I'm coming back to this draft after PUCK. Sean shared his thoughts there. I'll
> try to summarize. He didn't like how the page size requirements were passed
> through the fault handler in a "transient" way. That "transient" property covers
> both of the two options for passing the size info through the fault handler that
> we were debating. He also didn't like how TDX arch requires KVM to map at a
> specific host size around accept. Michael Roth brought up that SNP has the same
> requirement, but it can do the zap and refault approach.
> 
> We then discussed this lpage_info idea. He was in favor of it, but not, I'd say,
> overly enthusiastic. In a "least worst option" kind of way.
> 
> I think the biggest downside is the MMU write lock. Our goal for this series is
> to help performance, not to get huge page sizes. So if we do this idea, we can't
> fully waive our hands that any optimization is pre-mature. It *is* an
> optimization. We need to either convince ourselves that the overall benefit is
> still there, or have a plan to adopt the guest to avoid 4k accepts. Which we
> were previously discussing of requiring anyway.
> 
> But I much prefer the deterministic behavior of this approach from a
> maintainability perspective.
> 
> > 
> > > > Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
> > > > private fault?
> > > > 
> > > > > Maybe we could have EPT violations that contain 4k accept sizes first update the
> > > > > attribute for the GFN to be accepted or not, like have tdx.c call out to set
> > > > > kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
> > > > Something like kvm_lpage_info->disallow_lpage would disallow later page
> > > > promotion, though we don't support it right now.
> > > 
> > > Well I was originally thinking it would not set kvm_lpage_info->disallow_lpage
> > > directly, but rely on the logic that checks for mixed attributes. But more
> > > below...
> > > 
> > > > 
> > > > > like that. Maybe set a "accepted" attribute, or something. Not sure if could be
> > > > Setting "accepted" attribute in the EPT violation handler?
> > > > It's a little odd, as the accept operation is not yet completed.
> > > 
> > > I guess the question in both of these comments is: what is the life cycle. Guest
> > > could call TDG.MEM.PAGE.RELEASE to unaccept it as well. Oh, geez. It looks like
> > > TDG.MEM.PAGE.RELEASE will give the same size hints in the EPT violation. So an
> > > accept attribute is not going work, at least without TDX module changes.
> > > 
> > > 
> > > Actually, the problem we have doesn't fit the mixed attributes behavior. If many
> > > vCPU's accept at 2MB region at 4k page size, the entire 2MB range could be non-
> > > mixed and then individual accepts would fail.
> > > 
> > > 
> > > So instead there could be a KVM_LPAGE_GUEST_INHIBIT that doesn't get cleared
> > Set KVM_LPAGE_GUEST_INHIBIT via a TDVMCALL ?
> > 
> > Or just set the KVM_LPAGE_GUEST_INHIBIT when an EPT violation contains 4KB
> > level info?
> 
> Yes, that's the idea. 2MB accepts can behave like normal.
> 
> > 
> > I guess it's the latter one as it can avoid modification to both EDK2 and Linux
> > guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
> > 2MB" during the boot-up of a TD with 4GB memory.
> 
> Oh, wow that is more than I expected. Did you notice how many vCPUs they were
> spread across? What memory size did you use? What was your guest memory
> configuration?
The guest memory is 4GB, 8 vCPUs.
The memory slots layout is:
slot 1: base gfn=0, npages=0x80000
slot 2: base gfn=0x100000, npages=0x80000
slot 3: base gfn=0xffc00, npages=0x400

The GFN spread for the ~2710 instances is:
GFNs 0x806-0x9ff (1 time for each of 506 pages)
GFNs 0x7e800-0x7e9ff (1 time for each of 512 pages)
GFN: 0x7d3ff~0x7e7fe (repeated private-to-shared, and shared-to-private are
                      conducted on this range), with the top 3 among them being:
     0x7d9da (476 times)
     0x7d9d9 (156 times)
     0x7d9d7 (974 times)

All those instances are from vCPU 0, when the guest is in EDK2 and during early
kernel boot.

Based on my observation, the count of these instances does not scale with guest
memory. In other words, the count remains roughly the same even when the guest
memory is increased to 8GB.

> > But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
> > and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?
> 
> I think so. I didn't check the reason, but the other similar code took it. Maybe
> not? If we don't need to take mmu write lock, then this idea seems like a clear
> winner to me.
Hmm,  setting KVM_LPAGE_GUEST_INHIBIT needs trying splitting to be followed.
So, if we don't want to support splitting under read mmu_lock, we need to take
write mmu_lock.

I drafted a change as below (will refine some parts of it later).
The average count to take write mmu_lock is 11 during VM boot.

There's no signiticant difference in the count of 2M mappings
During guest kerne booting to login, on average: 
before this patch: 1144 2M mappings 
after this patch:  1143 2M mappings.

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index f999c15d8d3e..d4e98728f600 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -322,4 +322,8 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
 {
        return gfn & kvm_gfn_direct_bits(kvm);
 }
+
+void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
+bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
+
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f0afee2e283a..28c511d8b372 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -721,6 +721,8 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
  */
 #define KVM_LPAGE_MIXED_FLAG   BIT(31)

+#define KVM_LPAGE_GUEST_INHIBIT_FLAG   BIT(30)
+
 static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
                                            gfn_t gfn, int count)
 {
@@ -732,7 +734,8 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,

                old = linfo->disallow_lpage;
                linfo->disallow_lpage += count;
-               WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
+               WARN_ON_ONCE((old ^ linfo->disallow_lpage) &
+                             (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG));
        }
 }

@@ -1653,13 +1656,15 @@ int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range)
        bool ret = 0;

        lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
-                           lockdep_is_held(&kvm->slots_lock));
+                           lockdep_is_held(&kvm->slots_lock) ||
+                           srcu_read_lock_held(&kvm->srcu));

        if (tdp_mmu_enabled)
                ret = kvm_tdp_mmu_gfn_range_split_boundary(kvm, range);

        return ret;
 }
+EXPORT_SYMBOL_GPL(kvm_split_boundary_leafs);

 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
@@ -7734,6 +7739,18 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
                vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
 }

+bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+       return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUEST_INHIBIT_FLAG;
+}
+EXPORT_SYMBOL_GPL(hugepage_test_guest_inhibit);
+
+void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+       lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_GUEST_INHIBIT_FLAG;
+}
+EXPORT_SYMBOL_GPL(hugepage_set_guest_inhibit);
+
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
                                int level)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 244fd22683db..4028423cf595 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1852,28 +1852,8 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
        if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
                return -EINVAL;

-       /*
-        * Split request with mmu_lock held for reading can only occur when one
-        * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
-        * Ignore this 4KB mapping request by setting violation_request_level to
-        * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
-        * would be a spurious fault. The vCPU accepting at 2MB will accept the
-        * whole 2MB range.
-        */
-       if (mmu_lock_shared) {
-               struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-               struct vcpu_tdx *tdx = to_tdx(vcpu);
-
-               if (KVM_BUG_ON(!vcpu, kvm))
-                       return -EOPNOTSUPP;
-
-               /* Request to map as 2MB leaf for the whole 2MB range */
-               tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
-               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
-               tdx->violation_request_level = level;
-
-               return -EBUSY;
-       }
+       if (mmu_lock_shared)
+               return -EOPNOTSUPP;

        ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
        if (ret <= 0)
@@ -1937,28 +1917,51 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
        return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
 }

-static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
+static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
 {
        struct vcpu_tdx *tdx = to_tdx(vcpu);
+       struct kvm *kvm = vcpu->kvm;
+       gfn_t gfn = gpa_to_gfn(gpa);
+       struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
        int level = -1;
+       u64 eeq_info;

-       u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
+       if (!slot)
+               return 0;

-       u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
-                       TDX_EXT_EXIT_QUAL_INFO_SHIFT;
+       if ((tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK) !=
+           TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
+               return 0;

-       if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
-               level = (eeq_info & GENMASK(2, 0)) + 1;
+       eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
+                   TDX_EXT_EXIT_QUAL_INFO_SHIFT;

-               tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
-               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
-               tdx->violation_request_level = level;
-       } else {
-               tdx->violation_gfn_start = -1;
-               tdx->violation_gfn_end = -1;
-               tdx->violation_request_level = -1;
+       level = (eeq_info & GENMASK(2, 0)) + 1;
+
+       if (level == PG_LEVEL_4K) {
+              if (!hugepage_test_guest_inhibit(slot, gfn, PG_LEVEL_2M)) {
+                       struct kvm_gfn_range gfn_range = {
+                               .start = gfn,
+                               .end = gfn + 1,
+                               .slot = slot,
+                               .may_block = true,
+                               .attr_filter = KVM_FILTER_PRIVATE,
+                       };
+
+                       scoped_guard(write_lock, &kvm->mmu_lock) {
+                               int ret;
+
+                               ret = kvm_split_boundary_leafs(kvm, &gfn_range);
+
+                               if (ret)
+                                       return ret;
+
+                               hugepage_set_guest_inhibit(slot, gfn, PG_LEVEL_2M);
+                       }
+              }
        }
+
+       return 0;
 }

 static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
@@ -1987,7 +1990,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
                 */
                exit_qual = EPT_VIOLATION_ACC_WRITE;

-               tdx_get_accept_level(vcpu, gpa);
+               if (tdx_check_accept_level(vcpu, gpa))
+                       return RET_PF_RETRY;

                /* Only private GPA triggers zero-step mitigation */
                local_retry = true;
@@ -3022,9 +3026,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)

        vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;

-       tdx->violation_gfn_start = -1;
-       tdx->violation_gfn_end = -1;
-       tdx->violation_request_level = -1;
        return 0;

 free_tdcx:
@@ -3373,14 +3374,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
                                       gfn_t gfn, bool prefetch)
 {
-       struct vcpu_tdx *tdx = to_tdx(vcpu);
-
-       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE) || prefetch))
+       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE)))
                return PG_LEVEL_4K;

-       if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
-               return tdx->violation_request_level;
-
        return PG_LEVEL_2M;
 }

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index acd18a01f63d..3a3077666ee6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2610,6 +2610,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn

        return NULL;
 }
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);

 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
 {

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 7 months, 2 weeks ago

On Thu, 2025-06-26 at 16:53 +0800, Yan Zhao wrote:
> On Wed, Jun 25, 2025 at 10:47:47PM +0800, Edgecombe, Rick P wrote:
> > On Wed, 2025-06-25 at 17:28 +0800, Yan Zhao wrote:
> > > On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> > > > On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > > > 
> > > I guess it's the latter one as it can avoid modification to both EDK2 and Linux
> > > guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
> > > 2MB" during the boot-up of a TD with 4GB memory.
> > 
> > Oh, wow that is more than I expected. Did you notice how many vCPUs they were
> > spread across? What memory size did you use? What was your guest memory
> > configuration?
> The guest memory is 4GB, 8 vCPUs.
> The memory slots layout is:
> slot 1: base gfn=0, npages=0x80000
> slot 2: base gfn=0x100000, npages=0x80000
> slot 3: base gfn=0xffc00, npages=0x400
> 
> The GFN spread for the ~2710 instances is:
> GFNs 0x806-0x9ff (1 time for each of 506 pages)
> GFNs 0x7e800-0x7e9ff (1 time for each of 512 pages)
> GFN: 0x7d3ff~0x7e7fe (repeated private-to-shared, and shared-to-private are
>                       conducted on this range), with the top 3 among them being:
>      0x7d9da (476 times)
>      0x7d9d9 (156 times)
>      0x7d9d7 (974 times)
> 
> All those instances are from vCPU 0, when the guest is in EDK2 and during early
> kernel boot.
> 
> Based on my observation, the count of these instances does not scale with guest
> memory. In other words, the count remains roughly the same even when the guest
> memory is increased to 8GB.

So the impact would be negligible. The mmu write lock would not meet much, if
any, contention.

> 
> > > But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
> > > and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?
> > 
> > I think so. I didn't check the reason, but the other similar code took it. Maybe
> > not? If we don't need to take mmu write lock, then this idea seems like a clear
> > winner to me.
> Hmm,  setting KVM_LPAGE_GUEST_INHIBIT needs trying splitting to be followed.
> So, if we don't want to support splitting under read mmu_lock, we need to take
> write mmu_lock.
> 
> I drafted a change as below (will refine some parts of it later).
> The average count to take write mmu_lock is 11 during VM boot.
> 
> There's no signiticant difference in the count of 2M mappings
> During guest kerne booting to login, on average: 
> before this patch: 1144 2M mappings 
> after this patch:  1143 2M mappings.

Oh, hmm. Well, it's not strong argument against.

> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index f999c15d8d3e..d4e98728f600 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -322,4 +322,8 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
>  {
>         return gfn & kvm_gfn_direct_bits(kvm);
>  }
> +
> +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> +
>  #endif
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f0afee2e283a..28c511d8b372 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -721,6 +721,8 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
>   */
>  #define KVM_LPAGE_MIXED_FLAG   BIT(31)
> 
> +#define KVM_LPAGE_GUEST_INHIBIT_FLAG   BIT(30)
> +
>  static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
>                                             gfn_t gfn, int count)
>  {
> @@ -732,7 +734,8 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> 
>                 old = linfo->disallow_lpage;
>                 linfo->disallow_lpage += count;
> -               WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
> +               WARN_ON_ONCE((old ^ linfo->disallow_lpage) &
> +                             (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG));
>         }
>  }
> 
> @@ -1653,13 +1656,15 @@ int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range)
>         bool ret = 0;
> 
>         lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> -                           lockdep_is_held(&kvm->slots_lock));
> +                           lockdep_is_held(&kvm->slots_lock) ||
> +                           srcu_read_lock_held(&kvm->srcu));
> 
>         if (tdp_mmu_enabled)
>                 ret = kvm_tdp_mmu_gfn_range_split_boundary(kvm, range);
> 
>         return ret;
>  }
> +EXPORT_SYMBOL_GPL(kvm_split_boundary_leafs);
> 
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
> @@ -7734,6 +7739,18 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>                 vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
>  }
> 
> +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> +{
> +       return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUEST_INHIBIT_FLAG;
> +}
> +EXPORT_SYMBOL_GPL(hugepage_test_guest_inhibit);
> +
> +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> +{
> +       lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_GUEST_INHIBIT_FLAG;
> +}
> +EXPORT_SYMBOL_GPL(hugepage_set_guest_inhibit);
> +
>  #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
>  static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
>                                 int level)
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 244fd22683db..4028423cf595 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1852,28 +1852,8 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
>         if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
>                 return -EINVAL;
> 
> -       /*
> -        * Split request with mmu_lock held for reading can only occur when one
> -        * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
> -        * Ignore this 4KB mapping request by setting violation_request_level to
> -        * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
> -        * would be a spurious fault. The vCPU accepting at 2MB will accept the
> -        * whole 2MB range.
> -        */
> -       if (mmu_lock_shared) {
> -               struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> -               struct vcpu_tdx *tdx = to_tdx(vcpu);
> -
> -               if (KVM_BUG_ON(!vcpu, kvm))
> -                       return -EOPNOTSUPP;
> -
> -               /* Request to map as 2MB leaf for the whole 2MB range */
> -               tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
> -               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> -               tdx->violation_request_level = level;
> -
> -               return -EBUSY;
> -       }
> +       if (mmu_lock_shared)
> +               return -EOPNOTSUPP;
> 
>         ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>         if (ret <= 0)
> @@ -1937,28 +1917,51 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
>         return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
>  }
> 
> -static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
>  {
>         struct vcpu_tdx *tdx = to_tdx(vcpu);
> +       struct kvm *kvm = vcpu->kvm;
> +       gfn_t gfn = gpa_to_gfn(gpa);
> +       struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>         int level = -1;
> +       u64 eeq_info;
> 
> -       u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> +       if (!slot)
> +               return 0;
> 
> -       u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> -                       TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> +       if ((tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK) !=
> +           TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> +               return 0;
> 
> -       if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
> -               level = (eeq_info & GENMASK(2, 0)) + 1;
> +       eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> +                   TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> 
> -               tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
> -               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> -               tdx->violation_request_level = level;
> -       } else {
> -               tdx->violation_gfn_start = -1;
> -               tdx->violation_gfn_end = -1;
> -               tdx->violation_request_level = -1;
> +       level = (eeq_info & GENMASK(2, 0)) + 1;
> +
> +       if (level == PG_LEVEL_4K) {
> +              if (!hugepage_test_guest_inhibit(slot, gfn, PG_LEVEL_2M)) {
> +                       struct kvm_gfn_range gfn_range = {
> +                               .start = gfn,
> +                               .end = gfn + 1,
> +                               .slot = slot,
> +                               .may_block = true,
> +                               .attr_filter = KVM_FILTER_PRIVATE,
> +                       };
> +
> +                       scoped_guard(write_lock, &kvm->mmu_lock) {
> +                               int ret;
> +
> +                               ret = kvm_split_boundary_leafs(kvm, &gfn_range);
> +
> +                               if (ret)
> +                                       return ret;
> +
> +                               hugepage_set_guest_inhibit(slot, gfn, PG_LEVEL_2M);


Can you explain what you found regarding the write lock need? For most accept
cases, we could fault in the PTE's on the read lock. And in the future we could
have a demote that could work under read lock, as we talked. So
kvm_split_boundary_leafs() often or could be unneeded or work under read lock
when needed.

What is the problem in hugepage_set_guest_inhibit() that requires the write
lock?

But in any case, it seems like we have *a* solution here. It doesn't seem like
there are any big downsides. Should we close it?

> +                       }
> +              }
>         }
> +
> +       return 0;
>  }
> 
>  static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> @@ -1987,7 +1990,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>                  */
>                 exit_qual = EPT_VIOLATION_ACC_WRITE;
> 
> -               tdx_get_accept_level(vcpu, gpa);
> +               if (tdx_check_accept_level(vcpu, gpa))
> +                       return RET_PF_RETRY;
> 
>                 /* Only private GPA triggers zero-step mitigation */
>                 local_retry = true;
> @@ -3022,9 +3026,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> 
>         vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> 
> -       tdx->violation_gfn_start = -1;
> -       tdx->violation_gfn_end = -1;
> -       tdx->violation_request_level = -1;
>         return 0;
> 
>  free_tdcx:
> @@ -3373,14 +3374,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>  int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>                                        gfn_t gfn, bool prefetch)
>  {
> -       struct vcpu_tdx *tdx = to_tdx(vcpu);
> -
> -       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE) || prefetch))
> +       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE)))
>                 return PG_LEVEL_4K;
> 
> -       if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> -               return tdx->violation_request_level;
> -
>         return PG_LEVEL_2M;
>  }
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index acd18a01f63d..3a3077666ee6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2610,6 +2610,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
> 
>         return NULL;
>  }
> +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
> 
>  bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
>  {

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 7 months, 2 weeks ago

On Tue, Jul 01, 2025 at 08:42:33AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-06-26 at 16:53 +0800, Yan Zhao wrote:
> > On Wed, Jun 25, 2025 at 10:47:47PM +0800, Edgecombe, Rick P wrote:
> > > On Wed, 2025-06-25 at 17:28 +0800, Yan Zhao wrote:
> > > > On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> > > > > On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > > > > 
> > > > I guess it's the latter one as it can avoid modification to both EDK2 and Linux
> > > > guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
> > > > 2MB" during the boot-up of a TD with 4GB memory.
> > > 
> > > Oh, wow that is more than I expected. Did you notice how many vCPUs they were
> > > spread across? What memory size did you use? What was your guest memory
> > > configuration?
> > The guest memory is 4GB, 8 vCPUs.
> > The memory slots layout is:
> > slot 1: base gfn=0, npages=0x80000
> > slot 2: base gfn=0x100000, npages=0x80000
> > slot 3: base gfn=0xffc00, npages=0x400
> > 
> > The GFN spread for the ~2710 instances is:
> > GFNs 0x806-0x9ff (1 time for each of 506 pages)
> > GFNs 0x7e800-0x7e9ff (1 time for each of 512 pages)
> > GFN: 0x7d3ff~0x7e7fe (repeated private-to-shared, and shared-to-private are
> >                       conducted on this range), with the top 3 among them being:
> >      0x7d9da (476 times)
> >      0x7d9d9 (156 times)
> >      0x7d9d7 (974 times)
> > 
> > All those instances are from vCPU 0, when the guest is in EDK2 and during early
> > kernel boot.
> > 
> > Based on my observation, the count of these instances does not scale with guest
> > memory. In other words, the count remains roughly the same even when the guest
> > memory is increased to 8GB.
> 
> So the impact would be negligible. The mmu write lock would not meet much, if
> any, contention.
> 
> > 
> > > > But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
> > > > and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?
> > > 
> > > I think so. I didn't check the reason, but the other similar code took it. Maybe
> > > not? If we don't need to take mmu write lock, then this idea seems like a clear
> > > winner to me.
> > Hmm,  setting KVM_LPAGE_GUEST_INHIBIT needs trying splitting to be followed.
> > So, if we don't want to support splitting under read mmu_lock, we need to take
> > write mmu_lock.
> > 
> > I drafted a change as below (will refine some parts of it later).
> > The average count to take write mmu_lock is 11 during VM boot.
> > 
> > There's no signiticant difference in the count of 2M mappings
> > During guest kerne booting to login, on average: 
> > before this patch: 1144 2M mappings 
> > after this patch:  1143 2M mappings.
> 
> Oh, hmm. Well, it's not strong argument against.
> 
> > 
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index f999c15d8d3e..d4e98728f600 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -322,4 +322,8 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
> >  {
> >         return gfn & kvm_gfn_direct_bits(kvm);
> >  }
> > +
> > +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> > +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> > +
> >  #endif
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f0afee2e283a..28c511d8b372 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -721,6 +721,8 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
> >   */
> >  #define KVM_LPAGE_MIXED_FLAG   BIT(31)
> > 
> > +#define KVM_LPAGE_GUEST_INHIBIT_FLAG   BIT(30)
> > +
> >  static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> >                                             gfn_t gfn, int count)
> >  {
> > @@ -732,7 +734,8 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> > 
> >                 old = linfo->disallow_lpage;
> >                 linfo->disallow_lpage += count;
> > -               WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
> > +               WARN_ON_ONCE((old ^ linfo->disallow_lpage) &
> > +                             (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG));
> >         }
> >  }
> > 
> > @@ -1653,13 +1656,15 @@ int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range)
> >         bool ret = 0;
> > 
> >         lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> > -                           lockdep_is_held(&kvm->slots_lock));
> > +                           lockdep_is_held(&kvm->slots_lock) ||
> > +                           srcu_read_lock_held(&kvm->srcu));
> > 
> >         if (tdp_mmu_enabled)
> >                 ret = kvm_tdp_mmu_gfn_range_split_boundary(kvm, range);
> > 
> >         return ret;
> >  }
> > +EXPORT_SYMBOL_GPL(kvm_split_boundary_leafs);
> > 
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> > @@ -7734,6 +7739,18 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> >                 vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
> >  }
> > 
> > +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> > +{
> > +       return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUEST_INHIBIT_FLAG;
> > +}
> > +EXPORT_SYMBOL_GPL(hugepage_test_guest_inhibit);
> > +
> > +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> > +{
> > +       lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_GUEST_INHIBIT_FLAG;
> > +}
> > +EXPORT_SYMBOL_GPL(hugepage_set_guest_inhibit);
> > +
> >  #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> >  static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
> >                                 int level)
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 244fd22683db..4028423cf595 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1852,28 +1852,8 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> >         if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
> >                 return -EINVAL;
> > 
> > -       /*
> > -        * Split request with mmu_lock held for reading can only occur when one
> > -        * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
> > -        * Ignore this 4KB mapping request by setting violation_request_level to
> > -        * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
> > -        * would be a spurious fault. The vCPU accepting at 2MB will accept the
> > -        * whole 2MB range.
> > -        */
> > -       if (mmu_lock_shared) {
> > -               struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> > -               struct vcpu_tdx *tdx = to_tdx(vcpu);
> > -
> > -               if (KVM_BUG_ON(!vcpu, kvm))
> > -                       return -EOPNOTSUPP;
> > -
> > -               /* Request to map as 2MB leaf for the whole 2MB range */
> > -               tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
> > -               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> > -               tdx->violation_request_level = level;
> > -
> > -               return -EBUSY;
> > -       }
> > +       if (mmu_lock_shared)
> > +               return -EOPNOTSUPP;
> > 
> >         ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
> >         if (ret <= 0)
> > @@ -1937,28 +1917,51 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> >         return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> >  }
> > 
> > -static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> > +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> >  {
> >         struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +       struct kvm *kvm = vcpu->kvm;
> > +       gfn_t gfn = gpa_to_gfn(gpa);
> > +       struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >         int level = -1;
> > +       u64 eeq_info;
> > 
> > -       u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> > +       if (!slot)
> > +               return 0;
> > 
> > -       u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > -                       TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > +       if ((tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK) !=
> > +           TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> > +               return 0;
> > 
> > -       if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
> > -               level = (eeq_info & GENMASK(2, 0)) + 1;
> > +       eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > +                   TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > 
> > -               tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
> > -               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> > -               tdx->violation_request_level = level;
> > -       } else {
> > -               tdx->violation_gfn_start = -1;
> > -               tdx->violation_gfn_end = -1;
> > -               tdx->violation_request_level = -1;
> > +       level = (eeq_info & GENMASK(2, 0)) + 1;
> > +
> > +       if (level == PG_LEVEL_4K) {
> > +              if (!hugepage_test_guest_inhibit(slot, gfn, PG_LEVEL_2M)) {
> > +                       struct kvm_gfn_range gfn_range = {
> > +                               .start = gfn,
> > +                               .end = gfn + 1,
> > +                               .slot = slot,
> > +                               .may_block = true,
> > +                               .attr_filter = KVM_FILTER_PRIVATE,
> > +                       };
> > +
> > +                       scoped_guard(write_lock, &kvm->mmu_lock) {
> > +                               int ret;
> > +
> > +                               ret = kvm_split_boundary_leafs(kvm, &gfn_range);
> > +
> > +                               if (ret)
> > +                                       return ret;
> > +
> > +                               hugepage_set_guest_inhibit(slot, gfn, PG_LEVEL_2M);
> 
> 
> Can you explain what you found regarding the write lock need?
Here, the write lock protects 2 steps:
(1) update lpage_info.
(2) try splitting if there's any existing 2MB mapping.

The write mmu_lock is needed because lpage_info is read under read mmu_lock in
kvm_tdp_mmu_map().

kvm_tdp_mmu_map
  kvm_mmu_hugepage_adjust
    kvm_lpage_info_max_mapping_level

If we update the lpage_info with read mmu_lock, the other vCPUs may map at a
stale 2MB level even after lpage_info is updated by hugepage_set_guest_inhibit().

Therefore, we must perform splitting under the write mmu_lock to ensure there
are no 2MB mappings after hugepage_set_guest_inhibit().

Otherwise, during later mapping in __vmx_handle_ept_violation(), splitting at
fault path could be triggered as KVM MMU finds the goal level is 4KB while an
existing 2MB mapping is present.


> For most accept
> cases, we could fault in the PTE's on the read lock. And in the future we could

The actual mapping at 4KB level is still with read mmu_lock in
__vmx_handle_ept_violation().

> have a demote that could work under read lock, as we talked. So
> kvm_split_boundary_leafs() often or could be unneeded or work under read lock
> when needed.
Could we leave the "demote under read lock" as a future optimization? 


> What is the problem in hugepage_set_guest_inhibit() that requires the write
> lock?
As above, to avoid the other vCPUs reading stale mapping level and splitting
under read mmu_lock.

As guest_inhibit is set one-way, we could test it using
hugepage_test_guest_inhibit() without holding the lock. The chance to hold write
mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
(in my testing, 11 during VM boot).
 
> But in any case, it seems like we have *a* solution here. It doesn't seem like
> there are any big downsides. Should we close it?
I think it's good, as long as Sean doesn't disagree :)


> > +                       }
> > +              }
> >         }
> > +
> > +       return 0;
> >  }
> > 
> >  static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> > @@ -1987,7 +1990,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> >                  */
> >                 exit_qual = EPT_VIOLATION_ACC_WRITE;
> > 
> > -               tdx_get_accept_level(vcpu, gpa);
> > +               if (tdx_check_accept_level(vcpu, gpa))
> > +                       return RET_PF_RETRY;
> > 
> >                 /* Only private GPA triggers zero-step mitigation */
> >                 local_retry = true;
> > @@ -3022,9 +3026,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> > 
> >         vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> > 
> > -       tdx->violation_gfn_start = -1;
> > -       tdx->violation_gfn_end = -1;
> > -       tdx->violation_request_level = -1;
> >         return 0;
> > 
> >  free_tdcx:
> > @@ -3373,14 +3374,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> >  int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
> >                                        gfn_t gfn, bool prefetch)
> >  {
> > -       struct vcpu_tdx *tdx = to_tdx(vcpu);
> > -
> > -       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE) || prefetch))
> > +       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE)))
> >                 return PG_LEVEL_4K;
> > 
> > -       if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> > -               return tdx->violation_request_level;
> > -
> >         return PG_LEVEL_2M;
> >  }
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index acd18a01f63d..3a3077666ee6 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2610,6 +2610,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
> > 
> >         return NULL;
> >  }
> > +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
> > 
> >  bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
> >  {
>

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 7 months, 1 week ago

On Tue, 2025-07-01 at 10:41 +0800, Yan Zhao wrote:
> > Can you explain what you found regarding the write lock need?
> Here, the write lock protects 2 steps:
> (1) update lpage_info.
> (2) try splitting if there's any existing 2MB mapping.
> 
> The write mmu_lock is needed because lpage_info is read under read mmu_lock in
> kvm_tdp_mmu_map().
> 
> kvm_tdp_mmu_map
>   kvm_mmu_hugepage_adjust
>     kvm_lpage_info_max_mapping_level
> 
> If we update the lpage_info with read mmu_lock, the other vCPUs may map at a
> stale 2MB level even after lpage_info is updated by
> hugepage_set_guest_inhibit().
> 
> Therefore, we must perform splitting under the write mmu_lock to ensure there
> are no 2MB mappings after hugepage_set_guest_inhibit().
> 
> Otherwise, during later mapping in __vmx_handle_ept_violation(), splitting at
> fault path could be triggered as KVM MMU finds the goal level is 4KB while an
> existing 2MB mapping is present.

It could be?
1. mmu read lock
2. update lpage_info
3. mmu write lock upgrade
4. demote
5. mmu unlock

Then (3) could be skipped in the case of ability to demote under read lock?

I noticed that the other lpage_info updaters took mmu write lock, and I wasn't
sure why. We shouldn't take a lock that we don't actually need just for safety
margin or to copy other code.

> 
> 
> > For most accept
> > cases, we could fault in the PTE's on the read lock. And in the future we
> > could
> 
> The actual mapping at 4KB level is still with read mmu_lock in
> __vmx_handle_ept_violation().
> 
> > have a demote that could work under read lock, as we talked. So
> > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > lock
> > when needed.
> Could we leave the "demote under read lock" as a future optimization? 

We could add it to the list. If we have a TDX module that supports demote with a
single SEAMCALL then we don't have the rollback problem. The optimization could
utilize that. That said, we should focus on the optimizations that make the
biggest difference to real TDs. Your data suggests this might not be the case
today. 

> 
> 
> > What is the problem in hugepage_set_guest_inhibit() that requires the write
> > lock?
> As above, to avoid the other vCPUs reading stale mapping level and splitting
> under read mmu_lock.

We need mmu write lock for demote, but as long as the order is:
1. set lpage_info
2. demote if needed
3. go to fault handler

Then (3) should have what it needs even if another fault races (1).

> 
> As guest_inhibit is set one-way, we could test it using
> hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> write
> mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> (in my testing, 11 during VM boot).
>  
> > But in any case, it seems like we have *a* solution here. It doesn't seem
> > like
> > there are any big downsides. Should we close it?
> I think it's good, as long as Sean doesn't disagree :)

He seemed onboard. Let's close it. We can even discuss lpage_info update locking
on v2.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 7 months, 1 week ago

On Tue, Jul 01, 2025 at 11:36:22PM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-07-01 at 10:41 +0800, Yan Zhao wrote:
> > > Can you explain what you found regarding the write lock need?
> > Here, the write lock protects 2 steps:
> > (1) update lpage_info.
> > (2) try splitting if there's any existing 2MB mapping.
> > 
> > The write mmu_lock is needed because lpage_info is read under read mmu_lock in
> > kvm_tdp_mmu_map().
> > 
> > kvm_tdp_mmu_map
> >   kvm_mmu_hugepage_adjust
> >     kvm_lpage_info_max_mapping_level
> > 
> > If we update the lpage_info with read mmu_lock, the other vCPUs may map at a
> > stale 2MB level even after lpage_info is updated by
> > hugepage_set_guest_inhibit().
> > 
> > Therefore, we must perform splitting under the write mmu_lock to ensure there
> > are no 2MB mappings after hugepage_set_guest_inhibit().
> > 
> > Otherwise, during later mapping in __vmx_handle_ept_violation(), splitting at
> > fault path could be triggered as KVM MMU finds the goal level is 4KB while an
> > existing 2MB mapping is present.
> 
> It could be?
> 1. mmu read lock
> 2. update lpage_info
> 3. mmu write lock upgrade
> 4. demote
> 5. mmu unlock
> 
> Then (3) could be skipped in the case of ability to demote under read lock?
> 
> I noticed that the other lpage_info updaters took mmu write lock, and I wasn't
> sure why. We shouldn't take a lock that we don't actually need just for safety
> margin or to copy other code.
Use write mmu_lock is of reason.

In the 3 steps, 
1. set lpage_info
2. demote if needed
3. go to fault handler

Step 2 requires holding write mmu_lock before invoking kvm_split_boundary_leafs().
The write mmu_lock is also possible to get dropped and re-acquired in
kvm_split_boundary_leafs() for purpose like memory allocation.

If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at 2MB
level after the demote in step 2.
Luckily, current TDX doesn't support promotion now.
But we can avoid wading into this complex situation by holding write mmu_lock
in 1.

> > > For most accept
> > > cases, we could fault in the PTE's on the read lock. And in the future we
> > > could
> > 
> > The actual mapping at 4KB level is still with read mmu_lock in
> > __vmx_handle_ept_violation().
> > 
> > > have a demote that could work under read lock, as we talked. So
> > > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > > lock
> > > when needed.
> > Could we leave the "demote under read lock" as a future optimization? 
> 
> We could add it to the list. If we have a TDX module that supports demote with a
> single SEAMCALL then we don't have the rollback problem. The optimization could
> utilize that. That said, we should focus on the optimizations that make the
> biggest difference to real TDs. Your data suggests this might not be the case
> today. 
Ok. 
 
> > > What is the problem in hugepage_set_guest_inhibit() that requires the write
> > > lock?
> > As above, to avoid the other vCPUs reading stale mapping level and splitting
> > under read mmu_lock.
> 
> We need mmu write lock for demote, but as long as the order is:
> 1. set lpage_info
> 2. demote if needed
> 3. go to fault handler
> 
> Then (3) should have what it needs even if another fault races (1).
See the above comment for why we need to hold write mmu_lock for 1.

Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock before
walking the SPTEs to check if there's any existing mapping), I don't see any
performance impact by holding write mmu_lock for 1.


> > As guest_inhibit is set one-way, we could test it using
> > hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> > write
> > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> > (in my testing, 11 during VM boot).
> >  
> > > But in any case, it seems like we have *a* solution here. It doesn't seem
> > > like
> > > there are any big downsides. Should we close it?
> > I think it's good, as long as Sean doesn't disagree :)
> 
> He seemed onboard. Let's close it. We can even discuss lpage_info update locking
> on v2.
Ok. I'll use write mmu_lock for updating lpage_info in v2 first.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 7 months, 1 week ago

On Wed, 2025-07-02 at 08:12 +0800, Yan Zhao wrote:
> > Then (3) could be skipped in the case of ability to demote under read lock?
> > 
> > I noticed that the other lpage_info updaters took mmu write lock, and I
> > wasn't
> > sure why. We shouldn't take a lock that we don't actually need just for
> > safety
> > margin or to copy other code.
> Use write mmu_lock is of reason.
> 
> In the 3 steps, 
> 1. set lpage_info
> 2. demote if needed
> 3. go to fault handler
> 
> Step 2 requires holding write mmu_lock before invoking
> kvm_split_boundary_leafs().
> The write mmu_lock is also possible to get dropped and re-acquired in
> kvm_split_boundary_leafs() for purpose like memory allocation.
> 
> If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at
> 2MB
> level after the demote in step 2.
> Luckily, current TDX doesn't support promotion now.
> But we can avoid wading into this complex situation by holding write mmu_lock
> in 1.

I don't think because some code might race in the future is a good reason to
take the write lock.

> 
> > > > For most accept
> > > > cases, we could fault in the PTE's on the read lock. And in the future
> > > > we
> > > > could
> > > 
> > > The actual mapping at 4KB level is still with read mmu_lock in
> > > __vmx_handle_ept_violation().
> > > 
> > > > have a demote that could work under read lock, as we talked. So
> > > > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > > > lock
> > > > when needed.
> > > Could we leave the "demote under read lock" as a future optimization? 
> > 
> > We could add it to the list. If we have a TDX module that supports demote
> > with a
> > single SEAMCALL then we don't have the rollback problem. The optimization
> > could
> > utilize that. That said, we should focus on the optimizations that make the
> > biggest difference to real TDs. Your data suggests this might not be the
> > case
> > today. 
> Ok. 
>  
> > > > What is the problem in hugepage_set_guest_inhibit() that requires the
> > > > write
> > > > lock?
> > > As above, to avoid the other vCPUs reading stale mapping level and
> > > splitting
> > > under read mmu_lock.
> > 
> > We need mmu write lock for demote, but as long as the order is:
> > 1. set lpage_info
> > 2. demote if needed
> > 3. go to fault handler
> > 
> > Then (3) should have what it needs even if another fault races (1).
> See the above comment for why we need to hold write mmu_lock for 1.
> 
> Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock
> before
> walking the SPTEs to check if there's any existing mapping), I don't see any
> performance impact by holding write mmu_lock for 1.

It's maintainability problem too. Someday someone may want to remove it and
scratch their head for what race they are missing.

> 
> 
> > > As guest_inhibit is set one-way, we could test it using
> > > hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> > > write
> > > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> > > (in my testing, 11 during VM boot).
> > >  
> > > > But in any case, it seems like we have *a* solution here. It doesn't
> > > > seem
> > > > like
> > > > there are any big downsides. Should we close it?
> > > I think it's good, as long as Sean doesn't disagree :)
> > 
> > He seemed onboard. Let's close it. We can even discuss lpage_info update
> > locking
> > on v2.
> Ok. I'll use write mmu_lock for updating lpage_info in v2 first.

Specifically, why do the other lpage_info updating functions take mmu write
lock. Are you sure there is no other reason?

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 7 months, 1 week ago

On Wed, Jul 02, 2025 at 08:18:48AM +0800, Edgecombe, Rick P wrote:
> > > We need mmu write lock for demote, but as long as the order is:
> > > 1. set lpage_info
> > > 2. demote if needed
> > > 3. go to fault handler
> > > 
> > > Then (3) should have what it needs even if another fault races (1).
For now I implemented the sequence as

1. check lpage_info, if 2MB is already disabled for a GFN, goto 3.
2. if 2MB is not disabled,
   2.1 acquire write mmu_lock
   2.2 split the GFN mapping and kvm_flush_remote_tlbs() if split is performed
   2.3 update lpage_info to disable 2MB for the GFN
   2.4 release write mmu_lock
3. fault handler for the GFN

Note: write mmu_lock is held during 2.2 successfully splitting a huge GFN
entry and 2.3. So, it can guarantee that there's no 2MB mapping for the GFN
after 2.3.

Step 1 can help reduce the count of write mmu_lock from 17626 to 11.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 7 months, 1 week ago

On Wed, Jul 02, 2025 at 08:18:48AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-07-02 at 08:12 +0800, Yan Zhao wrote:
> > > Then (3) could be skipped in the case of ability to demote under read lock?
> > > 
> > > I noticed that the other lpage_info updaters took mmu write lock, and I
> > > wasn't
> > > sure why. We shouldn't take a lock that we don't actually need just for
> > > safety
> > > margin or to copy other code.
> > Use write mmu_lock is of reason.
> > 
> > In the 3 steps, 
> > 1. set lpage_info
> > 2. demote if needed
> > 3. go to fault handler
> > 
> > Step 2 requires holding write mmu_lock before invoking
> > kvm_split_boundary_leafs().
> > The write mmu_lock is also possible to get dropped and re-acquired in
> > kvm_split_boundary_leafs() for purpose like memory allocation.
> > 
> > If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at
> > 2MB
> > level after the demote in step 2.
> > Luckily, current TDX doesn't support promotion now.
> > But we can avoid wading into this complex situation by holding write mmu_lock
> > in 1.
> 
> I don't think because some code might race in the future is a good reason to
> take the write lock.

I still prefer to hold write mmu_lock right now.

Otherwise, we at least need to convert disallow_lpage to atomic variable and
updating it via an atomic way, e.g. cmpxchg. 

struct kvm_lpage_info {
        int disallow_lpage;
};


> > > > > For most accept
> > > > > cases, we could fault in the PTE's on the read lock. And in the future
> > > > > we
> > > > > could
> > > > 
> > > > The actual mapping at 4KB level is still with read mmu_lock in
> > > > __vmx_handle_ept_violation().
> > > > 
> > > > > have a demote that could work under read lock, as we talked. So
> > > > > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > > > > lock
> > > > > when needed.
> > > > Could we leave the "demote under read lock" as a future optimization? 
> > > 
> > > We could add it to the list. If we have a TDX module that supports demote
> > > with a
> > > single SEAMCALL then we don't have the rollback problem. The optimization
> > > could
> > > utilize that. That said, we should focus on the optimizations that make the
> > > biggest difference to real TDs. Your data suggests this might not be the
> > > case
> > > today. 
> > Ok. 
> >  
> > > > > What is the problem in hugepage_set_guest_inhibit() that requires the
> > > > > write
> > > > > lock?
> > > > As above, to avoid the other vCPUs reading stale mapping level and
> > > > splitting
> > > > under read mmu_lock.
> > > 
> > > We need mmu write lock for demote, but as long as the order is:
> > > 1. set lpage_info
> > > 2. demote if needed
> > > 3. go to fault handler
> > > 
> > > Then (3) should have what it needs even if another fault races (1).
> > See the above comment for why we need to hold write mmu_lock for 1.
> > 
> > Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock
> > before
> > walking the SPTEs to check if there's any existing mapping), I don't see any
> > performance impact by holding write mmu_lock for 1.
> 
> It's maintainability problem too. Someday someone may want to remove it and
> scratch their head for what race they are missing.
I don't get why holding write mmu_lock will cause maintainability problem.
In contrast, if we want to use read mmu_lock in future, we need to carefully
check if there's any potential risk.

> > > > As guest_inhibit is set one-way, we could test it using
> > > > hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> > > > write
> > > > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> > > > (in my testing, 11 during VM boot).
> > > >  
> > > > > But in any case, it seems like we have *a* solution here. It doesn't
> > > > > seem
> > > > > like
> > > > > there are any big downsides. Should we close it?
> > > > I think it's good, as long as Sean doesn't disagree :)
> > > 
> > > He seemed onboard. Let's close it. We can even discuss lpage_info update
> > > locking
> > > on v2.
> > Ok. I'll use write mmu_lock for updating lpage_info in v2 first.
> 
> Specifically, why do the other lpage_info updating functions take mmu write
> lock. Are you sure there is no other reason?
1. The read mmu_lock can't prevent the other vCPUs from reading stale lpage_info.
2. Shadow code in KVM MMU only holds write mmu_lock, so it updates lpage_info
   with write mmu_lock.
3. lpage_info is not updated atomically. If there're two vCPUs updating
   lpage_info concurrently, lpage_info may hold an invalid value.
4. lpage_info is not updated in performance critical paths. No need to hold
   read mmu_lock.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 7 months, 1 week ago

On Wed, 2025-07-02 at 09:07 +0800, Yan Zhao wrote:
> > I don't think because some code might race in the future is a good reason to
> > take the write lock.
> 
> I still prefer to hold write mmu_lock right now.
> 
> Otherwise, we at least need to convert disallow_lpage to atomic variable and
> updating it via an atomic way, e.g. cmpxchg. 
> 
> struct kvm_lpage_info {
>         int disallow_lpage;
> };

This seems like a valid reason. I wanted to make sure there was some reason,
besides it feels safer.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 7 months, 2 weeks ago

On Wed, Jun 25, 2025 at 05:28:22PM +0800, Yan Zhao wrote:
> Write down my understanding to check if it's correct:
> 
> - when a TD is NOT configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL, KVM
>   always maps at 4KB
> 
> - When a TD is configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL,
Sorry, the two conditions are stale ones. No need any more.
So it's always

 (a)
 1. guest accepts at 4KB
 2. TDX sets KVM_LPAGE_GUEST_INHIBIT and try splitting.(with write mmu_lock)
 3. KVM maps at 4KB (with read mmu_lock)
 4. guest's 4KB accept succeeds.

 (b)
 1. guest accepts at 2MB.
 2. KVM maps at 4KB due to a certain reason.
 3. guest's accept 2MB fails with TDACCEPT_SIZE_MISMATCH.
 4. guest accepts at 4KB
 5. guest's 4KB accept succeeds.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 7 months, 2 weeks ago

On Wed, 2025-06-25 at 17:36 +0800, Yan Zhao wrote:
> On Wed, Jun 25, 2025 at 05:28:22PM +0800, Yan Zhao wrote:
> > Write down my understanding to check if it's correct:
> > 
> > - when a TD is NOT configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL, KVM
> >    always maps at 4KB
> > 
> > - When a TD is configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL,
> Sorry, the two conditions are stale ones. No need any more.
> So it's always
>  
>  (a)
>  1. guest accepts at 4KB
>  2. TDX sets KVM_LPAGE_GUEST_INHIBIT and try splitting.(with write mmu_lock)
>  3. KVM maps at 4KB (with read mmu_lock)
>  4. guest's 4KB accept succeeds.

Yea.

>  
>  (b)
>  1. guest accepts at 2MB.
>  2. KVM maps at 4KB due to a certain reason.

I don't follow this part. You mean because it spans a memslot or other?
Basically that KVM won't guarantee the page size at exactly the accept size? I
think this is ok and good. The ABI can be that KVM will guarantee the S-EPT
mapping size <= the accept size.

>  3. guest's accept 2MB fails with TDACCEPT_SIZE_MISMATCH.
>  4. guest accepts at 4KB
>  5. guest's 4KB accept succeeds.
>  
In this option accept behavior doesn't need to change, but the
TDACCEPT_SIZE_MISMATCH in step 3 still is a little weird. TDX module could
accept at 4k mapping size. But this is an issue for the guest to deal with, not
KVM.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Yan Zhao 7 months, 2 weeks ago

On Wed, Jun 25, 2025 at 10:48:00PM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-06-25 at 17:36 +0800, Yan Zhao wrote:
> > On Wed, Jun 25, 2025 at 05:28:22PM +0800, Yan Zhao wrote:
> > > Write down my understanding to check if it's correct:
> > > 
> > > - when a TD is NOT configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL, KVM
> > >    always maps at 4KB
> > > 
> > > - When a TD is configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL,
> > Sorry, the two conditions are stale ones. No need any more.
> > So it's always
> >  
> >  (a)
> >  1. guest accepts at 4KB
> >  2. TDX sets KVM_LPAGE_GUEST_INHIBIT and try splitting.(with write mmu_lock)
> >  3. KVM maps at 4KB (with read mmu_lock)
> >  4. guest's 4KB accept succeeds.
> 
> Yea.
> 
> >  
> >  (b)
> >  1. guest accepts at 2MB.
> >  2. KVM maps at 4KB due to a certain reason.
> 
> I don't follow this part. You mean because it spans a memslot or other?
Sorry for bringing confusion. (b) is the same as the current bahavior.
I listed (b) just to contrast with (a)...

KVM may map at 4KB due to adjacent shared GFNs, spanning a memslot, or because
the TDX code doesn't support huge pages at all...

> Basically that KVM won't guarantee the page size at exactly the accept size? I
> think this is ok and good. The ABI can be that KVM will guarantee the S-EPT
> mapping size <= the accept size.
Right.

> >  3. guest's accept 2MB fails with TDACCEPT_SIZE_MISMATCH.
> >  4. guest accepts at 4KB
> >  5. guest's 4KB accept succeeds.
> >  
> In this option accept behavior doesn't need to change, but the
> TDACCEPT_SIZE_MISMATCH in step 3 still is a little weird. TDX module could
> accept at 4k mapping size. But this is an issue for the guest to deal with, not
> KVM.
With current TDX module, TDACCEPT_SIZE_MISMATCH is returned in step 3.

Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Posted by Edgecombe, Rick P 8 months, 4 weeks ago

On Fri, 2025-05-16 at 22:35 +0000, Huang, Kai wrote:
> > For TDs expect #VE, guests access private memory before accept it.
> > In that case, upon KVM receives EPT violation, there's no expected level
> > from
> > the TDX module. Returning PT_LEVEL_4K at the end basically disables huge
> > pages
> > for those TDs.
> 
> Just want to make sure I understand correctly:
> 
> Linux TDs always ACCEPT memory first before touching that memory, therefore
> KVM
> should always be able to get the accept level for Linux TDs.
> 
> In other words, returning PG_LEVEL_4K doesn't impact establishing large page
> mapping for Linux TDs.
> 
> However, other TDs may choose to touch memory first to receive #VE and then
> accept that memory.  Returning PG_LEVEL_2M allows those TDs to use large page
> mappings in SEPT.  Otherwise, returning PG_LEVEL_4K essentially disables large
> page for them (since we don't support PROMOTE for now?).
> 
> But in the above text you mentioned that, if doing so, because we choose to
> ignore splitting request on read, returning 2M could result in *endless* EPT
> violation.
> 
> So to me it seems you choose a design that could bring performance gain for
> certain non-Linux TDs when they follow a certain behaviour but otherwise could
> result in endless EPT violation in KVM.
> 
> I am not sure how is this OK?  Or probably I have misunderstanding?

Good point. And if we just pass 4k level if the EPT violation doesn't have the
accept size, then force prefetch to 4k too, like this does. Then what needs
fault path demotion? Guest double accept bugs?