[v1] KVM: x86/mmu: TDX post-populate cleanups

[RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Sean Christopherson 1 month, 1 week ago

Fold tdx_mem_page_record_premap_cnt() into tdx_sept_set_private_spte() as
providing a one-off helper for effectively three lines of code is at best a
wash, and splitting the code makes the comment for smp_rmb()  _extremely_
confusing as the comment talks about reading kvm->arch.pre_fault_allowed
before kvm_tdx->state, but the immediately visible code does the exact
opposite.

Opportunistically rewrite the comments to more explicitly explain who is
checking what, as well as _why_ the ordering matters.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 49 ++++++++++++++++++------------------------
 1 file changed, 21 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b7559ea1e353..e4b70c0dbda3 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1608,29 +1608,6 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
-/*
- * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to map guest pages; the
- * callback tdx_gmem_post_populate() then maps pages into private memory.
- * through the a seamcall TDH.MEM.PAGE.ADD().  The SEAMCALL also requires the
- * private EPT structures for the page to have been built before, which is
- * done via kvm_tdp_map_page(). nr_premapped counts the number of pages that
- * were added to the EPT structures but not added with TDH.MEM.PAGE.ADD().
- * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there
- * are no half-initialized shared EPT pages.
- */
-static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
-					  enum pg_level level, kvm_pfn_t pfn)
-{
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-
-	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
-		return -EIO;
-
-	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
-	atomic64_inc(&kvm_tdx->nr_premapped);
-	return 0;
-}
-
 static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, kvm_pfn_t pfn)
 {
@@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 		return -EIO;
 
 	/*
-	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
-	 * barrier in tdx_td_finalize().
+	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
+	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
+	 * arbitrary memory until the initial memory image is finalized.  Pairs
+	 * with the smp_wmb() in tdx_td_finalize().
 	 */
 	smp_rmb();
-	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
-		return tdx_mem_page_aug(kvm, gfn, level, pfn);
 
-	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
+	/*
+	 * If the TD isn't finalized/runnable, then userspace is initializing
+	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
+	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
+	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
+	 * the counter to ensure all mapped pages have been added to the image,
+	 * to prevent running the TD with uninitialized memory.
+	 */
+	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
+		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
+			return -EIO;
+
+		atomic64_inc(&kvm_tdx->nr_premapped);
+		return 0;
+	}
+
+	return tdx_mem_page_aug(kvm, gfn, level, pfn);
 }
 
 static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
-- 
2.51.0.268.g9569e192d0-goog

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Yan Zhao 1 month, 1 week ago

On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> Fold tdx_mem_page_record_premap_cnt() into tdx_sept_set_private_spte() as
> providing a one-off helper for effectively three lines of code is at best a
> wash, and splitting the code makes the comment for smp_rmb()  _extremely_
> confusing as the comment talks about reading kvm->arch.pre_fault_allowed
> before kvm_tdx->state, but the immediately visible code does the exact
> opposite.
> 
> Opportunistically rewrite the comments to more explicitly explain who is
> checking what, as well as _why_ the ordering matters.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 49 ++++++++++++++++++------------------------
>  1 file changed, 21 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b7559ea1e353..e4b70c0dbda3 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1608,29 +1608,6 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>  	return 0;
>  }
>  
> -/*
> - * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to map guest pages; the
> - * callback tdx_gmem_post_populate() then maps pages into private memory.
> - * through the a seamcall TDH.MEM.PAGE.ADD().  The SEAMCALL also requires the
> - * private EPT structures for the page to have been built before, which is
> - * done via kvm_tdp_map_page(). nr_premapped counts the number of pages that
> - * were added to the EPT structures but not added with TDH.MEM.PAGE.ADD().
> - * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there
> - * are no half-initialized shared EPT pages.
> - */
> -static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
> -					  enum pg_level level, kvm_pfn_t pfn)
> -{
> -	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -
> -	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> -		return -EIO;
> -
> -	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
> -	atomic64_inc(&kvm_tdx->nr_premapped);
> -	return 0;
> -}
> -
>  static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  				     enum pg_level level, kvm_pfn_t pfn)
>  {
> @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  		return -EIO;
>  
>  	/*
> -	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> -	 * barrier in tdx_td_finalize().
> +	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
> +	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
> +	 * arbitrary memory until the initial memory image is finalized.  Pairs
> +	 * with the smp_wmb() in tdx_td_finalize().
>  	 */
>  	smp_rmb();
> -	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> -		return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  
> -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> +	/*
> +	 * If the TD isn't finalized/runnable, then userspace is initializing
> +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> +	 * the counter to ensure all mapped pages have been added to the image,
> +	 * to prevent running the TD with uninitialized memory.
To prevent the mismatch between mirror EPT and the S-EPT?

e.g., Before KVM_TDX_FINALIZE_VM,
if userspace performs a zap after the TDH.MEM.PAGE.ADD, the page will be removed
from the S-EPT. The count of nr_premapped will not change after the successful
TDH.MEM.RANGE.BLOCK and TDH.MEM.PAGE.REMOVE.

As a result, the TD will still run with uninitialized memory.

> +	 */
> +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> +		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> +			return -EIO;
> +
> +		atomic64_inc(&kvm_tdx->nr_premapped);
> +		return 0;
> +	}
> +
> +	return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  }
>  
>  static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> -- 
> 2.51.0.268.g9569e192d0-goog
>

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Ira Weiny 1 month ago

Yan Zhao wrote:
> On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:

[snip]

> > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >  		return -EIO;
> >  
> >  	/*
> > -	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > -	 * barrier in tdx_td_finalize().
> > +	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
> > +	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
> > +	 * arbitrary memory until the initial memory image is finalized.  Pairs
> > +	 * with the smp_wmb() in tdx_td_finalize().
> >  	 */
> >  	smp_rmb();
> > -	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> > -		return tdx_mem_page_aug(kvm, gfn, level, pfn);
> >  
> > -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> > +	/*
> > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > +	 * the counter to ensure all mapped pages have been added to the image,
> > +	 * to prevent running the TD with uninitialized memory.
> To prevent the mismatch between mirror EPT and the S-EPT?
> 
> e.g., Before KVM_TDX_FINALIZE_VM,
> if userspace performs a zap after the TDH.MEM.PAGE.ADD, the page will be removed
> from the S-EPT. The count of nr_premapped will not change after the successful
> TDH.MEM.RANGE.BLOCK and TDH.MEM.PAGE.REMOVE.
> 
> As a result, the TD will still run with uninitialized memory.

I'm wondering if we are trying to over-architect this.

Should we even allow KVM_TDX_FINALIZE_VM to race with
KVM_TDX_INIT_MEM_REGION?  What is the use case for that?

It seems a basic sanity check/KVM_BUG_ON would suffice to tell the user;
Don't start adding memory dynamically until yall have finalized the VM.

Ira

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Sean Christopherson 1 month ago

On Wed, Aug 27, 2025, Yan Zhao wrote:
> On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >  		return -EIO;
> >  
> >  	/*
> > -	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > -	 * barrier in tdx_td_finalize().
> > +	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
> > +	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
> > +	 * arbitrary memory until the initial memory image is finalized.  Pairs
> > +	 * with the smp_wmb() in tdx_td_finalize().
> >  	 */
> >  	smp_rmb();
> > -	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> > -		return tdx_mem_page_aug(kvm, gfn, level, pfn);
> >  
> > -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> > +	/*
> > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > +	 * the counter to ensure all mapped pages have been added to the image,
> > +	 * to prevent running the TD with uninitialized memory.
> To prevent the mismatch between mirror EPT and the S-EPT?

No?  Because KVM bumps the count when installing the S-EPT and decrements it
on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?

> e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the
> TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of
> nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and
> TDH.MEM.PAGE.REMOVE.

Eww.  It would be nice to close that hole, but I suppose it's futile, e.g. the
underlying problem is unexpectedly removing pages from the initial, whether the
VMM is doing stupid things before vs. after FINALIZE doesn't really matter.

> As a result, the TD will still run with uninitialized memory.

No?  Because BLOCK+REMOVE means there are no valid S-EPT mappings.  There's a
"hole" that the guest might not expect, but that hole will trigger an EPT
violation and only get "filled" if the guest explicitly accepts an AUG'd page.

Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
nice with tdh_mem_page_add() failure necessitates both the
tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize()
that all pending pages have been consumed.

What reasonable use case is there for gracefully handling tdh_mem_page_add() failure?

If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
case.  And if it's only for -EBUSY, why can't that be handled by retrying in
tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
pages mapped into the S-EPT are ADDed, then it can assert that there are no
pending pages when it completes (even if it "fails"), and similarly
tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
non-zero.

> > +	 */
> > +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> > +		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> > +			return -EIO;
> > +
> > +		atomic64_inc(&kvm_tdx->nr_premapped);
> > +		return 0;
> > +	}
> > +
> > +	return tdx_mem_page_aug(kvm, gfn, level, pfn);
> >  }
> >  
> >  static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > -- 
> > 2.51.0.268.g9569e192d0-goog
> >

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Ira Weiny 1 month ago

Sean Christopherson wrote:
> On Wed, Aug 27, 2025, Yan Zhao wrote:
> > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > >  		return -EIO;

[snip]

> > >  	/*
> > > -	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > > -	 * barrier in tdx_td_finalize().
> > > +	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
> > > +	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
> > > +	 * arbitrary memory until the initial memory image is finalized.  Pairs
> > > +	 * with the smp_wmb() in tdx_td_finalize().
> > >  	 */
> > >  	smp_rmb();
> > > -	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> > > -		return tdx_mem_page_aug(kvm, gfn, level, pfn);
> > >  
> > > -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> > > +	/*
> > > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > > +	 * the counter to ensure all mapped pages have been added to the image,
> > > +	 * to prevent running the TD with uninitialized memory.
> > To prevent the mismatch between mirror EPT and the S-EPT?
> 
> No?  Because KVM bumps the count when installing the S-EPT and decrements it
> on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?
> 
> > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the
> > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of
> > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and
> > TDH.MEM.PAGE.REMOVE.
> 
> Eww.  It would be nice to close that hole, but I suppose it's futile, e.g. the
> underlying problem is unexpectedly removing pages from the initial, whether the
> VMM is doing stupid things before vs. after FINALIZE doesn't really matter.
> 
> > As a result, the TD will still run with uninitialized memory.
> 
> No?  Because BLOCK+REMOVE means there are no valid S-EPT mappings.  There's a
> "hole" that the guest might not expect, but that hole will trigger an EPT
> violation and only get "filled" if the guest explicitly accepts an AUG'd page.
> 
> Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
> nice with tdh_mem_page_add() failure necessitates both the
> tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize()
> that all pending pages have been consumed.
> 
> What reasonable use case is there for gracefully handling tdh_mem_page_add() failure?
> 
> If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
> pages mapped into the S-EPT are ADDed, then it can assert that there are no
> pending pages when it completes (even if it "fails"), and similarly
> tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> non-zero.

Ah just reading this...  yea I'm wondering the same thing.

Ira

[snip]

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Yan Zhao 1 month ago

On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote:
> On Wed, Aug 27, 2025, Yan Zhao wrote:
> > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > >  		return -EIO;
> > >  
> > >  	/*
> > > -	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > > -	 * barrier in tdx_td_finalize().
> > > +	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
> > > +	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
> > > +	 * arbitrary memory until the initial memory image is finalized.  Pairs
> > > +	 * with the smp_wmb() in tdx_td_finalize().
> > >  	 */
> > >  	smp_rmb();
> > > -	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> > > -		return tdx_mem_page_aug(kvm, gfn, level, pfn);
> > >  
> > > -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> > > +	/*
> > > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > > +	 * the counter to ensure all mapped pages have been added to the image,
> > > +	 * to prevent running the TD with uninitialized memory.
> > To prevent the mismatch between mirror EPT and the S-EPT?
> 
> No?  Because KVM bumps the count when installing the S-EPT and decrements it
> on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?
Hmm, I think there must be some misunderstanding.

Before userspace invokes KVM_TDX_FINALIZE_VM,
=======
1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION).
   (1) KVM holds slot_lock and filemap lock.
   (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in
       patch 2).
       KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate
       that there's a page mapped in M-EPT, while it's not yet installed in
       S-EPT.
   (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the
       page has been mapped in S-EPT too.
       
   As the name of nr_premapped indicates, the count means a page is pre-mapped
   in the M-EPT, before its real mapping in the S-EPT.
   If ADD fails in step (3), nr_premapped will not be decreased.

   With mere the normal path, nr_premapped should return back to 0 after all
   KVM_TDX_INIT_MEM_REGIONs.
      

2. Expected zap paths (e.g. If userspace does something strange, such as
   removing a slot after KVM_TDX_INIT_MEM_REGION)
   Those zap paths could be triggered by
   1) userspace performs a page attribute conversion
   2) userspace invokes gmem punch hole
   3) userspace removes a slot
   As all those paths either hold a slot_lock or a filemap lock, they can't
   contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both
   slot_lock and internally filemap lock).
   Consequently, those zaps must occur
   a) before kvm_tdp_map_page() or
   b) after TDH.MEM.PAGE.ADD.
   For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not
           mapped in M-EPT either;
   For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE
           SEAMCALLs are following the ADD.
   nr_premapped is therere unchanged, since it does not change the consistency
   between M-EPT and S-EPT.

3. Unexpected zaps (such as kvm_zap_gfn_range()).
   Those zaps are currently just paranoid ones. Not found in any existing paths
   yet. i.e.,
   We want to detect any future code or any missed code piecies, which invokes
   kvm_zap_gfn_range() (or maybe zaps under read mmu_lock).

   As those zaps do not necessarily hold slot_lock or filemap lock, they may
   ocurr after installing M-EPT and before installing S-EPT.
   As a result, the BLOCK fails and tdx_is_sept_zap_err_due_to_premap() returns
   true.
   Decreasing nr_premapped here to indicate the count of pages mapped in M-EPT
   but not in S-EPT decreases.

   TDH.MEM.PAGE.ADD after this zap can still succeed. If this occurs, the page
   will be mapped in S-EPT only. As KVM also decreases nr_premapped after a
   successful TDH.MEM.PAGE.ADD, the nr_premapped will be <0 in the end.
   So, we will be able to detect those unexpected zaps.
   

When userspace invokes KVM_TDX_FINALIZE_VM,
=======
The nr_premapped must be 0 before tdx_td_finalize() succeeds.

The nr_premapped could be 0 if
(1) userspace invokes KVM_TDX_INIT_MEM_REGIONs as in a normal way.
(2) userspace never triggers any KVM_TDX_INIT_MEM_REGION.
(3) userspace triggers KVM_TDX_INIT_MEM_REGION but zaps all initial memory
    regions.

For (2)and(3), KVM_TDX_FINALIZE_VM can still succeed. So, TD can still run with
uninitialized memory.

> > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the
> > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of
> > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and
> > TDH.MEM.PAGE.REMOVE.
> 
> Eww.  It would be nice to close that hole, but I suppose it's futile, e.g. the
> underlying problem is unexpectedly removing pages from the initial, whether the
> VMM is doing stupid things before vs. after FINALIZE doesn't really matter.
Are you referring to the above "case 2 Expected zap paths"?

It's equal to that userspace never triggers any KVM_TDX_INIT_MEM_REGION.
We can't force userspace must invoke KVM_TDX_INIT_MEM_REGION after all.

I don't think there's a hole from the guest point of view. See below.

> > As a result, the TD will still run with uninitialized memory.
> 
> No?  Because BLOCK+REMOVE means there are no valid S-EPT mappings.  There's a
> "hole" that the guest might not expect, but that hole will trigger an EPT
> violation and only get "filled" if the guest explicitly accepts an AUG'd page.

If TD runs with unintialized memory,
- for the linux guest, it will cause TD to access unaccepted memory and get
  killed by KVM;
- for non-linux guest which configured with #VE, the guest will see #VE and
  be informed of that the page must be accepted before access. Though the guest
  should not be able to run without any initial code, but there's not any
  security problem.


> Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
We don't. It returns -EBUSY or -EIO immediately.

> nice with tdh_mem_page_add() failure necessitates both the
> tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize()
> that all pending pages have been consumed.

tdx_is_sept_zap_err_due_to_premap() detects the error of BLOCK, which is caused
by executing BLOCK before ADD.

> What reasonable use case is there for gracefully handling tdh_mem_page_add() failure?
If tdh_mem_page_add() fails, the KVM_TDX_INIT_MEM_REGION just fails.

> If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
I analyzed the contention status of tdh_mem_sept_add() at
https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com.

As the userspace is expected to execute KVM_TDX_INIT_MEM_REGION in only one
vCPU, returning -EBUSY instead of retrying looks safer and easier.

> pages mapped into the S-EPT are ADDed, then it can assert that there are no
> pending pages when it completes (even if it "fails"), and similarly
> tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> non-zero.
tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0.
KVM_BUG_ON/WARN_ON should be also ok.

> > > +	 */
> > > +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> > > +		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> > > +			return -EIO;
> > > +
> > > +		atomic64_inc(&kvm_tdx->nr_premapped);
> > > +		return 0;
> > > +	}
> > > +
> > > +	return tdx_mem_page_aug(kvm, gfn, level, pfn);
> > >  }
> > >  
> > >  static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > -- 
> > > 2.51.0.268.g9569e192d0-goog
> > >

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Sean Christopherson 1 month ago

On Thu, Aug 28, 2025, Yan Zhao wrote:
> On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote:
> > On Wed, Aug 27, 2025, Yan Zhao wrote:
> > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > +	/*
> > > > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > > > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > > > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > > > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > > > +	 * the counter to ensure all mapped pages have been added to the image,
> > > > +	 * to prevent running the TD with uninitialized memory.
> > > To prevent the mismatch between mirror EPT and the S-EPT?
> > 
> > No?  Because KVM bumps the count when installing the S-EPT and decrements it
> > on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?
> Hmm, I think there must be some misunderstanding.

Yeah, I forgot that AUG and ADD create the leaf S-EPT entries.

> Before userspace invokes KVM_TDX_FINALIZE_VM,
> =======
> 1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION).
>    (1) KVM holds slot_lock and filemap lock.
>    (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in
>        patch 2).
>        KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate
>        that there's a page mapped in M-EPT, while it's not yet installed in
>        S-EPT.
>    (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the
>        page has been mapped in S-EPT too.
>        
>    As the name of nr_premapped indicates, the count means a page is pre-mapped
>    in the M-EPT, before its real mapping in the S-EPT.
>    If ADD fails in step (3), nr_premapped will not be decreased.
> 
>    With mere the normal path, nr_premapped should return back to 0 after all
>    KVM_TDX_INIT_MEM_REGIONs.
>       
> 
> 2. Expected zap paths (e.g. If userspace does something strange, such as
>    removing a slot after KVM_TDX_INIT_MEM_REGION)
>    Those zap paths could be triggered by
>    1) userspace performs a page attribute conversion
>    2) userspace invokes gmem punch hole
>    3) userspace removes a slot
>    As all those paths either hold a slot_lock or a filemap lock, they can't
>    contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both
>    slot_lock and internally filemap lock).
>    Consequently, those zaps must occur
>    a) before kvm_tdp_map_page() or
>    b) after TDH.MEM.PAGE.ADD.
>    For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not
>            mapped in M-EPT either;
>    For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE
>            SEAMCALLs are following the ADD.
>    nr_premapped is therere unchanged, since it does not change the consistency
>    between M-EPT and S-EPT.
> 
> 3. Unexpected zaps (such as kvm_zap_gfn_range()).

Side topic related to kvm_zap_gfn_range(), the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl()
is flawed.  If kvm_recalculate_apic_map() fails to allocate an optimized map, KVM
will mark APICv as inhibited, i.e. the associated WARN_ON_ONCE() is effectively
user-triggerable.

Easiest thing would be to mark the vCPU as dead (though we obviously need
"KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests" for that to be robust).

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index dbab1c15b0cd..1c0b43ff9544 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -719,7 +719,8 @@ static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
        if (is_td_vcpu(vcpu)) {
-               KVM_BUG_ON(!kvm_vcpu_apicv_active(vcpu), vcpu->kvm);
+               if (!kvm_vcpu_apicv_active(vcpu))
+                       kvm_make_request(KVM_REQ_VM_DEAD, vcpu);
                return;
        }
 
>    Those zaps are currently just paranoid ones. Not found in any existing paths
>    yet. i.e.,
>    We want to detect any future code or any missed code piecies, which invokes
>    kvm_zap_gfn_range() (or maybe zaps under read mmu_lock).
> 
>    As those zaps do not necessarily hold slot_lock or filemap lock, they may
>    ocurr after installing M-EPT and before installing S-EPT.
>    As a result, the BLOCK fails and tdx_is_sept_zap_err_due_to_premap() returns
>    true.
>    Decreasing nr_premapped here to indicate the count of pages mapped in M-EPT
>    but not in S-EPT decreases.
> 
>    TDH.MEM.PAGE.ADD after this zap can still succeed. If this occurs, the page
>    will be mapped in S-EPT only. As KVM also decreases nr_premapped after a
>    successful TDH.MEM.PAGE.ADD, the nr_premapped will be <0 in the end.
>    So, we will be able to detect those unexpected zaps.
>    
> 
> When userspace invokes KVM_TDX_FINALIZE_VM,
> =======
> The nr_premapped must be 0 before tdx_td_finalize() succeeds.
> 
> The nr_premapped could be 0 if
> (1) userspace invokes KVM_TDX_INIT_MEM_REGIONs as in a normal way.
> (2) userspace never triggers any KVM_TDX_INIT_MEM_REGION.
> (3) userspace triggers KVM_TDX_INIT_MEM_REGION but zaps all initial memory
>     regions.
> 
> For (2)and(3), KVM_TDX_FINALIZE_VM can still succeed.

Ya, we're in agreement on what can happen.  I think all of the confusion was due
to me forgetting that TDH.MEM.PAGE.ADD is what actually installes the leaf S-EPT
entry.

> So, TD can still run with uninitialized memory.

No, the TD can never run with truly uninitialized memory.  By "uninitialized", I
mean memory that the guest can access and which has not been written to.  Again,
my confusion was due to thinking a page was already mapped into the guest, but
awaiting TDH.MEM.PAGE.ADD to 
 
> > Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
> We don't. It returns -EBUSY or -EIO immediately.

But that _is_ tolerating failure, in the sense that KVM doesn't prevent further
actions on the VM.  Tolerating failure is fine in general, but in this case it
leaves the MMU is left in a half-baked state.  

> > nice with tdh_mem_page_add() failure necessitates both the
> > tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize()
> > that all pending pages have been consumed.
> 
> tdx_is_sept_zap_err_due_to_premap() detects the error of BLOCK, which is caused
> by executing BLOCK before ADD.

We need to make this situation impossible.

> > What reasonable use case is there for gracefully handling tdh_mem_page_add() failure?
> If tdh_mem_page_add() fails, the KVM_TDX_INIT_MEM_REGION just fails.
> 
> > If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> > case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> > tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
> I analyzed the contention status of tdh_mem_sept_add() at
> https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com.
> 
> As the userspace is expected to execute KVM_TDX_INIT_MEM_REGION in only one
> vCPU, returning -EBUSY instead of retrying looks safer and easier.
> 
> > pages mapped into the S-EPT are ADDed, then it can assert that there are no
> > pending pages when it completes (even if it "fails"), and similarly
> > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> > non-zero.
> tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0.
> KVM_BUG_ON/WARN_ON should be also ok.

Ok, so I vaguely recall that I may have pushed back on using a scratch field in
"struct kvm_tdx" for temporary data (or maybe it was abusing vcpus[0] that I
disliked?), but what we ended up with is far worse.

For all intents and purposes, nr_premapped _is_ a temporary scratch field, but
with access rules that are all but impossible to understand, e.g. there's practically
zero chance anyone could suss out complications with "Unexpected zaps", and the
existence of that super subtle edge case necessitates using an atomic because
KVM can't strictly guarantee that access to the field is mutually exclusive.  And
that also means it's inherently racy, e.g. if a zap happens while tdx_td_finalize()
is checking nr_premapped, what happens?

The real killer is that deferring TDH.MEM.PAGE.ADD and TDH.MR_EXTEND until after
the map completes and mmu_lock is dropped means that failure at that point leaves
the TDP MMU in an inconsistent state, where the M-EPT has a present page but the
S-EPT does not.  Eww.

Note, in no way am I trying to blame anyone; quite the opposite, you've done an
admirable job to get all of this landed.  And I apologize if any of my past
feedback led y'all down this road.  I suspect my prefaulting idea really screwed
things up; sorry :-(

Back to the code, unless I'm missing yet another complication, I think the obvious
fix to all of this is to pass the source page and metadata flags via a scratch
field in "struct kvm_tdx", and then do PAGE.ADD and MR.EXTEND in
tdx_sept_set_private_spte().  Then there is no need to keep track of pending
pages, because the M-EPT and S-EPT are always consistent.  E.g. if PAGE.ADD fails
with -EBUSY, then KVM will naturally revert the M-EPT entry from FROZEN to !PRESENT.
It also allows KVM to KVM_BUG_ON() MR.EXTEND failure, because it should be impossible
for the S-EPT entry to be modified between PAGE.ADD and MR.EXTEND.

Diff on top below for feedback on the idea.  A proper series for this would simply
replace several of the patches, e.g. asserting that slots_lock is held on
tdx_is_sept_zap_err_due_to_premap() is wrong.

---
 arch/x86/kvm/vmx/tdx.c | 157 +++++++++++++++++------------------------
 arch/x86/kvm/vmx/tdx.h |  11 +--
 2 files changed, 70 insertions(+), 98 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f9ac590e8ff0..5d981a061442 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1586,6 +1586,56 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }
 
+
+struct kvm_tdx_page_add {
+	struct page *src;
+	unsigned long flags;
+};
+
+static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+			    kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err, entry, level_state;
+	gpa_t gpa = gfn_to_gpa(gfn);
+	int i;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
+	    KVM_BUG_ON(!kvm_tdx->page_add, kvm))
+		return -EIO;
+
+	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
+			       kvm_tdx->page_add->src, &entry, &level_state);
+	if (unlikely(tdx_operand_busy(err)))
+		return -EBUSY;
+
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state);
+		return -EIO;
+	}
+
+	if (!(kvm_tdx->page_add->flags & KVM_TDX_MEASURE_MEMORY_REGION))
+		return 0;
+
+	/*
+	 * Extend the measurement while holding mmu_lock to ensure MR.EXTEND
+	 * can't fail, e.g. due to the S-EPT entry being zapped after PAGE.ADD.
+	 * Note!  If extended the measurement fails, bug the VM, but do NOT
+	 * return an error, as mapping the page in the S-EPT succeeded and so
+	 * needs to be tracked in KVM's mirror page tables.
+	 */
+	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
+		if (KVM_BUG_ON(err, kvm)) {
+			pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
+			break;
+		}
+	}
+	return 0;
+}
+
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 			    enum pg_level level, kvm_pfn_t pfn)
 {
@@ -1627,21 +1677,11 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 
 	/*
 	 * If the TD isn't finalized/runnable, then userspace is initializing
-	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
-	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
-	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
-	 * the counter to ensure all mapped pages have been added to the image,
-	 * to prevent running the TD with uninitialized memory.
+	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Add the page to the TD,
+	 * and optionally extend the measurement with the page contents.
 	 */
-	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
-		lockdep_assert_held(&kvm->slots_lock);
-
-		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
-			return -EIO;
-
-		kvm_tdx->nr_pending_tdh_mem_page_adds++;
-		return 0;
-	}
+	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
+		return tdx_mem_page_add(kvm, gfn, level, pfn);
 
 	return tdx_mem_page_aug(kvm, gfn, level, pfn);
 }
@@ -1716,39 +1756,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
-/*
- * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is
- * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called
- * successfully.
- *
- * Since tdh_mem_sept_add() must have been invoked successfully before a
- * non-leaf entry present in the mirrored page table, the SEPT ZAP related
- * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead
- * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the
- * SEPT.
- *
- * Further check if the returned entry from SEPT walking is with RWX permissions
- * to filter out anything unexpected.
- *
- * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from
- * level_state returned from a SEAMCALL error is the same as that passed into
- * the SEAMCALL.
- */
-static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
-					     u64 entry, int level)
-{
-	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
-		return false;
-
-	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
-		return false;
-
-	if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK)))
-		return false;
-
-	return true;
-}
-
 static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, struct page *page)
 {
@@ -1768,15 +1775,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
 		tdx_no_vcpus_enter_stop(kvm);
 	}
-	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
-		lockdep_assert_held(&kvm->slots_lock);
-
-		if (KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm))
-			return -EIO;
-
-		return 0;
-	}
-
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
 		return -EIO;
@@ -2842,12 +2840,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 
 	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
 		return -EINVAL;
-	/*
-	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
-	 * TDH.MEM.PAGE.ADD().
-	 */
-	if (kvm_tdx->nr_pending_tdh_mem_page_adds)
-		return -EINVAL;
 
 	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
 	if (tdx_operand_busy(cmd->hw_error))
@@ -3131,50 +3123,29 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 {
 	struct tdx_gmem_post_populate_arg *arg = _arg;
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-	u64 err, entry, level_state;
-	gpa_t gpa = gfn_to_gpa(gfn);
-	struct page *src_page;
-	int ret, i;
+	struct kvm_tdx_page_add page_add = {
+		.flags = arg->flags,
+	};
+	int ret;
 
-	lockdep_assert_held(&kvm->slots_lock);
+	if (KVM_BUG_ON(kvm_tdx->page_add, kvm))
+		return -EIO;
 
 	/*
 	 * Get the source page if it has been faulted in. Return failure if the
 	 * source page has been swapped out or unmapped in primary memory.
 	 */
-	ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page);
+	ret = get_user_pages_fast((unsigned long)src, 1, 0, &page_add.src);
 	if (ret < 0)
 		return ret;
 	if (ret != 1)
 		return -ENOMEM;
 
+	kvm_tdx->page_add = &page_add;
 	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
-	if (ret < 0)
-		goto out;
+	kvm_tdx->page_add = NULL;
 
-	ret = 0;
-	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
-			       src_page, &entry, &level_state);
-	if (err) {
-		ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
-		goto out;
-	}
-
-	KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm);
-
-	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
-		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
-			err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
-					    &level_state);
-			if (err) {
-				ret = -EIO;
-				break;
-			}
-		}
-	}
-
-out:
-	put_page(src_page);
+	put_page(page_add.src);
 	return ret;
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 45d86f9fa41c..39e0c3bcc866 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -21,6 +21,8 @@ enum kvm_tdx_state {
 	TD_STATE_RUNNABLE,
 };
 
+struct kvm_tdx_add_page;
+
 struct kvm_tdx {
 	struct kvm kvm;
 
@@ -37,12 +39,11 @@ struct kvm_tdx {
 	struct tdx_td td;
 
 	/*
-	 * The number of pages that KVM_TDX_INIT_MEM_REGION has mapped into the
-	 * S-EPT, but not yet initialized via TDH.MEM.PAGE_ADD.  Used to sanity
-	 * check adding pages to the image, and to ensure that all pages have
-	 * been initialized before finalizing the TD.
+	 * Scratch structure used to pass the source page and metadata flags to
+	 * tdx_mem_page_add.  Protected by slots_lock, and non-NULL only when
+	 * mapping a private pfn via tdx_gmem_post_populate().
 	 */
-	unsigned long nr_pending_tdh_mem_page_adds;
+	struct kvm_tdx_page_add *page_add;
 
 	/*
 	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do

base-commit: 7c7a3893b102bdeb4826f7140280b7b16081b385
--

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Yan Zhao 1 month ago

On Thu, Aug 28, 2025 at 10:00:28AM -0700, Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Yan Zhao wrote:
> > On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote:
> > > On Wed, Aug 27, 2025, Yan Zhao wrote:
> > > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > > > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > > +	/*
> > > > > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > > > > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > > > > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > > > > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > > > > +	 * the counter to ensure all mapped pages have been added to the image,
> > > > > +	 * to prevent running the TD with uninitialized memory.
> > > > To prevent the mismatch between mirror EPT and the S-EPT?
> > > 
> > > No?  Because KVM bumps the count when installing the S-EPT and decrements it
> > > on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?
> > Hmm, I think there must be some misunderstanding.
> 
> Yeah, I forgot that AUG and ADD create the leaf S-EPT entries.
> 
> > Before userspace invokes KVM_TDX_FINALIZE_VM,
> > =======
> > 1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION).
> >    (1) KVM holds slot_lock and filemap lock.
> >    (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in
> >        patch 2).
> >        KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate
> >        that there's a page mapped in M-EPT, while it's not yet installed in
> >        S-EPT.
> >    (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the
> >        page has been mapped in S-EPT too.
> >        
> >    As the name of nr_premapped indicates, the count means a page is pre-mapped
> >    in the M-EPT, before its real mapping in the S-EPT.
> >    If ADD fails in step (3), nr_premapped will not be decreased.
> > 
> >    With mere the normal path, nr_premapped should return back to 0 after all
> >    KVM_TDX_INIT_MEM_REGIONs.
> >       
> > 
> > 2. Expected zap paths (e.g. If userspace does something strange, such as
> >    removing a slot after KVM_TDX_INIT_MEM_REGION)
> >    Those zap paths could be triggered by
> >    1) userspace performs a page attribute conversion
> >    2) userspace invokes gmem punch hole
> >    3) userspace removes a slot
> >    As all those paths either hold a slot_lock or a filemap lock, they can't
> >    contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both
> >    slot_lock and internally filemap lock).
> >    Consequently, those zaps must occur
> >    a) before kvm_tdp_map_page() or
> >    b) after TDH.MEM.PAGE.ADD.
> >    For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not
> >            mapped in M-EPT either;
> >    For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE
> >            SEAMCALLs are following the ADD.
> >    nr_premapped is therere unchanged, since it does not change the consistency
> >    between M-EPT and S-EPT.
> > 
> > 3. Unexpected zaps (such as kvm_zap_gfn_range()).
> 
> Side topic related to kvm_zap_gfn_range(), the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl()
> is flawed.  If kvm_recalculate_apic_map() fails to allocate an optimized map, KVM
> will mark APICv as inhibited, i.e. the associated WARN_ON_ONCE() is effectively
> user-triggerable.
> 
> Easiest thing would be to mark the vCPU as dead (though we obviously need
> "KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests" for that to be robust).
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index dbab1c15b0cd..1c0b43ff9544 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -719,7 +719,8 @@ static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>  {
>         if (is_td_vcpu(vcpu)) {
> -               KVM_BUG_ON(!kvm_vcpu_apicv_active(vcpu), vcpu->kvm);
> +               if (!kvm_vcpu_apicv_active(vcpu))
> +                       kvm_make_request(KVM_REQ_VM_DEAD, vcpu);
>                 return;
>         }
>  
> >    Those zaps are currently just paranoid ones. Not found in any existing paths
> >    yet. i.e.,
> >    We want to detect any future code or any missed code piecies, which invokes
> >    kvm_zap_gfn_range() (or maybe zaps under read mmu_lock).
> > 
> >    As those zaps do not necessarily hold slot_lock or filemap lock, they may
> >    ocurr after installing M-EPT and before installing S-EPT.
> >    As a result, the BLOCK fails and tdx_is_sept_zap_err_due_to_premap() returns
> >    true.
> >    Decreasing nr_premapped here to indicate the count of pages mapped in M-EPT
> >    but not in S-EPT decreases.
> > 
> >    TDH.MEM.PAGE.ADD after this zap can still succeed. If this occurs, the page
> >    will be mapped in S-EPT only. As KVM also decreases nr_premapped after a
> >    successful TDH.MEM.PAGE.ADD, the nr_premapped will be <0 in the end.
> >    So, we will be able to detect those unexpected zaps.
> >    
> > 
> > When userspace invokes KVM_TDX_FINALIZE_VM,
> > =======
> > The nr_premapped must be 0 before tdx_td_finalize() succeeds.
> > 
> > The nr_premapped could be 0 if
> > (1) userspace invokes KVM_TDX_INIT_MEM_REGIONs as in a normal way.
> > (2) userspace never triggers any KVM_TDX_INIT_MEM_REGION.
> > (3) userspace triggers KVM_TDX_INIT_MEM_REGION but zaps all initial memory
> >     regions.
> > 
> > For (2)and(3), KVM_TDX_FINALIZE_VM can still succeed.
> 
> Ya, we're in agreement on what can happen.  I think all of the confusion was due
> to me forgetting that TDH.MEM.PAGE.ADD is what actually installes the leaf S-EPT
> entry.
> 
> > So, TD can still run with uninitialized memory.
> 
> No, the TD can never run with truly uninitialized memory.  By "uninitialized", I
> mean memory that the guest can access and which has not been written to.  Again,
> my confusion was due to thinking a page was already mapped into the guest, but
> awaiting TDH.MEM.PAGE.ADD to 
>  
> > > Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
> > We don't. It returns -EBUSY or -EIO immediately.
> 
> But that _is_ tolerating failure, in the sense that KVM doesn't prevent further
> actions on the VM.  Tolerating failure is fine in general, but in this case it
> leaves the MMU is left in a half-baked state.  
> 
> > > nice with tdh_mem_page_add() failure necessitates both the
> > > tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize()
> > > that all pending pages have been consumed.
> > 
> > tdx_is_sept_zap_err_due_to_premap() detects the error of BLOCK, which is caused
> > by executing BLOCK before ADD.
> 
> We need to make this situation impossible.
Currently this situation should be impossible already.
If there're still missing ones, we can fix it (as you did above).

But this tdx_is_sept_zap_err_due_to_premap() check is just to detect if anything
is still missing.
Or maybe directly KVM_BUG_ON() on that is also ok.


> > > What reasonable use case is there for gracefully handling tdh_mem_page_add() failure?
> > If tdh_mem_page_add() fails, the KVM_TDX_INIT_MEM_REGION just fails.
> > 
> > > If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> > > case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> > > tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
> > I analyzed the contention status of tdh_mem_sept_add() at
> > https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com.
> > 
> > As the userspace is expected to execute KVM_TDX_INIT_MEM_REGION in only one
> > vCPU, returning -EBUSY instead of retrying looks safer and easier.
> > 
> > > pages mapped into the S-EPT are ADDed, then it can assert that there are no
> > > pending pages when it completes (even if it "fails"), and similarly
> > > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> > > non-zero.
> > tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0.
> > KVM_BUG_ON/WARN_ON should be also ok.
> 
> Ok, so I vaguely recall that I may have pushed back on using a scratch field in
> "struct kvm_tdx" for temporary data (or maybe it was abusing vcpus[0] that I
> disliked?), but what we ended up with is far worse.
> 
> For all intents and purposes, nr_premapped _is_ a temporary scratch field, but
> with access rules that are all but impossible to understand, e.g. there's practically
> zero chance anyone could suss out complications with "Unexpected zaps", and the
> existence of that super subtle edge case necessitates using an atomic because
> KVM can't strictly guarantee that access to the field is mutually exclusive.  And
> that also means it's inherently racy, e.g. if a zap happens while tdx_td_finalize()
> is checking nr_premapped, what happens?
The tdx_td_finalize() takes slots_lock and you asserted those unexpected zaps
at https://lore.kernel.org/all/20250827000522.4022426-11-seanjc@google.com.

Expected zaps can't occur during tdx_td_finalize checking nr_premapped either.

 
> The real killer is that deferring TDH.MEM.PAGE.ADD and TDH.MR_EXTEND until after
> the map completes and mmu_lock is dropped means that failure at that point leaves
> the TDP MMU in an inconsistent state, where the M-EPT has a present page but the
> S-EPT does not.  Eww.
Eww... That's why there's nr_premapped.
And it's suggested by you, though you called it "Crazy idea"...
https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com/

> Note, in no way am I trying to blame anyone; quite the opposite, you've done an
> admirable job to get all of this landed.  And I apologize if any of my past
> feedback led y'all down this road.  I suspect my prefaulting idea really screwed
> things up; sorry :-(
It's ok :)

> Back to the code, unless I'm missing yet another complication, I think the obvious
> fix to all of this is to pass the source page and metadata flags via a scratch
> field in "struct kvm_tdx", and then do PAGE.ADD and MR.EXTEND in
> tdx_sept_set_private_spte().  Then there is no need to keep track of pending
> pages, because the M-EPT and S-EPT are always consistent.  E.g. if PAGE.ADD fails
> with -EBUSY, then KVM will naturally revert the M-EPT entry from FROZEN to !PRESENT.
> It also allows KVM to KVM_BUG_ON() MR.EXTEND failure, because it should be impossible
> for the S-EPT entry to be modified between PAGE.ADD and MR.EXTEND.
> 
> Diff on top below for feedback on the idea.  A proper series for this would simply
> replace several of the patches, e.g. asserting that slots_lock is held on
> tdx_is_sept_zap_err_due_to_premap() is wrong.
Looks it's similar to the implementation in v19?
https://lore.kernel.org/all/bbac4998cfb34da496646491038b03f501964cbd.1708933498.git.isaku.yamahata@intel.com/

> ---
>  arch/x86/kvm/vmx/tdx.c | 157 +++++++++++++++++------------------------
>  arch/x86/kvm/vmx/tdx.h |  11 +--
>  2 files changed, 70 insertions(+), 98 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f9ac590e8ff0..5d981a061442 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,6 +1586,56 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>  }
>  
> +
> +struct kvm_tdx_page_add {
> +	struct page *src;
> +	unsigned long flags;
> +};
> +
> +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +			    kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	u64 err, entry, level_state;
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	int i;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
> +	    KVM_BUG_ON(!kvm_tdx->page_add, kvm))
> +		return -EIO;
> +
> +	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> +			       kvm_tdx->page_add->src, &entry, &level_state);
> +	if (unlikely(tdx_operand_busy(err)))
> +		return -EBUSY;
> +
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state);
> +		return -EIO;
> +	}
> +
> +	if (!(kvm_tdx->page_add->flags & KVM_TDX_MEASURE_MEMORY_REGION))
> +		return 0;
> +
> +	/*
> +	 * Extend the measurement while holding mmu_lock to ensure MR.EXTEND
> +	 * can't fail, e.g. due to the S-EPT entry being zapped after PAGE.ADD.
> +	 * Note!  If extended the measurement fails, bug the VM, but do NOT
> +	 * return an error, as mapping the page in the S-EPT succeeded and so
> +	 * needs to be tracked in KVM's mirror page tables.
> +	 */
> +	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> +		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
> +		if (KVM_BUG_ON(err, kvm)) {
> +			pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
> +			break;
> +		}
> +	}
> +	return 0;
> +}
> +
>  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>  			    enum pg_level level, kvm_pfn_t pfn)
>  {
> @@ -1627,21 +1677,11 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  
>  	/*
>  	 * If the TD isn't finalized/runnable, then userspace is initializing
> -	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> -	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> -	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> -	 * the counter to ensure all mapped pages have been added to the image,
> -	 * to prevent running the TD with uninitialized memory.
> +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Add the page to the TD,
> +	 * and optionally extend the measurement with the page contents.
>  	 */
> -	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> -		lockdep_assert_held(&kvm->slots_lock);
> -
> -		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> -			return -EIO;
> -
> -		kvm_tdx->nr_pending_tdh_mem_page_adds++;
> -		return 0;
> -	}
> +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
> +		return tdx_mem_page_add(kvm, gfn, level, pfn);
>  
>  	return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  }
> @@ -1716,39 +1756,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
>  	return 0;
>  }
>  
> -/*
> - * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is
> - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called
> - * successfully.
> - *
> - * Since tdh_mem_sept_add() must have been invoked successfully before a
> - * non-leaf entry present in the mirrored page table, the SEPT ZAP related
> - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead
> - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the
> - * SEPT.
> - *
> - * Further check if the returned entry from SEPT walking is with RWX permissions
> - * to filter out anything unexpected.
> - *
> - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from
> - * level_state returned from a SEAMCALL error is the same as that passed into
> - * the SEAMCALL.
> - */
> -static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
> -					     u64 entry, int level)
> -{
> -	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
> -		return false;
> -
> -	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
> -		return false;
> -
> -	if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK)))
> -		return false;
> -
> -	return true;
> -}
> -
>  static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  				     enum pg_level level, struct page *page)
>  {
> @@ -1768,15 +1775,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
>  		tdx_no_vcpus_enter_stop(kvm);
>  	}
> -	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
> -		lockdep_assert_held(&kvm->slots_lock);
> -
> -		if (KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm))
> -			return -EIO;
> -
> -		return 0;
> -	}
> -
>  	if (KVM_BUG_ON(err, kvm)) {
>  		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
>  		return -EIO;
> @@ -2842,12 +2840,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  
>  	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
>  		return -EINVAL;
> -	/*
> -	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
> -	 * TDH.MEM.PAGE.ADD().
> -	 */
> -	if (kvm_tdx->nr_pending_tdh_mem_page_adds)
> -		return -EINVAL;
>  
>  	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
>  	if (tdx_operand_busy(cmd->hw_error))
> @@ -3131,50 +3123,29 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  {
>  	struct tdx_gmem_post_populate_arg *arg = _arg;
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	u64 err, entry, level_state;
> -	gpa_t gpa = gfn_to_gpa(gfn);
> -	struct page *src_page;
> -	int ret, i;
> +	struct kvm_tdx_page_add page_add = {
> +		.flags = arg->flags,
> +	};
> +	int ret;
>  
> -	lockdep_assert_held(&kvm->slots_lock);
> +	if (KVM_BUG_ON(kvm_tdx->page_add, kvm))
> +		return -EIO;
>  
>  	/*
>  	 * Get the source page if it has been faulted in. Return failure if the
>  	 * source page has been swapped out or unmapped in primary memory.
>  	 */
> -	ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page);
> +	ret = get_user_pages_fast((unsigned long)src, 1, 0, &page_add.src);
>  	if (ret < 0)
>  		return ret;
>  	if (ret != 1)
>  		return -ENOMEM;
>  
> +	kvm_tdx->page_add = &page_add;
>  	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
> -	if (ret < 0)
> -		goto out;
> +	kvm_tdx->page_add = NULL;
>  
> -	ret = 0;
> -	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> -			       src_page, &entry, &level_state);
> -	if (err) {
> -		ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
> -		goto out;
> -	}
> -
> -	KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm);
> -
> -	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
> -		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> -			err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
> -					    &level_state);
> -			if (err) {
> -				ret = -EIO;
> -				break;
> -			}
> -		}
> -	}
> -
> -out:
> -	put_page(src_page);
> +	put_page(page_add.src);
>  	return ret;
>  }
>  
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 45d86f9fa41c..39e0c3bcc866 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -21,6 +21,8 @@ enum kvm_tdx_state {
>  	TD_STATE_RUNNABLE,
>  };
>  
> +struct kvm_tdx_add_page;
> +
>  struct kvm_tdx {
>  	struct kvm kvm;
>  
> @@ -37,12 +39,11 @@ struct kvm_tdx {
>  	struct tdx_td td;
>  
>  	/*
> -	 * The number of pages that KVM_TDX_INIT_MEM_REGION has mapped into the
> -	 * S-EPT, but not yet initialized via TDH.MEM.PAGE_ADD.  Used to sanity
> -	 * check adding pages to the image, and to ensure that all pages have
> -	 * been initialized before finalizing the TD.
> +	 * Scratch structure used to pass the source page and metadata flags to
> +	 * tdx_mem_page_add.  Protected by slots_lock, and non-NULL only when
> +	 * mapping a private pfn via tdx_gmem_post_populate().
>  	 */
> -	unsigned long nr_pending_tdh_mem_page_adds;
> +	struct kvm_tdx_page_add *page_add;
>  
>  	/*
>  	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do
> 
> base-commit: 7c7a3893b102bdeb4826f7140280b7b16081b385
> --

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Yan Zhao 1 month ago

On Fri, Aug 29, 2025 at 10:31:29AM +0800, Yan Zhao wrote:
> On Thu, Aug 28, 2025 at 10:00:28AM -0700, Sean Christopherson wrote:
> > On Thu, Aug 28, 2025, Yan Zhao wrote:
> > > On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote:
> > > > On Wed, Aug 27, 2025, Yan Zhao wrote:
> > > > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
... 
> > > > Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
> > > We don't. It returns -EBUSY or -EIO immediately.
> > 
> > But that _is_ tolerating failure, in the sense that KVM doesn't prevent further
> > actions on the VM.  Tolerating failure is fine in general, but in this case it
> > leaves the MMU is left in a half-baked state.
Yes, but nr_premapped will not be decreased on tdh_mem_page_add() failure.

So we rely on nr_premapped to disallow the TD from running eventually.

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Edgecombe, Rick P 1 month ago

On Thu, 2025-08-28 at 10:00 -0700, Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Yan Zhao wrote:
> > On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote:
> > > On Wed, Aug 27, 2025, Yan Zhao wrote:
> > > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > > > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > > +	/*
> > > > > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > > > > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > > > > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > > > > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > > > > +	 * the counter to ensure all mapped pages have been added to the image,
> > > > > +	 * to prevent running the TD with uninitialized memory.
> > > > To prevent the mismatch between mirror EPT and the S-EPT?
> > > 
> > > No?  Because KVM bumps the count when installing the S-EPT and decrements it
> > > on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?
> > Hmm, I think there must be some misunderstanding.
> 
> Yeah, I forgot that AUG and ADD create the leaf S-EPT entries.
> 
> > Before userspace invokes KVM_TDX_FINALIZE_VM,
> > =======
> > 1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION).
> >    (1) KVM holds slot_lock and filemap lock.
> >    (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in
> >        patch 2).
> >        KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate
> >        that there's a page mapped in M-EPT, while it's not yet installed in
> >        S-EPT.
> >    (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the
> >        page has been mapped in S-EPT too.
> >        
> >    As the name of nr_premapped indicates, the count means a page is pre-mapped
> >    in the M-EPT, before its real mapping in the S-EPT.
> >    If ADD fails in step (3), nr_premapped will not be decreased.
> > 
> >    With mere the normal path, nr_premapped should return back to 0 after all
> >    KVM_TDX_INIT_MEM_REGIONs.
> >       
> > 
> > 2. Expected zap paths (e.g. If userspace does something strange, such as
> >    removing a slot after KVM_TDX_INIT_MEM_REGION)
> >    Those zap paths could be triggered by
> >    1) userspace performs a page attribute conversion
> >    2) userspace invokes gmem punch hole
> >    3) userspace removes a slot
> >    As all those paths either hold a slot_lock or a filemap lock, they can't
> >    contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both
> >    slot_lock and internally filemap lock).
> >    Consequently, those zaps must occur
> >    a) before kvm_tdp_map_page() or
> >    b) after TDH.MEM.PAGE.ADD.
> >    For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not
> >            mapped in M-EPT either;
> >    For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE
> >            SEAMCALLs are following the ADD.
> >    nr_premapped is therere unchanged, since it does not change the consistency
> >    between M-EPT and S-EPT.
> > 
> > 3. Unexpected zaps (such as kvm_zap_gfn_range()).
> 
> Side topic related to kvm_zap_gfn_range(), the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl()
> is flawed.  If kvm_recalculate_apic_map() fails to allocate an optimized map, KVM
> will mark APICv as inhibited, i.e. the associated WARN_ON_ONCE() is effectively
> user-triggerable.
> 
> Easiest thing would be to mark the vCPU as dead (though we obviously need
> "KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests" for that to be robust).
> 
> 
> 
I'm going need to look up the related apic discussions from the base series and
circle back.

[snip]

> > tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0.
> > KVM_BUG_ON/WARN_ON should be also ok.
> 
> Ok, so I vaguely recall that I may have pushed back on using a scratch field in
> "struct kvm_tdx" for temporary data (or maybe it was abusing vcpus[0] that I
> disliked?), but what we ended up with is far worse.

I think it was also that the tdh_mr_extend() loop was too heavyweight for the
fault path. But that was before we got to the kick+lock stuff.

> 
> For all intents and purposes, nr_premapped _is_ a temporary scratch field, but
> with access rules that are all but impossible to understand, e.g. there's practically
> zero chance anyone could suss out complications with "Unexpected zaps", and the
> existence of that super subtle edge case necessitates using an atomic because
> KVM can't strictly guarantee that access to the field is mutually exclusive.  And
> that also means it's inherently racy, e.g. if a zap happens while tdx_td_finalize()
> is checking nr_premapped, what happens?
> 
> The real killer is that deferring TDH.MEM.PAGE.ADD and TDH.MR_EXTEND until after
> the map completes and mmu_lock is dropped means that failure at that point leaves
> the TDP MMU in an inconsistent state, where the M-EPT has a present page but the
> S-EPT does not.  Eww.
> 
> Note, in no way am I trying to blame anyone; quite the opposite, you've done an
> admirable job to get all of this landed.  And I apologize if any of my past
> feedback led y'all down this road.  I suspect my prefaulting idea really screwed
> things up; sorry :-(

It's unfortunate we didn't have the gmem mmap() support then. Otherwise we could
have just encrypted it in place.

Otherwise, I'm really glad to see these cleanups/scrutiny. I kind of got the
impression that you wanted to see less TDX churn for a bit. The fact is, the TDX
base support still needs more work like this.

> 
> Back to the code, unless I'm missing yet another complication, I think the obvious
> fix to all of this is to pass the source page and metadata flags via a scratch
> field in "struct kvm_tdx", and then do PAGE.ADD and MR.EXTEND in
> tdx_sept_set_private_spte().  Then there is no need to keep track of pending
> pages, because the M-EPT and S-EPT are always consistent.  E.g. if PAGE.ADD fails
> with -EBUSY, then KVM will naturally revert the M-EPT entry from FROZEN to !PRESENT.
> It also allows KVM to KVM_BUG_ON() MR.EXTEND failure, because it should be impossible
> for the S-EPT entry to be modified between PAGE.ADD and MR.EXTEND.
> 
> Diff on top below for feedback on the idea.  A proper series for this would simply
> replace several of the patches, e.g. asserting that slots_lock is held on
> tdx_is_sept_zap_err_due_to_premap() is wrong.

This works. The "stuffing data on the vCPU" part is a little ugly as it was
before, but the other solutions were worse. Especially nr_premap was problematic
for a number of reasons that keeps growing.

So, big Acked-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

> 
> ---
>  arch/x86/kvm/vmx/tdx.c | 157 +++++++++++++++++------------------------
>  arch/x86/kvm/vmx/tdx.h |  11 +--
>  2 files changed, 70 insertions(+), 98 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f9ac590e8ff0..5d981a061442 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,6 +1586,56 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>  }
>  
> +
> +struct kvm_tdx_page_add {
> +	struct page *src;
> +	unsigned long flags;
> +};
> +
> +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +			    kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	u64 err, entry, level_state;
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	int i;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
> +	    KVM_BUG_ON(!kvm_tdx->page_add, kvm))
> +		return -EIO;
> +
> +	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> +			       kvm_tdx->page_add->src, &entry, &level_state);
> +	if (unlikely(tdx_operand_busy(err)))
> +		return -EBUSY;
> +
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state);
> +		return -EIO;
> +	}
> +
> +	if (!(kvm_tdx->page_add->flags & KVM_TDX_MEASURE_MEMORY_REGION))
> +		return 0;
> +
> +	/*
> +	 * Extend the measurement while holding mmu_lock to ensure MR.EXTEND
> +	 * can't fail, e.g. due to the S-EPT entry being zapped after PAGE.ADD.
> +	 * Note!  If extended the measurement fails, bug the VM, but do NOT
> +	 * return an error, as mapping the page in the S-EPT succeeded and so
> +	 * needs to be tracked in KVM's mirror page tables.
> +	 */
> +	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> +		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
> +		if (KVM_BUG_ON(err, kvm)) {
> +			pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
> +			break;
> +		}
> +	}
> +	return 0;
> +}
> +
>  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>  			    enum pg_level level, kvm_pfn_t pfn)
>  {
> @@ -1627,21 +1677,11 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  
>  	/*
>  	 * If the TD isn't finalized/runnable, then userspace is initializing
> -	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> -	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> -	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> -	 * the counter to ensure all mapped pages have been added to the image,
> -	 * to prevent running the TD with uninitialized memory.
> +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Add the page to the TD,
> +	 * and optionally extend the measurement with the page contents.
>  	 */
> -	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> -		lockdep_assert_held(&kvm->slots_lock);
> -
> -		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> -			return -EIO;
> -
> -		kvm_tdx->nr_pending_tdh_mem_page_adds++;
> -		return 0;
> -	}
> +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
> +		return tdx_mem_page_add(kvm, gfn, level, pfn);
>  
>  	return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  }
> @@ -1716,39 +1756,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
>  	return 0;
>  }
>  
> -/*
> - * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is
> - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called
> - * successfully.
> - *
> - * Since tdh_mem_sept_add() must have been invoked successfully before a
> - * non-leaf entry present in the mirrored page table, the SEPT ZAP related
> - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead
> - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the
> - * SEPT.
> - *
> - * Further check if the returned entry from SEPT walking is with RWX permissions
> - * to filter out anything unexpected.
> - *
> - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from
> - * level_state returned from a SEAMCALL error is the same as that passed into
> - * the SEAMCALL.
> - */
> -static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
> -					     u64 entry, int level)
> -{
> -	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
> -		return false;
> -
> -	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
> -		return false;
> -
> -	if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK)))
> -		return false;
> -
> -	return true;
> -}
> -
>  static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  				     enum pg_level level, struct page *page)
>  {
> @@ -1768,15 +1775,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
>  		tdx_no_vcpus_enter_stop(kvm);
>  	}
> -	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
> -		lockdep_assert_held(&kvm->slots_lock);
> -
> -		if (KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm))
> -			return -EIO;
> -
> -		return 0;
> -	}
> -
>  	if (KVM_BUG_ON(err, kvm)) {
>  		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
>  		return -EIO;
> @@ -2842,12 +2840,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  
>  	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
>  		return -EINVAL;
> -	/*
> -	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
> -	 * TDH.MEM.PAGE.ADD().
> -	 */
> -	if (kvm_tdx->nr_pending_tdh_mem_page_adds)
> -		return -EINVAL;
>  
>  	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
>  	if (tdx_operand_busy(cmd->hw_error))
> @@ -3131,50 +3123,29 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  {
>  	struct tdx_gmem_post_populate_arg *arg = _arg;
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	u64 err, entry, level_state;
> -	gpa_t gpa = gfn_to_gpa(gfn);
> -	struct page *src_page;
> -	int ret, i;
> +	struct kvm_tdx_page_add page_add = {
> +		.flags = arg->flags,
> +	};
> +	int ret;
>  
> -	lockdep_assert_held(&kvm->slots_lock);
> +	if (KVM_BUG_ON(kvm_tdx->page_add, kvm))
> +		return -EIO;
>  
>  	/*
>  	 * Get the source page if it has been faulted in. Return failure if the
>  	 * source page has been swapped out or unmapped in primary memory.
>  	 */
> -	ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page);
> +	ret = get_user_pages_fast((unsigned long)src, 1, 0, &page_add.src);
>  	if (ret < 0)
>  		return ret;
>  	if (ret != 1)
>  		return -ENOMEM;
>  
> +	kvm_tdx->page_add = &page_add;
>  	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
> -	if (ret < 0)
> -		goto out;
> +	kvm_tdx->page_add = NULL;
>  
> -	ret = 0;
> -	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> -			       src_page, &entry, &level_state);
> -	if (err) {
> -		ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
> -		goto out;
> -	}
> -
> -	KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm);
> -
> -	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
> -		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> -			err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
> -					    &level_state);
> -			if (err) {
> -				ret = -EIO;
> -				break;
> -			}
> -		}
> -	}
> -
> -out:
> -	put_page(src_page);
> +	put_page(page_add.src);
>  	return ret;
>  }
>  
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 45d86f9fa41c..39e0c3bcc866 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -21,6 +21,8 @@ enum kvm_tdx_state {
>  	TD_STATE_RUNNABLE,
>  };
>  
> +struct kvm_tdx_add_page;
> +
>  struct kvm_tdx {
>  	struct kvm kvm;
>  
> @@ -37,12 +39,11 @@ struct kvm_tdx {
>  	struct tdx_td td;
>  
>  	/*
> -	 * The number of pages that KVM_TDX_INIT_MEM_REGION has mapped into the
> -	 * S-EPT, but not yet initialized via TDH.MEM.PAGE_ADD.  Used to sanity
> -	 * check adding pages to the image, and to ensure that all pages have
> -	 * been initialized before finalizing the TD.
> +	 * Scratch structure used to pass the source page and metadata flags to
> +	 * tdx_mem_page_add.  Protected by slots_lock, and non-NULL only when
> +	 * mapping a private pfn via tdx_gmem_post_populate().
>  	 */
> -	unsigned long nr_pending_tdh_mem_page_adds;
> +	struct kvm_tdx_page_add *page_add;
>  
>  	/*
>  	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do
> 
> base-commit: 7c7a3893b102bdeb4826f7140280b7b16081b385
> --

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Binbin Wu 1 month ago

On 8/29/2025 2:52 AM, Edgecombe, Rick P wrote:
> On Thu, 2025-08-28 at 10:00 -0700, Sean Christopherson wrote:
>> On Thu, Aug 28, 2025, Yan Zhao wrote:
[...]
>>>
>>> 3. Unexpected zaps (such as kvm_zap_gfn_range()).
>> Side topic related to kvm_zap_gfn_range(), the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl()
>> is flawed.  If kvm_recalculate_apic_map() fails to allocate an optimized map, KVM
>> will mark APICv as inhibited, i.e. the associated WARN_ON_ONCE() is effectively
>> user-triggerable.
>>
>> Easiest thing would be to mark the vCPU as dead (though we obviously need
>> "KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests" for that to be robust).
>>
>>
>>
> I'm going need to look up the related apic discussions from the base series and
> circle back.
There was an analysis about the inhibit reasons for TDX.
https://lore.kernel.org/lkml/e3a2e8fa-b496-4010-9a8c-bfeb131bc43b@linux.intel.com/

As Sean mentioned, if kvm_recalculate_apic_map() fails to allocate the memory
for optimized map, it will trigger the KVM_BUG_ON() in
vt_refresh_apicv_exec_ctrl(). And kvzalloc() failure should not be treated as
KVM bug.

As talking about user-triggerable, the kvzalloc() failure path could be
triggered by KVM_CREATE_VCPU and KVM_TDX_INIT_VCPU for TD. After
KVM_TDX_INIT_VCPU, the mapping is not allowed to be changed.

Sean's suggested code change looks good to me.

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Sean Christopherson 1 month ago

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> It's unfortunate we didn't have the gmem mmap() support then. Otherwise we could
> have just encrypted it in place.

Yeah, hindsight is definitely 20/20 on that front.  Though I suspect that we'd
never have landed anything if we tried to go straight to support mmap().

> Otherwise, I'm really glad to see these cleanups/scrutiny. I kind of got the
> impression that you wanted to see less TDX churn for a bit.

Heh, believe me, I'm not exactly ecstatic to dive into this.  But, I don't want
to just ignore it, because some of these quirks/warts are already getting in the
way of new development, and if I/we delay such clean ups, the pain is only going
to get worse (and the total cost will be much higher).

Fatigue is a bit of a problem for me, but the biggest issue really is just lack
of cycles (the quick feedback and testing y'all are providing helps tremendously
on that front).  And lack of cycles should be mitigated to some extent as I
(re)familiarize myself with the code; I recognized most of the concepts, but
there are definitely a few places where I'm completely lost, and that makes
reviewing things like dynamic PAMT and hugepage support excrutiatingly slow.

> The fact is, the TDX base support still needs more work like this.

IMO, the most important things to address are cases where the design choices we
made turned out to be suboptimal, i.e. where the behavior itself is problematic.
Code cleanups are definitely welcome too, but my priority is polishing the core
design.

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Sean Christopherson 1 month ago

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-28 at 10:00 -0700, Sean Christopherson wrote:
> > > tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0.
> > > KVM_BUG_ON/WARN_ON should be also ok.
> > 
> > Ok, so I vaguely recall that I may have pushed back on using a scratch field in
> > "struct kvm_tdx" for temporary data (or maybe it was abusing vcpus[0] that I
> > disliked?), but what we ended up with is far worse.
> 
> I think it was also that the tdh_mr_extend() loop was too heavyweight for the
> fault path. But that was before we got to the kick+lock stuff.

Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
operations is not a concern.

If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
code is also broken in the sense that there are no cond_resched() calls.  The
vast majority of TDX hosts will be using non-preemptible kernels, so without an
explicit cond_resched(), there's no practical difference between extending the
measurement under mmu_lock versus outside of mmu_lock.

_If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
still do tdh_mem_page_add() under mmu_lock.

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Edgecombe, Rick P 1 month ago

On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> operations is not a concern.

Just was my recollection of the discussion. I found it:
https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/

> 
> If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> code is also broken in the sense that there are no cond_resched() calls.  The
> vast majority of TDX hosts will be using non-preemptible kernels, so without an
> explicit cond_resched(), there's no practical difference between extending the
> measurement under mmu_lock versus outside of mmu_lock.
> 
> _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> still do tdh_mem_page_add() under mmu_lock.

I just did a quick test and we should be on the order of <1 ms per page for the
full loop. I can try to get some more formal test data if it matters. But that
doesn't sound too horrible?

tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
inside it. But maybe a better reason is that we could better handle errors
outside the fault. (i.e. no 5 line comment about why not to return an error in
tdx_mem_page_add() due to code in another file).

I wonder if Yan can give an analysis of any zapping races if we do that.

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Yan Zhao 1 month ago

On Fri, Aug 29, 2025 at 05:33:48AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > operations is not a concern.
> 
> Just was my recollection of the discussion. I found it:
> https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/
> 
> > 
> > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > code is also broken in the sense that there are no cond_resched() calls.  The
> > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > explicit cond_resched(), there's no practical difference between extending the
> > measurement under mmu_lock versus outside of mmu_lock.
> > 
> > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > still do tdh_mem_page_add() under mmu_lock.
> 
> I just did a quick test and we should be on the order of <1 ms per page for the
> full loop. I can try to get some more formal test data if it matters. But that
> doesn't sound too horrible?
> 
> tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> inside it. But maybe a better reason is that we could better handle errors
> outside the fault. (i.e. no 5 line comment about why not to return an error in
> tdx_mem_page_add() due to code in another file).
> 
> I wonder if Yan can give an analysis of any zapping races if we do that.
I actually proposed to have write mmu_lock around tdh_mem_page_add() and
tdh_mr_extend(), as in
https://lore.kernel.org/kvm/Ztfn5gh5888PmEIe@yzhao56-desk.sh.intel.com.

I don't see any reason why tdh_mr_extend() can't be done inside mmu_lock
in the pre-boot stage.

But the previous conclusion was that with slots_lock and filemap invalidation
lock, it's ok to invoke tdh_mem_page_add() and tdh_mr_extend() without any
mmu_lock. The nr_premapped can also detect the unexpected zap.

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Ira Weiny 1 month ago

Edgecombe, Rick P wrote:
> On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > operations is not a concern.
> 
> Just was my recollection of the discussion. I found it:
> https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/
> 
> > 
> > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > code is also broken in the sense that there are no cond_resched() calls.  The
> > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > explicit cond_resched(), there's no practical difference between extending the
> > measurement under mmu_lock versus outside of mmu_lock.
> > 
> > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > still do tdh_mem_page_add() under mmu_lock.
> 
> I just did a quick test and we should be on the order of <1 ms per page for the
> full loop. I can try to get some more formal test data if it matters. But that
> doesn't sound too horrible?
> 
> tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> inside it.

I'm probably not following this conversation, so stupid question:  It
doesn't need to be in the lock because user space should not be setting up
memory and extending the measurement in an asynchronous way.  Is that
correct?

> But maybe a better reason is that we could better handle errors
> outside the fault. (i.e. no 5 line comment about why not to return an error in
> tdx_mem_page_add() due to code in another file).
> 
> I wonder if Yan can give an analysis of any zapping races if we do that.

When you say analysis, you mean detecting user space did something wrong
and failing gracefully?  Is that correct?

Ira

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Sean Christopherson 1 month ago

On Thu, Aug 28, 2025, Ira Weiny wrote:
> Edgecombe, Rick P wrote:
> > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > > operations is not a concern.
> > 
> > Just was my recollection of the discussion. I found it:
> > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/
> > 
> > > 
> > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > > code is also broken in the sense that there are no cond_resched() calls.  The
> > > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > > explicit cond_resched(), there's no practical difference between extending the
> > > measurement under mmu_lock versus outside of mmu_lock.
> > > 
> > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > > still do tdh_mem_page_add() under mmu_lock.
> > 
> > I just did a quick test and we should be on the order of <1 ms per page for the
> > full loop. I can try to get some more formal test data if it matters. But that
> > doesn't sound too horrible?
> > 
> > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> > inside it.
> 
> I'm probably not following this conversation, so stupid question:  It
> doesn't need to be in the lock because user space should not be setting up
> memory and extending the measurement in an asynchronous way.  Is that
> correct?

No, from userspace's perspective ADD+MEASURE is fully serialized.  ADD "needs"
to be under mmu_lock to guarantee consistency between the mirror EPT and the
"real" S-EPT entries.  E.g. if ADD is done after the fact, then KVM can end up
with a PRESENT M-EPT entry but a corresponding S-EPT entry that is !PRESENT.
That causes a pile of problems because it breaks KVM's fundamental assumption
that M-EPT and S-EPT entries updated in lock-step.

TDH_MR_EXTEND doesn't have the same same consistency issue.  If it fails, the
only thing that's left in a bad state is the measurement.  That's obviously not
ideal either, but we can handle that by forcefully terminating the VM, without
opening up KVM to edge cases that would otherwise be impossible.

> > But maybe a better reason is that we could better handle errors
> > outside the fault. (i.e. no 5 line comment about why not to return an error in
> > tdx_mem_page_add() due to code in another file).
> > 
> > I wonder if Yan can give an analysis of any zapping races if we do that.
> 
> When you say analysis, you mean detecting user space did something wrong
> and failing gracefully?  Is that correct?

More specifically, whether or not KVM can WARN without the WARN being user
triggerable.  Kernel policy is that WARNs must not be triggerable absent kernel,
hardware, or firmware bugs.  What we're trying to figure out is if there's a
flow that can be triggered by userspace (misbehving or not) that would trip a
WARN even if KVM is operating as expected.  I'm pretty sure the answer is "no".

Oh, and WARNing here is desirable, because it improves the chances of detecting
a fatal-to-the-VM bug, e.g. in KVM and/or in the TDX-Module.

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Ira Weiny 1 month ago

Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Ira Weiny wrote:
> > Edgecombe, Rick P wrote:
> > > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > > > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > > > operations is not a concern.
> > > 
> > > Just was my recollection of the discussion. I found it:
> > > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/
> > > 
> > > > 
> > > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > > > code is also broken in the sense that there are no cond_resched() calls.  The
> > > > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > > > explicit cond_resched(), there's no practical difference between extending the
> > > > measurement under mmu_lock versus outside of mmu_lock.
> > > > 
> > > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > > > still do tdh_mem_page_add() under mmu_lock.
> > > 
> > > I just did a quick test and we should be on the order of <1 ms per page for the
> > > full loop. I can try to get some more formal test data if it matters. But that
> > > doesn't sound too horrible?
> > > 
> > > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> > > inside it.
> > 
> > I'm probably not following this conversation, so stupid question:  It
> > doesn't need to be in the lock because user space should not be setting up
> > memory and extending the measurement in an asynchronous way.  Is that
> > correct?
> 
> No, from userspace's perspective ADD+MEASURE is fully serialized.  ADD "needs"
> to be under mmu_lock to guarantee consistency between the mirror EPT and the
> "real" S-EPT entries.  E.g. if ADD is done after the fact, then KVM can end up
> with a PRESENT M-EPT entry but a corresponding S-EPT entry that is !PRESENT.
> That causes a pile of problems because it breaks KVM's fundamental assumption
> that M-EPT and S-EPT entries updated in lock-step.

Ok yes, I think I worded my query incorrectly but this makes things clear.

Thanks!

> 
> TDH_MR_EXTEND doesn't have the same same consistency issue.  If it fails, the
> only thing that's left in a bad state is the measurement.  That's obviously not
> ideal either, but we can handle that by forcefully terminating the VM, without
> opening up KVM to edge cases that would otherwise be impossible.
> 
> > > But maybe a better reason is that we could better handle errors
> > > outside the fault. (i.e. no 5 line comment about why not to return an error in
> > > tdx_mem_page_add() due to code in another file).
> > > 
> > > I wonder if Yan can give an analysis of any zapping races if we do that.
> > 
> > When you say analysis, you mean detecting user space did something wrong
> > and failing gracefully?  Is that correct?
> 
> More specifically, whether or not KVM can WARN without the WARN being user
> triggerable.  Kernel policy is that WARNs must not be triggerable absent kernel,
> hardware, or firmware bugs.  What we're trying to figure out is if there's a
> flow that can be triggered by userspace (misbehving or not) that would trip a
> WARN even if KVM is operating as expected.  I'm pretty sure the answer is "no".
> 
> Oh, and WARNing here is desirable, because it improves the chances of detecting
> a fatal-to-the-VM bug, e.g. in KVM and/or in the TDX-Module.

OK...  In other areas of the kernel if the user misbehaves it is
reasonable to fail an operation.  I would think that being fatal to the VM
would be fine if QEMU did not properly synchronize ADD, measurement, and
finalize, for example.  Am I wrong in that assumption?

Ira

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Sean Christopherson 1 month ago

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > operations is not a concern.
> 
> Just was my recollection of the discussion. I found it:
> https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/

Ugh, another case where an honest question gets interpreted as "do it this way". :-(

> > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > code is also broken in the sense that there are no cond_resched() calls.  The
> > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > explicit cond_resched(), there's no practical difference between extending the
> > measurement under mmu_lock versus outside of mmu_lock.
> > 
> > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > still do tdh_mem_page_add() under mmu_lock.
> 
> I just did a quick test and we should be on the order of <1 ms per page for the
> full loop. I can try to get some more formal test data if it matters. But that
> doesn't sound too horrible?

1ms is totally reasonable.  I wouldn't bother with any more testing.

> tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> inside it.

Agreed, and it would eliminate the need for a "flags" argument.  But keeping it
in the mmu_lock critical section means KVM can WARN on failures.  If it's moved
out, then zapping S-EPT entries could induce failure, and I don't think it's
worth going through the effort to ensure it's impossible to trigger S-EPT removal.

Note, temoving S-EPT entries during initialization of the image isn't something
I want to official support, rather it's an endless stream of whack-a-mole due to
obsurce edge cases

Hmm, actually, maybe I take that back.  slots_lock prevents memslot updates,
filemap_invalidate_lock() prevents guest_memfd updates, and mmu_notifier events
shouldn't ever hit S-EPT.  I was worried about kvm_zap_gfn_range(), but the call
from sev.c is obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT
so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and while I'm 99%
certain there's a way to trip __kvm_set_or_clear_apicv_inhibit(), the APIC page
has its own non-guest_memfd memslot and so can't be used for the initial image,
which means that too is mutually exclusive.

So yeah, let's give it a shot.  Worst case scenario we're wrong and TDH_MR_EXTEND
errors can be triggered by userspace.

> But maybe a better reason is that we could better handle errors
> outside the fault. (i.e. no 5 line comment about why not to return an error in
> tdx_mem_page_add() due to code in another file).
> 
> I wonder if Yan can give an analysis of any zapping races if we do that.

As above, I think we're good?

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Yan Zhao 1 month ago

On Thu, Aug 28, 2025 at 02:57:39PM -0700, Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > > operations is not a concern.
> > 
> > Just was my recollection of the discussion. I found it:
> > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/
> 
> Ugh, another case where an honest question gets interpreted as "do it this way". :-(
> 
> > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > > code is also broken in the sense that there are no cond_resched() calls.  The
> > > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > > explicit cond_resched(), there's no practical difference between extending the
> > > measurement under mmu_lock versus outside of mmu_lock.
> > > 
> > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > > still do tdh_mem_page_add() under mmu_lock.
> > 
> > I just did a quick test and we should be on the order of <1 ms per page for the
> > full loop. I can try to get some more formal test data if it matters. But that
> > doesn't sound too horrible?
> 
> 1ms is totally reasonable.  I wouldn't bother with any more testing.
> 
> > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> > inside it.
> 
> Agreed, and it would eliminate the need for a "flags" argument.  But keeping it
> in the mmu_lock critical section means KVM can WARN on failures.  If it's moved
> out, then zapping S-EPT entries could induce failure, and I don't think it's
> worth going through the effort to ensure it's impossible to trigger S-EPT removal.
> 
> Note, temoving S-EPT entries during initialization of the image isn't something
> I want to official support, rather it's an endless stream of whack-a-mole due to
> obsurce edge cases
> 
> Hmm, actually, maybe I take that back.  slots_lock prevents memslot updates,
> filemap_invalidate_lock() prevents guest_memfd updates, and mmu_notifier events
> shouldn't ever hit S-EPT.  I was worried about kvm_zap_gfn_range(), but the call
> from sev.c is obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT
> so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and while I'm 99%
> certain there's a way to trip __kvm_set_or_clear_apicv_inhibit(), the APIC page
> has its own non-guest_memfd memslot and so can't be used for the initial image,
> which means that too is mutually exclusive.
> 
> So yeah, let's give it a shot.  Worst case scenario we're wrong and TDH_MR_EXTEND
> errors can be triggered by userspace.
> 
> > But maybe a better reason is that we could better handle errors
> > outside the fault. (i.e. no 5 line comment about why not to return an error in
> > tdx_mem_page_add() due to code in another file).
> > 
> > I wonder if Yan can give an analysis of any zapping races if we do that.
> 
> As above, I think we're good?
I think so.

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Edgecombe, Rick P 1 month ago

On Thu, 2025-08-28 at 14:57 -0700, Sean Christopherson wrote:
> Agreed, and it would eliminate the need for a "flags" argument.  But keeping it
> in the mmu_lock critical section means KVM can WARN on failures.  If it's moved
> out, then zapping S-EPT entries could induce failure, and I don't think it's
> worth going through the effort to ensure it's impossible to trigger S-EPT removal.
> 
> Note, temoving S-EPT entries during initialization of the image isn't something
> I want to official support, rather it's an endless stream of whack-a-mole due to
> obsurce edge cases
> 
> Hmm, actually, maybe I take that back.  slots_lock prevents memslot updates,
> filemap_invalidate_lock() prevents guest_memfd updates, and mmu_notifier events
> shouldn't ever hit S-EPT.  I was worried about kvm_zap_gfn_range(), but the call
> from sev.c is obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT
> so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and
> 

Yea, in the other thread Yan was suggesting the same thing from the KVM side:
https://lore.kernel.org/all/aK%2Fsdr2OQqYv9DBZ@yzhao56-desk.sh.intel.com/

But was concerned about "Unexpected zaps" (kvm_zap_gfn_range()). I think maybe
we could think about KVM_BUG_ON() in the case of mirror EPT to cover it from
another angle. IIRC we discussed this at some point.

I was wondering about TDH.MR.EXTEND error conditions. Coming back now, I'm not
sure what I was thinking.

> while I'm 99% certain there's a way to trip __kvm_set_or_clear_apicv_inhibit(),
> the APIC page has its own non-guest_memfd memslot and so can't be used for the
> initial image, which means that too is mutually exclusive.

Hmm, well maybe KVM_BUG_ON() for kvm_zap_gfn_range() only if this gets
addressed.

> 
> So yeah, let's give it a shot.  Worst case scenario we're wrong and TDH_MR_EXTEND
> errors can be triggered by userspace.
> 
> > But maybe a better reason is that we could better handle errors
> > outside the fault. (i.e. no 5 line comment about why not to return an error in
> > tdx_mem_page_add() due to code in another file).
> > 
> > I wonder if Yan can give an analysis of any zapping races if we do that.
> 
> As above, I think we're good?

Works for me.

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Edgecombe, Rick P 1 month ago

On Wed, 2025-08-27 at 12:08 -0700, Sean Christopherson wrote:
> > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the
> > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of
> > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and
> > TDH.MEM.PAGE.REMOVE.
> 
> Eww.  It would be nice to close that hole, but I suppose it's futile, e.g. the
> underlying problem is unexpectedly removing pages from the initial, whether
> the
> VMM is doing stupid things before vs. after FINALIZE doesn't really matter.
> 
> > As a result, the TD will still run with uninitialized memory.
> 
> No?  Because BLOCK+REMOVE means there are no valid S-EPT mappings.  There's a
> "hole" that the guest might not expect, but that hole will trigger an EPT
> violation and only get "filled" if the guest explicitly accepts an AUG'd page.

Ah, I just responded on another patch. I wonder if we can get rid of the premap
cnt.

> 
> Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
> nice with tdh_mem_page_add() failure necessitates both the
> tdx_is_sept_zap_err_due_to_premap() craziness and the check in
> tdx_td_finalize()
> that all pending pages have been consumed.

Reasons that tdh_mem_page_add() could get BUSY:
1. If two vCPU's tried to tdh_mem_page_add() the same gpa at the same time  they
could contend the SEPT entry lock
2. If one vCPU tries to tdh_mem_page_add() while the other zaps (i.e.
tdh_mem_range_block()).

I guess since we don't hold MMU lock while we tdh_mem_page_add(), 2 is a
possibility.

> 
> What reasonable use case is there for gracefully handling tdh_mem_page_add()
> failure?
> 
> If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
> pages mapped into the S-EPT are ADDed, then it can assert that there are no
> pending pages when it completes (even if it "fails"), and similarly
> tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> non-zero.

Maybe we could take mmu write lock for the retry of tdh_mem_page_add(). Or maybe
even for a single call of it, until someone wants to parallelize the operation.

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Yan Zhao 1 month ago

On Thu, Aug 28, 2025 at 11:13:11AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-08-27 at 12:08 -0700, Sean Christopherson wrote:
> > > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the
> > > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of
> > > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and
> > > TDH.MEM.PAGE.REMOVE.
> > 
> > Eww.  It would be nice to close that hole, but I suppose it's futile, e.g. the
> > underlying problem is unexpectedly removing pages from the initial, whether
> > the
> > VMM is doing stupid things before vs. after FINALIZE doesn't really matter.
> > 
> > > As a result, the TD will still run with uninitialized memory.
> > 
> > No?  Because BLOCK+REMOVE means there are no valid S-EPT mappings.  There's a
> > "hole" that the guest might not expect, but that hole will trigger an EPT
> > violation and only get "filled" if the guest explicitly accepts an AUG'd page.
> 
> Ah, I just responded on another patch. I wonder if we can get rid of the premap
> cnt.

I think keeping it is safer.
See my explanation at [1].

[1] https://lore.kernel.org/all/aK%2Fsdr2OQqYv9DBZ@yzhao56-desk.sh.intel.com.

> > 
> > Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
> > nice with tdh_mem_page_add() failure necessitates both the
> > tdx_is_sept_zap_err_due_to_premap() craziness and the check in
> > tdx_td_finalize()
> > that all pending pages have been consumed.
> 
> Reasons that tdh_mem_page_add() could get BUSY:
> 1. If two vCPU's tried to tdh_mem_page_add() the same gpa at the same time  they
> could contend the SEPT entry lock
> 2. If one vCPU tries to tdh_mem_page_add() while the other zaps (i.e.
> tdh_mem_range_block()).
Hmm, two tdh_mem_page_add()s can't contend as they are protected by both
slot_lock and filemap lock.

With regard to the contention to tdh_mem_range_block(), please check my analysis
at the above [1].

tdh_mem_page_add() could get BUSY though, when a misbehaved userspace invokes
KVM_TDX_INIT_MEM_REGION on one vCPU while initializing another vCPU.

Please check more details at [2].

[2] https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com/


> I guess since we don't hold MMU lock while we tdh_mem_page_add(), 2 is a
> possibility.
2 is possible only for paranoid zaps.
See "case 3. Unexpected zaps" in [1].


> > What reasonable use case is there for gracefully handling tdh_mem_page_add()
> > failure?
> > 
> > If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> > case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> > tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
> > pages mapped into the S-EPT are ADDed, then it can assert that there are no
> > pending pages when it completes (even if it "fails"), and similarly
> > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> > non-zero.
> 
> Maybe we could take mmu write lock for the retry of tdh_mem_page_add(). Or maybe
> even for a single call of it, until someone wants to parallelize the operation.
Hmm. I prefer returning -BUSY directly as invoking KVM_TDX_INIT_MEM_REGION 
before finishing initializing all vCPUs are uncommon.

Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller

Posted by Edgecombe, Rick P 1 month ago

On Thu, 2025-08-28 at 13:56 +0800, Yan Zhao wrote:
> > Reasons that tdh_mem_page_add() could get BUSY:
> > 1. If two vCPU's tried to tdh_mem_page_add() the same gpa at the same time 
> > they
> > could contend the SEPT entry lock
> > 2. If one vCPU tries to tdh_mem_page_add() while the other zaps (i.e.
> > tdh_mem_range_block()).
> Hmm, two tdh_mem_page_add()s can't contend as they are protected by both
> slot_lock and filemap lock.
> 
> With regard to the contention to tdh_mem_range_block(), please check my
> analysis at the above [1].

The analysis missed the tdh_mem_page_add() failure path

> 
> tdh_mem_page_add() could get BUSY though, when a misbehaved userspace invokes
> KVM_TDX_INIT_MEM_REGION on one vCPU while initializing another vCPU.
> 
> Please check more details at [2].
> 
> [2] https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com/

Ah, the TDR lock. I actually referred to an older version of your locking
analysis that didn't have that one. But this means the premap count could get
out of sync for that reason too.

> 
> 
> > I guess since we don't hold MMU lock while we tdh_mem_page_add(), 2 is a
> > possibility.
> 2 is possible only for paranoid zaps.
> See "case 3. Unexpected zaps" in [1].

Sean's lockdep assert handles half of those cases. Maybe we could also re-
consider a KVM_BUG_ON() in the invalid zap paths again if it comes to it.

> 
> 
> > > What reasonable use case is there for gracefully handling
> > > tdh_mem_page_add()
> > > failure?
> > > 
> > > If there is a need to handle failure, I gotta imagine it's only for the -
> > > EBUSY
> > > case.  And if it's only for -EBUSY, why can't that be handled by retrying
> > > in
> > > tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that
> > > all
> > > pages mapped into the S-EPT are ADDed, then it can assert that there are
> > > no
> > > pending pages when it completes (even if it "fails"), and similarly
> > > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> > > non-zero.
> > 
> > Maybe we could take mmu write lock for the retry of tdh_mem_page_add(). Or
> > maybe
> > even for a single call of it, until someone wants to parallelize the
> > operation.
> Hmm. I prefer returning -BUSY directly as invoking KVM_TDX_INIT_MEM_REGION 
> before finishing initializing all vCPUs are uncommon.

I was looking guaranteeing its success when Sean posted his suggestion to return
to the original pattern. I'm in favor of that direction. If you agree we can
call this moot.