Fold tdx_mem_page_record_premap_cnt() into tdx_sept_set_private_spte() as
providing a one-off helper for effectively three lines of code is at best a
wash, and splitting the code makes the comment for smp_rmb() _extremely_
confusing as the comment talks about reading kvm->arch.pre_fault_allowed
before kvm_tdx->state, but the immediately visible code does the exact
opposite.
Opportunistically rewrite the comments to more explicitly explain who is
checking what, as well as _why_ the ordering matters.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/vmx/tdx.c | 49 ++++++++++++++++++------------------------
1 file changed, 21 insertions(+), 28 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b7559ea1e353..e4b70c0dbda3 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1608,29 +1608,6 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
return 0;
}
-/*
- * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to map guest pages; the
- * callback tdx_gmem_post_populate() then maps pages into private memory.
- * through the a seamcall TDH.MEM.PAGE.ADD(). The SEAMCALL also requires the
- * private EPT structures for the page to have been built before, which is
- * done via kvm_tdp_map_page(). nr_premapped counts the number of pages that
- * were added to the EPT structures but not added with TDH.MEM.PAGE.ADD().
- * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there
- * are no half-initialized shared EPT pages.
- */
-static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
- enum pg_level level, kvm_pfn_t pfn)
-{
- struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-
- if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
- return -EIO;
-
- /* nr_premapped will be decreased when tdh_mem_page_add() is called. */
- atomic64_inc(&kvm_tdx->nr_premapped);
- return 0;
-}
-
static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level, kvm_pfn_t pfn)
{
@@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
return -EIO;
/*
- * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
- * barrier in tdx_td_finalize().
+ * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
+ * before kvm_tdx->state. Userspace must not be allowed to pre-fault
+ * arbitrary memory until the initial memory image is finalized. Pairs
+ * with the smp_wmb() in tdx_td_finalize().
*/
smp_rmb();
- if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
- return tdx_mem_page_aug(kvm, gfn, level, pfn);
- return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
+ /*
+ * If the TD isn't finalized/runnable, then userspace is initializing
+ * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of
+ * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
+ * requires a pre-existing S-EPT mapping). KVM_TDX_FINALIZE_VM checks
+ * the counter to ensure all mapped pages have been added to the image,
+ * to prevent running the TD with uninitialized memory.
+ */
+ if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
+ if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
+ return -EIO;
+
+ atomic64_inc(&kvm_tdx->nr_premapped);
+ return 0;
+ }
+
+ return tdx_mem_page_aug(kvm, gfn, level, pfn);
}
static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
--
2.51.0.268.g9569e192d0-goog
On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote: > Fold tdx_mem_page_record_premap_cnt() into tdx_sept_set_private_spte() as > providing a one-off helper for effectively three lines of code is at best a > wash, and splitting the code makes the comment for smp_rmb() _extremely_ > confusing as the comment talks about reading kvm->arch.pre_fault_allowed > before kvm_tdx->state, but the immediately visible code does the exact > opposite. > > Opportunistically rewrite the comments to more explicitly explain who is > checking what, as well as _why_ the ordering matters. > > No functional change intended. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/vmx/tdx.c | 49 ++++++++++++++++++------------------------ > 1 file changed, 21 insertions(+), 28 deletions(-) > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index b7559ea1e353..e4b70c0dbda3 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -1608,29 +1608,6 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn, > return 0; > } > > -/* > - * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to map guest pages; the > - * callback tdx_gmem_post_populate() then maps pages into private memory. > - * through the a seamcall TDH.MEM.PAGE.ADD(). The SEAMCALL also requires the > - * private EPT structures for the page to have been built before, which is > - * done via kvm_tdp_map_page(). nr_premapped counts the number of pages that > - * were added to the EPT structures but not added with TDH.MEM.PAGE.ADD(). > - * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there > - * are no half-initialized shared EPT pages. > - */ > -static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn, > - enum pg_level level, kvm_pfn_t pfn) > -{ > - struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > - > - if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm)) > - return -EIO; > - > - /* nr_premapped will be decreased when tdh_mem_page_add() is called. */ > - atomic64_inc(&kvm_tdx->nr_premapped); > - return 0; > -} > - > static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, > enum pg_level level, kvm_pfn_t pfn) > { > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, > return -EIO; > > /* > - * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching > - * barrier in tdx_td_finalize(). > + * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory() > + * before kvm_tdx->state. Userspace must not be allowed to pre-fault > + * arbitrary memory until the initial memory image is finalized. Pairs > + * with the smp_wmb() in tdx_td_finalize(). > */ > smp_rmb(); > - if (likely(kvm_tdx->state == TD_STATE_RUNNABLE)) > - return tdx_mem_page_aug(kvm, gfn, level, pfn); > > - return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn); > + /* > + * If the TD isn't finalized/runnable, then userspace is initializing > + * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of > + * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD > + * requires a pre-existing S-EPT mapping). KVM_TDX_FINALIZE_VM checks > + * the counter to ensure all mapped pages have been added to the image, > + * to prevent running the TD with uninitialized memory. To prevent the mismatch between mirror EPT and the S-EPT? e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and TDH.MEM.PAGE.REMOVE. As a result, the TD will still run with uninitialized memory. > + */ > + if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) { > + if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm)) > + return -EIO; > + > + atomic64_inc(&kvm_tdx->nr_premapped); > + return 0; > + } > + > + return tdx_mem_page_aug(kvm, gfn, level, pfn); > } > > static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, > -- > 2.51.0.268.g9569e192d0-goog >
Yan Zhao wrote: > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote: [snip] > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, > > return -EIO; > > > > /* > > - * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching > > - * barrier in tdx_td_finalize(). > > + * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory() > > + * before kvm_tdx->state. Userspace must not be allowed to pre-fault > > + * arbitrary memory until the initial memory image is finalized. Pairs > > + * with the smp_wmb() in tdx_td_finalize(). > > */ > > smp_rmb(); > > - if (likely(kvm_tdx->state == TD_STATE_RUNNABLE)) > > - return tdx_mem_page_aug(kvm, gfn, level, pfn); > > > > - return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn); > > + /* > > + * If the TD isn't finalized/runnable, then userspace is initializing > > + * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of > > + * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD > > + * requires a pre-existing S-EPT mapping). KVM_TDX_FINALIZE_VM checks > > + * the counter to ensure all mapped pages have been added to the image, > > + * to prevent running the TD with uninitialized memory. > To prevent the mismatch between mirror EPT and the S-EPT? > > e.g., Before KVM_TDX_FINALIZE_VM, > if userspace performs a zap after the TDH.MEM.PAGE.ADD, the page will be removed > from the S-EPT. The count of nr_premapped will not change after the successful > TDH.MEM.RANGE.BLOCK and TDH.MEM.PAGE.REMOVE. > > As a result, the TD will still run with uninitialized memory. I'm wondering if we are trying to over-architect this. Should we even allow KVM_TDX_FINALIZE_VM to race with KVM_TDX_INIT_MEM_REGION? What is the use case for that? It seems a basic sanity check/KVM_BUG_ON would suffice to tell the user; Don't start adding memory dynamically until yall have finalized the VM. Ira
On Wed, Aug 27, 2025, Yan Zhao wrote: > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote: > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, > > return -EIO; > > > > /* > > - * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching > > - * barrier in tdx_td_finalize(). > > + * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory() > > + * before kvm_tdx->state. Userspace must not be allowed to pre-fault > > + * arbitrary memory until the initial memory image is finalized. Pairs > > + * with the smp_wmb() in tdx_td_finalize(). > > */ > > smp_rmb(); > > - if (likely(kvm_tdx->state == TD_STATE_RUNNABLE)) > > - return tdx_mem_page_aug(kvm, gfn, level, pfn); > > > > - return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn); > > + /* > > + * If the TD isn't finalized/runnable, then userspace is initializing > > + * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of > > + * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD > > + * requires a pre-existing S-EPT mapping). KVM_TDX_FINALIZE_VM checks > > + * the counter to ensure all mapped pages have been added to the image, > > + * to prevent running the TD with uninitialized memory. > To prevent the mismatch between mirror EPT and the S-EPT? No? Because KVM bumps the count when installing the S-EPT and decrements it on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues? > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and > TDH.MEM.PAGE.REMOVE. Eww. It would be nice to close that hole, but I suppose it's futile, e.g. the underlying problem is unexpectedly removing pages from the initial, whether the VMM is doing stupid things before vs. after FINALIZE doesn't really matter. > As a result, the TD will still run with uninitialized memory. No? Because BLOCK+REMOVE means there are no valid S-EPT mappings. There's a "hole" that the guest might not expect, but that hole will trigger an EPT violation and only get "filled" if the guest explicitly accepts an AUG'd page. Side topic, why does KVM tolerate tdh_mem_page_add() failure? IIUC, playing nice with tdh_mem_page_add() failure necessitates both the tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize() that all pending pages have been consumed. What reasonable use case is there for gracefully handling tdh_mem_page_add() failure? If there is a need to handle failure, I gotta imagine it's only for the -EBUSY case. And if it's only for -EBUSY, why can't that be handled by retrying in tdx_vcpu_init_mem_region()? If tdx_vcpu_init_mem_region() guarantees that all pages mapped into the S-EPT are ADDed, then it can assert that there are no pending pages when it completes (even if it "fails"), and similarly tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being non-zero. > > + */ > > + if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) { > > + if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm)) > > + return -EIO; > > + > > + atomic64_inc(&kvm_tdx->nr_premapped); > > + return 0; > > + } > > + > > + return tdx_mem_page_aug(kvm, gfn, level, pfn); > > } > > > > static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, > > -- > > 2.51.0.268.g9569e192d0-goog > >
Sean Christopherson wrote: > On Wed, Aug 27, 2025, Yan Zhao wrote: > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote: > > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, > > > return -EIO; [snip] > > > /* > > > - * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching > > > - * barrier in tdx_td_finalize(). > > > + * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory() > > > + * before kvm_tdx->state. Userspace must not be allowed to pre-fault > > > + * arbitrary memory until the initial memory image is finalized. Pairs > > > + * with the smp_wmb() in tdx_td_finalize(). > > > */ > > > smp_rmb(); > > > - if (likely(kvm_tdx->state == TD_STATE_RUNNABLE)) > > > - return tdx_mem_page_aug(kvm, gfn, level, pfn); > > > > > > - return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn); > > > + /* > > > + * If the TD isn't finalized/runnable, then userspace is initializing > > > + * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of > > > + * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD > > > + * requires a pre-existing S-EPT mapping). KVM_TDX_FINALIZE_VM checks > > > + * the counter to ensure all mapped pages have been added to the image, > > > + * to prevent running the TD with uninitialized memory. > > To prevent the mismatch between mirror EPT and the S-EPT? > > No? Because KVM bumps the count when installing the S-EPT and decrements it > on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues? > > > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the > > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of > > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and > > TDH.MEM.PAGE.REMOVE. > > Eww. It would be nice to close that hole, but I suppose it's futile, e.g. the > underlying problem is unexpectedly removing pages from the initial, whether the > VMM is doing stupid things before vs. after FINALIZE doesn't really matter. > > > As a result, the TD will still run with uninitialized memory. > > No? Because BLOCK+REMOVE means there are no valid S-EPT mappings. There's a > "hole" that the guest might not expect, but that hole will trigger an EPT > violation and only get "filled" if the guest explicitly accepts an AUG'd page. > > Side topic, why does KVM tolerate tdh_mem_page_add() failure? IIUC, playing > nice with tdh_mem_page_add() failure necessitates both the > tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize() > that all pending pages have been consumed. > > What reasonable use case is there for gracefully handling tdh_mem_page_add() failure? > > If there is a need to handle failure, I gotta imagine it's only for the -EBUSY > case. And if it's only for -EBUSY, why can't that be handled by retrying in > tdx_vcpu_init_mem_region()? If tdx_vcpu_init_mem_region() guarantees that all > pages mapped into the S-EPT are ADDed, then it can assert that there are no > pending pages when it completes (even if it "fails"), and similarly > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being > non-zero. Ah just reading this... yea I'm wondering the same thing. Ira [snip]
On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote: > On Wed, Aug 27, 2025, Yan Zhao wrote: > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote: > > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, > > > return -EIO; > > > > > > /* > > > - * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching > > > - * barrier in tdx_td_finalize(). > > > + * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory() > > > + * before kvm_tdx->state. Userspace must not be allowed to pre-fault > > > + * arbitrary memory until the initial memory image is finalized. Pairs > > > + * with the smp_wmb() in tdx_td_finalize(). > > > */ > > > smp_rmb(); > > > - if (likely(kvm_tdx->state == TD_STATE_RUNNABLE)) > > > - return tdx_mem_page_aug(kvm, gfn, level, pfn); > > > > > > - return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn); > > > + /* > > > + * If the TD isn't finalized/runnable, then userspace is initializing > > > + * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of > > > + * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD > > > + * requires a pre-existing S-EPT mapping). KVM_TDX_FINALIZE_VM checks > > > + * the counter to ensure all mapped pages have been added to the image, > > > + * to prevent running the TD with uninitialized memory. > > To prevent the mismatch between mirror EPT and the S-EPT? > > No? Because KVM bumps the count when installing the S-EPT and decrements it > on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues? Hmm, I think there must be some misunderstanding. Before userspace invokes KVM_TDX_FINALIZE_VM, ======= 1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION). (1) KVM holds slot_lock and filemap lock. (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in patch 2). KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate that there's a page mapped in M-EPT, while it's not yet installed in S-EPT. (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the page has been mapped in S-EPT too. As the name of nr_premapped indicates, the count means a page is pre-mapped in the M-EPT, before its real mapping in the S-EPT. If ADD fails in step (3), nr_premapped will not be decreased. With mere the normal path, nr_premapped should return back to 0 after all KVM_TDX_INIT_MEM_REGIONs. 2. Expected zap paths (e.g. If userspace does something strange, such as removing a slot after KVM_TDX_INIT_MEM_REGION) Those zap paths could be triggered by 1) userspace performs a page attribute conversion 2) userspace invokes gmem punch hole 3) userspace removes a slot As all those paths either hold a slot_lock or a filemap lock, they can't contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both slot_lock and internally filemap lock). Consequently, those zaps must occur a) before kvm_tdp_map_page() or b) after TDH.MEM.PAGE.ADD. For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not mapped in M-EPT either; For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE SEAMCALLs are following the ADD. nr_premapped is therere unchanged, since it does not change the consistency between M-EPT and S-EPT. 3. Unexpected zaps (such as kvm_zap_gfn_range()). Those zaps are currently just paranoid ones. Not found in any existing paths yet. i.e., We want to detect any future code or any missed code piecies, which invokes kvm_zap_gfn_range() (or maybe zaps under read mmu_lock). As those zaps do not necessarily hold slot_lock or filemap lock, they may ocurr after installing M-EPT and before installing S-EPT. As a result, the BLOCK fails and tdx_is_sept_zap_err_due_to_premap() returns true. Decreasing nr_premapped here to indicate the count of pages mapped in M-EPT but not in S-EPT decreases. TDH.MEM.PAGE.ADD after this zap can still succeed. If this occurs, the page will be mapped in S-EPT only. As KVM also decreases nr_premapped after a successful TDH.MEM.PAGE.ADD, the nr_premapped will be <0 in the end. So, we will be able to detect those unexpected zaps. When userspace invokes KVM_TDX_FINALIZE_VM, ======= The nr_premapped must be 0 before tdx_td_finalize() succeeds. The nr_premapped could be 0 if (1) userspace invokes KVM_TDX_INIT_MEM_REGIONs as in a normal way. (2) userspace never triggers any KVM_TDX_INIT_MEM_REGION. (3) userspace triggers KVM_TDX_INIT_MEM_REGION but zaps all initial memory regions. For (2)and(3), KVM_TDX_FINALIZE_VM can still succeed. So, TD can still run with uninitialized memory. > > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the > > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of > > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and > > TDH.MEM.PAGE.REMOVE. > > Eww. It would be nice to close that hole, but I suppose it's futile, e.g. the > underlying problem is unexpectedly removing pages from the initial, whether the > VMM is doing stupid things before vs. after FINALIZE doesn't really matter. Are you referring to the above "case 2 Expected zap paths"? It's equal to that userspace never triggers any KVM_TDX_INIT_MEM_REGION. We can't force userspace must invoke KVM_TDX_INIT_MEM_REGION after all. I don't think there's a hole from the guest point of view. See below. > > As a result, the TD will still run with uninitialized memory. > > No? Because BLOCK+REMOVE means there are no valid S-EPT mappings. There's a > "hole" that the guest might not expect, but that hole will trigger an EPT > violation and only get "filled" if the guest explicitly accepts an AUG'd page. If TD runs with unintialized memory, - for the linux guest, it will cause TD to access unaccepted memory and get killed by KVM; - for non-linux guest which configured with #VE, the guest will see #VE and be informed of that the page must be accepted before access. Though the guest should not be able to run without any initial code, but there's not any security problem. > Side topic, why does KVM tolerate tdh_mem_page_add() failure? IIUC, playing We don't. It returns -EBUSY or -EIO immediately. > nice with tdh_mem_page_add() failure necessitates both the > tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize() > that all pending pages have been consumed. tdx_is_sept_zap_err_due_to_premap() detects the error of BLOCK, which is caused by executing BLOCK before ADD. > What reasonable use case is there for gracefully handling tdh_mem_page_add() failure? If tdh_mem_page_add() fails, the KVM_TDX_INIT_MEM_REGION just fails. > If there is a need to handle failure, I gotta imagine it's only for the -EBUSY > case. And if it's only for -EBUSY, why can't that be handled by retrying in > tdx_vcpu_init_mem_region()? If tdx_vcpu_init_mem_region() guarantees that all I analyzed the contention status of tdh_mem_sept_add() at https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com. As the userspace is expected to execute KVM_TDX_INIT_MEM_REGION in only one vCPU, returning -EBUSY instead of retrying looks safer and easier. > pages mapped into the S-EPT are ADDed, then it can assert that there are no > pending pages when it completes (even if it "fails"), and similarly > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being > non-zero. tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0. KVM_BUG_ON/WARN_ON should be also ok. > > > + */ > > > + if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) { > > > + if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm)) > > > + return -EIO; > > > + > > > + atomic64_inc(&kvm_tdx->nr_premapped); > > > + return 0; > > > + } > > > + > > > + return tdx_mem_page_aug(kvm, gfn, level, pfn); > > > } > > > > > > static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, > > > -- > > > 2.51.0.268.g9569e192d0-goog > > >
On Thu, Aug 28, 2025, Yan Zhao wrote: > On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote: > > On Wed, Aug 27, 2025, Yan Zhao wrote: > > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote: > > > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, > > > > + /* > > > > + * If the TD isn't finalized/runnable, then userspace is initializing > > > > + * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of > > > > + * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD > > > > + * requires a pre-existing S-EPT mapping). KVM_TDX_FINALIZE_VM checks > > > > + * the counter to ensure all mapped pages have been added to the image, > > > > + * to prevent running the TD with uninitialized memory. > > > To prevent the mismatch between mirror EPT and the S-EPT? > > > > No? Because KVM bumps the count when installing the S-EPT and decrements it > > on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues? > Hmm, I think there must be some misunderstanding. Yeah, I forgot that AUG and ADD create the leaf S-EPT entries. > Before userspace invokes KVM_TDX_FINALIZE_VM, > ======= > 1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION). > (1) KVM holds slot_lock and filemap lock. > (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in > patch 2). > KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate > that there's a page mapped in M-EPT, while it's not yet installed in > S-EPT. > (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the > page has been mapped in S-EPT too. > > As the name of nr_premapped indicates, the count means a page is pre-mapped > in the M-EPT, before its real mapping in the S-EPT. > If ADD fails in step (3), nr_premapped will not be decreased. > > With mere the normal path, nr_premapped should return back to 0 after all > KVM_TDX_INIT_MEM_REGIONs. > > > 2. Expected zap paths (e.g. If userspace does something strange, such as > removing a slot after KVM_TDX_INIT_MEM_REGION) > Those zap paths could be triggered by > 1) userspace performs a page attribute conversion > 2) userspace invokes gmem punch hole > 3) userspace removes a slot > As all those paths either hold a slot_lock or a filemap lock, they can't > contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both > slot_lock and internally filemap lock). > Consequently, those zaps must occur > a) before kvm_tdp_map_page() or > b) after TDH.MEM.PAGE.ADD. > For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not > mapped in M-EPT either; > For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE > SEAMCALLs are following the ADD. > nr_premapped is therere unchanged, since it does not change the consistency > between M-EPT and S-EPT. > > 3. Unexpected zaps (such as kvm_zap_gfn_range()). Side topic related to kvm_zap_gfn_range(), the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl() is flawed. If kvm_recalculate_apic_map() fails to allocate an optimized map, KVM will mark APICv as inhibited, i.e. the associated WARN_ON_ONCE() is effectively user-triggerable. Easiest thing would be to mark the vCPU as dead (though we obviously need "KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests" for that to be robust). diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c index dbab1c15b0cd..1c0b43ff9544 100644 --- a/arch/x86/kvm/vmx/main.c +++ b/arch/x86/kvm/vmx/main.c @@ -719,7 +719,8 @@ static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu) static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) { if (is_td_vcpu(vcpu)) { - KVM_BUG_ON(!kvm_vcpu_apicv_active(vcpu), vcpu->kvm); + if (!kvm_vcpu_apicv_active(vcpu)) + kvm_make_request(KVM_REQ_VM_DEAD, vcpu); return; } > Those zaps are currently just paranoid ones. Not found in any existing paths > yet. i.e., > We want to detect any future code or any missed code piecies, which invokes > kvm_zap_gfn_range() (or maybe zaps under read mmu_lock). > > As those zaps do not necessarily hold slot_lock or filemap lock, they may > ocurr after installing M-EPT and before installing S-EPT. > As a result, the BLOCK fails and tdx_is_sept_zap_err_due_to_premap() returns > true. > Decreasing nr_premapped here to indicate the count of pages mapped in M-EPT > but not in S-EPT decreases. > > TDH.MEM.PAGE.ADD after this zap can still succeed. If this occurs, the page > will be mapped in S-EPT only. As KVM also decreases nr_premapped after a > successful TDH.MEM.PAGE.ADD, the nr_premapped will be <0 in the end. > So, we will be able to detect those unexpected zaps. > > > When userspace invokes KVM_TDX_FINALIZE_VM, > ======= > The nr_premapped must be 0 before tdx_td_finalize() succeeds. > > The nr_premapped could be 0 if > (1) userspace invokes KVM_TDX_INIT_MEM_REGIONs as in a normal way. > (2) userspace never triggers any KVM_TDX_INIT_MEM_REGION. > (3) userspace triggers KVM_TDX_INIT_MEM_REGION but zaps all initial memory > regions. > > For (2)and(3), KVM_TDX_FINALIZE_VM can still succeed. Ya, we're in agreement on what can happen. I think all of the confusion was due to me forgetting that TDH.MEM.PAGE.ADD is what actually installes the leaf S-EPT entry. > So, TD can still run with uninitialized memory. No, the TD can never run with truly uninitialized memory. By "uninitialized", I mean memory that the guest can access and which has not been written to. Again, my confusion was due to thinking a page was already mapped into the guest, but awaiting TDH.MEM.PAGE.ADD to > > Side topic, why does KVM tolerate tdh_mem_page_add() failure? IIUC, playing > We don't. It returns -EBUSY or -EIO immediately. But that _is_ tolerating failure, in the sense that KVM doesn't prevent further actions on the VM. Tolerating failure is fine in general, but in this case it leaves the MMU is left in a half-baked state. > > nice with tdh_mem_page_add() failure necessitates both the > > tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize() > > that all pending pages have been consumed. > > tdx_is_sept_zap_err_due_to_premap() detects the error of BLOCK, which is caused > by executing BLOCK before ADD. We need to make this situation impossible. > > What reasonable use case is there for gracefully handling tdh_mem_page_add() failure? > If tdh_mem_page_add() fails, the KVM_TDX_INIT_MEM_REGION just fails. > > > If there is a need to handle failure, I gotta imagine it's only for the -EBUSY > > case. And if it's only for -EBUSY, why can't that be handled by retrying in > > tdx_vcpu_init_mem_region()? If tdx_vcpu_init_mem_region() guarantees that all > I analyzed the contention status of tdh_mem_sept_add() at > https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com. > > As the userspace is expected to execute KVM_TDX_INIT_MEM_REGION in only one > vCPU, returning -EBUSY instead of retrying looks safer and easier. > > > pages mapped into the S-EPT are ADDed, then it can assert that there are no > > pending pages when it completes (even if it "fails"), and similarly > > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being > > non-zero. > tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0. > KVM_BUG_ON/WARN_ON should be also ok. Ok, so I vaguely recall that I may have pushed back on using a scratch field in "struct kvm_tdx" for temporary data (or maybe it was abusing vcpus[0] that I disliked?), but what we ended up with is far worse. For all intents and purposes, nr_premapped _is_ a temporary scratch field, but with access rules that are all but impossible to understand, e.g. there's practically zero chance anyone could suss out complications with "Unexpected zaps", and the existence of that super subtle edge case necessitates using an atomic because KVM can't strictly guarantee that access to the field is mutually exclusive. And that also means it's inherently racy, e.g. if a zap happens while tdx_td_finalize() is checking nr_premapped, what happens? The real killer is that deferring TDH.MEM.PAGE.ADD and TDH.MR_EXTEND until after the map completes and mmu_lock is dropped means that failure at that point leaves the TDP MMU in an inconsistent state, where the M-EPT has a present page but the S-EPT does not. Eww. Note, in no way am I trying to blame anyone; quite the opposite, you've done an admirable job to get all of this landed. And I apologize if any of my past feedback led y'all down this road. I suspect my prefaulting idea really screwed things up; sorry :-( Back to the code, unless I'm missing yet another complication, I think the obvious fix to all of this is to pass the source page and metadata flags via a scratch field in "struct kvm_tdx", and then do PAGE.ADD and MR.EXTEND in tdx_sept_set_private_spte(). Then there is no need to keep track of pending pages, because the M-EPT and S-EPT are always consistent. E.g. if PAGE.ADD fails with -EBUSY, then KVM will naturally revert the M-EPT entry from FROZEN to !PRESENT. It also allows KVM to KVM_BUG_ON() MR.EXTEND failure, because it should be impossible for the S-EPT entry to be modified between PAGE.ADD and MR.EXTEND. Diff on top below for feedback on the idea. A proper series for this would simply replace several of the patches, e.g. asserting that slots_lock is held on tdx_is_sept_zap_err_due_to_premap() is wrong. --- arch/x86/kvm/vmx/tdx.c | 157 +++++++++++++++++------------------------ arch/x86/kvm/vmx/tdx.h | 11 +-- 2 files changed, 70 insertions(+), 98 deletions(-) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index f9ac590e8ff0..5d981a061442 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1586,6 +1586,56 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level) td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa); } + +struct kvm_tdx_page_add { + struct page *src; + unsigned long flags; +}; + +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level, + kvm_pfn_t pfn) +{ + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); + u64 err, entry, level_state; + gpa_t gpa = gfn_to_gpa(gfn); + int i; + + lockdep_assert_held(&kvm->slots_lock); + + if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) || + KVM_BUG_ON(!kvm_tdx->page_add, kvm)) + return -EIO; + + err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), + kvm_tdx->page_add->src, &entry, &level_state); + if (unlikely(tdx_operand_busy(err))) + return -EBUSY; + + if (KVM_BUG_ON(err, kvm)) { + pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state); + return -EIO; + } + + if (!(kvm_tdx->page_add->flags & KVM_TDX_MEASURE_MEMORY_REGION)) + return 0; + + /* + * Extend the measurement while holding mmu_lock to ensure MR.EXTEND + * can't fail, e.g. due to the S-EPT entry being zapped after PAGE.ADD. + * Note! If extended the measurement fails, bug the VM, but do NOT + * return an error, as mapping the page in the S-EPT succeeded and so + * needs to be tracked in KVM's mirror page tables. + */ + for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { + err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state); + if (KVM_BUG_ON(err, kvm)) { + pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state); + break; + } + } + return 0; +} + static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn, enum pg_level level, kvm_pfn_t pfn) { @@ -1627,21 +1677,11 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, /* * If the TD isn't finalized/runnable, then userspace is initializing - * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of - * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD - * requires a pre-existing S-EPT mapping). KVM_TDX_FINALIZE_VM checks - * the counter to ensure all mapped pages have been added to the image, - * to prevent running the TD with uninitialized memory. + * the VM image via KVM_TDX_INIT_MEM_REGION. Add the page to the TD, + * and optionally extend the measurement with the page contents. */ - if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) { - lockdep_assert_held(&kvm->slots_lock); - - if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm)) - return -EIO; - - kvm_tdx->nr_pending_tdh_mem_page_adds++; - return 0; - } + if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) + return tdx_mem_page_add(kvm, gfn, level, pfn); return tdx_mem_page_aug(kvm, gfn, level, pfn); } @@ -1716,39 +1756,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn, return 0; } -/* - * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called - * successfully. - * - * Since tdh_mem_sept_add() must have been invoked successfully before a - * non-leaf entry present in the mirrored page table, the SEPT ZAP related - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the - * SEPT. - * - * Further check if the returned entry from SEPT walking is with RWX permissions - * to filter out anything unexpected. - * - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from - * level_state returned from a SEAMCALL error is the same as that passed into - * the SEAMCALL. - */ -static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err, - u64 entry, int level) -{ - if (!err || kvm_tdx->state == TD_STATE_RUNNABLE) - return false; - - if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX)) - return false; - - if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK))) - return false; - - return true; -} - static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level, struct page *page) { @@ -1768,15 +1775,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state); tdx_no_vcpus_enter_stop(kvm); } - if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) { - lockdep_assert_held(&kvm->slots_lock); - - if (KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm)) - return -EIO; - - return 0; - } - if (KVM_BUG_ON(err, kvm)) { pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state); return -EIO; @@ -2842,12 +2840,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd) if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE) return -EINVAL; - /* - * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue - * TDH.MEM.PAGE.ADD(). - */ - if (kvm_tdx->nr_pending_tdh_mem_page_adds) - return -EINVAL; cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td); if (tdx_operand_busy(cmd->hw_error)) @@ -3131,50 +3123,29 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, { struct tdx_gmem_post_populate_arg *arg = _arg; struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); - u64 err, entry, level_state; - gpa_t gpa = gfn_to_gpa(gfn); - struct page *src_page; - int ret, i; + struct kvm_tdx_page_add page_add = { + .flags = arg->flags, + }; + int ret; - lockdep_assert_held(&kvm->slots_lock); + if (KVM_BUG_ON(kvm_tdx->page_add, kvm)) + return -EIO; /* * Get the source page if it has been faulted in. Return failure if the * source page has been swapped out or unmapped in primary memory. */ - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); + ret = get_user_pages_fast((unsigned long)src, 1, 0, &page_add.src); if (ret < 0) return ret; if (ret != 1) return -ENOMEM; + kvm_tdx->page_add = &page_add; ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); - if (ret < 0) - goto out; + kvm_tdx->page_add = NULL; - ret = 0; - err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), - src_page, &entry, &level_state); - if (err) { - ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO; - goto out; - } - - KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm); - - if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) { - for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { - err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, - &level_state); - if (err) { - ret = -EIO; - break; - } - } - } - -out: - put_page(src_page); + put_page(page_add.src); return ret; } diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h index 45d86f9fa41c..39e0c3bcc866 100644 --- a/arch/x86/kvm/vmx/tdx.h +++ b/arch/x86/kvm/vmx/tdx.h @@ -21,6 +21,8 @@ enum kvm_tdx_state { TD_STATE_RUNNABLE, }; +struct kvm_tdx_add_page; + struct kvm_tdx { struct kvm kvm; @@ -37,12 +39,11 @@ struct kvm_tdx { struct tdx_td td; /* - * The number of pages that KVM_TDX_INIT_MEM_REGION has mapped into the - * S-EPT, but not yet initialized via TDH.MEM.PAGE_ADD. Used to sanity - * check adding pages to the image, and to ensure that all pages have - * been initialized before finalizing the TD. + * Scratch structure used to pass the source page and metadata flags to + * tdx_mem_page_add. Protected by slots_lock, and non-NULL only when + * mapping a private pfn via tdx_gmem_post_populate(). */ - unsigned long nr_pending_tdh_mem_page_adds; + struct kvm_tdx_page_add *page_add; /* * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do base-commit: 7c7a3893b102bdeb4826f7140280b7b16081b385 --
On Thu, Aug 28, 2025 at 10:00:28AM -0700, Sean Christopherson wrote: > On Thu, Aug 28, 2025, Yan Zhao wrote: > > On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote: > > > On Wed, Aug 27, 2025, Yan Zhao wrote: > > > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote: > > > > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, > > > > > + /* > > > > > + * If the TD isn't finalized/runnable, then userspace is initializing > > > > > + * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of > > > > > + * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD > > > > > + * requires a pre-existing S-EPT mapping). KVM_TDX_FINALIZE_VM checks > > > > > + * the counter to ensure all mapped pages have been added to the image, > > > > > + * to prevent running the TD with uninitialized memory. > > > > To prevent the mismatch between mirror EPT and the S-EPT? > > > > > > No? Because KVM bumps the count when installing the S-EPT and decrements it > > > on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues? > > Hmm, I think there must be some misunderstanding. > > Yeah, I forgot that AUG and ADD create the leaf S-EPT entries. > > > Before userspace invokes KVM_TDX_FINALIZE_VM, > > ======= > > 1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION). > > (1) KVM holds slot_lock and filemap lock. > > (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in > > patch 2). > > KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate > > that there's a page mapped in M-EPT, while it's not yet installed in > > S-EPT. > > (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the > > page has been mapped in S-EPT too. > > > > As the name of nr_premapped indicates, the count means a page is pre-mapped > > in the M-EPT, before its real mapping in the S-EPT. > > If ADD fails in step (3), nr_premapped will not be decreased. > > > > With mere the normal path, nr_premapped should return back to 0 after all > > KVM_TDX_INIT_MEM_REGIONs. > > > > > > 2. Expected zap paths (e.g. If userspace does something strange, such as > > removing a slot after KVM_TDX_INIT_MEM_REGION) > > Those zap paths could be triggered by > > 1) userspace performs a page attribute conversion > > 2) userspace invokes gmem punch hole > > 3) userspace removes a slot > > As all those paths either hold a slot_lock or a filemap lock, they can't > > contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both > > slot_lock and internally filemap lock). > > Consequently, those zaps must occur > > a) before kvm_tdp_map_page() or > > b) after TDH.MEM.PAGE.ADD. > > For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not > > mapped in M-EPT either; > > For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE > > SEAMCALLs are following the ADD. > > nr_premapped is therere unchanged, since it does not change the consistency > > between M-EPT and S-EPT. > > > > 3. Unexpected zaps (such as kvm_zap_gfn_range()). > > Side topic related to kvm_zap_gfn_range(), the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl() > is flawed. If kvm_recalculate_apic_map() fails to allocate an optimized map, KVM > will mark APICv as inhibited, i.e. the associated WARN_ON_ONCE() is effectively > user-triggerable. > > Easiest thing would be to mark the vCPU as dead (though we obviously need > "KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests" for that to be robust). > > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c > index dbab1c15b0cd..1c0b43ff9544 100644 > --- a/arch/x86/kvm/vmx/main.c > +++ b/arch/x86/kvm/vmx/main.c > @@ -719,7 +719,8 @@ static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu) > static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) > { > if (is_td_vcpu(vcpu)) { > - KVM_BUG_ON(!kvm_vcpu_apicv_active(vcpu), vcpu->kvm); > + if (!kvm_vcpu_apicv_active(vcpu)) > + kvm_make_request(KVM_REQ_VM_DEAD, vcpu); > return; > } > > > Those zaps are currently just paranoid ones. Not found in any existing paths > > yet. i.e., > > We want to detect any future code or any missed code piecies, which invokes > > kvm_zap_gfn_range() (or maybe zaps under read mmu_lock). > > > > As those zaps do not necessarily hold slot_lock or filemap lock, they may > > ocurr after installing M-EPT and before installing S-EPT. > > As a result, the BLOCK fails and tdx_is_sept_zap_err_due_to_premap() returns > > true. > > Decreasing nr_premapped here to indicate the count of pages mapped in M-EPT > > but not in S-EPT decreases. > > > > TDH.MEM.PAGE.ADD after this zap can still succeed. If this occurs, the page > > will be mapped in S-EPT only. As KVM also decreases nr_premapped after a > > successful TDH.MEM.PAGE.ADD, the nr_premapped will be <0 in the end. > > So, we will be able to detect those unexpected zaps. > > > > > > When userspace invokes KVM_TDX_FINALIZE_VM, > > ======= > > The nr_premapped must be 0 before tdx_td_finalize() succeeds. > > > > The nr_premapped could be 0 if > > (1) userspace invokes KVM_TDX_INIT_MEM_REGIONs as in a normal way. > > (2) userspace never triggers any KVM_TDX_INIT_MEM_REGION. > > (3) userspace triggers KVM_TDX_INIT_MEM_REGION but zaps all initial memory > > regions. > > > > For (2)and(3), KVM_TDX_FINALIZE_VM can still succeed. > > Ya, we're in agreement on what can happen. I think all of the confusion was due > to me forgetting that TDH.MEM.PAGE.ADD is what actually installes the leaf S-EPT > entry. > > > So, TD can still run with uninitialized memory. > > No, the TD can never run with truly uninitialized memory. By "uninitialized", I > mean memory that the guest can access and which has not been written to. Again, > my confusion was due to thinking a page was already mapped into the guest, but > awaiting TDH.MEM.PAGE.ADD to > > > > Side topic, why does KVM tolerate tdh_mem_page_add() failure? IIUC, playing > > We don't. It returns -EBUSY or -EIO immediately. > > But that _is_ tolerating failure, in the sense that KVM doesn't prevent further > actions on the VM. Tolerating failure is fine in general, but in this case it > leaves the MMU is left in a half-baked state. > > > > nice with tdh_mem_page_add() failure necessitates both the > > > tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize() > > > that all pending pages have been consumed. > > > > tdx_is_sept_zap_err_due_to_premap() detects the error of BLOCK, which is caused > > by executing BLOCK before ADD. > > We need to make this situation impossible. Currently this situation should be impossible already. If there're still missing ones, we can fix it (as you did above). But this tdx_is_sept_zap_err_due_to_premap() check is just to detect if anything is still missing. Or maybe directly KVM_BUG_ON() on that is also ok. > > > What reasonable use case is there for gracefully handling tdh_mem_page_add() failure? > > If tdh_mem_page_add() fails, the KVM_TDX_INIT_MEM_REGION just fails. > > > > > If there is a need to handle failure, I gotta imagine it's only for the -EBUSY > > > case. And if it's only for -EBUSY, why can't that be handled by retrying in > > > tdx_vcpu_init_mem_region()? If tdx_vcpu_init_mem_region() guarantees that all > > I analyzed the contention status of tdh_mem_sept_add() at > > https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com. > > > > As the userspace is expected to execute KVM_TDX_INIT_MEM_REGION in only one > > vCPU, returning -EBUSY instead of retrying looks safer and easier. > > > > > pages mapped into the S-EPT are ADDed, then it can assert that there are no > > > pending pages when it completes (even if it "fails"), and similarly > > > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being > > > non-zero. > > tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0. > > KVM_BUG_ON/WARN_ON should be also ok. > > Ok, so I vaguely recall that I may have pushed back on using a scratch field in > "struct kvm_tdx" for temporary data (or maybe it was abusing vcpus[0] that I > disliked?), but what we ended up with is far worse. > > For all intents and purposes, nr_premapped _is_ a temporary scratch field, but > with access rules that are all but impossible to understand, e.g. there's practically > zero chance anyone could suss out complications with "Unexpected zaps", and the > existence of that super subtle edge case necessitates using an atomic because > KVM can't strictly guarantee that access to the field is mutually exclusive. And > that also means it's inherently racy, e.g. if a zap happens while tdx_td_finalize() > is checking nr_premapped, what happens? The tdx_td_finalize() takes slots_lock and you asserted those unexpected zaps at https://lore.kernel.org/all/20250827000522.4022426-11-seanjc@google.com. Expected zaps can't occur during tdx_td_finalize checking nr_premapped either. > The real killer is that deferring TDH.MEM.PAGE.ADD and TDH.MR_EXTEND until after > the map completes and mmu_lock is dropped means that failure at that point leaves > the TDP MMU in an inconsistent state, where the M-EPT has a present page but the > S-EPT does not. Eww. Eww... That's why there's nr_premapped. And it's suggested by you, though you called it "Crazy idea"... https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com/ > Note, in no way am I trying to blame anyone; quite the opposite, you've done an > admirable job to get all of this landed. And I apologize if any of my past > feedback led y'all down this road. I suspect my prefaulting idea really screwed > things up; sorry :-( It's ok :) > Back to the code, unless I'm missing yet another complication, I think the obvious > fix to all of this is to pass the source page and metadata flags via a scratch > field in "struct kvm_tdx", and then do PAGE.ADD and MR.EXTEND in > tdx_sept_set_private_spte(). Then there is no need to keep track of pending > pages, because the M-EPT and S-EPT are always consistent. E.g. if PAGE.ADD fails > with -EBUSY, then KVM will naturally revert the M-EPT entry from FROZEN to !PRESENT. > It also allows KVM to KVM_BUG_ON() MR.EXTEND failure, because it should be impossible > for the S-EPT entry to be modified between PAGE.ADD and MR.EXTEND. > > Diff on top below for feedback on the idea. A proper series for this would simply > replace several of the patches, e.g. asserting that slots_lock is held on > tdx_is_sept_zap_err_due_to_premap() is wrong. Looks it's similar to the implementation in v19? https://lore.kernel.org/all/bbac4998cfb34da496646491038b03f501964cbd.1708933498.git.isaku.yamahata@intel.com/ > --- > arch/x86/kvm/vmx/tdx.c | 157 +++++++++++++++++------------------------ > arch/x86/kvm/vmx/tdx.h | 11 +-- > 2 files changed, 70 insertions(+), 98 deletions(-) > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index f9ac590e8ff0..5d981a061442 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -1586,6 +1586,56 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level) > td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa); > } > > + > +struct kvm_tdx_page_add { > + struct page *src; > + unsigned long flags; > +}; > + > +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level, > + kvm_pfn_t pfn) > +{ > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > + u64 err, entry, level_state; > + gpa_t gpa = gfn_to_gpa(gfn); > + int i; > + > + lockdep_assert_held(&kvm->slots_lock); > + > + if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) || > + KVM_BUG_ON(!kvm_tdx->page_add, kvm)) > + return -EIO; > + > + err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), > + kvm_tdx->page_add->src, &entry, &level_state); > + if (unlikely(tdx_operand_busy(err))) > + return -EBUSY; > + > + if (KVM_BUG_ON(err, kvm)) { > + pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state); > + return -EIO; > + } > + > + if (!(kvm_tdx->page_add->flags & KVM_TDX_MEASURE_MEMORY_REGION)) > + return 0; > + > + /* > + * Extend the measurement while holding mmu_lock to ensure MR.EXTEND > + * can't fail, e.g. due to the S-EPT entry being zapped after PAGE.ADD. > + * Note! If extended the measurement fails, bug the VM, but do NOT > + * return an error, as mapping the page in the S-EPT succeeded and so > + * needs to be tracked in KVM's mirror page tables. > + */ > + for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { > + err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state); > + if (KVM_BUG_ON(err, kvm)) { > + pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state); > + break; > + } > + } > + return 0; > +} > + > static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn, > enum pg_level level, kvm_pfn_t pfn) > { > @@ -1627,21 +1677,11 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, > > /* > * If the TD isn't finalized/runnable, then userspace is initializing > - * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of > - * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD > - * requires a pre-existing S-EPT mapping). KVM_TDX_FINALIZE_VM checks > - * the counter to ensure all mapped pages have been added to the image, > - * to prevent running the TD with uninitialized memory. > + * the VM image via KVM_TDX_INIT_MEM_REGION. Add the page to the TD, > + * and optionally extend the measurement with the page contents. > */ > - if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) { > - lockdep_assert_held(&kvm->slots_lock); > - > - if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm)) > - return -EIO; > - > - kvm_tdx->nr_pending_tdh_mem_page_adds++; > - return 0; > - } > + if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) > + return tdx_mem_page_add(kvm, gfn, level, pfn); > > return tdx_mem_page_aug(kvm, gfn, level, pfn); > } > @@ -1716,39 +1756,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn, > return 0; > } > > -/* > - * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is > - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called > - * successfully. > - * > - * Since tdh_mem_sept_add() must have been invoked successfully before a > - * non-leaf entry present in the mirrored page table, the SEPT ZAP related > - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead > - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the > - * SEPT. > - * > - * Further check if the returned entry from SEPT walking is with RWX permissions > - * to filter out anything unexpected. > - * > - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from > - * level_state returned from a SEAMCALL error is the same as that passed into > - * the SEAMCALL. > - */ > -static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err, > - u64 entry, int level) > -{ > - if (!err || kvm_tdx->state == TD_STATE_RUNNABLE) > - return false; > - > - if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX)) > - return false; > - > - if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK))) > - return false; > - > - return true; > -} > - > static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, > enum pg_level level, struct page *page) > { > @@ -1768,15 +1775,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, > err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state); > tdx_no_vcpus_enter_stop(kvm); > } > - if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) { > - lockdep_assert_held(&kvm->slots_lock); > - > - if (KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm)) > - return -EIO; > - > - return 0; > - } > - > if (KVM_BUG_ON(err, kvm)) { > pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state); > return -EIO; > @@ -2842,12 +2840,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd) > > if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE) > return -EINVAL; > - /* > - * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue > - * TDH.MEM.PAGE.ADD(). > - */ > - if (kvm_tdx->nr_pending_tdh_mem_page_adds) > - return -EINVAL; > > cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td); > if (tdx_operand_busy(cmd->hw_error)) > @@ -3131,50 +3123,29 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > { > struct tdx_gmem_post_populate_arg *arg = _arg; > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > - u64 err, entry, level_state; > - gpa_t gpa = gfn_to_gpa(gfn); > - struct page *src_page; > - int ret, i; > + struct kvm_tdx_page_add page_add = { > + .flags = arg->flags, > + }; > + int ret; > > - lockdep_assert_held(&kvm->slots_lock); > + if (KVM_BUG_ON(kvm_tdx->page_add, kvm)) > + return -EIO; > > /* > * Get the source page if it has been faulted in. Return failure if the > * source page has been swapped out or unmapped in primary memory. > */ > - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); > + ret = get_user_pages_fast((unsigned long)src, 1, 0, &page_add.src); > if (ret < 0) > return ret; > if (ret != 1) > return -ENOMEM; > > + kvm_tdx->page_add = &page_add; > ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); > - if (ret < 0) > - goto out; > + kvm_tdx->page_add = NULL; > > - ret = 0; > - err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), > - src_page, &entry, &level_state); > - if (err) { > - ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO; > - goto out; > - } > - > - KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm); > - > - if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) { > - for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { > - err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, > - &level_state); > - if (err) { > - ret = -EIO; > - break; > - } > - } > - } > - > -out: > - put_page(src_page); > + put_page(page_add.src); > return ret; > } > > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h > index 45d86f9fa41c..39e0c3bcc866 100644 > --- a/arch/x86/kvm/vmx/tdx.h > +++ b/arch/x86/kvm/vmx/tdx.h > @@ -21,6 +21,8 @@ enum kvm_tdx_state { > TD_STATE_RUNNABLE, > }; > > +struct kvm_tdx_add_page; > + > struct kvm_tdx { > struct kvm kvm; > > @@ -37,12 +39,11 @@ struct kvm_tdx { > struct tdx_td td; > > /* > - * The number of pages that KVM_TDX_INIT_MEM_REGION has mapped into the > - * S-EPT, but not yet initialized via TDH.MEM.PAGE_ADD. Used to sanity > - * check adding pages to the image, and to ensure that all pages have > - * been initialized before finalizing the TD. > + * Scratch structure used to pass the source page and metadata flags to > + * tdx_mem_page_add. Protected by slots_lock, and non-NULL only when > + * mapping a private pfn via tdx_gmem_post_populate(). > */ > - unsigned long nr_pending_tdh_mem_page_adds; > + struct kvm_tdx_page_add *page_add; > > /* > * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do > > base-commit: 7c7a3893b102bdeb4826f7140280b7b16081b385 > --
On Fri, Aug 29, 2025 at 10:31:29AM +0800, Yan Zhao wrote: > On Thu, Aug 28, 2025 at 10:00:28AM -0700, Sean Christopherson wrote: > > On Thu, Aug 28, 2025, Yan Zhao wrote: > > > On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote: > > > > On Wed, Aug 27, 2025, Yan Zhao wrote: > > > > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote: ... > > > > Side topic, why does KVM tolerate tdh_mem_page_add() failure? IIUC, playing > > > We don't. It returns -EBUSY or -EIO immediately. > > > > But that _is_ tolerating failure, in the sense that KVM doesn't prevent further > > actions on the VM. Tolerating failure is fine in general, but in this case it > > leaves the MMU is left in a half-baked state. Yes, but nr_premapped will not be decreased on tdh_mem_page_add() failure. So we rely on nr_premapped to disallow the TD from running eventually.
On Thu, 2025-08-28 at 10:00 -0700, Sean Christopherson wrote: > On Thu, Aug 28, 2025, Yan Zhao wrote: > > On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote: > > > On Wed, Aug 27, 2025, Yan Zhao wrote: > > > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote: > > > > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, > > > > > + /* > > > > > + * If the TD isn't finalized/runnable, then userspace is initializing > > > > > + * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of > > > > > + * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD > > > > > + * requires a pre-existing S-EPT mapping). KVM_TDX_FINALIZE_VM checks > > > > > + * the counter to ensure all mapped pages have been added to the image, > > > > > + * to prevent running the TD with uninitialized memory. > > > > To prevent the mismatch between mirror EPT and the S-EPT? > > > > > > No? Because KVM bumps the count when installing the S-EPT and decrements it > > > on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues? > > Hmm, I think there must be some misunderstanding. > > Yeah, I forgot that AUG and ADD create the leaf S-EPT entries. > > > Before userspace invokes KVM_TDX_FINALIZE_VM, > > ======= > > 1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION). > > (1) KVM holds slot_lock and filemap lock. > > (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in > > patch 2). > > KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate > > that there's a page mapped in M-EPT, while it's not yet installed in > > S-EPT. > > (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the > > page has been mapped in S-EPT too. > > > > As the name of nr_premapped indicates, the count means a page is pre-mapped > > in the M-EPT, before its real mapping in the S-EPT. > > If ADD fails in step (3), nr_premapped will not be decreased. > > > > With mere the normal path, nr_premapped should return back to 0 after all > > KVM_TDX_INIT_MEM_REGIONs. > > > > > > 2. Expected zap paths (e.g. If userspace does something strange, such as > > removing a slot after KVM_TDX_INIT_MEM_REGION) > > Those zap paths could be triggered by > > 1) userspace performs a page attribute conversion > > 2) userspace invokes gmem punch hole > > 3) userspace removes a slot > > As all those paths either hold a slot_lock or a filemap lock, they can't > > contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both > > slot_lock and internally filemap lock). > > Consequently, those zaps must occur > > a) before kvm_tdp_map_page() or > > b) after TDH.MEM.PAGE.ADD. > > For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not > > mapped in M-EPT either; > > For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE > > SEAMCALLs are following the ADD. > > nr_premapped is therere unchanged, since it does not change the consistency > > between M-EPT and S-EPT. > > > > 3. Unexpected zaps (such as kvm_zap_gfn_range()). > > Side topic related to kvm_zap_gfn_range(), the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl() > is flawed. If kvm_recalculate_apic_map() fails to allocate an optimized map, KVM > will mark APICv as inhibited, i.e. the associated WARN_ON_ONCE() is effectively > user-triggerable. > > Easiest thing would be to mark the vCPU as dead (though we obviously need > "KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests" for that to be robust). > > > I'm going need to look up the related apic discussions from the base series and circle back. [snip] > > tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0. > > KVM_BUG_ON/WARN_ON should be also ok. > > Ok, so I vaguely recall that I may have pushed back on using a scratch field in > "struct kvm_tdx" for temporary data (or maybe it was abusing vcpus[0] that I > disliked?), but what we ended up with is far worse. I think it was also that the tdh_mr_extend() loop was too heavyweight for the fault path. But that was before we got to the kick+lock stuff. > > For all intents and purposes, nr_premapped _is_ a temporary scratch field, but > with access rules that are all but impossible to understand, e.g. there's practically > zero chance anyone could suss out complications with "Unexpected zaps", and the > existence of that super subtle edge case necessitates using an atomic because > KVM can't strictly guarantee that access to the field is mutually exclusive. And > that also means it's inherently racy, e.g. if a zap happens while tdx_td_finalize() > is checking nr_premapped, what happens? > > The real killer is that deferring TDH.MEM.PAGE.ADD and TDH.MR_EXTEND until after > the map completes and mmu_lock is dropped means that failure at that point leaves > the TDP MMU in an inconsistent state, where the M-EPT has a present page but the > S-EPT does not. Eww. > > Note, in no way am I trying to blame anyone; quite the opposite, you've done an > admirable job to get all of this landed. And I apologize if any of my past > feedback led y'all down this road. I suspect my prefaulting idea really screwed > things up; sorry :-( It's unfortunate we didn't have the gmem mmap() support then. Otherwise we could have just encrypted it in place. Otherwise, I'm really glad to see these cleanups/scrutiny. I kind of got the impression that you wanted to see less TDX churn for a bit. The fact is, the TDX base support still needs more work like this. > > Back to the code, unless I'm missing yet another complication, I think the obvious > fix to all of this is to pass the source page and metadata flags via a scratch > field in "struct kvm_tdx", and then do PAGE.ADD and MR.EXTEND in > tdx_sept_set_private_spte(). Then there is no need to keep track of pending > pages, because the M-EPT and S-EPT are always consistent. E.g. if PAGE.ADD fails > with -EBUSY, then KVM will naturally revert the M-EPT entry from FROZEN to !PRESENT. > It also allows KVM to KVM_BUG_ON() MR.EXTEND failure, because it should be impossible > for the S-EPT entry to be modified between PAGE.ADD and MR.EXTEND. > > Diff on top below for feedback on the idea. A proper series for this would simply > replace several of the patches, e.g. asserting that slots_lock is held on > tdx_is_sept_zap_err_due_to_premap() is wrong. This works. The "stuffing data on the vCPU" part is a little ugly as it was before, but the other solutions were worse. Especially nr_premap was problematic for a number of reasons that keeps growing. So, big Acked-by: Rick Edgecombe <rick.p.edgecombe@intel.com> > > --- > arch/x86/kvm/vmx/tdx.c | 157 +++++++++++++++++------------------------ > arch/x86/kvm/vmx/tdx.h | 11 +-- > 2 files changed, 70 insertions(+), 98 deletions(-) > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index f9ac590e8ff0..5d981a061442 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -1586,6 +1586,56 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level) > td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa); > } > > + > +struct kvm_tdx_page_add { > + struct page *src; > + unsigned long flags; > +}; > + > +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level, > + kvm_pfn_t pfn) > +{ > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > + u64 err, entry, level_state; > + gpa_t gpa = gfn_to_gpa(gfn); > + int i; > + > + lockdep_assert_held(&kvm->slots_lock); > + > + if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) || > + KVM_BUG_ON(!kvm_tdx->page_add, kvm)) > + return -EIO; > + > + err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), > + kvm_tdx->page_add->src, &entry, &level_state); > + if (unlikely(tdx_operand_busy(err))) > + return -EBUSY; > + > + if (KVM_BUG_ON(err, kvm)) { > + pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state); > + return -EIO; > + } > + > + if (!(kvm_tdx->page_add->flags & KVM_TDX_MEASURE_MEMORY_REGION)) > + return 0; > + > + /* > + * Extend the measurement while holding mmu_lock to ensure MR.EXTEND > + * can't fail, e.g. due to the S-EPT entry being zapped after PAGE.ADD. > + * Note! If extended the measurement fails, bug the VM, but do NOT > + * return an error, as mapping the page in the S-EPT succeeded and so > + * needs to be tracked in KVM's mirror page tables. > + */ > + for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { > + err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state); > + if (KVM_BUG_ON(err, kvm)) { > + pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state); > + break; > + } > + } > + return 0; > +} > + > static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn, > enum pg_level level, kvm_pfn_t pfn) > { > @@ -1627,21 +1677,11 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, > > /* > * If the TD isn't finalized/runnable, then userspace is initializing > - * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of > - * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD > - * requires a pre-existing S-EPT mapping). KVM_TDX_FINALIZE_VM checks > - * the counter to ensure all mapped pages have been added to the image, > - * to prevent running the TD with uninitialized memory. > + * the VM image via KVM_TDX_INIT_MEM_REGION. Add the page to the TD, > + * and optionally extend the measurement with the page contents. > */ > - if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) { > - lockdep_assert_held(&kvm->slots_lock); > - > - if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm)) > - return -EIO; > - > - kvm_tdx->nr_pending_tdh_mem_page_adds++; > - return 0; > - } > + if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) > + return tdx_mem_page_add(kvm, gfn, level, pfn); > > return tdx_mem_page_aug(kvm, gfn, level, pfn); > } > @@ -1716,39 +1756,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn, > return 0; > } > > -/* > - * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is > - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called > - * successfully. > - * > - * Since tdh_mem_sept_add() must have been invoked successfully before a > - * non-leaf entry present in the mirrored page table, the SEPT ZAP related > - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead > - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the > - * SEPT. > - * > - * Further check if the returned entry from SEPT walking is with RWX permissions > - * to filter out anything unexpected. > - * > - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from > - * level_state returned from a SEAMCALL error is the same as that passed into > - * the SEAMCALL. > - */ > -static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err, > - u64 entry, int level) > -{ > - if (!err || kvm_tdx->state == TD_STATE_RUNNABLE) > - return false; > - > - if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX)) > - return false; > - > - if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK))) > - return false; > - > - return true; > -} > - > static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, > enum pg_level level, struct page *page) > { > @@ -1768,15 +1775,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, > err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state); > tdx_no_vcpus_enter_stop(kvm); > } > - if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) { > - lockdep_assert_held(&kvm->slots_lock); > - > - if (KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm)) > - return -EIO; > - > - return 0; > - } > - > if (KVM_BUG_ON(err, kvm)) { > pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state); > return -EIO; > @@ -2842,12 +2840,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd) > > if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE) > return -EINVAL; > - /* > - * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue > - * TDH.MEM.PAGE.ADD(). > - */ > - if (kvm_tdx->nr_pending_tdh_mem_page_adds) > - return -EINVAL; > > cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td); > if (tdx_operand_busy(cmd->hw_error)) > @@ -3131,50 +3123,29 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > { > struct tdx_gmem_post_populate_arg *arg = _arg; > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > - u64 err, entry, level_state; > - gpa_t gpa = gfn_to_gpa(gfn); > - struct page *src_page; > - int ret, i; > + struct kvm_tdx_page_add page_add = { > + .flags = arg->flags, > + }; > + int ret; > > - lockdep_assert_held(&kvm->slots_lock); > + if (KVM_BUG_ON(kvm_tdx->page_add, kvm)) > + return -EIO; > > /* > * Get the source page if it has been faulted in. Return failure if the > * source page has been swapped out or unmapped in primary memory. > */ > - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); > + ret = get_user_pages_fast((unsigned long)src, 1, 0, &page_add.src); > if (ret < 0) > return ret; > if (ret != 1) > return -ENOMEM; > > + kvm_tdx->page_add = &page_add; > ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); > - if (ret < 0) > - goto out; > + kvm_tdx->page_add = NULL; > > - ret = 0; > - err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), > - src_page, &entry, &level_state); > - if (err) { > - ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO; > - goto out; > - } > - > - KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm); > - > - if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) { > - for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { > - err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, > - &level_state); > - if (err) { > - ret = -EIO; > - break; > - } > - } > - } > - > -out: > - put_page(src_page); > + put_page(page_add.src); > return ret; > } > > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h > index 45d86f9fa41c..39e0c3bcc866 100644 > --- a/arch/x86/kvm/vmx/tdx.h > +++ b/arch/x86/kvm/vmx/tdx.h > @@ -21,6 +21,8 @@ enum kvm_tdx_state { > TD_STATE_RUNNABLE, > }; > > +struct kvm_tdx_add_page; > + > struct kvm_tdx { > struct kvm kvm; > > @@ -37,12 +39,11 @@ struct kvm_tdx { > struct tdx_td td; > > /* > - * The number of pages that KVM_TDX_INIT_MEM_REGION has mapped into the > - * S-EPT, but not yet initialized via TDH.MEM.PAGE_ADD. Used to sanity > - * check adding pages to the image, and to ensure that all pages have > - * been initialized before finalizing the TD. > + * Scratch structure used to pass the source page and metadata flags to > + * tdx_mem_page_add. Protected by slots_lock, and non-NULL only when > + * mapping a private pfn via tdx_gmem_post_populate(). > */ > - unsigned long nr_pending_tdh_mem_page_adds; > + struct kvm_tdx_page_add *page_add; > > /* > * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do > > base-commit: 7c7a3893b102bdeb4826f7140280b7b16081b385 > --
On 8/29/2025 2:52 AM, Edgecombe, Rick P wrote: > On Thu, 2025-08-28 at 10:00 -0700, Sean Christopherson wrote: >> On Thu, Aug 28, 2025, Yan Zhao wrote: [...] >>> >>> 3. Unexpected zaps (such as kvm_zap_gfn_range()). >> Side topic related to kvm_zap_gfn_range(), the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl() >> is flawed. If kvm_recalculate_apic_map() fails to allocate an optimized map, KVM >> will mark APICv as inhibited, i.e. the associated WARN_ON_ONCE() is effectively >> user-triggerable. >> >> Easiest thing would be to mark the vCPU as dead (though we obviously need >> "KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests" for that to be robust). >> >> >> > I'm going need to look up the related apic discussions from the base series and > circle back. There was an analysis about the inhibit reasons for TDX. https://lore.kernel.org/lkml/e3a2e8fa-b496-4010-9a8c-bfeb131bc43b@linux.intel.com/ As Sean mentioned, if kvm_recalculate_apic_map() fails to allocate the memory for optimized map, it will trigger the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl(). And kvzalloc() failure should not be treated as KVM bug. As talking about user-triggerable, the kvzalloc() failure path could be triggered by KVM_CREATE_VCPU and KVM_TDX_INIT_VCPU for TD. After KVM_TDX_INIT_VCPU, the mapping is not allowed to be changed. Sean's suggested code change looks good to me.
On Thu, Aug 28, 2025, Rick P Edgecombe wrote: > It's unfortunate we didn't have the gmem mmap() support then. Otherwise we could > have just encrypted it in place. Yeah, hindsight is definitely 20/20 on that front. Though I suspect that we'd never have landed anything if we tried to go straight to support mmap(). > Otherwise, I'm really glad to see these cleanups/scrutiny. I kind of got the > impression that you wanted to see less TDX churn for a bit. Heh, believe me, I'm not exactly ecstatic to dive into this. But, I don't want to just ignore it, because some of these quirks/warts are already getting in the way of new development, and if I/we delay such clean ups, the pain is only going to get worse (and the total cost will be much higher). Fatigue is a bit of a problem for me, but the biggest issue really is just lack of cycles (the quick feedback and testing y'all are providing helps tremendously on that front). And lack of cycles should be mitigated to some extent as I (re)familiarize myself with the code; I recognized most of the concepts, but there are definitely a few places where I'm completely lost, and that makes reviewing things like dynamic PAMT and hugepage support excrutiatingly slow. > The fact is, the TDX base support still needs more work like this. IMO, the most important things to address are cases where the design choices we made turned out to be suboptimal, i.e. where the behavior itself is problematic. Code cleanups are definitely welcome too, but my priority is polishing the core design.
On Thu, Aug 28, 2025, Rick P Edgecombe wrote: > On Thu, 2025-08-28 at 10:00 -0700, Sean Christopherson wrote: > > > tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0. > > > KVM_BUG_ON/WARN_ON should be also ok. > > > > Ok, so I vaguely recall that I may have pushed back on using a scratch field in > > "struct kvm_tdx" for temporary data (or maybe it was abusing vcpus[0] that I > > disliked?), but what we ended up with is far worse. > > I think it was also that the tdh_mr_extend() loop was too heavyweight for the > fault path. But that was before we got to the kick+lock stuff. Me confused. This is pre-boot, not the normal fault path, i.e. blocking other operations is not a concern. If tdh_mr_extend() is too heavy for a non-preemptible section, then the current code is also broken in the sense that there are no cond_resched() calls. The vast majority of TDX hosts will be using non-preemptible kernels, so without an explicit cond_resched(), there's no practical difference between extending the measurement under mmu_lock versus outside of mmu_lock. _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should still do tdh_mem_page_add() under mmu_lock.
On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote: > Me confused. This is pre-boot, not the normal fault path, i.e. blocking other > operations is not a concern. Just was my recollection of the discussion. I found it: https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/ > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current > code is also broken in the sense that there are no cond_resched() calls. The > vast majority of TDX hosts will be using non-preemptible kernels, so without an > explicit cond_resched(), there's no practical difference between extending the > measurement under mmu_lock versus outside of mmu_lock. > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should > still do tdh_mem_page_add() under mmu_lock. I just did a quick test and we should be on the order of <1 ms per page for the full loop. I can try to get some more formal test data if it matters. But that doesn't sound too horrible? tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be inside it. But maybe a better reason is that we could better handle errors outside the fault. (i.e. no 5 line comment about why not to return an error in tdx_mem_page_add() due to code in another file). I wonder if Yan can give an analysis of any zapping races if we do that.
On Fri, Aug 29, 2025 at 05:33:48AM +0800, Edgecombe, Rick P wrote: > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote: > > Me confused. This is pre-boot, not the normal fault path, i.e. blocking other > > operations is not a concern. > > Just was my recollection of the discussion. I found it: > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/ > > > > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current > > code is also broken in the sense that there are no cond_resched() calls. The > > vast majority of TDX hosts will be using non-preemptible kernels, so without an > > explicit cond_resched(), there's no practical difference between extending the > > measurement under mmu_lock versus outside of mmu_lock. > > > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should > > still do tdh_mem_page_add() under mmu_lock. > > I just did a quick test and we should be on the order of <1 ms per page for the > full loop. I can try to get some more formal test data if it matters. But that > doesn't sound too horrible? > > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be > inside it. But maybe a better reason is that we could better handle errors > outside the fault. (i.e. no 5 line comment about why not to return an error in > tdx_mem_page_add() due to code in another file). > > I wonder if Yan can give an analysis of any zapping races if we do that. I actually proposed to have write mmu_lock around tdh_mem_page_add() and tdh_mr_extend(), as in https://lore.kernel.org/kvm/Ztfn5gh5888PmEIe@yzhao56-desk.sh.intel.com. I don't see any reason why tdh_mr_extend() can't be done inside mmu_lock in the pre-boot stage. But the previous conclusion was that with slots_lock and filemap invalidation lock, it's ok to invoke tdh_mem_page_add() and tdh_mr_extend() without any mmu_lock. The nr_premapped can also detect the unexpected zap.
Edgecombe, Rick P wrote: > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote: > > Me confused. This is pre-boot, not the normal fault path, i.e. blocking other > > operations is not a concern. > > Just was my recollection of the discussion. I found it: > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/ > > > > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current > > code is also broken in the sense that there are no cond_resched() calls. The > > vast majority of TDX hosts will be using non-preemptible kernels, so without an > > explicit cond_resched(), there's no practical difference between extending the > > measurement under mmu_lock versus outside of mmu_lock. > > > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should > > still do tdh_mem_page_add() under mmu_lock. > > I just did a quick test and we should be on the order of <1 ms per page for the > full loop. I can try to get some more formal test data if it matters. But that > doesn't sound too horrible? > > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be > inside it. I'm probably not following this conversation, so stupid question: It doesn't need to be in the lock because user space should not be setting up memory and extending the measurement in an asynchronous way. Is that correct? > But maybe a better reason is that we could better handle errors > outside the fault. (i.e. no 5 line comment about why not to return an error in > tdx_mem_page_add() due to code in another file). > > I wonder if Yan can give an analysis of any zapping races if we do that. When you say analysis, you mean detecting user space did something wrong and failing gracefully? Is that correct? Ira
On Thu, Aug 28, 2025, Ira Weiny wrote: > Edgecombe, Rick P wrote: > > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote: > > > Me confused. This is pre-boot, not the normal fault path, i.e. blocking other > > > operations is not a concern. > > > > Just was my recollection of the discussion. I found it: > > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/ > > > > > > > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current > > > code is also broken in the sense that there are no cond_resched() calls. The > > > vast majority of TDX hosts will be using non-preemptible kernels, so without an > > > explicit cond_resched(), there's no practical difference between extending the > > > measurement under mmu_lock versus outside of mmu_lock. > > > > > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should > > > still do tdh_mem_page_add() under mmu_lock. > > > > I just did a quick test and we should be on the order of <1 ms per page for the > > full loop. I can try to get some more formal test data if it matters. But that > > doesn't sound too horrible? > > > > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be > > inside it. > > I'm probably not following this conversation, so stupid question: It > doesn't need to be in the lock because user space should not be setting up > memory and extending the measurement in an asynchronous way. Is that > correct? No, from userspace's perspective ADD+MEASURE is fully serialized. ADD "needs" to be under mmu_lock to guarantee consistency between the mirror EPT and the "real" S-EPT entries. E.g. if ADD is done after the fact, then KVM can end up with a PRESENT M-EPT entry but a corresponding S-EPT entry that is !PRESENT. That causes a pile of problems because it breaks KVM's fundamental assumption that M-EPT and S-EPT entries updated in lock-step. TDH_MR_EXTEND doesn't have the same same consistency issue. If it fails, the only thing that's left in a bad state is the measurement. That's obviously not ideal either, but we can handle that by forcefully terminating the VM, without opening up KVM to edge cases that would otherwise be impossible. > > But maybe a better reason is that we could better handle errors > > outside the fault. (i.e. no 5 line comment about why not to return an error in > > tdx_mem_page_add() due to code in another file). > > > > I wonder if Yan can give an analysis of any zapping races if we do that. > > When you say analysis, you mean detecting user space did something wrong > and failing gracefully? Is that correct? More specifically, whether or not KVM can WARN without the WARN being user triggerable. Kernel policy is that WARNs must not be triggerable absent kernel, hardware, or firmware bugs. What we're trying to figure out is if there's a flow that can be triggered by userspace (misbehving or not) that would trip a WARN even if KVM is operating as expected. I'm pretty sure the answer is "no". Oh, and WARNing here is desirable, because it improves the chances of detecting a fatal-to-the-VM bug, e.g. in KVM and/or in the TDX-Module.
Sean Christopherson wrote: > On Thu, Aug 28, 2025, Ira Weiny wrote: > > Edgecombe, Rick P wrote: > > > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote: > > > > Me confused. This is pre-boot, not the normal fault path, i.e. blocking other > > > > operations is not a concern. > > > > > > Just was my recollection of the discussion. I found it: > > > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/ > > > > > > > > > > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current > > > > code is also broken in the sense that there are no cond_resched() calls. The > > > > vast majority of TDX hosts will be using non-preemptible kernels, so without an > > > > explicit cond_resched(), there's no practical difference between extending the > > > > measurement under mmu_lock versus outside of mmu_lock. > > > > > > > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should > > > > still do tdh_mem_page_add() under mmu_lock. > > > > > > I just did a quick test and we should be on the order of <1 ms per page for the > > > full loop. I can try to get some more formal test data if it matters. But that > > > doesn't sound too horrible? > > > > > > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be > > > inside it. > > > > I'm probably not following this conversation, so stupid question: It > > doesn't need to be in the lock because user space should not be setting up > > memory and extending the measurement in an asynchronous way. Is that > > correct? > > No, from userspace's perspective ADD+MEASURE is fully serialized. ADD "needs" > to be under mmu_lock to guarantee consistency between the mirror EPT and the > "real" S-EPT entries. E.g. if ADD is done after the fact, then KVM can end up > with a PRESENT M-EPT entry but a corresponding S-EPT entry that is !PRESENT. > That causes a pile of problems because it breaks KVM's fundamental assumption > that M-EPT and S-EPT entries updated in lock-step. Ok yes, I think I worded my query incorrectly but this makes things clear. Thanks! > > TDH_MR_EXTEND doesn't have the same same consistency issue. If it fails, the > only thing that's left in a bad state is the measurement. That's obviously not > ideal either, but we can handle that by forcefully terminating the VM, without > opening up KVM to edge cases that would otherwise be impossible. > > > > But maybe a better reason is that we could better handle errors > > > outside the fault. (i.e. no 5 line comment about why not to return an error in > > > tdx_mem_page_add() due to code in another file). > > > > > > I wonder if Yan can give an analysis of any zapping races if we do that. > > > > When you say analysis, you mean detecting user space did something wrong > > and failing gracefully? Is that correct? > > More specifically, whether or not KVM can WARN without the WARN being user > triggerable. Kernel policy is that WARNs must not be triggerable absent kernel, > hardware, or firmware bugs. What we're trying to figure out is if there's a > flow that can be triggered by userspace (misbehving or not) that would trip a > WARN even if KVM is operating as expected. I'm pretty sure the answer is "no". > > Oh, and WARNing here is desirable, because it improves the chances of detecting > a fatal-to-the-VM bug, e.g. in KVM and/or in the TDX-Module. OK... In other areas of the kernel if the user misbehaves it is reasonable to fail an operation. I would think that being fatal to the VM would be fine if QEMU did not properly synchronize ADD, measurement, and finalize, for example. Am I wrong in that assumption? Ira
On Thu, Aug 28, 2025, Rick P Edgecombe wrote: > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote: > > Me confused. This is pre-boot, not the normal fault path, i.e. blocking other > > operations is not a concern. > > Just was my recollection of the discussion. I found it: > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/ Ugh, another case where an honest question gets interpreted as "do it this way". :-( > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current > > code is also broken in the sense that there are no cond_resched() calls. The > > vast majority of TDX hosts will be using non-preemptible kernels, so without an > > explicit cond_resched(), there's no practical difference between extending the > > measurement under mmu_lock versus outside of mmu_lock. > > > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should > > still do tdh_mem_page_add() under mmu_lock. > > I just did a quick test and we should be on the order of <1 ms per page for the > full loop. I can try to get some more formal test data if it matters. But that > doesn't sound too horrible? 1ms is totally reasonable. I wouldn't bother with any more testing. > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be > inside it. Agreed, and it would eliminate the need for a "flags" argument. But keeping it in the mmu_lock critical section means KVM can WARN on failures. If it's moved out, then zapping S-EPT entries could induce failure, and I don't think it's worth going through the effort to ensure it's impossible to trigger S-EPT removal. Note, temoving S-EPT entries during initialization of the image isn't something I want to official support, rather it's an endless stream of whack-a-mole due to obsurce edge cases Hmm, actually, maybe I take that back. slots_lock prevents memslot updates, filemap_invalidate_lock() prevents guest_memfd updates, and mmu_notifier events shouldn't ever hit S-EPT. I was worried about kvm_zap_gfn_range(), but the call from sev.c is obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and while I'm 99% certain there's a way to trip __kvm_set_or_clear_apicv_inhibit(), the APIC page has its own non-guest_memfd memslot and so can't be used for the initial image, which means that too is mutually exclusive. So yeah, let's give it a shot. Worst case scenario we're wrong and TDH_MR_EXTEND errors can be triggered by userspace. > But maybe a better reason is that we could better handle errors > outside the fault. (i.e. no 5 line comment about why not to return an error in > tdx_mem_page_add() due to code in another file). > > I wonder if Yan can give an analysis of any zapping races if we do that. As above, I think we're good?
On Thu, Aug 28, 2025 at 02:57:39PM -0700, Sean Christopherson wrote: > On Thu, Aug 28, 2025, Rick P Edgecombe wrote: > > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote: > > > Me confused. This is pre-boot, not the normal fault path, i.e. blocking other > > > operations is not a concern. > > > > Just was my recollection of the discussion. I found it: > > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/ > > Ugh, another case where an honest question gets interpreted as "do it this way". :-( > > > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current > > > code is also broken in the sense that there are no cond_resched() calls. The > > > vast majority of TDX hosts will be using non-preemptible kernels, so without an > > > explicit cond_resched(), there's no practical difference between extending the > > > measurement under mmu_lock versus outside of mmu_lock. > > > > > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should > > > still do tdh_mem_page_add() under mmu_lock. > > > > I just did a quick test and we should be on the order of <1 ms per page for the > > full loop. I can try to get some more formal test data if it matters. But that > > doesn't sound too horrible? > > 1ms is totally reasonable. I wouldn't bother with any more testing. > > > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be > > inside it. > > Agreed, and it would eliminate the need for a "flags" argument. But keeping it > in the mmu_lock critical section means KVM can WARN on failures. If it's moved > out, then zapping S-EPT entries could induce failure, and I don't think it's > worth going through the effort to ensure it's impossible to trigger S-EPT removal. > > Note, temoving S-EPT entries during initialization of the image isn't something > I want to official support, rather it's an endless stream of whack-a-mole due to > obsurce edge cases > > Hmm, actually, maybe I take that back. slots_lock prevents memslot updates, > filemap_invalidate_lock() prevents guest_memfd updates, and mmu_notifier events > shouldn't ever hit S-EPT. I was worried about kvm_zap_gfn_range(), but the call > from sev.c is obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT > so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and while I'm 99% > certain there's a way to trip __kvm_set_or_clear_apicv_inhibit(), the APIC page > has its own non-guest_memfd memslot and so can't be used for the initial image, > which means that too is mutually exclusive. > > So yeah, let's give it a shot. Worst case scenario we're wrong and TDH_MR_EXTEND > errors can be triggered by userspace. > > > But maybe a better reason is that we could better handle errors > > outside the fault. (i.e. no 5 line comment about why not to return an error in > > tdx_mem_page_add() due to code in another file). > > > > I wonder if Yan can give an analysis of any zapping races if we do that. > > As above, I think we're good? I think so.
On Thu, 2025-08-28 at 14:57 -0700, Sean Christopherson wrote: > Agreed, and it would eliminate the need for a "flags" argument. But keeping it > in the mmu_lock critical section means KVM can WARN on failures. If it's moved > out, then zapping S-EPT entries could induce failure, and I don't think it's > worth going through the effort to ensure it's impossible to trigger S-EPT removal. > > Note, temoving S-EPT entries during initialization of the image isn't something > I want to official support, rather it's an endless stream of whack-a-mole due to > obsurce edge cases > > Hmm, actually, maybe I take that back. slots_lock prevents memslot updates, > filemap_invalidate_lock() prevents guest_memfd updates, and mmu_notifier events > shouldn't ever hit S-EPT. I was worried about kvm_zap_gfn_range(), but the call > from sev.c is obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT > so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and > Yea, in the other thread Yan was suggesting the same thing from the KVM side: https://lore.kernel.org/all/aK%2Fsdr2OQqYv9DBZ@yzhao56-desk.sh.intel.com/ But was concerned about "Unexpected zaps" (kvm_zap_gfn_range()). I think maybe we could think about KVM_BUG_ON() in the case of mirror EPT to cover it from another angle. IIRC we discussed this at some point. I was wondering about TDH.MR.EXTEND error conditions. Coming back now, I'm not sure what I was thinking. > while I'm 99% certain there's a way to trip __kvm_set_or_clear_apicv_inhibit(), > the APIC page has its own non-guest_memfd memslot and so can't be used for the > initial image, which means that too is mutually exclusive. Hmm, well maybe KVM_BUG_ON() for kvm_zap_gfn_range() only if this gets addressed. > > So yeah, let's give it a shot. Worst case scenario we're wrong and TDH_MR_EXTEND > errors can be triggered by userspace. > > > But maybe a better reason is that we could better handle errors > > outside the fault. (i.e. no 5 line comment about why not to return an error in > > tdx_mem_page_add() due to code in another file). > > > > I wonder if Yan can give an analysis of any zapping races if we do that. > > As above, I think we're good? Works for me.
On Wed, 2025-08-27 at 12:08 -0700, Sean Christopherson wrote: > > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the > > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of > > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and > > TDH.MEM.PAGE.REMOVE. > > Eww. It would be nice to close that hole, but I suppose it's futile, e.g. the > underlying problem is unexpectedly removing pages from the initial, whether > the > VMM is doing stupid things before vs. after FINALIZE doesn't really matter. > > > As a result, the TD will still run with uninitialized memory. > > No? Because BLOCK+REMOVE means there are no valid S-EPT mappings. There's a > "hole" that the guest might not expect, but that hole will trigger an EPT > violation and only get "filled" if the guest explicitly accepts an AUG'd page. Ah, I just responded on another patch. I wonder if we can get rid of the premap cnt. > > Side topic, why does KVM tolerate tdh_mem_page_add() failure? IIUC, playing > nice with tdh_mem_page_add() failure necessitates both the > tdx_is_sept_zap_err_due_to_premap() craziness and the check in > tdx_td_finalize() > that all pending pages have been consumed. Reasons that tdh_mem_page_add() could get BUSY: 1. If two vCPU's tried to tdh_mem_page_add() the same gpa at the same time they could contend the SEPT entry lock 2. If one vCPU tries to tdh_mem_page_add() while the other zaps (i.e. tdh_mem_range_block()). I guess since we don't hold MMU lock while we tdh_mem_page_add(), 2 is a possibility. > > What reasonable use case is there for gracefully handling tdh_mem_page_add() > failure? > > If there is a need to handle failure, I gotta imagine it's only for the -EBUSY > case. And if it's only for -EBUSY, why can't that be handled by retrying in > tdx_vcpu_init_mem_region()? If tdx_vcpu_init_mem_region() guarantees that all > pages mapped into the S-EPT are ADDed, then it can assert that there are no > pending pages when it completes (even if it "fails"), and similarly > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being > non-zero. Maybe we could take mmu write lock for the retry of tdh_mem_page_add(). Or maybe even for a single call of it, until someone wants to parallelize the operation.
On Thu, Aug 28, 2025 at 11:13:11AM +0800, Edgecombe, Rick P wrote: > On Wed, 2025-08-27 at 12:08 -0700, Sean Christopherson wrote: > > > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the > > > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of > > > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and > > > TDH.MEM.PAGE.REMOVE. > > > > Eww. It would be nice to close that hole, but I suppose it's futile, e.g. the > > underlying problem is unexpectedly removing pages from the initial, whether > > the > > VMM is doing stupid things before vs. after FINALIZE doesn't really matter. > > > > > As a result, the TD will still run with uninitialized memory. > > > > No? Because BLOCK+REMOVE means there are no valid S-EPT mappings. There's a > > "hole" that the guest might not expect, but that hole will trigger an EPT > > violation and only get "filled" if the guest explicitly accepts an AUG'd page. > > Ah, I just responded on another patch. I wonder if we can get rid of the premap > cnt. I think keeping it is safer. See my explanation at [1]. [1] https://lore.kernel.org/all/aK%2Fsdr2OQqYv9DBZ@yzhao56-desk.sh.intel.com. > > > > Side topic, why does KVM tolerate tdh_mem_page_add() failure? IIUC, playing > > nice with tdh_mem_page_add() failure necessitates both the > > tdx_is_sept_zap_err_due_to_premap() craziness and the check in > > tdx_td_finalize() > > that all pending pages have been consumed. > > Reasons that tdh_mem_page_add() could get BUSY: > 1. If two vCPU's tried to tdh_mem_page_add() the same gpa at the same time they > could contend the SEPT entry lock > 2. If one vCPU tries to tdh_mem_page_add() while the other zaps (i.e. > tdh_mem_range_block()). Hmm, two tdh_mem_page_add()s can't contend as they are protected by both slot_lock and filemap lock. With regard to the contention to tdh_mem_range_block(), please check my analysis at the above [1]. tdh_mem_page_add() could get BUSY though, when a misbehaved userspace invokes KVM_TDX_INIT_MEM_REGION on one vCPU while initializing another vCPU. Please check more details at [2]. [2] https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com/ > I guess since we don't hold MMU lock while we tdh_mem_page_add(), 2 is a > possibility. 2 is possible only for paranoid zaps. See "case 3. Unexpected zaps" in [1]. > > What reasonable use case is there for gracefully handling tdh_mem_page_add() > > failure? > > > > If there is a need to handle failure, I gotta imagine it's only for the -EBUSY > > case. And if it's only for -EBUSY, why can't that be handled by retrying in > > tdx_vcpu_init_mem_region()? If tdx_vcpu_init_mem_region() guarantees that all > > pages mapped into the S-EPT are ADDed, then it can assert that there are no > > pending pages when it completes (even if it "fails"), and similarly > > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being > > non-zero. > > Maybe we could take mmu write lock for the retry of tdh_mem_page_add(). Or maybe > even for a single call of it, until someone wants to parallelize the operation. Hmm. I prefer returning -BUSY directly as invoking KVM_TDX_INIT_MEM_REGION before finishing initializing all vCPUs are uncommon.
On Thu, 2025-08-28 at 13:56 +0800, Yan Zhao wrote: > > Reasons that tdh_mem_page_add() could get BUSY: > > 1. If two vCPU's tried to tdh_mem_page_add() the same gpa at the same time > > they > > could contend the SEPT entry lock > > 2. If one vCPU tries to tdh_mem_page_add() while the other zaps (i.e. > > tdh_mem_range_block()). > Hmm, two tdh_mem_page_add()s can't contend as they are protected by both > slot_lock and filemap lock. > > With regard to the contention to tdh_mem_range_block(), please check my > analysis at the above [1]. The analysis missed the tdh_mem_page_add() failure path > > tdh_mem_page_add() could get BUSY though, when a misbehaved userspace invokes > KVM_TDX_INIT_MEM_REGION on one vCPU while initializing another vCPU. > > Please check more details at [2]. > > [2] https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com/ Ah, the TDR lock. I actually referred to an older version of your locking analysis that didn't have that one. But this means the premap count could get out of sync for that reason too. > > > > I guess since we don't hold MMU lock while we tdh_mem_page_add(), 2 is a > > possibility. > 2 is possible only for paranoid zaps. > See "case 3. Unexpected zaps" in [1]. Sean's lockdep assert handles half of those cases. Maybe we could also re- consider a KVM_BUG_ON() in the invalid zap paths again if it comes to it. > > > > > What reasonable use case is there for gracefully handling > > > tdh_mem_page_add() > > > failure? > > > > > > If there is a need to handle failure, I gotta imagine it's only for the - > > > EBUSY > > > case. And if it's only for -EBUSY, why can't that be handled by retrying > > > in > > > tdx_vcpu_init_mem_region()? If tdx_vcpu_init_mem_region() guarantees that > > > all > > > pages mapped into the S-EPT are ADDed, then it can assert that there are > > > no > > > pending pages when it completes (even if it "fails"), and similarly > > > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being > > > non-zero. > > > > Maybe we could take mmu write lock for the retry of tdh_mem_page_add(). Or > > maybe > > even for a single call of it, until someone wants to parallelize the > > operation. > Hmm. I prefer returning -BUSY directly as invoking KVM_TDX_INIT_MEM_REGION > before finishing initializing all vCPUs are uncommon. I was looking guaranteeing its success when Sean posted his suggestion to return to the original pattern. I'm in favor of that direction. If you agree we can call this moot.
© 2016 - 2025 Red Hat, Inc.