[v3] KVM: TDX huge page support for private memory

[PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root

Posted by Yan Zhao 1 month ago

From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>

Disallow page merging (huge page adjustment) for the mirror root by
utilizing disallowed_hugepage_adjust().

Make the mirror root check asymmetric with NX huge pages and not to litter
the generic MMU code:

Invoke disallowed_hugepage_adjust() in kvm_tdp_mmu_map() when necessary,
specifically when KVM has mirrored TDP or the NX huge page workaround is
enabled.

Check and reduce the goal_level of a fault internally in
disallowed_hugepage_adjust() when the fault is for a mirror root and
there's a shadow present non-leaf entry at the original goal_level.

Signed-off-by: Edgecombe, Rick P <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Check is_mirror_sp() in disallowed_hugepage_adjust() instead of passing
  in an is_mirror arg. (Rick)
- Check kvm_has_mirrored_tdp() in kvm_tdp_mmu_map() to determine whether
  to invoke disallowed_hugepage_adjust(). (Rick)

RFC v1:
- new patch
---
 arch/x86/kvm/mmu/mmu.c     | 3 ++-
 arch/x86/kvm/mmu/tdp_mmu.c | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d2c49d92d25d..b4f2e3ced716 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3418,7 +3418,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 	    cur_level == fault->goal_level &&
 	    is_shadow_present_pte(spte) &&
 	    !is_large_pte(spte) &&
-	    spte_to_child_sp(spte)->nx_huge_page_disallowed) {
+	    ((spte_to_child_sp(spte)->nx_huge_page_disallowed) ||
+	     is_mirror_sp(spte_to_child_sp(spte)))) {
 		/*
 		 * A small SPTE exists for this pfn, but FNAME(fetch),
 		 * direct_map(), or kvm_tdp_mmu_map() would like to create a
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9c26038f6b77..dfa56554f9e0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1267,6 +1267,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
 	int ret = RET_PF_RETRY;
+	bool hugepage_adjust_disallowed = fault->nx_huge_page_workaround_enabled ||
+					  kvm_has_mirrored_tdp(kvm);
 
 	KVM_MMU_WARN_ON(!root || root->role.invalid);
 
@@ -1279,7 +1281,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
 		int r;
 
-		if (fault->nx_huge_page_workaround_enabled)
+		if (hugepage_adjust_disallowed)
 			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
 
 		/*
-- 
2.43.2

Re: [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root

Posted by Sean Christopherson 3 weeks, 1 day ago

On Tue, Jan 06, 2026, Yan Zhao wrote:
> From: Rick P Edgecombe <rick.p.edgecombe@intel.com>
> 
> Disallow page merging (huge page adjustment) for the mirror root by
> utilizing disallowed_hugepage_adjust().

Why?  What is this actually doing?  The below explains "how" but I'm baffled as
to the purpose.  I'm guessing there are hints in the surrounding patches, but I
haven't read them in depth, and shouldn't need to in order to understand the
primary reason behind a change.

Re: [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root

Posted by Yan Zhao 3 weeks, 1 day ago

Hi Sean,
Thanks for the review!

On Thu, Jan 15, 2026 at 02:49:59PM -0800, Sean Christopherson wrote:
> On Tue, Jan 06, 2026, Yan Zhao wrote:
> > From: Rick P Edgecombe <rick.p.edgecombe@intel.com>
> > 
> > Disallow page merging (huge page adjustment) for the mirror root by
> > utilizing disallowed_hugepage_adjust().
> 
> Why?  What is this actually doing?  The below explains "how" but I'm baffled as
> to the purpose.  I'm guessing there are hints in the surrounding patches, but I
> haven't read them in depth, and shouldn't need to in order to understand the
> primary reason behind a change.
Sorry for missing the background. I will explain the "why" in the patch log in
the next version.

The reason for introducing this patch is to disallow page merging for TDX. I
explained the reasons to disallow page merging in the cover letter:

"
7. Page merging (page promotion)

   Promotion is disallowed, because:

   - The current TDX module requires all 4KB leafs to be either all PENDING
     or all ACCEPTED before a successful promotion to 2MB. This requirement
     prevents successful page merging after partially converting a 2MB
     range from private to shared and then back to private, which is the
     primary scenario necessitating page promotion.

   - tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
     TDX module. Consequently, handling BUSY errors is complex, as page
     merging typically occurs in the fault path under shared mmu_lock.

   - Limited amount of initial private memory (typically ~4MB) means the
     need for page merging during TD build time is minimal.
"

Without this patch, page promotion may be triggered in the following scenario:

1. guest_memfd allocates a 2MB folio for GPA X, so the max mapping level is 2MB.
2. KVM maps GPA X at 4KB level during TD build time.
3. Guest converts GPA X to shared, zapping the 4KB leaf private mapping while
   keeping the 2MB non-leaf private mapping.
3. Guest converts GPA X to private and accepts it at 2MB level.
4. KVM maps GPA X at 2MB level, triggering page merging.

However, we currently don't support page merging yet. Specifically for the above
scenario, the purpose is to avoid handling the error from
tdh_mem_page_promote(), which SEAMCALL currently needs to be preceded by
tdh_mem_range_block(). To handle the promotion error (e.g., due to busy) under
read mmu_lock, we may need to introduce several spinlocks and guarantees from
the guest to ensure the success of tdh_mem_range_unblock() to restore the S-EPT
status. 

Therefore, we introduced this patch for simplicity, and because the promotion
scenario is not common.

Re: [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root

Posted by Sean Christopherson 1 week, 4 days ago

On Fri, Jan 16, 2026, Yan Zhao wrote:
> Hi Sean,
> Thanks for the review!
> 
> On Thu, Jan 15, 2026 at 02:49:59PM -0800, Sean Christopherson wrote:
> > On Tue, Jan 06, 2026, Yan Zhao wrote:
> > > From: Rick P Edgecombe <rick.p.edgecombe@intel.com>
> > > 
> > > Disallow page merging (huge page adjustment) for the mirror root by
> > > utilizing disallowed_hugepage_adjust().
> > 
> > Why?  What is this actually doing?  The below explains "how" but I'm baffled as
> > to the purpose.  I'm guessing there are hints in the surrounding patches, but I
> > haven't read them in depth, and shouldn't need to in order to understand the
> > primary reason behind a change.
> Sorry for missing the background. I will explain the "why" in the patch log in
> the next version.
> 
> The reason for introducing this patch is to disallow page merging for TDX. I
> explained the reasons to disallow page merging in the cover letter:
> 
> "
> 7. Page merging (page promotion)
> 
>    Promotion is disallowed, because:
> 
>    - The current TDX module requires all 4KB leafs to be either all PENDING
>      or all ACCEPTED before a successful promotion to 2MB. This requirement
>      prevents successful page merging after partially converting a 2MB
>      range from private to shared and then back to private, which is the
>      primary scenario necessitating page promotion.
> 
>    - tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
>      TDX module. Consequently, handling BUSY errors is complex, as page
>      merging typically occurs in the fault path under shared mmu_lock.
> 
>    - Limited amount of initial private memory (typically ~4MB) means the
>      need for page merging during TD build time is minimal.
> "

> However, we currently don't support page merging yet. Specifically for the above
> scenariol, the purpose is to avoid handling the error from
> tdh_mem_page_promote(), which SEAMCALL currently needs to be preceded by
> tdh_mem_range_block(). To handle the promotion error (e.g., due to busy) under
> read mmu_lock, we may need to introduce several spinlocks and guarantees from
> the guest to ensure the success of tdh_mem_range_unblock() to restore the S-EPT
> status. 
> 
> Therefore, we introduced this patch for simplicity, and because the promotion
> scenario is not common.

Say that in the changelog!  Describing the "how" in detail is completely unnecessary,
or at least it should be.  Because I strongly disagree with Rick's opinion from
the RFC that kvm_tdp_mmu_map() should check kvm_has_mirrored_tdp()[*].

 : I think part of the thing that is bugging me is that
 : nx_huge_page_workaround_enabled is not conceptually about whether the specific
 : fault/level needs to disallow huge page adjustments, it's whether it needs to
 : check if it does. Then disallowed_hugepage_adjust() does the actual specific
 : checking. But for the mirror logic the check is the same for both. It's
 : asymmetric with NX huge pages, and just sort of jammed in. It would be easier to
 : follow if the kvm_tdp_mmu_map() conditional checked wither mirror TDP was
 : "active", rather than the mirror role.

[*] http://lore.kernel.org/all/eea0bf7925c3b9c16573be8e144ddcc77b54cc92.camel@intel.com

If the changelog explains _why_, and the code is actually commented, then calling
into disallowed_hugepage_adjust() for all faults in a VM with mirrored roots is
nonsensical, because the code won't match the comment.

From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Date: Tue, 22 Apr 2025 10:21:12 +0800
Subject: [PATCH] KVM: x86/mmu: Prevent hugepage promotion for mirror roots in
 fault path

Disallow hugepage promotion in the TDP MMU for mirror roots as KVM doesn't
currently support promoting S-EPT entries due to the complexity incurred
by the TDX-Module's rules for hugepage promotion.

 - The current TDX-Module requires all 4KB leafs to be either all PENDING
   or all ACCEPTED before a successful promotion to 2MB. This requirement
   prevents successful page merging after partially converting a 2MB
   range from private to shared and then back to private, which is the
   primary scenario necessitating page promotion.

 - The TDX-Module effectively requires a break-before-make sequence (to
   satisfy its TLB flushing rules), i.e. creates a window of time where a
   different vCPU can encounter faults on a SPTE that KVM is trying to
   promote to a hugepage.  To avoid unexpected BUSY errors, KVM would need
   to FREEZE the non-leaf SPTE before replacing it with a huge SPTE.

Disable hugepage promotion for all map() operations, as supporting page
promotion when building the initial image is still non-trivial, and the
vast majority of images are ~4MB or less, i.e. the benefit of creating
hugepages during TD build time is minimal.

Signed-off-by: Edgecombe, Rick P <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
[sean: check root, add comment, rewrite changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     |  3 ++-
 arch/x86/kvm/mmu/tdp_mmu.c | 12 +++++++++++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4ecbf216d96f..45650f70eeab 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3419,7 +3419,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 	    cur_level == fault->goal_level &&
 	    is_shadow_present_pte(spte) &&
 	    !is_large_pte(spte) &&
-	    spte_to_child_sp(spte)->nx_huge_page_disallowed) {
+	    ((spte_to_child_sp(spte)->nx_huge_page_disallowed) ||
+	     is_mirror_sp(spte_to_child_sp(spte)))) {
 		/*
 		 * A small SPTE exists for this pfn, but FNAME(fetch),
 		 * direct_map(), or kvm_tdp_mmu_map() would like to create a
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 321dbde77d3f..0fe3be41594f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1232,7 +1232,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
 		int r;
 
-		if (fault->nx_huge_page_workaround_enabled)
+		/*
+		 * Don't replace a page table (non-leaf) SPTE with a huge SPTE
+		 * (a.k.a. hugepage promotion) if the NX hugepage workaround is
+		 * enabled, as doing so will cause significant thrashing if one
+		 * or more leaf SPTEs needs to be executable.
+		 *
+		 * Disallow hugepage promotion for mirror roots as KVM doesn't
+		 * (yet) support promoting S-EPT entries while holding mmu_lock
+		 * for read (due to complexity induced by the TDX-Module APIs).
+		 */
+		if (fault->nx_huge_page_workaround_enabled || is_mirror_sp(root))
 			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
 
 		/*

base-commit: 914ea33c797e95e5fa7a0803e44b621a9e70a90f
--

Re: [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root

Posted by Yan Zhao 1 week, 4 days ago

On Mon, Jan 26, 2026 at 08:08:31AM -0800, Sean Christopherson wrote:
> On Fri, Jan 16, 2026, Yan Zhao wrote:
> > Hi Sean,
> > Thanks for the review!
> > 
> > On Thu, Jan 15, 2026 at 02:49:59PM -0800, Sean Christopherson wrote:
> > > On Tue, Jan 06, 2026, Yan Zhao wrote:
> > > > From: Rick P Edgecombe <rick.p.edgecombe@intel.com>
> > > > 
> > > > Disallow page merging (huge page adjustment) for the mirror root by
> > > > utilizing disallowed_hugepage_adjust().
> > > 
> > > Why?  What is this actually doing?  The below explains "how" but I'm baffled as
> > > to the purpose.  I'm guessing there are hints in the surrounding patches, but I
> > > haven't read them in depth, and shouldn't need to in order to understand the
> > > primary reason behind a change.
> > Sorry for missing the background. I will explain the "why" in the patch log in
> > the next version.
> > 
> > The reason for introducing this patch is to disallow page merging for TDX. I
> > explained the reasons to disallow page merging in the cover letter:
> > 
> > "
> > 7. Page merging (page promotion)
> > 
> >    Promotion is disallowed, because:
> > 
> >    - The current TDX module requires all 4KB leafs to be either all PENDING
> >      or all ACCEPTED before a successful promotion to 2MB. This requirement
> >      prevents successful page merging after partially converting a 2MB
> >      range from private to shared and then back to private, which is the
> >      primary scenario necessitating page promotion.
> > 
> >    - tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
> >      TDX module. Consequently, handling BUSY errors is complex, as page
> >      merging typically occurs in the fault path under shared mmu_lock.
> > 
> >    - Limited amount of initial private memory (typically ~4MB) means the
> >      need for page merging during TD build time is minimal.
> > "
> 
> > However, we currently don't support page merging yet. Specifically for the above
> > scenariol, the purpose is to avoid handling the error from
> > tdh_mem_page_promote(), which SEAMCALL currently needs to be preceded by
> > tdh_mem_range_block(). To handle the promotion error (e.g., due to busy) under
> > read mmu_lock, we may need to introduce several spinlocks and guarantees from
> > the guest to ensure the success of tdh_mem_range_unblock() to restore the S-EPT
> > status. 
> > 
> > Therefore, we introduced this patch for simplicity, and because the promotion
> > scenario is not common.
> 
> Say that in the changelog!  Describing the "how" in detail is completely unnecessary,
I'll keep it in mind in the future!

> or at least it should be.  Because I strongly disagree with Rick's opinion from
> the RFC that kvm_tdp_mmu_map() should check kvm_has_mirrored_tdp()[*].
> 
>  : I think part of the thing that is bugging me is that
>  : nx_huge_page_workaround_enabled is not conceptually about whether the specific
>  : fault/level needs to disallow huge page adjustments, it's whether it needs to
>  : check if it does. Then disallowed_hugepage_adjust() does the actual specific
>  : checking. But for the mirror logic the check is the same for both. It's
>  : asymmetric with NX huge pages, and just sort of jammed in. It would be easier to
>  : follow if the kvm_tdp_mmu_map() conditional checked wither mirror TDP was
>  : "active", rather than the mirror role.
> 
> [*] http://lore.kernel.org/all/eea0bf7925c3b9c16573be8e144ddcc77b54cc92.camel@intel.com
> 
> If the changelog explains _why_, and the code is actually commented, then calling
> into disallowed_hugepage_adjust() for all faults in a VM with mirrored roots is
> nonsensical, because the code won't match the comment.

Thanks a lot! It looks good to me. 

> From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> Date: Tue, 22 Apr 2025 10:21:12 +0800
> Subject: [PATCH] KVM: x86/mmu: Prevent hugepage promotion for mirror roots in
>  fault path
> 
> Disallow hugepage promotion in the TDP MMU for mirror roots as KVM doesn't
> currently support promoting S-EPT entries due to the complexity incurred
> by the TDX-Module's rules for hugepage promotion.
> 
>  - The current TDX-Module requires all 4KB leafs to be either all PENDING
>    or all ACCEPTED before a successful promotion to 2MB. This requirement
>    prevents successful page merging after partially converting a 2MB
>    range from private to shared and then back to private, which is the
>    primary scenario necessitating page promotion.
> 
>  - The TDX-Module effectively requires a break-before-make sequence (to
>    satisfy its TLB flushing rules), i.e. creates a window of time where a
>    different vCPU can encounter faults on a SPTE that KVM is trying to
>    promote to a hugepage.  To avoid unexpected BUSY errors, KVM would need
>    to FREEZE the non-leaf SPTE before replacing it with a huge SPTE.
> 
> Disable hugepage promotion for all map() operations, as supporting page
> promotion when building the initial image is still non-trivial, and the
> vast majority of images are ~4MB or less, i.e. the benefit of creating
> hugepages during TD build time is minimal.
> 
> Signed-off-by: Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> [sean: check root, add comment, rewrite changelog]
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     |  3 ++-
>  arch/x86/kvm/mmu/tdp_mmu.c | 12 +++++++++++-
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4ecbf216d96f..45650f70eeab 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3419,7 +3419,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
>  	    cur_level == fault->goal_level &&
>  	    is_shadow_present_pte(spte) &&
>  	    !is_large_pte(spte) &&
> -	    spte_to_child_sp(spte)->nx_huge_page_disallowed) {
> +	    ((spte_to_child_sp(spte)->nx_huge_page_disallowed) ||
> +	     is_mirror_sp(spte_to_child_sp(spte)))) {
>  		/*
>  		 * A small SPTE exists for this pfn, but FNAME(fetch),
>  		 * direct_map(), or kvm_tdp_mmu_map() would like to create a
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 321dbde77d3f..0fe3be41594f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1232,7 +1232,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
>  		int r;
>  
> -		if (fault->nx_huge_page_workaround_enabled)
> +		/*
> +		 * Don't replace a page table (non-leaf) SPTE with a huge SPTE
> +		 * (a.k.a. hugepage promotion) if the NX hugepage workaround is
> +		 * enabled, as doing so will cause significant thrashing if one
> +		 * or more leaf SPTEs needs to be executable.
> +		 *
> +		 * Disallow hugepage promotion for mirror roots as KVM doesn't
> +		 * (yet) support promoting S-EPT entries while holding mmu_lock
> +		 * for read (due to complexity induced by the TDX-Module APIs).
> +		 */
> +		if (fault->nx_huge_page_workaround_enabled || is_mirror_sp(root))
A small nit:
Here, we check is_mirror_sp(root).
However, not far from here,  in kvm_tdp_mmu_map(), we have another check of
is_mirror_sp(), which should get the same result since sp->role.is_mirror is
inherited from its parent.

               if (is_mirror_sp(sp))
                       kvm_mmu_alloc_external_spt(vcpu, sp);

So, do you think we can save the is_mirror status in a local variable?
Like this:

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b524b44733b8..c54befec3042 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1300,6 +1300,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
        struct kvm_mmu_page *root = tdp_mmu_get_root_for_fault(vcpu, fault);
+       bool is_mirror = root && is_mirror_sp(root);
        struct kvm *kvm = vcpu->kvm;
        struct tdp_iter iter;
        struct kvm_mmu_page *sp;
@@ -1316,7 +1317,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
        for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
                int r;

-               if (fault->nx_huge_page_workaround_enabled)
+               /*
+                * Don't replace a page table (non-leaf) SPTE with a huge SPTE
+                * (a.k.a. hugepage promotion) if the NX hugepage workaround is
+                * enabled, as doing so will cause significant thrashing if one
+                * or more leaf SPTEs needs to be executable.
+                *
+                * Disallow hugepage promotion for mirror roots as KVM doesn't
+                * (yet) support promoting S-EPT entries while holding mmu_lock
+                * for read (due to complexity induced by the TDX-Module APIs).
+                */
+               if (fault->nx_huge_page_workaround_enabled || is_mirror)
                        disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);

                /*
@@ -1340,7 +1351,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
                 */
                sp = tdp_mmu_alloc_sp(vcpu);
                tdp_mmu_init_child_sp(sp, &iter);
-               if (is_mirror_sp(sp))
+               if (is_mirror)
                        kvm_mmu_alloc_external_spt(vcpu, sp);

Re: [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root

Posted by Sean Christopherson 1 week, 2 days ago

On Tue, Jan 27, 2026, Yan Zhao wrote:
> On Mon, Jan 26, 2026 at 08:08:31AM -0800, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 321dbde77d3f..0fe3be41594f 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1232,7 +1232,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  	for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
> >  		int r;
> >  
> > -		if (fault->nx_huge_page_workaround_enabled)
> > +		/*
> > +		 * Don't replace a page table (non-leaf) SPTE with a huge SPTE
> > +		 * (a.k.a. hugepage promotion) if the NX hugepage workaround is
> > +		 * enabled, as doing so will cause significant thrashing if one
> > +		 * or more leaf SPTEs needs to be executable.
> > +		 *
> > +		 * Disallow hugepage promotion for mirror roots as KVM doesn't
> > +		 * (yet) support promoting S-EPT entries while holding mmu_lock
> > +		 * for read (due to complexity induced by the TDX-Module APIs).
> > +		 */
> > +		if (fault->nx_huge_page_workaround_enabled || is_mirror_sp(root))
> A small nit:
> Here, we check is_mirror_sp(root).
> However, not far from here,  in kvm_tdp_mmu_map(), we have another check of
> is_mirror_sp(), which should get the same result since sp->role.is_mirror is
> inherited from its parent.
> 
>                if (is_mirror_sp(sp))
>                        kvm_mmu_alloc_external_spt(vcpu, sp);
> 
> So, do you think we can save the is_mirror status in a local variable?

Eh, I vote "no".  From a performance perspective, it's basically meaningless.
The check is a single uop to test a flag that is all but guaranteed to be
cache-hot, and any halfway decent CPU be able to predict the branch.

From a code perspective, I'd rather have the explicit is_mirror_sp(root) check,
as opposed to having to go look at the origins of is_mirror.

> Like this:
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b524b44733b8..c54befec3042 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1300,6 +1300,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>         struct kvm_mmu_page *root = tdp_mmu_get_root_for_fault(vcpu, fault);
> +       bool is_mirror = root && is_mirror_sp(root);
>         struct kvm *kvm = vcpu->kvm;
>         struct tdp_iter iter;
>         struct kvm_mmu_page *sp;
> @@ -1316,7 +1317,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>         for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
>                 int r;
> 
> -               if (fault->nx_huge_page_workaround_enabled)
> +               /*
> +                * Don't replace a page table (non-leaf) SPTE with a huge SPTE
> +                * (a.k.a. hugepage promotion) if the NX hugepage workaround is
> +                * enabled, as doing so will cause significant thrashing if one
> +                * or more leaf SPTEs needs to be executable.
> +                *
> +                * Disallow hugepage promotion for mirror roots as KVM doesn't
> +                * (yet) support promoting S-EPT entries while holding mmu_lock
> +                * for read (due to complexity induced by the TDX-Module APIs).
> +                */
> +               if (fault->nx_huge_page_workaround_enabled || is_mirror)
>                         disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
> 
>                 /*
> @@ -1340,7 +1351,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                  */
>                 sp = tdp_mmu_alloc_sp(vcpu);
>                 tdp_mmu_init_child_sp(sp, &iter);
> -               if (is_mirror_sp(sp))
> +               if (is_mirror)
>                         kvm_mmu_alloc_external_spt(vcpu, sp);
>