[v3] KVM: TDX huge page support for private memory

[PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Yan Zhao 1 month ago

Introduce kvm_split_cross_boundary_leafs() to split huge leaf entries that
cross the boundary of a specified range.

Splitting huge leaf entries that cross the boundary is essential before
zapping a specified range in the mirror root. This ensures that the
subsequent zap operation does not affect any GFNs outside the specified
range, which is crucial for the mirror root, as the private page table
requires the guest's ACCEPT operation after faulting back.

While the core of kvm_split_cross_boundary_leafs() leverages the main logic
of tdp_mmu_split_huge_pages_root(), the former only splits huge leaf
entries when their mapping ranges cross the specified range boundary. When
splitting is necessary, kvm->mmu_lock may be temporarily released for
memory allocation, meaning returning -ENOMEM is possible.

Since tdp_mmu_split_huge_pages_root() is originally invoked by dirty page
tracking related functions that flush TLB unconditionally at the end,
tdp_mmu_split_huge_pages_root() doesn't flush TLB before it temporarily
releases mmu_lock.

Do not enhance tdp_mmu_split_huge_pages_root() to return split or flush
status for kvm_split_cross_boundary_leafs(). This is because the status
could be inaccurate when multiple threads are trying to split the same
memory range concurrently, e.g., if kvm_split_cross_boundary_leafs()
returns split/flush as false, it doesn't mean there're no splits in the
specified range, since splits could have occurred in other threads due to
temporary release of mmu_lock.

Therefore, callers of kvm_split_cross_boundary_leafs() need to determine
how/when to flush TLB according to the use cases:

- If the split is triggered in a fault path for TDX, the hardware shouldn't
  have cached the old huge translation. Therefore, no need to flush TLB.

- If the split is triggered by zaps in guest_memfd punch hole or page
  conversion, it can delay the TLB flush until after zaps.

- If the use case relies on pure split status (e.g., splitting for PML),
  flush TLB unconditionally. (Just hypothetical. No such use case currently
  exists for kvm_split_cross_boundary_leafs()).

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- s/only_cross_bounday/only_cross_boundary. (Kai)
- Do not return flush status and have the callers to determine how/when to
  flush TLB.
- Always pass "flush" as false to tdp_mmu_iter_cond_resched(). (Kai)
- Added a default implementation for kvm_split_boundary_leafs() for non-x86
  platforms.
- Removed middle level function tdp_mmu_split_cross_boundary_leafs().
- Use EXPORT_SYMBOL_FOR_KVM_INTERNAL().

RFC v2:
- Rename the API to kvm_split_cross_boundary_leafs().
- Make the API to be usable for direct roots or under shared mmu_lock.
- Leverage the main logic from tdp_mmu_split_huge_pages_root(). (Rick)

RFC v1:
- Split patch.
- introduced API kvm_split_boundary_leafs(), refined the logic and
  simplified the code.
---
 arch/x86/kvm/mmu/mmu.c     | 34 ++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 42 ++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.h |  3 +++
 include/linux/kvm_host.h   |  2 ++
 virt/kvm/kvm_main.c        |  7 +++++++
 5 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b4f2e3ced716..f40af7ac75b3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1644,6 +1644,40 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
 				 start, end - 1, can_yield, true, flush);
 }
 
+/*
+ * Split large leafs crossing the boundary of the specified range.
+ * Only support TDP MMU. Do nothing if !tdp_mmu_enabled.
+ *
+ * This API does not flush TLB. Callers need to determine how/when to flush TLB
+ * according to their use cases, e.g.,
+ * - No need to flush TLB. e.g., if it's in a fault path or TLB flush has been
+ *   ensured.
+ * - Delay the TLB flush until after zaps if the split is invoked for precise
+ *   zapping.
+ * - Unconditionally flush TLB if a use case relies on pure split status (e.g.,
+ *   splitting for PML).
+ *
+ * Return value: 0 : success;  <0: failure
+ */
+int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
+				   bool shared)
+{
+	bool ret = 0;
+
+	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
+			    lockdep_is_held(&kvm->slots_lock) ||
+			    srcu_read_lock_held(&kvm->srcu));
+
+	if (!range->may_block)
+		return -EOPNOTSUPP;
+
+	if (tdp_mmu_enabled)
+		ret = kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(kvm, range,
+								       shared);
+	return ret;
+}
+EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_split_cross_boundary_leafs);
+
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool flush = false;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 074209d91ec3..b984027343b7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1600,10 +1600,17 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	return ret;
 }
 
+static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
+{
+	return !(iter->gfn >= start &&
+		 (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
+}
+
 static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 					 struct kvm_mmu_page *root,
 					 gfn_t start, gfn_t end,
-					 int target_level, bool shared)
+					 int target_level, bool shared,
+					 bool only_cross_boundary)
 {
 	struct kvm_mmu_page *sp = NULL;
 	struct tdp_iter iter;
@@ -1615,6 +1622,10 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 	 * level into one lower level. For example, if we encounter a 1GB page
 	 * we split it into 512 2MB pages.
 	 *
+	 * When only_cross_boundary is true, just split huge pages above the
+	 * target level into one lower level if the huge pages cross the start
+	 * or end boundary.
+	 *
 	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
 	 * to visit an SPTE before ever visiting its children, which means we
 	 * will correctly recursively split huge pages that are more than one
@@ -1629,6 +1640,10 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 		if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
 			continue;
 
+		if (only_cross_boundary &&
+		    !iter_cross_boundary(&iter, start, end))
+			continue;
+
 		if (!sp) {
 			rcu_read_unlock();
 
@@ -1692,12 +1707,35 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
 
 	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
 	for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) {
-		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, shared);
+		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level,
+						  shared, false);
+		if (r) {
+			kvm_tdp_mmu_put_root(kvm, root);
+			break;
+		}
+	}
+}
+
+int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm,
+						     struct kvm_gfn_range *range,
+						     bool shared)
+{
+	enum kvm_tdp_mmu_root_types types;
+	struct kvm_mmu_page *root;
+	int r = 0;
+
+	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+	types = kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter);
+
+	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types) {
+		r = tdp_mmu_split_huge_pages_root(kvm, root, range->start, range->end,
+						  PG_LEVEL_4K, shared, true);
 		if (r) {
 			kvm_tdp_mmu_put_root(kvm, root);
 			break;
 		}
 	}
+	return r;
 }
 
 static bool tdp_mmu_need_write_protect(struct kvm *kvm, struct kvm_mmu_page *sp)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index bd62977c9199..c20b1416e4b2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -70,6 +70,9 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
 				  enum kvm_tdp_mmu_root_types root_types);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
+int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm,
+						     struct kvm_gfn_range *range,
+						     bool shared);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8144d27e6c12..e563bb22c481 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -275,6 +275,8 @@ struct kvm_gfn_range {
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
+				   bool shared);
 #endif
 
 enum {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1d7ab2324d10..feeef7747099 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -910,6 +910,13 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
 }
 
+int __weak kvm_split_cross_boundary_leafs(struct kvm *kvm,
+					  struct kvm_gfn_range *range,
+					  bool shared)
+{
+	return 0;
+}
+
 #else  /* !CONFIG_KVM_GENERIC_MMU_NOTIFIER */
 
 static int kvm_init_mmu_notifier(struct kvm *kvm)
-- 
2.43.2

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Huang, Kai 3 weeks, 3 days ago

On Tue, 2026-01-06 at 18:21 +0800, Yan Zhao wrote:
> @@ -1692,12 +1707,35 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
>  
>  	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
>  	for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) {
> -		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, shared);
> +		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level,
> +						  shared, false);
> +		if (r) {
> +			kvm_tdp_mmu_put_root(kvm, root);
> +			break;
> +		}
> +	}
> +}
> +
> +int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm,
> +						     struct kvm_gfn_range *range,
> +						     bool shared)
> +{
> +	enum kvm_tdp_mmu_root_types types;
> +	struct kvm_mmu_page *root;
> +	int r = 0;
> +
> +	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> +	types = kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter);
> +
> +	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types) {
> +		r = tdp_mmu_split_huge_pages_root(kvm, root, range->start, range->end,
> +						  PG_LEVEL_4K, shared, true);
>  		if (r) {
>  			kvm_tdp_mmu_put_root(kvm, root);
>  			break;
>  		}
>  	}
> +	return r;
>  }
>  

Seems the two functions -- kvm_tdp_mmu_try_split_huge_pages() and
kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() -- are almost
identical.  Is it better to introduce a helper and make the two just be
the wrappers?

E.g.,

static int __kvm_tdp_mmu_split_huge_pages(struct kvm *kvm, 
					  struct kvm_gfn_range *range,
					  int target_level,
					  bool shared,
					  bool cross_boundary_only)
{
	...
}

And by using this helper, I found the name of the two wrapper functions
are not ideal:

kvm_tdp_mmu_try_split_huge_pages() is only for log dirty, and it should
not be reachable for TD (VM with mirrored PT).  But currently it uses
KVM_VALID_ROOTS for root filter thus mirrored PT is also included.  I
think it's better to rename it, e.g., at least with "log_dirty" in the
name so it's more clear this function is only for dealing log dirty (at
least currently).  We can also add a WARN() if it's called for VM with
mirrored PT but it's a different topic.

kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() doesn't have
"huge_pages", which isn't consistent with the other.  And it is a bit
long.  If we don't have "gfn_range" in __kvm_tdp_mmu_split_huge_pages(),
then I think we can remove "gfn_range" from
kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() too to make it shorter.

So how about:

Rename kvm_tdp_mmu_try_split_huge_pages() to
kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
kvm_tdp_mmu_split_huge_pages_cross_boundary()

?

E.g.,:

int kvm_tdp_mmu_split_huge_pages_log_dirty(struct kvm *kvm, 
				     	   const kvm_memory_slot *slot,
				    	   gfn_t start, gfn_t end,
					   int target_level, bool shared)
{
	struct kvm_gfn_range range = {
		.slot		= slot,
		.start		= start,
		.end		= end,
		.attr_filter	= 0, /* doesn't matter */
		.may_block	= true,
	};

	if (WARN_ON_ONCE(kvm_has_mirrored_tdp(kvm))
		return -EINVAL;

	return __kvm_tdp_mmu_split_huge_pages(kvm, &range, target_level,
					      shared, false);
}

int kvm_tdp_mmu_split_huge_pages_cross_boundary(struct kvm *kvm,
					struct kvm_gfn_range *range,
					int target_level,
					bool shared)
{
	return __kvm_tdp_mmu_split_huge_pages(kvm, range, target_level,
					      shared, true);
}

Anything I missed?

And one more minor thing:

With that, I think you can move range->may_block check from
kvm_split_cross_boundary_leafs() to the __kvm_tdp_mmu_split_huge_pages()
common helper:

	if (!range->may_block)
		return -EOPNOTSUPP;

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Sean Christopherson 3 weeks, 1 day ago

On Thu, Jan 15, 2026, Kai Huang wrote:
> static int __kvm_tdp_mmu_split_huge_pages(struct kvm *kvm, 
> 					  struct kvm_gfn_range *range,
> 					  int target_level,
> 					  bool shared,
> 					  bool cross_boundary_only)
> {
> 	...
> }
> 
> And by using this helper, I found the name of the two wrapper functions
> are not ideal:
> 
> kvm_tdp_mmu_try_split_huge_pages() is only for log dirty, and it should
> not be reachable for TD (VM with mirrored PT).  But currently it uses
> KVM_VALID_ROOTS for root filter thus mirrored PT is also included.  I
> think it's better to rename it, e.g., at least with "log_dirty" in the
> name so it's more clear this function is only for dealing log dirty (at
> least currently).  We can also add a WARN() if it's called for VM with
> mirrored PT but it's a different topic.
> 
> kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() doesn't have
> "huge_pages", which isn't consistent with the other.  And it is a bit
> long.  If we don't have "gfn_range" in __kvm_tdp_mmu_split_huge_pages(),
> then I think we can remove "gfn_range" from
> kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() too to make it shorter.
> 
> So how about:
> 
> Rename kvm_tdp_mmu_try_split_huge_pages() to
> kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> kvm_tdp_mmu_split_huge_pages_cross_boundary()
> 
> ?

I find the "cross_boundary" termininology extremely confusing.  I also dislike
the concept itself, in the sense that it shoves a weird, specific concept into
the guts of the TDP MMU.

The other wart is that it's inefficient when punching a large hole.  E.g. say
there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
and tail pages is asinine.

And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
_only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.

For the EPT violation case, the guest is accepting a page.  Just split to the
guest's accepted level, I don't see any reason to make things more complicated
than that.

And then for the PUNCH_HOLE case, do the math to determine which, if any, head
and tail pages need to be split, and use the existing APIs to make that happen.

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Vishal Annapurve 2 weeks, 4 days ago

On Fri, Jan 16, 2026 at 3:39 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 15, 2026, Kai Huang wrote:
> > static int __kvm_tdp_mmu_split_huge_pages(struct kvm *kvm,
> >                                         struct kvm_gfn_range *range,
> >                                         int target_level,
> >                                         bool shared,
> >                                         bool cross_boundary_only)
> > {
> >       ...
> > }
> >
> > And by using this helper, I found the name of the two wrapper functions
> > are not ideal:
> >
> > kvm_tdp_mmu_try_split_huge_pages() is only for log dirty, and it should
> > not be reachable for TD (VM with mirrored PT).  But currently it uses
> > KVM_VALID_ROOTS for root filter thus mirrored PT is also included.  I
> > think it's better to rename it, e.g., at least with "log_dirty" in the
> > name so it's more clear this function is only for dealing log dirty (at
> > least currently).  We can also add a WARN() if it's called for VM with
> > mirrored PT but it's a different topic.
> >
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() doesn't have
> > "huge_pages", which isn't consistent with the other.  And it is a bit
> > long.  If we don't have "gfn_range" in __kvm_tdp_mmu_split_huge_pages(),
> > then I think we can remove "gfn_range" from
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() too to make it shorter.
> >
> > So how about:
> >
> > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> >
> > ?
>
> I find the "cross_boundary" termininology extremely confusing.  I also dislike
> the concept itself, in the sense that it shoves a weird, specific concept into
> the guts of the TDP MMU.
>
> The other wart is that it's inefficient when punching a large hole.  E.g. say
> there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> and tail pages is asinine.
>
> And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
>
> For the EPT violation case, the guest is accepting a page.  Just split to the
> guest's accepted level, I don't see any reason to make things more complicated
> than that.
>
> And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> and tail pages need to be split, and use the existing APIs to make that happen.

Just a note: Through guest_memfd upstream syncs, we agreed that
guest_memfd will only allow the punch_hole operation for huge page
size-aligned ranges for hugetlb and thp backing. i.e. the PUNCH_HOLE
operation doesn't need to split any EPT mappings for foreseeable
future.

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Sean Christopherson 2 weeks, 4 days ago

On Tue, Jan 20, 2026, Vishal Annapurve wrote:
> On Fri, Jan 16, 2026 at 3:39 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Jan 15, 2026, Kai Huang wrote:
> > > static int __kvm_tdp_mmu_split_huge_pages(struct kvm *kvm,
> > >                                         struct kvm_gfn_range *range,
> > >                                         int target_level,
> > >                                         bool shared,
> > >                                         bool cross_boundary_only)
> > > {
> > >       ...
> > > }
> > >
> > > And by using this helper, I found the name of the two wrapper functions
> > > are not ideal:
> > >
> > > kvm_tdp_mmu_try_split_huge_pages() is only for log dirty, and it should
> > > not be reachable for TD (VM with mirrored PT).  But currently it uses
> > > KVM_VALID_ROOTS for root filter thus mirrored PT is also included.  I
> > > think it's better to rename it, e.g., at least with "log_dirty" in the
> > > name so it's more clear this function is only for dealing log dirty (at
> > > least currently).  We can also add a WARN() if it's called for VM with
> > > mirrored PT but it's a different topic.
> > >
> > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() doesn't have
> > > "huge_pages", which isn't consistent with the other.  And it is a bit
> > > long.  If we don't have "gfn_range" in __kvm_tdp_mmu_split_huge_pages(),
> > > then I think we can remove "gfn_range" from
> > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() too to make it shorter.
> > >
> > > So how about:
> > >
> > > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > >
> > > ?
> >
> > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > the concept itself, in the sense that it shoves a weird, specific concept into
> > the guts of the TDP MMU.
> >
> > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > and tail pages is asinine.
> >
> > And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> > _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> > code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
> >
> > For the EPT violation case, the guest is accepting a page.  Just split to the
> > guest's accepted level, I don't see any reason to make things more complicated
> > than that.
> >
> > And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> > and tail pages need to be split, and use the existing APIs to make that happen.
> 
> Just a note: Through guest_memfd upstream syncs, we agreed that
> guest_memfd will only allow the punch_hole operation for huge page
> size-aligned ranges for hugetlb and thp backing. i.e. the PUNCH_HOLE
> operation doesn't need to split any EPT mappings for foreseeable
> future.

Oh!  Right, forgot about that.  It's the conversion path that we need to sort out,
not PUNCH_HOLE.  Thanks for the reminder!

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Yan Zhao 2 weeks, 3 days ago

On Tue, Jan 20, 2026 at 10:02:41AM -0800, Sean Christopherson wrote:
> On Tue, Jan 20, 2026, Vishal Annapurve wrote:
> > On Fri, Jan 16, 2026 at 3:39 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, Jan 15, 2026, Kai Huang wrote:
> > > > static int __kvm_tdp_mmu_split_huge_pages(struct kvm *kvm,
> > > >                                         struct kvm_gfn_range *range,
> > > >                                         int target_level,
> > > >                                         bool shared,
> > > >                                         bool cross_boundary_only)
> > > > {
> > > >       ...
> > > > }
> > > >
> > > > And by using this helper, I found the name of the two wrapper functions
> > > > are not ideal:
> > > >
> > > > kvm_tdp_mmu_try_split_huge_pages() is only for log dirty, and it should
> > > > not be reachable for TD (VM with mirrored PT).  But currently it uses
> > > > KVM_VALID_ROOTS for root filter thus mirrored PT is also included.  I
> > > > think it's better to rename it, e.g., at least with "log_dirty" in the
> > > > name so it's more clear this function is only for dealing log dirty (at
> > > > least currently).  We can also add a WARN() if it's called for VM with
> > > > mirrored PT but it's a different topic.
> > > >
> > > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() doesn't have
> > > > "huge_pages", which isn't consistent with the other.  And it is a bit
> > > > long.  If we don't have "gfn_range" in __kvm_tdp_mmu_split_huge_pages(),
> > > > then I think we can remove "gfn_range" from
> > > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() too to make it shorter.
> > > >
> > > > So how about:
> > > >
> > > > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > > > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > > > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > > >
> > > > ?
> > >
> > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > the guts of the TDP MMU.
> > >
> > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > and tail pages is asinine.
> > >
> > > And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> > > _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> > > code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
> > >
> > > For the EPT violation case, the guest is accepting a page.  Just split to the
> > > guest's accepted level, I don't see any reason to make things more complicated
> > > than that.
> > >
> > > And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> > > and tail pages need to be split, and use the existing APIs to make that happen.
> > 
> > Just a note: Through guest_memfd upstream syncs, we agreed that
> > guest_memfd will only allow the punch_hole operation for huge page
> > size-aligned ranges for hugetlb and thp backing. i.e. the PUNCH_HOLE
> > operation doesn't need to split any EPT mappings for foreseeable
> > future.
> 
> Oh!  Right, forgot about that.  It's the conversion path that we need to sort out,
> not PUNCH_HOLE.  Thanks for the reminder!
Hmm, I see.
However, do you think it's better to leave the splitting logic in PUNCH_HOLE as
well? e.g., guest_memfd may want to map several folios in a mapping in the
future, i.e., after *max_order > folio_order(folio);

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Sean Christopherson 1 week, 3 days ago

On Thu, Jan 22, 2026, Yan Zhao wrote:
> On Tue, Jan 20, 2026 at 10:02:41AM -0800, Sean Christopherson wrote:
> > On Tue, Jan 20, 2026, Vishal Annapurve wrote:
> > > On Fri, Jan 16, 2026 at 3:39 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> > > > and tail pages need to be split, and use the existing APIs to make that happen.
> > > 
> > > Just a note: Through guest_memfd upstream syncs, we agreed that
> > > guest_memfd will only allow the punch_hole operation for huge page
> > > size-aligned ranges for hugetlb and thp backing. i.e. the PUNCH_HOLE
> > > operation doesn't need to split any EPT mappings for foreseeable
> > > future.
> > 
> > Oh!  Right, forgot about that.  It's the conversion path that we need to sort out,
> > not PUNCH_HOLE.  Thanks for the reminder!
> Hmm, I see.
> However, do you think it's better to leave the splitting logic in PUNCH_HOLE as
> well? e.g., guest_memfd may want to map several folios in a mapping in the
> future, i.e., after *max_order > folio_order(folio);

No, not at this time.  That is a _very_ big "if".  Coordinating and tracking
contiguous chunks of memory at a larger granularity than the underlying HugeTLB
page size would require significant complexity, I don't see us ever doing that.

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Yan Zhao 2 weeks, 6 days ago

On Fri, Jan 16, 2026 at 03:39:13PM -0800, Sean Christopherson wrote:
> On Thu, Jan 15, 2026, Kai Huang wrote:
> > static int __kvm_tdp_mmu_split_huge_pages(struct kvm *kvm, 
> > 					  struct kvm_gfn_range *range,
> > 					  int target_level,
> > 					  bool shared,
> > 					  bool cross_boundary_only)
> > {
> > 	...
> > }
> > 
> > And by using this helper, I found the name of the two wrapper functions
> > are not ideal:
> > 
> > kvm_tdp_mmu_try_split_huge_pages() is only for log dirty, and it should
> > not be reachable for TD (VM with mirrored PT).  But currently it uses
> > KVM_VALID_ROOTS for root filter thus mirrored PT is also included.  I
> > think it's better to rename it, e.g., at least with "log_dirty" in the
> > name so it's more clear this function is only for dealing log dirty (at
> > least currently).  We can also add a WARN() if it's called for VM with
> > mirrored PT but it's a different topic.
> > 
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() doesn't have
> > "huge_pages", which isn't consistent with the other.  And it is a bit
> > long.  If we don't have "gfn_range" in __kvm_tdp_mmu_split_huge_pages(),
> > then I think we can remove "gfn_range" from
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() too to make it shorter.
> > 
> > So how about:
> > 
> > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > 
> > ?
> 
> I find the "cross_boundary" termininology extremely confusing.  I also dislike
> the concept itself, in the sense that it shoves a weird, specific concept into
> the guts of the TDP MMU.
> The other wart is that it's inefficient when punching a large hole.  E.g. say
> there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> and tail pages is asinine.
That's a reasonable concern. I actually thought about it.
My consideration was as follows:
Currently, we don't have such large areas. Usually, the conversion ranges are
less than 1GB. Though the initial conversion which converts all memory from
private to shared may be wide, there are usually no mappings at that stage. So,
the traversal should be very fast (since the traversal doesn't even need to go
down to the 2MB/1GB level).

If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
very large range at runtime, it can optimize by invoking the API twice:
once for range [start, ALIGN(start, 1GB)), and
once for range [ALIGN_DOWN(end, 1GB), end).

I can also implement this optimization within kvm_split_cross_boundary_leafs()
by checking the range size if you think that would be better.

> And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
Sorry for the confusion about the usage of tdx_honor_guest_accept_level(). I
should add a better comment.

There are 4 use cases for the API kvm_split_cross_boundary_leafs():
1. PUNCH_HOLE
2. KVM_SET_MEMORY_ATTRIBUTES2, which invokes kvm_gmem_set_attributes() for
   private-to-shared conversions
3. tdx_honor_guest_accept_level()
4. kvm_gmem_error_folio()

Use cases 1-3 are already in the current code. Use case 4 is per our discussion,
and will be implemented in the next version (because guest_memfd may split
folios without first splitting S-EPT).

The 4 use cases can be divided into two categories:

1. Category 1: use cases 1, 2, 4
   We must ensure GFN start - 1 and GFN start are not mapped in a single
   mapping. However, for GFN start or GFN start - 1 specifically, we don't care
   about their actual mapping levels, which means they are free to be mapped at
   2MB or 1GB. The same applies to GFN end - 1 and GFN end.

   --|------------------|-----------
     ^                  ^
    start              end - 1 

2. Category 2: use case 3
   It cares about the mapping level of the GFN, i.e., it must not be mapped
   above a certain level.

   -----|-------
        ^
       GFN

   So, to unify the two categories, I have tdx_honor_guest_accept_level() check
   the range of [level-aligned GFN, level-aligned GFN + level size). e.g.,
   If the accept level is 2MB, only 1GB mapping is possible to be outside the
   range and needs splitting.

   -----|-------------|---
        ^             ^
        |             |
   level-aligned     level-aligned
      GFN            GFN + level size - 1

> For the EPT violation case, the guest is accepting a page.  Just split to the
> guest's accepted level, I don't see any reason to make things more complicated
> than that.
This use case could reuse the kvm_mmu_try_split_huge_pages() API, except that we
need a return value.

> And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> and tail pages need to be split, and use the existing APIs to make that happen.
This use case cannot reuse kvm_mmu_try_split_huge_pages() without modification.
Or which existing APIs are you referring to?
The cross_boundary information is still useful?

BTW: Currently, kvm_split_cross_boundary_leafs() internally reuses
tdp_mmu_split_huge_pages_root() (as shown below).

kvm_split_cross_boundary_leafs
  kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
    tdp_mmu_split_huge_pages_root

However, tdp_mmu_split_huge_pages_root() is originally used to split huge
mappings in a wide range, so it temporarily releases mmu_lock for memory
allocation for sp, since it can't predict how many pages to pre-allocate in the
KVM mmu cache.

For kvm_split_cross_boundary_leafs(), we can actually predict the max number of
pages to pre-allocate. If we don't reuse tdp_mmu_split_huge_pages_root(), we can
allocate sp, sp->spt, sp->external_spt and DPAMT pages from the KVM mmu cache
without releasing mmu_lock and invoking tdp_mmu_alloc_sp_for_split(). Do you
think this approach is better?

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Sean Christopherson 2 weeks, 4 days ago

On Mon, Jan 19, 2026, Yan Zhao wrote:
> On Fri, Jan 16, 2026 at 03:39:13PM -0800, Sean Christopherson wrote:
> > On Thu, Jan 15, 2026, Kai Huang wrote:
> > > So how about:
> > > 
> > > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > > 
> > > ?
> > 
> > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > the concept itself, in the sense that it shoves a weird, specific concept into
> > the guts of the TDP MMU.
> > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > and tail pages is asinine.
> That's a reasonable concern. I actually thought about it.
> My consideration was as follows:
> Currently, we don't have such large areas. Usually, the conversion ranges are
> less than 1GB. 

Nothing guarantees that behavior.

> Though the initial conversion which converts all memory from private to
> shared may be wide, there are usually no mappings at that stage. So, the
> traversal should be very fast (since the traversal doesn't even need to go
> down to the 2MB/1GB level).
> 
> If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> very large range at runtime, it can optimize by invoking the API twice:
> once for range [start, ALIGN(start, 1GB)), and
> once for range [ALIGN_DOWN(end, 1GB), end).
> 
> I can also implement this optimization within kvm_split_cross_boundary_leafs()
> by checking the range size if you think that would be better.
> 
> > And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> > _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> > code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
> Sorry for the confusion about the usage of tdx_honor_guest_accept_level(). I
> should add a better comment.
> 
> There are 4 use cases for the API kvm_split_cross_boundary_leafs():
> 1. PUNCH_HOLE
> 2. KVM_SET_MEMORY_ATTRIBUTES2, which invokes kvm_gmem_set_attributes() for
>    private-to-shared conversions
> 3. tdx_honor_guest_accept_level()
> 4. kvm_gmem_error_folio()
> 
> Use cases 1-3 are already in the current code. Use case 4 is per our discussion,
> and will be implemented in the next version (because guest_memfd may split
> folios without first splitting S-EPT).
> 
> The 4 use cases can be divided into two categories:
> 
> 1. Category 1: use cases 1, 2, 4
>    We must ensure GFN start - 1 and GFN start are not mapped in a single
>    mapping. However, for GFN start or GFN start - 1 specifically, we don't care
>    about their actual mapping levels, which means they are free to be mapped at
>    2MB or 1GB. The same applies to GFN end - 1 and GFN end.
> 
>    --|------------------|-----------
>      ^                  ^
>     start              end - 1 
> 
> 2. Category 2: use case 3
>    It cares about the mapping level of the GFN, i.e., it must not be mapped
>    above a certain level.
> 
>    -----|-------
>         ^
>        GFN
> 
>    So, to unify the two categories, I have tdx_honor_guest_accept_level() check
>    the range of [level-aligned GFN, level-aligned GFN + level size). e.g.,
>    If the accept level is 2MB, only 1GB mapping is possible to be outside the
>    range and needs splitting.

But that overlooks the fact that Category 2 already fits the existing "category"
that is supported by the TDP MMU.  I.e. Category 1 is (somewhat) new and novel,
Category 2 is not.

>    -----|-------------|---
>         ^             ^
>         |             |
>    level-aligned     level-aligned
>       GFN            GFN + level size - 1
> 
> 
> > For the EPT violation case, the guest is accepting a page.  Just split to the
> > guest's accepted level, I don't see any reason to make things more complicated
> > than that.
> This use case could reuse the kvm_mmu_try_split_huge_pages() API, except that we
> need a return value.

Just expose tdp_mmu_split_huge_pages_root(), the fault path only _needs_ to split
the current root, and in fact shouldn't even try to split other roots (ignoring
that no other relevant roots exist).

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9c26038f6b77..7d924da75106 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1555,10 +1555,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
        return ret;
 }
 
-static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
-                                        struct kvm_mmu_page *root,
-                                        gfn_t start, gfn_t end,
-                                        int target_level, bool shared)
+int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
+                                 gfn_t start, gfn_t end, int target_level,
+                                 bool shared)
 {
        struct kvm_mmu_page *sp = NULL;
        struct tdp_iter iter;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index bd62977c9199..ea9a509608fb 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -93,6 +93,9 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
                                   struct kvm_memory_slot *slot, gfn_t gfn,
                                   int min_level);
 
+int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
+                                 gfn_t start, gfn_t end, int target_level,
+                                 bool shared);
 void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
                                      const struct kvm_memory_slot *slot,
                                      gfn_t start, gfn_t end,

> > And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> > and tail pages need to be split, and use the existing APIs to make that happen.
> This use case cannot reuse kvm_mmu_try_split_huge_pages() without modification.

Modifying existing code is a non-issue, and you're already modifying TDP MMU
functions, so I don't see that as a reason for choosing X instead of Y.

> Or which existing APIs are you referring to?

See above.

> The cross_boundary information is still useful?
> 
> BTW: Currently, kvm_split_cross_boundary_leafs() internally reuses
> tdp_mmu_split_huge_pages_root() (as shown below).
> 
> kvm_split_cross_boundary_leafs
>   kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
>     tdp_mmu_split_huge_pages_root
> 
> However, tdp_mmu_split_huge_pages_root() is originally used to split huge
> mappings in a wide range, so it temporarily releases mmu_lock for memory
> allocation for sp, since it can't predict how many pages to pre-allocate in the
> KVM mmu cache.
> 
> For kvm_split_cross_boundary_leafs(), we can actually predict the max number of
> pages to pre-allocate. If we don't reuse tdp_mmu_split_huge_pages_root(), we can
> allocate sp, sp->spt, sp->external_spt and DPAMT pages from the KVM mmu cache
> without releasing mmu_lock and invoking tdp_mmu_alloc_sp_for_split().

That's completely orthogonal to the "only need to maybe split head and tail pages".
E.g. kvm_tdp_mmu_try_split_huge_pages() can also predict the _max_ number of pages
to pre-allocate, it's just not worth adding a kvm_mmu_memory_cache for that use
case because that path can drop mmu_lock at will, unlike the full page fault path.
I.e. the complexity doesn't justify the benefits, especially since the max number
of pages is so large.

AFAICT, the only pre-allocation that is _necessary_ is for the dynamic PAMT,
because the allocation is done outside of KVM's control.  But that's a solvable
problem, the tricky part is protecting the PAMT cache for PUNCH_HOLE, but that
too is solvable, e.g. by adding a per-VM mutex that's taken by kvm_gmem_punch_hole()
to handle the PUNCH_HOLE case, and then using the per-vCPU cache when splitting
for a mismatched accept.

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Yan Zhao 2 weeks, 3 days ago

On Tue, Jan 20, 2026 at 09:51:06AM -0800, Sean Christopherson wrote:
> On Mon, Jan 19, 2026, Yan Zhao wrote:
> > On Fri, Jan 16, 2026 at 03:39:13PM -0800, Sean Christopherson wrote:
> > > On Thu, Jan 15, 2026, Kai Huang wrote:
> > > > So how about:
> > > > 
> > > > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > > > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > > > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > > > 
> > > > ?
> > > 
> > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > the guts of the TDP MMU.
> > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > and tail pages is asinine.
> > That's a reasonable concern. I actually thought about it.
> > My consideration was as follows:
> > Currently, we don't have such large areas. Usually, the conversion ranges are
> > less than 1GB. 
> 
> Nothing guarantees that behavior.
> 
> > Though the initial conversion which converts all memory from private to
> > shared may be wide, there are usually no mappings at that stage. So, the
> > traversal should be very fast (since the traversal doesn't even need to go
> > down to the 2MB/1GB level).
> > 
> > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > very large range at runtime, it can optimize by invoking the API twice:
> > once for range [start, ALIGN(start, 1GB)), and
> > once for range [ALIGN_DOWN(end, 1GB), end).
> > 
> > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > by checking the range size if you think that would be better.
> > 
> > > And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> > > _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> > > code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
> > Sorry for the confusion about the usage of tdx_honor_guest_accept_level(). I
> > should add a better comment.
> > 
> > There are 4 use cases for the API kvm_split_cross_boundary_leafs():
> > 1. PUNCH_HOLE
> > 2. KVM_SET_MEMORY_ATTRIBUTES2, which invokes kvm_gmem_set_attributes() for
> >    private-to-shared conversions
> > 3. tdx_honor_guest_accept_level()
> > 4. kvm_gmem_error_folio()
> > 
> > Use cases 1-3 are already in the current code. Use case 4 is per our discussion,
> > and will be implemented in the next version (because guest_memfd may split
> > folios without first splitting S-EPT).
> > 
> > The 4 use cases can be divided into two categories:
> > 
> > 1. Category 1: use cases 1, 2, 4
> >    We must ensure GFN start - 1 and GFN start are not mapped in a single
> >    mapping. However, for GFN start or GFN start - 1 specifically, we don't care
> >    about their actual mapping levels, which means they are free to be mapped at
> >    2MB or 1GB. The same applies to GFN end - 1 and GFN end.
> > 
> >    --|------------------|-----------
> >      ^                  ^
> >     start              end - 1 
> > 
> > 2. Category 2: use case 3
> >    It cares about the mapping level of the GFN, i.e., it must not be mapped
> >    above a certain level.
> > 
> >    -----|-------
> >         ^
> >        GFN
> > 
> >    So, to unify the two categories, I have tdx_honor_guest_accept_level() check
> >    the range of [level-aligned GFN, level-aligned GFN + level size). e.g.,
> >    If the accept level is 2MB, only 1GB mapping is possible to be outside the
> >    range and needs splitting.
> 
> But that overlooks the fact that Category 2 already fits the existing "category"
> that is supported by the TDP MMU.  I.e. Category 1 is (somewhat) new and novel,
> Category 2 is not.
> 
> >    -----|-------------|---
> >         ^             ^
> >         |             |
> >    level-aligned     level-aligned
> >       GFN            GFN + level size - 1
> > 
> > 
> > > For the EPT violation case, the guest is accepting a page.  Just split to the
> > > guest's accepted level, I don't see any reason to make things more complicated
> > > than that.
> > This use case could reuse the kvm_mmu_try_split_huge_pages() API, except that we
> > need a return value.
> 
> Just expose tdp_mmu_split_huge_pages_root(), the fault path only _needs_ to split
> the current root, and in fact shouldn't even try to split other roots (ignoring
> that no other relevant roots exist).
Ok.

> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 9c26038f6b77..7d924da75106 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1555,10 +1555,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>         return ret;
>  }
>  
> -static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> -                                        struct kvm_mmu_page *root,
> -                                        gfn_t start, gfn_t end,
> -                                        int target_level, bool shared)
> +int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
> +                                 gfn_t start, gfn_t end, int target_level,
> +                                 bool shared)
>  {
>         struct kvm_mmu_page *sp = NULL;
>         struct tdp_iter iter;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index bd62977c9199..ea9a509608fb 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -93,6 +93,9 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>                                    struct kvm_memory_slot *slot, gfn_t gfn,
>                                    int min_level);
>  
> +int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
> +                                 gfn_t start, gfn_t end, int target_level,
> +                                 bool shared);
>  void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
>                                       const struct kvm_memory_slot *slot,
>                                       gfn_t start, gfn_t end,
> 
> > > And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> > > and tail pages need to be split, and use the existing APIs to make that happen.
> > This use case cannot reuse kvm_mmu_try_split_huge_pages() without modification.
> 
> Modifying existing code is a non-issue, and you're already modifying TDP MMU
> functions, so I don't see that as a reason for choosing X instead of Y.
> 
> > Or which existing APIs are you referring to?
> 
> See above.
Ok. Do you like the idea of introducing only_cross_boundary (or something with a
different name) to tdp_mmu_split_huge_pages_root() ?
If not, could I expose a helper to help range calculate?


> > The cross_boundary information is still useful?
> > 
> > BTW: Currently, kvm_split_cross_boundary_leafs() internally reuses
> > tdp_mmu_split_huge_pages_root() (as shown below).
> > 
> > kvm_split_cross_boundary_leafs
> >   kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
> >     tdp_mmu_split_huge_pages_root
> > 
> > However, tdp_mmu_split_huge_pages_root() is originally used to split huge
> > mappings in a wide range, so it temporarily releases mmu_lock for memory
> > allocation for sp, since it can't predict how many pages to pre-allocate in the
> > KVM mmu cache.
> > 
> > For kvm_split_cross_boundary_leafs(), we can actually predict the max number of
> > pages to pre-allocate. If we don't reuse tdp_mmu_split_huge_pages_root(), we can
> > allocate sp, sp->spt, sp->external_spt and DPAMT pages from the KVM mmu cache
> > without releasing mmu_lock and invoking tdp_mmu_alloc_sp_for_split().
> 
> That's completely orthogonal to the "only need to maybe split head and tail pages".
> E.g. kvm_tdp_mmu_try_split_huge_pages() can also predict the _max_ number of pages
> to pre-allocate, it's just not worth adding a kvm_mmu_memory_cache for that use
> case because that path can drop mmu_lock at will, unlike the full page fault path.
> I.e. the complexity doesn't justify the benefits, especially since the max number
> of pages is so large.
Right, it's technically feasible, but practically not.
If to split a huge range down to 4KB, like 16G, the _max_ number of pages to
pre-allocate is too large. It's 16*512=8192 pages without TDX.

> AFAICT, the only pre-allocation that is _necessary_ is for the dynamic PAMT,
Yes, patch 20 in this series just pre-allocates DPAMT pages for splitting.

See:
static int tdx_min_split_cache_sz(struct kvm *kvm, int level)
{
	KVM_BUG_ON(level != PG_LEVEL_2M, kvm);

	if (!tdx_supports_dynamic_pamt(tdx_sysinfo))
		return 0;

	return tdx_dpamt_entry_pages() * 2;
}

> because the allocation is done outside of KVM's control.  But that's a solvable
> problem, the tricky part is protecting the PAMT cache for PUNCH_HOLE, but that
> too is solvable, e.g. by adding a per-VM mutex that's taken by kvm_gmem_punch_hole()
I don't get why only PUNCH_HOLE case needs to be protected.
It's not guaranteed that KVM_SET_MEMORY_ATTRIBUTES2 ioctls are in vCPU contexts.

BTW: the split_external_spte() hook is invoked under mmu_lock, so I used a
spinlock kvm_tdx->prealloc_split_cache_lock to protect cache enqueuing
and dequeuing.

> to handle the PUNCH_HOLE case, and then using the per-vCPU cache when splitting
> for a mismatched accept.
Yes, I planned to use per-vCPU cache in the future. e.g., when splitting under
shared mmu_lock to honor guest accept level.

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Huang, Kai 2 weeks, 6 days ago

On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > the concept itself, in the sense that it shoves a weird, specific concept into
> > the guts of the TDP MMU.
> > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > and tail pages is asinine.
> That's a reasonable concern. I actually thought about it.
> My consideration was as follows:
> Currently, we don't have such large areas. Usually, the conversion ranges are
> less than 1GB. Though the initial conversion which converts all memory from
> private to shared may be wide, there are usually no mappings at that stage. So,
> the traversal should be very fast (since the traversal doesn't even need to go
> down to the 2MB/1GB level).
> 
> If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> very large range at runtime, it can optimize by invoking the API twice:
> once for range [start, ALIGN(start, 1GB)), and
> once for range [ALIGN_DOWN(end, 1GB), end).
> 
> I can also implement this optimization within kvm_split_cross_boundary_leafs()
> by checking the range size if you think that would be better.

I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
want to do optimization.

I think I've raised this in v2, and asked why not just letting the caller
to figure out the ranges to split for a given range (see at the end of
[*]), because the "cross boundary" can only happen at the beginning and
end of the given range, if possible.

[*]:
https://lore.kernel.org/all/35fd7d70475d5743a3c45bc5b8118403036e439b.camel@intel.com/

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Huang, Kai 2 weeks, 6 days ago

On Mon, 2026-01-19 at 08:35 +0000, Huang, Kai wrote:
> On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > the guts of the TDP MMU.
> > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > and tail pages is asinine.
> > That's a reasonable concern. I actually thought about it.
> > My consideration was as follows:
> > Currently, we don't have such large areas. Usually, the conversion ranges are
> > less than 1GB. Though the initial conversion which converts all memory from
> > private to shared may be wide, there are usually no mappings at that stage. So,
> > the traversal should be very fast (since the traversal doesn't even need to go
> > down to the 2MB/1GB level).
> > 
> > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > very large range at runtime, it can optimize by invoking the API twice:
> > once for range [start, ALIGN(start, 1GB)), and
> > once for range [ALIGN_DOWN(end, 1GB), end).
> > 
> > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > by checking the range size if you think that would be better.
> 
> I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
> want to do optimization.
> 
> I think I've raised this in v2, and asked why not just letting the caller
> to figure out the ranges to split for a given range (see at the end of
> [*]), because the "cross boundary" can only happen at the beginning and
> end of the given range, if possible.
> 
> [*]:
> https://lore.kernel.org/all/35fd7d70475d5743a3c45bc5b8118403036e439b.camel@intel.com/

Hmm.. thinking again, if you have multiple places needing to do this, then
kvm_split_cross_boundary_leafs() may serve as a helper to calculate the
ranges to split.

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Yan Zhao 2 weeks, 6 days ago

On Mon, Jan 19, 2026 at 04:49:58PM +0800, Huang, Kai wrote:
> On Mon, 2026-01-19 at 08:35 +0000, Huang, Kai wrote:
> > On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > > the guts of the TDP MMU.
> > > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > > and tail pages is asinine.
> > > That's a reasonable concern. I actually thought about it.
> > > My consideration was as follows:
> > > Currently, we don't have such large areas. Usually, the conversion ranges are
> > > less than 1GB. Though the initial conversion which converts all memory from
> > > private to shared may be wide, there are usually no mappings at that stage. So,
> > > the traversal should be very fast (since the traversal doesn't even need to go
> > > down to the 2MB/1GB level).
> > > 
> > > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > > very large range at runtime, it can optimize by invoking the API twice:
> > > once for range [start, ALIGN(start, 1GB)), and
> > > once for range [ALIGN_DOWN(end, 1GB), end).
> > > 
> > > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > > by checking the range size if you think that would be better.
> > 
> > I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
> > want to do optimization.
> > 
> > I think I've raised this in v2, and asked why not just letting the caller
> > to figure out the ranges to split for a given range (see at the end of
> > [*]), because the "cross boundary" can only happen at the beginning and
> > end of the given range, if possible.
Hmm, the caller can only figure out when splitting is NOT necessary, e.g., if
start is 1GB-aligned, then there's no need to split for start. However, if start
is not 1GB/2MB-aligned, the caller has no idea if there's a 2MB mapping covering
start - 1 and start.
(for non-TDX cases, if start is not 1GB-aligned and is just 2MB-aligned,
invoking tdp_mmu_split_huge_pages_root() is still necessary because there may
exist a 1GB mapping covering start -1 and start).

In my reply to [*], I didn't want to do the calculation because I didn't see
much overhead from always invoking tdp_mmu_split_huge_pages_root().
But the scenario Sean pointed out is different. When both start and end are not
2MB-aligned, if [start, end) covers a huge range, we can still pre-calculate to
reduce the iterations in tdp_mmu_split_huge_pages_root().

Opportunistically, optimization to skip splits for 1GB-aligned start or end is
possible :)

> > [*]:
> > https://lore.kernel.org/all/35fd7d70475d5743a3c45bc5b8118403036e439b.camel@intel.com/
> 
> Hmm.. thinking again, if you have multiple places needing to do this, then
> kvm_split_cross_boundary_leafs() may serve as a helper to calculate the
> ranges to split.
Yes.

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Huang, Kai 2 weeks, 6 days ago

On Mon, 2026-01-19 at 18:11 +0800, Yan Zhao wrote:
> On Mon, Jan 19, 2026 at 04:49:58PM +0800, Huang, Kai wrote:
> > On Mon, 2026-01-19 at 08:35 +0000, Huang, Kai wrote:
> > > On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > > > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > > > the guts of the TDP MMU.
> > > > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > > > and tail pages is asinine.
> > > > That's a reasonable concern. I actually thought about it.
> > > > My consideration was as follows:
> > > > Currently, we don't have such large areas. Usually, the conversion ranges are
> > > > less than 1GB. Though the initial conversion which converts all memory from
> > > > private to shared may be wide, there are usually no mappings at that stage. So,
> > > > the traversal should be very fast (since the traversal doesn't even need to go
> > > > down to the 2MB/1GB level).
> > > > 
> > > > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > > > very large range at runtime, it can optimize by invoking the API twice:
> > > > once for range [start, ALIGN(start, 1GB)), and
> > > > once for range [ALIGN_DOWN(end, 1GB), end).
> > > > 
> > > > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > > > by checking the range size if you think that would be better.
> > > 
> > > I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
> > > want to do optimization.
> > > 
> > > I think I've raised this in v2, and asked why not just letting the caller
> > > to figure out the ranges to split for a given range (see at the end of
> > > [*]), because the "cross boundary" can only happen at the beginning and
> > > end of the given range, if possible.
> Hmm, the caller can only figure out when splitting is NOT necessary, e.g., if
> start is 1GB-aligned, then there's no need to split for start. However, if start
> is not 1GB/2MB-aligned, the caller has no idea if there's a 2MB mapping covering
> start - 1 and start.

Why does the caller need to know?

Let's only talk about 'start' for simplicity:

- If start is 1G aligned, then no split is needed.

- If start is not 1G-aligned but 2M-aligned, you split the range:

   [ALIGN_DOWN(start, 1G), ALIGN(start, 1G)) to 2M level.

- If start is 4K-aligned only, you firstly split

   [ALIGN_DOWN(start, 1G), ALIGN(start, 1G))

  to 2M level, then you split

   [ALIGN_DOWN(start, 2M), ALIGN(start, 2M))

  to 4K level.

Similar handling to 'end'.  An additional thing is if one to-be-split-
range calculated from 'start' overlaps one calculated from 'end', the
split is only needed once. 

Wouldn't this work?

> (for non-TDX cases, if start is not 1GB-aligned and is just 2MB-aligned,
> invoking tdp_mmu_split_huge_pages_root() is still necessary because there may
> exist a 1GB mapping covering start -1 and start).
> 
> In my reply to [*], I didn't want to do the calculation because I didn't see
> much overhead from always invoking tdp_mmu_split_huge_pages_root().
> But the scenario Sean pointed out is different. When both start and end are not
> 2MB-aligned, if [start, end) covers a huge range, we can still pre-calculate to
> reduce the iterations in tdp_mmu_split_huge_pages_root().

I don't see much difference.  Maybe I am missing something.

> 
> Opportunistically, optimization to skip splits for 1GB-aligned start or end is
> possible :)

If this makes code easier to review/maintain then sure.

As long as the solution is easy to review (i.e., not too complicated to
understand/maintain) then I am fine with whatever Sean/you prefer.

However the 'cross_boundary_only' thing was indeed a bit odd to me when I
firstly saw this :-)

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Yan Zhao 2 weeks, 6 days ago

On Mon, Jan 19, 2026 at 06:40:50PM +0800, Huang, Kai wrote:
> On Mon, 2026-01-19 at 18:11 +0800, Yan Zhao wrote:
> > On Mon, Jan 19, 2026 at 04:49:58PM +0800, Huang, Kai wrote:
> > > On Mon, 2026-01-19 at 08:35 +0000, Huang, Kai wrote:
> > > > On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > > > > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > > > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > > > > the guts of the TDP MMU.
> > > > > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > > > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > > > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > > > > and tail pages is asinine.
> > > > > That's a reasonable concern. I actually thought about it.
> > > > > My consideration was as follows:
> > > > > Currently, we don't have such large areas. Usually, the conversion ranges are
> > > > > less than 1GB. Though the initial conversion which converts all memory from
> > > > > private to shared may be wide, there are usually no mappings at that stage. So,
> > > > > the traversal should be very fast (since the traversal doesn't even need to go
> > > > > down to the 2MB/1GB level).
> > > > > 
> > > > > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > > > > very large range at runtime, it can optimize by invoking the API twice:
> > > > > once for range [start, ALIGN(start, 1GB)), and
> > > > > once for range [ALIGN_DOWN(end, 1GB), end).
> > > > > 
> > > > > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > > > > by checking the range size if you think that would be better.
> > > > 
> > > > I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
> > > > want to do optimization.
> > > > 
> > > > I think I've raised this in v2, and asked why not just letting the caller
> > > > to figure out the ranges to split for a given range (see at the end of
> > > > [*]), because the "cross boundary" can only happen at the beginning and
> > > > end of the given range, if possible.
> > Hmm, the caller can only figure out when splitting is NOT necessary, e.g., if
> > start is 1GB-aligned, then there's no need to split for start. However, if start
> > is not 1GB/2MB-aligned, the caller has no idea if there's a 2MB mapping covering
> > start - 1 and start.
> 
> Why does the caller need to know?
> 
> Let's only talk about 'start' for simplicity:
> 
> - If start is 1G aligned, then no split is needed.
> 
> - If start is not 1G-aligned but 2M-aligned, you split the range:
> 
>    [ALIGN_DOWN(start, 1G), ALIGN(start, 1G)) to 2M level.
> 
> - If start is 4K-aligned only, you firstly split
> 
>    [ALIGN_DOWN(start, 1G), ALIGN(start, 1G))
> 
>   to 2M level, then you split
> 
>    [ALIGN_DOWN(start, 2M), ALIGN(start, 2M))
> 
>   to 4K level.
> 
> Similar handling to 'end'.  An additional thing is if one to-be-split-
> range calculated from 'start' overlaps one calculated from 'end', the
> split is only needed once. 
> 
> Wouldn't this work?
It can work. But I don't think the calculations are necessary if the length
of [start, end) is less than 1G or 2MB.

e.g., if both start and end are just 4KB-aligned, of a length 8KB, the current
implementation can invoke a single tdp_mmu_split_huge_pages_root() to split
a 1GB mapping to 4KB directly. Why bother splitting twice for start or end?

> > (for non-TDX cases, if start is not 1GB-aligned and is just 2MB-aligned,
> > invoking tdp_mmu_split_huge_pages_root() is still necessary because there may
> > exist a 1GB mapping covering start -1 and start).
> > 
> > In my reply to [*], I didn't want to do the calculation because I didn't see
> > much overhead from always invoking tdp_mmu_split_huge_pages_root().
> > But the scenario Sean pointed out is different. When both start and end are not
> > 2MB-aligned, if [start, end) covers a huge range, we can still pre-calculate to
> > reduce the iterations in tdp_mmu_split_huge_pages_root().
> 
> I don't see much difference.  Maybe I am missing something.
The difference is the length of the range.
For lengths < 1GB, always invoking tdp_mmu_split_huge_pages_root() without any
calculation is simpler and more efficient.

> > 
> > Opportunistically, optimization to skip splits for 1GB-aligned start or end is
> > possible :)
> 
> If this makes code easier to review/maintain then sure.
> 
> As long as the solution is easy to review (i.e., not too complicated to
> understand/maintain) then I am fine with whatever Sean/you prefer.
> 
> However the 'cross_boundary_only' thing was indeed a bit odd to me when I
> firstly saw this :-)

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Yan Zhao 2 weeks, 6 days ago

On Mon, Jan 19, 2026 at 07:06:01PM +0800, Yan Zhao wrote:
> On Mon, Jan 19, 2026 at 06:40:50PM +0800, Huang, Kai wrote:
> > On Mon, 2026-01-19 at 18:11 +0800, Yan Zhao wrote:
> > > On Mon, Jan 19, 2026 at 04:49:58PM +0800, Huang, Kai wrote:
> > > > On Mon, 2026-01-19 at 08:35 +0000, Huang, Kai wrote:
> > > > > On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > > > > > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > > > > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > > > > > the guts of the TDP MMU.
> > > > > > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > > > > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > > > > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > > > > > and tail pages is asinine.
> > > > > > That's a reasonable concern. I actually thought about it.
> > > > > > My consideration was as follows:
> > > > > > Currently, we don't have such large areas. Usually, the conversion ranges are
> > > > > > less than 1GB. Though the initial conversion which converts all memory from
> > > > > > private to shared may be wide, there are usually no mappings at that stage. So,
> > > > > > the traversal should be very fast (since the traversal doesn't even need to go
> > > > > > down to the 2MB/1GB level).
> > > > > > 
> > > > > > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > > > > > very large range at runtime, it can optimize by invoking the API twice:
> > > > > > once for range [start, ALIGN(start, 1GB)), and
> > > > > > once for range [ALIGN_DOWN(end, 1GB), end).
> > > > > > 
> > > > > > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > > > > > by checking the range size if you think that would be better.
> > > > > 
> > > > > I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
> > > > > want to do optimization.
> > > > > 
> > > > > I think I've raised this in v2, and asked why not just letting the caller
> > > > > to figure out the ranges to split for a given range (see at the end of
> > > > > [*]), because the "cross boundary" can only happen at the beginning and
> > > > > end of the given range, if possible.
> > > Hmm, the caller can only figure out when splitting is NOT necessary, e.g., if
> > > start is 1GB-aligned, then there's no need to split for start. However, if start
> > > is not 1GB/2MB-aligned, the caller has no idea if there's a 2MB mapping covering
> > > start - 1 and start.
> > 
> > Why does the caller need to know?
> > 
> > Let's only talk about 'start' for simplicity:
> > 
> > - If start is 1G aligned, then no split is needed.
> > 
> > - If start is not 1G-aligned but 2M-aligned, you split the range:
> > 
> >    [ALIGN_DOWN(start, 1G), ALIGN(start, 1G)) to 2M level.
> > 
> > - If start is 4K-aligned only, you firstly split
> > 
> >    [ALIGN_DOWN(start, 1G), ALIGN(start, 1G))
> > 
> >   to 2M level, then you split
> > 
> >    [ALIGN_DOWN(start, 2M), ALIGN(start, 2M))
> > 
> >   to 4K level.
> > 
> > Similar handling to 'end'.  An additional thing is if one to-be-split-
> > range calculated from 'start' overlaps one calculated from 'end', the
> > split is only needed once. 
> > 
> > Wouldn't this work?
> It can work. But I don't think the calculations are necessary if the length
> of [start, end) is less than 1G or 2MB.
> 
> e.g., if both start and end are just 4KB-aligned, of a length 8KB, the current
> implementation can invoke a single tdp_mmu_split_huge_pages_root() to split
> a 1GB mapping to 4KB directly. Why bother splitting twice for start or end?
I think I get your point now.
It's a good idea if introducing only_cross_boundary is undesirable.

So, the remaining question (as I asked at the bottom of [1]) is whether we could
create a specific function for this split use case, rather than reusing
tdp_mmu_split_huge_pages_root() which allocates pages outside of mmu_lock. This
way, we don't need to introduce a spinlock to protect the page enqueuing/
dequeueing of the per-VM external cache (see prealloc_split_cache_lock in patch
20 [2]).

Then we would disallow mirror_root for tdp_mmu_split_huge_pages_root(), which is
currently called for dirty page tracking in upstream code. Would this be
acceptable for TDX migration?


[1] https://lore.kernel.org/all/aW2Iwpuwoyod8eQc@yzhao56-desk.sh.intel.com/
[2] https://lore.kernel.org/all/20260106102345.25261-1-yan.y.zhao@intel.com/
> > > (for non-TDX cases, if start is not 1GB-aligned and is just 2MB-aligned,
> > > invoking tdp_mmu_split_huge_pages_root() is still necessary because there may
> > > exist a 1GB mapping covering start -1 and start).
> > > 
> > > In my reply to [*], I didn't want to do the calculation because I didn't see
> > > much overhead from always invoking tdp_mmu_split_huge_pages_root().
> > > But the scenario Sean pointed out is different. When both start and end are not
> > > 2MB-aligned, if [start, end) covers a huge range, we can still pre-calculate to
> > > reduce the iterations in tdp_mmu_split_huge_pages_root().
> > 
> > I don't see much difference.  Maybe I am missing something.
> The difference is the length of the range.
> For lengths < 1GB, always invoking tdp_mmu_split_huge_pages_root() without any
> calculation is simpler and more efficient.
> 
> > > 
> > > Opportunistically, optimization to skip splits for 1GB-aligned start or end is
> > > possible :)
> > 
> > If this makes code easier to review/maintain then sure.
> > 
> > As long as the solution is easy to review (i.e., not too complicated to
> > understand/maintain) then I am fine with whatever Sean/you prefer.
> > 
> > However the 'cross_boundary_only' thing was indeed a bit odd to me when I
> > firstly saw this :-)

Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Posted by Sean Christopherson 1 week, 3 days ago

On Mon, Jan 19, 2026, Yan Zhao wrote:
> On Mon, Jan 19, 2026 at 07:06:01PM +0800, Yan Zhao wrote:
> > On Mon, Jan 19, 2026 at 06:40:50PM +0800, Huang, Kai wrote:
> > > Similar handling to 'end'.  An additional thing is if one to-be-split-
> > > range calculated from 'start' overlaps one calculated from 'end', the
> > > split is only needed once. 
> > > 
> > > Wouldn't this work?
> > It can work. But I don't think the calculations are necessary if the length
> > of [start, end) is less than 1G or 2MB.
> > 
> > e.g., if both start and end are just 4KB-aligned, of a length 8KB, the current
> > implementation can invoke a single tdp_mmu_split_huge_pages_root() to split
> > a 1GB mapping to 4KB directly. Why bother splitting twice for start or end?
> I think I get your point now.
> It's a good idea if introducing only_cross_boundary is undesirable.
> 
> So, the remaining question (as I asked at the bottom of [1]) is whether we could
> create a specific function for this split use case, rather than reusing
> tdp_mmu_split_huge_pages_root() which allocates pages outside of mmu_lock. 

Belatedly, yes.  What I want to avoid is modifying core MMU functionality to add
edge-case handling for TDX.  Inevitably, TDX will require invasive changes, but
in this case they're completely unjustified.

FWIW, if __for_each_tdp_mmu_root_yield_safe() were visible outside of tdp_mmu.c,
all of the x86 code guarded by CONFIG_HAVE_KVM_ARCH_GMEM_CONVERT[*] could live in
tdx.c.

Hmm, actually, looking at that again, it's totally doable to bury the majority of
the logic in tdx.c, the TDP MMU just needs to expose an API to split hugepages in
mirror roots.  Which is effectively what tdx_handle_mismatched_accept() needs as
well, since there can only be one mirror root in practice.

Oof, and kvm_tdp_mmu_split_huge_pages() used by tdx_handle_mismatched_accept()
is wrong; it operates on the "normal" root, not the mirror root.

Let me respond to those patches.

[*] https://lore.kernel.org/all/20260129011517.3545883-45-seanjc@google.com

> This
> way, we don't need to introduce a spinlock to protect the page enqueuing/
> dequeueing of the per-VM external cache (see prealloc_split_cache_lock in patch
> 20 [2]).
> 
> Then we would disallow mirror_root for tdp_mmu_split_huge_pages_root(), which is
> currently called for dirty page tracking in upstream code. Would this be
> acceptable for TDX migration?

Honestly, I have no idea.  That's so far in the future.

> [1] https://lore.kernel.org/all/aW2Iwpuwoyod8eQc@yzhao56-desk.sh.intel.com/
> [2] https://lore.kernel.org/all/20260106102345.25261-1-yan.y.zhao@intel.com/