[v1] KVM: TDX huge page support for private memory

[RFC PATCH 18/21] KVM: x86: Split huge boundary leafs before private to shared conversion

Posted by Yan Zhao 9 months, 3 weeks ago

Before converting a GFN range from private to shared, it is necessary to
zap the mirror page table. When huge pages are supported and the GFN range
intersects with a huge leaf, split the huge leaf to prevent zapping GFNs
outside the conversion range.

Invoke kvm_split_boundary_leafs() in kvm_arch_pre_set_memory_attributes()
to split the huge boundary leafs before calling kvm_unmap_gfn_range() to
zap the GFN range that will be converted to shared.

Unlike kvm_unmap_gfn_range(), which cannot fail, kvm_split_boundary_leafs()
may fail due to out of memory during splitting.
Update kvm_handle_gfn_range() to propagate the splitting error back to
kvm_vm_set_mem_attributes(), which will subsequently fail the ioctl
KVM_SET_MEMORY_ATTRIBUTES.

The downside of this approach is that although kvm_split_boundary_leafs()
is invoked before kvm_unmap_gfn_range() for each GFN range, the entire
conversion range may consist of several GFN ranges. If an out-of-memory
error occurs during the splitting of a GFN range, some previous GFN ranges
may have been successfully split and zapped, even though their page
attributes remain unchanged due to the splitting failure. This may not be a
big problem as the user can retry the ioctl to split and zap the full
range.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
 virt/kvm/kvm_main.c    | 13 +++++++++----
 2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ba993445a00e..1a34e43bd349 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7694,6 +7694,9 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 				       struct kvm_gfn_range *range)
 {
+	bool flush = false;
+	int ret;
+
 	/*
 	 * Zap SPTEs even if the slot can't be mapped PRIVATE.  KVM x86 only
 	 * supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like KVM
@@ -7706,7 +7709,7 @@ int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	 * a hugepage can be used for affected ranges.
 	 */
 	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
-		return false;
+		return 0;
 
 	/* Unmap the old attribute page. */
 	if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)
@@ -7714,7 +7717,13 @@ int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	else
 		range->attr_filter = KVM_FILTER_PRIVATE;
 
-	return kvm_unmap_gfn_range(kvm, range);
+	ret = kvm_split_boundary_leafs(kvm, range);
+	if (ret < 0)
+		return ret;
+	flush |= ret;
+
+	flush |= kvm_unmap_gfn_range(kvm, range);
+	return flush;
 }
 
 static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
@@ -7769,7 +7778,7 @@ int kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 	 * SHARED may now allow hugepages.
 	 */
 	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
-		return false;
+		return 0;
 
 	/*
 	 * The sequence matters here: upper levels consume the result of lower
@@ -7816,7 +7825,7 @@ int kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 				hugepage_set_mixed(slot, gfn, level);
 		}
 	}
-	return false;
+	return 0;
 }
 
 void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 72bd98c100cf..6d9b82890f15 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2408,8 +2408,8 @@ bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	return true;
 }
 
-static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
-						 struct kvm_mmu_notifier_range *range)
+static __always_inline int kvm_handle_gfn_range(struct kvm *kvm,
+						struct kvm_mmu_notifier_range *range)
 {
 	struct kvm_gfn_range gfn_range;
 	struct kvm_memory_slot *slot;
@@ -2462,6 +2462,8 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 
 	if (found_memslot)
 		KVM_MMU_UNLOCK(kvm);
+
+	return ret < 0 ? ret : 0;
 }
 
 static int kvm_pre_set_memory_attributes(struct kvm *kvm,
@@ -2526,7 +2528,9 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 			goto out_unlock;
 	}
 
-	kvm_handle_gfn_range(kvm, &pre_set_range);
+	r = kvm_handle_gfn_range(kvm, &pre_set_range);
+	if (r)
+		goto out_unlock;
 
 	for (i = start; i < end; i++) {
 		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
@@ -2534,7 +2538,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		KVM_BUG_ON(r, kvm);
 	}
 
-	kvm_handle_gfn_range(kvm, &post_set_range);
+	r = kvm_handle_gfn_range(kvm, &post_set_range);
+	KVM_BUG_ON(r, kvm);
 
 out_unlock:
 	mutex_unlock(&kvm->slots_lock);
-- 
2.43.2

Re: [RFC PATCH 18/21] KVM: x86: Split huge boundary leafs before private to shared conversion

Posted by Edgecombe, Rick P 9 months ago

On Thu, 2025-04-24 at 11:08 +0800, Yan Zhao wrote:
> Before converting a GFN range from private to shared, it is necessary to
> zap the mirror page table. When huge pages are supported and the GFN range
> intersects with a huge leaf, split the huge leaf to prevent zapping GFNs
> outside the conversion range.

FALLOC_FL_PUNCH_HOLE demotion failure doesn't look like it is addressed in this
series. I noticed that mmu notifier failures are allowed to be handled by
blocking until success is possible, in most cases. KVM just doesn't need to
because it can't fail. We could think about doing retries for
FALLOC_FL_PUNCH_HOLE, while checking for signals. Or adding a ENOMEM error code
to fallocate.

Re: [RFC PATCH 18/21] KVM: x86: Split huge boundary leafs before private to shared conversion

Posted by Yan Zhao 9 months ago

On Sat, May 10, 2025 at 07:34:39AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:08 +0800, Yan Zhao wrote:
> > Before converting a GFN range from private to shared, it is necessary to
> > zap the mirror page table. When huge pages are supported and the GFN range
> > intersects with a huge leaf, split the huge leaf to prevent zapping GFNs
> > outside the conversion range.
> 
> FALLOC_FL_PUNCH_HOLE demotion failure doesn't look like it is addressed in this
Hmm, FALLOC_FL_PUNCH_HOLE demotion failure is handled in patch 19.

> series. I noticed that mmu notifier failures are allowed to be handled by
> blocking until success is possible, in most cases. KVM just doesn't need to
> because it can't fail. We could think about doing retries for
> FALLOC_FL_PUNCH_HOLE, while checking for signals. Or adding a ENOMEM error code
> to fallocate.
In patch 19, FALLOC_FL_PUNCH_HOLE could return -ENOMEM.

Returning -ENOMEM may be inevitable as we can't endlessly retry. So for
simplicity, there's no retry in this series.

Besides that, do you think whether we need to conduct splittings before any
unmap is invoked?

As in the patch log:
"
The downside of this approach is that although kvm_split_boundary_leafs()        
is invoked before kvm_unmap_gfn_range() for each GFN range, the entire           
conversion range may consist of several GFN ranges. If an out-of-memory          
error occurs during the splitting of a GFN range, some previous GFN ranges       
may have been successfully split and zapped, even though their page              
attributes remain unchanged due to the splitting failure. This may not be a      
big problem as the user can retry the ioctl to split and zap the full            
range.
"

Re: [RFC PATCH 18/21] KVM: x86: Split huge boundary leafs before private to shared conversion

Posted by Edgecombe, Rick P 9 months ago

On Mon, 2025-05-12 at 10:25 +0800, Yan Zhao wrote:
> > FALLOC_FL_PUNCH_HOLE demotion failure doesn't look like it is addressed in
> > this
> Hmm, FALLOC_FL_PUNCH_HOLE demotion failure is handled in patch 19.

Oh, right you are.

> 
> > series. I noticed that mmu notifier failures are allowed to be handled by
> > blocking until success is possible, in most cases. KVM just doesn't need to
> > because it can't fail. We could think about doing retries for
> > FALLOC_FL_PUNCH_HOLE, while checking for signals. Or adding a ENOMEM error
> > code
> > to fallocate.
> In patch 19, FALLOC_FL_PUNCH_HOLE could return -ENOMEM.

Yes. It is not in the man pages, but looking this morning I see other fallocate
handlers are already returning it.

> 
> Returning -ENOMEM may be inevitable as we can't endlessly retry. So for
> simplicity, there's no retry in this series.

Ok, seems good.

> 
> 
> Besides that, do you think whether we need to conduct splittings before any
> unmap is invoked?
> 
> As in the patch log:
> "
> The downside of this approach is that although
> kvm_split_boundary_leafs()        
> is invoked before kvm_unmap_gfn_range() for each GFN range, the
> entire           
> conversion range may consist of several GFN ranges. If an out-of-
> memory          
> error occurs during the splitting of a GFN range, some previous GFN
> ranges       
> may have been successfully split and zapped, even though their
> page              
> attributes remain unchanged due to the splitting failure. This may not be
> a      
> big problem as the user can retry the ioctl to split and zap the
> full            
> range.
> "

If we ended up plumbing the zapping errors all the way through the MMU, it
probably would be simpler to do it during the unmap. Of course callers would
have to be aware the range may be half unmapped on error. I think the way you
have it is nice for not churning the MMU though.

But for the case of having to retry the split and walking the mirror EPT range
again, it's a rare case and not worth optimizing for. Let's not consider it
much.