KVM: TDX: Defer guest memory removal to decrease shutdown time

[PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Posted by Adrian Hunter 11 months ago

Improve TDX shutdown performance by adding a more efficient shutdown
operation at the cost of adding separate branches for the TDX MMU
operations for normal runtime and shutdown.  This more efficient method was
previously used in earlier versions of the TDX patches, but was removed to
simplify the initial upstreaming.  This is an RFC, and still needs a proper
upstream commit log. It is intended to be an eventual follow up to base
support.

== Background ==

TDX has 2 methods for the host to reclaim guest private memory, depending
on whether the TD (TDX VM) is in a runnable state or not.  These are
called, respectively:
  1. Dynamic Page Removal
  2. Reclaiming TD Pages in TD_TEARDOWN State

Dynamic Page Removal is much slower.  Reclaiming a 4K page in TD_TEARDOWN
state can be 5 times faster, although that is just one part of TD shutdown.

== Relevant TDX Architecture ==

Dynamic Page Removal is slow because it has to potentially deal with a
running TD, and so involves a number of steps:
	Block further address translation
	Exit each VCPU
	Clear Secure EPT entry
	Flush/write-back/invalidate relevant caches

Reclaiming TD Pages in TD_TEARDOWN State is fast because the shutdown
procedure (refer tdx_mmu_release_hkid()) has already performed the relevant
flushing.  For details, see TDX Module Base Spec October 2024 sections:

	7.5.   TD Teardown Sequence
	5.6.3. TD Keys Reclamation, TLB and Cache Flush

Essentially all that remains then is to take each page away from the
TDX Module and return it to the kernel.

== Problem ==

Currently, Dynamic Page Removal is being used when the TD is being
shutdown for the sake of having simpler initial code.

This happens when guest_memfds are closed, refer kvm_gmem_release().
guest_memfds hold a reference to struct kvm, so that VM destruction cannot
happen until after they are released, refer kvm_gmem_release().

Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
reclaim time.  For example:

	VCPUs	Size (GB)	Before (secs)	After (secs)
	 4	 18		 72		 24
	32	107		517		134

Note, the V19 patch set:

	https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/

did not have this issue because the HKID was released early, something that
Sean effectively NAK'ed:

	"No, the right answer is to not release the HKID until the VM is
	destroyed."

	https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/

That was taken on board in the "TDX MMU Part 2" patch set.  Refer
"Moving of tdx_mmu_release_hkid()" in:

	https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/

== Options ==

  1. Start TD teardown earlier so that when pages are removed,
  they can be reclaimed faster.
  2. Defer page removal until after TD teardown has started.
  3. A combination of 1 and 2.

Option 1 is problematic because it means putting the TD into a non-runnable
state while it is potentially still active. Also, as mentioned above, Sean
effectively NAK'ed it.

Option 2 is possible because the lifetime of guest memory pages is separate
from guest_memfd (struct kvm_gmem) lifetime.

A reference is taken to pages when they are faulted in, refer
kvm_gmem_get_pfn().  That reference is not released until the pages are
removed from the mirror SEPT, refer tdx_unpin().

Option 3 is not needed because TD teardown begins during VM destruction
before pages are reclaimed.  TD_TEARDOWN state is entered by
tdx_mmu_release_hkid(), whereas pages are reclaimed by tdp_mmu_zap_root(),
as follows:

    kvm_arch_destroy_vm()
        ...
        vt_vm_pre_destroy()
            tdx_mmu_release_hkid()
        ...
        kvm_mmu_uninit_vm()
            kvm_mmu_uninit_tdp_mmu()
                kvm_tdp_mmu_invalidate_roots()
                kvm_tdp_mmu_zap_invalidated_roots()
                    tdp_mmu_zap_root()

== Proof of Concept for option 2 ==

Assume user space never needs to close a guest_memfd except as part of VM
shutdown.

Add a callback from kvm_gmem_release() to decide whether to defer removal.
For TDX, record the inode (up to a max. of 64 inodes) and pin it.

Amend the release of guest_memfds to skip removing pages from the MMU
in that case.

Amend TDX private memory page removal to detect TD_TEARDOWN state, and
reclaim the page accordingly.

For TDX, finally unpin any pinned inodes.

This hopefully illustrates what needs to be done, but guidance is sought
for the best way to do it.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  3 ++
 arch/x86/kvm/Kconfig               |  1 +
 arch/x86/kvm/vmx/main.c            | 12 +++++++-
 arch/x86/kvm/vmx/tdx.c             | 47 +++++++++++++++++++++++++-----
 arch/x86/kvm/vmx/tdx.h             | 14 +++++++++
 arch/x86/kvm/vmx/x86_ops.h         |  2 ++
 arch/x86/kvm/x86.c                 |  7 +++++
 include/linux/kvm_host.h           |  5 ++++
 virt/kvm/Kconfig                   |  4 +++
 virt/kvm/guest_memfd.c             | 26 ++++++++++++-----
 11 files changed, 107 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 79406bf07a1c..e4728f1fe646 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -148,6 +148,7 @@ KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
 KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
 KVM_X86_OP_OPTIONAL_RET0(private_max_mapping_level)
 KVM_X86_OP_OPTIONAL(gmem_invalidate)
+KVM_X86_OP_OPTIONAL_RET0(gmem_defer_removal)
 
 #undef KVM_X86_OP
 #undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9b9dde476f3c..d1afb4e1c2ee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1661,6 +1661,8 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical)
 	return dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
 }
 
+struct inode;
+
 struct kvm_x86_ops {
 	const char *name;
 
@@ -1888,6 +1890,7 @@ struct kvm_x86_ops {
 	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
 	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
 	int (*private_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn);
+	int (*gmem_defer_removal)(struct kvm *kvm, struct inode *inode);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 0d445a317f61..32c4b9922e7b 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,7 @@ config KVM_INTEL
 	depends on KVM && IA32_FEAT_CTL
 	select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
 	select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
+	select HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL if INTEL_TDX_HOST
 	help
 	  Provides support for KVM on processors equipped with Intel's VT
 	  extensions, a.k.a. Virtual Machine Extensions (VMX).
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 94d5d907d37b..b835006e1282 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -888,6 +888,14 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
 	return 0;
 }
 
+static int vt_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
+{
+	if (is_td(kvm))
+		return tdx_gmem_defer_removal(kvm, inode);
+
+	return 0;
+}
+
 #define VMX_REQUIRED_APICV_INHIBITS				\
 	(BIT(APICV_INHIBIT_REASON_DISABLED) |			\
 	 BIT(APICV_INHIBIT_REASON_ABSENT) |			\
@@ -1046,7 +1054,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.mem_enc_ioctl = vt_mem_enc_ioctl,
 	.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
 
-	.private_max_mapping_level = vt_gmem_private_max_mapping_level
+	.private_max_mapping_level = vt_gmem_private_max_mapping_level,
+
+	.gmem_defer_removal = vt_gmem_defer_removal,
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d9eb20516c71..51bbb44ac1bd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,6 +5,7 @@
 #include <asm/fpu/xcr.h>
 #include <linux/misc_cgroup.h>
 #include <linux/mmu_context.h>
+#include <linux/fs.h>
 #include <asm/tdx.h>
 #include "capabilities.h"
 #include "mmu.h"
@@ -594,10 +595,20 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 	kvm_tdx->td.tdr_page = NULL;
 }
 
+static void tdx_unpin_inodes(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	for (int i = 0; i < kvm_tdx->nr_gmem_inodes; i++)
+		iput(kvm_tdx->gmem_inodes[i]);
+}
+
 void tdx_vm_destroy(struct kvm *kvm)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 
+	tdx_unpin_inodes(kvm);
+
 	tdx_reclaim_td_control_pages(kvm);
 
 	kvm_tdx->state = TD_STATE_UNINITIALIZED;
@@ -1778,19 +1789,28 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	return tdx_reclaim_page(virt_to_page(private_spt));
 }
 
+static int tdx_sept_teardown_private_spte(struct kvm *kvm, enum pg_level level, struct page *page)
+{
+	int ret;
+
+	if (level != PG_LEVEL_4K)
+		return -EINVAL;
+
+	ret = tdx_reclaim_page(page);
+	if (!ret)
+		tdx_unpin(kvm, page);
+
+	return ret;
+}
+
 int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 				 enum pg_level level, kvm_pfn_t pfn)
 {
 	struct page *page = pfn_to_page(pfn);
 	int ret;
 
-	/*
-	 * HKID is released after all private pages have been removed, and set
-	 * before any might be populated. Warn if zapping is attempted when
-	 * there can't be anything populated in the private EPT.
-	 */
-	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EINVAL;
+	if (!is_hkid_assigned(to_kvm_tdx(kvm)))
+		return tdx_sept_teardown_private_spte(kvm, level, pfn_to_page(pfn));
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
@@ -3221,6 +3241,19 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
 	return PG_LEVEL_4K;
 }
 
+int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	if (kvm_tdx->nr_gmem_inodes >= TDX_MAX_GMEM_INODES)
+		return 0;
+
+	kvm_tdx->gmem_inodes[kvm_tdx->nr_gmem_inodes++] = inode;
+	ihold(inode);
+
+	return 1;
+}
+
 static int tdx_online_cpu(unsigned int cpu)
 {
 	unsigned long flags;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 6b3bebebabfa..fb5c4face131 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -20,6 +20,10 @@ enum kvm_tdx_state {
 	TD_STATE_RUNNABLE,
 };
 
+struct inode;
+
+#define TDX_MAX_GMEM_INODES 64
+
 struct kvm_tdx {
 	struct kvm kvm;
 
@@ -43,6 +47,16 @@ struct kvm_tdx {
 	 * Set/unset is protected with kvm->mmu_lock.
 	 */
 	bool wait_for_sept_zap;
+
+	/*
+	 * For pages that will not be removed until TD shutdown, the associated
+	 * guest_memfd inode is pinned.  Allow for a fixed number of pinned
+	 * inodes.  If there are more, then when the guest_memfd is closed,
+	 * their pages will be removed safely but inefficiently prior to
+	 * shutdown.
+	 */
+	struct inode *gmem_inodes[TDX_MAX_GMEM_INODES];
+	int nr_gmem_inodes;
 };
 
 /* TDX module vCPU states */
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 4704bed033b1..4ee123289d85 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -164,6 +164,7 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
+int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode);
 #else
 static inline void tdx_disable_virtualization_cpu(void) {}
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
@@ -229,6 +230,7 @@ static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
 static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {}
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
 static inline int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) { return 0; }
+static inline int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode) { return 0; }
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03db366e794a..96ebf0303223 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13659,6 +13659,13 @@ void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
 }
 #endif
 
+#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
+bool kvm_arch_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
+{
+	return kvm_x86_call(gmem_defer_removal)(kvm, inode);
+}
+#endif
+
 int kvm_spec_ctrl_test_value(u64 value)
 {
 	/*
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d72ba0a4ca0e..f5c8b1923c24 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2534,6 +2534,11 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
 #endif
 
+#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
+struct inode;
+bool kvm_arch_gmem_defer_removal(struct kvm *kvm, struct inode *inode);
+#endif
+
 #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
 /**
  * kvm_gmem_populate() - Populate/prepare a GPA range with guest data
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 54e959e7d68f..589111505ad0 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
 config HAVE_KVM_ARCH_GMEM_INVALIDATE
        bool
        depends on KVM_PRIVATE_MEM
+
+config HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
+       bool
+       depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index b2aa6bf24d3a..cd485f45fdaf 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -248,6 +248,15 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
 	return ret;
 }
 
+inline bool kvm_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
+{
+#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
+	return kvm_arch_gmem_defer_removal(kvm, inode);
+#else
+	return false;
+#endif
+}
+
 static int kvm_gmem_release(struct inode *inode, struct file *file)
 {
 	struct kvm_gmem *gmem = file->private_data;
@@ -275,13 +284,16 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
 	xa_for_each(&gmem->bindings, index, slot)
 		WRITE_ONCE(slot->gmem.file, NULL);
 
-	/*
-	 * All in-flight operations are gone and new bindings can be created.
-	 * Zap all SPTEs pointed at by this file.  Do not free the backing
-	 * memory, as its lifetime is associated with the inode, not the file.
-	 */
-	kvm_gmem_invalidate_begin(gmem, 0, -1ul);
-	kvm_gmem_invalidate_end(gmem, 0, -1ul);
+	if (!kvm_gmem_defer_removal(kvm, inode)) {
+		/*
+		 * All in-flight operations are gone and new bindings can be
+		 * created.  Zap all SPTEs pointed at by this file.  Do not free
+		 * the backing memory, as its lifetime is associated with the
+		 * inode, not the file.
+		 */
+		kvm_gmem_invalidate_begin(gmem, 0, -1ul);
+		kvm_gmem_invalidate_end(gmem, 0, -1ul);
+	}
 
 	list_del(&gmem->entry);
 
-- 
2.43.0

Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Posted by Sean Christopherson 10 months, 2 weeks ago

On Thu, Mar 13, 2025, Adrian Hunter wrote:
> Improve TDX shutdown performance by adding a more efficient shutdown
> operation at the cost of adding separate branches for the TDX MMU
> operations for normal runtime and shutdown.  This more efficient method was
> previously used in earlier versions of the TDX patches, but was removed to
> simplify the initial upstreaming.  This is an RFC, and still needs a proper
> upstream commit log. It is intended to be an eventual follow up to base
> support.

...

> == Options ==
> 
>   1. Start TD teardown earlier so that when pages are removed,
>   they can be reclaimed faster.
>   2. Defer page removal until after TD teardown has started.
>   3. A combination of 1 and 2.
> 
> Option 1 is problematic because it means putting the TD into a non-runnable
> state while it is potentially still active. Also, as mentioned above, Sean
> effectively NAK'ed it.

Option 2 is just as gross, arguably even worse.  I NAK'd a flavor of option 1,
not the base concept of initiating teardown before all references to the VM are
put.

AFAICT, nothing outright prevents adding a TDX sub-ioctl to terminate the VM.
The locking is a bit heinous, but I would prefer heavy locking to deferring
reclaim and pinning inodes.

Oh FFS.  This is also an opportunity to cleanup RISC-V's insidious copy-paste of
ARM.  Because extracting (un)lock_all_vcpus() to common code would have been sooo
hard.  *sigh*

Very roughly, something like the below (*completely* untested).

An alternative to taking mmu_lock would be to lock all bound guest_memfds, but I
think I prefer taking mmu_lock is it's easier to reason about the safety of freeing
the HKID.  Note, the truncation phase of a PUNCH_HOLE could still run in parallel,
but that's a-ok.  The only part of PUNCH_HOLE that needs to be blocked is the call
to kvm_mmu_unmap_gfn_range().

---
 arch/x86/kvm/vmx/tdx.c | 61 ++++++++++++++++++++++++++++++------------
 1 file changed, 44 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 87f188021cbd..6fb595c272ab 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -472,7 +472,7 @@ static void smp_func_do_phymem_cache_wb(void *unused)
 		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err);
 }
 
-void tdx_mmu_release_hkid(struct kvm *kvm)
+static void __tdx_release_hkid(struct kvm *kvm, bool terminate)
 {
 	bool packages_allocated, targets_allocated;
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -485,10 +485,11 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	if (!is_hkid_assigned(kvm_tdx))
 		return;
 
+	if (KVM_BUG_ON(refcount_read(&kvm->users_count) && !terminate))
+		return;
+
 	packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
 	targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
-	cpus_read_lock();
-
 	kvm_for_each_vcpu(j, vcpu, kvm)
 		tdx_flush_vp_on_cpu(vcpu);
 
@@ -500,12 +501,8 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	 */
 	mutex_lock(&tdx_lock);
 
-	/*
-	 * Releasing HKID is in vm_destroy().
-	 * After the above flushing vps, there should be no more vCPU
-	 * associations, as all vCPU fds have been released at this stage.
-	 */
 	err = tdh_mng_vpflushdone(&kvm_tdx->td);
+	/* Uh, what's going on here? */
 	if (err == TDX_FLUSHVP_NOT_DONE)
 		goto out;
 	if (KVM_BUG_ON(err, kvm)) {
@@ -515,6 +512,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 		goto out;
 	}
 
+	write_lock(&kvm->mmu_lock);
 	for_each_online_cpu(i) {
 		if (packages_allocated &&
 		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
@@ -539,14 +537,21 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	} else {
 		tdx_hkid_free(kvm_tdx);
 	}
-
+	write_unlock(&kvm->mmu_lock);
 out:
 	mutex_unlock(&tdx_lock);
-	cpus_read_unlock();
 	free_cpumask_var(targets);
 	free_cpumask_var(packages);
 }
 
+void tdx_mmu_release_hkid(struct kvm *kvm)
+{
+	cpus_read_lock();
+	__tdx_release_hkid(kvm, false);
+	cpus_read_unlock();
+}
+
+
 static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1789,13 +1794,10 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct page *page = pfn_to_page(pfn);
 	int ret;
 
-	/*
-	 * HKID is released after all private pages have been removed, and set
-	 * before any might be populated. Warn if zapping is attempted when
-	 * there can't be anything populated in the private EPT.
-	 */
-	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EINVAL;
+	if (!is_hkid_assigned(to_kvm_tdx(kvm)), kvm) {
+		WARN_ON_ONCE(!kvm->vm_dead);
+		return tdx_reclaim_page(pfn_to_page(pfn));
+	}
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
@@ -2790,6 +2792,28 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return 0;
 }
 
+static int tdx_td_terminate(struct kvm *kvm)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	int bkt;
+
+	cpus_read_lock();
+	guard(mutex)(&kvm->lock);
+
+	r = kvm_lock_all_vcpus();
+	if (r)
+		goto out;
+
+	kvm_vm_dead(kvm);
+	kvm_unlock_all_vcpus();
+
+	__tdx_release_hkid(kvm);
+out:
+	cpus_read_unlock();
+	return r;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -2805,6 +2829,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	if (tdx_cmd.hw_error)
 		return -EINVAL;
 
+	if (tdx_cmd.id == KVM_TDX_TERMINATE_VM)
+		return tdx_td_terminate(kvm);
+
 	mutex_lock(&kvm->lock);
 
 	switch (tdx_cmd.id) {

base-commit: 2156c3c7d60c5be9c0d9ab1fedccffe3c55a2ca0
--

Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Posted by Adrian Hunter 10 months, 2 weeks ago

On 27/03/25 17:54, Sean Christopherson wrote:
> On Thu, Mar 13, 2025, Adrian Hunter wrote:
>> Improve TDX shutdown performance by adding a more efficient shutdown
>> operation at the cost of adding separate branches for the TDX MMU
>> operations for normal runtime and shutdown.  This more efficient method was
>> previously used in earlier versions of the TDX patches, but was removed to
>> simplify the initial upstreaming.  This is an RFC, and still needs a proper
>> upstream commit log. It is intended to be an eventual follow up to base
>> support.
> 
> ...
> 
>> == Options ==
>>
>>   1. Start TD teardown earlier so that when pages are removed,
>>   they can be reclaimed faster.
>>   2. Defer page removal until after TD teardown has started.
>>   3. A combination of 1 and 2.
>>
>> Option 1 is problematic because it means putting the TD into a non-runnable
>> state while it is potentially still active. Also, as mentioned above, Sean
>> effectively NAK'ed it.
> 
> Option 2 is just as gross, arguably even worse.  I NAK'd a flavor of option 1,
> not the base concept of initiating teardown before all references to the VM are
> put.
> 
> AFAICT, nothing outright prevents adding a TDX sub-ioctl to terminate the VM.
> The locking is a bit heinous, but I would prefer heavy locking to deferring
> reclaim and pinning inodes.
> 
> Oh FFS.  This is also an opportunity to cleanup RISC-V's insidious copy-paste of
> ARM.  Because extracting (un)lock_all_vcpus() to common code would have been sooo
> hard.  *sigh*
> 
> Very roughly, something like the below (*completely* untested).

Thank you!  I will give it a test.

> 
> An alternative to taking mmu_lock would be to lock all bound guest_memfds, but I
> think I prefer taking mmu_lock is it's easier to reason about the safety of freeing
> the HKID.  Note, the truncation phase of a PUNCH_HOLE could still run in parallel,
> but that's a-ok.  The only part of PUNCH_HOLE that needs to be blocked is the call
> to kvm_mmu_unmap_gfn_range().
> 
> ---
>  arch/x86/kvm/vmx/tdx.c | 61 ++++++++++++++++++++++++++++++------------
>  1 file changed, 44 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 87f188021cbd..6fb595c272ab 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -472,7 +472,7 @@ static void smp_func_do_phymem_cache_wb(void *unused)
>  		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err);
>  }
>  
> -void tdx_mmu_release_hkid(struct kvm *kvm)
> +static void __tdx_release_hkid(struct kvm *kvm, bool terminate)
>  {
>  	bool packages_allocated, targets_allocated;
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -485,10 +485,11 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
>  	if (!is_hkid_assigned(kvm_tdx))
>  		return;
>  
> +	if (KVM_BUG_ON(refcount_read(&kvm->users_count) && !terminate))
> +		return;
> +
>  	packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
>  	targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
> -	cpus_read_lock();
> -
>  	kvm_for_each_vcpu(j, vcpu, kvm)
>  		tdx_flush_vp_on_cpu(vcpu);
>  
> @@ -500,12 +501,8 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
>  	 */
>  	mutex_lock(&tdx_lock);
>  
> -	/*
> -	 * Releasing HKID is in vm_destroy().
> -	 * After the above flushing vps, there should be no more vCPU
> -	 * associations, as all vCPU fds have been released at this stage.
> -	 */
>  	err = tdh_mng_vpflushdone(&kvm_tdx->td);
> +	/* Uh, what's going on here? */
>  	if (err == TDX_FLUSHVP_NOT_DONE)
>  		goto out;

Looks like it should not be needed.

First introduced here:

	[PATCH v16 071/116] KVM: TDX: handle vcpu migration over logical processor
	https://lore.kernel.org/af6cb5e6abae5a19d6d2eeb452d29a96233e5a44.1697471314.git.isaku.yamahata@intel.com

Explained a bit

	Re: [PATCH v17 071/116] KVM: TDX: handle vcpu migration over logical processor
	https://lore.kernel.org/20231117080804.GF1277973@ls.amr.corp.intel.com

And explained a bit more:

	Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor
	https://lore.kernel.org/all/20240415224828.GS3039520@ls.amr.corp.intel.com/

>  	if (KVM_BUG_ON(err, kvm)) {
> @@ -515,6 +512,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
>  		goto out;
>  	}
>  
> +	write_lock(&kvm->mmu_lock);
>  	for_each_online_cpu(i) {
>  		if (packages_allocated &&
>  		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
> @@ -539,14 +537,21 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
>  	} else {
>  		tdx_hkid_free(kvm_tdx);
>  	}
> -
> +	write_unlock(&kvm->mmu_lock);
>  out:
>  	mutex_unlock(&tdx_lock);
> -	cpus_read_unlock();
>  	free_cpumask_var(targets);
>  	free_cpumask_var(packages);
>  }
>  
> +void tdx_mmu_release_hkid(struct kvm *kvm)
> +{
> +	cpus_read_lock();
> +	__tdx_release_hkid(kvm, false);
> +	cpus_read_unlock();
> +}
> +
> +
>  static void tdx_reclaim_td_control_pages(struct kvm *kvm)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -1789,13 +1794,10 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>  	struct page *page = pfn_to_page(pfn);
>  	int ret;
>  
> -	/*
> -	 * HKID is released after all private pages have been removed, and set
> -	 * before any might be populated. Warn if zapping is attempted when
> -	 * there can't be anything populated in the private EPT.
> -	 */
> -	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> -		return -EINVAL;
> +	if (!is_hkid_assigned(to_kvm_tdx(kvm)), kvm) {
> +		WARN_ON_ONCE(!kvm->vm_dead);
> +		return tdx_reclaim_page(pfn_to_page(pfn));
> +	}
>  
>  	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>  	if (ret <= 0)
> @@ -2790,6 +2792,28 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  	return 0;
>  }
>  
> +static int tdx_td_terminate(struct kvm *kvm)
> +{
> +	struct kvm_memory_slot *slot;
> +	struct kvm_memslots *slots;
> +	int bkt;
> +
> +	cpus_read_lock();
> +	guard(mutex)(&kvm->lock);
> +
> +	r = kvm_lock_all_vcpus();
> +	if (r)
> +		goto out;
> +
> +	kvm_vm_dead(kvm);
> +	kvm_unlock_all_vcpus();
> +
> +	__tdx_release_hkid(kvm);
> +out:
> +	cpus_read_unlock();
> +	return r;
> +}
> +
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	struct kvm_tdx_cmd tdx_cmd;
> @@ -2805,6 +2829,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  	if (tdx_cmd.hw_error)
>  		return -EINVAL;
>  
> +	if (tdx_cmd.id == KVM_TDX_TERMINATE_VM)
> +		return tdx_td_terminate(kvm);
> +
>  	mutex_lock(&kvm->lock);
>  
>  	switch (tdx_cmd.id) {
> 
> base-commit: 2156c3c7d60c5be9c0d9ab1fedccffe3c55a2ca0

Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Posted by Vishal Annapurve 10 months, 2 weeks ago

On Thu, Mar 13, 2025 at 11:17 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> ...
> == Problem ==
>
> Currently, Dynamic Page Removal is being used when the TD is being
> shutdown for the sake of having simpler initial code.
>
> This happens when guest_memfds are closed, refer kvm_gmem_release().
> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
> happen until after they are released, refer kvm_gmem_release().
>
> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
> reclaim time.  For example:
>
>         VCPUs   Size (GB)       Before (secs)   After (secs)
>          4       18              72              24
>         32      107             517             134

If the time for reclaim grows linearly with memory size, then this is
a significantly high value for TD cleanup (~21 minutes for a 1TB VM).

>
> Note, the V19 patch set:
>
>         https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/
>
> did not have this issue because the HKID was released early, something that
> Sean effectively NAK'ed:
>
>         "No, the right answer is to not release the HKID until the VM is
>         destroyed."
>
>         https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/

IIUC, Sean is suggesting to treat S-EPT page removal and page reclaim
separately. Through his proposal:
1) If userspace drops last reference on gmem inode before/after
dropping the VM reference
    -> slow S-EPT removal and slow page reclaim
2) If memslots are removed before closing the gmem and dropping the VM reference
    -> slow S-EPT page removal and no page reclaim until the gmem is around.

Reclaim should ideally happen when the host wants to use that memory
i.e. for following scenarios:
1) Truncation of private guest_memfd ranges
2) Conversion of private guest_memfd ranges to shared when supporting
in-place conversion (Could be deferred to the faulting in as shared as
well).

Would it be possible for you to provide the split of the time spent in
slow S-EPT page removal vs page reclaim?

It might be worth exploring the possibility of parallelizing or giving
userspace the flexibility to parallelize both these operations to
bring the cleanup time down (to be comparable with non-confidential VM
cleanup time for example).

Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Posted by Adrian Hunter 10 months, 2 weeks ago

On 27/03/25 10:14, Vishal Annapurve wrote:
> On Thu, Mar 13, 2025 at 11:17 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>> ...
>> == Problem ==
>>
>> Currently, Dynamic Page Removal is being used when the TD is being
>> shutdown for the sake of having simpler initial code.
>>
>> This happens when guest_memfds are closed, refer kvm_gmem_release().
>> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
>> happen until after they are released, refer kvm_gmem_release().
>>
>> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
>> reclaim time.  For example:
>>
>>         VCPUs   Size (GB)       Before (secs)   After (secs)
>>          4       18              72              24
>>         32      107             517             134
> 
> If the time for reclaim grows linearly with memory size, then this is
> a significantly high value for TD cleanup (~21 minutes for a 1TB VM).
> 
>>
>> Note, the V19 patch set:
>>
>>         https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/
>>
>> did not have this issue because the HKID was released early, something that
>> Sean effectively NAK'ed:
>>
>>         "No, the right answer is to not release the HKID until the VM is
>>         destroyed."
>>
>>         https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/
> 
> IIUC, Sean is suggesting to treat S-EPT page removal and page reclaim
> separately. Through his proposal:

Thanks for looking at this!

It seems I am using the term "reclaim" wrongly.  Sorry!

I am talking about taking private memory away from the guest,
not what happens to it subsequently.  When the TDX VM is in "Runnable"
state, taking private memory away is slow (slow S-EPT removal).
When the TDX VM is in "Teardown" state, taking private memory away
is faster (a TDX SEAMCALL named TDH.PHYMEM.PAGE.RECLAIM which is where
I picked up the term "reclaim")

Once guest memory is removed from S-EPT, further action is not
needed to reclaim it.  It belongs to KVM at that point.

guest_memfd memory can be added directly to S-EPT.  No intermediate
state or step is used.  Any guest_memfd memory not given to the
MMU (S-EPT), can be freed directly if userspace/KVM wants to.
Again there is no intermediate state or (reclaim) step.

> 1) If userspace drops last reference on gmem inode before/after
> dropping the VM reference
>     -> slow S-EPT removal and slow page reclaim

Currently slow S-EPT removal happens when the file is released.

> 2) If memslots are removed before closing the gmem and dropping the VM reference
>     -> slow S-EPT page removal and no page reclaim until the gmem is around.
> 
> Reclaim should ideally happen when the host wants to use that memory
> i.e. for following scenarios:
> 1) Truncation of private guest_memfd ranges
> 2) Conversion of private guest_memfd ranges to shared when supporting
> in-place conversion (Could be deferred to the faulting in as shared as
> well).
> 
> Would it be possible for you to provide the split of the time spent in
> slow S-EPT page removal vs page reclaim?

Based on what I wrote above, all the time is spent removing pages
from S-EPT.  Greater that 99% of shutdown time is kvm_gmem_release().

> 
> It might be worth exploring the possibility of parallelizing or giving
> userspace the flexibility to parallelize both these operations to
> bring the cleanup time down (to be comparable with non-confidential VM
> cleanup time for example).

Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Posted by Vishal Annapurve 10 months, 3 weeks ago

On Thu, Mar 13, 2025 at 11:17 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> Improve TDX shutdown performance by adding a more efficient shutdown
> operation at the cost of adding separate branches for the TDX MMU
> operations for normal runtime and shutdown.  This more efficient method was
> previously used in earlier versions of the TDX patches, but was removed to
> simplify the initial upstreaming.  This is an RFC, and still needs a proper
> upstream commit log. It is intended to be an eventual follow up to base
> support.
>
> == Background ==
>
> TDX has 2 methods for the host to reclaim guest private memory, depending
> on whether the TD (TDX VM) is in a runnable state or not.  These are
> called, respectively:
>   1. Dynamic Page Removal
>   2. Reclaiming TD Pages in TD_TEARDOWN State
>
> Dynamic Page Removal is much slower.  Reclaiming a 4K page in TD_TEARDOWN
> state can be 5 times faster, although that is just one part of TD shutdown.
>
> == Relevant TDX Architecture ==
>
> Dynamic Page Removal is slow because it has to potentially deal with a
> running TD, and so involves a number of steps:
>         Block further address translation
>         Exit each VCPU
>         Clear Secure EPT entry
>         Flush/write-back/invalidate relevant caches
>
> Reclaiming TD Pages in TD_TEARDOWN State is fast because the shutdown
> procedure (refer tdx_mmu_release_hkid()) has already performed the relevant
> flushing.  For details, see TDX Module Base Spec October 2024 sections:
>
>         7.5.   TD Teardown Sequence
>         5.6.3. TD Keys Reclamation, TLB and Cache Flush
>
> Essentially all that remains then is to take each page away from the
> TDX Module and return it to the kernel.
>
> == Problem ==
>
> Currently, Dynamic Page Removal is being used when the TD is being
> shutdown for the sake of having simpler initial code.
>
> This happens when guest_memfds are closed, refer kvm_gmem_release().
> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
> happen until after they are released, refer kvm_gmem_release().
>
> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
> reclaim time.  For example:
>
>         VCPUs   Size (GB)       Before (secs)   After (secs)
>          4       18              72              24
>         32      107             517             134
>
> Note, the V19 patch set:
>
>         https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/
>
> did not have this issue because the HKID was released early, something that
> Sean effectively NAK'ed:
>
>         "No, the right answer is to not release the HKID until the VM is
>         destroyed."
>
>         https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/
>
> That was taken on board in the "TDX MMU Part 2" patch set.  Refer
> "Moving of tdx_mmu_release_hkid()" in:
>
>         https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/
>
> == Options ==
>
>   1. Start TD teardown earlier so that when pages are removed,
>   they can be reclaimed faster.
>   2. Defer page removal until after TD teardown has started.
>   3. A combination of 1 and 2.
>
> Option 1 is problematic because it means putting the TD into a non-runnable
> state while it is potentially still active. Also, as mentioned above, Sean
> effectively NAK'ed it.
>
> Option 2 is possible because the lifetime of guest memory pages is separate
> from guest_memfd (struct kvm_gmem) lifetime.
>
> A reference is taken to pages when they are faulted in, refer
> kvm_gmem_get_pfn().  That reference is not released until the pages are
> removed from the mirror SEPT, refer tdx_unpin().

Note that this logic to pin guest memory pages should go away to
support in-place conversion for huge pages [1], though you can still
pin inodes IIUC.

[1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/

Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Posted by Adrian Hunter 10 months, 2 weeks ago

On 20/03/25 02:42, Vishal Annapurve wrote:
> On Thu, Mar 13, 2025 at 11:17 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> Improve TDX shutdown performance by adding a more efficient shutdown
>> operation at the cost of adding separate branches for the TDX MMU
>> operations for normal runtime and shutdown.  This more efficient method was
>> previously used in earlier versions of the TDX patches, but was removed to
>> simplify the initial upstreaming.  This is an RFC, and still needs a proper
>> upstream commit log. It is intended to be an eventual follow up to base
>> support.
>>
>> == Background ==
>>
>> TDX has 2 methods for the host to reclaim guest private memory, depending
>> on whether the TD (TDX VM) is in a runnable state or not.  These are
>> called, respectively:
>>   1. Dynamic Page Removal
>>   2. Reclaiming TD Pages in TD_TEARDOWN State
>>
>> Dynamic Page Removal is much slower.  Reclaiming a 4K page in TD_TEARDOWN
>> state can be 5 times faster, although that is just one part of TD shutdown.
>>
>> == Relevant TDX Architecture ==
>>
>> Dynamic Page Removal is slow because it has to potentially deal with a
>> running TD, and so involves a number of steps:
>>         Block further address translation
>>         Exit each VCPU
>>         Clear Secure EPT entry
>>         Flush/write-back/invalidate relevant caches
>>
>> Reclaiming TD Pages in TD_TEARDOWN State is fast because the shutdown
>> procedure (refer tdx_mmu_release_hkid()) has already performed the relevant
>> flushing.  For details, see TDX Module Base Spec October 2024 sections:
>>
>>         7.5.   TD Teardown Sequence
>>         5.6.3. TD Keys Reclamation, TLB and Cache Flush
>>
>> Essentially all that remains then is to take each page away from the
>> TDX Module and return it to the kernel.
>>
>> == Problem ==
>>
>> Currently, Dynamic Page Removal is being used when the TD is being
>> shutdown for the sake of having simpler initial code.
>>
>> This happens when guest_memfds are closed, refer kvm_gmem_release().
>> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
>> happen until after they are released, refer kvm_gmem_release().
>>
>> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
>> reclaim time.  For example:
>>
>>         VCPUs   Size (GB)       Before (secs)   After (secs)
>>          4       18              72              24
>>         32      107             517             134
>>
>> Note, the V19 patch set:
>>
>>         https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/
>>
>> did not have this issue because the HKID was released early, something that
>> Sean effectively NAK'ed:
>>
>>         "No, the right answer is to not release the HKID until the VM is
>>         destroyed."
>>
>>         https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/
>>
>> That was taken on board in the "TDX MMU Part 2" patch set.  Refer
>> "Moving of tdx_mmu_release_hkid()" in:
>>
>>         https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/
>>
>> == Options ==
>>
>>   1. Start TD teardown earlier so that when pages are removed,
>>   they can be reclaimed faster.
>>   2. Defer page removal until after TD teardown has started.
>>   3. A combination of 1 and 2.
>>
>> Option 1 is problematic because it means putting the TD into a non-runnable
>> state while it is potentially still active. Also, as mentioned above, Sean
>> effectively NAK'ed it.
>>
>> Option 2 is possible because the lifetime of guest memory pages is separate
>> from guest_memfd (struct kvm_gmem) lifetime.
>>
>> A reference is taken to pages when they are faulted in, refer
>> kvm_gmem_get_pfn().  That reference is not released until the pages are
>> removed from the mirror SEPT, refer tdx_unpin().
> 
> Note that this logic to pin guest memory pages should go away to
> support in-place conversion for huge pages [1], though you can still
> pin inodes IIUC.
> 
> [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/

Then the virt code should handle the pinning since it becomes
essential.

inode->i_mapping->i_private_list can be used because we only need to
pin inodes that have no gmem references.  Although it is using it as
a list entry and elsewhere it is used as a list head.

So perhaps like this:

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 0d445a317f61..90e7354d6f40 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,7 @@ config KVM_INTEL
 	depends on KVM && IA32_FEAT_CTL
 	select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
 	select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
+	select HAVE_KVM_GMEM_DEFER_REMOVAL if INTEL_TDX_HOST
 	help
 	  Provides support for KVM on processors equipped with Intel's VT
 	  extensions, a.k.a. Virtual Machine Extensions (VMX).
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1f354b3dbc28..acb022291aec 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,6 +5,7 @@
 #include <asm/fpu/xcr.h>
 #include <linux/misc_cgroup.h>
 #include <linux/mmu_context.h>
+#include <linux/fs.h>
 #include <asm/tdx.h>
 #include "capabilities.h"
 #include "mmu.h"
@@ -626,6 +627,7 @@ int tdx_vm_init(struct kvm *kvm)
 	kvm->arch.has_protected_state = true;
 	kvm->arch.has_private_mem = true;
 	kvm->arch.disabled_quirks |= KVM_X86_QUIRK_IGNORE_GUEST_PAT;
+	kvm->gmem_defer_removal = true;
 
 	/*
 	 * Because guest TD is protected, VMM can't parse the instruction in TD.
@@ -1839,19 +1841,28 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	return tdx_reclaim_page(virt_to_page(private_spt));
 }
 
+static int tdx_sept_teardown_private_spte(struct kvm *kvm, enum pg_level level, struct page *page)
+{
+	int ret;
+
+	if (level != PG_LEVEL_4K)
+		return -EINVAL;
+
+	ret = tdx_reclaim_page(page);
+	if (!ret)
+		tdx_unpin(kvm, page);
+
+	return ret;
+}
+
 int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 				 enum pg_level level, kvm_pfn_t pfn)
 {
 	struct page *page = pfn_to_page(pfn);
 	int ret;
 
-	/*
-	 * HKID is released after all private pages have been removed, and set
-	 * before any might be populated. Warn if zapping is attempted when
-	 * there can't be anything populated in the private EPT.
-	 */
-	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EINVAL;
+	if (!is_hkid_assigned(to_kvm_tdx(kvm)))
+		return tdx_sept_teardown_private_spte(kvm, level, pfn_to_page(pfn));
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ed1968f6f841..2630b88b0983 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -860,6 +860,10 @@ struct kvm {
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 	/* Protected by slots_locks (for writes) and RCU (for reads) */
 	struct xarray mem_attr_array;
+#endif
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+	struct list_head gmem_pinned_inodes;
+	bool gmem_defer_removal;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
 };
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 54e959e7d68f..f56a7e89116d 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
 config HAVE_KVM_ARCH_GMEM_INVALIDATE
        bool
        depends on KVM_PRIVATE_MEM
+
+config HAVE_KVM_GMEM_DEFER_REMOVAL
+       bool
+       depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index b2aa6bf24d3a..a673da75a499 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -248,6 +248,36 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
 	return ret;
 }
 
+static bool kvm_gmem_defer_removal(struct kvm *kvm, struct kvm_gmem *gmem,
+				   struct inode *inode)
+{
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+	if (kvm->gmem_defer_removal) {
+		list_del(&gmem->entry);
+		if (list_empty(&inode->i_mapping->i_private_list)) {
+			list_add_tail(&inode->i_mapping->i_private_list,
+				      &kvm->gmem_pinned_inodes);
+			ihold(inode);
+		}
+		return true;
+	}
+#endif
+	return false;
+}
+
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+void kvm_gmem_unpin_inodes(struct kvm *kvm)
+{
+	struct address_space *mapping, *n;
+
+	list_for_each_entry_safe(mapping, n, &kvm->gmem_pinned_inodes,
+				 i_private_list) {
+		list_del_init(&mapping->i_private_list);
+		iput(mapping->host);
+	}
+}
+#endif
+
 static int kvm_gmem_release(struct inode *inode, struct file *file)
 {
 	struct kvm_gmem *gmem = file->private_data;
@@ -275,15 +305,18 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
 	xa_for_each(&gmem->bindings, index, slot)
 		WRITE_ONCE(slot->gmem.file, NULL);
 
-	/*
-	 * All in-flight operations are gone and new bindings can be created.
-	 * Zap all SPTEs pointed at by this file.  Do not free the backing
-	 * memory, as its lifetime is associated with the inode, not the file.
-	 */
-	kvm_gmem_invalidate_begin(gmem, 0, -1ul);
-	kvm_gmem_invalidate_end(gmem, 0, -1ul);
+	if (!kvm_gmem_defer_removal(kvm, gmem, inode)) {
+		/*
+		 * All in-flight operations are gone and new bindings can be
+		 * created.  Zap all SPTEs pointed at by this file.  Do not free
+		 * the backing memory, as its lifetime is associated with the
+		 * inode, not the file.
+		 */
+		kvm_gmem_invalidate_begin(gmem, 0, -1ul);
+		kvm_gmem_invalidate_end(gmem, 0, -1ul);
 
-	list_del(&gmem->entry);
+		list_del(&gmem->entry);
+	}
 
 	filemap_invalidate_unlock(inode->i_mapping);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 08f237bf4107..5c15828b86fb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1195,6 +1195,10 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	preempt_notifier_inc();
 	kvm_init_pm_notifier(kvm);
 
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+	INIT_LIST_HEAD(&kvm->gmem_pinned_inodes);
+#endif
+
 	return kvm;
 
 out_err_no_debugfs:
@@ -1299,6 +1303,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	cleanup_srcu_struct(&kvm->srcu);
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 	xa_destroy(&kvm->mem_attr_array);
+#endif
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+	kvm_gmem_unpin_inodes(kvm);
 #endif
 	kvm_arch_free_vm(kvm);
 	preempt_notifier_dec();
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index acef3f5c582a..59562a355488 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -73,6 +73,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args);
 int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
 		  unsigned int fd, loff_t offset);
 void kvm_gmem_unbind(struct kvm_memory_slot *slot);
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+void kvm_gmem_unpin_inodes(struct kvm *kvm);
+#endif // HAVE_KVM_GMEM_DEFER_REMOVAL
 #else
 static inline void kvm_gmem_init(struct module *module)
 {

Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Posted by Kirill A. Shutemov 10 months, 4 weeks ago

On Thu, Mar 13, 2025 at 08:16:29PM +0200, Adrian Hunter wrote:
> @@ -3221,6 +3241,19 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
>  	return PG_LEVEL_4K;
>  }
>  
> +int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> +	if (kvm_tdx->nr_gmem_inodes >= TDX_MAX_GMEM_INODES)
> +		return 0;

We have graceful way to handle this, but should we pr_warn_once() or
something if we ever hit this limit?

Hm. It is also a bit odd that we need to wait until removal to add a link
to guest_memfd inode from struct kvm/kvm_tdx. Can we do it right away in
__kvm_gmem_create()?

Do I read correctly that inode->i_mapping->i_private_list only ever has
single entry of the gmem? Seems wasteful.

Maybe move it to i_private (I don't see flags being used anywhere) and
re-use the list_head to link all inodes of the struct kvm?

No need in the gmem_inodes array.
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Posted by Adrian Hunter 10 months, 3 weeks ago

On 17/03/25 10:13, Kirill A. Shutemov wrote:
> On Thu, Mar 13, 2025 at 08:16:29PM +0200, Adrian Hunter wrote:
>> @@ -3221,6 +3241,19 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
>>  	return PG_LEVEL_4K;
>>  }
>>  
>> +int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
>> +{
>> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>> +
>> +	if (kvm_tdx->nr_gmem_inodes >= TDX_MAX_GMEM_INODES)
>> +		return 0;
> 
> We have graceful way to handle this, but should we pr_warn_once() or
> something if we ever hit this limit?
> 
> Hm. It is also a bit odd that we need to wait until removal to add a link
> to guest_memfd inode from struct kvm/kvm_tdx. Can we do it right away in
> __kvm_gmem_create()?

Sure.

The thing is, the inode is currently private within virt/kvm/guest_memfd.c
so there needs to be a way to make it accessible to arch code.  Either a
callback passes it, or it is put on struct kvm in some way.

> 
> Do I read correctly that inode->i_mapping->i_private_list only ever has
> single entry of the gmem? Seems wasteful.

Yes, it is presently used for only 1 gmem.

> 
> Maybe move it to i_private (I don't see flags being used anywhere) and
> re-use the list_head to link all inodes of the struct kvm?
> 
> No need in the gmem_inodes array.

There is also inode->i_mapping->i_private_data which is unused.

Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Posted by Vishal Annapurve 10 months, 3 weeks ago

On Tue, Mar 18, 2025 at 8:42 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 17/03/25 10:13, Kirill A. Shutemov wrote:
> > On Thu, Mar 13, 2025 at 08:16:29PM +0200, Adrian Hunter wrote:
> >> @@ -3221,6 +3241,19 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> >>      return PG_LEVEL_4K;
> >>  }
> >>
> >> +int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> >> +{
> >> +    struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> >> +
> >> +    if (kvm_tdx->nr_gmem_inodes >= TDX_MAX_GMEM_INODES)
> >> +            return 0;
> >
> > We have graceful way to handle this, but should we pr_warn_once() or
> > something if we ever hit this limit?
> >
> > Hm. It is also a bit odd that we need to wait until removal to add a link
> > to guest_memfd inode from struct kvm/kvm_tdx. Can we do it right away in
> > __kvm_gmem_create()?
>
> Sure.
>
> The thing is, the inode is currently private within virt/kvm/guest_memfd.c
> so there needs to be a way to make it accessible to arch code.  Either a
> callback passes it, or it is put on struct kvm in some way.
>
> >
> > Do I read correctly that inode->i_mapping->i_private_list only ever has
> > single entry of the gmem? Seems wasteful.
>
> Yes, it is presently used for only 1 gmem.

Intrahost migration support[1] will make use of this list to track
additional linked files. We are planning to float a next revision
soon.

[1] https://lore.kernel.org/lkml/ZN%2F81KNAWofRCaQK@google.com/t/

>
> >
> > Maybe move it to i_private (I don't see flags being used anywhere) and
> > re-use the list_head to link all inodes of the struct kvm?
> >
> > No need in the gmem_inodes array.
>
> There is also inode->i_mapping->i_private_data which is unused.
>

Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Posted by Paolo Bonzini 11 months ago

On Thu, Mar 13, 2025 at 7:16 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> Improve TDX shutdown performance by adding a more efficient shutdown
> operation at the cost of adding separate branches for the TDX MMU
> operations for normal runtime and shutdown.  This more efficient method was
> previously used in earlier versions of the TDX patches, but was removed to
> simplify the initial upstreaming.  This is an RFC, and still needs a proper
> upstream commit log. It is intended to be an eventual follow up to base
> support.

In the latest code the HKID is released in kvm_arch_pre_destroy_vm().
That is before kvm_free_memslot() calls kvm_gmem_unbind(), which
results in fput() and hence kvm_gmem_release().

So, as long as userspace doesn't remove the memslots and close the
guestmemfds, shouldn't the TD_TEARDOWN method be usable?

Paolo

> == Background ==
>
> TDX has 2 methods for the host to reclaim guest private memory, depending
> on whether the TD (TDX VM) is in a runnable state or not.  These are
> called, respectively:
>   1. Dynamic Page Removal
>   2. Reclaiming TD Pages in TD_TEARDOWN State
>
> Dynamic Page Removal is much slower.  Reclaiming a 4K page in TD_TEARDOWN
> state can be 5 times faster, although that is just one part of TD shutdown.
>
> == Relevant TDX Architecture ==
>
> Dynamic Page Removal is slow because it has to potentially deal with a
> running TD, and so involves a number of steps:
>         Block further address translation
>         Exit each VCPU
>         Clear Secure EPT entry
>         Flush/write-back/invalidate relevant caches
>
> Reclaiming TD Pages in TD_TEARDOWN State is fast because the shutdown
> procedure (refer tdx_mmu_release_hkid()) has already performed the relevant
> flushing.  For details, see TDX Module Base Spec October 2024 sections:
>
>         7.5.   TD Teardown Sequence
>         5.6.3. TD Keys Reclamation, TLB and Cache Flush
>
> Essentially all that remains then is to take each page away from the
> TDX Module and return it to the kernel.
>
> == Problem ==
>
> Currently, Dynamic Page Removal is being used when the TD is being
> shutdown for the sake of having simpler initial code.
>
> This happens when guest_memfds are closed, refer kvm_gmem_release().
> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
> happen until after they are released, refer kvm_gmem_release().
>
> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
> reclaim time.  For example:
>
>         VCPUs   Size (GB)       Before (secs)   After (secs)
>          4       18              72              24
>         32      107             517             134
>
> Note, the V19 patch set:
>
>         https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/
>
> did not have this issue because the HKID was released early, something that
> Sean effectively NAK'ed:
>
>         "No, the right answer is to not release the HKID until the VM is
>         destroyed."
>
>         https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/
>
> That was taken on board in the "TDX MMU Part 2" patch set.  Refer
> "Moving of tdx_mmu_release_hkid()" in:
>
>         https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/
>
> == Options ==
>
>   1. Start TD teardown earlier so that when pages are removed,
>   they can be reclaimed faster.
>   2. Defer page removal until after TD teardown has started.
>   3. A combination of 1 and 2.
>
> Option 1 is problematic because it means putting the TD into a non-runnable
> state while it is potentially still active. Also, as mentioned above, Sean
> effectively NAK'ed it.
>
> Option 2 is possible because the lifetime of guest memory pages is separate
> from guest_memfd (struct kvm_gmem) lifetime.
>
> A reference is taken to pages when they are faulted in, refer
> kvm_gmem_get_pfn().  That reference is not released until the pages are
> removed from the mirror SEPT, refer tdx_unpin().
>
> Option 3 is not needed because TD teardown begins during VM destruction
> before pages are reclaimed.  TD_TEARDOWN state is entered by
> tdx_mmu_release_hkid(), whereas pages are reclaimed by tdp_mmu_zap_root(),
> as follows:
>
>     kvm_arch_destroy_vm()
>         ...
>         vt_vm_pre_destroy()
>             tdx_mmu_release_hkid()
>         ...
>         kvm_mmu_uninit_vm()
>             kvm_mmu_uninit_tdp_mmu()
>                 kvm_tdp_mmu_invalidate_roots()
>                 kvm_tdp_mmu_zap_invalidated_roots()
>                     tdp_mmu_zap_root()
>
> == Proof of Concept for option 2 ==
>
> Assume user space never needs to close a guest_memfd except as part of VM
> shutdown.
>
> Add a callback from kvm_gmem_release() to decide whether to defer removal.
> For TDX, record the inode (up to a max. of 64 inodes) and pin it.
>
> Amend the release of guest_memfds to skip removing pages from the MMU
> in that case.
>
> Amend TDX private memory page removal to detect TD_TEARDOWN state, and
> reclaim the page accordingly.
>
> For TDX, finally unpin any pinned inodes.
>
> This hopefully illustrates what needs to be done, but guidance is sought
> for the best way to do it.
>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>  arch/x86/include/asm/kvm_host.h    |  3 ++
>  arch/x86/kvm/Kconfig               |  1 +
>  arch/x86/kvm/vmx/main.c            | 12 +++++++-
>  arch/x86/kvm/vmx/tdx.c             | 47 +++++++++++++++++++++++++-----
>  arch/x86/kvm/vmx/tdx.h             | 14 +++++++++
>  arch/x86/kvm/vmx/x86_ops.h         |  2 ++
>  arch/x86/kvm/x86.c                 |  7 +++++
>  include/linux/kvm_host.h           |  5 ++++
>  virt/kvm/Kconfig                   |  4 +++
>  virt/kvm/guest_memfd.c             | 26 ++++++++++++-----
>  11 files changed, 107 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 79406bf07a1c..e4728f1fe646 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -148,6 +148,7 @@ KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
>  KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
>  KVM_X86_OP_OPTIONAL_RET0(private_max_mapping_level)
>  KVM_X86_OP_OPTIONAL(gmem_invalidate)
> +KVM_X86_OP_OPTIONAL_RET0(gmem_defer_removal)
>
>  #undef KVM_X86_OP
>  #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9b9dde476f3c..d1afb4e1c2ee 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1661,6 +1661,8 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical)
>         return dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
>  }
>
> +struct inode;
> +
>  struct kvm_x86_ops {
>         const char *name;
>
> @@ -1888,6 +1890,7 @@ struct kvm_x86_ops {
>         int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
>         void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
>         int (*private_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn);
> +       int (*gmem_defer_removal)(struct kvm *kvm, struct inode *inode);
>  };
>
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 0d445a317f61..32c4b9922e7b 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -96,6 +96,7 @@ config KVM_INTEL
>         depends on KVM && IA32_FEAT_CTL
>         select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
>         select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
> +       select HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL if INTEL_TDX_HOST
>         help
>           Provides support for KVM on processors equipped with Intel's VT
>           extensions, a.k.a. Virtual Machine Extensions (VMX).
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 94d5d907d37b..b835006e1282 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -888,6 +888,14 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
>         return 0;
>  }
>
> +static int vt_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> +{
> +       if (is_td(kvm))
> +               return tdx_gmem_defer_removal(kvm, inode);
> +
> +       return 0;
> +}
> +
>  #define VMX_REQUIRED_APICV_INHIBITS                            \
>         (BIT(APICV_INHIBIT_REASON_DISABLED) |                   \
>          BIT(APICV_INHIBIT_REASON_ABSENT) |                     \
> @@ -1046,7 +1054,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>         .mem_enc_ioctl = vt_mem_enc_ioctl,
>         .vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
>
> -       .private_max_mapping_level = vt_gmem_private_max_mapping_level
> +       .private_max_mapping_level = vt_gmem_private_max_mapping_level,
> +
> +       .gmem_defer_removal = vt_gmem_defer_removal,
>  };
>
>  struct kvm_x86_init_ops vt_init_ops __initdata = {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index d9eb20516c71..51bbb44ac1bd 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -5,6 +5,7 @@
>  #include <asm/fpu/xcr.h>
>  #include <linux/misc_cgroup.h>
>  #include <linux/mmu_context.h>
> +#include <linux/fs.h>
>  #include <asm/tdx.h>
>  #include "capabilities.h"
>  #include "mmu.h"
> @@ -594,10 +595,20 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
>         kvm_tdx->td.tdr_page = NULL;
>  }
>
> +static void tdx_unpin_inodes(struct kvm *kvm)
> +{
> +       struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> +       for (int i = 0; i < kvm_tdx->nr_gmem_inodes; i++)
> +               iput(kvm_tdx->gmem_inodes[i]);
> +}
> +
>  void tdx_vm_destroy(struct kvm *kvm)
>  {
>         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>
> +       tdx_unpin_inodes(kvm);
> +
>         tdx_reclaim_td_control_pages(kvm);
>
>         kvm_tdx->state = TD_STATE_UNINITIALIZED;
> @@ -1778,19 +1789,28 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
>         return tdx_reclaim_page(virt_to_page(private_spt));
>  }
>
> +static int tdx_sept_teardown_private_spte(struct kvm *kvm, enum pg_level level, struct page *page)
> +{
> +       int ret;
> +
> +       if (level != PG_LEVEL_4K)
> +               return -EINVAL;
> +
> +       ret = tdx_reclaim_page(page);
> +       if (!ret)
> +               tdx_unpin(kvm, page);
> +
> +       return ret;
> +}
> +
>  int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>                                  enum pg_level level, kvm_pfn_t pfn)
>  {
>         struct page *page = pfn_to_page(pfn);
>         int ret;
>
> -       /*
> -        * HKID is released after all private pages have been removed, and set
> -        * before any might be populated. Warn if zapping is attempted when
> -        * there can't be anything populated in the private EPT.
> -        */
> -       if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> -               return -EINVAL;
> +       if (!is_hkid_assigned(to_kvm_tdx(kvm)))
> +               return tdx_sept_teardown_private_spte(kvm, level, pfn_to_page(pfn));
>
>         ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>         if (ret <= 0)
> @@ -3221,6 +3241,19 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
>         return PG_LEVEL_4K;
>  }
>
> +int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> +{
> +       struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> +       if (kvm_tdx->nr_gmem_inodes >= TDX_MAX_GMEM_INODES)
> +               return 0;
> +
> +       kvm_tdx->gmem_inodes[kvm_tdx->nr_gmem_inodes++] = inode;
> +       ihold(inode);
> +
> +       return 1;
> +}
> +
>  static int tdx_online_cpu(unsigned int cpu)
>  {
>         unsigned long flags;
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 6b3bebebabfa..fb5c4face131 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -20,6 +20,10 @@ enum kvm_tdx_state {
>         TD_STATE_RUNNABLE,
>  };
>
> +struct inode;
> +
> +#define TDX_MAX_GMEM_INODES 64
> +
>  struct kvm_tdx {
>         struct kvm kvm;
>
> @@ -43,6 +47,16 @@ struct kvm_tdx {
>          * Set/unset is protected with kvm->mmu_lock.
>          */
>         bool wait_for_sept_zap;
> +
> +       /*
> +        * For pages that will not be removed until TD shutdown, the associated
> +        * guest_memfd inode is pinned.  Allow for a fixed number of pinned
> +        * inodes.  If there are more, then when the guest_memfd is closed,
> +        * their pages will be removed safely but inefficiently prior to
> +        * shutdown.
> +        */
> +       struct inode *gmem_inodes[TDX_MAX_GMEM_INODES];
> +       int nr_gmem_inodes;
>  };
>
>  /* TDX module vCPU states */
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 4704bed033b1..4ee123289d85 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -164,6 +164,7 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
>  void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
>  void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
>  int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
> +int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode);
>  #else
>  static inline void tdx_disable_virtualization_cpu(void) {}
>  static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
> @@ -229,6 +230,7 @@ static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
>  static inline int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) { return 0; }
> +static inline int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode) { return 0; }
>  #endif
>
>  #endif /* __KVM_X86_VMX_X86_OPS_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 03db366e794a..96ebf0303223 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13659,6 +13659,13 @@ void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
>  }
>  #endif
>
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
> +bool kvm_arch_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> +{
> +       return kvm_x86_call(gmem_defer_removal)(kvm, inode);
> +}
> +#endif
> +
>  int kvm_spec_ctrl_test_value(u64 value)
>  {
>         /*
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d72ba0a4ca0e..f5c8b1923c24 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2534,6 +2534,11 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
>  int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
>  #endif
>
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
> +struct inode;
> +bool kvm_arch_gmem_defer_removal(struct kvm *kvm, struct inode *inode);
> +#endif
> +
>  #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
>  /**
>   * kvm_gmem_populate() - Populate/prepare a GPA range with guest data
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 54e959e7d68f..589111505ad0 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
>  config HAVE_KVM_ARCH_GMEM_INVALIDATE
>         bool
>         depends on KVM_PRIVATE_MEM
> +
> +config HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
> +       bool
> +       depends on KVM_PRIVATE_MEM
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index b2aa6bf24d3a..cd485f45fdaf 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -248,6 +248,15 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
>         return ret;
>  }
>
> +inline bool kvm_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> +{
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
> +       return kvm_arch_gmem_defer_removal(kvm, inode);
> +#else
> +       return false;
> +#endif
> +}
> +
>  static int kvm_gmem_release(struct inode *inode, struct file *file)
>  {
>         struct kvm_gmem *gmem = file->private_data;
> @@ -275,13 +284,16 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
>         xa_for_each(&gmem->bindings, index, slot)
>                 WRITE_ONCE(slot->gmem.file, NULL);
>
> -       /*
> -        * All in-flight operations are gone and new bindings can be created.
> -        * Zap all SPTEs pointed at by this file.  Do not free the backing
> -        * memory, as its lifetime is associated with the inode, not the file.
> -        */
> -       kvm_gmem_invalidate_begin(gmem, 0, -1ul);
> -       kvm_gmem_invalidate_end(gmem, 0, -1ul);
> +       if (!kvm_gmem_defer_removal(kvm, inode)) {
> +               /*
> +                * All in-flight operations are gone and new bindings can be
> +                * created.  Zap all SPTEs pointed at by this file.  Do not free
> +                * the backing memory, as its lifetime is associated with the
> +                * inode, not the file.
> +                */
> +               kvm_gmem_invalidate_begin(gmem, 0, -1ul);
> +               kvm_gmem_invalidate_end(gmem, 0, -1ul);
> +       }
>
>         list_del(&gmem->entry);
>
> --
> 2.43.0
>

Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Posted by Adrian Hunter 11 months ago

On 13/03/25 20:39, Paolo Bonzini wrote:
> On Thu, Mar 13, 2025 at 7:16 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>> Improve TDX shutdown performance by adding a more efficient shutdown
>> operation at the cost of adding separate branches for the TDX MMU
>> operations for normal runtime and shutdown.  This more efficient method was
>> previously used in earlier versions of the TDX patches, but was removed to
>> simplify the initial upstreaming.  This is an RFC, and still needs a proper
>> upstream commit log. It is intended to be an eventual follow up to base
>> support.
> 
> In the latest code the HKID is released in kvm_arch_pre_destroy_vm().

I am looking at kvm-coco-queue

> That is before kvm_free_memslot() calls kvm_gmem_unbind(), which
> results in fput() and hence kvm_gmem_release().
> 
> So, as long as userspace doesn't remove the memslots and close the
> guestmemfds, shouldn't the TD_TEARDOWN method be usable?

kvm_arch_pre_destroy_vm() is called from kvm_destroy_vm()
which won't happen before kvm_gmem_release() calls
kvm_put_kvm().