[PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM

Sean Christopherson posted 5 patches 2 months, 1 week ago
[PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
Posted by Sean Christopherson 2 months, 1 week ago
Add sub-ioctl KVM_TDX_TERMINATE_VM to release the HKID prior to shutdown,
which enables more efficient reclaim of private memory.

Private memory is removed from MMU/TDP when guest_memfds are closed.  If
the HKID has not been released, the TDX VM is still in the RUNNABLE state,
and so pages must be removed using "Dynamic Page Removal" procedure (refer
to the TDX Module Base spec) which involves a number of steps:
	Block further address translation
	Exit each VCPU
	Clear Secure EPT entry
	Flush/write-back/invalidate relevant caches

However, when the HKID is released, the TDX VM moves to TD_TEARDOWN state,
where all TDX VM pages are effectively unmapped, so pages can be reclaimed
directly.

Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
reclaim time.  For example:

  VCPUs    Size (GB)    Before (secs)    After (secs)
      4           18		   72              24
     32          107		  517             134
     64          400		 5539             467

Add kvm_tdx_capabilities.supported_caps along with KVM_TDX_CAP_TERMINATE_VM
to advertise support to userspace.  Use a new field in kvm_tdx_capabilities
instead of adding yet another generic KVM_CAP to avoid bleeding TDX details
into common code (and #ifdefs), and so that userspace can query TDX
capabilities in one shot.  Enumerating capabilities as a mask of bits does
limit supported_caps to 64 capabilities, but in the unlikely event KVM
needs to enumerate more than 64 TDX capabilities, there are another 249
u64 entries reserved for future expansion.

To preserve the KVM_BUG_ON() sanity check that deals with HKID assignment,
track if a TD is terminated and assert that, when an S-EPT entry is
removed, either the TD has an assigned HKID or the TD was explicitly
terminated.

Link: https://lore.kernel.org/r/Z-V0qyTn2bXdrPF7@google.com
Link: https://lore.kernel.org/r/aAL4dT1pWG5dDDeo@google.com
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Vishal Annapurve <vannapurve@google.com>
Tested-by: Vishal Annapurve <vannapurve@google.com>
Tested-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/x86/intel-tdx.rst | 22 +++++++++++++-
 arch/x86/include/uapi/asm/kvm.h          |  7 ++++-
 arch/x86/kvm/vmx/tdx.c                   | 37 +++++++++++++++++++-----
 arch/x86/kvm/vmx/tdx.h                   |  1 +
 4 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
index 5efac62c92c7..bcfa97e0c9e7 100644
--- a/Documentation/virt/kvm/x86/intel-tdx.rst
+++ b/Documentation/virt/kvm/x86/intel-tdx.rst
@@ -38,6 +38,7 @@ ioctl with TDX specific sub-ioctl() commands.
           KVM_TDX_INIT_MEM_REGION,
           KVM_TDX_FINALIZE_VM,
           KVM_TDX_GET_CPUID,
+          KVM_TDX_TERMINATE_VM,
 
           KVM_TDX_CMD_NR_MAX,
   };
@@ -92,7 +93,10 @@ to be configured to the TDX guest.
         __u64 kernel_tdvmcallinfo_1_r12;
         __u64 user_tdvmcallinfo_1_r12;
 
-        __u64 reserved[250];
+        /* Misc capabilities enumerated via the KVM_TDX_CAP_* namespace. */
+        __u64 supported_caps;
+
+        __u64 reserved[249];
 
         /* Configurable CPUID bits for userspace */
         struct kvm_cpuid2 cpuid;
@@ -227,6 +231,22 @@ struct kvm_cpuid2.
 	  __u32 padding[3];
   };
 
+KVM_TDX_TERMINATE_VM
+--------------------
+:Capability: KVM_TDX_CAP_TERMINATE_VM
+:Type: vm ioctl
+:Returns: 0 on success, <0 on error
+
+Release Host Key ID (HKID) to allow more efficient reclaim of private memory.
+After this, the TD is no longer in a runnable state.
+
+Using KVM_TDX_TERMINATE_VM is optional.
+
+- id: KVM_TDX_TERMINATE_VM
+- flags: must be 0
+- data: must be 0
+- hw_error: must be 0
+
 KVM TDX creation flow
 =====================
 In addition to the standard KVM flow, new TDX ioctls need to be called.  The
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 0f15d683817d..e019111e2150 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -940,6 +940,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_INIT_MEM_REGION,
 	KVM_TDX_FINALIZE_VM,
 	KVM_TDX_GET_CPUID,
+	KVM_TDX_TERMINATE_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
@@ -962,6 +963,8 @@ struct kvm_tdx_cmd {
 	__u64 hw_error;
 };
 
+#define KVM_TDX_CAP_TERMINATE_VM       _BITULL(0)
+
 struct kvm_tdx_capabilities {
 	__u64 supported_attrs;
 	__u64 supported_xfam;
@@ -971,7 +974,9 @@ struct kvm_tdx_capabilities {
 	__u64 kernel_tdvmcallinfo_1_r12;
 	__u64 user_tdvmcallinfo_1_r12;
 
-	__u64 reserved[250];
+	__u64 supported_caps;
+
+	__u64 reserved[249];
 
 	/* Configurable CPUID bits for userspace */
 	struct kvm_cpuid2 cpuid;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c2ef03f39c32..ae059daf1a20 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -188,6 +188,8 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
 	if (!caps->supported_xfam)
 		return -EIO;
 
+	caps->supported_caps = KVM_TDX_CAP_TERMINATE_VM;
+
 	caps->cpuid.nent = td_conf->num_cpuid_config;
 
 	caps->user_tdvmcallinfo_1_r11 =
@@ -520,6 +522,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 		goto out;
 	}
 
+	write_lock(&kvm->mmu_lock);
 	for_each_online_cpu(i) {
 		if (packages_allocated &&
 		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
@@ -544,7 +547,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	} else {
 		tdx_hkid_free(kvm_tdx);
 	}
-
+	write_unlock(&kvm->mmu_lock);
 out:
 	mutex_unlock(&tdx_lock);
 	cpus_read_unlock();
@@ -1884,13 +1887,13 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct page *page = pfn_to_page(pfn);
 	int ret;
 
-	/*
-	 * HKID is released after all private pages have been removed, and set
-	 * before any might be populated. Warn if zapping is attempted when
-	 * there can't be anything populated in the private EPT.
-	 */
-	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EINVAL;
+	if (!is_hkid_assigned(to_kvm_tdx(kvm))) {
+		KVM_BUG_ON(!to_kvm_tdx(kvm)->vm_terminated, kvm);
+		ret = tdx_reclaim_page(page);
+		if (!ret)
+			tdx_unpin(kvm, page);
+		return ret;
+	}
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
@@ -2884,6 +2887,21 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return 0;
 }
 
+static int tdx_terminate_vm(struct kvm *kvm)
+{
+	if (kvm_trylock_all_vcpus(kvm))
+		return -EBUSY;
+
+	kvm_vm_dead(kvm);
+	to_kvm_tdx(kvm)->vm_terminated = true;
+
+	kvm_unlock_all_vcpus(kvm);
+
+	tdx_mmu_release_hkid(kvm);
+
+	return 0;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -2911,6 +2929,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_TDX_FINALIZE_VM:
 		r = tdx_td_finalize(kvm, &tdx_cmd);
 		break;
+	case KVM_TDX_TERMINATE_VM:
+		r = tdx_terminate_vm(kvm);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index ca39a9391db1..0abe70aa1644 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -45,6 +45,7 @@ struct kvm_tdx {
 	 * Set/unset is protected with kvm->mmu_lock.
 	 */
 	bool wait_for_sept_zap;
+	bool vm_terminated;
 };
 
 /* TDX module vCPU states */
-- 
2.50.1.552.g942d659e1b-goog
Re: [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
Posted by Adrian Hunter 2 months ago
On 29/07/2025 22:33, Sean Christopherson wrote:
> +static int tdx_terminate_vm(struct kvm *kvm)
> +{
> +	if (kvm_trylock_all_vcpus(kvm))
> +		return -EBUSY;
> +
> +	kvm_vm_dead(kvm);
> +	to_kvm_tdx(kvm)->vm_terminated = true;
> +
> +	kvm_unlock_all_vcpus(kvm);
> +
> +	tdx_mmu_release_hkid(kvm);
> +
> +	return 0;
> +}

As I think I mentioned when removing vm_dead first came up,
I think we need more checks.  I spent some time going through
the code and came up with what is below:

First, we need to avoid TDX VCPU sub-IOCTLs from racing with
tdx_mmu_release_hkid().  But having any TDX sub-IOCTL run after
KVM_TDX_TERMINATE_VM raises questions of what might happen, so
it is much simpler to understand, if that is not possible.
There are 3 options:

1. Require that KVM_TDX_TERMINATE_VM is valid only if
kvm_tdx->state == TD_STATE_RUNNABLE.  Since currently all
the TDX sub-IOCTLs are for initialization, that would block
the opportunity for any to run after KVM_TDX_TERMINATE_VM.

2. Check vm_terminated in tdx_vm_ioctl() and tdx_vcpu_ioctl()

3. Test KVM_REQ_VM_DEAD in tdx_vm_ioctl() and tdx_vcpu_ioctl()

[ Note cannot check is_hkid_assigned() because that is racy ]

Secondly, I suggest we avoid SEAMCALLs that will fail and
result in KVM_BUG_ON() if HKID has been released.

There are 2 groups of those: MMU-related and TDVPS_ACCESSORS.

For the MMU-related, the following 2 functions should return
an error immediately if vm_terminated:

	tdx_sept_link_private_spt()
	tdx_sept_set_private_spte()

For that not be racy, extra synchronization is needed so that
vm_terminated can be reliably checked when holding mmu lock
i.e.

static int tdx_terminate_vm(struct kvm *kvm)
{
	if (kvm_trylock_all_vcpus(kvm))
		return -EBUSY;

	kvm_vm_dead(kvm);
+
+       write_lock(&kvm->mmu_lock);
	to_kvm_tdx(kvm)->vm_terminated = true;
+       write_unlock(&kvm->mmu_lock);

	kvm_unlock_all_vcpus(kvm);

	tdx_mmu_release_hkid(kvm);

	return 0;
}

Finally, there are 2 TDVPS_ACCESSORS that need avoiding:

	tdx_load_mmu_pgd()
		skip td_vmcs_write64() if vm_terminated

	tdx_protected_apic_has_interrupt()
		skip td_state_non_arch_read64() if vm_terminated
Re: [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
Posted by Sean Christopherson 2 months ago
+Chao

On Fri, Aug 01, 2025, Adrian Hunter wrote:
> On 29/07/2025 22:33, Sean Christopherson wrote:
> > +static int tdx_terminate_vm(struct kvm *kvm)
> > +{
> > +	if (kvm_trylock_all_vcpus(kvm))
> > +		return -EBUSY;
> > +
> > +	kvm_vm_dead(kvm);
> > +	to_kvm_tdx(kvm)->vm_terminated = true;
> > +
> > +	kvm_unlock_all_vcpus(kvm);
> > +
> > +	tdx_mmu_release_hkid(kvm);
> > +
> > +	return 0;
> > +}
> 
> As I think I mentioned when removing vm_dead first came up,
> I think we need more checks.  I spent some time going through
> the code and came up with what is below:
> 
> First, we need to avoid TDX VCPU sub-IOCTLs from racing with
> tdx_mmu_release_hkid().  But having any TDX sub-IOCTL run after
> KVM_TDX_TERMINATE_VM raises questions of what might happen, so
> it is much simpler to understand, if that is not possible.
> There are 3 options:
> 
> 1. Require that KVM_TDX_TERMINATE_VM is valid only if
> kvm_tdx->state == TD_STATE_RUNNABLE.  Since currently all
> the TDX sub-IOCTLs are for initialization, that would block
> the opportunity for any to run after KVM_TDX_TERMINATE_VM.
> 
> 2. Check vm_terminated in tdx_vm_ioctl() and tdx_vcpu_ioctl()
> 
> 3. Test KVM_REQ_VM_DEAD in tdx_vm_ioctl() and tdx_vcpu_ioctl()
> 
> [ Note cannot check is_hkid_assigned() because that is racy ]
> 
> Secondly, I suggest we avoid SEAMCALLs that will fail and
> result in KVM_BUG_ON() if HKID has been released.
> 
> There are 2 groups of those: MMU-related and TDVPS_ACCESSORS.
> 
> For the MMU-related, the following 2 functions should return
> an error immediately if vm_terminated:
> 
> 	tdx_sept_link_private_spt()
> 	tdx_sept_set_private_spte()
> 
> For that not be racy, extra synchronization is needed so that
> vm_terminated can be reliably checked when holding mmu lock
> i.e.
> 
> static int tdx_terminate_vm(struct kvm *kvm)
> {
> 	if (kvm_trylock_all_vcpus(kvm))
> 		return -EBUSY;
> 
> 	kvm_vm_dead(kvm);
> +
> +       write_lock(&kvm->mmu_lock);
> 	to_kvm_tdx(kvm)->vm_terminated = true;
> +       write_unlock(&kvm->mmu_lock);
> 
> 	kvm_unlock_all_vcpus(kvm);
> 
> 	tdx_mmu_release_hkid(kvm);
> 
> 	return 0;
> }
> 
> Finally, there are 2 TDVPS_ACCESSORS that need avoiding:
> 
> 	tdx_load_mmu_pgd()
> 		skip td_vmcs_write64() if vm_terminated
> 
> 	tdx_protected_apic_has_interrupt()
> 		skip td_state_non_arch_read64() if vm_terminated

Oof.  And as Chao pointed out[*], removing the vm_dead check would allow creating
and running vCPUs in a dead VM, which is most definitely not desirable.  Squashing
the vCPU creation case is easy enough if we keep vm_dead but still generally allow
ioctls, and it's probably worth doing that no matter what (to plug the hole where
pending vCPU creations could succeed):

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d477a7fda0ae..941d2c32b7dc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4207,6 +4207,11 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
 
        mutex_lock(&kvm->lock);
 
+       if (kvm->vm_dead) {
+               r = -EIO;
+               goto unlock_vcpu_destroy;
+       }
+
        if (kvm_get_vcpu_by_id(kvm, id)) {
                r = -EEXIST;
                goto unlock_vcpu_destroy;

And then to ensure vCPUs can't do anything, check KVM_REQ_VM_DEAD after acquiring
vcpu->mutex.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6c07dd423458..883077eee4ce 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4433,6 +4433,12 @@ static long kvm_vcpu_ioctl(struct file *filp,
 
        if (mutex_lock_killable(&vcpu->mutex))
                return -EINTR;
+
+       if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) {
+               r = -EIO;
+               goto out;
+       }
+
        switch (ioctl) {
        case KVM_RUN: {
                struct pid *oldpid;


That should address all TDVPS paths (I hope), and I _think_ would address all
MMU-related paths as well?  E.g. prefault requires a vCPU.

Disallowing (most) vCPU ioctls but not all VM ioctls on vm_dead isn't great ABI
(understatement), but I think we need/want the above changes even if we keep the
general vm_dead restriction.  And given the extremely ad hoc behavior of taking
kvm->lock for VM ioctls, trying to enforce vm_dead for "all" VM ioctls seems like
a fool's errand.

So I'm leaning toward keeping "KVM: Reject ioctls only if the VM is bugged, not
simply marked dead" (with a different shortlog+changelog), but keeping vm_dead
(and not introducing kvm_tdx.vm_terminated).

Thoughts?

[*] https://lore.kernel.org/all/aIlzeT+yFG2Tvb3%2F@intel.com
Re: [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
Posted by Adrian Hunter 2 months ago
On 01/08/2025 19:44, Sean Christopherson wrote:
> +Chao
> 
> On Fri, Aug 01, 2025, Adrian Hunter wrote:
>> On 29/07/2025 22:33, Sean Christopherson wrote:
>>> +static int tdx_terminate_vm(struct kvm *kvm)
>>> +{
>>> +	if (kvm_trylock_all_vcpus(kvm))
>>> +		return -EBUSY;
>>> +
>>> +	kvm_vm_dead(kvm);
>>> +	to_kvm_tdx(kvm)->vm_terminated = true;
>>> +
>>> +	kvm_unlock_all_vcpus(kvm);
>>> +
>>> +	tdx_mmu_release_hkid(kvm);
>>> +
>>> +	return 0;
>>> +}
>>
>> As I think I mentioned when removing vm_dead first came up,
>> I think we need more checks.  I spent some time going through
>> the code and came up with what is below:
>>
>> First, we need to avoid TDX VCPU sub-IOCTLs from racing with
>> tdx_mmu_release_hkid().  But having any TDX sub-IOCTL run after
>> KVM_TDX_TERMINATE_VM raises questions of what might happen, so
>> it is much simpler to understand, if that is not possible.
>> There are 3 options:
>>
>> 1. Require that KVM_TDX_TERMINATE_VM is valid only if
>> kvm_tdx->state == TD_STATE_RUNNABLE.  Since currently all
>> the TDX sub-IOCTLs are for initialization, that would block
>> the opportunity for any to run after KVM_TDX_TERMINATE_VM.
>>
>> 2. Check vm_terminated in tdx_vm_ioctl() and tdx_vcpu_ioctl()
>>
>> 3. Test KVM_REQ_VM_DEAD in tdx_vm_ioctl() and tdx_vcpu_ioctl()
>>
>> [ Note cannot check is_hkid_assigned() because that is racy ]
>>
>> Secondly, I suggest we avoid SEAMCALLs that will fail and
>> result in KVM_BUG_ON() if HKID has been released.
>>
>> There are 2 groups of those: MMU-related and TDVPS_ACCESSORS.
>>
>> For the MMU-related, the following 2 functions should return
>> an error immediately if vm_terminated:
>>
>> 	tdx_sept_link_private_spt()
>> 	tdx_sept_set_private_spte()
>>
>> For that not be racy, extra synchronization is needed so that
>> vm_terminated can be reliably checked when holding mmu lock
>> i.e.
>>
>> static int tdx_terminate_vm(struct kvm *kvm)
>> {
>> 	if (kvm_trylock_all_vcpus(kvm))
>> 		return -EBUSY;
>>
>> 	kvm_vm_dead(kvm);
>> +
>> +       write_lock(&kvm->mmu_lock);
>> 	to_kvm_tdx(kvm)->vm_terminated = true;
>> +       write_unlock(&kvm->mmu_lock);
>>
>> 	kvm_unlock_all_vcpus(kvm);
>>
>> 	tdx_mmu_release_hkid(kvm);
>>
>> 	return 0;
>> }
>>
>> Finally, there are 2 TDVPS_ACCESSORS that need avoiding:
>>
>> 	tdx_load_mmu_pgd()
>> 		skip td_vmcs_write64() if vm_terminated
>>
>> 	tdx_protected_apic_has_interrupt()
>> 		skip td_state_non_arch_read64() if vm_terminated
> 
> Oof.  And as Chao pointed out[*], removing the vm_dead check would allow creating
> and running vCPUs in a dead VM, which is most definitely not desirable.  Squashing
> the vCPU creation case is easy enough if we keep vm_dead but still generally allow
> ioctls, and it's probably worth doing that no matter what (to plug the hole where
> pending vCPU creations could succeed):
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d477a7fda0ae..941d2c32b7dc 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4207,6 +4207,11 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
>  
>         mutex_lock(&kvm->lock);
>  
> +       if (kvm->vm_dead) {
> +               r = -EIO;
> +               goto unlock_vcpu_destroy;
> +       }
> +
>         if (kvm_get_vcpu_by_id(kvm, id)) {
>                 r = -EEXIST;
>                 goto unlock_vcpu_destroy;
> 
> And then to ensure vCPUs can't do anything, check KVM_REQ_VM_DEAD after acquiring
> vcpu->mutex.
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6c07dd423458..883077eee4ce 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4433,6 +4433,12 @@ static long kvm_vcpu_ioctl(struct file *filp,
>  
>         if (mutex_lock_killable(&vcpu->mutex))
>                 return -EINTR;
> +
> +       if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) {
> +               r = -EIO;
> +               goto out;
> +       }
> +
>         switch (ioctl) {
>         case KVM_RUN: {
>                 struct pid *oldpid;
> 
> 
> That should address all TDVPS paths (I hope), and I _think_ would address all
> MMU-related paths as well?  E.g. prefault requires a vCPU.
> 
> Disallowing (most) vCPU ioctls but not all VM ioctls on vm_dead isn't great ABI
> (understatement), but I think we need/want the above changes even if we keep the
> general vm_dead restriction.  And given the extremely ad hoc behavior of taking
> kvm->lock for VM ioctls, trying to enforce vm_dead for "all" VM ioctls seems like
> a fool's errand.
> 
> So I'm leaning toward keeping "KVM: Reject ioctls only if the VM is bugged, not
> simply marked dead" (with a different shortlog+changelog), but keeping vm_dead
> (and not introducing kvm_tdx.vm_terminated).
> 
> Thoughts?

That covers the cases I listed, so it is fine by me.
Re: [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
Posted by Chao Gao 2 months ago
>Oof.  And as Chao pointed out[*], removing the vm_dead check would allow creating
>and running vCPUs in a dead VM, which is most definitely not desirable.  Squashing
>the vCPU creation case is easy enough if we keep vm_dead but still generally allow
>ioctls, and it's probably worth doing that no matter what (to plug the hole where
>pending vCPU creations could succeed):
>
>diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>index d477a7fda0ae..941d2c32b7dc 100644
>--- a/virt/kvm/kvm_main.c
>+++ b/virt/kvm/kvm_main.c
>@@ -4207,6 +4207,11 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
> 
>        mutex_lock(&kvm->lock);
> 
>+       if (kvm->vm_dead) {
>+               r = -EIO;
>+               goto unlock_vcpu_destroy;
>+       }
>+

yes. this addresses my concern.

>        if (kvm_get_vcpu_by_id(kvm, id)) {
>                r = -EEXIST;
>                goto unlock_vcpu_destroy;
>
>And then to ensure vCPUs can't do anything, check KVM_REQ_VM_DEAD after acquiring
>vcpu->mutex.
>
>diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>index 6c07dd423458..883077eee4ce 100644
>--- a/virt/kvm/kvm_main.c
>+++ b/virt/kvm/kvm_main.c
>@@ -4433,6 +4433,12 @@ static long kvm_vcpu_ioctl(struct file *filp,
> 
>        if (mutex_lock_killable(&vcpu->mutex))
>                return -EINTR;
>+
>+       if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) {
>+               r = -EIO;
>+               goto out;
>+       }
>+
>        switch (ioctl) {
>        case KVM_RUN: {
>                struct pid *oldpid;
>
>
>That should address all TDVPS paths (I hope), and I _think_ would address all
>MMU-related paths as well?  E.g. prefault requires a vCPU.
>
>Disallowing (most) vCPU ioctls but not all VM ioctls on vm_dead isn't great ABI
>(understatement), but I think we need/want the above changes even if we keep the
>general vm_dead restriction.  And given the extremely ad hoc behavior of taking
>kvm->lock for VM ioctls, trying to enforce vm_dead for "all" VM ioctls seems like
>a fool's errand.
>
>So I'm leaning toward keeping "KVM: Reject ioctls only if the VM is bugged, not
>simply marked dead" (with a different shortlog+changelog), but keeping vm_dead
>(and not introducing kvm_tdx.vm_terminated).

Sounds good to me.

With kvm_tdx.vm_terminated removed, we should consider adding a comment above
the is_hkid_assigned() check in tdx_sept_remove_private_spte() to clarify that
!is_hkid_assigned() indicates the guest has been terminated, allowing private
pages to be reclaimed directly without zapping.