WARN and terminate the VM if TDH_MR_EXTEND fails, as extending the
measurement should fail if and only if there is a KVM bug, or if the S-EPT
mapping is invalid, and it should be impossibe for the S-EPT mappings to
be removed between kvm_tdp_mmu_map_private_pfn() and tdh_mr_extend().
Holding slots_lock prevents zaps due to memslot updates,
filemap_invalidate_lock() prevents zaps due to guest_memfd PUNCH_HOLE,
and all usage of kvm_zap_gfn_range() is mutually exclusive with S-EPT
entries that can be used for the initial image. The call from sev.c is
obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT
so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and while
__kvm_set_or_clear_apicv_inhibit() can likely be tripped while building the
image, the APIC page has its own non-guest_memfd memslot and so can't be
used for the initial image, which means that too is mutually exclusive.
Opportunistically switch to "goto" to jump around the measurement code,
partly to make it clear that KVM needs to bail entirely if extending the
measurement fails, partly in anticipation of reworking how and when
TDH_MEM_PAGE_ADD is done.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/vmx/tdx.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 06dd2861eba7..bc92e87a1dbb 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3145,14 +3145,22 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
- if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
- for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
- err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
- &level_state);
- if (err) {
- ret = -EIO;
- break;
- }
+ if (!(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION))
+ goto out;
+
+ /*
+ * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed
+ * between mapping the pfn and now, but slots_lock prevents memslot
+ * updates, filemap_invalidate_lock() prevents guest_memfd updates,
+ * mmu_notifier events can't reach S-EPT entries, and KVM's internal
+ * zapping flows are mutually exclusive with S-EPT mappings.
+ */
+ for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+ err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
+ ret = -EIO;
+ goto out;
}
}
--
2.51.0.318.gd7df087d1a-goog
On Thu, Aug 28, 2025 at 05:06:12PM -0700, Sean Christopherson wrote: > WARN and terminate the VM if TDH_MR_EXTEND fails, as extending the > measurement should fail if and only if there is a KVM bug, or if the S-EPT > mapping is invalid, and it should be impossibe for the S-EPT mappings to > be removed between kvm_tdp_mmu_map_private_pfn() and tdh_mr_extend(). > > Holding slots_lock prevents zaps due to memslot updates, > filemap_invalidate_lock() prevents zaps due to guest_memfd PUNCH_HOLE, > and all usage of kvm_zap_gfn_range() is mutually exclusive with S-EPT > entries that can be used for the initial image. The call from sev.c is > obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT > so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and while > __kvm_set_or_clear_apicv_inhibit() can likely be tripped while building the > image, the APIC page has its own non-guest_memfd memslot and so can't be > used for the initial image, which means that too is mutually exclusive. > > Opportunistically switch to "goto" to jump around the measurement code, > partly to make it clear that KVM needs to bail entirely if extending the > measurement fails, partly in anticipation of reworking how and when > TDH_MEM_PAGE_ADD is done. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/vmx/tdx.c | 24 ++++++++++++++++-------- > 1 file changed, 16 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index 06dd2861eba7..bc92e87a1dbb 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -3145,14 +3145,22 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm); > > - if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) { > - for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { > - err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, > - &level_state); > - if (err) { > - ret = -EIO; > - break; > - } > + if (!(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) > + goto out; > + > + /* > + * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed > + * between mapping the pfn and now, but slots_lock prevents memslot > + * updates, filemap_invalidate_lock() prevents guest_memfd updates, > + * mmu_notifier events can't reach S-EPT entries, and KVM's internal > + * zapping flows are mutually exclusive with S-EPT mappings. > + */ > + for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { > + err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state); > + if (KVM_BUG_ON(err, kvm)) { I suspect tdh_mr_extend() running on one vCPU may contend with tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr()/ tdh_mng_rd()/tdh_vp_flush() on other vCPUs, if userspace invokes ioctl KVM_TDX_INIT_MEM_REGION on one vCPU while initializing other vCPUs. It's similar to the analysis of contention of tdh_mem_page_add() [1], as both tdh_mr_extend() and tdh_mem_page_add() acquire exclusive lock on resource TDR. I'll try to write a test to verify it and come back to you. [1] https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com/ > + pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state); > + ret = -EIO; > + goto out; > } > } > > -- > 2.51.0.318.gd7df087d1a-goog >
On Fri, 2025-08-29 at 16:18 +0800, Yan Zhao wrote: > > + /* > > + * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed > > + * between mapping the pfn and now, but slots_lock prevents memslot > > + * updates, filemap_invalidate_lock() prevents guest_memfd updates, > > + * mmu_notifier events can't reach S-EPT entries, and KVM's > > internal > > + * zapping flows are mutually exclusive with S-EPT mappings. > > + */ > > + for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { > > + err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, > > &level_state); > > + if (KVM_BUG_ON(err, kvm)) { > I suspect tdh_mr_extend() running on one vCPU may contend with > tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr()/ > tdh_mng_rd()/tdh_vp_flush() on other vCPUs, if userspace invokes ioctl > KVM_TDX_INIT_MEM_REGION on one vCPU while initializing other vCPUs. > > It's similar to the analysis of contention of tdh_mem_page_add() [1], as > both tdh_mr_extend() and tdh_mem_page_add() acquire exclusive lock on > resource TDR. > > I'll try to write a test to verify it and come back to you. I'm seeing the same thing in the TDX module. It could fail because of contention controllable from userspace. So the KVM_BUG_ON() is not appropriate. Today though if tdh_mr_extend() fails because of contention then the TD is essentially dead anyway. Trying to redo KVM_TDX_INIT_MEM_REGION will fail. The M-EPT fault could be spurious but the second tdh_mem_page_add() would return an error and never get back to the tdh_mr_extend(). The version in this patch can't recover for a different reason. That is kvm_tdp_mmu_map_private_pfn() doesn't handle spurious faults, so I'd say just drop the KVM_BUG_ON(), and try to handle the contention in a separate effort. I guess the two approaches could be to make KVM_TDX_INIT_MEM_REGION more robust, or prevent the contention. For the latter case: tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr() ...I think we could just take slots_lock during KVM_TDX_INIT_VCPU and KVM_TDX_GET_CPUID. For tdh_vp_flush() the vcpu_load() in kvm_arch_vcpu_ioctl() could be hard to handle. So I'd think maybe to look towards making KVM_TDX_INIT_MEM_REGION more robust, which would mean the eventual solution wouldn't have ABI concerns by later blocking things that used to be allowed. Maybe having kvm_tdp_mmu_map_private_pfn() return success for spurious faults is enough. But this is all for a case that userspace isn't expected to actually hit, so seems like something that could be kicked down the road easily.
On Fri, Aug 29, 2025, Rick P Edgecombe wrote: > On Fri, 2025-08-29 at 16:18 +0800, Yan Zhao wrote: > > > + /* > > > + * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed > > > + * between mapping the pfn and now, but slots_lock prevents memslot > > > + * updates, filemap_invalidate_lock() prevents guest_memfd updates, > > > + * mmu_notifier events can't reach S-EPT entries, and KVM's > > > internal > > > + * zapping flows are mutually exclusive with S-EPT mappings. > > > + */ > > > + for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { > > > + err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, > > > &level_state); > > > + if (KVM_BUG_ON(err, kvm)) { > > I suspect tdh_mr_extend() running on one vCPU may contend with > > tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr()/ > > tdh_mng_rd()/tdh_vp_flush() on other vCPUs, if userspace invokes ioctl > > KVM_TDX_INIT_MEM_REGION on one vCPU while initializing other vCPUs. > > > > It's similar to the analysis of contention of tdh_mem_page_add() [1], as > > both tdh_mr_extend() and tdh_mem_page_add() acquire exclusive lock on > > resource TDR. > > > > I'll try to write a test to verify it and come back to you. > > I'm seeing the same thing in the TDX module. It could fail because of contention > controllable from userspace. So the KVM_BUG_ON() is not appropriate. > > Today though if tdh_mr_extend() fails because of contention then the TD is > essentially dead anyway. Trying to redo KVM_TDX_INIT_MEM_REGION will fail. The > M-EPT fault could be spurious but the second tdh_mem_page_add() would return an > error and never get back to the tdh_mr_extend(). > > The version in this patch can't recover for a different reason. That is > kvm_tdp_mmu_map_private_pfn() doesn't handle spurious faults, so I'd say just > drop the KVM_BUG_ON(), and try to handle the contention in a separate effort. > > I guess the two approaches could be to make KVM_TDX_INIT_MEM_REGION more robust, This. First and foremost, KVM's ordering and locking rules need to be explicit (ideally documented, but at the very least apparent in the code), *especially* when the locking (or lack thereof) impacts userspace. Even if effectively relying on the TDX-module to provide ordering "works", it's all but impossible to follow. And it doesn't truly work, as everything in the TDX-Module is a trylock, and that in turn prevents KVM from asserting success. Sometimes KVM has better option than to rely on hardware to detect failure, but it really should be a last resort, because not being able to expect success makes debugging no fun. Even worse, it bleeds hard-to-document, specific ordering requirements into userspace, e.g. in this case, it sounds like userspace can't do _anything_ on vCPUs while doing KVM_TDX_INIT_MEM_REGION. Which might not be a burden for userspace, but oof is it nasty from an ABI perspective. > or prevent the contention. For the latter case: > tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr() > ...I think we could just take slots_lock during KVM_TDX_INIT_VCPU and > KVM_TDX_GET_CPUID. > > For tdh_vp_flush() the vcpu_load() in kvm_arch_vcpu_ioctl() could be hard to > handle. > > So I'd think maybe to look towards making KVM_TDX_INIT_MEM_REGION more robust, > which would mean the eventual solution wouldn't have ABI concerns by later > blocking things that used to be allowed. > > Maybe having kvm_tdp_mmu_map_private_pfn() return success for spurious faults is > enough. But this is all for a case that userspace isn't expected to actually > hit, so seems like something that could be kicked down the road easily. You're trying to be too "nice", just smack 'em with a big hammer. For all intents and purposes, the paths in question are fully serialized, there's no reason to try and allow anything remotely interesting to happen. Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering. Doing that for a vCPU ioctl is a bit awkward, but not awful. E.g. we can abuse kvm_arch_vcpu_async_ioctl(). In hindsight, a more clever approach would have been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd. Oh well. Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because taking kvm->slots_lock inside vcpu->mutex is gross. AFAICT it's not actively problematic today, but it feels like a deadlock waiting to happen. The other oddity I see is the handling of kvm_tdx->state. I don't see how this check in tdx_vcpu_create() is safe: if (kvm_tdx->state != TD_STATE_INITIALIZED) return -EIO; kvm_arch_vcpu_create() runs without any locks held, and so TDX effectively has the same bug that SEV intra-host migration had, where an in-flight vCPU creation could race with a VM-wide state transition (see commit ecf371f8b02d ("KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight"). To fix that, kvm->lock needs to be taken and KVM needs to verify there's no in-flight vCPU creation, e.g. so that a vCPU doesn't pop up and contend a TDX-Module lock. We an even define a fancy new CLASS to handle the lock+check => unlock logic with guard()-like syntax: CLASS(tdx_vm_state_guard, guard)(kvm); if (IS_ERR(guard)) return PTR_ERR(guard); IIUC, with all of those locks, KVM can KVM_BUG_ON() both TDH_MEM_PAGE_ADD and TDH_MR_EXTEND, with no exceptions given for -EBUSY. Attached patches are very lightly tested as usual and need to be chunked up, but seem do to what I want.
On Fri, Aug 29, 2025 at 01:11:35PM -0700, Sean Christopherson wrote: > On Fri, Aug 29, 2025, Rick P Edgecombe wrote: > > On Fri, 2025-08-29 at 16:18 +0800, Yan Zhao wrote: > > > > + /* > > > > + * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed > > > > + * between mapping the pfn and now, but slots_lock prevents memslot > > > > + * updates, filemap_invalidate_lock() prevents guest_memfd updates, > > > > + * mmu_notifier events can't reach S-EPT entries, and KVM's > > > > internal > > > > + * zapping flows are mutually exclusive with S-EPT mappings. > > > > + */ > > > > + for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { > > > > + err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, > > > > &level_state); > > > > + if (KVM_BUG_ON(err, kvm)) { > > > I suspect tdh_mr_extend() running on one vCPU may contend with > > > tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr()/ > > > tdh_mng_rd()/tdh_vp_flush() on other vCPUs, if userspace invokes ioctl > > > KVM_TDX_INIT_MEM_REGION on one vCPU while initializing other vCPUs. > > > > > > It's similar to the analysis of contention of tdh_mem_page_add() [1], as > > > both tdh_mr_extend() and tdh_mem_page_add() acquire exclusive lock on > > > resource TDR. > > > > > > I'll try to write a test to verify it and come back to you. I've written a selftest and proved the contention between tdh_mr_extend() and tdh_vp_create(). The KVM_BUG_ON() after tdh_mr_extend() now is not hittable with Sean's newly provided 2 fixes. But during writing another concurrency test, I found a sad news : SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its leaf_opcode.version > 0. So, when I use v1 (which is the current value in upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error "SEAMCALL TDH_VP_INIT failed: 0x8000020000000080". If I switch to using v0 version of TDH_VP_INIT, the contention will be gone. Note: this acquiring of exclusive lock was not previously present in the public repo https://github.com/intel/tdx-module.git, branch tdx_1.5. (The branch has been force-updated to new implementation now). > > I'm seeing the same thing in the TDX module. It could fail because of contention > > controllable from userspace. So the KVM_BUG_ON() is not appropriate. > > > > Today though if tdh_mr_extend() fails because of contention then the TD is > > essentially dead anyway. Trying to redo KVM_TDX_INIT_MEM_REGION will fail. The > > M-EPT fault could be spurious but the second tdh_mem_page_add() would return an > > error and never get back to the tdh_mr_extend(). > > > > The version in this patch can't recover for a different reason. That is > > kvm_tdp_mmu_map_private_pfn() doesn't handle spurious faults, so I'd say just > > drop the KVM_BUG_ON(), and try to handle the contention in a separate effort. > > > > I guess the two approaches could be to make KVM_TDX_INIT_MEM_REGION more robust, > > This. First and foremost, KVM's ordering and locking rules need to be explicit > (ideally documented, but at the very least apparent in the code), *especially* > when the locking (or lack thereof) impacts userspace. Even if effectively relying > on the TDX-module to provide ordering "works", it's all but impossible to follow. > > And it doesn't truly work, as everything in the TDX-Module is a trylock, and that > in turn prevents KVM from asserting success. Sometimes KVM has better option than > to rely on hardware to detect failure, but it really should be a last resort, > because not being able to expect success makes debugging no fun. Even worse, it > bleeds hard-to-document, specific ordering requirements into userspace, e.g. in > this case, it sounds like userspace can't do _anything_ on vCPUs while doing > KVM_TDX_INIT_MEM_REGION. Which might not be a burden for userspace, but oof is > it nasty from an ABI perspective. > > > or prevent the contention. For the latter case: > > tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr() > > ...I think we could just take slots_lock during KVM_TDX_INIT_VCPU and > > KVM_TDX_GET_CPUID. > > > > For tdh_vp_flush() the vcpu_load() in kvm_arch_vcpu_ioctl() could be hard to > > handle. > > > > So I'd think maybe to look towards making KVM_TDX_INIT_MEM_REGION more robust, > > which would mean the eventual solution wouldn't have ABI concerns by later > > blocking things that used to be allowed. > > > > Maybe having kvm_tdp_mmu_map_private_pfn() return success for spurious faults is > > enough. But this is all for a case that userspace isn't expected to actually > > hit, so seems like something that could be kicked down the road easily. > > You're trying to be too "nice", just smack 'em with a big hammer. For all intents > and purposes, the paths in question are fully serialized, there's no reason to try > and allow anything remotely interesting to happen. This big hammer looks good to me :) > > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent > kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering. Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps !mirror roots. The slots_lock should be for slots deletion. > > Doing that for a vCPU ioctl is a bit awkward, but not awful. E.g. we can abuse > kvm_arch_vcpu_async_ioctl(). In hindsight, a more clever approach would have > been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd. Oh > well. > > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because > taking kvm->slots_lock inside vcpu->mutex is gross. AFAICT it's not actively > problematic today, but it feels like a deadlock waiting to happen. Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside vcpu->mutex. > The other oddity I see is the handling of kvm_tdx->state. I don't see how this > check in tdx_vcpu_create() is safe: > > if (kvm_tdx->state != TD_STATE_INITIALIZED) > return -EIO; Right, if tdh_vp_create() contends with tdh_mr_finalize(), KVM_BUG_ON() will be triggered. I previously overlooked the KVM_BUG_ON() after tdh_vp_create(), thinking that it's ok to have it return error once tdh_vp_create() is invoked after tdh_mr_finalize(). ... > int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) > { > struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); > @@ -3146,19 +3211,14 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) > if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE) > return -EINVAL; > > - if (copy_from_user(&cmd, argp, sizeof(cmd))) > - return -EFAULT; > - > - if (cmd.hw_error) > - return -EINVAL; > + ret = tdx_get_cmd(argp, &cmd); > + if (ret) > + return ret; > > switch (cmd.id) { > case KVM_TDX_INIT_VCPU: > ret = tdx_vcpu_init(vcpu, &cmd); > break; So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well? > - case KVM_TDX_INIT_MEM_REGION: > - ret = tdx_vcpu_init_mem_region(vcpu, &cmd); > - break; > case KVM_TDX_GET_CPUID: > ret = tdx_vcpu_get_cpuid(vcpu, &cmd);
On Tue, Sep 02, 2025, Yan Zhao wrote: > But during writing another concurrency test, I found a sad news : > > SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its > leaf_opcode.version > 0. So, when I use v1 (which is the current value in > upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different > vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error > "SEAMCALL TDH_VP_INIT failed: 0x8000020000000080". > > If I switch to using v0 version of TDH_VP_INIT, the contention will be gone. Uh, so that's exactly the type of breaking ABI change that isn't acceptable. If it's really truly necessary, then we can can probably handle the change in KVM since TDX is so new, but generally speaking such changes simply must not happen. > Note: this acquiring of exclusive lock was not previously present in the public > repo https://github.com/intel/tdx-module.git, branch tdx_1.5. > (The branch has been force-updated to new implementation now). Lovely. > > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent > > kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering. > Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps > !mirror roots. The slots_lock should be for slots deletion. Oof, I missed that. We should have required nx_huge_pages=never for tdx=1. Probably too late for that now though :-/ > > Doing that for a vCPU ioctl is a bit awkward, but not awful. E.g. we can abuse > > kvm_arch_vcpu_async_ioctl(). In hindsight, a more clever approach would have > > been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd. Oh > > well. > > > > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because > > taking kvm->slots_lock inside vcpu->mutex is gross. AFAICT it's not actively > > problematic today, but it feels like a deadlock waiting to happen. > Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside > vcpu->mutex. Yikes. As does kvm_alloc_apic_access_page(), which is likely why I thought it was ok to take slots_lock. But while kvm_alloc_apic_access_page() appears to be called with vCPU scope, it's actually called from VM scope during vCPU creation. I'll chew on this, though if someone has any ideas... > So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well? If it's _just_ INIT_VCPU that can race (assuming the VM-scoped state transtitions take all vcpu->mutex locks, as proposed), then a dedicated mutex (spinlock?) would suffice, and probably would be preferable. If INIT_VCPU needs to take kvm->lock to protect against other races, then I guess the big hammer approach could work?
On Tue, 2025-09-02 at 10:04 -0700, Sean Christopherson wrote: > On Tue, Sep 02, 2025, Yan Zhao wrote: > > But during writing another concurrency test, I found a sad news : > > > > SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its > > leaf_opcode.version > 0. So, when I use v1 (which is the current value in > > upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different > > vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error > > "SEAMCALL TDH_VP_INIT failed: 0x8000020000000080". > > > > If I switch to using v0 version of TDH_VP_INIT, the contention will be gone. > > Uh, so that's exactly the type of breaking ABI change that isn't acceptable. If > it's really truly necessary, then we can can probably handle the change in KVM > since TDX is so new, but generally speaking such changes simply must not happen. > > > Note: this acquiring of exclusive lock was not previously present in the public > > repo https://github.com/intel/tdx-module.git, branch tdx_1.5. > > (The branch has been force-updated to new implementation now). > > Lovely. Hmm, this exactly the kind of TDX module change we were just discussing reporting as a bug. Not clear on the timing of the change as far as the landing upstream. We could investigate whether whether we could fix it in the TDX module. This probably falls into the category of not actually regressing any userspace. But it does trigger a kernel warning, so warrant a fix, hmm. > > > > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent > > > kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering. > > Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps > > !mirror roots. The slots_lock should be for slots deletion. > > Oof, I missed that. We should have required nx_huge_pages=never for tdx=1. > Probably too late for that now though :-/ > > > > Doing that for a vCPU ioctl is a bit awkward, but not awful. E.g. we can abuse > > > kvm_arch_vcpu_async_ioctl(). In hindsight, a more clever approach would have > > > been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd. Oh > > > well. > > > > > > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because > > > taking kvm->slots_lock inside vcpu->mutex is gross. AFAICT it's not actively > > > problematic today, but it feels like a deadlock waiting to happen. > > Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside > > vcpu->mutex. > > Yikes. As does kvm_alloc_apic_access_page(), which is likely why I thought it > was ok to take slots_lock. But while kvm_alloc_apic_access_page() appears to be > called with vCPU scope, it's actually called from VM scope during vCPU creation. > > I'll chew on this, though if someone has any ideas... > > > So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well? > > If it's _just_ INIT_VCPU that can race (assuming the VM-scoped state transtitions > take all vcpu->mutex locks, as proposed), then a dedicated mutex (spinlock?) would > suffice, and probably would be preferable. If INIT_VCPU needs to take kvm->lock > to protect against other races, then I guess the big hammer approach could work? A duplicate TDR lock inside KVM or maybe even the arch/x86 side would make the reasoning easier to follow. Like, you don't need to remember "we take slots_lock/kvm_lock because of TDR lock", it's just 1:1. I hate the idea of adding more locks, and have argued against it in the past. But are we just fooling ourselves though? There are already more locks. Another reason to duplicate (some) locks is that if it gives the scheduler more hints as far as waking up waiters, etc. The TDX module needs these locks to protect itself, so those are required. But when we just do retry loops (or let userspace do this), then we lose out on all of the locking goodness in the kernel. Anyway, just a strawman. I don't have any great ideas.
On Wed, Sep 03, 2025 at 08:18:10AM +0800, Edgecombe, Rick P wrote: > On Tue, 2025-09-02 at 10:04 -0700, Sean Christopherson wrote: > > On Tue, Sep 02, 2025, Yan Zhao wrote: > > > But during writing another concurrency test, I found a sad news : > > > > > > SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its > > > leaf_opcode.version > 0. So, when I use v1 (which is the current value in > > > upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different > > > vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error > > > "SEAMCALL TDH_VP_INIT failed: 0x8000020000000080". > > > > > > If I switch to using v0 version of TDH_VP_INIT, the contention will be gone. > > > > Uh, so that's exactly the type of breaking ABI change that isn't acceptable. If > > it's really truly necessary, then we can can probably handle the change in KVM > > since TDX is so new, but generally speaking such changes simply must not happen. > > > > > Note: this acquiring of exclusive lock was not previously present in the public > > > repo https://github.com/intel/tdx-module.git, branch tdx_1.5. > > > (The branch has been force-updated to new implementation now). > > > > Lovely. > > Hmm, this exactly the kind of TDX module change we were just discussing > reporting as a bug. Not clear on the timing of the change as far as the landing > upstream. We could investigate whether whether we could fix it in the TDX > module. This probably falls into the category of not actually regressing any > userspace. But it does trigger a kernel warning, so warrant a fix, hmm. > > > > > > > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent > > > > kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering. > > > Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps > > > !mirror roots. The slots_lock should be for slots deletion. > > > > Oof, I missed that. We should have required nx_huge_pages=never for tdx=1. > > Probably too late for that now though :-/ > > > > > > Doing that for a vCPU ioctl is a bit awkward, but not awful. E.g. we can abuse > > > > kvm_arch_vcpu_async_ioctl(). In hindsight, a more clever approach would have > > > > been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd. Oh > > > > well. > > > > > > > > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because > > > > taking kvm->slots_lock inside vcpu->mutex is gross. AFAICT it's not actively > > > > problematic today, but it feels like a deadlock waiting to happen. > > > Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside > > > vcpu->mutex. > > > > Yikes. As does kvm_alloc_apic_access_page(), which is likely why I thought it > > was ok to take slots_lock. But while kvm_alloc_apic_access_page() appears to be > > called with vCPU scope, it's actually called from VM scope during vCPU creation. > > > > I'll chew on this, though if someone has any ideas... > > > > > So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well? > > > > If it's _just_ INIT_VCPU that can race (assuming the VM-scoped state transtitions > > take all vcpu->mutex locks, as proposed), then a dedicated mutex (spinlock?) would > > suffice, and probably would be preferable. If INIT_VCPU needs to take kvm->lock > > to protect against other races, then I guess the big hammer approach could work? We need the big hammer approach as INIT_VCPU may race with vcpu_load() in other vCPU ioctls. > A duplicate TDR lock inside KVM or maybe even the arch/x86 side would make the > reasoning easier to follow. Like, you don't need to remember "we take > slots_lock/kvm_lock because of TDR lock", it's just 1:1. I hate the idea of > adding more locks, and have argued against it in the past. But are we just > fooling ourselves though? There are already more locks. > > Another reason to duplicate (some) locks is that if it gives the scheduler more > hints as far as waking up waiters, etc. The TDX module needs these locks to > protect itself, so those are required. But when we just do retry loops (or let > userspace do this), then we lose out on all of the locking goodness in the > kernel. > > Anyway, just a strawman. I don't have any great ideas. Do you think the following fix is good? It moves INIT_VCPU to tdx_vcpu_async_ioctl and uses the big hammer to make it impossible to contend with other SEAMCALLs, just as for tdh_mr_extend(). It passed my local concurrent test. diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 99381c8b4108..8a6f2feaab41 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -3047,16 +3047,22 @@ static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd) static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd) { - u64 apic_base; + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); struct vcpu_tdx *tdx = to_tdx(vcpu); + u64 apic_base; int ret; + CLASS(tdx_vm_state_guard, guard)(vcpu->kvm); + if (IS_ERR(guard)) + return PTR_ERR(guard); + if (cmd->flags) return -EINVAL; - if (tdx->state != VCPU_TD_STATE_UNINITIALIZED) + if (!is_hkid_assigned(kvm_tdx) || tdx->state != VCPU_TD_STATE_UNINITIALIZED) return -EINVAL; + vcpu_load(vcpu); /* * TDX requires X2APIC, userspace is responsible for configuring guest * CPUID accordingly. @@ -3075,6 +3081,7 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd) td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR); tdx->state = VCPU_TD_STATE_INITIALIZED; + vcpu_put(vcpu); return 0; } @@ -3228,10 +3235,18 @@ int tdx_vcpu_async_ioctl(struct kvm_vcpu *vcpu, void __user *argp) if (r) return r; - if (cmd.id != KVM_TDX_INIT_MEM_REGION) - return -ENOIOCTLCMD; - - return tdx_vcpu_init_mem_region(vcpu, &cmd); + switch (cmd.id) { + case KVM_TDX_INIT_MEM_REGION: + r = tdx_vcpu_init_mem_region(vcpu, &cmd); + break; + case KVM_TDX_INIT_VCPU: + r = tdx_vcpu_init(vcpu, &cmd); + break; + default: + r = -ENOIOCTLCMD; + break; + } + return r; } int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) @@ -3248,9 +3263,6 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) return ret; switch (cmd.id) { - case KVM_TDX_INIT_VCPU: - ret = tdx_vcpu_init(vcpu, &cmd); - break; case KVM_TDX_GET_CPUID: ret = tdx_vcpu_get_cpuid(vcpu, &cmd); break; Besides, to unblock testing the above code, I fixed a bug related to vcpu_load() in current TDX upstream code. Attached the fixup patch below. Sean, please let me know if you want to include it into this series. (It still lacks a Fixes tag as I haven't found out which commit is the best fit). From 0d1ba6d60315e34bdb0e54acceb6e8dd0fbdb262 Mon Sep 17 00:00:00 2001 From: Yan Zhao <yan.y.zhao@intel.com> Date: Tue, 2 Sep 2025 18:31:27 -0700 Subject: [PATCH 1/2] KVM: TDX: Fix list_add corruption during vcpu_load() During vCPU creation, a vCPU may be destroyed immediately after kvm_arch_vcpu_create() (e.g., due to vCPU id confiliction). However, the vcpu_load() inside kvm_arch_vcpu_create() may have associate the vCPU to pCPU via "list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu))" before invoking tdx_vcpu_free(). Though there's no need to invoke tdh_vp_flush() on the vCPU, failing to dissociate the vCPU from pCPU (i.e., "list_del(&to_tdx(vcpu)->cpu_list)") will cause list corruption of the per-pCPU list associated_tdvcpus. Then, a later list_add() during vcpu_load() would detect list corruption and print calltrace as shown below. Dissociate a vCPU from its associated pCPU in tdx_vcpu_free() for the vCPUs destroyed immediately after creation which must be in VCPU_TD_STATE_UNINITIALIZED state. kernel BUG at lib/list_debug.c:29! Oops: invalid opcode: 0000 [#2] SMP NOPTI RIP: 0010:__list_add_valid_or_report+0x82/0xd0 Call Trace: <TASK> tdx_vcpu_load+0xa8/0x120 vt_vcpu_load+0x25/0x30 kvm_arch_vcpu_load+0x81/0x300 vcpu_load+0x55/0x90 kvm_arch_vcpu_create+0x24f/0x330 kvm_vm_ioctl_create_vcpu+0x1b1/0x53 ? trace_lock_release+0x6d/0xb0 kvm_vm_ioctl+0xc2/0xa60 ? tty_ldisc_deref+0x16/0x20 ? debug_smp_processor_id+0x17/0x20 ? __fget_files+0xc2/0x1b0 ? debug_smp_processor_id+0x17/0x20 ? rcu_is_watching+0x13/0x70 ? __fget_files+0xc2/0x1b0 ? trace_lock_release+0x6d/0xb0 ? lock_release+0x14/0xd0 ? __fget_files+0xcc/0x1b0 __x64_sys_ioctl+0x9a/0xf0 ? rcu_is_watching+0x13/0x70 x64_sys_call+0x10ee/0x20d0 do_syscall_64+0xc3/0x470 entry_SYSCALL_64_after_hwframe+0x77/0x7f Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 37 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index e99d07611393..99381c8b4108 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -837,19 +837,51 @@ void tdx_vcpu_put(struct kvm_vcpu *vcpu) tdx_prepare_switch_to_host(vcpu); } +/* + * Life cycles for a TD and a vCPU: + * 1. KVM_CREATE_VM ioctl. + * TD state is TD_STATE_UNINITIALIZED. + * hkid is not assigned at this stage. + * 2. KVM_TDX_INIT_VM ioctl. + * TD transistions to TD_STATE_INITIALIZED. + * hkid is assigned after this stage. + * 3. KVM_CREATE_VCPU ioctl. (only when TD is TD_STATE_INITIALIZED). + * 3.1 tdx_vcpu_create() transitions vCPU state to VCPU_TD_STATE_UNINITIALIZED. + * 3.2 vcpu_load() and vcpu_put() in kvm_arch_vcpu_create(). + * 3.3 (conditional) if any error encountered after kvm_arch_vcpu_create() + * kvm_arch_vcpu_destroy() --> tdx_vcpu_free(). + * 4. KVM_TDX_INIT_VCPU ioctl. + * tdx_vcpu_init() transistions vCPU state to VCPU_TD_STATE_INITIALIZED. + * vCPU control structures are allocated at this stage. + * 5. kvm_destroy_vm(). + * 5.1 tdx_mmu_release_hkid(): (1) tdh_vp_flush(), disassociats all vCPUs. + * (2) puts hkid to !assigned state. + * 5.2 kvm_destroy_vcpus() --> tdx_vcpu_free(): + * transistions vCPU to VCPU_TD_STATE_UNINITIALIZED state. + * 5.3 tdx_vm_destroy() + * transitions TD to TD_STATE_UNINITIALIZED state. + * + * tdx_vcpu_free() can be invoked only at 3.3 or 5.2. + * - If at 3.3, hkid is still assigned, but the vCPU must be in + * VCPU_TD_STATE_UNINITIALIZED state. + * - if at 5.2, hkid must be !assigned and all vCPUs must be in + * VCPU_TD_STATE_INITIALIZED state and have been dissociated. + */ void tdx_vcpu_free(struct kvm_vcpu *vcpu) { struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); struct vcpu_tdx *tdx = to_tdx(vcpu); int i; + if (vcpu->cpu != -1) { + KVM_BUG_ON(tdx->state == VCPU_TD_STATE_INITIALIZED, vcpu->kvm); + tdx_disassociate_vp(vcpu); + return; + } /* * It is not possible to reclaim pages while hkid is assigned. It might - * be assigned if: - * 1. the TD VM is being destroyed but freeing hkid failed, in which - * case the pages are leaked - * 2. TD VCPU creation failed and this on the error path, in which case - * there is nothing to do anyway + * be assigned if the TD VM is being destroyed but freeing hkid failed, + * in which case the pages are leaked. */ if (is_hkid_assigned(kvm_tdx)) return; -- 2.43.0
> From 0d1ba6d60315e34bdb0e54acceb6e8dd0fbdb262 Mon Sep 17 00:00:00 2001 > From: Yan Zhao <yan.y.zhao@intel.com> > Date: Tue, 2 Sep 2025 18:31:27 -0700 > Subject: [PATCH 1/2] KVM: TDX: Fix list_add corruption during vcpu_load() > > During vCPU creation, a vCPU may be destroyed immediately after > kvm_arch_vcpu_create() (e.g., due to vCPU id confiliction). However, the > vcpu_load() inside kvm_arch_vcpu_create() may have associate the vCPU to > pCPU via "list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu))" > before invoking tdx_vcpu_free(). > > Though there's no need to invoke tdh_vp_flush() on the vCPU, failing to > dissociate the vCPU from pCPU (i.e., "list_del(&to_tdx(vcpu)->cpu_list)") > will cause list corruption of the per-pCPU list associated_tdvcpus. > > Then, a later list_add() during vcpu_load() would detect list corruption > and print calltrace as shown below. > > Dissociate a vCPU from its associated pCPU in tdx_vcpu_free() for the vCPUs > destroyed immediately after creation which must be in > VCPU_TD_STATE_UNINITIALIZED state. > > kernel BUG at lib/list_debug.c:29! > Oops: invalid opcode: 0000 [#2] SMP NOPTI > RIP: 0010:__list_add_valid_or_report+0x82/0xd0 > > Call Trace: > <TASK> > tdx_vcpu_load+0xa8/0x120 > vt_vcpu_load+0x25/0x30 > kvm_arch_vcpu_load+0x81/0x300 > vcpu_load+0x55/0x90 > kvm_arch_vcpu_create+0x24f/0x330 > kvm_vm_ioctl_create_vcpu+0x1b1/0x53 > ? trace_lock_release+0x6d/0xb0 > kvm_vm_ioctl+0xc2/0xa60 > ? tty_ldisc_deref+0x16/0x20 > ? debug_smp_processor_id+0x17/0x20 > ? __fget_files+0xc2/0x1b0 > ? debug_smp_processor_id+0x17/0x20 > ? rcu_is_watching+0x13/0x70 > ? __fget_files+0xc2/0x1b0 > ? trace_lock_release+0x6d/0xb0 > ? lock_release+0x14/0xd0 > ? __fget_files+0xcc/0x1b0 > __x64_sys_ioctl+0x9a/0xf0 > ? rcu_is_watching+0x13/0x70 > x64_sys_call+0x10ee/0x20d0 > do_syscall_64+0xc3/0x470 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > Fixes: d789fa6efac9 ("KVM: TDX: Handle vCPU dissociation") > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > --- > arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++----- > 1 file changed, 37 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index e99d07611393..99381c8b4108 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -837,19 +837,51 @@ void tdx_vcpu_put(struct kvm_vcpu *vcpu) > tdx_prepare_switch_to_host(vcpu); > } > > +/* > + * Life cycles for a TD and a vCPU: > + * 1. KVM_CREATE_VM ioctl. > + * TD state is TD_STATE_UNINITIALIZED. > + * hkid is not assigned at this stage. > + * 2. KVM_TDX_INIT_VM ioctl. > + * TD transistions to TD_STATE_INITIALIZED. > + * hkid is assigned after this stage. > + * 3. KVM_CREATE_VCPU ioctl. (only when TD is TD_STATE_INITIALIZED). > + * 3.1 tdx_vcpu_create() transitions vCPU state to VCPU_TD_STATE_UNINITIALIZED. > + * 3.2 vcpu_load() and vcpu_put() in kvm_arch_vcpu_create(). > + * 3.3 (conditional) if any error encountered after kvm_arch_vcpu_create() > + * kvm_arch_vcpu_destroy() --> tdx_vcpu_free(). > + * 4. KVM_TDX_INIT_VCPU ioctl. > + * tdx_vcpu_init() transistions vCPU state to VCPU_TD_STATE_INITIALIZED. > + * vCPU control structures are allocated at this stage. > + * 5. kvm_destroy_vm(). > + * 5.1 tdx_mmu_release_hkid(): (1) tdh_vp_flush(), disassociats all vCPUs. > + * (2) puts hkid to !assigned state. > + * 5.2 kvm_destroy_vcpus() --> tdx_vcpu_free(): > + * transistions vCPU to VCPU_TD_STATE_UNINITIALIZED state. > + * 5.3 tdx_vm_destroy() > + * transitions TD to TD_STATE_UNINITIALIZED state. > + * > + * tdx_vcpu_free() can be invoked only at 3.3 or 5.2. > + * - If at 3.3, hkid is still assigned, but the vCPU must be in > + * VCPU_TD_STATE_UNINITIALIZED state. > + * - if at 5.2, hkid must be !assigned and all vCPUs must be in > + * VCPU_TD_STATE_INITIALIZED state and have been dissociated. > + */ > void tdx_vcpu_free(struct kvm_vcpu *vcpu) > { > struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); > struct vcpu_tdx *tdx = to_tdx(vcpu); > int i; > > + if (vcpu->cpu != -1) { > + KVM_BUG_ON(tdx->state == VCPU_TD_STATE_INITIALIZED, vcpu->kvm); > + tdx_disassociate_vp(vcpu); Sorry, I should use "tdx_flush_vp_on_cpu(vcpu);" here to ensure the list_del() is running on vcpu->cpu with local irq disabled. > + return; > + } > /* > * It is not possible to reclaim pages while hkid is assigned. It might > - * be assigned if: > - * 1. the TD VM is being destroyed but freeing hkid failed, in which > - * case the pages are leaked > - * 2. TD VCPU creation failed and this on the error path, in which case > - * there is nothing to do anyway > + * be assigned if the TD VM is being destroyed but freeing hkid failed, > + * in which case the pages are leaked. > */ > if (is_hkid_assigned(kvm_tdx)) > return; > -- > 2.43.0 >
On Fri, 2025-08-29 at 13:11 -0700, Sean Christopherson wrote: > > I guess the two approaches could be to make KVM_TDX_INIT_MEM_REGION more > > robust, > > This. First and foremost, KVM's ordering and locking rules need to be > explicit (ideally documented, but at the very least apparent in the code), > *especially* when the locking (or lack thereof) impacts userspace. Even if > effectively relying on the TDX-module to provide ordering "works", it's all > but impossible to follow. > > And it doesn't truly work, as everything in the TDX-Module is a trylock, and > that in turn prevents KVM from asserting success. Sometimes KVM has better > option than to rely on hardware to detect failure, but it really should be a > last resort, because not being able to expect success makes debugging no fun. > Even worse, it bleeds hard-to-document, specific ordering requirements into > userspace, e.g. in this case, it sounds like userspace can't do _anything_ on > vCPUs while doing KVM_TDX_INIT_MEM_REGION. Which might not be a burden for > userspace, but oof is it nasty from an ABI perspective. I could see that. I didn't think of the below. > > > or prevent the contention. For the latter case: > > tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr() > > ...I think we could just take slots_lock during KVM_TDX_INIT_VCPU and > > KVM_TDX_GET_CPUID. > > > > For tdh_vp_flush() the vcpu_load() in kvm_arch_vcpu_ioctl() could be hard to > > handle. > > > > So I'd think maybe to look towards making KVM_TDX_INIT_MEM_REGION more > > robust, which would mean the eventual solution wouldn't have ABI concerns by > > later blocking things that used to be allowed. > > > > Maybe having kvm_tdp_mmu_map_private_pfn() return success for spurious > > faults is enough. But this is all for a case that userspace isn't expected > > to actually hit, so seems like something that could be kicked down the road > > easily. > > You're trying to be too "nice", just smack 'em with a big hammer. For all > intents and purposes, the paths in question are fully serialized, there's no > reason to try and allow anything remotely interesting to happen. > > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to > prevent kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from > interefering. > > Doing that for a vCPU ioctl is a bit awkward, but not awful. E.g. we can > abuse kvm_arch_vcpu_async_ioctl(). In hindsight, a more clever approach would > have been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU > fd. Oh well. Yea. > > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, > because taking kvm->slots_lock inside vcpu->mutex is gross. AFAICT it's not > actively problematic today, but it feels like a deadlock waiting to happen. > > The other oddity I see is the handling of kvm_tdx->state. I don't see how > this check in tdx_vcpu_create() is safe: > > if (kvm_tdx->state != TD_STATE_INITIALIZED) > return -EIO; > > kvm_arch_vcpu_create() runs without any locks held, and so TDX effectively has > the same bug that SEV intra-host migration had, where an in-flight vCPU > creation could race with a VM-wide state transition (see commit ecf371f8b02d > ("KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in- > flight"). To fix that, kvm->lock needs to be taken and KVM needs to verify > there's no in-flight vCPU creation, e.g. so that a vCPU doesn't pop up and > contend a TDX-Module lock. > > We an even define a fancy new CLASS to handle the lock+check => unlock logic > with guard()-like syntax: > > CLASS(tdx_vm_state_guard, guard)(kvm); > if (IS_ERR(guard)) > return PTR_ERR(guard); > > IIUC, with all of those locks, KVM can KVM_BUG_ON() both TDH_MEM_PAGE_ADD and > TDH_MR_EXTEND, with no exceptions given for -EBUSY. Attached patches are very > lightly tested as usual and need to be chunked up, but seem do to what I want. Ok, the direction seem clear. The patch has an issue, need to debug.
On Fri, 2025-08-29 at 15:39 -0700, Rick Edgecombe wrote: > > > > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, > > because taking kvm->slots_lock inside vcpu->mutex is gross. AFAICT it's not > > actively problematic today, but it feels like a deadlock waiting to happen. > > > > The other oddity I see is the handling of kvm_tdx->state. I don't see how > > this check in tdx_vcpu_create() is safe: > > > > if (kvm_tdx->state != TD_STATE_INITIALIZED) > > return -EIO; > > > > kvm_arch_vcpu_create() runs without any locks held, Oh, you're right. It's about those fields being set further down in the function based on the results of KVM_TDX_INIT_VM, rather then TDX module locking. The race would show if vCPU creation transitioned to TD_STATE_RUNNABLE in finalized while another vCPU was getting created. Though I'm not sure exactly what would go wrong, the code is wrong enough looking to be worth a fix. > > and so TDX effectively has the same bug that SEV intra-host migration had, > > where an in-flight vCPU creation could race with a VM-wide state transition > > (see commit ecf371f8b02d ("KVM: SVM: Reject SEV{-ES} intra host migration if > > vCPU creation is in-flight"). To fix that, kvm->lock needs to be taken and > > KVM needs to verify there's no in-flight vCPU creation, e.g. so that a vCPU > > doesn't pop up and contend a TDX-Module lock. > > > > We an even define a fancy new CLASS to handle the lock+check => unlock logic > > with guard()-like syntax: > > > > CLASS(tdx_vm_state_guard, guard)(kvm); > > if (IS_ERR(guard)) > > return PTR_ERR(guard); > > > > IIUC, with all of those locks, KVM can KVM_BUG_ON() both TDH_MEM_PAGE_ADD > > and TDH_MR_EXTEND, with no exceptions given for -EBUSY. Attached patches > > are very lightly tested as usual and need to be chunked up, but seem do to > > what I want. > > Ok, the direction seem clear. The patch has an issue, need to debug. Just this: diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index c595d9cb6dcd..e99d07611393 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -2809,7 +2809,7 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd) static int tdx_get_cmd(void __user *argp, struct kvm_tdx_cmd *cmd) { - if (copy_from_user(cmd, argp, sizeof(cmd))) + if (copy_from_user(cmd, argp, sizeof(*cmd))) return -EFAULT; if (cmd->hw_error)
On Fri, Aug 29, 2025, Rick P Edgecombe wrote: > On Fri, 2025-08-29 at 15:39 -0700, Rick Edgecombe wrote: > > Ok, the direction seem clear. The patch has an issue, need to debug. > > Just this: > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index c595d9cb6dcd..e99d07611393 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -2809,7 +2809,7 @@ static int tdx_td_finalize(struct kvm *kvm, struct > kvm_tdx_cmd *cmd) > > static int tdx_get_cmd(void __user *argp, struct kvm_tdx_cmd *cmd) > { > - if (copy_from_user(cmd, argp, sizeof(cmd))) > + if (copy_from_user(cmd, argp, sizeof(*cmd))) LOL, it's always some mundane detail! > return -EFAULT; > > if (cmd->hw_error) >
© 2016 - 2025 Red Hat, Inc.