From nobody Fri Dec 19 15:48:36 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00EEC1DB92A; Mon, 19 May 2025 02:40:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747622411; cv=none; b=NyQNEPue0nqOdpoUlz/BguSGL2nZ2BYWEUd2tYZGyQMMaqXZx9O1Ppo7tRuRyWWcHEfZCNkFivrplOdfAFNMKdTY02oLjs7tv2AWRqk/sg2v6qHoTMVt1QpQNgMRiyq0W9msYKVubYIJuetxK4sm7kWH6z1fufW6Rrp42fxMeQE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747622411; c=relaxed/simple; bh=Y0wCnSUnurFZRnLN+azAHQiKHmOTcZPUOPN5z+D56wM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=g3gGAUhOKlZzHEELApLSZT/uf2XJAkDAYflpi7RLnNV1hNeB8pdopDYRJplC8f2/xWE5LQIA5o3qFfkx35ITTiURztIlFY4c7NaM0o3WQ9VSqkYfHwqkDz/4BnlqLgmQ9Z4neXFx5WiylsZhL1xsZRwPw0sVMntbpX9D7bdhu5A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=byEwZOmm; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="byEwZOmm" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1747622410; x=1779158410; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Y0wCnSUnurFZRnLN+azAHQiKHmOTcZPUOPN5z+D56wM=; b=byEwZOmmZ4aqbNqdR5wdv/Fswv/XHVLXlVqpTbxQBEjWQlRQgWuwd8Hy 54+zJeAMAxVNzBsnB067yQULki5s2D7x14BhB/ukBrDfAEj5tyZ9wHZqB eUswV4haFFEE8BvBmUl6pNdt8L5i2aPsjqGKymdVXkKppAsbPZMcvvbL+ /XEWsOjWG19sZewpe5ZUY7nTtN4Et9/0Eb4lyLFjFy77Y2ZjvrLa7/vMg bZRun1VEMqgjA2JglD9IWCFSz4XKH62KAG51N0qqFeltlUVXMoS35DLwJ Kyx6KPoZ3NLukgICjUoH6Cdz/Y39IxG2FqcLcn73I4fvOKJh/G/uFK3lq g==; X-CSE-ConnectionGUID: Mv62kRrNQNqv875sL4Qqkw== X-CSE-MsgGUID: I7A+tZOqR968rL8tJeo2Xw== X-IronPort-AV: E=McAfee;i="6700,10204,11437"; a="53183544" X-IronPort-AV: E=Sophos;i="6.15,299,1739865600"; d="scan'208";a="53183544" Received: from fmviesa010.fm.intel.com ([10.60.135.150]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 May 2025 19:40:09 -0700 X-CSE-ConnectionGUID: GcY27bRZTw6HMoXu0YxNSQ== X-CSE-MsgGUID: l70NrZo/R9O4EHg7KzghrQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.15,299,1739865600"; d="scan'208";a="139732958" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by fmviesa010-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 May 2025 19:40:07 -0700 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: reinette.chatre@intel.com, rick.p.edgecombe@intel.com, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Yan Zhao Subject: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot Date: Mon, 19 May 2025 10:37:37 +0800 Message-ID: <20250519023737.30360-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20250519023613.30329-1-yan.y.zhao@intel.com> References: <20250519023613.30329-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of kvm_mmu_do_page_fault() that a fault retry is due to an invalid memslot. This helps prevent deadlocks when a memslot is removed during pre-faulting GPAs in the memslot or local retry of faulting private pages in TDX. Take pre-faulting as an example. During ioctl KVM_PRE_FAULT_MEMORY, kvm->srcu is acquired around the pre-faulting of the entire range. For x86, kvm_arch_vcpu_pre_fault_memory() further invokes kvm_tdp_map_page(), which retries kvm_mmu_do_page_fault() if the return value is RET_PF_RETRY. If a memslot is deleted during the ioctl KVM_PRE_FAULT_MEMORY, after kvm_invalidate_memslot() marks a slot as invalid and makes it visible via rcu_assign_pointer() in kvm_swap_active_memslots(), kvm_mmu_do_page_fault() may encounter an invalid slot and return RET_PF_RETRY. Consequently, kvm_tdp_map_page() will then retry without releasing the srcu lock. Meanwhile, synchronize_srcu_expedited() in kvm_swap_active_memslots() is blocked, waiting for kvm_vcpu_pre_fault_memory() to release the srcu lock, leading to a deadlock. "slot deleting" thread "prefault" thread Reported-by: Reinette Chatre ----------------------------- ---------------------- srcu_read_lock(); (A) invalid_slot->flags |=3D KVM_MEMSLOT_INVALID; rcu_assign_pointer(); kvm_tdp_map_page(); (B) do { r =3D kvm_mmu_do_page_fault(= ); (C) synchronize_srcu_expedited(); } while (r =3D=3D RET_PF_RETRY); (D) srcu_read_unlock(); As shown in diagram, (C) is waiting for (D). However, (B) continuously finds an invalid slot before (C) completes, causing (B) to retry and preventing (D) from being invoked. The local retry code in TDX's EPT violation handler faces a similar issue, where a deadlock can occur when faulting a private GFN in a slot that is concurrently being removed. To resolve the deadlock, introduce a new return value RET_PF_RETRY_INVALID_SLOT and modify kvm_mmu_do_page_fault() to return RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an invalid memslot. This prevents endless retries in kvm_tdp_map_page() or tdx_handle_ept_violation(), allowing the srcu to be released and enabling slot removal to proceed. As all callers of kvm_tdp_map_page(), i.e., kvm_arch_vcpu_pre_fault_memory() or tdx_gmem_post_populate(), are in pre-fault path, treat RET_PF_RETRY_INVALID_SLOT the same as RET_PF_EMULATE to return -ENOENT in kvm_tdp_map_page() to enable userspace to be aware of the slot removal. Returning RET_PF_RETRY_INVALID_SLOT in kvm_mmu_do_page_fault() does not affect kvm_mmu_page_fault() and kvm_arch_async_page_ready(), as their callers either only check if the return value > 0 to re-enter vCPU for retry or do not check return value. Reported-by: Reinette Chatre Signed-off-by: Yan Zhao --- arch/x86/kvm/mmu/mmu.c | 3 ++- arch/x86/kvm/mmu/mmu_internal.h | 3 +++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index cbc84c6abc2e..3331e1e1aa69 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4599,7 +4599,7 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu, * be zapped before KVM inserts a new MMIO SPTE for the gfn. */ if (slot->flags & KVM_MEMSLOT_INVALID) - return RET_PF_RETRY; + return RET_PF_RETRY_INVALID_SLOT; =20 if (slot->id =3D=3D APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) { /* @@ -4879,6 +4879,7 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa= , u64 error_code, u8 *level return 0; =20 case RET_PF_EMULATE: + case RET_PF_RETRY_INVALID_SLOT: return -ENOENT; =20 case RET_PF_RETRY: diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_interna= l.h index db8f33e4de62..1aa14a32225e 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -311,6 +311,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kv= m_page_fault *fault); * RET_PF_INVALID: the spte is invalid, let the real page fault path updat= e it. * RET_PF_FIXED: The faulting entry has been fixed. * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another = vCPU. + * RET_PF_RETRY_INVALID_SLOT: Let CPU fault again on the address due to sl= ot + * with flag KVM_MEMSLOT_INVALID. * * Any names added to this enum should be exported to userspace for use in * tracepoints via TRACE_DEFINE_ENUM() in mmutrace.h @@ -326,6 +328,7 @@ enum { RET_PF_INVALID, RET_PF_FIXED, RET_PF_SPURIOUS, + RET_PF_RETRY_INVALID_SLOT, }; =20 /* --=20 2.43.2 From nobody Fri Dec 19 15:48:36 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D6F431C9DC6; Mon, 19 May 2025 02:40:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747622423; cv=none; b=P1i3KNevQvY7L8EABT09PjCSJWQhOSXcuzZOHCfzagwY90OQCkWcPh51ZwFi5ALe5A1BHs20GTeE+Y/gWfqo5g/jdKVO3EP30FfvyXkv9WpYeLt6cV8dbQLlhw2tqStJlabwRznLzHPpyYw7+LjENcc9Wz8jReM/KVhZ04cFyJg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747622423; c=relaxed/simple; bh=ePRNQ6U3OVkGRddknbUFGYPgzi3krex/eUxPMOk4TLg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lin4OpvN1QKcONlrfpnjjidhvDPHGtl9yuwEMUbr3V7GDB5hCfN5aNA/yCSX+7wOPiGZb6lc2BrrS6GEG5ELGARlIHbZhAQaTXQ2yaEjoKkgs5PK60LQ8VCFDjPmnpW9Air8cTakENaGJ6nmQuTd51ZwXZR7qIOHDPAqPfFjfUc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=QPVaCCM3; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="QPVaCCM3" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1747622421; x=1779158421; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ePRNQ6U3OVkGRddknbUFGYPgzi3krex/eUxPMOk4TLg=; b=QPVaCCM3QJzY5XNLku5AQ+r3/pPXaA5QSrvzx+Y6X7kIuJNfgXKrARSX NO7fSFS5M/8jj8QMlTl/ZUYkUIon/ifmsBngE5JY4peqU6YG9/Ra2RVkh WnNlgjhdi7BVk9aaPF2GurMiFruOLGGlvWewowqEosCYgqfJVQRO0/wh9 C381h0EWR7vspH23d+09FJKLtFUeuqUc/lPKJmNN2rHQgINkFKuchmsxw ZHR1KqgHE/4r9x7E3P20SxGl7oE1EgKZXOF9dB7KedVXpVk0UXgDPdqFc pwC+kEfSer73a0bGcnfOKdGRSesnno+LAIqULEXnGdNA8wYGH1vsGhJeK Q==; X-CSE-ConnectionGUID: 5tHhmLMUS3aACoe8NSkA4g== X-CSE-MsgGUID: c5F+PrPuQo6hBf08sD9zkQ== X-IronPort-AV: E=McAfee;i="6700,10204,11437"; a="53183550" X-IronPort-AV: E=Sophos;i="6.15,299,1739865600"; d="scan'208";a="53183550" Received: from fmviesa010.fm.intel.com ([10.60.135.150]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 May 2025 19:40:19 -0700 X-CSE-ConnectionGUID: Hley6XJfTauLnyjDbBl/wA== X-CSE-MsgGUID: pcG+fH3wSt+Xhb5am6bYkw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.15,299,1739865600"; d="scan'208";a="139732973" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by fmviesa010-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 May 2025 19:40:17 -0700 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: reinette.chatre@intel.com, rick.p.edgecombe@intel.com, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Yan Zhao Subject: [PATCH 2/2] KVM: selftests: Test prefault memory with concurrent memslot removal Date: Mon, 19 May 2025 10:38:15 +0800 Message-ID: <20250519023815.30384-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20250519023613.30329-1-yan.y.zhao@intel.com> References: <20250519023613.30329-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a new test case in pre_fault_memory_test to run vm_mem_region_delete() concurrently with ioctl KVM_PRE_FAULT_MEMORY. Both of them should complete. Signed-off-by: Yan Zhao --- .../selftests/kvm/pre_fault_memory_test.c | 82 +++++++++++++++---- 1 file changed, 67 insertions(+), 15 deletions(-) diff --git a/tools/testing/selftests/kvm/pre_fault_memory_test.c b/tools/te= sting/selftests/kvm/pre_fault_memory_test.c index 0350a8896a2f..c82dfc033a7b 100644 --- a/tools/testing/selftests/kvm/pre_fault_memory_test.c +++ b/tools/testing/selftests/kvm/pre_fault_memory_test.c @@ -10,12 +10,16 @@ #include #include #include +#include =20 /* Arbitrarily chosen values */ #define TEST_SIZE (SZ_2M + PAGE_SIZE) #define TEST_NPAGES (TEST_SIZE / PAGE_SIZE) #define TEST_SLOT 10 =20 +static bool prefault_ready; +static bool delete_thread_ready; + static void guest_code(uint64_t base_gpa) { volatile uint64_t val __used; @@ -30,16 +34,41 @@ static void guest_code(uint64_t base_gpa) GUEST_DONE(); } =20 -static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 gpa, u64 size, - u64 left) +static void *remove_slot_worker(void *data) +{ + struct kvm_vcpu *vcpu =3D (struct kvm_vcpu *)data; + + WRITE_ONCE(delete_thread_ready, true); + + while (!READ_ONCE(prefault_ready)) + cpu_relax(); + + vm_mem_region_delete(vcpu->vm, TEST_SLOT); + + WRITE_ONCE(delete_thread_ready, false); + return NULL; +} + +static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 base_gpa, u64 offs= et, + u64 size, u64 left, bool private, bool remove_slot) { struct kvm_pre_fault_memory range =3D { - .gpa =3D gpa, + .gpa =3D base_gpa + offset, .size =3D size, .flags =3D 0, }; u64 prev; int ret, save_errno; + pthread_t remove_thread; + + if (remove_slot) { + pthread_create(&remove_thread, NULL, remove_slot_worker, vcpu); + + while (!READ_ONCE(delete_thread_ready)) + cpu_relax(); + + WRITE_ONCE(prefault_ready, true); + } =20 do { prev =3D range.size; @@ -51,16 +80,35 @@ static void pre_fault_memory(struct kvm_vcpu *vcpu, u64= gpa, u64 size, ret < 0 ? "failure" : "success"); } while (ret >=3D 0 ? range.size : save_errno =3D=3D EINTR); =20 - TEST_ASSERT(range.size =3D=3D left, - "Completed with %lld bytes left, expected %" PRId64, - range.size, left); - - if (left =3D=3D 0) - __TEST_ASSERT_VM_VCPU_IOCTL(!ret, "KVM_PRE_FAULT_MEMORY", ret, vcpu->vm); - else - /* No memory slot causes RET_PF_EMULATE. it results in -ENOENT. */ - __TEST_ASSERT_VM_VCPU_IOCTL(ret && save_errno =3D=3D ENOENT, + if (remove_slot) { + /* + * ENOENT is expected if slot removal is performed earlier or + * during KVM_PRE_FAULT_MEMORY; + * On rare condition, ret could be 0 if KVM_PRE_FAULT_MEMORY + * completes earlier than slot removal. + */ + __TEST_ASSERT_VM_VCPU_IOCTL((ret && save_errno =3D=3D ENOENT) || !ret, "KVM_PRE_FAULT_MEMORY", ret, vcpu->vm); + + pthread_join(remove_thread, NULL); + WRITE_ONCE(prefault_ready, false); + + vm_userspace_mem_region_add(vcpu->vm, VM_MEM_SRC_ANONYMOUS, + base_gpa, TEST_SLOT, TEST_NPAGES, + private ? KVM_MEM_GUEST_MEMFD : 0); + } else { + TEST_ASSERT(range.size =3D=3D left, + "Completed with %lld bytes left, expected %" PRId64, + range.size, left); + + if (left =3D=3D 0) + __TEST_ASSERT_VM_VCPU_IOCTL(!ret, "KVM_PRE_FAULT_MEMORY", + ret, vcpu->vm); + else + /* No memory slot causes RET_PF_EMULATE. it results in -ENOENT. */ + __TEST_ASSERT_VM_VCPU_IOCTL(ret && save_errno =3D=3D ENOENT, + "KVM_PRE_FAULT_MEMORY", ret, vcpu->vm); + } } =20 static void __test_pre_fault_memory(unsigned long vm_type, bool private) @@ -97,9 +145,13 @@ static void __test_pre_fault_memory(unsigned long vm_ty= pe, bool private) =20 if (private) vm_mem_set_private(vm, guest_test_phys_mem, TEST_SIZE); - pre_fault_memory(vcpu, guest_test_phys_mem, SZ_2M, 0); - pre_fault_memory(vcpu, guest_test_phys_mem + SZ_2M, PAGE_SIZE * 2, PAGE_S= IZE); - pre_fault_memory(vcpu, guest_test_phys_mem + TEST_SIZE, PAGE_SIZE, PAGE_S= IZE); + + pre_fault_memory(vcpu, guest_test_phys_mem, 0, SZ_2M, 0, private, true); + pre_fault_memory(vcpu, guest_test_phys_mem, 0, SZ_2M, 0, private, false); + pre_fault_memory(vcpu, guest_test_phys_mem, SZ_2M, PAGE_SIZE * 2, PAGE_SI= ZE, + private, false); + pre_fault_memory(vcpu, guest_test_phys_mem, TEST_SIZE, PAGE_SIZE, PAGE_SI= ZE, + private, false); =20 vcpu_args_set(vcpu, 1, guest_test_virt_mem); vcpu_run(vcpu); --=20 2.43.2