From nobody Wed Oct 8 02:01:45 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 16B5B291C25; Thu, 3 Jul 2025 06:29:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.16 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751524164; cv=none; b=WzQGmCvLgFpjAYS3oG1c7HWY4PahDWklVI8Z/9Lw3lnlaWiUpfa0M3dFMKPgtFFeB/LAurI5FucayDpc4BknQ2IbiFeGQt/dwUEjHE/tIJZFZPsJ1KGpoks2Ayh2Vrg9L4MCTqNaowgkbOGfzDdXiTMkI8qncfO6Htbe6uApAVM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751524164; c=relaxed/simple; bh=XFuDlrr2UJ+oulH4ssl40unO/f2SggKOd/4xvSNL/SI=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=iUHNyBBM1es4TYGNmUCLpc26Jpv63lvvyEukQfZKsDj2NBGrwjWETqZrcDe93dJvCbhlqKsJRL6YriNQlSbkFcFcXNjeBbpA2dMPhEAdETVaQ5NZhgBzo71gopoQaAAdc3927mVW0Ihiiyag6mvaM8Ijqv8pxoarhGnwhPmicDY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=P6tbZnjA; arc=none smtp.client-ip=198.175.65.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="P6tbZnjA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1751524163; x=1783060163; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=XFuDlrr2UJ+oulH4ssl40unO/f2SggKOd/4xvSNL/SI=; b=P6tbZnjAAx2yIMI02zucFugpMFYNW4Lq+TRF4QCdQzAjnD+pPy3iMYUK AlLCRKj1yEA4CPo9fC0nq4N+9PlwpfSoX/PHjNgjUMnwIqLgIZLzF1GEp /RPSsyrChTcvCoMk1wMYwMaPBgt3cvYhk7S2YaIhb4TuxcIMvsZ+3dKB3 xafuxYeOa/dHoakMGhAS1fCEtlxeiS86h2edhUyIJhZwwFNudoBtxoqwI v8L03kXOdSDXzBj0/u/o7/1+DaH9ij+QkjBv2R6doVH/MZzG/QlcCo/vE L524siTO9qks08aOwDDYPdqzINIGaFXbhOuvyHgQb1V1k6hyYEaLkNhOH A==; X-CSE-ConnectionGUID: fJpb43xvT7SaVkKDvI8y4A== X-CSE-MsgGUID: 1PylTQUNS9qr/oterHVDgA== X-IronPort-AV: E=McAfee;i="6800,10657,11482"; a="53949835" X-IronPort-AV: E=Sophos;i="6.16,283,1744095600"; d="scan'208";a="53949835" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by orvoesa108.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jul 2025 23:29:22 -0700 X-CSE-ConnectionGUID: RV0Z01BaQoCKnvMk4H13Xw== X-CSE-MsgGUID: 652LQVcnQRKffS2NB9+87Q== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,283,1744095600"; d="scan'208";a="158643671" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jul 2025 23:29:17 -0700 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org, rick.p.edgecombe@intel.com, kai.huang@intel.com, adrian.hunter@intel.com, reinette.chatre@intel.com, xiaoyao.li@intel.com, tony.lindgren@intel.com, binbin.wu@linux.intel.com, dmatlack@google.com, isaku.yamahata@intel.com, ira.weiny@intel.com, michael.roth@amd.com, vannapurve@google.com, david@redhat.com, ackerleytng@google.com, tabba@google.com, chao.p.peng@intel.com, Yan Zhao Subject: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate() Date: Thu, 3 Jul 2025 14:26:41 +0800 Message-ID: <20250703062641.3247-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region() to use open code to populate the initial memory region into the mirror page table, and add the region to S-EPT. Background =3D=3D=3D Sean initially suggested TDX to populate initial memory region in a 4-step way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate() interface [2] to help TDX populate init memory region. tdx_vcpu_init_mem_region guard(mutex)(&kvm->slots_lock) kvm_gmem_populate filemap_invalidate_lock(file->f_mapping) __kvm_gmem_get_pfn //1. get private PFN post_populate //tdx_gmem_post_populate get_user_pages_fast //2. get source page kvm_tdp_map_page //3. map private PFN to mirror root tdh_mem_page_add //4. add private PFN to S-EPT and copy source page to it. kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file invalidate lock also helps ensure the private PFN remains valid when tdh_mem_page_add() is invoked in TDX's post_populate hook. Though TDX does not need the folio prepration code, kvm_gmem_populate() helps on sharing common code between SEV-SNP and TDX. Problem =3D=3D=3D (1) In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap invalidation lock for protecting its preparedness tracking. Similarly, the in-place conversion version of guest_memfd series by Ackerly also requires kvm_gmem_get_pfn() to acquire filemap invalidation lock [5]. kvm_gmem_get_pfn filemap_invalidate_lock_shared(file_inode(file)->i_mapping); However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the filemap invalidation lock. (2) Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock, resulting in the following lock sequence in tdx_vcpu_init_mem_region(): - filemap invalidation lock --> mm->mmap_lock However, in future code, the shared filemap invalidation lock will be held in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence: - mm->mmap_lock --> filemap invalidation lock This creates an AB-BA deadlock issue. These two issues should still present in Michael Roth's code [7], [8]. Proposal =3D=3D=3D To prevent deadlock and the AB-BA issue, this patch enables TDX to populate the initial memory region independently of kvm_gmem_populate(). The revised sequence in tdx_vcpu_init_mem_region() is as follows: tdx_vcpu_init_mem_region guard(mutex)(&kvm->slots_lock) tdx_init_mem_populate get_user_pages_fast //1. get source page kvm_tdp_map_page //2. map private PFN to mirror root read_lock(&kvm->mmu_lock); kvm_tdp_mmu_gpa_is_mapped // 3. check if the gpa is mapped in the mirror root and return the mapped private PFN. tdh_mem_page_add //4. add private PFN to S-EPT and copy source page to it read_unlock(&kvm->mmu_lock); The original step 1 "get private PFN" is now integrated in the new step 3 "check if the gpa is mapped in the mirror root and return the mapped private PFN". With the protection of slots_lock, the read mmu_lock ensures the private PFN added by tdh_mem_page_add() is the same one mapped in the mirror page table. Addiontionally, before the TD state becomes TD_STATE_RUNNABLE, the only permitted map level is 4KB, preventing any potential merging or splitting in the mirror root under the read mmu_lock. So, this approach should work for TDX. It still follows the spirit in Sean's suggestion [1], where mapping the private PFN to mirror root and adding it to the S-EPT with initial content from the source page are executed in separate steps. Discussions =3D=3D=3D The introduction of kvm_gmem_populate() was intended to make it usable by both TDX and SEV-SNP [3], which is why Paolo provided the vendor hook post_populate for both. a) TDX keeps using kvm_gmem_populate(). Pros: - keep the status quo - share common code between SEV-SNP and TDX, though TDX does not need to prepare folios. Cons: - we need to explore solutions to the locking issues, e.g. the proposal at [11]. - PFN is faulted in twice for each GFN: one in __kvm_gmem_get_pfn(), another in kvm_gmem_get_pfn(). b) Michael suggested introducing some variant of kvm_tdp_map_page()/kvm_mmu_do_page_fault() to avoid invoking kvm_gmem_get_pfn() in the kvm_gmem_populate() path. [10]. Pro: - TDX can still invoke kvm_gmem_populate(). can share common code between SEV-SNP and TDX. Cons: - only TDX needs this variant. - can't fix the 2nd AB-BA lock issue. c) Change in this patch Pro: greater flexibility. Simplify the implementation for both SEV-SNP and TDX. Con: undermine the purpose of sharing common code. kvm_gmem_populate() may only be usable by SEV-SNP in future. Link: https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com [1] Link: https://lore.kernel.org/lkml/20240404185034.3184582-10-pbonzini@redha= t.com [2] Link: https://lore.kernel.org/lkml/20240404185034.3184582-1-pbonzini@redhat= .com [3] Link: https://lore.kernel.org/lkml/20241212063635.712877-4-michael.roth@amd= .com [4] Link: https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.= 1747264138.git.ackerleytng@google.com [5] Link: https://lore.kernel.org/all/20250430165655.605595-9-tabba@google.com = [6] Link: https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2 [7] Link: https://lore.kernel.org/kvm/20250613005400.3694904-1-michael.roth@amd= .com [8] Link: https://lore.kernel.org/lkml/20250613151939.z5ztzrtibr6xatql@amd.com = [9] Link: https://lore.kernel.org/lkml/20250613180418.bo4vqveigxsq2ouu@amd.com = [10] Link: https://lore.kernel.org/lkml/aErK25Oo5VJna40z@yzhao56-desk.sh.intel.c= om [11] Suggested-by: Vishal Annapurve Signed-off-by: Yan Zhao Acked-by: Vishal Annapurve Tested-by: Vishal Annapurve --- This is an RFC patch. Will split it later. --- arch/x86/kvm/mmu.h | 3 +- arch/x86/kvm/mmu/tdp_mmu.c | 6 ++- arch/x86/kvm/vmx/tdx.c | 96 ++++++++++++++------------------------ 3 files changed, 42 insertions(+), 63 deletions(-) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index b4b6860ab971..b122255c7d4e 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -257,7 +257,8 @@ extern bool tdp_mmu_enabled; #define tdp_mmu_enabled false #endif =20 -bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa); +bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, gpa_t gpa, int level, + kvm_pfn_t *pfn); int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 = *level); =20 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 7f3d7229b2c1..bb95c95f6531 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1934,7 +1934,8 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 a= ddr, u64 *sptes, return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, root); } =20 -bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa) +bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, gpa_t gpa, int level, + kvm_pfn_t *pfn) { struct kvm *kvm =3D vcpu->kvm; bool is_direct =3D kvm_is_addr_direct(kvm, gpa); @@ -1947,10 +1948,11 @@ bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcp= u, u64 gpa) rcu_read_lock(); leaf =3D __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root)); rcu_read_unlock(); - if (leaf < 0) + if (leaf < 0 || leaf !=3D level) return false; =20 spte =3D sptes[leaf]; + *pfn =3D spte_to_pfn(spte); return is_shadow_present_pte(spte) && is_last_spte(spte, leaf); } EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped); diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index b952bc673271..f3c2bb3554c3 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1521,9 +1521,9 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gf= n, } =20 /* - * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to map guest pages; t= he - * callback tdx_gmem_post_populate() then maps pages into private memory. - * through the a seamcall TDH.MEM.PAGE.ADD(). The SEAMCALL also requires = the + * KVM_TDX_INIT_MEM_REGION calls tdx_init_mem_populate() to first map guest + * pages into mirror page table and then maps pages into private memory th= rough + * the a SEAMCALL TDH.MEM.PAGE.ADD(). The SEAMCALL also requires the * private EPT structures for the page to have been built before, which is * done via kvm_tdp_map_page(). nr_premapped counts the number of pages th= at * were added to the EPT structures but not added with TDH.MEM.PAGE.ADD(). @@ -3047,23 +3047,17 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool ini= t_event) WARN_ON_ONCE(init_event); } =20 -struct tdx_gmem_post_populate_arg { - struct kvm_vcpu *vcpu; - __u32 flags; -}; - -static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pf= n, - void __user *src, int order, void *_arg) +static int tdx_init_mem_populate(struct kvm_vcpu *vcpu, gpa_t gpa, + void __user *src, __u32 flags) { u64 error_code =3D PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS; - struct kvm_tdx *kvm_tdx =3D to_kvm_tdx(kvm); - struct tdx_gmem_post_populate_arg *arg =3D _arg; - struct kvm_vcpu *vcpu =3D arg->vcpu; - gpa_t gpa =3D gfn_to_gpa(gfn); + struct kvm_tdx *kvm_tdx =3D to_kvm_tdx(vcpu->kvm); + struct kvm *kvm =3D vcpu->kvm; u8 level =3D PG_LEVEL_4K; struct page *src_page; int ret, i; u64 err, entry, level_state; + kvm_pfn_t pfn; =20 /* * Get the source page if it has been faulted in. Return failure if the @@ -3079,38 +3073,33 @@ static int tdx_gmem_post_populate(struct kvm *kvm, = gfn_t gfn, kvm_pfn_t pfn, if (ret < 0) goto out; =20 - /* - * The private mem cannot be zapped after kvm_tdp_map_page() - * because all paths are covered by slots_lock and the - * filemap invalidate lock. Check that they are indeed enough. - */ - if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) { - scoped_guard(read_lock, &kvm->mmu_lock) { - if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) { - ret =3D -EIO; - goto out; - } - } - } + KVM_BUG_ON(level !=3D PG_LEVEL_4K, kvm); =20 - ret =3D 0; - err =3D tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), - src_page, &entry, &level_state); - if (err) { - ret =3D unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO; - goto out; - } + scoped_guard(read_lock, &kvm->mmu_lock) { + if (!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa, level, &pfn)) { + ret =3D -EIO; + goto out; + } =20 - if (!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) - atomic64_dec(&kvm_tdx->nr_premapped); + ret =3D 0; + err =3D tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), + src_page, &entry, &level_state); + if (err) { + ret =3D unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO; + goto out; + } =20 - if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) { - for (i =3D 0; i < PAGE_SIZE; i +=3D TDX_EXTENDMR_CHUNKSIZE) { - err =3D tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, - &level_state); - if (err) { - ret =3D -EIO; - break; + if (!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) + atomic64_dec(&kvm_tdx->nr_premapped); + + if (flags & KVM_TDX_MEASURE_MEMORY_REGION) { + for (i =3D 0; i < PAGE_SIZE; i +=3D TDX_EXTENDMR_CHUNKSIZE) { + err =3D tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, + &level_state); + if (err) { + ret =3D -EIO; + break; + } } } } @@ -3126,8 +3115,6 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *= vcpu, struct kvm_tdx_cmd *c struct kvm *kvm =3D vcpu->kvm; struct kvm_tdx *kvm_tdx =3D to_kvm_tdx(kvm); struct kvm_tdx_init_mem_region region; - struct tdx_gmem_post_populate_arg arg; - long gmem_ret; int ret; =20 if (tdx->state !=3D VCPU_TD_STATE_INITIALIZED) @@ -3160,22 +3147,11 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu= *vcpu, struct kvm_tdx_cmd *c break; } =20 - arg =3D (struct tdx_gmem_post_populate_arg) { - .vcpu =3D vcpu, - .flags =3D cmd->flags, - }; - gmem_ret =3D kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa), - u64_to_user_ptr(region.source_addr), - 1, tdx_gmem_post_populate, &arg); - if (gmem_ret < 0) { - ret =3D gmem_ret; - break; - } - - if (gmem_ret !=3D 1) { - ret =3D -EIO; + ret =3D tdx_init_mem_populate(vcpu, region.gpa, + u64_to_user_ptr(region.source_addr), + cmd->flags); + if (ret) break; - } =20 region.source_addr +=3D PAGE_SIZE; region.gpa +=3D PAGE_SIZE; --=20 2.43.2