From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D9E230FC17; Tue, 6 Jan 2026 10:20:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694832; cv=none; b=eT2sih6rKS9YFZPGczM7Pmh3LB3479t1k/3+pN4uRQNTlprcewqNh+6Eig7qnV5SMM9mjOgeXeSGJQ74yQbIHiF3nylGCFsDUPwRAfZok18CUfR164+CdE1yP/hRE6JYIDzMeRlmvv8NhAwiVBKRM/w7M0x1ZO/qITaQjdaTEtM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694832; c=relaxed/simple; bh=SJnwT2+Fn1GNcltb62bdu3hbUsMYH6Vi3i6Pskn9TCI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Jmvmw9/l3X/NbdPehHdd8N4CzFnDC5G6CsRxmwU/zsHkwM9QntyqthvPa0CA1kO8meHJFosrWfOKeejxZtUQdq49Ue5E01250gH0jgrnYGRWgF7Tqm+YDiHhEsZfW5TWO7TOwz/tn2uKSZCm9/3835lblXyByznHaNhlkim0fJs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=GEzcZ4Ql; arc=none smtp.client-ip=192.198.163.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="GEzcZ4Ql" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767694830; x=1799230830; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=SJnwT2+Fn1GNcltb62bdu3hbUsMYH6Vi3i6Pskn9TCI=; b=GEzcZ4QlMk85uAVeZD93nH+XkTfoTt1FTijLgFkmwybz6cW5q46PhoFA 8Gtz05atNG2ERe6p34LoULiB/XzyVPoGYVBUi2AQfvh+6NOaOaxg0ot1z 6gkHCHPefQ5s3m9w1g4lUG+ZjeCXfJiug+L74iWQqxysPi+tsNkRmDNAi lZn2QQ6CJJgH814Xb7LFzvCTyPv5oPrSUIjewvPFcJP0gmJvcXUucKR2/ f7qCtihL81LJ5QqLJ/Zz/qMLP9+3hvwGpsOXTwqOrSJ2AuxKLHcJUBX09 HRu+MApiitJqeP9i1RYrPLe6BLlynNBoy0ERIgMveMBTxnI744p2IQEFR w==; X-CSE-ConnectionGUID: c9ygREiORqKRwWG3hK8/ig== X-CSE-MsgGUID: 1c53AvZSRNa5lrKwPlZxlw== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="68966493" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="68966493" Received: from orviesa004.jf.intel.com ([10.64.159.144]) by fmvoesa111.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:20:29 -0800 X-CSE-ConnectionGUID: OPdlHDoqSu24Yb+daN2jWA== X-CSE-MsgGUID: AbZ71cBdRhSYdwXVBD8ufQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="207175059" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:20:24 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Date: Tue, 6 Jan 2026 18:18:26 +0800 Message-ID: <20260106101826.24870-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages. The SEAMCALL TDH_MEM_PAGE_AUG currently supports adding physical memory to the S-EPT up to 2MB in size. While keeping the "level" parameter in the tdh_mem_page_aug() wrapper to allow callers to specify the physical memory size, introduce the parameters "folio" and "start_idx" to specify the physical memory starting from the page at "start_idx" within the "folio". The specified physical memory must be fully contained within a single folio. Invoke tdx_clflush_page() for each 4KB segment of the physical memory being added. tdx_clflush_page() performs CLFLUSH operations conservatively to prevent dirty cache lines from writing back later and corrupting TD memory. Signed-off-by: Xiaoyao Li Signed-off-by: Isaku Yamahata Signed-off-by: Yan Zhao --- v3: - nth_page() --> folio_page(). (Kai, Dave) - Rebased on top of DPAMT v4. RFC v2: - Refine patch log. (Rick) - Removed the level checking. (Kirill, Chao Gao) - Use "folio", and "start_idx" rather than "page". - Return TDX_OPERAND_INVALID if the specified physical memory is not contained within a single folio. - Use PTE_SHIFT to replace the 9 in "1 << (level * 9)" (Kirill) - Use C99-style definition of variables inside a loop. (Nikolay Borisov) RFC v1: - Rebased to new tdh_mem_page_aug() with "struct page *" as param. - Check folio, folio_page_idx. --- arch/x86/include/asm/tdx.h | 3 ++- arch/x86/kvm/vmx/tdx.c | 5 +++-- arch/x86/virt/vmx/tdx/tdx.c | 13 ++++++++++--- 3 files changed, 15 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 8c0c548f9735..f92850789193 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -235,7 +235,8 @@ u64 tdh_mng_addcx(struct tdx_td *td, struct page *tdcs_= page); u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, struct page *page, struct= page *source, u64 *ext_err1, u64 *ext_err2); u64 tdh_mem_sept_add(struct tdx_td *td, u64 gpa, int level, struct page *p= age, u64 *ext_err1, u64 *ext_err2); u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page); -u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *p= age, u64 *ext_err1, u64 *ext_err2); +u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *= folio, + unsigned long start_idx, u64 *ext_err1, u64 *ext_err2); u64 tdh_mem_range_block(struct tdx_td *td, u64 gpa, int level, u64 *ext_er= r1, u64 *ext_err2); u64 tdh_mng_key_config(struct tdx_td *td); u64 tdh_mng_create(struct tdx_td *td, u16 hkid); diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 98ff84bc83f2..2f03c51515b9 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1679,12 +1679,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t = gfn, int tdx_level =3D pg_level_to_tdx_sept_level(level); struct kvm_tdx *kvm_tdx =3D to_kvm_tdx(kvm); struct page *page =3D pfn_to_page(pfn); + struct folio *folio =3D page_folio(page); gpa_t gpa =3D gfn_to_gpa(gfn); u64 entry, level_state; u64 err; =20 - err =3D tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &lev= el_state); - + err =3D tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio, + folio_page_idx(folio, page), &entry, &level_state); if (unlikely(IS_TDX_OPERAND_BUSY(err))) return -EBUSY; =20 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index b0b33f606c11..41ce18619ffc 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -1743,16 +1743,23 @@ u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *td= cx_page) } EXPORT_SYMBOL_GPL(tdh_vp_addcx); =20 -u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *p= age, u64 *ext_err1, u64 *ext_err2) +u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *= folio, + unsigned long start_idx, u64 *ext_err1, u64 *ext_err2) { struct tdx_module_args args =3D { .rcx =3D gpa | level, .rdx =3D tdx_tdr_pa(td), - .r8 =3D page_to_phys(page), + .r8 =3D page_to_phys(folio_page(folio, start_idx)), }; + unsigned long npages =3D 1 << (level * PTE_SHIFT); u64 ret; =20 - tdx_clflush_page(page); + if (start_idx + npages > folio_nr_pages(folio)) + return TDX_OPERAND_INVALID; + + for (int i =3D 0; i < npages; i++) + tdx_clflush_page(folio_page(folio, start_idx + i)); + ret =3D seamcall_ret(TDH_MEM_PAGE_AUG, &args); =20 *ext_err1 =3D args.rcx; --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 204DE30AADB; Tue, 6 Jan 2026 10:20:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694854; cv=none; b=Fc4FCl+eTjVJu/qAM1lh8TKS/M4woMLKFPW8wRdIKX86nnGL/uFFFlSQFeIjMCWJWdPIuIWv4FhkUSMvNJvAz+rFl6MzpxSGvKkro6beOjS3TpqncwXqyjbx087W/a5WAg9TxPTK35gXgdTHFALQEayRQ1dwDvdksZwtSskuRAE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694854; c=relaxed/simple; bh=6Am863RFtojr3rdLOq5FNSVmKnQdzK9+NgRhicWj3OM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=c1dkC3zJFqSyX8umTv9pXvJ4DxHC8DJnbn2ZnfSDqefVFBgzSPnwnz393mGM47p+xFDzxcFCdLh96qnAyz6R7pLkT9jZoG5suKQzSNkht2ENLAJh8r5cKmKD6Ga3L4Dp2NPAiDaAOAtNesAw78hgJgMXtJJ9hoQgMvY+ZLmYhbc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=fmEou+Eh; arc=none smtp.client-ip=192.198.163.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="fmEou+Eh" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767694853; x=1799230853; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=6Am863RFtojr3rdLOq5FNSVmKnQdzK9+NgRhicWj3OM=; b=fmEou+EhrA/XAtsyneOLn+AMghHDw7YaQj8aLDpbzrpT0bkwTlINDOhn niElOQLtd2thX+NxCWwC37rH5pm2TbuUKaczgH7LR4jElFtE2M1roJqI1 GrwM9BksJNwjYW+ybH3f/9e7Tq8BI2MaPR98fs7CnG/njxL7EiGMOLT4J kEkGB/dGFzHOQtwviyq4MQBqSK3AU8cLdX7QUqlYLt7oqoyIKoyaLEuHB lYt7t1tywiojH8HKvTjuaX4jzsAdm40J+0LHMROcnFOft3vo4Ffz0lSfD 1CoNj23C9A8vr7FBbXWjbUifX2Rd40oITa8oPXan2tK5gwZ2j3A/XSzjG Q==; X-CSE-ConnectionGUID: Kzk6pH0BT/6/3QbANBAxfA== X-CSE-MsgGUID: 5Rsotw0ySk24gEPtZEdynQ== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="68966501" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="68966501" Received: from orviesa004.jf.intel.com ([10.64.159.144]) by fmvoesa111.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:20:52 -0800 X-CSE-ConnectionGUID: OpNIBQYLRDqPgzcspmiMlg== X-CSE-MsgGUID: L3pJgn2qQwqEddiUuDHYxg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="207175087" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:20:47 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Date: Tue, 6 Jan 2026 18:18:49 +0800 Message-ID: <20260106101849.24889-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Xiaoyao Li Introduce SEAMCALL wrapper tdh_mem_page_demote() to invoke TDH_MEM_PAGE_DEMOTE, which splits a 2MB or a 1GB mapping in S-EPT into 512 4KB or 2MB mappings respectively. SEAMCALL TDH_MEM_PAGE_DEMOTE walks the S-EPT to locate the huge mapping to split and add a new S-EPT page to hold the 512 smaller mappings. Parameters "gpa" and "level" specify the huge mapping to split, and parameter "new_sept_page" specifies the 4KB page to be added as the S-EPT page. Invoke tdx_clflush_page() before adding the new S-EPT page conservatively to prevent dirty cache lines from writing back later and corrupting TD memory. tdh_mem_page_demote() may fail, e.g., due to S-EPT walk error. Callers must check function return value and can retrieve the extended error info from the output parameters "ext_err1", and "ext_err2". The TDX module has many internal locks. To avoid staying in SEAM mode for too long, SEAMCALLs return a BUSY error code to the kernel instead of spinning on the locks. Depending on the specific SEAMCALL, the caller may need to handle this error in specific ways (e.g., retry). Therefore, return the SEAMCALL error code directly to the caller without attempting to handle it in the core kernel. Enable tdh_mem_page_demote() only on TDX modules that support feature TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2]. This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle. The TDX module provides no guaranteed maximum retry count to ensure forward progress of the demotion. Interrupt storms could then result in a DoS if host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling interrupts before invoking the SEAMCALL also doesn't work because NMIs can also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the reasonable execution time for demotion. [1] Link: https://lore.kernel.org/kvm/99f5585d759328db973403be0713f68e492b492a.= camel@intel.com [1] Link: https://lore.kernel.org/all/fbf04b09f13bc2ce004ac97ee9c1f2c965f44fdf.= camel@intel.com [2] Signed-off-by: Xiaoyao Li Signed-off-by: Isaku Yamahata Co-developed-by: Yan Zhao Signed-off-by: Yan Zhao --- v3: - Use a var name that clearly tell that the page is used as a page table page. (Binbin). - Check if TDX module supports feature ENHANCE_DEMOTE_INTERRUPTIBILITY. (Kai). RFC v2: - Refine the patch log (Rick). - Do not handle TDX_INTERRUPTED_RESTARTABLE as the new TDX modules in planning do not check interrupts for basic TDX. RFC v1: - Rebased and split patch. Updated patch log. --- arch/x86/include/asm/tdx.h | 8 ++++++++ arch/x86/virt/vmx/tdx/tdx.c | 24 ++++++++++++++++++++++++ arch/x86/virt/vmx/tdx/tdx.h | 1 + 3 files changed, 33 insertions(+) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index f92850789193..d1891e099d42 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -15,6 +15,7 @@ /* Bit definitions of TDX_FEATURES0 metadata field */ #define TDX_FEATURES0_NO_RBP_MOD BIT_ULL(18) #define TDX_FEATURES0_DYNAMIC_PAMT BIT_ULL(36) +#define TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY BIT_ULL(51) =20 #ifndef __ASSEMBLER__ =20 @@ -140,6 +141,11 @@ static inline bool tdx_supports_dynamic_pamt(const str= uct tdx_sys_info *sysinfo) return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT; } =20 +static inline bool tdx_supports_demote_nointerrupt(const struct tdx_sys_in= fo *sysinfo) +{ + return sysinfo->features.tdx_features0 & TDX_FEATURES0_ENHANCE_DEMOTE_INT= ERRUPTIBILITY; +} + void tdx_quirk_reset_page(struct page *page); =20 int tdx_guest_keyid_alloc(void); @@ -242,6 +248,8 @@ u64 tdh_mng_key_config(struct tdx_td *td); u64 tdh_mng_create(struct tdx_td *td, u16 hkid); u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp); u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data); +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page= *new_sept_page, + u64 *ext_err1, u64 *ext_err2); u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2= ); u64 tdh_mr_finalize(struct tdx_td *td); u64 tdh_vp_flush(struct tdx_vp *vp); diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 41ce18619ffc..c3f4457816c8 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -1837,6 +1837,30 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *da= ta) } EXPORT_SYMBOL_GPL(tdh_mng_rd); =20 +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page= *new_sept_page, + u64 *ext_err1, u64 *ext_err2) +{ + struct tdx_module_args args =3D { + .rcx =3D gpa | level, + .rdx =3D tdx_tdr_pa(td), + .r8 =3D page_to_phys(new_sept_page), + }; + u64 ret; + + if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo)) + return TDX_SW_ERROR; + + /* Flush the new S-EPT page to be added */ + tdx_clflush_page(new_sept_page); + ret =3D seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args); + + *ext_err1 =3D args.rcx; + *ext_err2 =3D args.rdx; + + return ret; +} +EXPORT_SYMBOL_GPL(tdh_mem_page_demote); + u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2) { struct tdx_module_args args =3D { diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 096c78a1d438..a6c0fa53ece9 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -24,6 +24,7 @@ #define TDH_MNG_KEY_CONFIG 8 #define TDH_MNG_CREATE 9 #define TDH_MNG_RD 11 +#define TDH_MEM_PAGE_DEMOTE 15 #define TDH_MR_EXTEND 16 #define TDH_MR_FINALIZE 17 #define TDH_VP_FLUSH 18 --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 383D531B815; Tue, 6 Jan 2026 10:21:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.13 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694893; cv=none; b=PrGHwpEqhR8x9ZXsRq0aYda7/KvMEHgg6b8xRNK+UILEN6AWKOnMRmYroHxf0zontpBjc6irPpCG3oa6bq67DyVOeya0b/82eFR/LhATuewToDpa99KTjMtvu237PNVQ+sTNTt2sQ7l6rDldjN0DOeRJ8gncu5io12NM6gt5C0E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694893; c=relaxed/simple; bh=lSTvIOoBAb312grAY0t9vB2+io+mhlIu8uMBcqc4VsE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=U2nptPKAWCdpttEFQWTKUz9M9T+b+iHI1kdiUlEEhscs3LIjcYu75lYP3KwkPGfG+rn/EAlKhcBnS9CjJEZ4K7SmLLLNeMKqHgxOQYKgurXHNmmxzDz7/hKVzH0emA43kOIqi9cTWzsni1YJ87FyOgVcrisatfFwPETPtCFiCn8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=ilsSavRP; arc=none smtp.client-ip=198.175.65.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="ilsSavRP" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767694892; x=1799230892; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=lSTvIOoBAb312grAY0t9vB2+io+mhlIu8uMBcqc4VsE=; b=ilsSavRPnw/VjsIL0NGenCy2W38LBnABWZNW23/omkeQagjdygeDJpoh MjxLNyV0NqYC+yssy7Et+32XilvTwT4RdbDNMsSZ9tiu9OaiRLsrMNK53 zAfR3DK3I2VbG5P/QNu+20yK8RWoHrHhGDMk6L8eIJrpOZvAeNno/7kEb amK9xRdOFd5rSq2sAYChPmvNbg01H87jZ0PtfFMsadxnGBTnp45aDHnvZ 22Sja2cxyrgcvCa0EI4B4Et4fGzlNrrnNys5LPen1NNgI1wGv2Arpk6s+ 7Hvzf8KSLYa7nNqPvCNPfrQEMobk/ggLUSMZrXoh7xdIot3rXQUGuTB0T w==; X-CSE-ConnectionGUID: zv2NkZpjSHK+2zmofhYLBA== X-CSE-MsgGUID: CPZVYXh6Q62mqjfpBrlAhQ== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="80176769" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="80176769" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:21:31 -0800 X-CSE-ConnectionGUID: pUPkMj12StKxw0NSEJ4VDA== X-CSE-MsgGUID: E3a4nazKSUC65/TzDE2pzA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="207096709" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:21:25 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 03/24] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Date: Tue, 6 Jan 2026 18:19:29 +0800 Message-ID: <20260106101929.24937-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" After removing a TD's private page, the TDX module does not write back and invalidate cache lines associated with the page and its keyID (i.e., the TD's guest keyID). The SEAMCALL wrapper tdh_phymem_page_wbinvd_hkid() enables the caller to provide the TD's guest keyID and physical memory address to invoke the SEAMCALL TDH_PHYMEM_PAGE_WBINVD to perform cache line invalidation. Enhance the SEAMCALL wrapper tdh_phymem_page_wbinvd_hkid() to support cache line invalidation for huge pages by introducing the parameters "folio", "start_idx", and "npages". These parameters specify the physical memory starting from the page at "start_idx" within a "folio" and spanning "npages" contiguous PFNs. Return TDX_OPERAND_INVALID if the specified memory is not entirely contained within a single folio. Signed-off-by: Xiaoyao Li Signed-off-by: Isaku Yamahata Suggested-by: Rick Edgecombe Signed-off-by: Yan Zhao --- v3: - nth_page() --> folio_page(). (Kai, Dave) - Rebased on top of Sean's cleanup series. RFC v2: - Enhance tdh_phymem_page_wbinvd_hkid() to invalidate multiple pages directly, rather than looping within KVM, following Dave's suggestion: "Don't wrap the wrappers." (Rick). RFC v1: - Split patch - Aded a helper tdx_wbinvd_page() in TDX, which accepts param "struct page *". --- arch/x86/include/asm/tdx.h | 4 ++-- arch/x86/kvm/vmx/tdx.c | 5 ++++- arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++++---- 3 files changed, 21 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index d1891e099d42..7f72fd07f4e5 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -264,8 +264,8 @@ u64 tdh_mem_track(struct tdx_td *tdr); u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_er= r1, u64 *ext_err2); u64 tdh_phymem_cache_wb(bool resume); u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td); -u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page); - +u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio, + unsigned long start_idx, unsigned long npages); void tdx_meminfo(struct seq_file *m); #else static inline void tdx_init(void) { } diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 2f03c51515b9..b369f90dbafa 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1857,6 +1857,7 @@ static void tdx_sept_remove_private_spte(struct kvm *= kvm, gfn_t gfn, struct page *page =3D pfn_to_page(spte_to_pfn(mirror_spte)); int tdx_level =3D pg_level_to_tdx_sept_level(level); struct kvm_tdx *kvm_tdx =3D to_kvm_tdx(kvm); + struct folio *folio =3D page_folio(page); gpa_t gpa =3D gfn_to_gpa(gfn); u64 err, entry, level_state; =20 @@ -1895,7 +1896,9 @@ static void tdx_sept_remove_private_spte(struct kvm *= kvm, gfn_t gfn, if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_REMOVE, entry, level_state, kvm)) return; =20 - err =3D tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page); + err =3D tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, folio, + folio_page_idx(folio, page), + KVM_PAGES_PER_HPAGE(level)); if (TDX_BUG_ON(err, TDH_PHYMEM_PAGE_WBINVD, kvm)) return; =20 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index c3f4457816c8..b57e00c71384 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -2046,13 +2046,24 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td) } EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr); =20 -u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page) +u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio, + unsigned long start_idx, unsigned long npages) { - struct tdx_module_args args =3D {}; + u64 err; =20 - args.rcx =3D mk_keyed_paddr(hkid, page); + if (start_idx + npages > folio_nr_pages(folio)) + return TDX_OPERAND_INVALID; =20 - return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args); + for (unsigned long i =3D 0; i < npages; i++) { + struct page *p =3D folio_page(folio, start_idx + i); + struct tdx_module_args args =3D {}; + + args.rcx =3D mk_keyed_paddr(hkid, p); + err =3D seamcall(TDH_PHYMEM_PAGE_WBINVD, &args); + if (err) + break; + } + return err; } EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid); =20 --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3BD0B30FC2C; Tue, 6 Jan 2026 10:21:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.13 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694921; cv=none; b=tfw52aTdNOvwd2fNDrhSZo3f2AVz1bruIA1Erkmtr4UjyWdAWRs0+pt6wIv83L+3rRNTHom5RAg5t7+TN467QFbQJj64y2U+PGs9lwCdJIRJvSKq/M4j3850h2w1wLlu6DAAnxG4gHlDtBWWP4NVuBpIH6Tf3GGu7iFZmYiZz+E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694921; c=relaxed/simple; bh=TD3vbmKUAbYlJ5uvNxw6IXr6XmjZjncu6ecmm3SuGdM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=QRf0A3ubu/TStjlQJhUoc6sYLvvzCg4EGMU37thItGfJyG47hQPfyjaoFaA/eWQ3BLPaWXsVytd7DYJATu/z/DGSVj/BPHrZZpHuLK+2kceb9DGBAdn3f8ryyMuYAatvqjlGmOzouLj+R3n3YpvDk3l4yswRW8DWxS4M7catG+U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=mYZb2lrb; arc=none smtp.client-ip=198.175.65.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="mYZb2lrb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767694919; x=1799230919; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=TD3vbmKUAbYlJ5uvNxw6IXr6XmjZjncu6ecmm3SuGdM=; b=mYZb2lrbOEjCEyOLodcy4WOPkfw+co1dNe0PpCdV+znfWuab8GURz7cQ JMW8FALA6c0ZklPhHcCvoKzLGyRZrGBr8RbsoMRFitPHEcSc0EBba/2Ip KaZ1m/pTMDJJ3BjY5Q5UM3c/x4lyoPJ5vj6sEip+KtBDorLrvqJUJtg28 ePeNdN8NjJ9t+u2Rr5/PjmflkyXqmmwuq98BUyyvtX3bmLM6FHkhqPBuS nDFwc9OivP/aofmdU1FbbYfUzZguV2o312jvkWA+Q9ReOtvG2rwUeJqiP bM8RXHPyf9XQ7EOnNIjJTcJ7Vk13SD+55h485rwkm1ioG8SbEzYGsPLMZ Q==; X-CSE-ConnectionGUID: 5OXZJOaSTemW1AkQtAD23A== X-CSE-MsgGUID: 4vx1exgiRRS7yXVEC0SNfw== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="80176783" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="80176783" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:21:57 -0800 X-CSE-ConnectionGUID: KY4/a46VQ/O7NKpDj/vKpA== X-CSE-MsgGUID: X/4wZK5/TYay5eNqVT01qw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="207096754" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:21:51 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 04/24] x86/tdx: Introduce tdx_quirk_reset_folio() to reset private huge pages Date: Tue, 6 Jan 2026 18:19:54 +0800 Message-ID: <20260106101955.24967-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" After removing or reclaiming a guest private page or a control page from a TD, zero the physical page using movdir64b() to enable the kernel to reuse the pages. This is needed for systems with the X86_BUG_TDX_PW_MCE erratum. Introduce the function tdx_quirk_reset_folio() to invoke tdx_quirk_reset_paddr() to convert pages in a huge folio from private back to normal. The pages start from the page at "start_idx" within a "folio", spanning "npages" contiguous PFNs. Signed-off-by: Xiaoyao Li Signed-off-by: Isaku Yamahata Signed-off-by: Yan Zhao --- RFC v3: - Rebased to Sean's cleanup series. - tdx_clear_folio() --> tdx_quirk_reset_folio(). RFC v2: - Add tdx_clear_folio(). - Drop inner loop _tdx_clear_page() and move __mb() outside of the loop. (Rick) - Use C99-style definition of variables inside a for loop. - Note: [1] also changes tdx_clear_page(). RFC v2 is not based on [1] now. [1] https://lore.kernel.org/all/20250724130354.79392-2-adrian.hunter@intel.= com RFC v1: - split out, let tdx_clear_page() accept level. --- arch/x86/include/asm/tdx.h | 2 ++ arch/x86/kvm/vmx/tdx.c | 3 ++- arch/x86/virt/vmx/tdx/tdx.c | 11 +++++++++++ 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 7f72fd07f4e5..669dd6d99821 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -147,6 +147,8 @@ static inline bool tdx_supports_demote_nointerrupt(cons= t struct tdx_sys_info *sy } =20 void tdx_quirk_reset_page(struct page *page); +void tdx_quirk_reset_folio(struct folio *folio, unsigned long start_idx, + unsigned long npages); =20 int tdx_guest_keyid_alloc(void); u32 tdx_get_nr_guest_keyids(void); diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index b369f90dbafa..5b499593edff 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1902,7 +1902,8 @@ static void tdx_sept_remove_private_spte(struct kvm *= kvm, gfn_t gfn, if (TDX_BUG_ON(err, TDH_PHYMEM_PAGE_WBINVD, kvm)) return; =20 - tdx_quirk_reset_page(page); + tdx_quirk_reset_folio(folio, folio_page_idx(folio, page), + KVM_PAGES_PER_HPAGE(level)); tdx_pamt_put(page); } =20 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index b57e00c71384..20708f56b1a0 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -800,6 +800,17 @@ static void tdx_quirk_reset_paddr(unsigned long base, = unsigned long size) mb(); } =20 +void tdx_quirk_reset_folio(struct folio *folio, unsigned long start_idx, + unsigned long npages) +{ + if (WARN_ON_ONCE(start_idx + npages > folio_nr_pages(folio))) + return; + + tdx_quirk_reset_paddr(page_to_phys(folio_page(folio, start_idx)), + npages << PAGE_SHIFT); +} +EXPORT_SYMBOL_GPL(tdx_quirk_reset_folio); + void tdx_quirk_reset_page(struct page *page) { tdx_quirk_reset_paddr(page_to_phys(page), PAGE_SIZE); --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 532E021C173; Tue, 6 Jan 2026 10:22:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694937; cv=none; b=uCsjXCwY5W4wThlzO+8rLqJ4ZhKhkkxpvi+iM+cdxlwYwTIacgzk5C0IlEqww+U7iyP5gX68oShWZ8jNbb7PEoxH6XmkqTGCj1LQsp4sI68MmXIHWBfW4EaGFICKoT4eCKRPJj1gQGmv1Rjy1aZ3yyRDpJ5dYwxKeCOZ3mwyCzc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694937; c=relaxed/simple; bh=wlA5WQWsZy00SaoZfWNiyVC6pqw7wJhR3jJQHkZ5xuo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Z8+g2UgufYyEjTzN9k8EIzcIThbe3Iq2L/mMaGOvwNv/TJDAWJOu0DgXclcnqvRLZGdSo8qIRWW6Y/CP3TqeSdAKSoOyM9fbSRWy4pKo0DrAFYcCk7AoNRnewYtXIWIRfcpJSRdCbsTDSqWLp6yFK/++EJzNNF0E+2suJx9HCC0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=dxasiABb; arc=none smtp.client-ip=192.198.163.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="dxasiABb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767694934; x=1799230934; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=wlA5WQWsZy00SaoZfWNiyVC6pqw7wJhR3jJQHkZ5xuo=; b=dxasiABbr+XoG6qFcgvbrffkY8MXh7bk466PZU47esBR+PI/x0QibIRP ZFLSroTdatIeb9LMwugjxcRh1EKNTqhJiJ7Ff4Ys/1+Br/ewn69oJBwtw NI35Pj7IChaECrjjgYAu6H2BPF0vRvr+RfEoabOfVTlVDYi4jSOvz39JD 7YWI+gWrjCxcJIlpRLGOrkHfixA/sxk8XGSLsiSjMkZ459+isnq4d2P3Q eJiqH+Qjc1Kl6W+6IlrW8RuhcbB8Oy8iXyczXoA+m5jQae1c62K0Jvg0X TasD4HEWy+fmrFCi85pKeMe9+8sGr1GX3w5DxvJkFRZ6HWtgURYCDz4Sc Q==; X-CSE-ConnectionGUID: z2cvSkkCQzSf14wElncnLA== X-CSE-MsgGUID: S/BbEwH/Q2uysRkVims8rQ== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="68966605" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="68966605" Received: from orviesa004.jf.intel.com ([10.64.159.144]) by fmvoesa111.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:22:13 -0800 X-CSE-ConnectionGUID: 9gNpH/mdQ2OT4VERLTrVLA== X-CSE-MsgGUID: CUOGTDXPTL2a7nhA0dGSMA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="207175309" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:22:07 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 05/24] x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support huge pages Date: Tue, 6 Jan 2026 18:20:09 +0800 Message-ID: <20260106102009.25006-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Enhance the SEAMCALL wrapper tdh_phymem_page_reclaim() to support huge pages by introducing new parameters: "folio", "start_idx", and "npages". These parameters specify the physical memory to be reclaimed, starting from the page at "start_idx" within a folio and spanning "npages" contiguous PFNs. The specified memory must be entirely contained within a single folio. Return TDX_SW_ERROR if the size of the reclaimed memory does not match the specified size. On the KVM side, introduce tdx_reclaim_folio() to invoke tdh_phymem_page_reclaim() for reclaiming huge guest private pages. The "reset" parameter in tdx_reclaim_folio() specifies whether tdx_quirk_reset_folio() should be subsequently invoked within tdx_reclaim_folio(). To facilitate reclaiming of 4KB pages, keep function tdx_reclaim_page() and make it a helper for reclaiming normal TDX control pages, and introduce a new helper tdx_reclaim_page_noreset() for reclaiming the TDR page. Opportunistically, rename rcx, rdx, r8 to tdx_pt, tdx_owner, tdx_size in tdx_reclaim_folio() to improve readability. Signed-off-by: Xiaoyao Li Signed-off-by: Isaku Yamahata Signed-off-by: Yan Zhao --- v3: - Rebased to Sean's cleanup series. Dropped invoking tdx_reclaim_folio() in tdx_sept_remove_private_spte() due to no reclaiming is required in that path. However, keep introducing tdx_reclaim_folio() as it will be needed when the patches of removing guest private memory after releasing HKID are merged. - tdx_reclaim_page_noclear() --> tdx_reclaim_page_noreset() and invoke tdx_quirk_reset_folio() instead in tdx_reclaim_folio() due to rebase. - Check mismatch between the request size and the reclaimed size, and return TDX_SW_ERROR only after a successful TDH_PHYMEM_PAGE_RECLAIM. (Binbin) RFC v2: - Introduce new params "folio", "start_idx" and "npages" to wrapper tdh_phymem_page_reclaim(). - Move the checking of return size from KVM to x86/virt and return error. - Rename tdx_reclaim_page() to tdx_reclaim_folio(). - Add two helper functions tdx_reclaim_page() tdx_reclaim_page_noclear() to faciliate the reclaiming of 4KB pages. RFC v1: - Rebased and split patch. --- arch/x86/include/asm/tdx.h | 3 ++- arch/x86/kvm/vmx/tdx.c | 27 +++++++++++++++++---------- arch/x86/virt/vmx/tdx/tdx.c | 12 ++++++++++-- 3 files changed, 29 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 669dd6d99821..abe484045132 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -261,7 +261,8 @@ u64 tdh_mng_init(struct tdx_td *td, u64 td_params, u64 = *extended_err); u64 tdh_vp_init(struct tdx_vp *vp, u64 initial_rcx, u32 x2apicid); u64 tdh_vp_rd(struct tdx_vp *vp, u64 field, u64 *data); u64 tdh_vp_wr(struct tdx_vp *vp, u64 field, u64 data, u64 mask); -u64 tdh_phymem_page_reclaim(struct page *page, u64 *tdx_pt, u64 *tdx_owner= , u64 *tdx_size); +u64 tdh_phymem_page_reclaim(struct folio *folio, unsigned long start_idx, = unsigned long npages, + u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size); u64 tdh_mem_track(struct tdx_td *tdr); u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_er= r1, u64 *ext_err2); u64 tdh_phymem_cache_wb(bool resume); diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 5b499593edff..405afd2a56b7 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -318,33 +318,40 @@ static inline void tdx_disassociate_vp(struct kvm_vcp= u *vcpu) }) =20 /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */ -static int __tdx_reclaim_page(struct page *page) +static int tdx_reclaim_folio(struct folio *folio, unsigned long start_idx, + unsigned long npages, bool reset) { - u64 err, rcx, rdx, r8; + u64 err, tdx_pt, tdx_owner, tdx_size; =20 - err =3D tdh_phymem_page_reclaim(page, &rcx, &rdx, &r8); + err =3D tdh_phymem_page_reclaim(folio, start_idx, npages, &tdx_pt, + &tdx_owner, &tdx_size); =20 /* * No need to check for TDX_OPERAND_BUSY; all TD pages are freed * before the HKID is released and control pages have also been * released at this point, so there is no possibility of contention. */ - if (TDX_BUG_ON_3(err, TDH_PHYMEM_PAGE_RECLAIM, rcx, rdx, r8, NULL)) + if (TDX_BUG_ON_3(err, TDH_PHYMEM_PAGE_RECLAIM, tdx_pt, tdx_owner, tdx_siz= e, NULL)) return -EIO; =20 + if (reset) + tdx_quirk_reset_folio(folio, start_idx, npages); return 0; } =20 static int tdx_reclaim_page(struct page *page) { - int r; + struct folio *folio =3D page_folio(page); =20 - r =3D __tdx_reclaim_page(page); - if (!r) - tdx_quirk_reset_page(page); - return r; + return tdx_reclaim_folio(folio, folio_page_idx(folio, page), 1, true); } =20 +static int tdx_reclaim_page_noreset(struct page *page) +{ + struct folio *folio =3D page_folio(page); + + return tdx_reclaim_folio(folio, folio_page_idx(folio, page), 1, false); +} =20 /* * Reclaim the TD control page(s) which are crypto-protected by TDX guest's @@ -583,7 +590,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kv= m) if (!kvm_tdx->td.tdr_page) return; =20 - if (__tdx_reclaim_page(kvm_tdx->td.tdr_page)) + if (tdx_reclaim_page_noreset(kvm_tdx->td.tdr_page)) return; =20 /* diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 20708f56b1a0..c12665389b67 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -1993,19 +1993,27 @@ EXPORT_SYMBOL_GPL(tdh_vp_init); * So despite the names, they must be interpted specially as described by = the spec. Return * them only for error reporting purposes. */ -u64 tdh_phymem_page_reclaim(struct page *page, u64 *tdx_pt, u64 *tdx_owner= , u64 *tdx_size) +u64 tdh_phymem_page_reclaim(struct folio *folio, unsigned long start_idx, + unsigned long npages, u64 *tdx_pt, u64 *tdx_owner, + u64 *tdx_size) { struct tdx_module_args args =3D { - .rcx =3D page_to_phys(page), + .rcx =3D page_to_phys(folio_page(folio, start_idx)), }; u64 ret; =20 + if (start_idx + npages > folio_nr_pages(folio)) + return TDX_OPERAND_INVALID; + ret =3D seamcall_ret(TDH_PHYMEM_PAGE_RECLAIM, &args); =20 *tdx_pt =3D args.rcx; *tdx_owner =3D args.rdx; *tdx_size =3D args.r8; =20 + if (!ret && npages !=3D (1 << (*tdx_size) * PTE_SHIFT)) + return TDX_SW_ERROR; + return ret; } EXPORT_SYMBOL_GPL(tdh_phymem_page_reclaim); --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BED3B326920; Tue, 6 Jan 2026 10:22:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694950; cv=none; b=cm+ytx1yqIohDR0q69AclIy/JiH8Wc1w/yeGLgdsh7WLB/al+podmRBh57RHu8dMTg9+OvqZXi8YXQyb2jfDjK9GANQjIujZVN1jAnzeWEPjRcof/CCVyS8ep6lBpknGM1YA5iqW5C0OW/Jg1wFoIQG2I8P7+v1wQaDqjc6DfFc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694950; c=relaxed/simple; bh=GFGrz/CXi/RxbvHEabGemFFSeHhuI34AKA6B/kIEpxc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HnJ62WocRTGp/8ZUQGt523AgZz6cC264bTeiplyNlbK15ak70ftdZtjhy6wF/nQK7G/Qkj+VXKvlnQ+XKxvKue1+fKkufFEB/ciePr3qvx8e/oPSfgU4YXKzno3kwqV+GFSbVavkkIX2tEDJR7AsFzm+mdeCvU7VgZcb8/3GcA0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Evrnd25y; arc=none smtp.client-ip=192.198.163.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Evrnd25y" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767694948; x=1799230948; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=GFGrz/CXi/RxbvHEabGemFFSeHhuI34AKA6B/kIEpxc=; b=Evrnd25y8UKd6SD3LkkalxyrKm9olr/pBW70anvk+f3mE97BkDFai0X4 Suf3SYePOeuO+ftOQc/lqhsy8N8u/uoPHh99+IIqUWRN9EVJ4p9yWTF2A +hmCetI2wWvwEwKLzV7b/LnLgd8NwGSL/cq237cpZbuacrlNQqaKFS4Hk FaLkLkr77Vid31iUlJRBliuSq+An512zQoO7Rp+L3bhXUfNM9JIOo0fqi jQ8lgNMD84cigYf76ikMNXlNEp2jh7FuZISwWKEmxtLFcsUUgpNozcg/C m2npeph6ahy9jGPzSDVfAFGwKmXcLbbt3De2aB0Ajf2nuEnVpWnx9N0nA w==; X-CSE-ConnectionGUID: iAHHcfajTSWGjQ7zetfurQ== X-CSE-MsgGUID: ilmTcysXQ8GERKMY2HWZBQ== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="68966617" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="68966617" Received: from orviesa004.jf.intel.com ([10.64.159.144]) by fmvoesa111.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:22:28 -0800 X-CSE-ConnectionGUID: 4sIsfgaITKafd/iH8wW7Sw== X-CSE-MsgGUID: sHG+fAR2TvSsgKlp1x31AQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="207175372" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:22:21 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Date: Tue, 6 Jan 2026 18:20:24 +0800 Message-ID: <20260106102024.25023-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Edgecombe, Rick P" Disallow page merging (huge page adjustment) for the mirror root by utilizing disallowed_hugepage_adjust(). Make the mirror root check asymmetric with NX huge pages and not to litter the generic MMU code: Invoke disallowed_hugepage_adjust() in kvm_tdp_mmu_map() when necessary, specifically when KVM has mirrored TDP or the NX huge page workaround is enabled. Check and reduce the goal_level of a fault internally in disallowed_hugepage_adjust() when the fault is for a mirror root and there's a shadow present non-leaf entry at the original goal_level. Signed-off-by: Edgecombe, Rick P Co-developed-by: Yan Zhao Signed-off-by: Yan Zhao --- RFC v2: - Check is_mirror_sp() in disallowed_hugepage_adjust() instead of passing in an is_mirror arg. (Rick) - Check kvm_has_mirrored_tdp() in kvm_tdp_mmu_map() to determine whether to invoke disallowed_hugepage_adjust(). (Rick) RFC v1: - new patch --- arch/x86/kvm/mmu/mmu.c | 3 ++- arch/x86/kvm/mmu/tdp_mmu.c | 4 +++- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index d2c49d92d25d..b4f2e3ced716 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3418,7 +3418,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault= *fault, u64 spte, int cur_ cur_level =3D=3D fault->goal_level && is_shadow_present_pte(spte) && !is_large_pte(spte) && - spte_to_child_sp(spte)->nx_huge_page_disallowed) { + ((spte_to_child_sp(spte)->nx_huge_page_disallowed) || + is_mirror_sp(spte_to_child_sp(spte)))) { /* * A small SPTE exists for this pfn, but FNAME(fetch), * direct_map(), or kvm_tdp_mmu_map() would like to create a diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 9c26038f6b77..dfa56554f9e0 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1267,6 +1267,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm= _page_fault *fault) struct tdp_iter iter; struct kvm_mmu_page *sp; int ret =3D RET_PF_RETRY; + bool hugepage_adjust_disallowed =3D fault->nx_huge_page_workaround_enable= d || + kvm_has_mirrored_tdp(kvm); =20 KVM_MMU_WARN_ON(!root || root->role.invalid); =20 @@ -1279,7 +1281,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm= _page_fault *fault) for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) { int r; =20 - if (fault->nx_huge_page_workaround_enabled) + if (hugepage_adjust_disallowed) disallowed_hugepage_adjust(fault, iter.old_spte, iter.level); =20 /* --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7F93A309F19; Tue, 6 Jan 2026 10:22:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.13 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694972; cv=none; b=CJBHawR6kEvYCN07xR+c/TRMfhtDPsSD9dqBJFCadRekg+B4SlGbPi6HqdPwZyzTMX6RxLBE7a1SDtjkPn9ZyY73zw8UUDCPh+kynn0sGOvuJ/AubWlIXFcofPeJQfFQyX5Daa8SZv8S7M0Cu05mEU7iB312trymTkndf2p1/b8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694972; c=relaxed/simple; bh=vDwN3h48oxVtbZG2Fhg72RtCLukqpxS4T1TL6g+eqw0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=uNWAkWkN3i4eJnnDwqn81OimTVrQaCnSNZ/zRgYi/xQuuKuXNwHfby4nD6oVsnx5hkddX7A5uOhn4X5+4xAs0HtMC4gT/b53oh0Rm43vO7glO3wTQEWi39T7omsSa6vQeklsOIo2PD1i6flj5JHXIj+s94IlR2bRSRTwNsx/zOc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=RncWkF2F; arc=none smtp.client-ip=198.175.65.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="RncWkF2F" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767694963; x=1799230963; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=vDwN3h48oxVtbZG2Fhg72RtCLukqpxS4T1TL6g+eqw0=; b=RncWkF2FMOTtpjCC8FWTrpwnLgouY0QuV3aPOOueDbF1QOzwkuvLBze8 Z4OQ9hkYDGtWj/ctnR+TjDHAgRuKUjYMiPJRJF1UDcZM88I5TtQ2Zsx2V amH/1LcG4MatuWUySKCkSG6uVXHQmvOdP3r9Qr7xbD39B0JBlgkPmaJtp qsM7QGMFQnA7HopbWb+ueVMlubxuKlpSSSEFOSa6B45GK3uGAdTz6kEvS F7EaSmpHtYRwrqukzPGybWcruerN0lr5c5YQJbuMrxTYT/nrhsKAqVQDI YRqIM/GMdlParFrupSyFv6vTKj66t6sf/tyoIIOKVbdqbzmDHNGA7/dFF A==; X-CSE-ConnectionGUID: OvIGolOVS1a4K0Mhal2fJA== X-CSE-MsgGUID: 9vSfBLiXTXGmnE67a8uHOw== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="80176839" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="80176839" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:22:43 -0800 X-CSE-ConnectionGUID: CJVlXTmUT92JzZDWqT1ODg== X-CSE-MsgGUID: QaTzNb7uRFaE4BHRwl7jWQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="203085160" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:22:37 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock Date: Tue, 6 Jan 2026 18:20:40 +0800 Message-ID: <20260106102040.25041-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce kvm_x86_ops.split_external_spte() and wrap it in a helper function split_external_spte(). Invoke the helper function split_external_spte() in tdp_mmu_set_spte() to propagate splitting transitions from the mirror page table to the external page table under write mmu_lock. Introduce a new valid transition case for splitting and document all valid transitions of the mirror page table under write mmu_lock in tdp_mmu_set_spte(). Signed-off-by: Xiaoyao Li Signed-off-by: Isaku Yamahata Signed-off-by: Yan Zhao --- v3: - Rename split_external_spt() to split_external_spte(). - Pass in param "old_mirror_spte" to hook kvm_x86_ops.set_external_spte(). This is in aligned with the parameter change in hook kvm_x86_ops.set_external_spte() in Sean's cleanup series, and also allows future DPAMT patches to acquire guest private PFN from the old mirror spte. - Rename param "external_spt" to "new_external_spt" in hook kvm_x86_ops.set_external_spte() to indicate this is a new page table page for the external page table. - Drop declaration of get_external_spt() by moving split_external_spte() after get_external_spt() but before set_external_spte_present() and tdp_mmu_set_spte(). (Kai) - split_external_spte --> split_external_spte() (Kai) RFC v2: - Removed the KVM_BUG_ON() in split_external_spt(). (Rick) - Add a comment for the KVM_BUG_ON() in tdp_mmu_set_spte(). (Rick) - Use kvm_x86_call() instead of static_call(). (Binbin) RFC v1: - Split patch. - Dropped invoking hook zap_private_spte and kvm_flush_remote_tlbs() in KVM MMU core. --- arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h | 4 ++++ arch/x86/kvm/mmu/tdp_mmu.c | 29 +++++++++++++++++++++++++---- 3 files changed, 30 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-= x86-ops.h index 58c5c9b082ca..84fa8689b45c 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -98,6 +98,7 @@ KVM_X86_OP_OPTIONAL(link_external_spt) KVM_X86_OP_OPTIONAL(set_external_spte) KVM_X86_OP_OPTIONAL(free_external_spt) KVM_X86_OP_OPTIONAL(remove_external_spte) +KVM_X86_OP_OPTIONAL(split_external_spte) KVM_X86_OP_OPTIONAL(alloc_external_fault_cache) KVM_X86_OP_OPTIONAL(topup_external_fault_cache) KVM_X86_OP_OPTIONAL(free_external_fault_cache) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos= t.h index 7818da148a8c..56089d6b9b51 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1848,6 +1848,10 @@ struct kvm_x86_ops { void (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level le= vel, u64 mirror_spte); =20 + /* Split a huge mapping into smaller mappings in external page table */ + int (*split_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level leve= l, + u64 old_mirror_spte, void *new_external_spt); + /* Allocation a pages from the external page cache. */ void *(*alloc_external_fault_cache)(struct kvm_vcpu *vcpu); =20 diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index dfa56554f9e0..977914b2627f 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -508,6 +508,19 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte,= int level) return NULL; } =20 +static int split_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte, + u64 new_spte, int level) +{ + void *new_external_spt =3D get_external_spt(gfn, new_spte, level); + int ret; + + KVM_BUG_ON(!new_external_spt, kvm); + + ret =3D kvm_x86_call(split_external_spte)(kvm, gfn, level, old_spte, + new_external_spt); + return ret; +} + static int __must_check set_external_spte_present(struct kvm *kvm, tdp_pte= p_t sptep, gfn_t gfn, u64 old_spte, u64 new_spte, int level) @@ -758,12 +771,20 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_i= d, tdp_ptep_t sptep, handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false); =20 /* - * Users that do non-atomic setting of PTEs don't operate on mirror - * roots, so don't handle it and bug the VM if it's seen. + * Propagate changes of SPTE to the external page table under write + * mmu_lock. + * Current valid transitions: + * - present leaf to !present. + * - present non-leaf to !present. + * - present leaf to present non-leaf (splitting) */ if (is_mirror_sptep(sptep)) { - KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm); - remove_external_spte(kvm, gfn, old_spte, level); + if (!is_shadow_present_pte(new_spte)) + remove_external_spte(kvm, gfn, old_spte, level); + else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level)) + split_external_spte(kvm, gfn, old_spte, new_spte, level); + else + KVM_BUG_ON(1, kvm); } =20 return old_spte; --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4EC00327C00; Tue, 6 Jan 2026 10:22:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.13 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694979; cv=none; b=BsRF+EXhgXdqYwoxMprnhlOGcQ/I5iG3PYozrEbZrDqic1EeZVr6PWupQoOJnrPaUCFrr702JJ5MX9Yar6o0cSuGW4F95Yna6RrInM5vAma86wY68+CnE4LegKyJ5X1iorMAkNAWHeGz/7euTU4HgmGwFmUyVk4vW6G3H7GS+Q4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694979; c=relaxed/simple; bh=zxX1DVIirvSLgAHnUL44AWniee+nugavXFptHQZ63Ac=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=txtg/Qsnk/Rj3MjxvBSEHbz2Sa6k7bqAx4RFFryUem9VJXbvHTz6q1crt3bxOpj91RvUFb8diF4MkM4XE2KxHnqcon/ZlSCkbyrMGip4mZusJp3ug7l2mj+J3mAIofdNqWRlMhg7KhKjQA5iob/hIvtDSkAH4a5/fvBvkj51E9E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=SSaE+B0O; arc=none smtp.client-ip=198.175.65.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="SSaE+B0O" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767694978; x=1799230978; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=zxX1DVIirvSLgAHnUL44AWniee+nugavXFptHQZ63Ac=; b=SSaE+B0O/Rqb72+0ToNQ4mzV3wsORYA+urVK/PwSrt+mRNRDcd7Ijc3B hrsT9ZGbwcw5ahGiSKKzTgXs9iecEQQfEwORSkloBnlbtFWB7n+7qBqbS 4ulnzDTGyhjoviNEvrjIr9Xm5MTehupur7wMf+TB6+rn4/EZr1ymo9x8m owvlf06uOyUgV20eSEoKcK5bOxADLxq4O/diB4neAkiZ0DFpKuV1/PeZs sCZ9+NLYXGrlbckM2dzvGCEN4OJaHv9+gp+K0j4l/91O0bNODtFx/BLWL KCXy4y6OC1mUiO6e/WUfGzrZBWiLY60SfDBNByabBxJB7PBoKgRLInlDG w==; X-CSE-ConnectionGUID: 2YDCRk8uR7qvThRfxPx/3Q== X-CSE-MsgGUID: AKz5K+myT7m7INLuXuLKnA== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="80176859" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="80176859" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:22:58 -0800 X-CSE-ConnectionGUID: b5vPav7MS26oI6wqed3aTg== X-CSE-MsgGUID: 6bgZCcwOSBCtLJTPYmtgRA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="203085166" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:22:51 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 08/24] KVM: TDX: Enable huge page splitting under write mmu_lock Date: Tue, 6 Jan 2026 18:20:55 +0800 Message-ID: <20260106102055.25058-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Implement kvm_x86_ops.split_external_spte() under TDX to enable huge page splitting under write mmu_lock. Invoke tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs, and tdh_mem_page_demote() in sequence. All operations are performed under kvm->mmu_lock held for writing, similar to those in page removal. Though with kvm->mmu_lock held for writing, tdh_mem_page_demote() may still contend with tdh_vp_enter() and potentially with the guest's S-EPT entry operations. Therefore, kick off other vCPUs and prevent tdh_vp_enter() from being called on them to ensure success on the second attempt. Use KVM_BUG_ON() for any other unexpected errors. Signed-off-by: Xiaoyao Li Signed-off-by: Isaku Yamahata Signed-off-by: Yan Zhao --- v3: - Rebased on top of Sean's cleanup series. - Call out UNBLOCK is not required after DEMOTE. (Kai) - tdx_sept_split_private_spt() --> tdx_sept_split_private_spte(). RFC v2: - Split out the code to handle the error TDX_INTERRUPTED_RESTARTABLE. - Rebased to 6.16.0-rc6 (the way of defining TDX hook changes). RFC v1: - Split patch for exclusive mmu_lock only, - Invoke tdx_sept_zap_private_spte() and tdx_track() for splitting. - Handled busy error of tdh_mem_page_demote() by kicking off vCPUs. --- arch/x86/kvm/vmx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 405afd2a56b7..b41793402769 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1914,6 +1914,45 @@ static void tdx_sept_remove_private_spte(struct kvm = *kvm, gfn_t gfn, tdx_pamt_put(page); } =20 +/* + * Split a 2MB huge mapping. + * + * Invoke "BLOCK + TRACK + kick off vCPUs (inside tdx_track())" since DEMO= TE + * now does not support yet the NON-BLOCKING-RESIZE feature. No UNBLOCK is + * needed after a successful DEMOTE. + * + * Under write mmu_lock, kick off all vCPUs (inside tdh_do_no_vcpus()) to = ensure + * DEMOTE will succeed on the second invocation if the first invocation re= turns + * BUSY. + */ +static int tdx_sept_split_private_spte(struct kvm *kvm, gfn_t gfn, enum pg= _level level, + u64 old_mirror_spte, void *new_private_spt) +{ + struct page *new_sept_page =3D virt_to_page(new_private_spt); + int tdx_level =3D pg_level_to_tdx_sept_level(level); + struct kvm_tdx *kvm_tdx =3D to_kvm_tdx(kvm); + gpa_t gpa =3D gfn_to_gpa(gfn); + u64 err, entry, level_state; + + if (KVM_BUG_ON(kvm_tdx->state !=3D TD_STATE_RUNNABLE || + level !=3D PG_LEVEL_2M, kvm)) + return -EIO; + + err =3D tdh_do_no_vcpus(tdh_mem_range_block, kvm, &kvm_tdx->td, gpa, + tdx_level, &entry, &level_state); + if (TDX_BUG_ON_2(err, TDH_MEM_RANGE_BLOCK, entry, level_state, kvm)) + return -EIO; + + tdx_track(kvm); + + err =3D tdh_do_no_vcpus(tdh_mem_page_demote, kvm, &kvm_tdx->td, gpa, + tdx_level, new_sept_page, &entry, &level_state); + if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_DEMOTE, entry, level_state, kvm)) + return -EIO; + + return 0; +} + void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode, int trig_mode, int vector) { @@ -3672,6 +3711,7 @@ void __init tdx_hardware_setup(void) vt_x86_ops.set_external_spte =3D tdx_sept_set_private_spte; vt_x86_ops.free_external_spt =3D tdx_sept_free_private_spt; vt_x86_ops.remove_external_spte =3D tdx_sept_remove_private_spte; + vt_x86_ops.split_external_spte =3D tdx_sept_split_private_spte; vt_x86_ops.protected_apic_has_interrupt =3D tdx_protected_apic_has_interr= upt; vt_x86_ops.alloc_external_fault_cache =3D tdx_alloc_external_fault_cache; vt_x86_ops.topup_external_fault_cache =3D tdx_topup_external_fault_cache; --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 527EC309F19; Tue, 6 Jan 2026 10:23:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694993; cv=none; b=U6wlKk5c2fEPyOf5KMy4t0m14ESUHITElY2Uz8E1SGI1SR3CXBINtAkqICdIRCzCqxNcMvcpIAsP3qF4tFg2CRntSEjhJeMRcEYliAt4AhttKSVAFiJg1s9WCla4YpJB3Ij2Ggvb1E008CtPrw1Um96bj+JVoeK7JpEQp9HBjT4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767694993; c=relaxed/simple; bh=hU+1lBfPRw+55UuvRNPjJw1DjBSK19bueoWEonOZS1Y=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=eqTK3zNnKJ4L/87gp2ckCqzhAsIQ8wLC0weEaxk4J7HnBKy1JZKPBbFkpIiUBkw1L+Qwg37A3sACGamCQmxjYPPa00Q1xzGQ0lGhpLvlY+vLyhH/LOOHfCAM0WEQGglMCIVeRp8+Cueudidvo96LTcV9m5YBcxCgC5T0uvbiKV4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=LOb9WAiy; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="LOb9WAiy" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767694992; x=1799230992; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=hU+1lBfPRw+55UuvRNPjJw1DjBSK19bueoWEonOZS1Y=; b=LOb9WAiysTgaH6jXO3uDwHe7yp2PPgvR/e6ldCyKh+FVbKCIJIkzK3Sg hm0k5oManHq6Ed5Zo2ZNkAuyHOIqHhJ88UE7z8v31mmPjwHNK0JgQ2gpi KnjbUNfPqRf0wh0MG03FYk+PoNrs3AJbBajfRgMGfk50VQKjhjY9PwblV EcjKqZHH8M5V/Ky1y2LcgdRMgaDUZ4jaMp4oypOe5uTHpYujI4QJekMD8 hbo3DE2CA+mJi/J8ThvB4hgGCwVI5ikH2/kxoE6IhzuG6x9xwHC0bmT9Z Otr/1cZwB/dXCTkA0O4vItIrMWAD3Vv9zt2oKLSWBgxvqqs+/NKC4P+BA A==; X-CSE-ConnectionGUID: 7jnqZsrJQYOfOZCIHtmSHA== X-CSE-MsgGUID: rHDftlT5QnS0W1FJWSZAug== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="72689564" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="72689564" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:23:11 -0800 X-CSE-ConnectionGUID: h1TlaHITShCLs+g8Doza7g== X-CSE-MsgGUID: p/ac4WuaRKCehoEo7ec8mA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="201847465" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa010-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:23:06 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 09/24] KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX Date: Tue, 6 Jan 2026 18:21:08 +0800 Message-ID: <20260106102108.25074-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Allow propagating SPTE splitting changes from the mirror page table to the external page table in the fault path under shared mmu_lock, while rejecting this splitting request in TDX's implementation of kvm_x86_ops.split_external_spte(). Allow tdp_mmu_split_huge_page() to be invoked for the mirror page table in the fault path by removing the KVM_BUG_ON() immediately before it. set_external_spte_present() is invoked in the fault path under shared mmu_lock to propagate transitions from the mirror page table to the external page table when the target SPTE is present. Add "splitting" as a valid transition case in set_external_spte_present() and invoke the helper split_external_spte() to perform the propagation. Pass shared mmu_lock information to kvm_x86_ops.split_external_spte() and reject the splitting request in TDX's implementation of kvm_x86_ops.split_external_spte() when under shared mmu_lock. This is because TDX requires different handling for splitting under shared versus exclusive mmu_lock: under shared mmu_lock, TDX cannot kick off all vCPUs to avoid BUSY errors from DEMOTE. Since the current TDX module (i.e., without feature NON-BLOCKING-RESIZE) requires BLOCK/TRACK/kicking off vCPUs to be invoked before each DEMOTE, if a BUSY error occurs from DEMOTE, TDX must call UNBLOCK before returning the error to the KVM MMU core to roll back the old SPTE and retry. However, UNBLOCK itself may also fail due to contentions. Rejecting splitting of private huge pages under shared mmu_lock in TDX rather than using KVM_BUG_ON() in the KVM MMU core allows for splitting under shared mmu_lock once the TDX module supports the NON-BLOCKING-RESIZE feature, keeping the KVM MMU core framework stable across TDX module implementation changes. Signed-off-by: Yan Zhao --- v3: - Rebased on top of Sean's cleanup series. - split_external_spte --> kvm_x86_ops.split_external_spte(). (Kai) RFC v2: - WARN_ON_ONCE() and return error in tdx_sept_split_private_spt() if it's invoked under shared mmu_lock. (rather than increase the next fault's max_level in current vCPU via tdx->violation_gfn_start/end and tdx->violation_request_level). - TODO: Perform the real implementation of demote under shared mmu_lock when new version of TDX module supporting non-blocking demote is available. RFC v1: - New patch. --- arch/x86/include/asm/kvm_host.h | 3 +- arch/x86/kvm/mmu/tdp_mmu.c | 51 +++++++++++++++++++++------------ arch/x86/kvm/vmx/tdx.c | 9 +++++- 3 files changed, 42 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos= t.h index 56089d6b9b51..315ffb23e9d8 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1850,7 +1850,8 @@ struct kvm_x86_ops { =20 /* Split a huge mapping into smaller mappings in external page table */ int (*split_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level leve= l, - u64 old_mirror_spte, void *new_external_spt); + u64 old_mirror_spte, void *new_external_spt, + bool mmu_lock_shared); =20 /* Allocation a pages from the external page cache. */ void *(*alloc_external_fault_cache)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 977914b2627f..9b45ffb8585f 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -509,7 +509,7 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, = int level) } =20 static int split_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte, - u64 new_spte, int level) + u64 new_spte, int level, bool shared) { void *new_external_spt =3D get_external_spt(gfn, new_spte, level); int ret; @@ -517,7 +517,7 @@ static int split_external_spte(struct kvm *kvm, gfn_t g= fn, u64 old_spte, KVM_BUG_ON(!new_external_spt, kvm); =20 ret =3D kvm_x86_call(split_external_spte)(kvm, gfn, level, old_spte, - new_external_spt); + new_external_spt, shared); return ret; } =20 @@ -527,10 +527,20 @@ static int __must_check set_external_spte_present(str= uct kvm *kvm, tdp_ptep_t sp { bool was_present =3D is_shadow_present_pte(old_spte); bool is_present =3D is_shadow_present_pte(new_spte); + bool was_leaf =3D was_present && is_last_spte(old_spte, level); bool is_leaf =3D is_present && is_last_spte(new_spte, level); int ret =3D 0; =20 - KVM_BUG_ON(was_present, kvm); + /* + * The caller __tdp_mmu_set_spte_atomic() has ensured new_spte must be + * present. + * + * Current valid transitions: + * - leaf to non-leaf (demote) + * - !present to present leaf + * - !present to present non-leaf + */ + KVM_BUG_ON(!(!was_present || (was_leaf && !is_leaf)), kvm); =20 lockdep_assert_held(&kvm->mmu_lock); /* @@ -541,18 +551,24 @@ static int __must_check set_external_spte_present(str= uct kvm *kvm, tdp_ptep_t sp if (!try_cmpxchg64(rcu_dereference(sptep), &old_spte, FROZEN_SPTE)) return -EBUSY; =20 - /* - * Use different call to either set up middle level - * external page table, or leaf. - */ - if (is_leaf) { - ret =3D kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte); - } else { - void *external_spt =3D get_external_spt(gfn, new_spte, level); + if (!was_present) { + /* + * Use different call to either set up middle level external + * page table, or leaf. + */ + if (is_leaf) { + ret =3D kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte); + } else { + void *external_spt =3D get_external_spt(gfn, new_spte, level); =20 - KVM_BUG_ON(!external_spt, kvm); - ret =3D kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt); + KVM_BUG_ON(!external_spt, kvm); + ret =3D kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt); + } + } else if (was_leaf && !is_leaf) { + /* splitting */ + ret =3D split_external_spte(kvm, gfn, old_spte, new_spte, level, true); } + if (ret) __kvm_tdp_mmu_write_spte(sptep, old_spte); else @@ -782,7 +798,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,= tdp_ptep_t sptep, if (!is_shadow_present_pte(new_spte)) remove_external_spte(kvm, gfn, old_spte, level); else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level)) - split_external_spte(kvm, gfn, old_spte, new_spte, level); + split_external_spte(kvm, gfn, old_spte, new_spte, level, false); else KVM_BUG_ON(1, kvm); } @@ -1331,13 +1347,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct k= vm_page_fault *fault) =20 sp->nx_huge_page_disallowed =3D fault->huge_page_disallowed; =20 - if (is_shadow_present_pte(iter.old_spte)) { - /* Don't support large page for mirrored roots (TDX) */ - KVM_BUG_ON(is_mirror_sptep(iter.sptep), vcpu->kvm); + if (is_shadow_present_pte(iter.old_spte)) r =3D tdp_mmu_split_huge_page(kvm, &iter, sp, true); - } else { + else r =3D tdp_mmu_link_sp(kvm, &iter, sp, true); - } =20 /* * Force the guest to retry if installing an upper level SPTE diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index b41793402769..1e29722abb36 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1926,7 +1926,8 @@ static void tdx_sept_remove_private_spte(struct kvm *= kvm, gfn_t gfn, * BUSY. */ static int tdx_sept_split_private_spte(struct kvm *kvm, gfn_t gfn, enum pg= _level level, - u64 old_mirror_spte, void *new_private_spt) + u64 old_mirror_spte, void *new_private_spt, + bool mmu_lock_shared) { struct page *new_sept_page =3D virt_to_page(new_private_spt); int tdx_level =3D pg_level_to_tdx_sept_level(level); @@ -1938,6 +1939,12 @@ static int tdx_sept_split_private_spte(struct kvm *k= vm, gfn_t gfn, enum pg_level level !=3D PG_LEVEL_2M, kvm)) return -EIO; =20 + if (WARN_ON_ONCE(mmu_lock_shared)) { + pr_warn_once("Splitting of GFN %llx level %d under shared lock occurs wh= en KVM does not support it yet\n", + gfn, level); + return -EOPNOTSUPP; + } + err =3D tdh_do_no_vcpus(tdh_mem_range_block, kvm, &kvm_tdx->td, gpa, tdx_level, &entry, &level_state); if (TDX_BUG_ON_2(err, TDH_MEM_RANGE_BLOCK, entry, level_state, kvm)) --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9119432571F; Tue, 6 Jan 2026 10:23:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695008; cv=none; b=DQpZ7nsAznTEe/ds8vwgGbRkGc3rxZwNXlPVURhw4iwn40FD8Qp+IsEw469iyADgb/Gj+dBEJ5l3FpGfnwt0AoIE90awITuGK3nZ0Dj7tzwoIbQErcOg5st8FRVTAFhRyUCGesSnbdKh3rFuBzAnZo9gPJiHBTdX3uxAaRKEj2w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695008; c=relaxed/simple; bh=AsoC2HYHqOgGjBb063lZ0sV1MtNnqE9jiBqgR85GKho=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=TPzVFc9GppFwVTJ87Qkb6mYtH3ZmEUI9l3toTxUoAtxfVExcXd13HMCVOsRAhFja2SKyzHa1ap7fjGoS++Scabznw7cih6YeT/OnDcBb+tIkV7MdokHvdQnwGzsIOY3lGFs+zInafB0MkUEAGvmdc0oybtcCcxie63ToxOMXiTs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=bKayXe+X; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="bKayXe+X" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695006; x=1799231006; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=AsoC2HYHqOgGjBb063lZ0sV1MtNnqE9jiBqgR85GKho=; b=bKayXe+XM4o+Hakl5jbDXTYAESPqO5EmjQzgIPCG0xfNFf0I1TUOuzP4 0i7yO4qE1NTYfglkKhPtZS4XZh5+krsA+qmGsFa6aDNz9lLqKsdFcy1iS Oo+S9Nvm/86kbQSUsDPY00vd35tx0/CF2v4lvrlBfxemR2F7n/hXp/+pM YWbSXZxcEXkpsaI6V54mMPuL2f2f4254tMN+lYDOEs3YAkCVArjhpeAvV Gwjh4JB+YXYZmIabpidKdKYkwNG1KtLgGvjO0F3u7wMzrEa6vaHoAv6k8 bnnCQHvic86zZ9VKWeSN6BtpyqE4w+HglTjYAb8oH84Wq5Q5x45q8AnWX g==; X-CSE-ConnectionGUID: ggixy/U3Qq2gPL0FzIS5Bg== X-CSE-MsgGUID: V1SMhHdJTkCxc3o/lZjEbg== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="72689584" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="72689584" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:23:26 -0800 X-CSE-ConnectionGUID: MHiVDWaFQASAzOZ61sRtjw== X-CSE-MsgGUID: K0q06bQJTji3abxnd2Q8wg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="201847492" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa010-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:23:20 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 10/24] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Date: Tue, 6 Jan 2026 18:21:22 +0800 Message-ID: <20260106102122.25091-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Isaku Yamahata Enhance tdp_mmu_alloc_sp_for_split() to allocate a page table page for the external page table for splitting the mirror page table. Signed-off-by: Isaku Yamahata Co-developed-by: Yan Zhao Signed-off-by: Yan Zhao --- v3: - Removed unnecessary declaration of tdp_mmu_alloc_sp_for_split(). (Kai) - Fixed a typo in the patch log. (Kai) RFC v2: - NO change. RFC v1: - Rebased and simplified the code. --- arch/x86/kvm/mmu/tdp_mmu.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 9b45ffb8585f..074209d91ec3 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1535,7 +1535,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, return spte_set; } =20 -static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void) +static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror) { struct kvm_mmu_page *sp; =20 @@ -1549,6 +1549,15 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_spl= it(void) return NULL; } =20 + if (mirror) { + sp->external_spt =3D (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); + if (!sp->external_spt) { + free_page((unsigned long)sp->spt); + kmem_cache_free(mmu_page_header_cache, sp); + return NULL; + } + } + return sp; } =20 @@ -1628,7 +1637,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *= kvm, else write_unlock(&kvm->mmu_lock); =20 - sp =3D tdp_mmu_alloc_sp_for_split(); + sp =3D tdp_mmu_alloc_sp_for_split(is_mirror_sp(root)); =20 if (shared) read_lock(&kvm->mmu_lock); --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 428DD309F19; Tue, 6 Jan 2026 10:23:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695023; cv=none; b=r/qcXtzL1/lCCuRKgNgFRr9vK1/CuJpI+q2b4KdCLz6/usXMY1pfBZ/+9OteSPSageenDYTmwzDAsxIQwMwOPvdkigbwNv1EGNQt0PpKqHTbBDNVKgnnUEN+f9PxKm9wgLwV6NjT5JCBJrWCU6eZVkjPaf4enYxN/kOTFy5qfj8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695023; c=relaxed/simple; bh=qPi5NNUsFtXM1wM0hk0GwsQeRaWC08JCvXyAMTr4kZk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=H5nheW2lefQOWisD3SCnHAOLRIh5A2S7DX2D1FkidRVurc7ppiSntFy1DGpErVCPhfonhkC9h41D/i6JDzPuQDeYphwZg3DzY5EYOgVhlTY0y8UNgf4iQRqB4jTpBZIYTj9f7ph8+IhUkIin1vVGEYB1QZXr1ef4NrUlXtMsxb4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=MdkG+6ot; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="MdkG+6ot" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695021; x=1799231021; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=qPi5NNUsFtXM1wM0hk0GwsQeRaWC08JCvXyAMTr4kZk=; b=MdkG+6otB+zYo4l2wYlvTBUU4Hfh5OnECv7MPTgNvVoHc0/whKqgZIhQ uN1T0FwEiTVUVD20SzSnHdFQrTaeUUoN9XhLGtinmE7nEc7102hg+ZKDU /ak4Y5r2cvNKDiN4bWPRSH/Ry+8RzhhdNxk40uFzmQzdbCkivLdLVaT+f pJgofbuW1QeS/YuoVLYzMGgEvmcsDc/xEY+aEN0URj84lEMgAAL0pgiVX lahtNaB7+SKnESM6XGAbtXPj1YGIM7v4ej4aZCiFt6NGMm3Qdn9zuysUY W/OH2zpiBg0NLV6P0ud3/5K8QAnSFHz9FFYOjzFrfCAtNRxKMJQTdXPQ5 Q==; X-CSE-ConnectionGUID: +4wv+gAKSrWA8gwefXJ1Vw== X-CSE-MsgGUID: Qz1ngfAaRGufrjnFORzRFA== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="72689605" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="72689605" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:23:40 -0800 X-CSE-ConnectionGUID: rdQB6kYKQfqOGdw88F91sg== X-CSE-MsgGUID: CayaJQcORwO8iAxi3+LWQA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="201847530" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa010-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:23:34 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Date: Tue, 6 Jan 2026 18:21:36 +0800 Message-ID: <20260106102136.25108-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce kvm_split_cross_boundary_leafs() to split huge leaf entries that cross the boundary of a specified range. Splitting huge leaf entries that cross the boundary is essential before zapping a specified range in the mirror root. This ensures that the subsequent zap operation does not affect any GFNs outside the specified range, which is crucial for the mirror root, as the private page table requires the guest's ACCEPT operation after faulting back. While the core of kvm_split_cross_boundary_leafs() leverages the main logic of tdp_mmu_split_huge_pages_root(), the former only splits huge leaf entries when their mapping ranges cross the specified range boundary. When splitting is necessary, kvm->mmu_lock may be temporarily released for memory allocation, meaning returning -ENOMEM is possible. Since tdp_mmu_split_huge_pages_root() is originally invoked by dirty page tracking related functions that flush TLB unconditionally at the end, tdp_mmu_split_huge_pages_root() doesn't flush TLB before it temporarily releases mmu_lock. Do not enhance tdp_mmu_split_huge_pages_root() to return split or flush status for kvm_split_cross_boundary_leafs(). This is because the status could be inaccurate when multiple threads are trying to split the same memory range concurrently, e.g., if kvm_split_cross_boundary_leafs() returns split/flush as false, it doesn't mean there're no splits in the specified range, since splits could have occurred in other threads due to temporary release of mmu_lock. Therefore, callers of kvm_split_cross_boundary_leafs() need to determine how/when to flush TLB according to the use cases: - If the split is triggered in a fault path for TDX, the hardware shouldn't have cached the old huge translation. Therefore, no need to flush TLB. - If the split is triggered by zaps in guest_memfd punch hole or page conversion, it can delay the TLB flush until after zaps. - If the use case relies on pure split status (e.g., splitting for PML), flush TLB unconditionally. (Just hypothetical. No such use case currently exists for kvm_split_cross_boundary_leafs()). Signed-off-by: Xiaoyao Li Signed-off-by: Isaku Yamahata Signed-off-by: Yan Zhao --- v3: - s/only_cross_bounday/only_cross_boundary. (Kai) - Do not return flush status and have the callers to determine how/when to flush TLB. - Always pass "flush" as false to tdp_mmu_iter_cond_resched(). (Kai) - Added a default implementation for kvm_split_boundary_leafs() for non-x86 platforms. - Removed middle level function tdp_mmu_split_cross_boundary_leafs(). - Use EXPORT_SYMBOL_FOR_KVM_INTERNAL(). RFC v2: - Rename the API to kvm_split_cross_boundary_leafs(). - Make the API to be usable for direct roots or under shared mmu_lock. - Leverage the main logic from tdp_mmu_split_huge_pages_root(). (Rick) RFC v1: - Split patch. - introduced API kvm_split_boundary_leafs(), refined the logic and simplified the code. --- arch/x86/kvm/mmu/mmu.c | 34 ++++++++++++++++++++++++++++++ arch/x86/kvm/mmu/tdp_mmu.c | 42 ++++++++++++++++++++++++++++++++++++-- arch/x86/kvm/mmu/tdp_mmu.h | 3 +++ include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 7 +++++++ 5 files changed, 86 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index b4f2e3ced716..f40af7ac75b3 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1644,6 +1644,40 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm, start, end - 1, can_yield, true, flush); } =20 +/* + * Split large leafs crossing the boundary of the specified range. + * Only support TDP MMU. Do nothing if !tdp_mmu_enabled. + * + * This API does not flush TLB. Callers need to determine how/when to flus= h TLB + * according to their use cases, e.g., + * - No need to flush TLB. e.g., if it's in a fault path or TLB flush has = been + * ensured. + * - Delay the TLB flush until after zaps if the split is invoked for prec= ise + * zapping. + * - Unconditionally flush TLB if a use case relies on pure split status (= e.g., + * splitting for PML). + * + * Return value: 0 : success; <0: failure + */ +int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *= range, + bool shared) +{ + bool ret =3D 0; + + lockdep_assert_once(kvm->mmu_invalidate_in_progress || + lockdep_is_held(&kvm->slots_lock) || + srcu_read_lock_held(&kvm->srcu)); + + if (!range->may_block) + return -EOPNOTSUPP; + + if (tdp_mmu_enabled) + ret =3D kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(kvm, range, + shared); + return ret; +} +EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_split_cross_boundary_leafs); + bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) { bool flush =3D false; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 074209d91ec3..b984027343b7 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1600,10 +1600,17 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm,= struct tdp_iter *iter, return ret; } =20 +static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t = end) +{ + return !(iter->gfn >=3D start && + (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <=3D end); +} + static int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root, gfn_t start, gfn_t end, - int target_level, bool shared) + int target_level, bool shared, + bool only_cross_boundary) { struct kvm_mmu_page *sp =3D NULL; struct tdp_iter iter; @@ -1615,6 +1622,10 @@ static int tdp_mmu_split_huge_pages_root(struct kvm = *kvm, * level into one lower level. For example, if we encounter a 1GB page * we split it into 512 2MB pages. * + * When only_cross_boundary is true, just split huge pages above the + * target level into one lower level if the huge pages cross the start + * or end boundary. + * * Since the TDP iterator uses a pre-order traversal, we are guaranteed * to visit an SPTE before ever visiting its children, which means we * will correctly recursively split huge pages that are more than one @@ -1629,6 +1640,10 @@ static int tdp_mmu_split_huge_pages_root(struct kvm = *kvm, if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte= )) continue; =20 + if (only_cross_boundary && + !iter_cross_boundary(&iter, start, end)) + continue; + if (!sp) { rcu_read_unlock(); =20 @@ -1692,12 +1707,35 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *k= vm, =20 kvm_lockdep_assert_mmu_lock_held(kvm, shared); for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) { - r =3D tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level,= shared); + r =3D tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, + shared, false); + if (r) { + kvm_tdp_mmu_put_root(kvm, root); + break; + } + } +} + +int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm, + struct kvm_gfn_range *range, + bool shared) +{ + enum kvm_tdp_mmu_root_types types; + struct kvm_mmu_page *root; + int r =3D 0; + + kvm_lockdep_assert_mmu_lock_held(kvm, shared); + types =3D kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter); + + __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types) { + r =3D tdp_mmu_split_huge_pages_root(kvm, root, range->start, range->end, + PG_LEVEL_4K, shared, true); if (r) { kvm_tdp_mmu_put_root(kvm, root); break; } } + return r; } =20 static bool tdp_mmu_need_write_protect(struct kvm *kvm, struct kvm_mmu_pag= e *sp) diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index bd62977c9199..c20b1416e4b2 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -70,6 +70,9 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm); void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm, enum kvm_tdp_mmu_root_types root_types); void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared); +int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm, + struct kvm_gfn_range *range, + bool shared); =20 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); =20 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 8144d27e6c12..e563bb22c481 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -275,6 +275,8 @@ struct kvm_gfn_range { bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); +int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *= range, + bool shared); #endif =20 enum { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1d7ab2324d10..feeef7747099 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -910,6 +910,13 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) return mmu_notifier_register(&kvm->mmu_notifier, current->mm); } =20 +int __weak kvm_split_cross_boundary_leafs(struct kvm *kvm, + struct kvm_gfn_range *range, + bool shared) +{ + return 0; +} + #else /* !CONFIG_KVM_GENERIC_MMU_NOTIFIER */ =20 static int kvm_init_mmu_notifier(struct kvm *kvm) --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9A83532F749; Tue, 6 Jan 2026 10:23:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695037; cv=none; b=VpK4fcDP0CwNymZ6HCaV4qc5A84olfxxZMJL296dtkGs2YWSzMabNCzs/rOxbfro4jnTgQAkgI2ja4ZM10hd8zONCv9ZjFjbFC84x7tyc/gdxRvmDrddGXQ8s0fXm1fbrrBieerofxx67NO5OozVLb4Mt9mctdo3jHX2+KHtcXo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695037; c=relaxed/simple; bh=C4M7RSRYEuSHh59yYwRTI2YQscvta/YplpTRjhoDSAI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=TWL/wBk+RetW+Mwz4+8bSxt5+IAEHMpWevEPSWznfqcU8L7fpge+Ve3bo/mOxefET05Mcbt6gy3FLd+wnS7qZRvHKes0Kj3V4YZ4PFpnyWN5669g3xTgmvABaBGSgzSNKHGwvKNIHjJ5CCvrdnsZv//1Dg2gK/RDk0/wV1YADes= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=eGUcRATV; arc=none smtp.client-ip=192.198.163.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="eGUcRATV" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695035; x=1799231035; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=C4M7RSRYEuSHh59yYwRTI2YQscvta/YplpTRjhoDSAI=; b=eGUcRATVMCGyWbSxDBuAtGI9Tw/g1vue/wSothnnFQ/Hyf7ZrBuFHdlb /wXX1duP8wvGFtEoMVPoAvV4HkYtGb7irANa3DWJJxv/1aqTnpymsC60x PEd1+r/Vx2XCGV0N9u71lZbMeHKbIC7po2Lr60llZYd15iqvDWl/NZ0WL yIV6NQV/WyD3isuGZSNYsJmqzuBQieJSXVKLgkanh+T3er24w/pgY3yvF 8XgjvcsyFZSp4dWeb7PvwcO3ETyRQgPpZQ9e6GkaltA6e2ZWUYIYP60cJ bam3tjSjY6OvWrAjTQvd2pt4DJSzWalSiSImRQjx1NxTffXhIGcbJ2Nl6 w==; X-CSE-ConnectionGUID: blLFSnQPSL6IpcmhH7DoSA== X-CSE-MsgGUID: DRYMc/VoTDuIWcpPQ20wuw== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="68257879" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="68257879" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by fmvoesa112.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:23:55 -0800 X-CSE-ConnectionGUID: Yfs1q9+hSIm5UwVMmYkzZQ== X-CSE-MsgGUID: rEM7+xFlSe2JGTIJHnLTgA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="240111081" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:23:49 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 12/24] KVM: x86: Introduce hugepage_set_guest_inhibit() Date: Tue, 6 Jan 2026 18:21:51 +0800 Message-ID: <20260106102151.25125-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" TDX requires guests to accept S-EPT mappings created by the host KVM. Due to the current implementation of the TDX module, if a guest accepts a GFN at a lower level after KVM maps it at a higher level, the TDX module will emulate an EPT violation VMExit to KVM instead of returning a size mismatch error to the guest. If KVM fails to perform page splitting in the VMExit handler, the guest's accept operation will be triggered again upon re-entering the guest, causing a repeated EPT violation VMExit. To facilitate passing the guest's accept level information to the KVM MMU core and to prevent the repeated mapping of a GFN at different levels due to different accept levels specified by different vCPUs, introduce the interface hugepage_set_guest_inhibit(). This interface specifies across vCPUs that mapping at a certain level is inhibited from the guest. The KVM_LPAGE_GUEST_INHIBIT_FLAG bit is currently modified in one direction (set), so no unset interface is provided. Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.= camel@intel.com/ [1] Suggested-by: Rick Edgecombe Suggested-by: Sean Christopherson Signed-off-by: Yan Zhao --- v3: - Use EXPORT_SYMBOL_FOR_KVM_INTERNAL(). RFC v2: - new in RFC v2 --- arch/x86/kvm/mmu.h | 3 +++ arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++++++--- 2 files changed, 21 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 830f46145692..f97bedff5c4c 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -322,4 +322,7 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gf= n_t gfn) { return gfn & kvm_gfn_direct_bits(kvm); } + +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, i= nt level); +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, = int level); #endif diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f40af7ac75b3..029f2f272ffc 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -714,12 +714,14 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t g= fn, } =20 /* - * The most significant bit in disallow_lpage tracks whether or not memory - * attributes are mixed, i.e. not identical for all gfns at the current le= vel. + * The most 2 significant bits in disallow_lpage tracks whether or not mem= ory + * attributes are mixed, i.e. not identical for all gfns at the current le= vel, + * or whether or not guest inhibits the current level of hugepage at the g= fn. * The lower order bits are used to refcount other cases where a hugepage = is * disallowed, e.g. if KVM has shadow a page table at the gfn. */ #define KVM_LPAGE_MIXED_FLAG BIT(31) +#define KVM_LPAGE_GUEST_INHIBIT_FLAG BIT(30) =20 static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *= slot, gfn_t gfn, int count) @@ -732,7 +734,8 @@ static void update_gfn_disallow_lpage_count(const struc= t kvm_memory_slot *slot, =20 old =3D linfo->disallow_lpage; linfo->disallow_lpage +=3D count; - WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG); + WARN_ON_ONCE((old ^ linfo->disallow_lpage) & + (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG)); } } =20 @@ -1644,6 +1647,18 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm, start, end - 1, can_yield, true, flush); } =20 +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, = int level) +{ + return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUES= T_INHIBIT_FLAG; +} +EXPORT_SYMBOL_FOR_KVM_INTERNAL(hugepage_test_guest_inhibit); + +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, i= nt level) +{ + lpage_info_slot(gfn, slot, level)->disallow_lpage |=3D KVM_LPAGE_GUEST_IN= HIBIT_FLAG; +} +EXPORT_SYMBOL_FOR_KVM_INTERNAL(hugepage_set_guest_inhibit); + /* * Split large leafs crossing the boundary of the specified range. * Only support TDP MMU. Do nothing if !tdp_mmu_enabled. --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F1BB4330B3B; Tue, 6 Jan 2026 10:24:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695054; cv=none; b=FHaW4ETdKNwWVit0SM42rAo0dKVxS20cqLJGmgVj6oJ76gqaOzQjOF6GISQrz2i3iYr5KWdxv6NLmXC4Y8sX0eDZ7EOAlH0MW9vnQ0G/9baRyFoHSYNJ7QL4dK17UOaW/9z9bE0ByOss5bh/Bpvv5x01obKBpJeDbA3JmsdRoiY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695054; c=relaxed/simple; bh=7p4aGDoalMVzElQ2eJw5W4B2KiyonDkzRB5bPxon/3E=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HM1/HogaUsDxtddC5VGfz8Ye2YCPxCyw3R0qO7g11KXZyRxcfsR/VR1TqM57l8+s9r0vq7gnLoX+A7t/6jzrCMrLXBgid+71FgHvU5Z42LvjoHC4Xl8PyBmHLmEoqHwaUAyIb0dtbnW1C1BO9dLlv9cxkTi13KPh6pxm2dc57qE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=cjEgxr/u; arc=none smtp.client-ip=192.198.163.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="cjEgxr/u" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695053; x=1799231053; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=7p4aGDoalMVzElQ2eJw5W4B2KiyonDkzRB5bPxon/3E=; b=cjEgxr/utQZ/oqQJTPtBy4Pnvd2PZKhM4/ItKagnyza/YXOkYjCM9KBP Ywv1QcFivP/dAfIZERL9L1ru0TQ00SmDnfpg/23bR5gAjm8y7Lzz3Fi/1 wbGfZKYmHZoQNAzxqgdNA0Q5BhD89wO8IfikIG8Dw7cWZD/YGuqo03AN7 zz0jZnrMjRx6351N/dgrfr52rwj9RnBVMckspdYQvMTtpndFlN9UkW2Qq cMSkY50ro+F1XIPHAn5hxnJj252lSyW4WrKxvhml6zn1IUkam8oSKaQ59 nqtAhCHIeFxGUhu4OEQgsYBvWQXELwJL5ZeyVabKCu21GQnw+8Ijx35n1 Q==; X-CSE-ConnectionGUID: YhwJNKdbQMKTI2b1B7nLvA== X-CSE-MsgGUID: UX8SxS5SQbScvmsVZiJc7A== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="68257895" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="68257895" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by fmvoesa112.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:24:11 -0800 X-CSE-ConnectionGUID: jt86jOuEQoywKPhyMpRMQw== X-CSE-MsgGUID: zZlpSrEWRa+F76crevGsvQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="240111124" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:24:05 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 13/24] KVM: TDX: Honor the guest's accept level contained in an EPT violation Date: Tue, 6 Jan 2026 18:22:07 +0800 Message-ID: <20260106102207.25143-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" TDX requires guests to accept S-EPT mappings created by the host KVM. Due to the current implementation of the TDX module, if a guest accepts a GFN at a lower level after KVM maps it at a higher level, the TDX module will emulate an EPT violation VMExit to KVM instead of returning a size mismatch error to the guest. If KVM fails to perform page splitting in the EPT violation handler, the guest's ACCEPT operation will be triggered again upon re-entering the guest, causing a repeated EPT violation VMExit. The TDX module thus have the EPT violation VMExit carry the guest's accept level if it's caused by the guest's ACCEPT operation. Honor the guest's accept level if an EPT violation VMExit contains guest accept level: (1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core from mapping at a higher level than the guest's accept level. (2) Split any existing mapping higher than the guest's accept level. Use write mmu_lock to protect (1) and (2) for now. When the TDX module with feature NON-BLOCKING-RESIZE is available, splitting can be performed under shared mmu_lock as no need to worry about the failure of UNBLOCK after the failure of DEMOTE. Then both (1) and (2) are possible to be done under shared mmu_lock. As an optimization, this patch calls hugepage_test_guest_inhibit() without holding the mmu_lock to reduce the frequency of acquiring the write mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit is not already set. This is safe because the guest inhibit bit is set in a one-way manner while the splitting under the write mmu_lock is performed before setting the guest inhibit bit. Note: For EPT violation VMExits without the guest's accept level, they are not caused by the guest's ACCEPT operation, but are instead caused by the guest's access of memory before it accepts the memory. Since KVM can't obtain guest accept level info from such EPT violation VMExits (the ACCEPT operation hasn't occurred yet), KVM may still map at a higher level than the later guest's accept level. So, the typical guest/KVM interaction flow is: - If guest accesses private memory without first accepting it, (like non-Linux guests): 1. Guest accesses a private memory. 2. KVM finds it can map the GFN at 2MB. So, AUG at 2MB. 3. Guest accepts the GFN at 4KB. 4. KVM receives an EPT violation with eeq_type of ACCEPT + 4KB level. 5. KVM splits the 2MB mapping. 6. Guest accepts successfully and accesses the page. - If guest first accepts private memory before accessing it, (like Linux guests): 1. Guest accepts a private memory at 4KB. 2. KVM receives an EPT violation with eeq_type of ACCEPT + 4KB level. 3. KVM AUG at 4KB. 4. Guest accepts successfully and accesses the page. Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.= camel@intel.com Suggested-by: Rick Edgecombe Suggested-by: Sean Christopherson Signed-off-by: Yan Zhao --- v3: - tdx_check_accept_level() --> tdx_honor_guest_accept_level(). (Binbin) - Add patch log and code comment to describe the flows for EPT violations w/ and w/o accept level better. (Kai) - Add a comment to descibe why kvm_flush_remote_tlbs() is not needed after kvm_split_cross_boundary_leafs(). (Kai). - Return ret to userspace on error of tdx_honor_guest_accept_level(). (Kai) RFC v2 - Change tdx_get_accept_level() to tdx_check_accept_level(). - Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit() to change KVM mapping level in a global way according to guest accept level. (Rick, Sean). RFC v1: - Introduce tdx_get_accept_level() to get guest accept level. - Use tdx->violation_request_level and tdx->violation_gfn* to pass guest accept level to tdx_gmem_private_max_mapping_level() to detemine KVM mapping level. --- arch/x86/kvm/vmx/tdx.c | 77 +++++++++++++++++++++++++++++++++++++ arch/x86/kvm/vmx/tdx_arch.h | 3 ++ 2 files changed, 80 insertions(+) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 1e29722abb36..712aaa3d45b7 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1983,6 +1983,79 @@ static inline bool tdx_is_sept_violation_unexpected_= pending(struct kvm_vcpu *vcp return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_R= ING3_LIN); } =20 +/* + * An EPT violation can be either due to the guest's ACCEPT operation or + * due to the guest's access of memory before the guest accepts the + * memory. + * + * Type TDX_EXT_EXIT_QUAL_TYPE_ACCEPT in the extended exit qualification + * identifies the former case, which must also contain a valid guest + * accept level. + * + * For the former case, honor guest's accept level by setting guest inhibi= t bit + * on levels above the guest accept level and split the existing mapping f= or the + * faulting GFN if it's with a higher level than the guest accept level. + * + * Do nothing if the EPT violation is due to the latter case. KVM will map= the + * GFN without considering the guest's accept level (unless the guest inhi= bit + * bit is already set). + */ +static inline int tdx_honor_guest_accept_level(struct kvm_vcpu *vcpu, gfn_= t gfn) +{ + struct kvm_memory_slot *slot =3D gfn_to_memslot(vcpu->kvm, gfn); + struct vcpu_tdx *tdx =3D to_tdx(vcpu); + struct kvm *kvm =3D vcpu->kvm; + u64 eeq_type, eeq_info; + int level =3D -1; + + if (!slot) + return 0; + + eeq_type =3D tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK; + if (eeq_type !=3D TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) + return 0; + + eeq_info =3D (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) = >> + TDX_EXT_EXIT_QUAL_INFO_SHIFT; + + level =3D (eeq_info & GENMASK(2, 0)) + 1; + + if (level =3D=3D PG_LEVEL_4K || level =3D=3D PG_LEVEL_2M) { + if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) { + gfn_t base_gfn =3D gfn_round_for_level(gfn, level); + struct kvm_gfn_range gfn_range =3D { + .start =3D base_gfn, + .end =3D base_gfn + KVM_PAGES_PER_HPAGE(level), + .slot =3D slot, + .may_block =3D true, + .attr_filter =3D KVM_FILTER_PRIVATE, + }; + + scoped_guard(write_lock, &kvm->mmu_lock) { + int ret; + + /* + * No kvm_flush_remote_tlbs() is required after + * the split for S-EPT, because the + * "BLOCK + TRACK + kick off vCPUs" sequence in + * tdx_sept_split_private_spte() has guaranteed + * the TLB flush. The hardware also doesn't + * cache stale huge mappings in the fault path. + */ + ret =3D kvm_split_cross_boundary_leafs(kvm, &gfn_range, + false); + if (ret) + return ret; + + hugepage_set_guest_inhibit(slot, gfn, level + 1); + if (level =3D=3D PG_LEVEL_4K) + hugepage_set_guest_inhibit(slot, gfn, level + 2); + } + } + } + return 0; +} + static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) { unsigned long exit_qual; @@ -2008,6 +2081,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu = *vcpu) */ exit_qual =3D EPT_VIOLATION_ACC_WRITE; =20 + ret =3D tdx_honor_guest_accept_level(vcpu, gpa_to_gfn(gpa)); + if (ret) + return ret; + /* Only private GPA triggers zero-step mitigation */ local_retry =3D true; } else { diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h index a30e880849e3..af006a73ee05 100644 --- a/arch/x86/kvm/vmx/tdx_arch.h +++ b/arch/x86/kvm/vmx/tdx_arch.h @@ -82,7 +82,10 @@ struct tdx_cpuid_value { #define TDX_TD_ATTR_PERFMON BIT_ULL(63) =20 #define TDX_EXT_EXIT_QUAL_TYPE_MASK GENMASK(3, 0) +#define TDX_EXT_EXIT_QUAL_TYPE_ACCEPT 1 #define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION 6 +#define TDX_EXT_EXIT_QUAL_INFO_MASK GENMASK(63, 32) +#define TDX_EXT_EXIT_QUAL_INFO_SHIFT 32 /* * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is= 1024B. */ --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A7CC330D58; Tue, 6 Jan 2026 10:24:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695068; cv=none; b=teDiJs68bh5oghQaveXZNH+YY+o3gLeAt+IcG8MGyXDezO5CZasZz8cdSAOb1Gosa+Syvbjml3z+l4pcI0vCee1lKD/mWlBUhReAwAH5z9efq242HuV7k+ItTJl01TYylYVh5wA89bLhvxgwE02I3FiHu+CywSZof6W8we8regM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695068; c=relaxed/simple; bh=ffK1vQpvyzKmlUzz2zxoihkeD9nVSu7GIDcTNxPg6ZM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=nAHKvustFRbaL8I9p4wmhGHZTjSXJyNn4FxneBVZ2ulEw2lkgZcymxTYocaSf1gWxZyVv6EhEb0wH2fjjsSH64dF/1z3TG8e3NX08qZWW3KT2c/6x9dnayvVFrjGQxtbmBaEbqGI73PQ47yrcc5sPA/B1pDg+b61bBJS4LulHKo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=kauFI1zN; arc=none smtp.client-ip=192.198.163.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="kauFI1zN" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695065; x=1799231065; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ffK1vQpvyzKmlUzz2zxoihkeD9nVSu7GIDcTNxPg6ZM=; b=kauFI1zNy553EM7fgpVzVUwd5nzJw/faJZIBs9lji7LKrWgQfje0TrBG +iOxsAMCBtUsKJmILH4bTl46m2HouIEN8CBo+98aCSrORd6XSO048lQrL m81fjugU62C5vu6yKlgi9ym/FrAq2A1mVMufS6GrAUEfgI3W+J6XZsQSY 80OMkEvVCqbzIgsvbHbtHlr7gi7BwhhTskIGnfjPJVjcdjVsLq4gX5OE9 IIScZ2UpvG8X3B3tHf38QXaQwGGtbDYlTOleQK/vQyfUYS+56zPpRJBZK SwxY95pHJyN8uaW+JzSNG+5d0tOoR3sbfxjDYWCisBjIleHhhSkWXe0/t w==; X-CSE-ConnectionGUID: vEFYgxLjSNGz8nDdPRziUA== X-CSE-MsgGUID: 9i6Dut6LQmW8CQxLaUvLIA== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="80427348" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="80427348" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by fmvoesa104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:24:24 -0800 X-CSE-ConnectionGUID: qZCS4Z+ISYiXm6jdpoxKCg== X-CSE-MsgGUID: qw4VeN+7TxSY9JjDNSm1xg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="202681573" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:24:19 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int Date: Tue, 6 Jan 2026 18:22:22 +0800 Message-ID: <20260106102222.25160-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Modify the return type of gfn_handler_t() from bool to int. A negative return value indicates failure, while a return value of 1 signifies success with a flush required, and 0 denotes success without a flush required. This adjustment prepares for a later change that will enable kvm_pre_set_memory_attributes() to fail. No functional changes expected. Signed-off-by: Yan Zhao --- v3: - Rebased. RFC v2: - No change RFC v1: - New patch. --- arch/arm64/kvm/mmu.c | 8 ++++---- arch/loongarch/kvm/mmu.c | 8 ++++---- arch/mips/kvm/mmu.c | 6 +++--- arch/powerpc/kvm/book3s.c | 4 ++-- arch/powerpc/kvm/e500_mmu_host.c | 8 ++++---- arch/riscv/kvm/mmu.c | 12 ++++++------ arch/x86/kvm/mmu/mmu.c | 20 ++++++++++---------- include/linux/kvm_host.h | 12 ++++++------ virt/kvm/kvm_main.c | 24 ++++++++++++++++-------- 9 files changed, 55 insertions(+), 47 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 5ab0cfa08343..c39d3ef577f8 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -2221,12 +2221,12 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kv= m_gfn_range *range) return false; } =20 -bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { u64 size =3D (range->end - range->start) << PAGE_SHIFT; =20 if (!kvm->arch.mmu.pgt) - return false; + return 0; =20 return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT, @@ -2237,12 +2237,12 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_ra= nge *range) */ } =20 -bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { u64 size =3D (range->end - range->start) << PAGE_SHIFT; =20 if (!kvm->arch.mmu.pgt) - return false; + return 0; =20 return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT, diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c index a7fa458e3360..06fa060878c9 100644 --- a/arch/loongarch/kvm/mmu.c +++ b/arch/loongarch/kvm/mmu.c @@ -511,7 +511,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gf= n_range *range) range->end << PAGE_SHIFT, &ctx); } =20 -bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { kvm_ptw_ctx ctx; =20 @@ -523,15 +523,15 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_rang= e *range) range->end << PAGE_SHIFT, &ctx); } =20 -bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { gpa_t gpa =3D range->start << PAGE_SHIFT; kvm_pte_t *ptep =3D kvm_populate_gpa(kvm, NULL, gpa, 0); =20 if (ptep && kvm_pte_present(NULL, ptep) && kvm_pte_young(*ptep)) - return true; + return 1; =20 - return false; + return 0; } =20 /* diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c index d2c3b6b41f18..c26cc89c8e98 100644 --- a/arch/mips/kvm/mmu.c +++ b/arch/mips/kvm/mmu.c @@ -444,18 +444,18 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_= gfn_range *range) return true; } =20 -bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end); } =20 -bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { gpa_t gpa =3D range->start << PAGE_SHIFT; pte_t *gpa_pte =3D kvm_mips_pte_for_gpa(kvm, NULL, gpa); =20 if (!gpa_pte) - return false; + return 0; return pte_young(*gpa_pte); } =20 diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index d79c5d1098c0..9bf6e1cf64f1 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -886,12 +886,12 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_= gfn_range *range) return kvm->arch.kvm_ops->unmap_gfn_range(kvm, range); } =20 -bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { return kvm->arch.kvm_ops->age_gfn(kvm, range); } =20 -bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { return kvm->arch.kvm_ops->test_age_gfn(kvm, range); } diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_h= ost.c index 06caf8bbbe2b..dd5411ee242e 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -697,16 +697,16 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_= gfn_range *range) return kvm_e500_mmu_unmap_gfn(kvm, range); } =20 -bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { /* XXX could be more clever ;) */ - return false; + return 0; } =20 -bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { /* XXX could be more clever ;) */ - return false; + return 0; } =20 /*****************************************/ diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c index 4ab06697bfc0..aa163d2ef7d5 100644 --- a/arch/riscv/kvm/mmu.c +++ b/arch/riscv/kvm/mmu.c @@ -259,7 +259,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gf= n_range *range) return false; } =20 -bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { pte_t *ptep; u32 ptep_level =3D 0; @@ -267,7 +267,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range = *range) struct kvm_gstage gstage; =20 if (!kvm->arch.pgd) - return false; + return 0; =20 WARN_ON(size !=3D PAGE_SIZE && size !=3D PMD_SIZE && size !=3D PUD_SIZE); =20 @@ -277,12 +277,12 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_rang= e *range) gstage.pgd =3D kvm->arch.pgd; if (!kvm_riscv_gstage_get_leaf(&gstage, range->start << PAGE_SHIFT, &ptep, &ptep_level)) - return false; + return 0; =20 return ptep_test_and_clear_young(NULL, 0, ptep); } =20 -bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { pte_t *ptep; u32 ptep_level =3D 0; @@ -290,7 +290,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_r= ange *range) struct kvm_gstage gstage; =20 if (!kvm->arch.pgd) - return false; + return 0; =20 WARN_ON(size !=3D PAGE_SIZE && size !=3D PMD_SIZE && size !=3D PUD_SIZE); =20 @@ -300,7 +300,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_r= ange *range) gstage.pgd =3D kvm->arch.pgd; if (!kvm_riscv_gstage_get_leaf(&gstage, range->start << PAGE_SHIFT, &ptep, &ptep_level)) - return false; + return 0; =20 return pte_young(ptep_get(ptep)); } diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 029f2f272ffc..1b180279aacd 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1810,7 +1810,7 @@ static bool kvm_may_have_shadow_mmu_sptes(struct kvm = *kvm) return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages); } =20 -bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { bool young =3D false; =20 @@ -1823,7 +1823,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_rang= e *range) return young; } =20 -bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { bool young =3D false; =20 @@ -7962,8 +7962,8 @@ static void hugepage_set_mixed(struct kvm_memory_slot= *slot, gfn_t gfn, lpage_info_slot(gfn, slot, level)->disallow_lpage |=3D KVM_LPAGE_MIXED_FL= AG; } =20 -bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, - struct kvm_gfn_range *range) +int kvm_arch_pre_set_memory_attributes(struct kvm *kvm, + struct kvm_gfn_range *range) { struct kvm_memory_slot *slot =3D range->slot; int level; @@ -7980,10 +7980,10 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm = *kvm, * a hugepage can be used for affected ranges. */ if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm))) - return false; + return 0; =20 if (WARN_ON_ONCE(range->end <=3D range->start)) - return false; + return 0; =20 /* * If the head and tail pages of the range currently allow a hugepage, @@ -8042,8 +8042,8 @@ static bool hugepage_has_attrs(struct kvm *kvm, struc= t kvm_memory_slot *slot, return true; } =20 -bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, - struct kvm_gfn_range *range) +int kvm_arch_post_set_memory_attributes(struct kvm *kvm, + struct kvm_gfn_range *range) { unsigned long attrs =3D range->arg.attributes; struct kvm_memory_slot *slot =3D range->slot; @@ -8059,7 +8059,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *= kvm, * SHARED may now allow hugepages. */ if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm))) - return false; + return 0; =20 /* * The sequence matters here: upper levels consume the result of lower @@ -8106,7 +8106,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *= kvm, hugepage_set_mixed(slot, gfn, level); } } - return false; + return 0; } =20 void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm, diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index e563bb22c481..6f3d29db0505 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -273,8 +273,8 @@ struct kvm_gfn_range { bool lockless; }; bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); -bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); -bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); +int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); +int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *= range, bool shared); #endif @@ -734,10 +734,10 @@ static inline bool kvm_arch_has_private_mem(struct kv= m *kvm) extern bool vm_memory_attributes; bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_= t end, unsigned long mask, unsigned long attrs); -bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, +int kvm_arch_pre_set_memory_attributes(struct kvm *kvm, + struct kvm_gfn_range *range); +int kvm_arch_post_set_memory_attributes(struct kvm *kvm, struct kvm_gfn_range *range); -bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, - struct kvm_gfn_range *range); #else #define vm_memory_attributes false #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */ @@ -1568,7 +1568,7 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memor= y_cache *mc); void kvm_mmu_invalidate_begin(struct kvm *kvm); void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end); void kvm_mmu_invalidate_end(struct kvm *kvm); -bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); +int kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); =20 long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index feeef7747099..471f798dba2d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -517,7 +517,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mm= u_notifier *mn) return container_of(mn, struct kvm, mmu_notifier); } =20 -typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range= ); +typedef int (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range); =20 typedef void (*on_lock_fn_t)(struct kvm *kvm); =20 @@ -601,6 +601,7 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_rang= e(struct kvm *kvm, kvm_for_each_memslot_in_hva_range(node, slots, range->start, range->end - 1) { unsigned long hva_start, hva_end; + int ret; =20 slot =3D container_of(node, struct kvm_memory_slot, hva_node[slots->nod= e_idx]); hva_start =3D max_t(unsigned long, range->start, slot->userspace_addr); @@ -641,7 +642,9 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_rang= e(struct kvm *kvm, goto mmu_unlock; } } - r.ret |=3D range->handler(kvm, &gfn_range); + ret =3D range->handler(kvm, &gfn_range); + WARN_ON_ONCE(ret < 0); + r.ret |=3D ret; } } =20 @@ -727,7 +730,7 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_= t start, gfn_t end) } } =20 -bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) +int kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) { kvm_mmu_invalidate_range_add(kvm, range->start, range->end); return kvm_unmap_gfn_range(kvm, range); @@ -2507,7 +2510,8 @@ static __always_inline void kvm_handle_gfn_range(stru= ct kvm *kvm, struct kvm_memslots *slots; struct kvm_memslot_iter iter; bool found_memslot =3D false; - bool ret =3D false; + bool flush =3D false; + int ret =3D 0; int i; =20 gfn_range.arg =3D range->arg; @@ -2540,19 +2544,23 @@ static __always_inline void kvm_handle_gfn_range(st= ruct kvm *kvm, range->on_lock(kvm); } =20 - ret |=3D range->handler(kvm, &gfn_range); + ret =3D range->handler(kvm, &gfn_range); + if (ret < 0) + goto err; + flush |=3D ret; } } =20 - if (range->flush_on_ret && ret) +err: + if (range->flush_on_ret && flush) kvm_flush_remote_tlbs(kvm); =20 if (found_memslot) KVM_MMU_UNLOCK(kvm); } =20 -static bool kvm_pre_set_memory_attributes(struct kvm *kvm, - struct kvm_gfn_range *range) +static int kvm_pre_set_memory_attributes(struct kvm *kvm, + struct kvm_gfn_range *range) { /* * Unconditionally add the range to the invalidation set, regardless of --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 633A23328FE; Tue, 6 Jan 2026 10:24:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695082; cv=none; b=L5dw8aIUfjSQdsgyZTgRbW0l7Ye5xJ9ta9H6clrYhNfPN/FzBSpEEj8JGrpkJQb/FC7V0BDd/jmoCYzVPp1lD31FKff9BpMUfgpFfmFagkqb28EGiSLhzA7pWhv48/2v6JNAq8A/UzkPBbMYjPufKILIy0ciPNN9IDYSN0IQM1I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695082; c=relaxed/simple; bh=K1PS7q8ByM2DlBU5Cu9E4tGzbXH+BzETyF184ZFRjWE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=tbOpAZBh/TAlfCN++yUq/FsSQ9MJgkhYwOMMoRRUct08P44R23JqFqmS04KHuUjHbDvx32KBfMe1XszI9Tsia4I1dm4Sqj9bcrCUBu4pr2kO1ScKqezSuSPt2oe0+g3xJZlz61cKYEr0lJGnm/liHNLxevZwKdmXbhUvAnVKlAc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Xw5kvUe8; arc=none smtp.client-ip=192.198.163.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Xw5kvUe8" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695079; x=1799231079; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=K1PS7q8ByM2DlBU5Cu9E4tGzbXH+BzETyF184ZFRjWE=; b=Xw5kvUe8Ptx8RDVBXghxDvn51zQDbIwhRpHx/kAKJbtud24ESY+OL7ut 7kFkLPInMITLPCbOQNHGsFMufyd8h5eyPVHEvZMZyJcQJeHbqvLyiR0ll Xxi039WuiOeq8SIBE2v2p6un4jMaI/42/8PKhBLXWqGHyFY72RVrFA2Pq G29RZ4qnJx3VSF+gqxHWYKzI41fgKIn2EJH3R9Hxmj84ggm7oNpQmUZ6a Iu4+LMwH0omCxBhvZbEcvHUAZuD0n6h6le7CdnYvo6JFP8c1WVdi7rd6+ ZoKVYWIsaKfnoTg4DcAXTPEY/asMDy6K0XcQLZNmCcLb5LkD6PNH3/yt0 w==; X-CSE-ConnectionGUID: S1FCbjAGRBu13KckS9laaw== X-CSE-MsgGUID: l6QIxleaTJudnBeqFvJS1Q== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="80427378" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="80427378" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by fmvoesa104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:24:38 -0800 X-CSE-ConnectionGUID: Sc7zStDwTAqazYquWwHc8A== X-CSE-MsgGUID: n8Xq9WzmRjKbIgVNyMFhHg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="202681591" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:24:33 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 15/24] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Date: Tue, 6 Jan 2026 18:22:36 +0800 Message-ID: <20260106102236.25177-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In TDX, private page tables require precise zapping because faulting back the zapped mappings necessitates the guest's re-acceptance. Therefore, before performing a zap for the private-to-shared conversion, rather than zapping a huge leaf entry that crosses the boundary of the GFN range to be zapped, split the leaf entry to ensure GFNs outside the conversion range are not affected. Invoke kvm_split_cross_boundary_leafs() in kvm_arch_pre_set_memory_attributes() to split the huge leafs that cross GFN range boundary before calling kvm_unmap_gfn_range() to zap the GFN range that will be converted to shared. Only update flush status if zaps are performed. Unlike kvm_unmap_gfn_range(), which cannot fail, kvm_split_cross_boundary_leafs() may fail due to memory allocation for splitting. Update kvm_handle_gfn_range() to propagate the error back to kvm_vm_set_mem_attributes(), which can then fail the ioctl KVM_SET_MEMORY_ATTRIBUTES. The downside of current implementation is that though kvm_split_cross_boundary_leafs() is invoked before kvm_unmap_gfn_range() for each GFN range, the entire conversion range may consist of several GFN ranges. If an out-of-memory error occurs during the splitting of a GFN range, some previous GFN ranges may have been successfully split and zapped, even though their page attributes remain unchanged due to the splitting failure. If it's necessary, a patch can be arranged to divide a single invocation of "kvm_handle_gfn_range(kvm, &pre_set_range)" into two, e.g., kvm_handle_gfn_range(kvm, &pre_set_range_prepare_and_split) kvm_handle_gfn_range(kvm, &pre_set_range_unmap), Signed-off-by: Yan Zhao --- v3: - Do not return flush status from kvm_split_cross_boundary_leafs(), so TLB is flushed only if zaps are perfomed. RFC v2: - update kvm_split_boundary_leafs() to kvm_split_cross_boundary_leafs() and invoke it only for priate-to-shared conversion. RFC v1: - new patch. --- arch/x86/kvm/mmu/mmu.c | 10 ++++++++-- virt/kvm/kvm_main.c | 13 +++++++++---- 2 files changed, 17 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 1b180279aacd..35a6e37bfc68 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -8015,10 +8015,16 @@ int kvm_arch_pre_set_memory_attributes(struct kvm *= kvm, } =20 /* Unmap the old attribute page. */ - if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE) + if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE) { range->attr_filter =3D KVM_FILTER_SHARED; - else + } else { + int ret; + range->attr_filter =3D KVM_FILTER_PRIVATE; + ret =3D kvm_split_cross_boundary_leafs(kvm, range, false); + if (ret) + return ret; + } =20 return kvm_unmap_gfn_range(kvm, range); } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 471f798dba2d..f3b0d7f8dcfd 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2502,8 +2502,8 @@ bool kvm_range_has_vm_memory_attributes(struct kvm *k= vm, gfn_t start, gfn_t end, return true; } =20 -static __always_inline void kvm_handle_gfn_range(struct kvm *kvm, - struct kvm_mmu_notifier_range *range) +static __always_inline int kvm_handle_gfn_range(struct kvm *kvm, + struct kvm_mmu_notifier_range *range) { struct kvm_gfn_range gfn_range; struct kvm_memory_slot *slot; @@ -2557,6 +2557,8 @@ static __always_inline void kvm_handle_gfn_range(stru= ct kvm *kvm, =20 if (found_memslot) KVM_MMU_UNLOCK(kvm); + + return ret < 0 ? ret : 0; } =20 static int kvm_pre_set_memory_attributes(struct kvm *kvm, @@ -2625,7 +2627,9 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm,= gfn_t start, gfn_t end, cond_resched(); } =20 - kvm_handle_gfn_range(kvm, &pre_set_range); + r =3D kvm_handle_gfn_range(kvm, &pre_set_range); + if (r) + goto out_unlock; =20 for (i =3D start; i < end; i++) { r =3D xa_err(xa_store(&kvm->mem_attr_array, i, entry, @@ -2634,7 +2638,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm,= gfn_t start, gfn_t end, cond_resched(); } =20 - kvm_handle_gfn_range(kvm, &post_set_range); + r =3D kvm_handle_gfn_range(kvm, &post_set_range); + KVM_BUG_ON(r, kvm); =20 out_unlock: mutex_unlock(&kvm->slots_lock); --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 60D6533122D; Tue, 6 Jan 2026 10:24:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695096; cv=none; b=FaJ4TRLJSrOzihWpC/uPTdsYENlZoBnOnxjSI6KeDgIlUhRTuZ+JvcCgSBefCvBMXGTqx8vJKTIwozElEmN2DE8c5INfSzyf99GOmpMoxiskMebpSJbnZnROkFZPuHHWTi7OHAJwuo0CzsJJKAocuECbPXcaClLadRg8/SDnGz8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695096; c=relaxed/simple; bh=4ml7WsyFlngxKSf5JvOWbQKUiTbUj6eH/V5cMjcRqcU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=OnkZq4CdcKYzsTI9YqzN9kGnRDEG4OcEF6xt8Y0SmJse0BKbVvbt+2aqmOXnXQnigjdhTsEgwH2HaKV83tMsi4Mdeb8PBlM0SZPTu5mDULwuxddk0h2cPhmigaNeG3PTIPWf3a38/bzizH+An+cE+iH8fkfjJvKXeTDLf7QcU2U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=WwYV/NKd; arc=none smtp.client-ip=192.198.163.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="WwYV/NKd" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695093; x=1799231093; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=4ml7WsyFlngxKSf5JvOWbQKUiTbUj6eH/V5cMjcRqcU=; b=WwYV/NKdcXEaWnTa+jWPLNNNStdVioGqaZI1KA39L5rvjc558Yn8nTuj CDkYdxgYZrCMgDPLKxf/V5njs8eEGuT0gEr1hOOb8mYBX5TDSyb8dh0XQ rFhOnkHScWBMSdArBmPqVfpBbQUWLIPAcaO5Ftjt0ToTrMIMPrrYU3in3 V804/nka5obwSVXd5fGwmoDiXA7AMj0yOgA6w8XDg1cCmXg9H7+W31Y86 nCAtqZUD+x0CbsttVJ3NYhIuVEZkgTMzGoMy0kWdfY0mFke6yWhPo7Y2f OAGtGQEhTPJ5U1b50OAfdTJIGey4I6d6ehCj7uGvvCyOdcVpcS40b8Iq9 g==; X-CSE-ConnectionGUID: o+IzO1hySG67eV4FVu3c5A== X-CSE-MsgGUID: PmQnvXcZT7yITUrmcKHOkQ== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="80427399" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="80427399" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by fmvoesa104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:24:52 -0800 X-CSE-ConnectionGUID: 77UUouFeSpCgozuCMjOCeA== X-CSE-MsgGUID: IDWJSDqzSqiZh0SFaZn13w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="202681611" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:24:48 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Date: Tue, 6 Jan 2026 18:22:50 +0800 Message-ID: <20260106102250.25194-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In TDX, private page tables require precise zapping because faulting back the zapped mappings necessitates guest re-acceptance. Therefore, before performing a zap for hole punching and private-to-shared conversions, huge leaves that cross the boundary of the zapping GFN range in the mirror page table must be split. Splitting may fail (usually due to out of memory). If this happens, hole punching and private-to-shared conversion should bail out early and return an error to userspace. Splitting is not necessary for zapping shared mappings or zapping in kvm_gmem_release()/kvm_gmem_error_folio(). The penalty of zapping more shared mappings than necessary is minimal. All mappings are zapped in kvm_gmem_release(). kvm_gmem_error_folio() zaps the entire folio range, and KVM's basic assumption is that a huge mapping must have a single backend folio. Signed-off-by: Yan Zhao --- v3: - Rebased to [2]. - Do not flush TLB for kvm_split_cross_boundary_leafs(), i.e., only flush TLB if zaps are performed. [2] https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-= hugetlb-restructuring-12-08-25 RFC v2: - Rebased to [1]. As changes in this patch are gmem specific, they may need to be updated if the implementation in [1] changes. - Update kvm_split_boundary_leafs() to kvm_split_cross_boundary_leafs() and invoke it before kvm_gmem_punch_hole() and private-to-shared conversion. [1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/ RFC v1: - new patch. --- virt/kvm/guest_memfd.c | 67 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 03613b791728..8e7fbed57a20 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -486,6 +486,55 @@ static int merge_truncate_range(struct inode *inode, p= goff_t start, return ret; } =20 +static int __kvm_gmem_split_private(struct gmem_file *f, pgoff_t start, pg= off_t end) +{ + enum kvm_gfn_range_filter attr_filter =3D KVM_FILTER_PRIVATE; + + bool locked =3D false; + struct kvm_memory_slot *slot; + struct kvm *kvm =3D f->kvm; + unsigned long index; + int ret =3D 0; + + xa_for_each_range(&f->bindings, index, slot, start, end - 1) { + pgoff_t pgoff =3D slot->gmem.pgoff; + struct kvm_gfn_range gfn_range =3D { + .start =3D slot->base_gfn + max(pgoff, start) - pgoff, + .end =3D slot->base_gfn + min(pgoff + slot->npages, end) - pgoff, + .slot =3D slot, + .may_block =3D true, + .attr_filter =3D attr_filter, + }; + + if (!locked) { + KVM_MMU_LOCK(kvm); + locked =3D true; + } + + ret =3D kvm_split_cross_boundary_leafs(kvm, &gfn_range, false); + if (ret) + break; + } + + if (locked) + KVM_MMU_UNLOCK(kvm); + + return ret; +} + +static int kvm_gmem_split_private(struct inode *inode, pgoff_t start, pgof= f_t end) +{ + struct gmem_file *f; + int r =3D 0; + + kvm_gmem_for_each_file(f, inode->i_mapping) { + r =3D __kvm_gmem_split_private(f, start, end); + if (r) + break; + } + return r; +} + static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t= len) { pgoff_t start =3D offset >> PAGE_SHIFT; @@ -499,6 +548,13 @@ static long kvm_gmem_punch_hole(struct inode *inode, l= off_t offset, loff_t len) filemap_invalidate_lock(inode->i_mapping); =20 kvm_gmem_invalidate_begin(inode, start, end); + + ret =3D kvm_gmem_split_private(inode, start, end); + if (ret) { + kvm_gmem_invalidate_end(inode, start, end); + filemap_invalidate_unlock(inode->i_mapping); + return ret; + } kvm_gmem_zap(inode, start, end); =20 ret =3D merge_truncate_range(inode, start, len >> PAGE_SHIFT, true); @@ -907,6 +963,17 @@ static int kvm_gmem_convert(struct inode *inode, pgoff= _t start, invalidate_start =3D kvm_gmem_compute_invalidate_start(inode, start); invalidate_end =3D kvm_gmem_compute_invalidate_end(inode, end); kvm_gmem_invalidate_begin(inode, invalidate_start, invalidate_end); + + if (!to_private) { + r =3D kvm_gmem_split_private(inode, start, end); + if (r) { + *err_index =3D start; + mas_destroy(&mas); + kvm_gmem_invalidate_end(inode, invalidate_start, invalidate_end); + return r; + } + } + kvm_gmem_zap(inode, start, end); kvm_gmem_invalidate_end(inode, invalidate_start, invalidate_end); =20 --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E244633343F; Tue, 6 Jan 2026 10:25:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.8 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695109; cv=none; b=Vyv/lqQVNnm4/6B9RQSN+wYrKDJvblcj/jqL1u8ndftOIBX+iOndHWLfxhTGPetwTUVw691/84NSe1FxZV9o1meAwuc7J6WCYPhCPyz7lAF6l7cFEGGpB7CtVGgviG1HS29cggrNBlJKrxmgZX3UtqkEMDXTk9Ph0U+jBUyznb4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695109; c=relaxed/simple; bh=9WemXhIcKEOzBmDaVfEziMNELHtArYfBszBdtuSN710=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=gbPk44nJ9fUClM/xB2yF+GD2YolzeZKUpUrNX803FwmUfDH3NHtnZL3vm/36Rhp1edYJOHReXhIvWgl5LAM5V77AauP9iU2QG541Tae6rbeJJC7hlywgn70MIGMZjcRAV7LDThZpr+sGLZUQzPa2h/WrPiI1rIHdkVF5s0oWnfg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=FcJ6LkEU; arc=none smtp.client-ip=192.198.163.8 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="FcJ6LkEU" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695108; x=1799231108; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9WemXhIcKEOzBmDaVfEziMNELHtArYfBszBdtuSN710=; b=FcJ6LkEU+64qGNan8oSWoakP4Tjf0xku8aZGeR09oydtf8YcOiOGR9pc ahuu+CU/BQ3N2UU4RXFBvz41VuBekSCpg9/m5kMMXocgoR390YzWFR4Y4 y2XI7sK3HIDmv/8HRMZKNocKrLXhAPP/hJOgqFDeRZAZgVR5uL9HTV29d p/y7tfLn/VkK1ZhEhWra0O3tHwBmOBd4Rf3c7xyYpbCE1LjROUQSOvRPE cKbXRb5jqvZCBCcrlSCwV7pCdqRE+/H5lh2e0Ydfbjp87p17hPs+Lk+JM ENm9u1EJu50CERYZqfRvM9OpNnWA7wt3bwog4tLGVgqGW0wvHTA1NMT2T g==; X-CSE-ConnectionGUID: Qncxhx8mRZ6r8IMKS9Qi9A== X-CSE-MsgGUID: clQP6uSmTiWTk0Sp3h6lWg== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="86645828" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="86645828" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:25:07 -0800 X-CSE-ConnectionGUID: 2QYusNpMSKGDujp5XjzsOQ== X-CSE-MsgGUID: Y/hOs/sZRWmRzIvJaGe/zw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="202246964" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:25:00 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 17/24] KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB Date: Tue, 6 Jan 2026 18:23:04 +0800 Message-ID: <20260106102304.25211-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Kirill A. Shutemov" Invoke tdx_pamt_{get/put}() to add/remove Dynamic PAMT page pair for guest private memory only when the S-EPT mapping size is 4KB. When the mapping size is greater than 4KB, static PAMT pages are used. No need to install/uninstall extra PAMT pages dynamically. Signed-off-by: Kirill A. Shutemov [Yan: Move level checking to callers of tdx_pamt_{get/put}()] Signed-off-by: Yan Zhao --- v3: - new patch Checking for 4KB level was previously done inside tdx_pamt_{get/put}() in DPAMT v2 [1]. Move the checking to callers of tdx_pamt_{get/put}() in KVM to avoid introducing an extra "level" parameter to tdx_pamt_{get/put}(). This is also because the callers that could have level > 4KB are limited in KVM, i.e., only inside tdx_sept_{set/remove}_private_spte(). [1] https://lore.kernel.org/all/20250609191340.2051741-5-kirill.shutemov@li= nux.intel.com --- arch/x86/kvm/vmx/tdx.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 712aaa3d45b7..c1dc1aaae49d 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1722,9 +1722,11 @@ static int tdx_sept_set_private_spte(struct kvm *kvm= , gfn_t gfn, WARN_ON_ONCE(!is_shadow_present_pte(mirror_spte) || (mirror_spte & VMX_EPT_RWX_MASK) !=3D VMX_EPT_RWX_MASK); =20 - ret =3D tdx_pamt_get(page, &tdx->prealloc); - if (ret) - return ret; + if (level =3D=3D PG_LEVEL_4K) { + ret =3D tdx_pamt_get(page, &tdx->prealloc); + if (ret) + return ret; + } =20 /* * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory() @@ -1743,7 +1745,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm,= gfn_t gfn, else ret =3D tdx_mem_page_add(kvm, gfn, level, pfn); =20 - if (ret) + if (ret && level =3D=3D PG_LEVEL_4K) tdx_pamt_put(page); =20 return ret; @@ -1911,7 +1913,9 @@ static void tdx_sept_remove_private_spte(struct kvm *= kvm, gfn_t gfn, =20 tdx_quirk_reset_folio(folio, folio_page_idx(folio, page), KVM_PAGES_PER_HPAGE(level)); - tdx_pamt_put(page); + + if (level =3D=3D PG_LEVEL_4K) + tdx_pamt_put(page); } =20 /* --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2C7CF31B101; Tue, 6 Jan 2026 10:25:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.8 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695128; cv=none; b=Y9rpJepNs19e+spHxeHxoeXN/ZYP5nZbOeCV44TXCNo3Qs8WzGYtb36VkkgjMK2jcgr7lcBh2exGmnSSulljCJqdwLzzHSXMS3lg5kYdK8fORu6R1RGEcpZELeX30PRn/VBY7jBg/6dSAijzt1TjEjfeTuBKvWa5gnnBDb/fUaE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695128; c=relaxed/simple; bh=GFiw5N2oVRbxNS+8hGRru/HArobn9JpEsHC0envKVVQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=OTPgjg1ZwBKdzzkI7m+eOB2i87y1lvz2gTPoeBXcr+gtTpxURfKAgMwj6shc3BAndvQQipUjiGnpH9kwx9goC6rWBGghTky/IdnmVZaGjxO7Ww01RZp9RBeYyz4euifnXjktu2/sk2LSBylAIgse9tD11O+QvivU+dDnMAmcOTQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=MpWWpEoq; arc=none smtp.client-ip=192.198.163.8 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="MpWWpEoq" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695124; x=1799231124; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=GFiw5N2oVRbxNS+8hGRru/HArobn9JpEsHC0envKVVQ=; b=MpWWpEoqfigS3OselD3xZtCOp5bvcnoO/mv0ej+PFYWb6R1OH0UVDzx4 yuK/AJVMycHN3yLwywtgfuE5lhXwhW9SsBwNN/j0uR4CwrXd2hxiCg8Xm N5OOnFKCPPFV96RIefvkLKTuo1u2M+7cxEgs3kPAGDxLZ8aA16miTi9mD zocx5LOWN5QyWnxzPwdb92n/uyV6lfXiXVKpmBMFbWIoQi7MuQFg8KYH7 D9AWlVz9qCZOaJiccYyXkZDZnJHdtnkhlgtBAY0Q59+32S6PtJgfTfK9O 0N7TBX+h72492g2JZqEefH3iCPTYBokZnzROKt7C93cSx6zE4r00kYUQA g==; X-CSE-ConnectionGUID: xjoP9fitQbq7G+EeyVGpug== X-CSE-MsgGUID: /zkdPUbgQRO2GWBgTpZYXw== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="86645841" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="86645841" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:25:21 -0800 X-CSE-ConnectionGUID: xGgCTDkEQ6Gp1MSB+4d6rQ== X-CSE-MsgGUID: et/eJSCTTlysGA8yEqT+ew== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="202246988" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:25:14 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 18/24] x86/virt/tdx: Add loud warning when tdx_pamt_put() fails. Date: Tue, 6 Jan 2026 18:23:18 +0800 Message-ID: <20260106102318.25227-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" tdx_pamt_put() does not return any error to its caller when SEAMCALL TDH_PHYMEM_PAMT_REMOVE fails. Though pamt_refcount for the failed 2MB physical range is increased (so the DPAMT pages stay added), it will cause that the 2MB physical range can only be mapped at 4KB level, i.e., later SEAMCALL TDH_MEM_PAGE_AUG on the 2MB range at 2MB level will therefore fail forever. Since tdx_pamt_put() only fails when there are bugs in the host kernel or in the TDX module, simply add a loud warning to aid debugging after such an error occurs. Link: https://lore.kernel.org/all/67d55b24ef1a80af615c3672e8436e0ac32e8efa.= camel@intel.com Suggested-by: Rick Edgecombe Signed-off-by: Yan Zhao --- v3: - new patch --- arch/x86/virt/vmx/tdx/tdx.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index c12665389b67..76963c563906 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -2348,8 +2348,7 @@ void tdx_pamt_put(struct page *page) */ atomic_inc(pamt_refcount); =20 - pr_err("TDH_PHYMEM_PAMT_REMOVE failed: %#llx\n", tdx_status); - + WARN_ONCE(1, "TDH_PHYMEM_PAMT_REMOVE failed: %#llx\n", tdx_status); /* * Don't free pamt_pa_array as it could hold garbage * when tdh_phymem_pamt_remove() fails. --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 51830335097; Tue, 6 Jan 2026 10:25:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695138; cv=none; b=tB40wTP8ZmWWaDf3aU9of4wuaTRn3laX2A3AiMN0bni3En/rkP+7Gtavf4hqE3/eyYpSFeGZ+fyFVGT6WJvfYXVgTLzSW86lHkAKnrKSkwp+p6c2RVpI2ds4e+u0JxJ+Ah3J5le3ZESCQZjaAZ/9a6CIkHDn37wgRauNxXfDZUw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695138; c=relaxed/simple; bh=VJy5UQSBk0MUjLZqt0wrUpLU7GZHYX/hSwGX2hN+5eg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=BoNbK8NAFu3iUkmLVfWP0ohycX3M5179MFY50Fo9qSebthdj7kjGPnOuTeaE8xlNL44QFpbc7MG8yE7HqRexFLRySNeNeF8AR7a6QJ5Zt2C5t9ISLVK3qhwd+W+nsC8oSRyWxbJnlMSw3ZaUpyQOa+Dsa5lDmKRvOfipZx9MFuw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=D+3pmAEB; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="D+3pmAEB" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695136; x=1799231136; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=VJy5UQSBk0MUjLZqt0wrUpLU7GZHYX/hSwGX2hN+5eg=; b=D+3pmAEBOGCp9XV2vfg0SHUrr3/dp6mPjgm1VzpL+WtpZxAG+hteSiNb w5ww0Vz/qpfu3DnbAex0ZcUPM2AditmlG26L9R/YdYurWS9chO+B9af5c 9DRhJA2390jPD1v7QRw9wxgRR9/SKuTdujF3Q2XgDI0u1N/y2zqHRKlzH HuhtXnzCGX5ARsowSZtIJF+t1qxF74Rws89JUQ94DDQT2TH0XlSEHHFcH a7QNW3rm/CPClU4I7Hs88PMjJUphL41R5lR3uENXRsk4zscwjKZq7MMzw Ll4slcNu3rgRBFcJhnQfuNOEUqMZ+FbQ5ABojRojash/QYIe9CeAV/WNS w==; X-CSE-ConnectionGUID: 9NiaDqZoSlOtQ6MSCezPKg== X-CSE-MsgGUID: Ep/w2U9kTB2XHeR0sGasCA== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="72689736" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="72689736" Received: from orviesa007.jf.intel.com ([10.64.159.147]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:25:34 -0800 X-CSE-ConnectionGUID: 98YaJTufTiKZm/pvRdVkzQ== X-CSE-MsgGUID: Hqu89t5FTZa55nCdoVfZ7Q== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="202645182" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:25:29 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting Date: Tue, 6 Jan 2026 18:23:31 +0800 Message-ID: <20260106102331.25244-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce per-VM external cache for splitting the external page table by adding KVM x86 ops for cache "topup", "free", "need topup" operations. Invoke the KVM x86 ops for "topup", "need topup" for the per-VM external split cache when splitting the mirror root in tdp_mmu_split_huge_pages_root() where there's no per-vCPU context. Invoke the KVM x86 op for "free" to destroy the per-VM external split cache when KVM frees memory caches. This per-VM external split cache is only used when per-vCPU context is not available. Use the per-vCPU external fault cache in the fault path when per-vCPU context is available. The per-VM external split cache is protected under both kvm->mmu_lock and a cache lock inside vendor implementations to ensure that there're enough pages in cache for one split: - Dequeuing of the per-VM external split cache is in kvm_x86_ops.split_external_spte() under mmu_lock. - Yield the traversal in tdp_mmu_split_huge_pages_root() after topup of the per-VM cache, so that need_topup() is checked again after re-acquiring the mmu_lock. - Vendor implementations of the per-VM external split cache provide a cache lock to protect the enqueue/dequeue of pages into/from the cache. Here's the sequence to show how enough pages in cache is guaranteed. a. with write mmu_lock: 1. write_lock(&kvm->mmu_lock) kvm_x86_ops.need_topup() 2. write_unlock(&kvm->mmu_lock) kvm_x86_ops.topup() --> in vendor: { allocate pages get cache lock enqueue pages in cache put cache lock } 3. write_lock(&kvm->mmu_lock) kvm_x86_ops.need_topup() (goto 2 if topup is necessary) (*) kvm_x86_ops.split_external_spte() --> in vendor: { get cache lock dequeue pages in cache put cache lock } write_unlock(&kvm->mmu_lock) b. with read mmu_lock, 1. read_lock(&kvm->mmu_lock) kvm_x86_ops.need_topup() 2. read_unlock(&kvm->mmu_lock) kvm_x86_ops.topup() --> in vendor: { allocate pages get cache lock enqueue pages in cache put cache lock } 3. read_lock(&kvm->mmu_lock) kvm_x86_ops.need_topup() (goto 2 if topup is necessary) kvm_x86_ops.split_external_spte() --> in vendor: { get cache lock kvm_x86_ops.need_topup() (return retry if topup is necessary) (**) dequeue pages in cache put cache lock } read_unlock(&kvm->mmu_lock) Due to (*) and (**) in step 3, enough pages for split is guaranteed. Co-developed-by: Kirill A. Shutemov Signed-off-by: Kirill A. Shutemov Signed-off-by: Yan Zhao --- v3: - Introduce x86 ops to manages the cache. --- arch/x86/include/asm/kvm-x86-ops.h | 3 ++ arch/x86/include/asm/kvm_host.h | 17 +++++++ arch/x86/kvm/mmu/mmu.c | 2 + arch/x86/kvm/mmu/tdp_mmu.c | 71 +++++++++++++++++++++++++++++- 4 files changed, 91 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-= x86-ops.h index 84fa8689b45c..307edc51ad8d 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -102,6 +102,9 @@ KVM_X86_OP_OPTIONAL(split_external_spte) KVM_X86_OP_OPTIONAL(alloc_external_fault_cache) KVM_X86_OP_OPTIONAL(topup_external_fault_cache) KVM_X86_OP_OPTIONAL(free_external_fault_cache) +KVM_X86_OP_OPTIONAL(topup_external_per_vm_split_cache) +KVM_X86_OP_OPTIONAL(free_external_per_vm_split_cache) +KVM_X86_OP_OPTIONAL(need_topup_external_per_vm_split_cache) KVM_X86_OP(has_wbinvd_exit) KVM_X86_OP(get_l2_tsc_offset) KVM_X86_OP(get_l2_tsc_multiplier) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos= t.h index 315ffb23e9d8..6122801f334b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1862,6 +1862,23 @@ struct kvm_x86_ops { /* Free in external page fault cache. */ void (*free_external_fault_cache)(struct kvm_vcpu *vcpu); =20 + /* + * Top up extra pages needed in the per-VM cache for splitting external + * page table. + */ + int (*topup_external_per_vm_split_cache)(struct kvm *kvm, + enum pg_level level); + + /* Free the per-VM cache for splitting external page table. */ + void (*free_external_per_vm_split_cache)(struct kvm *kvm); + + /* + * Check if it's necessary to top up the per-VM cache for splitting + * external page table. + */ + bool (*need_topup_external_per_vm_split_cache)(struct kvm *kvm, + enum pg_level level); + bool (*has_wbinvd_exit)(void); =20 u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 35a6e37bfc68..3d568512201d 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6924,6 +6924,8 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm) kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache); kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache); kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache); + if (kvm_has_mirrored_tdp(kvm)) + kvm_x86_call(free_external_per_vm_split_cache)(kvm); } =20 void kvm_mmu_uninit_vm(struct kvm *kvm) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index b984027343b7..b45d3da683f2 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1606,6 +1606,55 @@ static bool iter_cross_boundary(struct tdp_iter *ite= r, gfn_t start, gfn_t end) (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <=3D end); } =20 +/* + * Check the per-VM external split cache under write mmu_lock or read mmu_= lock + * in tdp_mmu_split_huge_pages_root(). + * + * When need_topup_external_split_cache() returns false, the mmu_lock is h= eld + * throughout the exectution from + * (a) need_topup_external_split_cache(), and + * (b) the cache dequeuing (in tdx_sept_split_private_spte() called by + * tdp_mmu_split_huge_page()). + * + * - When mmu_lock is held for write, the per-VM external split cache is + * exclusively accessed by a single user. Therefore, the result returned= from + * need_topup_external_split_cache() is accurate. + * + * - When mmu_lock is held for read, the per-VM external split cache can be + * shared among multiple users. Cache dequeuing in + * tdx_sept_split_private_spte() thus needs to check again of the cache = page + * count after acquiring its internal split cache lock and return an err= or for + * retry if the cache page count is not sufficient. + */ +static bool need_topup_external_split_cache(struct kvm *kvm, int level) +{ + return kvm_x86_call(need_topup_external_per_vm_split_cache)(kvm, level); +} + +static int topup_external_split_cache(struct kvm *kvm, int level, bool sha= red) +{ + int r; + + rcu_read_unlock(); + + if (shared) + read_unlock(&kvm->mmu_lock); + else + write_unlock(&kvm->mmu_lock); + + r =3D kvm_x86_call(topup_external_per_vm_split_cache)(kvm, level); + + if (shared) + read_lock(&kvm->mmu_lock); + else + write_lock(&kvm->mmu_lock); + + if (!r) + rcu_read_lock(); + + return r; +} + static int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root, gfn_t start, gfn_t end, @@ -1614,6 +1663,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *= kvm, { struct kvm_mmu_page *sp =3D NULL; struct tdp_iter iter; + int r =3D 0; =20 rcu_read_lock(); =20 @@ -1672,6 +1722,21 @@ static int tdp_mmu_split_huge_pages_root(struct kvm = *kvm, continue; } =20 + if (is_mirror_sp(root) && + need_topup_external_split_cache(kvm, iter.level)) { + r =3D topup_external_split_cache(kvm, iter.level, shared); + + if (r) { + trace_kvm_mmu_split_huge_page(iter.gfn, + iter.old_spte, + iter.level, r); + goto out; + } + + iter.yielded =3D true; + continue; + } + tdp_mmu_init_child_sp(sp, &iter); =20 if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared)) @@ -1682,15 +1747,17 @@ static int tdp_mmu_split_huge_pages_root(struct kvm= *kvm, =20 rcu_read_unlock(); =20 +out: /* * It's possible to exit the loop having never used the last sp if, for * example, a vCPU doing HugePage NX splitting wins the race and - * installs its own sp in place of the last sp we tried to split. + * installs its own sp in place of the last sp we tried to split or + * topup_external_split_cache() failure. */ if (sp) tdp_mmu_free_sp(sp); =20 - return 0; + return r; } =20 =20 --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5714F336ECF; Tue, 6 Jan 2026 10:25:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695150; cv=none; b=aujKg3VSnIgnqBPO0vMpzwVw27stOnQw3VTVZOv8zfKXoriEvK6Gn8CCv9Bc3Nc5Xrjc8aLG06EGErmuCYuzFwvp9mabL3efAKvYlVsAkSeMcSjWWxb+WeFkketg09Pjvuonkl7DWow7S9wiwuNk5hMcFHEeI8QrQRGuyK9dWxY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695150; c=relaxed/simple; bh=qwwpptGS42i0X5HRafBSLm3iF6+2BtZgb4c3EI3htcs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=mDC5u2/v4gZAjcMS0VQRiBFB6NcCN2G7T6dJl7I0Pa8zf6z7xCjhUe3pksaoDg3VM2RaYsQNoiGAaA8uXdjMVmS/SDa0tOKIfaKSXFPtC/nFi5a6j0lc0VP6SnFggcU/10Y0+aAVXBPsBCVenS8D38TFsdWUnjCCqLh28bL6sys= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=OxzgK6lX; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="OxzgK6lX" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695149; x=1799231149; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=qwwpptGS42i0X5HRafBSLm3iF6+2BtZgb4c3EI3htcs=; b=OxzgK6lXUN5wLUwGOkyubGO1HtBdeWhrbfL4K0YBYNOZDb0sUWuhOYLn 3NuZBxYbKj4/X66uzbeRHMlTcyfjF8oB+2wirt3I4b7wcdYj3+g1bFBSL hI6pnrV5K34jFjuDJYL6nHr3RyUzSbVzjS6QVogwzz00WPN/yM+oIQ0cb DAwV2JLFHMcjRd7Vpw9Pj/Tl+wkpxsjKX++2MLPEmRt5rq8fTMIbJ+pQs Dx8pQytu995ZhxGkIR3/6Q7EK/ZZJ9dJUrvLwI1jsgu/OIN1tI8IcL12L 61rQzgTq2kzTIjNoAcM2XL09WRgFtifLatAtaK9SbUCM4RoiQpxq8JmE9 w==; X-CSE-ConnectionGUID: ZWVGWH7RTsKWgKi6ybVAKA== X-CSE-MsgGUID: pA/FjI8xTpa/FG1rbm/cbQ== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="72689752" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="72689752" Received: from orviesa007.jf.intel.com ([10.64.159.147]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:25:49 -0800 X-CSE-ConnectionGUID: vttwPh0gQLuiYzZAfynL2w== X-CSE-MsgGUID: 4RoS58UBRvOnrxG9tFIIkg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="202645207" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:25:43 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 20/24] KVM: TDX: Implement per-VM external cache for splitting in TDX Date: Tue, 6 Jan 2026 18:23:45 +0800 Message-ID: <20260106102345.25261-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Implement the KVM x86 ops for per-VM external cache for splitting the external page table in TDX. Since the per-VM external cache for splitting the external page table is intended to be invoked outside of vCPU threads, i.e., when the per-vCPU external_fault_cache is not available, introduce a spinlock prealloc_split_cache_lock in TDX to protect pages enqueuing/dequeuing operations for the per-VM external split cache. Cache topup in tdx_topup_vm_split_cache() manages page enqueuing with the help of prealloc_split_cache_lock. Cache dequeuing will be implemented in tdx_sept_split_private_spte() in later patches, which will also hold prealloc_split_cache_lock. Checking the need of topup in tdx_need_topup_vm_split_cache() does not hold prealloc_split_cache_lock internally. When tdx_need_topup_vm_split_cache() is invoked under write mmu_lock, there's no need for further acquiring prealloc_split_cache_lock; when tdx_need_topup_vm_split_cache() is invoked under read mmu_lock, it needs to be checked again after acquiring prealloc_split_cache_lock for cache dequeuing. Cache free does not hold prealloc_split_cache_lock because it's intended to be called when there's no contention. Signed-off-by: Yan Zhao --- v3: - new patch corresponds to DPAMT v4. --- arch/x86/kvm/vmx/tdx.c | 61 ++++++++++++++++++++++++++++++++++++++++++ arch/x86/kvm/vmx/tdx.h | 5 ++++ 2 files changed, 66 insertions(+) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index c1dc1aaae49d..40cca273d480 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -671,6 +671,9 @@ int tdx_vm_init(struct kvm *kvm) =20 kvm_tdx->state =3D TD_STATE_UNINITIALIZED; =20 + INIT_LIST_HEAD(&kvm_tdx->prealloc_split_cache.page_list); + spin_lock_init(&kvm_tdx->prealloc_split_cache_lock); + return 0; } =20 @@ -1680,6 +1683,61 @@ static void tdx_free_external_fault_cache(struct kvm= _vcpu *vcpu) __free_page(page); } =20 +/* + * Need to prepare at least 2 pairs of PAMT pages (i.e., 4 PAMT pages) for + * splitting a S-EPT PG_LEVEL_2M mapping when Dynamic PAMT is enabled: + * - 1 pair for the new 4KB S-EPT page for splitting, which may be dequeue= d in + * tdx_sept_split_private_spte() when there are no installed PAMT pages = for + * the 2MB physical range of the S-EPT page. + * - 1 pair for demoting guest private memory from 2MB to 4KB, which will = be + * dequeued in tdh_mem_page_demote(). + */ +static int tdx_min_split_cache_sz(struct kvm *kvm, int level) +{ + KVM_BUG_ON(level !=3D PG_LEVEL_2M, kvm); + + if (!tdx_supports_dynamic_pamt(tdx_sysinfo)) + return 0; + + return tdx_dpamt_entry_pages() * 2; +} + +static int tdx_topup_vm_split_cache(struct kvm *kvm, enum pg_level level) +{ + struct kvm_tdx *kvm_tdx =3D to_kvm_tdx(kvm); + struct tdx_prealloc *prealloc =3D &kvm_tdx->prealloc_split_cache; + int cnt =3D tdx_min_split_cache_sz(kvm, level); + + while (READ_ONCE(prealloc->cnt) < cnt) { + struct page *page =3D alloc_page(GFP_KERNEL); + + if (!page) + return -ENOMEM; + + spin_lock(&kvm_tdx->prealloc_split_cache_lock); + list_add(&page->lru, &prealloc->page_list); + prealloc->cnt++; + spin_unlock(&kvm_tdx->prealloc_split_cache_lock); + } + + return 0; +} + +static bool tdx_need_topup_vm_split_cache(struct kvm *kvm, enum pg_level l= evel) +{ + struct tdx_prealloc *prealloc =3D &to_kvm_tdx(kvm)->prealloc_split_cache; + + return prealloc->cnt < tdx_min_split_cache_sz(kvm, level); +} + +static void tdx_free_vm_split_cache(struct kvm *kvm) +{ + struct page *page; + + while ((page =3D get_tdx_prealloc_page(&to_kvm_tdx(kvm)->prealloc_split_c= ache))) + __free_page(page); +} + static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn, enum pg_level level, kvm_pfn_t pfn) { @@ -3804,4 +3862,7 @@ void __init tdx_hardware_setup(void) vt_x86_ops.alloc_external_fault_cache =3D tdx_alloc_external_fault_cache; vt_x86_ops.topup_external_fault_cache =3D tdx_topup_external_fault_cache; vt_x86_ops.free_external_fault_cache =3D tdx_free_external_fault_cache; + vt_x86_ops.topup_external_per_vm_split_cache =3D tdx_topup_vm_split_cache; + vt_x86_ops.need_topup_external_per_vm_split_cache =3D tdx_need_topup_vm_s= plit_cache; + vt_x86_ops.free_external_per_vm_split_cache =3D tdx_free_vm_split_cache; } diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h index 43dd295b7fd6..034e3ddfb679 100644 --- a/arch/x86/kvm/vmx/tdx.h +++ b/arch/x86/kvm/vmx/tdx.h @@ -48,6 +48,11 @@ struct kvm_tdx { * Set/unset is protected with kvm->mmu_lock. */ bool wait_for_sept_zap; + + /* The per-VM cache for splitting S-EPT */ + struct tdx_prealloc prealloc_split_cache; + /* Protect page enqueuing/dequeuing in prealloc_split_cache */ + spinlock_t prealloc_split_cache_lock; }; =20 /* TDX module vCPU states */ --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6EC61337107; Tue, 6 Jan 2026 10:26:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695169; cv=none; b=m/vH2nw3kRsQNVdkhNS4AgX+96YubgjjUOjXDDVJfR9SFl5lWasMozct/S4EJwwlte4uW39xG6Wdj0qCCpxZeX9GoPxZoNXQzEsra0y5BvJ0dK637W+xF8L8YU0nxaHNlVemvMnXQSvY9s185f9lYIzFJWgaGHgZfHj7ly+OMWA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695169; c=relaxed/simple; bh=R27r6drBHtR2cNmPgR2j0SPZdUkZTBJSOvlRidBaYAw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ACcLs2QGtasb6QbJMRbmfD32r3bRk77EEAaZfI62XeINgBCOdCNjn0IrNH2UtmHqMC/3JzXGhJI7O43ZgphIRznX/NMQC7pKbMkAcUDtO69ZadRYeqpwn5YKfRgg7+kstTQtDs4oJjBRpM0NldeJJ2ZP8uZAIIU7QU+vWzXu6ic= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=VRQXSKPP; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="VRQXSKPP" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695166; x=1799231166; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=R27r6drBHtR2cNmPgR2j0SPZdUkZTBJSOvlRidBaYAw=; b=VRQXSKPP3Blo1IVhvKX0h4VE/BloGSGMPrp/kC0+57d8SD0eSRkVG891 ug1zA9zptUCLfC4tvxd7Co62sne+RBjes8dTWgUpLz9QOyE00KppgKvSM npFGk4wwvvJVV/3Srn2UyUOM8RbBE6fONjKfc4erzx9NvvuMVCMXKXkqd ar3pyT/J3tYthdM5mgGvKArR54cg1qZk7P1T/oDhfo0AArzLz7CZmNkge gE6ws25XnVis3Z3Dc/+t0lE8RQKY0TfCgJVf/SAshi4FQxlqO6J9RXF4J CVpgos6LO8ivXqlQPXMSg5jShCFmRd7s3iy5U+XQnC9DyQ1I2LYYc1mXn Q==; X-CSE-ConnectionGUID: c/on4LnXRlypGZ6e6JDk7Q== X-CSE-MsgGUID: ImHWuJLXSBCZusC5sFp3Ng== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="72689775" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="72689775" Received: from orviesa007.jf.intel.com ([10.64.159.147]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:26:03 -0800 X-CSE-ConnectionGUID: GuqRtoQNTKSR1m7jxF5IIQ== X-CSE-MsgGUID: VmDl0HO7TI++hNzksiTqUw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="202645252" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:25:57 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 21/24] KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting Date: Tue, 6 Jan 2026 18:23:59 +0800 Message-ID: <20260106102359.25278-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Kirill A. Shutemov" Splitting a huge S-EPT mapping requires the host to provide a new 4KB page, which will be added as the S-EPT page to hold smaller mappings after the splitting. Install Dynamic PAMT page pair for the new S-EPT page before passing the S-EPT page to tdh_mem_page_demote(); Uninstall and free the Dynamic PAMT page pair when tdh_mem_page_demote() fails. When Dynamic PAMT is enabled and when there's no installed pair for the 2MB physical range containing the new S-EPT page, tdx_pamt_get() dequeues a pair of preallocated pages from the per-VM prealloc_split_cache and installs them as the Dynamic PAMT page pair. Hold prealloc_split_cache_lock when dequeuing from the per-VM prealloc_split_cache. After tdh_mem_page_demote() fails, tdx_pamt_put() uninstalls and frees the Dynamic PAMT page pair for the new S-EPT page if Dynamic PAMT is enabled, and the new S-EPT page is the last page in 2MB physical range requiring the Dynamic PAMT page pair. Signed-off-by: Kirill A. Shutemov Co-developed-by: Yan Zhao Signed-off-by: Yan Zhao --- v3: - Split out as a new patch. - Add KVM_BUG_ON() after tdx_pamt_get() fails. (Vishal) --- arch/x86/kvm/vmx/tdx.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 40cca273d480..ec47bd799274 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1996,6 +1996,7 @@ static int tdx_sept_split_private_spte(struct kvm *kv= m, gfn_t gfn, enum pg_level struct kvm_tdx *kvm_tdx =3D to_kvm_tdx(kvm); gpa_t gpa =3D gfn_to_gpa(gfn); u64 err, entry, level_state; + int ret; =20 if (KVM_BUG_ON(kvm_tdx->state !=3D TD_STATE_RUNNABLE || level !=3D PG_LEVEL_2M, kvm)) @@ -2014,10 +2015,18 @@ static int tdx_sept_split_private_spte(struct kvm *= kvm, gfn_t gfn, enum pg_level =20 tdx_track(kvm); =20 + spin_lock(&kvm_tdx->prealloc_split_cache_lock); + ret =3D tdx_pamt_get(new_sept_page, &kvm_tdx->prealloc_split_cache); + spin_unlock(&kvm_tdx->prealloc_split_cache_lock); + if (KVM_BUG_ON(ret, kvm)) + return -EIO; + err =3D tdh_do_no_vcpus(tdh_mem_page_demote, kvm, &kvm_tdx->td, gpa, tdx_level, new_sept_page, &entry, &level_state); - if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_DEMOTE, entry, level_state, kvm)) + if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_DEMOTE, entry, level_state, kvm)) { + tdx_pamt_put(new_sept_page); return -EIO; + } =20 return 0; } --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 588DF1E3DE5; Tue, 6 Jan 2026 10:26:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695179; cv=none; b=TAkIl9BDeqTOj1ro56tsPQUpKM4Gxk0oOHFEgcviMRGzA/EDAFcoMTByG7I0JZB5VJq5V8gfgiG8YvAIpYTi7bu3/BO89GVfCgcdBX4OO8oDyjRvVkDamrRsmn6nZVpqBsj8pMvgQHc+I02S/tRZF6KalUuFHWKq3DJdv1FfWCA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695179; c=relaxed/simple; bh=yVRhINa0SpzR7LHo8RdjWdsoINtk7obtx80TlyEt/8I=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=fa/mm7lkgTqtZ73Ygtx6n5pc9dAfIM9shuIfVdzVjPc5uqTRHb+9e7N0779rHrKtRvd4ZxcjxpwXBEHcw1rETRJQlxV8DlthmDoTztkH8B7dxrHivWcYQi7gbdWepM/gmnys3xsczEyibZKPUbcuf4G3/tM6WaVDzDHnnlAjr2M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=mf7XqUOV; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="mf7XqUOV" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695177; x=1799231177; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=yVRhINa0SpzR7LHo8RdjWdsoINtk7obtx80TlyEt/8I=; b=mf7XqUOVQpRYSQBW6fYBLrMjJ89PbniQeCid7S/B2aIijmGagVTQYFo2 9/WGqaV4FUocCBedMaWkRwBgviPNFm74eGBjtRVLeqyibaoXvT44d+9I/ UOIDODfaWa7n8bRboc05M0f5e9b0MOSkRBBIjVsovzJE7uvPk9EwmZAct z49q93/u8WdQinXlYA1wTu5Kwl7I8mPNQOWKtjPzPHIXSb/GamBUcIJx7 PRoFpts7m+Fa7jhNIytlLNmfaMjSGG53YdH/cos9eJy7ElPw4XuldwFPY /4bWSni4ZE3nz5nbDADoW2R/qrQD4u+2b49JHQtulReD90c3/f96uP4iK A==; X-CSE-ConnectionGUID: 0AHx6fJtS3WChha16QkkLg== X-CSE-MsgGUID: vgeLspW/QHeEA7cVndWQEQ== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="72918874" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="72918874" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:26:16 -0800 X-CSE-ConnectionGUID: 4vkVzlpVTdGWWZMW3I76rQ== X-CSE-MsgGUID: R+xgwDwzQumSTbnOF7vweQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="207665023" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:26:10 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote Date: Tue, 6 Jan 2026 18:24:13 +0800 Message-ID: <20260106102413.25294-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Kirill A. Shutemov" When Dynamic PAMT is enabled and splitting a 2MB mapping to 512 4KB mappings, SEAMCALL TDH.MEM.PAGE.DEMOTE takes the Dynamic PAMT page pair in registers R12 and R13. The Dynamic PAMT page pair is used to store physical memory metadata for the 2MB guest private memory after its S-EPT mapping is split to 4KB successfully. Pass prealloc_split_cache (the per-VM split cache) to SEAMCALL wrapper tdh_mem_page_demote() for dequeuing Dynamic PAMT pages from the cache. Protect the cache dequeuing in KVM with prealloc_split_cache_lock. Inside wrapper tdh_mem_page_demote(), dequeue the Dynamic PAMT pages into the guest_memory_pamt_page array and copy the page address to R12 and R13. Invoke SEAMCALL TDH_MEM_PAGE_DEMOTE using seamcall_saved_ret() to handle registers above R11. Free the Dynamic PAMT pages after SEAMCALL TDH_MEM_PAGE_DEMOTE fails since the guest private memory is still mapped at 2MB level. Opportunistically, rename dpamt_args_array_ptr() to dpamt_args_array_ptr_rdx() for tdh_phymem_pamt_{add/remove} and invoke dpamt_args_array_ptr_r12() in tdh_mem_page_demote() for populating registers starting from R12. Signed-off-by: Kirill A. Shutemov Co-developed-by: Yan Zhao Signed-off-by: Yan Zhao --- v3: - Split out as a new patch. - Get pages from preallocate cache corresponding to DPAMT v4. --- arch/x86/include/asm/tdx.h | 1 + arch/x86/kvm/vmx/tdx.c | 5 ++- arch/x86/virt/vmx/tdx/tdx.c | 76 ++++++++++++++++++++++++++----------- 3 files changed, 59 insertions(+), 23 deletions(-) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index abe484045132..5fc7498392fd 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -251,6 +251,7 @@ u64 tdh_mng_create(struct tdx_td *td, u16 hkid); u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp); u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data); u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page= *new_sept_page, + struct tdx_prealloc *prealloc, u64 *ext_err1, u64 *ext_err2); u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2= ); u64 tdh_mr_finalize(struct tdx_td *td); diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index ec47bd799274..a11ff02a4f30 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -2021,8 +2021,11 @@ static int tdx_sept_split_private_spte(struct kvm *k= vm, gfn_t gfn, enum pg_level if (KVM_BUG_ON(ret, kvm)) return -EIO; =20 + spin_lock(&kvm_tdx->prealloc_split_cache_lock); err =3D tdh_do_no_vcpus(tdh_mem_page_demote, kvm, &kvm_tdx->td, gpa, - tdx_level, new_sept_page, &entry, &level_state); + tdx_level, new_sept_page, + &kvm_tdx->prealloc_split_cache, &entry, &level_state); + spin_unlock(&kvm_tdx->prealloc_split_cache_lock); if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_DEMOTE, entry, level_state, kvm)) { tdx_pamt_put(new_sept_page); return -EIO; diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 76963c563906..9917e4e7705f 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -1848,25 +1848,69 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *d= ata) } EXPORT_SYMBOL_GPL(tdh_mng_rd); =20 +static int alloc_pamt_array(u64 *pa_array, struct tdx_prealloc *prealloc); +static void free_pamt_array(u64 *pa_array); +/* + * The TDX spec treats the registers like an array, as they are ordered + * in the struct. The array size is limited by the number or registers, + * so define the max size it could be for worst case allocations and sanity + * checking. + */ +#define MAX_TDX_ARG_SIZE(reg) ((sizeof(struct tdx_module_args) - \ + offsetof(struct tdx_module_args, reg)) / sizeof(u64)) +#define TDX_ARG_INDEX(reg) (offsetof(struct tdx_module_args, reg) / \ + sizeof(u64)) +/* + * Treat struct the registers like an array that starts at R12, per + * TDX spec. Do some sanitychecks, and return an indexable type. + */ +static u64 *dpamt_args_array_ptr_r12(struct tdx_module_array_args *args) +{ + WARN_ON_ONCE(tdx_dpamt_entry_pages() > MAX_TDX_ARG_SIZE(r12)); + + return &args->args_array[TDX_ARG_INDEX(r12)]; +} + u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page= *new_sept_page, + struct tdx_prealloc *prealloc, u64 *ext_err1, u64 *ext_err2) { - struct tdx_module_args args =3D { - .rcx =3D gpa | level, - .rdx =3D tdx_tdr_pa(td), - .r8 =3D page_to_phys(new_sept_page), + bool dpamt =3D tdx_supports_dynamic_pamt(&tdx_sysinfo) && level =3D=3D TD= X_PS_2M; + u64 guest_memory_pamt_page[MAX_TDX_ARG_SIZE(r12)]; + struct tdx_module_array_args args =3D { + .args.rcx =3D gpa | level, + .args.rdx =3D tdx_tdr_pa(td), + .args.r8 =3D page_to_phys(new_sept_page), }; u64 ret; =20 if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo)) return TDX_SW_ERROR; =20 + if (dpamt) { + u64 *args_array =3D dpamt_args_array_ptr_r12(&args); + + if (alloc_pamt_array(guest_memory_pamt_page, prealloc)) + return TDX_SW_ERROR; + + /* + * Copy PAMT page PAs of the guest memory into the struct per the + * TDX ABI + */ + memcpy(args_array, guest_memory_pamt_page, + tdx_dpamt_entry_pages() * sizeof(*args_array)); + } + /* Flush the new S-EPT page to be added */ tdx_clflush_page(new_sept_page); - ret =3D seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args); =20 - *ext_err1 =3D args.rcx; - *ext_err2 =3D args.rdx; + ret =3D seamcall_saved_ret(TDH_MEM_PAGE_DEMOTE, &args.args); + + *ext_err1 =3D args.args.rcx; + *ext_err2 =3D args.args.rdx; + + if (dpamt && ret) + free_pamt_array(guest_memory_pamt_page); =20 return ret; } @@ -2104,23 +2148,11 @@ static struct page *alloc_dpamt_page(struct tdx_pre= alloc *prealloc) return alloc_page(GFP_KERNEL_ACCOUNT); } =20 - -/* - * The TDX spec treats the registers like an array, as they are ordered - * in the struct. The array size is limited by the number or registers, - * so define the max size it could be for worst case allocations and sanity - * checking. - */ -#define MAX_TDX_ARG_SIZE(reg) (sizeof(struct tdx_module_args) - \ - offsetof(struct tdx_module_args, reg)) -#define TDX_ARG_INDEX(reg) (offsetof(struct tdx_module_args, reg) / \ - sizeof(u64)) - /* * Treat struct the registers like an array that starts at RDX, per * TDX spec. Do some sanitychecks, and return an indexable type. */ -static u64 *dpamt_args_array_ptr(struct tdx_module_array_args *args) +static u64 *dpamt_args_array_ptr_rdx(struct tdx_module_array_args *args) { WARN_ON_ONCE(tdx_dpamt_entry_pages() > MAX_TDX_ARG_SIZE(rdx)); =20 @@ -2188,7 +2220,7 @@ static u64 tdh_phymem_pamt_add(struct page *page, u64= *pamt_pa_array) struct tdx_module_array_args args =3D { .args.rcx =3D pamt_2mb_arg(page) }; - u64 *dpamt_arg_array =3D dpamt_args_array_ptr(&args); + u64 *dpamt_arg_array =3D dpamt_args_array_ptr_rdx(&args); =20 /* Copy PAMT page PA's into the struct per the TDX ABI */ memcpy(dpamt_arg_array, pamt_pa_array, @@ -2216,7 +2248,7 @@ static u64 tdh_phymem_pamt_remove(struct page *page, = u64 *pamt_pa_array) struct tdx_module_array_args args =3D { .args.rcx =3D pamt_2mb_arg(page), }; - u64 *args_array =3D dpamt_args_array_ptr(&args); + u64 *args_array =3D dpamt_args_array_ptr_rdx(&args); u64 ret; =20 ret =3D seamcall_ret(TDH_PHYMEM_PAMT_REMOVE, &args.args); --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B0BDD1E3DE5; Tue, 6 Jan 2026 10:26:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695192; cv=none; b=RfGA+XOK1CvZ87oN5ieA5GuP+bcZ22xLHUEWlGN3WFlKcaLP6JaAC0Dxa6Rl4t6lL5USzrwwgeM/WtJXZ0GOJa3V3cAbdBwTMYAo5/6gqleGbOMk1+aA17tPM3Ro1qXoqFAhNztP7pLOtC1DFRetg1WVZsy9hF7ndEZuCkrpXKc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695192; c=relaxed/simple; bh=xXosmn1UyznO3Ntig48bCZdj7wfeHeeTNRKz3vbDbxc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=EkckzcK4t5fYsfA6UPzWZ6p0XYpTboYgEMqZGsHLxCYpsl1BuiChgBx8jCR6Tu7Cd9lizaSFaOqHSgxovZ98vIUfAQjbvuFUX2E8Hjq/jP2uEkFnM4CE2K2e9+bI8a7guNfKM20gcR13YyM7vwU5dmnHTGYay3M0G2R04Bxgk2w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=lIJuCufn; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="lIJuCufn" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695190; x=1799231190; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=xXosmn1UyznO3Ntig48bCZdj7wfeHeeTNRKz3vbDbxc=; b=lIJuCufn9BHs/8PB0LsjVIhsbR1p5N4WxvamgBXCHefmRXMX7ArIDxhD xjQOrt9WtTL8hqFLc3OcUycoBZaTzPE8w9JjCuXqY8zhV1k/NVnZxZqZl HBOWiwMjjHIfCwR3BEOOwI7s+1AO7Wyu9j5xtmT6trJ+WHhRbbOdL253D Ae6DjnhY24apa7NzlfawmDhlt2rZVWUiYC4t/lFdDoCHvAD9Ggp42D3sZ Eza2RDoD9Z0W0A+hp1hfF96rZHE+sYmePEsZTjNLIZBYz2IfFAbNP4DcD OOyOLK9CIgTe0hShZ3TtBJMGPTPF6lp+c/zVA5PwCqqaAn8UjuoBjdhUd w==; X-CSE-ConnectionGUID: oOh3UIT7RnGZyZ2OLsyJTw== X-CSE-MsgGUID: JiW/5kK0Rw2ele0k86pixw== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="72918889" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="72918889" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:26:29 -0800 X-CSE-ConnectionGUID: 28oFsooDT8m4a5drTlnc4w== X-CSE-MsgGUID: 0PHXVftfS0Ks4lUoJFO+hw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="207665115" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:26:23 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 23/24] x86/tdx: Pass guest memory's PFN info to demote for updating pamt_refcount Date: Tue, 6 Jan 2026 18:24:26 +0800 Message-ID: <20260106102426.25311-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Kirill A. Shutemov" Pass guest memory's PFN info to tdh_mem_page_demote() by adding parameters "guest_folio" and "guest_start_idx" to tdh_mem_page_demote(). The guest memory's pfn info is not required by directly SEAMCALL TDH_MEM_PAGE_DEMOTE. Instead, it's used by host kernel to track the pamt_refcount for the 2MB range containing the guest private memory. Ater the S-EPT mapping is successfully split, set the pamt_refcount for the 2MB range containing the guest private memory to 512 after ensuring its original value is 0. Warn loudly if the setting refcount operation fails, which indicates kernel bugs. Check guest memory's base pfn is 2MB aligned and all the guest memory is contained in a single folio in tdh_mem_page_demote() to guard against any kernel bugs. Signed-off-by: Kirill A. Shutemov Co-developed-by: Yan Zhao Signed-off-by: Yan Zhao --- v3: - Split out as a new patch. - Added parameters "guest_folio" and "guest_start_idx" to pass the guest memory pfn info. - Use atomic_cmpxchg_release() to set guest_pamt_refcount. - No need to add param "pfn_for_gfn" kvm_x86_ops.split_external_spt() as the pfn info is already contained in param "old_mirror_spte" in kvm_x86_ops.split_external_spte(). --- arch/x86/include/asm/tdx.h | 6 +++--- arch/x86/kvm/vmx/tdx.c | 9 ++++++--- arch/x86/virt/vmx/tdx/tdx.c | 30 +++++++++++++++++++++++++----- 3 files changed, 34 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 5fc7498392fd..f536782da157 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -250,9 +250,9 @@ u64 tdh_mng_key_config(struct tdx_td *td); u64 tdh_mng_create(struct tdx_td *td, u16 hkid); u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp); u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data); -u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page= *new_sept_page, - struct tdx_prealloc *prealloc, - u64 *ext_err1, u64 *ext_err2); +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct foli= o *guest_folio, + unsigned long guest_start_idx, struct page *new_sept_page, + struct tdx_prealloc *prealloc, u64 *ext_err1, u64 *ext_err2); u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2= ); u64 tdh_mr_finalize(struct tdx_td *td); u64 tdh_vp_flush(struct tdx_vp *vp); diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index a11ff02a4f30..0054a9de867c 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1991,7 +1991,9 @@ static int tdx_sept_split_private_spte(struct kvm *kv= m, gfn_t gfn, enum pg_level u64 old_mirror_spte, void *new_private_spt, bool mmu_lock_shared) { + struct page *guest_page =3D pfn_to_page(spte_to_pfn(old_mirror_spte)); struct page *new_sept_page =3D virt_to_page(new_private_spt); + struct folio *guest_folio =3D page_folio(guest_page); int tdx_level =3D pg_level_to_tdx_sept_level(level); struct kvm_tdx *kvm_tdx =3D to_kvm_tdx(kvm); gpa_t gpa =3D gfn_to_gpa(gfn); @@ -2022,9 +2024,10 @@ static int tdx_sept_split_private_spte(struct kvm *k= vm, gfn_t gfn, enum pg_level return -EIO; =20 spin_lock(&kvm_tdx->prealloc_split_cache_lock); - err =3D tdh_do_no_vcpus(tdh_mem_page_demote, kvm, &kvm_tdx->td, gpa, - tdx_level, new_sept_page, - &kvm_tdx->prealloc_split_cache, &entry, &level_state); + err =3D tdh_do_no_vcpus(tdh_mem_page_demote, kvm, &kvm_tdx->td, gpa, tdx_= level, + guest_folio, folio_page_idx(guest_folio, guest_page), + new_sept_page, &kvm_tdx->prealloc_split_cache, + &entry, &level_state); spin_unlock(&kvm_tdx->prealloc_split_cache_lock); if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_DEMOTE, entry, level_state, kvm)) { tdx_pamt_put(new_sept_page); diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 9917e4e7705f..d036d9b5c87a 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -1871,9 +1871,9 @@ static u64 *dpamt_args_array_ptr_r12(struct tdx_modul= e_array_args *args) return &args->args_array[TDX_ARG_INDEX(r12)]; } =20 -u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page= *new_sept_page, - struct tdx_prealloc *prealloc, - u64 *ext_err1, u64 *ext_err2) +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct foli= o *guest_folio, + unsigned long guest_start_idx, struct page *new_sept_page, + struct tdx_prealloc *prealloc, u64 *ext_err1, u64 *ext_err2) { bool dpamt =3D tdx_supports_dynamic_pamt(&tdx_sysinfo) && level =3D=3D TD= X_PS_2M; u64 guest_memory_pamt_page[MAX_TDX_ARG_SIZE(r12)]; @@ -1882,6 +1882,8 @@ u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, i= nt level, struct page *new_ .args.rdx =3D tdx_tdr_pa(td), .args.r8 =3D page_to_phys(new_sept_page), }; + /* base pfn for guest private memory */ + unsigned long guest_base_pfn; u64 ret; =20 if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo)) @@ -1889,6 +1891,15 @@ u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, = int level, struct page *new_ =20 if (dpamt) { u64 *args_array =3D dpamt_args_array_ptr_r12(&args); + unsigned long npages =3D 1 << (level * PTE_SHIFT); + struct page *guest_page; + + guest_page =3D folio_page(guest_folio, guest_start_idx); + guest_base_pfn =3D page_to_pfn(guest_page); + + if (guest_start_idx + npages > folio_nr_pages(guest_folio) || + !IS_ALIGNED(guest_base_pfn, npages)) + return TDX_OPERAND_INVALID; =20 if (alloc_pamt_array(guest_memory_pamt_page, prealloc)) return TDX_SW_ERROR; @@ -1909,9 +1920,18 @@ u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, = int level, struct page *new_ *ext_err1 =3D args.args.rcx; *ext_err2 =3D args.args.rdx; =20 - if (dpamt && ret) - free_pamt_array(guest_memory_pamt_page); + if (dpamt) { + if (ret) { + free_pamt_array(guest_memory_pamt_page); + } else { + /* PAMT refcount for guest private memory */ + atomic_t *pamt_refcount; =20 + pamt_refcount =3D tdx_find_pamt_refcount(guest_base_pfn); + WARN_ON_ONCE(atomic_cmpxchg_release(pamt_refcount, 0, + PTRS_PER_PMD)); + } + } return ret; } EXPORT_SYMBOL_GPL(tdh_mem_page_demote); --=20 2.43.2 From nobody Sat Feb 7 09:58:55 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AD39C338F20; Tue, 6 Jan 2026 10:26:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695205; cv=none; b=TC8mvyih3puFYQ4xc3aodoY30gZ1WQx8toKzn6brKDGvQoZC1ptxDt1wD60wHX6TyR0TC/85/W6bBr0XYSXuGpVA8Q3ZitfDW14k7tRLFFpNQXCMvv6o/iD6WIVYikEQakdx9yoa6utdQYSZ5NH15e/ouA0GTDVh89BpQ8l7Tdk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767695205; c=relaxed/simple; bh=YYc8jq3ziucShY612xow5a0Yb1+0Aezs5WAQqRXW4Vg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ksey/Q6f2cBgJcL2BIc4ParQu1unkLKTrm9XPzSn+fF3Ux2DbUFv7XFfk6k/3azsHVpnGO4gWr2swyHYyx192Fg0sSMjFh+7D4ZWY2J287kV8CCULIXqSrtIlMzG/j1VShsFUMZxROVDO8Ltpc4aSBFz4zNOzz/JB5rqTkhuSDA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=fail smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=MVNqwDR+; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="MVNqwDR+" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1767695203; x=1799231203; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=YYc8jq3ziucShY612xow5a0Yb1+0Aezs5WAQqRXW4Vg=; b=MVNqwDR+4ZS/o14ePw1p6zqjqB2otIDlIJreNgbbEdAsx1nKyJzW+xsF BA+V2efVRtY2eaP8lbtYiWSwxGgEr0mRaSnaiJlUwR+HthN2ny6oHsOCH qheWHLdpBtOMgsv5TMP6FLol6z9RGCB5fXVekcVwQzYGCmThpu/nIIePD Y6x0NUDwILGffueIuL+PMT0VxCctoVjqAP3SEnPgUavr1oLwbuPgHZcAB yty8hOOSCR9SelGtqTYkv1R2bxoKNE5kj18AIyvflkblZV8YT+AyWRpgj iy9zkg1kaoTkA9+olk619l8knwZeNrenojHW16+TIqToBBO00ByETBVSj w==; X-CSE-ConnectionGUID: VDvLM4bgT3mAPk6WxwNnNA== X-CSE-MsgGUID: CDQB46uXRhCcn3I9+CASgA== X-IronPort-AV: E=McAfee;i="6800,10657,11662"; a="72918910" X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="72918910" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:26:43 -0800 X-CSE-ConnectionGUID: yuEL0qJhTqu3LnNW0rb7EA== X-CSE-MsgGUID: cwZArffdRZaBZDf22k8q9Q== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,204,1763452800"; d="scan'208";a="207665147" Received: from yzhao56-desk.sh.intel.com ([10.239.47.19]) by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jan 2026 02:26:36 -0800 From: Yan Zhao To: pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, ackerleytng@google.com, michael.roth@amd.com, david@kernel.org, vannapurve@google.com, sagis@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, nik.borisov@suse.com, pgonda@google.com, fan.du@intel.com, jun.miao@intel.com, francescolavra.fl@gmail.com, jgross@suse.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, kai.huang@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, chao.gao@intel.com, yan.y.zhao@intel.com Subject: [PATCH v3 24/24] KVM: TDX: Turn on PG_LEVEL_2M Date: Tue, 6 Jan 2026 18:24:40 +0800 Message-ID: <20260106102440.25328-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20260106101646.24809-1-yan.y.zhao@intel.com> References: <20260106101646.24809-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Turn on PG_LEVEL_2M in tdx_gmem_private_max_mapping_level() when TDX huge page is enabled and TD is RUNNABLE. Introduce a module parameter named "tdx_huge_page" for kvm-intel.ko to enable/disable TDX huge page. Turn TDX huge page off if the TDX module does not support TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY. Force page size to 4KB during TD build time to simplify code design, since - tdh_mem_page_add() only adds private pages at 4KB. - The amount of initial memory pages is usually limited (e.g. ~4MB in a typical linux TD). Update the warnings and KVM_BUG_ON() info to match the conditions when 2MB mappings are permitted. Signed-off-by: Xiaoyao Li Signed-off-by: Isaku Yamahata Signed-off-by: Yan Zhao --- v3: - Introduce the module param enable_tdx_huge_page and disable to toggle TDX huge page support. - Disable TDX huge page if TDX module does not support TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY. (Kai). - Explain why not allow 2M before TD is RUNNABLE in patch log.(Kai) - Add comment to explain the relationship between returning PG_LEVEL_2M and guest accept level. (Kai) - Dropped some KVM_BUG_ON()s due to rebasing. Updated KVM_BUG_ON()s on mapping levels to take into account of enable_tdx_huge_page. RFC v2: - Merged RFC v1's patch 4 (forcing PG_LEVEL_4K before TD runnable) with patch 9 (allowing PG_LEVEL_2M after TD runnable). --- arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++++++++++------ 1 file changed, 39 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 0054a9de867c..8149e89b5549 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -54,6 +54,8 @@ =20 bool enable_tdx __ro_after_init; module_param_named(tdx, enable_tdx, bool, 0444); +static bool __read_mostly enable_tdx_huge_page =3D true; +module_param_named(tdx_huge_page, enable_tdx_huge_page, bool, 0444); =20 #define TDX_SHARED_BIT_PWL_5 gpa_to_gfn(BIT_ULL(51)) #define TDX_SHARED_BIT_PWL_4 gpa_to_gfn(BIT_ULL(47)) @@ -1773,8 +1775,12 @@ static int tdx_sept_set_private_spte(struct kvm *kvm= , gfn_t gfn, if (KVM_BUG_ON(!vcpu, kvm)) return -EINVAL; =20 - /* TODO: handle large pages. */ - if (KVM_BUG_ON(level !=3D PG_LEVEL_4K, kvm)) + /* + * Large page is not supported before TD runnable or TDX huge page is + * not enabled. + */ + if (KVM_BUG_ON(((!enable_tdx_huge_page || kvm_tdx->state !=3D TD_STATE_RU= NNABLE) && + level !=3D PG_LEVEL_4K), kvm)) return -EIO; =20 WARN_ON_ONCE(!is_shadow_present_pte(mirror_spte) || @@ -1937,9 +1943,12 @@ static void tdx_sept_remove_private_spte(struct kvm = *kvm, gfn_t gfn, */ if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm)) return; - - /* TODO: handle large pages. */ - if (KVM_BUG_ON(level !=3D PG_LEVEL_4K, kvm)) + /* + * Large page is not supported before TD runnable or TDX huge page is + * not enabled. + */ + if (KVM_BUG_ON(((!enable_tdx_huge_page || kvm_tdx->state !=3D TD_STATE_RU= NNABLE) && + level !=3D PG_LEVEL_4K), kvm)) return; =20 err =3D tdh_do_no_vcpus(tdh_mem_range_block, kvm, &kvm_tdx->td, gpa, @@ -3556,12 +3565,34 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __us= er *argp) return ret; } =20 +/* + * For private pages: + * + * Force KVM to map at 4KB level when !enable_tdx_huge_page (e.g., due to + * incompatible TDX module) or before TD state is RUNNABLE. + * + * Always allow KVM to map at 2MB level in other cases, though KVM may sti= ll map + * the page at 4KB (i.e., passing in PG_LEVEL_4K to AUG) due to + * (1) the backend folio is 4KB, + * (2) disallow_lpage restrictions: + * - mixed private/shared pages in the 2MB range + * - level misalignment due to slot base_gfn, slot size, and ugfn + * - guest_inhibit bit set due to guest's 4KB accept level + * (3) page merging is disallowed (e.g., when part of a 2MB range has been + * mapped at 4KB level during TD build time). + */ int tdx_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_pri= vate) { if (!is_private) return 0; =20 - return PG_LEVEL_4K; + if (!enable_tdx_huge_page) + return PG_LEVEL_4K; + + if (unlikely(to_kvm_tdx(kvm)->state !=3D TD_STATE_RUNNABLE)) + return PG_LEVEL_4K; + + return PG_LEVEL_2M; } =20 static int tdx_online_cpu(unsigned int cpu) @@ -3747,6 +3778,8 @@ static int __init __tdx_bringup(void) if (misc_cg_set_capacity(MISC_CG_RES_TDX, tdx_get_nr_guest_keyids())) goto get_sysinfo_err; =20 + if (enable_tdx_huge_page && !tdx_supports_demote_nointerrupt(tdx_sysinfo)) + enable_tdx_huge_page =3D false; /* * Leave hardware virtualization enabled after TDX is enabled * successfully. TDX CPU hotplug depends on this. --=20 2.43.2