From nobody Sun Feb 8 21:28:07 2026 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E9DC433E35C for ; Thu, 29 Jan 2026 01:16:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769649386; cv=none; b=HPDHrhttg4tQOrL+KukmMzhcCQ8re6NSRp3e9vCm9Uq0LanyiiIov6KWAyFNRW+2Z3aVQcGgMV4+f7Jr7NH5SLnOLpxTCr7R2ChQB6jTR8XeTD5I7+lXjJH6G/MCAIr8IkYGdu2wHBJsVwXHdgMm6Glw5AxwYuKV238ek/RnLqM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769649386; c=relaxed/simple; bh=w/nijrur6jKEFCzBZzrCq8D2hM1PQMnPJT4Mk6V/KUw=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=F9WxnD2tKvlt8EB6kLg/1Sl0ozoUwNkfQfr2JKgs84Nk8utUzZ475fSyT05dGt3GGNsQPvKjcVxif5h37f670r6vqMBa5yS3mkSOa8MZXXjDM5m9tC5MNniyx8qmIYZaVpp48UAPg5YFbfLqyftbopx2ZaFJ0j+arssGPWLoFws= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=tYewLBo8; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="tYewLBo8" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-34e5a9f0d6aso427068a91.0 for ; Wed, 28 Jan 2026 17:16:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1769649384; x=1770254184; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=WpiUmsPu8uA5hilip45Z70pJh5rqhfxSEKqO7V+kAwg=; b=tYewLBo8XzGaATkK7wHfDeOSKsmQ7niTxv1tSAQNaRDcSCQ33p0V7Pr6QeQlXfxhyE xhuogJAXL5sz0hBv5uYzPFodLsYvs9eMHR1lvD/YjX944qqKFx7obE6f9XtAROrNhE01 njwuLkt2N8jfFewUGKoucgeLoeONJThzLT9WIpXaJ2ALnInFtCgt2VX8EIzUVKnjqxJF fHE6RYUP3PIX1+ay93Z/E42PdyVwURHOg3zxqONRojgyOOX8l4O4iCdYEqFFy/P7PYHh rIa31KghUMTFPKeZAP3aQ9fZX5Wf2c8d7JgugnAV2I/0GeEPC8qYUTdJD4fuN7xY1l+M c2yw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769649384; x=1770254184; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=WpiUmsPu8uA5hilip45Z70pJh5rqhfxSEKqO7V+kAwg=; b=jUUHDuazvSXw5M4nl93Aesq7758jlDVNyybJnqRKi/+6R22OI3MXOmqAmBhLrCs+Zo iOww/puxdpqloJJktm+Rx1vrvb+bQRd3sjoy3AaG5/QNq1UysvXs/X4Fq+hylFIDRMfu R8KYP3tCzzULcwUpshJFdc3elvO5sfLcPEp5KJ8bM0sHIeah6+VGnthLAhAZd5MoKD14 1lBcjEkcfAvQ0rNV+EOuCwIEb95O6qDecC2xyUJbErRYRT85Vvasx/PUD0qZ0l7kf0Qa J3uq2CZ/L+c08GSBjR6tOmMy2j8FsYrq517e6CI7/+5Sn/sSJRYDqeLKvkyYFKtKLgyP tKCQ== X-Gm-Message-State: AOJu0YzmO1/vRU7K9otn1v0fGhUUmiuuxzQCGPDZlAI9LiicJxdOHrH/ /MguWgPAz+/uJew9pAB/sFUe13l/aEFAWklJboSCNZthN247FhtNp1mhAo2IpORpMQJijt257RB 27ne2Ww== X-Received: from pjbbo7.prod.google.com ([2002:a17:90b:907:b0:34a:b143:87d3]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:51c2:b0:352:dbcc:d74c with SMTP id 98e67ed59e1d1-35429a8deb5mr979193a91.15.1769649384396; Wed, 28 Jan 2026 17:16:24 -0800 (PST) Reply-To: Sean Christopherson Date: Wed, 28 Jan 2026 17:15:02 -0800 In-Reply-To: <20260129011517.3545883-1-seanjc@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260129011517.3545883-1-seanjc@google.com> X-Mailer: git-send-email 2.53.0.rc1.217.geba53bf80e-goog Message-ID: <20260129011517.3545883-31-seanjc@google.com> Subject: [RFC PATCH v5 30/45] x86/virt/tdx: Add API to demote a 2MB mapping to 512 4KB mappings From: Sean Christopherson To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, Kiryl Shutsemau , Sean Christopherson , Paolo Bonzini Cc: linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev, kvm@vger.kernel.org, Kai Huang , Rick Edgecombe , Yan Zhao , Vishal Annapurve , Ackerley Tng , Sagi Shahar , Binbin Wu , Xiaoyao Li , Isaku Yamahata Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Xiaoyao Li Introduce SEAMCALL wrapper tdh_mem_page_demote() to invoke TDH_MEM_PAGE_DEMOTE, which splits a 2MB or a 1GB mapping in S-EPT into 512 4KB or 2MB mappings respectively. TDH_MEM_PAGE_DEMOTE walks the S-EPT to locate the huge entry/mapping to split, and replaces the huge entry with a new S-EPT page table containing the equivalent 512 smaller mappings. Parameters "gpa" and "level" specify the huge mapping to split, and parameter "new_sept_page" specifies the 4KB page to be added as the S-EPT page. Invoke tdx_clflush_page() before adding the new S-EPT page conservatively to prevent dirty cache lines from writing back later and corrupting TD memory. tdh_mem_page_demote() may fail, e.g., due to S-EPT walk error. Callers must check function return value and can retrieve the extended error info from the output parameters "ext_err1", and "ext_err2". The TDX module has many internal locks. To avoid staying in SEAM mode for too long, SEAMCALLs return a BUSY error code to the kernel instead of spinning on the locks. Depending on the specific SEAMCALL, the caller may need to handle this error in specific ways (e.g., retry). Therefore, return the SEAMCALL error code directly to the caller without attempting to handle it in the core kernel. Enable tdh_mem_page_demote() only on TDX modules that support feature TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2]. This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle. The TDX module provides no guaranteed maximum retry count to ensure forward progress of the demotion. Interrupt storms could then result in a DoS if host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling interrupts before invoking the SEAMCALL also doesn't work because NMIs can also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the reasonable execution time for demotion. [1] Allocate (or dequeue from the cache) PAMT pages when Dynamic PAMT is enabled, as TDH.MEM.PAGE.DEMOTE takes a DPAMT page pair in R12 and R13, to store physical memory metadata for the 2MB guest private memory (after a successful split). Take care to use seamcall_saved_ret() to handle registers above R11. Free the Dynamic PAMT pages after SEAMCALL TDH_MEM_PAGE_DEMOTE fails since the guest private memory is still mapped at 2MB level. Link: https://lore.kernel.org/kvm/99f5585d759328db973403be0713f68e492b492a.= camel@intel.com [1] Link: https://lore.kernel.org/all/fbf04b09f13bc2ce004ac97ee9c1f2c965f44fdf.= camel@intel.com [2] Signed-off-by: Xiaoyao Li Co-developed-by: Kirill A. Shutemov Signed-off-by: Kirill A. Shutemov Co-developed-by: Isaku Yamahata Signed-off-by: Isaku Yamahata Co-developed-by: Yan Zhao Signed-off-by: Yan Zhao [sean: squash all demote support into a single patch] Signed-off-by: Sean Christopherson --- arch/x86/include/asm/tdx.h | 9 +++++++ arch/x86/virt/vmx/tdx/tdx.c | 54 +++++++++++++++++++++++++++++++++++++ arch/x86/virt/vmx/tdx/tdx.h | 1 + 3 files changed, 64 insertions(+) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 50feea01b066..483441de7fe0 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -15,6 +15,7 @@ /* Bit definitions of TDX_FEATURES0 metadata field */ #define TDX_FEATURES0_NO_RBP_MOD BIT_ULL(18) #define TDX_FEATURES0_DYNAMIC_PAMT BIT_ULL(36) +#define TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY BIT_ULL(51) =20 #ifndef __ASSEMBLER__ =20 @@ -140,6 +141,11 @@ static inline bool tdx_supports_dynamic_pamt(const str= uct tdx_sys_info *sysinfo) return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT; } =20 +static inline bool tdx_supports_demote_nointerrupt(const struct tdx_sys_in= fo *sysinfo) +{ + return sysinfo->features.tdx_features0 & TDX_FEATURES0_ENHANCE_DEMOTE_INT= ERRUPTIBILITY; +} + /* Simple structure for pre-allocating Dynamic PAMT pages outside of locks= . */ struct tdx_pamt_cache { struct list_head page_list; @@ -240,6 +246,9 @@ u64 tdh_mng_key_config(struct tdx_td *td); u64 tdh_mng_create(struct tdx_td *td, u16 hkid); u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp); u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data); +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, enum pg_level level, u= 64 pfn, + struct page *new_sp, struct tdx_pamt_cache *pamt_cache, + u64 *ext_err1, u64 *ext_err2); u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2= ); u64 tdh_mr_finalize(struct tdx_td *td); u64 tdh_vp_flush(struct tdx_vp *vp); diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index cff325fdec79..823ec092b4e4 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -1841,6 +1841,9 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *dat= a) } EXPORT_SYMBOL_FOR_KVM(tdh_mng_rd); =20 +static int alloc_pamt_array(u64 *pa_array, struct tdx_pamt_cache *cache); +static void free_pamt_array(u64 *pa_array); + /* Number PAMT pages to be provided to TDX module per 2M region of PA */ static int tdx_dpamt_entry_pages(void) { @@ -1885,6 +1888,57 @@ static void dpamt_copy_regs_array(struct tdx_module_= args *args, void *reg, */ #define MAX_NR_DPAMT_ARGS (sizeof(struct tdx_module_args) / sizeof(u64)) =20 +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, enum pg_level level, u= 64 pfn, + struct page *new_sp, struct tdx_pamt_cache *pamt_cache, + u64 *ext_err1, u64 *ext_err2) +{ + bool dpamt =3D tdx_supports_dynamic_pamt(&tdx_sysinfo) && level =3D=3D PG= _LEVEL_2M; + u64 pamt_pa_array[MAX_NR_DPAMT_ARGS]; + struct tdx_module_args args =3D { + .rcx =3D gpa | pg_level_to_tdx_sept_level(level), + .rdx =3D tdx_tdr_pa(td), + .r8 =3D page_to_phys(new_sp), + }; + u64 ret; + + if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo)) + return TDX_SW_ERROR; + + if (dpamt) { + if (alloc_pamt_array(pamt_pa_array, pamt_cache)) + return TDX_SW_ERROR; + + dpamt_copy_to_regs(&args, r12, pamt_pa_array); + } + + /* Flush the new S-EPT page to be added */ + tdx_clflush_page(new_sp); + + ret =3D seamcall_saved_ret(TDH_MEM_PAGE_DEMOTE, &args); + + *ext_err1 =3D args.rcx; + *ext_err2 =3D args.rdx; + + if (dpamt) { + if (ret) { + free_pamt_array(pamt_pa_array); + } else { + /* + * Set the PAMT refcount for the guest private memory, + * i.e. for the hugepage that was just demoted to 512 + * smaller pages. + */ + atomic_t *pamt_refcount; + + pamt_refcount =3D tdx_find_pamt_refcount(pfn); + WARN_ON_ONCE(atomic_cmpxchg_release(pamt_refcount, 0, + PTRS_PER_PMD)); + } + } + return ret; +} +EXPORT_SYMBOL_FOR_KVM(tdh_mem_page_demote); + u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2) { struct tdx_module_args args =3D { diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 096c78a1d438..a6c0fa53ece9 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -24,6 +24,7 @@ #define TDH_MNG_KEY_CONFIG 8 #define TDH_MNG_CREATE 9 #define TDH_MNG_RD 11 +#define TDH_MEM_PAGE_DEMOTE 15 #define TDH_MR_EXTEND 16 #define TDH_MR_FINALIZE 17 #define TDH_VP_FLUSH 18 --=20 2.53.0.rc1.217.geba53bf80e-goog