From nobody Sun Oct 5 23:37:50 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2B4C026563C; Mon, 28 Jul 2025 12:29:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705743; cv=none; b=igSiYLEw0XjDbh1hTIIwnh8bzXIm0rDpvDH7B0UsGV50xba+4PjFtH+8HJu/9L9dPSN4V/rBDdZHI84vNjh30Ptv9itzzpoYwXMZJ5FWBjYTFp1sSxn+ronb4CU+iZYfNCEvkI7y3mhl2F8QDnF3+v2Nj58B92EBFSDxPNYCuaU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705743; c=relaxed/simple; bh=zPp2F2i+Po0uyGj+tTdxebQGCP+/7nZdEvYd3tuPUbo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=AL1QpmJvbdN7WhfazPqoF+1QjoWKYEF3BCYSeuHxXNVPzkvgm6R7hLQLvEcH+EOQFFHhEnpPtli2L+Mdf2f1lV6w7HIFkW2gIV4msMYz2AhyJ2VYF0/MhzMUxjO6r+isa7NugToYojM6Nt9P/wYoJC4y5ofr3JXYLJ8uv3uDn0E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=hizi4mMN; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="hizi4mMN" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1753705742; x=1785241742; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=zPp2F2i+Po0uyGj+tTdxebQGCP+/7nZdEvYd3tuPUbo=; b=hizi4mMNOiXCvjoniJSfZMNiFIfFd+bzTMCJOqwj7Q+y8JSwxC87m+MP Gmz9c0yZl+Ko3dflggOC7Vaou6vm4DGig3wPUIFa86b9JS7tJ3p6RHvYY /HLMhVQYilqZMqphvV5uOWouGX1vdCw8HGaOTlmjSOta648+TG82TCVpW Zmdgtan6AJ0tuUYMwii7f5hk8E2puSPFOxnWdy3TG96X/YYatPoIwX891 hu9G3I+4D69E181ucYkPcbsUxF2nmqlsImaBA5IrCLfXf1bfBurunppz2 8V2CWl1LSUMPdsEVCfx0Y4aUv4Nek7yIjSQnuC/22TlF+Kjdr1c9bOGOu Q==; X-CSE-ConnectionGUID: uBICTQ86RYm48OJEQWxtvQ== X-CSE-MsgGUID: sLEYCaf6RGateEmCMd6YTA== X-IronPort-AV: E=McAfee;i="6800,10657,11504"; a="56043300" X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="56043300" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:29:01 -0700 X-CSE-ConnectionGUID: lWjsCdmtS8ikJPDDlRgE7g== X-CSE-MsgGUID: sqzoC42IQQOmCgtMs/urVg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="193375589" Received: from dnelso2-mobl.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.220.205]) by orviesa002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:28:56 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com Cc: x86@kernel.org, kas@kernel.org, rick.p.edgecombe@intel.com, dwmw@amazon.co.uk, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, ashish.kalra@amd.com, nik.borisov@suse.com, chao.gao@intel.com, sagis@google.com Subject: [PATCH v5 1/7] x86/kexec: Consolidate relocate_kernel() function parameters Date: Tue, 29 Jul 2025 00:28:35 +1200 Message-ID: <48b3b8dc2ece4095d29f21a439664c0302f3c979.1753679792.git.kai.huang@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" During kexec, the kernel jumps to the new kernel in relocate_kernel(), which is implemented in assembly and both 32-bit and 64-bit have their own version. Currently, for both 32-bit and 64-bit, the last two parameters of the relocate_kernel() are both 'unsigned int' but actually they only convey a boolean, i.e., one bit information. The 'unsigned int' has enough space to carry two bits information therefore there's no need to pass the two booleans in two separate 'unsigned int'. Consolidate the last two function parameters of relocate_kernel() into a single 'unsigned int' and pass flags instead. Only consolidate the 64-bit version albeit the similar optimization can be done for the 32-bit version too. Don't bother changing the 32-bit version while it is working (since assembly code change is required). Signed-off-by: Kai Huang Reviewed-by: Tom Lendacky --- v4 -> v5: - RELOC_KERNEL_HOST_MEM_ACTIVE -> RELOC_KERNEL_HOST_MEM_ENC_ACTIVE (Tom) - Add a comment to explain only RELOC_KERNEL_PRESERVE_CONTEXT is restored after jumping back from peer kernel for preserved_context kexec (pointed out by Tom). - Use testb instead of testq when comparing the flag with R11 to save 3 bytes (Hpa). v4: - new patch --- arch/x86/include/asm/kexec.h | 12 ++++++++++-- arch/x86/kernel/machine_kexec_64.c | 22 +++++++++++++--------- arch/x86/kernel/relocate_kernel_64.S | 25 +++++++++++++++---------- 3 files changed, 38 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index f2ad77929d6e..12cebbcdb6c8 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -13,6 +13,15 @@ # define KEXEC_DEBUG_EXC_HANDLER_SIZE 6 /* PUSHI, PUSHI, 2-byte JMP */ #endif =20 +#ifdef CONFIG_X86_64 + +#include + +#define RELOC_KERNEL_PRESERVE_CONTEXT BIT(0) +#define RELOC_KERNEL_HOST_MEM_ENC_ACTIVE BIT(1) + +#endif + # define KEXEC_CONTROL_PAGE_SIZE 4096 # define KEXEC_CONTROL_CODE_MAX_SIZE 2048 =20 @@ -121,8 +130,7 @@ typedef unsigned long relocate_kernel_fn(unsigned long indirection_page, unsigned long pa_control_page, unsigned long start_address, - unsigned int preserve_context, - unsigned int host_mem_enc_active); + unsigned int flags); #endif extern relocate_kernel_fn relocate_kernel; #define ARCH_HAS_KIMAGE_ARCH diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_k= exec_64.c index 697fb99406e6..5cda8d8d8b13 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -384,16 +384,10 @@ void __nocfi machine_kexec(struct kimage *image) { unsigned long reloc_start =3D (unsigned long)__relocate_kernel_start; relocate_kernel_fn *relocate_kernel_ptr; - unsigned int host_mem_enc_active; + unsigned int relocate_kernel_flags; int save_ftrace_enabled; void *control_page; =20 - /* - * This must be done before load_segments() since if call depth tracking - * is used then GS must be valid to make any function calls. - */ - host_mem_enc_active =3D cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT); - #ifdef CONFIG_KEXEC_JUMP if (image->preserve_context) save_processor_state(); @@ -427,6 +421,17 @@ void __nocfi machine_kexec(struct kimage *image) */ relocate_kernel_ptr =3D control_page + (unsigned long)relocate_kernel - r= eloc_start; =20 + relocate_kernel_flags =3D 0; + if (image->preserve_context) + relocate_kernel_flags |=3D RELOC_KERNEL_PRESERVE_CONTEXT; + + /* + * This must be done before load_segments() since if call depth tracking + * is used then GS must be valid to make any function calls. + */ + if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)) + relocate_kernel_flags |=3D RELOC_KERNEL_HOST_MEM_ENC_ACTIVE; + /* * The segment registers are funny things, they have both a * visible and an invisible part. Whenever the visible part is @@ -443,8 +448,7 @@ void __nocfi machine_kexec(struct kimage *image) image->start =3D relocate_kernel_ptr((unsigned long)image->head, virt_to_phys(control_page), image->start, - image->preserve_context, - host_mem_enc_active); + relocate_kernel_flags); =20 #ifdef CONFIG_KEXEC_JUMP if (image->preserve_context) diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocat= e_kernel_64.S index ea604f4d0b52..26e945f85d19 100644 --- a/arch/x86/kernel/relocate_kernel_64.S +++ b/arch/x86/kernel/relocate_kernel_64.S @@ -66,8 +66,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel) * %rdi indirection_page * %rsi pa_control_page * %rdx start address - * %rcx preserve_context - * %r8 host_mem_enc_active + * %rcx flags: RELOC_KERNEL_* */ =20 /* Save the CPU context, used for jumping back */ @@ -111,7 +110,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel) /* save indirection list for jumping back */ movq %rdi, pa_backup_pages_map(%rip) =20 - /* Save the preserve_context to %r11 as swap_pages clobbers %rcx. */ + /* Save the flags to %r11 as swap_pages clobbers %rcx. */ movq %rcx, %r11 =20 /* setup a new stack at the end of the physical control page */ @@ -129,9 +128,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) /* * %rdi indirection page * %rdx start address - * %r8 host_mem_enc_active * %r9 page table page - * %r11 preserve_context + * %r11 flags: RELOC_KERNEL_* * %r13 original CR4 when relocate_kernel() was invoked */ =20 @@ -204,7 +202,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) * entries that will conflict with the now unencrypted memory * used by kexec. Flush the caches before copying the kernel. */ - testq %r8, %r8 + testb $RELOC_KERNEL_HOST_MEM_ENC_ACTIVE, %r11b jz .Lsme_off wbinvd .Lsme_off: @@ -220,7 +218,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) movq %cr3, %rax movq %rax, %cr3 =20 - testq %r11, %r11 /* preserve_context */ + testb $RELOC_KERNEL_PRESERVE_CONTEXT, %r11b jnz .Lrelocate =20 /* @@ -273,7 +271,13 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) ANNOTATE_NOENDBR andq $PAGE_MASK, %r8 lea PAGE_SIZE(%r8), %rsp - movl $1, %r11d /* Ensure preserve_context flag is set */ + /* + * Ensure RELOC_KERNEL_PRESERVE_CONTEXT flag is set so that + * swap_pages() can swap pages correctly. Note all other + * RELOC_KERNEL_* flags passed to relocate_kernel() are not + * restored. + */ + movl $RELOC_KERNEL_PRESERVE_CONTEXT, %r11d call swap_pages movq kexec_va_control_page(%rip), %rax 0: addq $virtual_mapped - 0b, %rax @@ -321,7 +325,7 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages) UNWIND_HINT_END_OF_STACK /* * %rdi indirection page - * %r11 preserve_context + * %r11 flags: RELOC_KERNEL_* */ movq %rdi, %rcx /* Put the indirection_page in %rcx */ xorl %edi, %edi @@ -357,7 +361,8 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages) movq %rdi, %rdx /* Save destination page to %rdx */ movq %rsi, %rax /* Save source page to %rax */ =20 - testq %r11, %r11 /* Only actually swap for ::preserve_context */ + /* Only actually swap for ::preserve_context */ + testb $RELOC_KERNEL_PRESERVE_CONTEXT, %r11b jz .Lnoswap =20 /* copy source page to swap page */ --=20 2.50.1 From nobody Sun Oct 5 23:37:50 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C069E266591; Mon, 28 Jul 2025 12:29:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705748; cv=none; b=DVVv+QEIpPL+q9Z0TYhooZxD8hua7VBCO3vQRWv+FiysdinQugBtXSEWimyvZLOpiQxl3lmJG8W3spSkdv1+4IxA2iYGSDQkDmxUW5fzssOCPIcGd20I86CmjgSoWJoFuYcr0n65cDZGn9nqThBQSuBAw+5/Yx1Vm3cl8OnfT44= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705748; c=relaxed/simple; bh=CPbIL+QVkJIdaIUnaSlETCvDgok3TqJVlLNAqYylZGQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=aVIrLzIkm37wBNyGT1TZOvOemrWrIzM2TJrB/dBkiA+X/ApluInRRR1XD6Q3ZmYEsg5PlQNaljDlBPbp0C/ea2KBHqw1ybm6pxJTcup0W9A8zE8lN/D6bfJmBgmRCnYEGFcgJfQPedmIYFDbiX9rDajEb42gXAjpJ8U0gAx61U0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=iHtsX+xb; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="iHtsX+xb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1753705747; x=1785241747; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=CPbIL+QVkJIdaIUnaSlETCvDgok3TqJVlLNAqYylZGQ=; b=iHtsX+xbxOrPxxcjlUzg5TOfIGLor6shiet7huN5uqic2ZQqyTq2WXm2 snm6cy+AXGKlQMNiX09h74c2Fb71+gYRecljpqLAogbUok2zuf7ZO4XiU 8n70LhoOR3pkQgW9kEYQ3jnmQ3fu5zWuToumi2IIirx5NDvPZRpmNO9I/ DOq1387wRh7SSN9OHp4T2fhBMqbs1ellOeTVeR/F7yvQ3qWSgLUlgMWBt E7T+zcWfAQorB5+zhChxWbtMvve7aaWd3x1z8/AITSF2x6DWDI5yD2usH xGIiqq+SHHfi0wh6S2lVaTRo+fPB9JcdAQKr1JSyt/ktqnG9yD9FaOACD w==; X-CSE-ConnectionGUID: gWMbGG55RUyRWqGsi56uFw== X-CSE-MsgGUID: XwkbYxyBRDegQcBu54OE0g== X-IronPort-AV: E=McAfee;i="6800,10657,11504"; a="56043313" X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="56043313" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:29:07 -0700 X-CSE-ConnectionGUID: kV8QVDcHQS+FjrcOj4Llqw== X-CSE-MsgGUID: tBiTAqvmSmiOM/JpkvLtiQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="193375606" Received: from dnelso2-mobl.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.220.205]) by orviesa002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:29:02 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com Cc: x86@kernel.org, kas@kernel.org, rick.p.edgecombe@intel.com, dwmw@amazon.co.uk, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, ashish.kalra@amd.com, nik.borisov@suse.com, chao.gao@intel.com, sagis@google.com Subject: [PATCH v5 2/7] x86/sme: Use percpu boolean to control WBINVD during kexec Date: Tue, 29 Jul 2025 00:28:36 +1200 Message-ID: X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" TL;DR: Prepare to unify how TDX and SME do cache flushing during kexec by making a percpu boolean control whether to do the WBINVD. -- Background -- On SME platforms, dirty cacheline aliases with and without encryption bit can coexist, and the CPU can flush them back to memory in random order. During kexec, the caches must be flushed before jumping to the new kernel otherwise the dirty cachelines could silently corrupt the memory used by the new kernel due to different encryption property. TDX also needs a cache flush during kexec for the same reason. It would be good to have a generic way to flush the cache instead of scattering checks for each feature all around. When SME is enabled, the kernel basically encrypts all memory including the kernel itself and a simple memory write from the kernel could dirty cachelines. Currently, the kernel uses WBINVD to flush the cache for SME during kexec in two places: 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU stops them; 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the new kernel. -- Solution -- Unlike SME, TDX can only dirty cachelines when it is used (i.e., when SEAMCALLs are performed). Since there are no more SEAMCALLs after the aforementioned WBINVDs, leverage this for TDX. To unify the approach for SME and TDX, use a percpu boolean to indicate the cache may be in an incoherent state and needs flushing during kexec, and set the boolean for SME. TDX can then leverage it. While SME could use a global flag (since it's enabled at early boot and enabled on all CPUs), the percpu flag fits TDX better: The percpu flag can be set when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU obviates the need for a kexec-time WBINVD. Saving kexec-time WBINVD is valuable, because there is an existing race[*] where kexec could proceed while another CPU is active. WBINVD could make this race worse, so it's worth skipping it when possible. -- Side effect to SME -- Today the first WBINVD in the stop_this_cpu() is performed when SME is *supported* by the platform, and the second WBINVD is done in relocate_kernel() when SME is *activated* by the kernel. Make things simple by changing to do the second WBINVD when the platform supports SME. This allows the kernel to simply turn on this percpu boolean when bringing up a CPU by checking whether the platform supports SME. No other functional change intended. [*] The aforementioned race: During kexec native_stop_other_cpus() is called to stop all remote CPUs before jumping to the new kernel. native_stop_other_cpus() firstly sends normal REBOOT vector IPIs to stop remote CPUs and waits them to stop. If that times out, it sends NMI to stop the CPUs that are still alive. The race happens when native_stop_other_cpus() has to send NMIs and could potentially result in the system hang (for more information please see [1]). Link: https://lore.kernel.org/kvm/b963fcd60abe26c7ec5dc20b42f1a2ebbcc72397.= 1750934177.git.kai.huang@intel.com/ [1] Signed-off-by: Kai Huang Reviewed-by: Tom Lendacky Tested-by: Tom Lendacky --- v4 -> v5: - Code rebase due to change RELOC_KERNEL_HOST_MEM_ACTIVE to RELOC_KERNEL_HOST_MEM_ENC_ACTIVE. v3 -> v4: - Simplify the changelog using AI -- Boris - Call out "Test CPUID bit directly due to mem_encrypt=3Doff" in the comment -- Boris - Add a comment to explain the percpu boolean -- Boris - s/wbinvd/WBINVD -- Boris - Code update due to patch 1 being added --- arch/x86/include/asm/kexec.h | 4 ++-- arch/x86/include/asm/processor.h | 2 ++ arch/x86/kernel/cpu/amd.c | 17 +++++++++++++++++ arch/x86/kernel/machine_kexec_64.c | 14 ++++++++++---- arch/x86/kernel/process.c | 24 +++++++++++------------- arch/x86/kernel/relocate_kernel_64.S | 13 ++++++++++--- 6 files changed, 52 insertions(+), 22 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 12cebbcdb6c8..5cfb27f26583 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -17,8 +17,8 @@ =20 #include =20 -#define RELOC_KERNEL_PRESERVE_CONTEXT BIT(0) -#define RELOC_KERNEL_HOST_MEM_ENC_ACTIVE BIT(1) +#define RELOC_KERNEL_PRESERVE_CONTEXT BIT(0) +#define RELOC_KERNEL_CACHE_INCOHERENT BIT(1) =20 #endif =20 diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/proces= sor.h index bde58f6510ac..a24c7805acdb 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy); void microcode_check(struct cpuinfo_x86 *prev_info); void store_cpu_caps(struct cpuinfo_x86 *info); =20 +DECLARE_PER_CPU(bool, cache_state_incoherent); + enum l1tf_mitigations { L1TF_MITIGATION_OFF, L1TF_MITIGATION_AUTO, diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index a5ece6ebe8a7..66a682be4a1a 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -545,6 +545,23 @@ static void early_detect_mem_encrypt(struct cpuinfo_x8= 6 *c) { u64 msr; =20 + /* + * Mark using WBINVD is needed during kexec on processors that + * support SME. This provides support for performing a successful + * kexec when going from SME inactive to SME active (or vice-versa). + * + * The cache must be cleared so that if there are entries with the + * same physical address, both with and without the encryption bit, + * they don't race each other when flushed and potentially end up + * with the wrong entry being committed to memory. + * + * Test the CPUID bit directly because with mem_encrypt=3Doff the + * BSP will clear the X86_FEATURE_SME bit and the APs will not + * see it set after that. + */ + if (c->extended_cpuid_level >=3D 0x8000001f && (cpuid_eax(0x8000001f) & B= IT(0))) + __this_cpu_write(cache_state_incoherent, true); + /* * BIOS support is required for SME and SEV. * For SME: If BIOS has enabled SME then adjust x86_phys_bits by diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_k= exec_64.c index 5cda8d8d8b13..dfb91091f451 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -29,6 +29,7 @@ #include #include #include +#include =20 #ifdef CONFIG_ACPI /* @@ -426,11 +427,11 @@ void __nocfi machine_kexec(struct kimage *image) relocate_kernel_flags |=3D RELOC_KERNEL_PRESERVE_CONTEXT; =20 /* - * This must be done before load_segments() since if call depth tracking - * is used then GS must be valid to make any function calls. + * This must be done before load_segments() since it resets + * GS to 0 and percpu data needs the correct GS to work. */ - if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)) - relocate_kernel_flags |=3D RELOC_KERNEL_HOST_MEM_ENC_ACTIVE; + if (this_cpu_read(cache_state_incoherent)) + relocate_kernel_flags |=3D RELOC_KERNEL_CACHE_INCOHERENT; =20 /* * The segment registers are funny things, they have both a @@ -441,6 +442,11 @@ void __nocfi machine_kexec(struct kimage *image) * * Take advantage of this here by force loading the segments, * before the GDT is zapped with an invalid value. + * + * load_segments() resets GS to 0. Don't make any function call + * after here since call depth tracking uses percpu variables to + * operate (relocate_kernel() is explicitly ignored by call depth + * tracking). */ load_segments(); =20 diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 1b7960cf6eb0..f2bbbeef5477 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -88,6 +88,16 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw); DEFINE_PER_CPU(bool, __tss_limit_invalid); EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid); =20 +/* + * The cache may be in an incoherent state and needs flushing during kexec. + * E.g., on SME/TDX platforms, dirty cacheline aliases with and without + * encryption bit(s) can coexist and the cache needs to be flushed before + * booting to the new kernel to avoid the silent memory corruption due to + * dirty cachelines with different encryption property being written back + * to the memory. + */ +DEFINE_PER_CPU(bool, cache_state_incoherent); + /* * this gets called so that we can store lazy state into memory and copy t= he * current task into the new thread. @@ -827,19 +837,7 @@ void __noreturn stop_this_cpu(void *dummy) disable_local_APIC(); mcheck_cpu_clear(c); =20 - /* - * Use wbinvd on processors that support SME. This provides support - * for performing a successful kexec when going from SME inactive - * to SME active (or vice-versa). The cache must be cleared so that - * if there are entries with the same physical address, both with and - * without the encryption bit, they don't race each other when flushed - * and potentially end up with the wrong entry being committed to - * memory. - * - * Test the CPUID bit directly because the machine might've cleared - * X86_FEATURE_SME due to cmdline options. - */ - if (c->extended_cpuid_level >=3D 0x8000001f && (cpuid_eax(0x8000001f) & B= IT(0))) + if (this_cpu_read(cache_state_incoherent)) wbinvd(); =20 /* diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocat= e_kernel_64.S index 26e945f85d19..11e20bb13aca 100644 --- a/arch/x86/kernel/relocate_kernel_64.S +++ b/arch/x86/kernel/relocate_kernel_64.S @@ -198,14 +198,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) movq %r9, %cr3 =20 /* + * If the memory cache is in incoherent state, e.g., due to + * memory encryption, do WBINVD to flush cache. + * * If SME is active, there could be old encrypted cache line * entries that will conflict with the now unencrypted memory * used by kexec. Flush the caches before copying the kernel. + * + * Note SME sets this flag to true when the platform supports + * SME, so the WBINVD is performed even SME is not activated + * by the kernel. But this has no harm. */ - testb $RELOC_KERNEL_HOST_MEM_ENC_ACTIVE, %r11b - jz .Lsme_off + testb $RELOC_KERNEL_CACHE_INCOHERENT, %r11b + jz .Lnowbinvd wbinvd -.Lsme_off: +.Lnowbinvd: =20 call swap_pages =20 --=20 2.50.1 From nobody Sun Oct 5 23:37:50 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3D69326738D; Mon, 28 Jul 2025 12:29:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705754; cv=none; b=aNPUBAKb6l3npJm+pCjzrWQqo6DnjeUvAlGCs8uPAVvdZ8DYkYohYDTht1L09Yz613zWgipI9LH2guSACca7oTLnYDsM1vk26X2D7F4RI9PI2AYnPI+YKVsrMqLYIW6qVUJCQd7M+1qg37C62gssKFPO5J41BONJBaS5oKOxecI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705754; c=relaxed/simple; bh=pESkzcs1qKhmLkqfZthtCjpmBzr04nrNKv8w0hyquZw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=AguZLNHGirX5Gpkr8D6djIokUG1qEUgvmPn14pMacNbSTsYhYKygSz7E3gWg3lCKteZkm1ZNJVMAuYSMi0fT3YNR/zd4oWppDfmRydGTCVwR/Vw0OmRFpgWsJ3AHhWKi6Kez2URsAr69YbGVDtYDtexdNalzGlsmI0ZFGABWjd4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=lMfbAY5F; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="lMfbAY5F" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1753705753; x=1785241753; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=pESkzcs1qKhmLkqfZthtCjpmBzr04nrNKv8w0hyquZw=; b=lMfbAY5FFHFGamsrSdKPHB5YqHuVQzauyhY5ollfm6HQgKDg9U6qFP5i vy8+CerxkQuAlbMgGTNyitXnAnTtAcegpSH5/BFCofLe3BwT5kN//v8cl 3sK6dNT12HqhGUhfMGDpZtHdrxF1IhJq7OuTe2bbAJQ5WQo1w9ZrOkiAY pqEeN4PXahKWUqzT6moxrHCyAh8L3sZF7IMjgz9VSVQ5jkrMqpOs0oWG/ kA9Hmx3zK5NtPtsV4T4nXDj2z0DXATV71kH9KF++wlaiMwx10a281LGJx ntu2yBHKtd7WVMUfRfqf+vLrYS+xmYMVNS2+edKYE752vqBXiAWRqCPWk Q==; X-CSE-ConnectionGUID: NzP3li4EReG048O6TO2t9w== X-CSE-MsgGUID: NUcGMNjcQF2j4xtTIh0o4Q== X-IronPort-AV: E=McAfee;i="6800,10657,11504"; a="56043329" X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="56043329" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:29:12 -0700 X-CSE-ConnectionGUID: kiL6L7WzSEyYrVudsVoxDg== X-CSE-MsgGUID: ir3KJg66S2Oq1nCBbHkefg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="193375630" Received: from dnelso2-mobl.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.220.205]) by orviesa002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:29:07 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com Cc: x86@kernel.org, kas@kernel.org, rick.p.edgecombe@intel.com, dwmw@amazon.co.uk, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, ashish.kalra@amd.com, nik.borisov@suse.com, chao.gao@intel.com, sagis@google.com, Farrah Chen Subject: [PATCH v5 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Date: Tue, 29 Jul 2025 00:28:37 +1200 Message-ID: <03d3eecaca3f7680aacc55549bb2bacdd85a048f.1753679792.git.kai.huang@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On TDX platforms, dirty cacheline aliases with and without encryption bits can coexist, and the cpu can flush them back to memory in random order. During kexec, the caches must be flushed before jumping to the new kernel otherwise the dirty cachelines could silently corrupt the memory used by the new kernel due to different encryption property. A percpu boolean is used to mark whether the cache of a given CPU may be in an incoherent state, and the kexec performs WBINVD on the CPUs with that boolean turned on. For TDX, only the TDX module or the TDX guests can generate dirty cachelines of TDX private memory, i.e., they are only generated when the kernel does a SEAMCALL. Set that boolean when the kernel does SEAMCALL so that kexec can flush the cache correctly. The kernel provides both the __seamcall*() assembly functions and the seamcall*() wrapper ones which additionally handle running out of entropy error in a loop. Most of the SEAMCALLs are called using the seamcall*(), except TDH.VP.ENTER and TDH.PHYMEM.PAGE.RDMD which are called using __seamcall*() variant directly. To cover the two special cases, add a new helper do_seamcall() which only sets the percpu boolean and then calls the __seamcall*(), and change the special cases to use do_seamcall(). To cover all other SEAMCALLs, change seamcall*() to call do_seamcall(). For the SEAMCALLs invoked via seamcall*(), they can be made from both task context and IRQ disabled context. Given SEAMCALL is just a lengthy instruction (e.g., thousands of cycles) from kernel's point of view and preempt_{disable|enable}() is cheap compared to it, just unconditionally disable preemption during setting the boolean and making SEAMCALL. Signed-off-by: Kai Huang Tested-by: Farrah Chen Reviewed-by: Chao Gao Reviewed-by: Rick Edgecombe with the name chan= ge. --- v4 -> v5: - Remove unneeded 'ret' local variable in do_seamcall() - Chao. v3 -> v4: - Set the boolean for TDH.VP.ENTER and TDH.PHYMEM.PAGE.RDMD. -Rick - Update the first paragraph to make it shorter -- Rick - Update changelog to mention the two special cases. --- arch/x86/include/asm/tdx.h | 25 ++++++++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.c | 4 ++-- 2 files changed, 26 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 7ddef3a69866..488274959cd5 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -102,10 +102,31 @@ u64 __seamcall_ret(u64 fn, struct tdx_module_args *ar= gs); u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args); void tdx_init(void); =20 +#include #include +#include =20 typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args); =20 +static __always_inline u64 do_seamcall(sc_func_t func, u64 fn, + struct tdx_module_args *args) +{ + lockdep_assert_preemption_disabled(); + + /* + * SEAMCALLs are made to the TDX module and can generate dirty + * cachelines of TDX private memory. Mark cache state incoherent + * so that the cache can be flushed during kexec. + * + * This needs to be done before actually making the SEAMCALL, + * because kexec-ing CPU could send NMI to stop remote CPUs, + * in which case even disabling IRQ won't help here. + */ + this_cpu_write(cache_state_incoherent, true); + + return func(fn, args); +} + static __always_inline u64 sc_retry(sc_func_t func, u64 fn, struct tdx_module_args *args) { @@ -113,7 +134,9 @@ static __always_inline u64 sc_retry(sc_func_t func, u64= fn, u64 ret; =20 do { - ret =3D func(fn, args); + preempt_disable(); + ret =3D do_seamcall(func, fn, args); + preempt_enable(); } while (ret =3D=3D TDX_RND_NO_ENTROPY && --retry); =20 return ret; diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index c7a9a087ccaf..d6ee4e5a75d2 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -1266,7 +1266,7 @@ static bool paddr_is_tdx_private(unsigned long phys) return false; =20 /* Get page type from the TDX module */ - sret =3D __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args); + sret =3D do_seamcall(__seamcall_ret, TDH_PHYMEM_PAGE_RDMD, &args); =20 /* * The SEAMCALL will not return success unless there is a @@ -1522,7 +1522,7 @@ noinstr __flatten u64 tdh_vp_enter(struct tdx_vp *td,= struct tdx_module_args *ar { args->rcx =3D tdx_tdvpr_pa(td); =20 - return __seamcall_saved_ret(TDH_VP_ENTER, args); + return do_seamcall(__seamcall_saved_ret, TDH_VP_ENTER, args); } EXPORT_SYMBOL_GPL(tdh_vp_enter); =20 --=20 2.50.1 From nobody Sun Oct 5 23:37:50 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BC4BE26738D; Mon, 28 Jul 2025 12:29:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705759; cv=none; b=KkSCFOsD/X6KYHs5AHBx+lXmSmioJvQCq+upMBd9s5aD006d1fJoJ7iaO5qo+IKxqQEtjfvue4wkHVzrjF6p5+4FjdZaiVBqHuXNCjupMWq6VjqFvPUG0cSzWna8W37+9Vy6kSGwSw/o+FW5eiK8UlKIRQik++3nPLP6VujCAEs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705759; c=relaxed/simple; bh=gDLfbUILtjXyO/96Ed2vWuj/rsbjca3iJueMw7RbE88=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=YT0pg0JOocsO+DAKFx8zlvsJrk71IxbLDeV4f74tYckdoit3ekbBALhLTIErsmTROIODoDaWTYg7oHcjkthNtquSSykf+QuI+Uxk0rPeh7otkoriIm5jxkeCCg6kTkFShcgUadt4Lvrp4PeP7mNg9nVHVcOhX7QiwIs4OeKH/jc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=OlcUMIlb; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="OlcUMIlb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1753705758; x=1785241758; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=gDLfbUILtjXyO/96Ed2vWuj/rsbjca3iJueMw7RbE88=; b=OlcUMIlb3cf9VBjWbDtzv0DVDQ/ZgOdlgU9S3+Vi5c1wfCOViT3a84I4 kYOynQOUm2GfUSsKEh8VeJvI7s2PJv+7Edd+T5c7XH9TcVIeKKAsEy+fJ TEBwJXdk0GpzC0RyxgYZnp+cnoUQpuAwnq3u0zdr5Ff98Kuh9m7OndkI3 v0uI4B0v++8uDVonG1ABNxxbXV+w6spZkxU/UI4e7x9T4q9OtmgmXnc4E fWkcMW5Y6hx3HCzNCN2gBKi+WsjdVB8IdoGJYv1/vgsOGH/7muuk9qD6F IGyx035KtIW5wKxI/769tc6Ec7rrXZgLOmx07lHLestCb+zOVqJ8aH7BA g==; X-CSE-ConnectionGUID: QA/6cIKZSG6tAlmNsfWSNQ== X-CSE-MsgGUID: SBFoWG57TdaZQbOGDeLdUg== X-IronPort-AV: E=McAfee;i="6800,10657,11504"; a="56043343" X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="56043343" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:29:18 -0700 X-CSE-ConnectionGUID: BeRbRIUPQoSYmKrKbo+mnw== X-CSE-MsgGUID: 4galhUeMQo+owPkXJ2xDYQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="193375642" Received: from dnelso2-mobl.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.220.205]) by orviesa002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:29:13 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com Cc: x86@kernel.org, kas@kernel.org, rick.p.edgecombe@intel.com, dwmw@amazon.co.uk, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, ashish.kalra@amd.com, nik.borisov@suse.com, chao.gao@intel.com, sagis@google.com, Farrah Chen Subject: [PATCH v5 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Date: Tue, 29 Jul 2025 00:28:38 +1200 Message-ID: <1a8d6eeb9e7ac6bb4959722160fa8032bb3dfa26.1753679792.git.kai.huang@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Some early TDX-capable platforms have an erratum: A kernel partial write (a write transaction of less than cacheline lands at memory controller) to TDX private memory poisons that memory, and a subsequent read triggers a machine check. On those platforms, the old kernel must reset TDX private memory before jumping to the new kernel, otherwise the new kernel may see unexpected machine check. Currently the kernel doesn't track which page is a TDX private page. For simplicity just fail kexec/kdump for those platforms. Leverage the existing machine_kexec_prepare() to fail kexec/kdump by adding the check of the presence of the TDX erratum (which is only checked for if the kernel is built with TDX host support). This rejects kexec/kdump when the kernel is loading the kexec/kdump kernel image. The alternative is to reject kexec/kdump when the kernel is jumping to the new kernel. But for kexec this requires adding a new check (e.g., arch_kexec_allowed()) in the common code to fail kernel_kexec() at early stage. Kdump (crash_kexec()) needs similar check, but it's hard to justify because crash_kexec() is not supposed to abort. It's feasible to further relax this limitation, i.e., only fail kexec when TDX is actually enabled by the kernel. But this is still a half measure compared to resetting TDX private memory so just do the simplest thing for now. The impact to userspace is the users will get an error when loading the kexec/kdump kernel image: kexec_load failed: Operation not supported This might be confusing to the users, thus also print the reason in the dmesg: [..] kexec: not allowed on platform with tdx_pw_mce bug. Signed-off-by: Kai Huang Tested-by: Farrah Chen Reviewed-by: Rick Edgecombe --- arch/x86/kernel/machine_kexec_64.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_k= exec_64.c index dfb91091f451..15088d14904f 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -347,6 +347,22 @@ int machine_kexec_prepare(struct kimage *image) unsigned long reloc_end =3D (unsigned long)__relocate_kernel_end; int result; =20 + /* + * Some early TDX-capable platforms have an erratum. A kernel + * partial write (a write transaction of less than cacheline + * lands at memory controller) to TDX private memory poisons that + * memory, and a subsequent read triggers a machine check. + * + * On those platforms the old kernel must reset TDX private + * memory before jumping to the new kernel otherwise the new + * kernel may see unexpected machine check. For simplicity + * just fail kexec/kdump on those platforms. + */ + if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) { + pr_info_once("Not allowed on platform with tdx_pw_mce bug\n"); + return -EOPNOTSUPP; + } + /* Setup the identity mapped 64bit page table */ result =3D init_pgtable(image, __pa(control_page)); if (result) --=20 2.50.1 From nobody Sun Oct 5 23:37:50 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6C6DC267F48; Mon, 28 Jul 2025 12:29:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705764; cv=none; b=EMHLjRjw4HRGY76uk3EaNIa36EPL+YikKhlAo9ZG7Gfdtk2m08chOYNpxas3tWzpLn+NqHNWVmP223K1Ih7Rki9HxnCN1uWDi0kBeGZd+tR3vNwzZDmpOHn+LUmaOhMLzmTHeMs7lFv+doMqPtOesIoiujLhLB7aiLhzFfAvVKQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705764; c=relaxed/simple; bh=RsTYyFppNTiESBzQNoFuQKFcG+PMIx4if+miTEOIJtA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=EKy5oidoZx2/sMVltAJMLBqGyK4yfwr56QFaTWexJAXZi6c5p2RGBC9xplUpDjozRDMFlkFGVyo9e5E+k7MQ/TdqdDbJIw0buUG4/3YMKe9Og6hwknZz7D6Wk+KcHpc9UUrhT3H1GPWcuzLOLLzHpNbXEcszY4URhzD/CFAYcXQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=mzckbcWu; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="mzckbcWu" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1753705764; x=1785241764; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=RsTYyFppNTiESBzQNoFuQKFcG+PMIx4if+miTEOIJtA=; b=mzckbcWuEmq6T5rL5x0RbHNwtQ/F7unkOHt+q10M42lm7T8j+Td7XVKY gftZFyLNZALTqShRpZnAsluGOpqxK0d2Bz4ju36bS2hfqsdL2nyV/E1BH 45dI61KrHUVGNYPMFD4q6yTm711JM3Egw8EZ6vhawD6XyOtNDuXBrtxhM MNW20Lii1b29n1DQ2i25XyOZ3sZHuqT43lXJ/w+SBhxS/cl1ipNlgkXx+ yD3CssOHl8qgkHfUTbky7CFJyMHR861KTMm5+P69Z1JzYPi+5ZzM5JCf1 xvhLY1mddocjCGdx5IL0NgY2Vy32fT8y3VRGGEuv4CScEerP7lnh/yHt6 w==; X-CSE-ConnectionGUID: j4FJupUyTfi7xpDxya8+gg== X-CSE-MsgGUID: sDoC+0mLSdeUmjJMAKKKtw== X-IronPort-AV: E=McAfee;i="6800,10657,11504"; a="56043361" X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="56043361" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:29:23 -0700 X-CSE-ConnectionGUID: dO9NC3SMTgiNUoZO53KqCQ== X-CSE-MsgGUID: ZpaRA0SnTai35JC012hHIw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="193375645" Received: from dnelso2-mobl.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.220.205]) by orviesa002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:29:18 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com Cc: x86@kernel.org, kas@kernel.org, rick.p.edgecombe@intel.com, dwmw@amazon.co.uk, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, ashish.kalra@amd.com, nik.borisov@suse.com, chao.gao@intel.com, sagis@google.com, Farrah Chen Subject: [PATCH v5 5/7] x86/virt/tdx: Remove the !KEXEC_CORE dependency Date: Tue, 29 Jul 2025 00:28:39 +1200 Message-ID: X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" During kexec it is now guaranteed that all dirty cachelines of TDX private memory are flushed before jumping to the new kernel. The TDX private memory from the old kernel will remain as TDX private memory in the new kernel, but it is OK because kernel read/write to TDX private memory will never cause machine check, except on the platforms with the TDX partial write erratum, which has already been handled. It is safe to allow kexec to work together with TDX now. Remove the !KEXEC_CORE dependency. Signed-off-by: Kai Huang Tested-by: Farrah Chen Reviewed-by: Rick Edgecombe --- arch/x86/Kconfig | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index e924819ae133..41dfe282cf7a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1900,7 +1900,6 @@ config INTEL_TDX_HOST depends on X86_X2APIC select ARCH_KEEP_MEMBLOCK depends on CONTIG_ALLOC - depends on !KEXEC_CORE depends on X86_MCE help Intel Trust Domain Extensions (TDX) protects guest VMs from malicious --=20 2.50.1 From nobody Sun Oct 5 23:37:50 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D2D5F264627; Mon, 28 Jul 2025 12:29:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705770; cv=none; b=IfdX/zQ9JLq0cgsMNb66fJF5JsR5l0DmiuDwlK1bbFyGXK625OvL5MwLcZ0abvokNDoX+ddK2ix0aUTKAZVIro3teB8q67oEa7HzeTRDlF28Wvs4zXCNfmqZp4v3iwGr7oJhXIAR488ujJpCO/z/kVBNX7T1zdAImjWggu7X6YM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705770; c=relaxed/simple; bh=WUHLDpzN4yyBJ03WNWSWNPYEC4Nh6LTe3WOYShwnaxQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=j8xgPMwItp5C38CO3EPRT+KXMu5Q+Z1/tAinIuKtrRr1nrKDyfPpZsBZEPJOFOJeHLZ2ffQG5xvbIXd7HIdSFdM16AK4UMtxjJDu7poVOukmWsG2dtBb384g7jgFKnkUPiEDdpDpbsx7DR+/WRhAqvBOV1vzx1Q+nh+6Wy/u1Yk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=U9K+8wtr; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="U9K+8wtr" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1753705769; x=1785241769; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=WUHLDpzN4yyBJ03WNWSWNPYEC4Nh6LTe3WOYShwnaxQ=; b=U9K+8wtryLGrGgaTqaAAlSx7UOllZzT9DvTAxtm5UMlPF1FO0OOtp78S IWUWtIUx8RbklExJrZ66smBRQrUaBmcntOGTZ/r+LvteafHn3AvOevMU4 iTVQrtQYcJ802ZHGFsk47H5xeAiOV2tkCn+W5ibD8Djli3q7DdN2+yZkq MNrIA/QzEGICofY4z81QF567N05Sc1zbFU2ElxaAHGwnWu09J89um/0DH 9VHOoYFhTTddYPiUqPr7UpdUU6cN9mF0cv7nLZp9h2YCUizW+t0aFDVO8 hrfQ+Abijc/GK/VVtZ3AGWRS0OzMDVy9opUQ20WdlBXbV0dkwtyMMZyxF Q==; X-CSE-ConnectionGUID: I2tbL9XAQXCkyR9u0TVveQ== X-CSE-MsgGUID: HXBD05M+SjWeof4ouOJwng== X-IronPort-AV: E=McAfee;i="6800,10657,11504"; a="56043376" X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="56043376" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:29:29 -0700 X-CSE-ConnectionGUID: d7G9FBOjQwiZUQsKMc37cQ== X-CSE-MsgGUID: e6FbP7+tSmWLcIICkPmEXw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="193375648" Received: from dnelso2-mobl.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.220.205]) by orviesa002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:29:24 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com Cc: x86@kernel.org, kas@kernel.org, rick.p.edgecombe@intel.com, dwmw@amazon.co.uk, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, ashish.kalra@amd.com, nik.borisov@suse.com, chao.gao@intel.com, sagis@google.com, Farrah Chen Subject: [PATCH v5 6/7] x86/virt/tdx: Update the kexec section in the TDX documentation Date: Tue, 29 Jul 2025 00:28:40 +1200 Message-ID: <3389378e2e239e4067294b1ae05b0dde65cb5bba.1753679792.git.kai.huang@intel.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" TDX host kernel now supports kexec/kdump. Update the documentation to reflect that. Opportunistically, remove the parentheses in "Kexec()" and move this section under the "Erratum" section because the updated "Kexec" section now refers to that erratum. Signed-off-by: Kai Huang Tested-by: Farrah Chen Reviewed-by: Rick Edgecombe --- Documentation/arch/x86/tdx.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst index 719043cd8b46..61670e7df2f7 100644 --- a/Documentation/arch/x86/tdx.rst +++ b/Documentation/arch/x86/tdx.rst @@ -142,13 +142,6 @@ but depends on the BIOS to behave correctly. Note TDX works with CPU logical online/offline, thus the kernel still allows to offline logical CPU and online it again. =20 -Kexec() -~~~~~~~ - -TDX host support currently lacks the ability to handle kexec. For -simplicity only one of them can be enabled in the Kconfig. This will be -fixed in the future. - Erratum ~~~~~~~ =20 @@ -171,6 +164,13 @@ If the platform has such erratum, the kernel prints ad= ditional message in machine check handler to tell user the machine check may be caused by kernel bug on TDX private memory. =20 +Kexec +~~~~~~~ + +Currently kexec doesn't work on the TDX platforms with the aforementioned +erratum. It fails when loading the kexec kernel image. Otherwise it +works normally. + Interaction vs S3 and deeper states ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ =20 --=20 2.50.1 From nobody Sun Oct 5 23:37:50 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AD3E3264A60; Mon, 28 Jul 2025 12:29:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705776; cv=none; b=BwFFRikpo6SnU8gdoRlO372mbych8tOShnnCsAqLqEYD/65Cbddm+OVcDfKAbpDeDiOuAR6tQctirB22mc3xoB36q1zkObkRSzbTP3st4LCF0LQ+iRr963GrYXmCSWHRCvXZaO/R7FcY98R5rm9M00G5N6Y398LkG79UAfIcHO4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753705776; c=relaxed/simple; bh=2i8R7WqbMuksduAgSp+rs6HAumU9dIeQQFpDSr5corI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PqlWVZQeSn+xVOwcdlqxUXdCs5dY4Pmf7AvewaEqrHdg36S1CPVA9POO2HZ20eV8igvL9kSquQ5CZSP02dNixPhOlaqA7P+BnIg7Vy3kr+y67N/VPjPv5mjybKHbi8aLEvXlgqCUdvgfBGc2gfiRmIIm9c2Q+OU31R+jtl10MxA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=W5X4L7zx; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="W5X4L7zx" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1753705775; x=1785241775; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=2i8R7WqbMuksduAgSp+rs6HAumU9dIeQQFpDSr5corI=; b=W5X4L7zxU49C2o0nHm62T9m0NLQC2m1pMHuvdQJGFDC4BEwNI5dDHJA4 bnIL7+bk++4kbnGp4PGsAMWxi5TrjZKwlYvpFRaLQyMIOXx3m4lbx9NR4 qjLIu2w8AlvMwUQMGHHhHRuVQcp5vnMoHvcKw0o47yOFAoE2vmneXrGNX x97hIpTr4TRz1tUE/kvu8FNs71xZuYfpJhO/8X0RMWgpYEvTbm4jV3n5i v/KX64FhKlCLqjoE/uuwKG+1arpvdBCJu944mM8s5sAzeg+06Z8WAnNkf PbntSD9nWov5/s0V1NLki42FJA9By6x7AC4SsIFt0gp+vUx/fUdT56f4Q A==; X-CSE-ConnectionGUID: AZe8YjvlRKq5LJFupNPvnw== X-CSE-MsgGUID: VRFdJ4iYTH+kSFM3Jq8wMg== X-IronPort-AV: E=McAfee;i="6800,10657,11504"; a="56043393" X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="56043393" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:29:35 -0700 X-CSE-ConnectionGUID: qs8xHEBNRYCb/4dCnrX/+g== X-CSE-MsgGUID: jnsd1L3ATpmUrz8Xn9BvVw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,339,1744095600"; d="scan'208";a="193375653" Received: from dnelso2-mobl.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.220.205]) by orviesa002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2025 05:29:29 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com Cc: x86@kernel.org, kas@kernel.org, rick.p.edgecombe@intel.com, dwmw@amazon.co.uk, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, ashish.kalra@amd.com, nik.borisov@suse.com, chao.gao@intel.com, sagis@google.com, Farrah Chen , Binbin Wu Subject: [PATCH v5 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs Date: Tue, 29 Jul 2025 00:28:41 +1200 Message-ID: X-Mailer: git-send-email 2.50.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On TDX platforms, during kexec, the kernel needs to make sure there are no dirty cachelines of TDX private memory before booting to the new kernel to avoid silent memory corruption to the new kernel. During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus() to stop all remote CPUs before booting to the new kernel. The remote CPUs will then execute stop_this_cpu() to stop themselves. The kernel has a percpu boolean to indicate whether the cache of a CPU may be in incoherent state. In stop_this_cpu(), the kernel does WBINVD if that percpu boolean is true. TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL. This makes sure the caches will be flushed during kexec. However, the native_stop_other_cpus() and stop_this_cpu() have a "race" which is extremely rare to happen but could cause the system to hang. Specifically, the native_stop_other_cpus() firstly sends normal reboot IPI to remote CPUs and waits one second for them to stop. If that times out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop them. The aforementioned race happens when NMIs are sent. Doing WBINVD in stop_this_cpu() makes each CPU take longer time to stop and increases the chance of the race happening. Explicitly flush cache in tdx_disable_virtualization_cpu() after which no more TDX activity can happen on this cpu. This moves the WBINVD to an earlier stage than stop_this_cpus(), avoiding a possibly lengthy operation at a time where it could cause this race. Signed-off-by: Kai Huang Acked-by: Paolo Bonzini Tested-by: Farrah Chen Reviewed-by: Binbin Wu Reviewed-by: Chao Gao --- v4 -> v5: - No change v3 -> v4: - Change doing wbinvd() from rebooting notifier to tdx_disable_virtualization_cpu() to cover the case where more SEAMCALL can be made after cache flush, i.e., doing kexec when there's TD alive. - Chao. - Add check to skip wbinvd if the boolean is false. -- Chao - Fix typo in the comment -- Binbin. --- arch/x86/include/asm/tdx.h | 2 ++ arch/x86/kvm/vmx/tdx.c | 12 ++++++++++++ arch/x86/virt/vmx/tdx/tdx.c | 12 ++++++++++++ 3 files changed, 26 insertions(+) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 488274959cd5..b7c978281934 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -217,6 +217,7 @@ u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64= level, u64 *ext_err1, u6 u64 tdh_phymem_cache_wb(bool resume); u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td); u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page); +void tdx_cpu_flush_cache(void); #else static inline void tdx_init(void) { } static inline int tdx_cpu_enable(void) { return -ENODEV; } @@ -224,6 +225,7 @@ static inline int tdx_enable(void) { return -ENODEV; } static inline u32 tdx_get_nr_guest_keyids(void) { return 0; } static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; } static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NU= LL; } +static inline void tdx_cpu_flush_cache(void) { } #endif /* CONFIG_INTEL_TDX_HOST */ =20 #endif /* !__ASSEMBLER__ */ diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index ec79aacc446f..93477233baae 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -442,6 +442,18 @@ void tdx_disable_virtualization_cpu(void) tdx_flush_vp(&arg); } local_irq_restore(flags); + + /* + * No more TDX activity on this CPU from here. Flush cache to + * avoid having to do WBINVD in stop_this_cpu() during kexec. + * + * Kexec calls native_stop_other_cpus() to stop remote CPUs + * before booting to new kernel, but that code has a "race" + * when the normal REBOOT IPI times out and NMIs are sent to + * remote CPUs to stop them. Doing WBINVD in stop_this_cpu() + * could potentially increase the possibility of the "race". + */ + tdx_cpu_flush_cache(); } =20 #define TDX_SEAMCALL_RETRIES 10000 diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index d6ee4e5a75d2..c098a6e0382b 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -1870,3 +1870,15 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct pag= e *page) return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args); } EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid); + +void tdx_cpu_flush_cache(void) +{ + lockdep_assert_preemption_disabled(); + + if (!this_cpu_read(cache_state_incoherent)) + return; + + wbinvd(); + this_cpu_write(cache_state_incoherent, false); +} +EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache); --=20 2.50.1