From nobody Tue Dec 16 11:03:19 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1190C23F38A for ; Wed, 12 Mar 2025 11:34:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741779284; cv=none; b=ZkNBSlC8Chc8so3zD2bHuHGK2Z4qt2wZ03PmJUirQtl4YflA2g/wDe7NKWQe8cmcWWCXYG1zpvkITdPreijvkl4Z+BRcDTOHG4pqnew0eEI7m94qobrqtFFQ5KzxswkUBG6k6tz1WXvgfRv5t7zH+gnPgRkFteRF7feJrxA7FS0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741779284; c=relaxed/simple; bh=EfG/g3ntDBqLTUOmxcrU8a+zoCSn9m09nTwIQsOjBis=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=foIb27Bdw+B1W9sp6iMFJuxbSyg5JIfNXLwxdRBLYmYFkny4IxR+W8Vv6Cutda50INEAmv7oJOAdU4GRaLYE2bercaq0pYMQbJGw1YTPOqszrSbVWtxgUwVPeb0crfpiW+c4uv5UKYna0F/CLRX2QFsq6WiCOeIgDG8X3hVPj2g= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=mf5Tf8j6; arc=none smtp.client-ip=192.198.163.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="mf5Tf8j6" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1741779283; x=1773315283; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=EfG/g3ntDBqLTUOmxcrU8a+zoCSn9m09nTwIQsOjBis=; b=mf5Tf8j6cg8UPKCClduE0odXEf7Yylwoeo758YRHoDI2ZeMonBNscDRp fp6wsivs52BhNwXloa9/VFJOQ3Tvt7O8nur2Suau9geGTDC3pk2OaOjF7 Q6UNZshu7vBiO74haoRBtLh5T6/Cs5JTMY+/HLk3Mr77daB01ixgzZKk2 SYuypCTsbJj0ZYxqkqMN/eSa523vf+SNj1ieaYDmeQW7+9WgE5FBjQQGx Gt+kkbpDn7AbGT8nxNOfP5vqrvcFPqKM/QMdqpTOeBKnemR4OxOnNowbK 9IFCCzj5R1znE50WtKK357wSEQ3e87eq7fHkRDpklg8+CYKCyyRFYQkke Q==; X-CSE-ConnectionGUID: G8Nj4U3aQiinA6W7qyezQw== X-CSE-MsgGUID: 3Isk6VHiT4uURUs4n4wQZQ== X-IronPort-AV: E=McAfee;i="6700,10204,11370"; a="42985135" X-IronPort-AV: E=Sophos;i="6.14,241,1736841600"; d="scan'208";a="42985135" Received: from fmviesa003.fm.intel.com ([10.60.135.143]) by fmvoesa109.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2025 04:34:42 -0700 X-CSE-ConnectionGUID: 0TdLM2mxTsaS+ye9OUh2Tw== X-CSE-MsgGUID: iW2MqBzpSdeN0peTCAChNw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.14,241,1736841600"; d="scan'208";a="124775987" Received: from iweiny-desk3.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.221.164]) by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2025 04:34:37 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, kirill.shutemov@linux.intel.com Cc: hpa@zytor.com, x86@kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, rick.p.edgecombe@intel.com, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, thomas.lendacky@amd.com, ashish.kalra@amd.com, dwmw@amazon.co.uk, bhe@redhat.com, nik.borisov@suse.com, sagis@google.com, Dave Young Subject: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu() Date: Thu, 13 Mar 2025 00:34:13 +1300 Message-ID: X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" TL;DR: Change to do unconditional WBINVD in stop_this_cpu() for bare metal to cover kexec support for both AMD SME and Intel TDX. Previously there _was_ some issue preventing from doing so but now it has been fixed. Long version: AMD SME uses the C-bit to determine whether to encrypt the memory or not. For the same physical memory address, dirty cachelines with and without the C-bit can coexist and the CPU can flush them back to memory in random order. To support kexec for SME, the old kernel uses WBINVD to flush cache before booting to the new kernel so that no stale dirty cacheline are left over by the old kernel which could otherwise corrupt the new kernel's memory. TDX uses 'KeyID' bits in the physical address for memory encryption and has the same requirement. To support kexec for TDX, the old kernel needs to flush cache of TDX private memory. Currently, the kernel only performs WBINVD in stop_this_cpu() when SME is supported by hardware. Perform unconditional WBINVD to support TDX instead of adding one more vendor-specific check. Kexec is a slow path, and the additional WBINVD is acceptable for the sake of simplicity and maintainability. Only do WBINVD on bare-metal. Doing WBINVD in guest triggers unexpected exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest may not be able to handle such exception (e.g., TDX guests panics if it sees such #VE). History of SME and kexec WBINVD: There _was_ an issue preventing doing unconditional WBINVD but that has been fixed. Initial SME kexec support added an unconditional WBINVD in commit bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME") This commit caused different Intel systems to hang or reset. Without a clear root cause, a later commit f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()") fixed the Intel system hang issues by only doing WBINVD when hardware supports SME. A corner case [*] revealed the root cause of the system hang issues and was fixed by commit 1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust") See [1][2] for more information. Further testing of doing unconditional WBINVD based on the above fix on the problematic machines (that issues were originally reported) confirmed the issues couldn't be reproduced. See [3][4] for more information. Therefore, it is safe to do unconditional WBINVD for bare-metal now. [*] The commit didn't check whether the CPUID leaf is available or not. Making unsupported CPUID leaf on Intel returns garbage resulting in unintended WBINVD which caused some issue (followed by the analysis and the reveal of the final root cause). The corner case was independently fixed by commit 9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf") Link: https://lore.kernel.org/lkml/28a494ca-3173-4072-921c-6c5f5b257e79@amd= .com/ [1] Link: https://lore.kernel.org/lkml/24844584-8031-4b58-ba5c-f85ef2f4c718@amd= .com/ [2] Link: https://lore.kernel.org/lkml/20240221092856.GAZdXCWGJL7c9KLewv@fat_cr= ate.local/ [3] Link: https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdg= mkZqZvNp2dQ@mail.gmail.com/ [4] Signed-off-by: Kai Huang Suggested-by: Borislav Petkov Cc: Tom Lendacky Cc: Dave Young Reviewed-by: Tom Lendacky --- arch/x86/kernel/process.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 9c75d701011f..8475d9d2d8c4 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -819,18 +819,19 @@ void __noreturn stop_this_cpu(void *dummy) mcheck_cpu_clear(c); =20 /* - * Use wbinvd on processors that support SME. This provides support - * for performing a successful kexec when going from SME inactive - * to SME active (or vice-versa). The cache must be cleared so that - * if there are entries with the same physical address, both with and - * without the encryption bit, they don't race each other when flushed - * and potentially end up with the wrong entry being committed to - * memory. + * Use wbinvd to support kexec for both SME (from inactive to active + * or vice-versa) and TDX. The cache must be cleared so that if there + * are entries with the same physical address, both with and without + * the encryption bit(s), they don't race each other when flushed and + * potentially end up with the wrong entry being committed to memory. * - * Test the CPUID bit directly because the machine might've cleared - * X86_FEATURE_SME due to cmdline options. + * stop_this_cpu() isn't a fast path, just do unconditional WBINVD for + * bare-metal to cover both SME and TDX. Do not do WBINVD in a guest + * since performing one will result in an exception (#VE or #VC) for + * TDX or SEV-ES/SEV-SNP guests which the guest may not be able to + * handle (e.g., TDX guest panics if it sees #VE). */ - if (c->extended_cpuid_level >=3D 0x8000001f && (cpuid_eax(0x8000001f) & B= IT(0))) + if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) wbinvd(); =20 /* --=20 2.48.1 From nobody Tue Dec 16 11:03:19 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7549923E348 for ; Wed, 12 Mar 2025 11:34:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741779290; cv=none; b=IrjykgBrRwrltooEeDJXVNKydkcfiRdkyMAZg/urrfwuOGI5xn3nxEqC8Io8VYra+sh05euH4i6GGH9/bPryf6W/aewdx6v/v0B5jW4CDE75T3jt5KM1f2a/rVOK6SWuxlTNyOkvxScauiVjd2oYBrbZAUX+Ly58NGKl9clplN0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741779290; c=relaxed/simple; bh=aD+jjzZjn1ZZwgNbkBqMyfBNePvy2KuzSsLhnBYx060=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lJ+lbT1eea7Jp2OE9zpMiRKkV2NEgEMx841i4ARgTgQPYWgQxnPjUoHoD1pLICEIWH570/oszI1EJvAw2pZyenyGMQ9kxoVurvh/D+IidS6+fP3Mzrb7JpX0p87QJ74wSXvt3miPRNN+2kOfMXxMAmgfmNKDgclXs3i+PxW6ngs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=n+2/v8u0; arc=none smtp.client-ip=192.198.163.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="n+2/v8u0" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1741779288; x=1773315288; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=aD+jjzZjn1ZZwgNbkBqMyfBNePvy2KuzSsLhnBYx060=; b=n+2/v8u0C+vk4hOqz0mhrQdD7XLfZhDom25TY1neJLlylbFwJE0369uv BEt8DJy+IRUIwfymRPVfkBXHqvocd6hGnnjYjBQzN6HuXGnxFXggGlFrD MQMhOLXru0xAB/XH2bnEh3zS0sXqbBU/jUkWRQvpTERkPbEfQ5MomiV7e zYoIBtehQhdlRMgZinIPprRSh9HC/E2DteFKGaSBdo1Gfc2+IzMOTohXH ffYgTxvk39iQMHuMhu8NUGJ6Mzq0S8fdReau8k4GKMthGp8b3Gi/wzObr DFXLW8z5xkr4qLADrOfEyl9x4KyPhTX1PRqISYWfkLvRcjq9l4gOHlvmN Q==; X-CSE-ConnectionGUID: yK1KO/5JTVO9DMGQi/4WBg== X-CSE-MsgGUID: KVqTUtdQQJGC42nEv5uSxg== X-IronPort-AV: E=McAfee;i="6700,10204,11370"; a="42985153" X-IronPort-AV: E=Sophos;i="6.14,241,1736841600"; d="scan'208";a="42985153" Received: from fmviesa003.fm.intel.com ([10.60.135.143]) by fmvoesa109.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2025 04:34:48 -0700 X-CSE-ConnectionGUID: uzJc9TCpQKmxQ1WZ02lg9g== X-CSE-MsgGUID: hX4NAt5ARWuITPiNLawTNg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.14,241,1736841600"; d="scan'208";a="124776008" Received: from iweiny-desk3.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.221.164]) by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2025 04:34:43 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, kirill.shutemov@linux.intel.com Cc: hpa@zytor.com, x86@kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, rick.p.edgecombe@intel.com, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, thomas.lendacky@amd.com, ashish.kalra@amd.com, dwmw@amazon.co.uk, bhe@redhat.com, nik.borisov@suse.com, sagis@google.com, Dave Young , David Kaplan Subject: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel() Date: Thu, 13 Mar 2025 00:34:14 +1300 Message-ID: X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" For both SME and TDX, dirty cachelines with and without the encryption bit(s) of the same physical memory address can coexist and the CPU can flush them back to memory in random order. During kexec, the caches must be flushed before jumping to the new kernel to avoid silent memory corruption to the new kernel. The WBINVD in stop_this_cpu() flushes caches for all remote CPUs when they are being stopped. For SME, the WBINVD in relocate_kernel() flushes the cache for the last running CPU (which is doing kexec). Similarly, to support kexec for TDX host, after stopping all remote CPUs with cache flushed, the kernel needs to flush cache for the last running CPU. Use the existing WBINVD in relocate_kernel() to cover TDX host as well. Just do unconditional WBINVD to cover both SME and TDX instead of sprinkling additional vendor-specific checks. Kexec is a slow path, and the additional WBINVD is acceptable for the sake of simplicity and maintainability. But only do WBINVD for bare-metal because TDX guests and SEV-ES/SEV-SNP guests will get unexpected (and yet unnecessary) exception (#VE or #VC) which the kernel is unable to handle at the time of relocate_kernel() since the kernel has torn down the IDT. Remove the host_mem_enc_active local variable and directly use !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) as an argument of calling relocate_kernel(). cpu_feature_enabled() is always inline but not a function call, thus it is safe to use after load_segments() when call depth tracking is enabled. Signed-off-by: Kai Huang Reviewed-by: Kirill A. Shutemov Cc: Tom Lendacky Cc: Dave Young Cc: David Kaplan Reviewed-by: Tom Lendacky Tested-by: David Kaplan --- arch/x86/include/asm/kexec.h | 2 +- arch/x86/kernel/machine_kexec_64.c | 14 ++++++-------- arch/x86/kernel/relocate_kernel_64.S | 15 ++++++++++----- 3 files changed, 17 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 8ad187462b68..48c313575262 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -123,7 +123,7 @@ relocate_kernel_fn(unsigned long indirection_page, unsigned long pa_control_page, unsigned long start_address, unsigned int preserve_context, - unsigned int host_mem_enc_active); + unsigned int bare_metal); #endif extern relocate_kernel_fn relocate_kernel; #define ARCH_HAS_KIMAGE_ARCH diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_k= exec_64.c index a68f5a0a9f37..0e9808eeb63e 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -346,16 +346,9 @@ void __nocfi machine_kexec(struct kimage *image) { unsigned long reloc_start =3D (unsigned long)__relocate_kernel_start; relocate_kernel_fn *relocate_kernel_ptr; - unsigned int host_mem_enc_active; int save_ftrace_enabled; void *control_page; =20 - /* - * This must be done before load_segments() since if call depth tracking - * is used then GS must be valid to make any function calls. - */ - host_mem_enc_active =3D cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT); - #ifdef CONFIG_KEXEC_JUMP if (image->preserve_context) save_processor_state(); @@ -398,6 +391,11 @@ void __nocfi machine_kexec(struct kimage *image) * * I take advantage of this here by force loading the * segments, before I zap the gdt with an invalid value. + * + * load_segments() resets GS to 0. Don't make any function call + * after here since call depth tracking uses per-CPU variables to + * operate (relocate_kernel() is explicitly ignored by call depth + * tracking). */ load_segments(); /* @@ -412,7 +410,7 @@ void __nocfi machine_kexec(struct kimage *image) virt_to_phys(control_page), image->start, image->preserve_context, - host_mem_enc_active); + !cpu_feature_enabled(X86_FEATURE_HYPERVISOR)); =20 #ifdef CONFIG_KEXEC_JUMP if (image->preserve_context) diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocat= e_kernel_64.S index b44d8863e57f..dc1a59cd8139 100644 --- a/arch/x86/kernel/relocate_kernel_64.S +++ b/arch/x86/kernel/relocate_kernel_64.S @@ -50,7 +50,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel) * %rsi pa_control_page * %rdx start address * %rcx preserve_context - * %r8 host_mem_enc_active + * %r8 bare_metal */ =20 /* Save the CPU context, used for jumping back */ @@ -107,7 +107,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) /* * %rdi indirection page * %rdx start address - * %r8 host_mem_enc_active + * %r8 bare_metal * %r9 page table page * %r11 preserve_context * %r13 original CR4 when relocate_kernel() was invoked @@ -156,14 +156,19 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) movq %r9, %cr3 =20 /* - * If SME is active, there could be old encrypted cache line + * If SME/TDX is active, there could be old encrypted cache line * entries that will conflict with the now unencrypted memory * used by kexec. Flush the caches before copying the kernel. + * + * Do WBINVD for bare-metal only to cover both SME and TDX. Doing + * WBINVD in guest results in an unexpected exception (#VE or #VC) + * for TDX and SEV-ES/SNP guests which then crashes the guest (the + * kernel has torn down the IDT). */ testq %r8, %r8 - jz .Lsme_off + jz .Lno_wbinvd wbinvd -.Lsme_off: +.Lno_wbinvd: =20 call swap_pages =20 --=20 2.48.1 From nobody Tue Dec 16 11:03:19 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9EA7F24166E for ; Wed, 12 Mar 2025 11:34:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741779295; cv=none; b=YzVKPERPg8hSNvbUBg+WAnaRGGKOs8/h0mkROkUov/V3iOa1BEOXrQlmoLznAOjSHr7rEYiagp6uOYpXeljmQTu8Ib3V7YwcvdKx2nH+ym5LIJCI54KFMaqRQcLLgkx/CFs86Q6u6F/YRlpTjq2dQjAdpdqwzTJ8bCD8+odXXuA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741779295; c=relaxed/simple; bh=sGbCY+YrXZbAygAc4GG+nRChDurFUrfw2J5bfC1dNLA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=e2C9aUtRCi7L4C5Aj3iNqLAle5mpKkPHC3LIh+4lFUq362q35r8ovSkkBVdLtA6njfc+fOJjn7aXS5GNRpr8JbTomVPYlmWjeXnokJjRCbY49ss4r4Zk+GNEdUIAgU0/tokw15ZASBjXFIEH5roP+AJE/29xxbsJoTRWY+GSlI0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=hZbyesLY; arc=none smtp.client-ip=192.198.163.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="hZbyesLY" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1741779293; x=1773315293; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=sGbCY+YrXZbAygAc4GG+nRChDurFUrfw2J5bfC1dNLA=; b=hZbyesLY05Hn5mB6sHQP3wzGWY+0p6MpiZoJQE/F5FlJK7YAXrS642RI JbRgvmKVW9SkR9Xr5w8SmiPfLMHf0xSsf2OMHtfIz90UWs26TyUMdk6iD kd4DzdtLATIl1NSBZzMzUHdwjFeNHsCiDdSj7X5K726sbmTc+ErgF/XoM z5ZInV2mcxGSka/FmfD5LVFucF8lCjnR1zz1SStvwl60AkDpPORXO03GO BfU+GvnepmklsqTm/wFz66UrPIaZHjX159D6IYpd6NeMnyhKW4/FASIWP gihlmanxNwRxucRyoIH3xVpVCjJtckMzJ8W8BIuDkmp11SaEsIyh83GuS A==; X-CSE-ConnectionGUID: RVT0LCDwR8yRVocASN6lNA== X-CSE-MsgGUID: NcOZ673CSxyCVZOT4km8cg== X-IronPort-AV: E=McAfee;i="6700,10204,11370"; a="42985171" X-IronPort-AV: E=Sophos;i="6.14,241,1736841600"; d="scan'208";a="42985171" Received: from fmviesa003.fm.intel.com ([10.60.135.143]) by fmvoesa109.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2025 04:34:53 -0700 X-CSE-ConnectionGUID: TknHble/S1SswKOBX1NDqw== X-CSE-MsgGUID: 21mz/lznQr6WSnMlaY46GA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.14,241,1736841600"; d="scan'208";a="124776063" Received: from iweiny-desk3.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.221.164]) by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2025 04:34:48 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, kirill.shutemov@linux.intel.com Cc: hpa@zytor.com, x86@kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, rick.p.edgecombe@intel.com, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, thomas.lendacky@amd.com, ashish.kalra@amd.com, dwmw@amazon.co.uk, bhe@redhat.com, nik.borisov@suse.com, sagis@google.com Subject: [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Date: Thu, 13 Mar 2025 00:34:15 +1300 Message-ID: <408103f145360dfa04a41bc836ca2c724f29deb0.1741778537.git.kai.huang@intel.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Some early TDX-capable platforms have an erratum: A kernel partial write (a write transaction of less than cacheline lands at memory controller) to TDX private memory poisons that memory, and a subsequent read triggers a machine check. On those platforms, the old kernel must reset TDX private memory before jumping to the new kernel, otherwise the new kernel may see unexpected machine check. Currently the kernel doesn't track which page is a TDX private page. For simplicity just fail kexec/kdump for those platforms. Leverage the existing machine_kexec_prepare() to fail kexec/kdump by adding the check of the presence of the TDX erratum (which is only checked for if the kernel is built with TDX host support). This rejects kexec/kdump when the kernel is loading the kexec/kdump kernel image. The alternative is to reject kexec/kdump when the kernel is jumping to the new kernel. But for kexec this requires adding a new check (e.g., arch_kexec_allowed()) in the common code to fail kernel_kexec() at early stage. Kdump (crash_kexec()) needs similar check, but it's hard to justify because crash_kexec() is not supposed to abort. It's feasible to further relax this limitation, i.e., only fail kexec when TDX is actually enabled by the kernel. But this is still a half measure compared to resetting TDX private memory so just do the simplest thing for now. The impact to userspace is the users will get an error when loading the kexec/kdump kernel image: kexec_load failed: Operation not supported This might be confusing to the users, thus also print the reason in the dmesg: [..] kexec: not allowed on platform with tdx_pw_mce bug. Signed-off-by: Kai Huang --- arch/x86/kernel/machine_kexec_64.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_k= exec_64.c index 0e9808eeb63e..e438c4163960 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -311,6 +311,22 @@ int machine_kexec_prepare(struct kimage *image) unsigned long reloc_end =3D (unsigned long)__relocate_kernel_end; int result; =20 + /* + * Some early TDX-capable platforms have an erratum. A kernel + * partial write (a write transaction of less than cacheline + * lands at memory controller) to TDX private memory poisons that + * memory, and a subsequent read triggers a machine check. + * + * On those platforms the old kernel must reset TDX private + * memory before jumping to the new kernel otherwise the new + * kernel may see unexpected machine check. For simplicity + * just fail kexec/kdump on those platforms. + */ + if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) { + pr_info_once("Not allowed on platform with tdx_pw_mce bug\n"); + return -EOPNOTSUPP; + } + /* Setup the identity mapped 64bit page table */ result =3D init_pgtable(image, __pa(control_page)); if (result) --=20 2.48.1 From nobody Tue Dec 16 11:03:19 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 314B8233158 for ; Wed, 12 Mar 2025 11:34:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741779300; cv=none; b=OCkS9p861LejjCoBZV8lyU/+o5n+34X56aH278dMhDmvUvTmGb+ZDdh8i7czX+Qu/OPnU3p/TYLNHwYtkk60hD/tcmH29hOjvreI0/sOgjf9qfPENVEylxsZBwTxByxSAX3gUV0ihlLD5awV9BESkV12hyykDsj2g+GVrhZSR0Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741779300; c=relaxed/simple; bh=wvBfcJ2a+kweUO+Yq2/vprTjpgxiS33jBY6gHcj5ubE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=IBPmD8sIcaMW5958N+lbry4Wc2XAkF/jlOtQpkXYetsegyk5Rlo7oegOSBwWSocJDwPWU2UKcz5sbjABhEQFLNTA4GPo0c0bjxe3GQ40IdcS+gK0RISwCgiZCumqRFMovxi52f41GVi13OvsZ8iy6Y3SaBDdCHJQpfqZXN07XDc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=U2Fz7ZBj; arc=none smtp.client-ip=192.198.163.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="U2Fz7ZBj" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1741779299; x=1773315299; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=wvBfcJ2a+kweUO+Yq2/vprTjpgxiS33jBY6gHcj5ubE=; b=U2Fz7ZBjyANwonej/FInMap6P6mqrsocsmQnKicE8Zu6I2H6EOq8/BPT TK4wY8770cbKnGAAyfut/Kl2/t/7clG5z0UTb7viH3nQ8z5AumA2Vqi8M OGJw5aiNLVaTZd/DjDOti1lziqcW4uwXILEYAzgTDHYwFz8PjhPw3z4rA QjWAU/tSya0nVDXTVj6siAkuCcjTwCv4k6YZQAiV2wMrlwYmOPZDYMfdW MeFVcjGKtdif0Yu4wAcvmU4jsyvy0C/+YEqBd4pHLRbaM67g0gJqnsmM6 IAR27bwNYQEbESV15ugfKzwhxjHdsADqVvL2t+QZx36YIFbcEzUtjaw9f A==; X-CSE-ConnectionGUID: kt7GlhiKRv+yM4pa8CvJcw== X-CSE-MsgGUID: 97e8Cya/TuKfydZDSh3lFg== X-IronPort-AV: E=McAfee;i="6700,10204,11370"; a="42985181" X-IronPort-AV: E=Sophos;i="6.14,241,1736841600"; d="scan'208";a="42985181" Received: from fmviesa003.fm.intel.com ([10.60.135.143]) by fmvoesa109.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2025 04:34:59 -0700 X-CSE-ConnectionGUID: GKDd7LXKSYSwb+lZpDpVTw== X-CSE-MsgGUID: 0n1/n8xSSg6hjeE97/iEsA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.14,241,1736841600"; d="scan'208";a="124776085" Received: from iweiny-desk3.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.221.164]) by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2025 04:34:53 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, kirill.shutemov@linux.intel.com Cc: hpa@zytor.com, x86@kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, rick.p.edgecombe@intel.com, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, thomas.lendacky@amd.com, ashish.kalra@amd.com, dwmw@amazon.co.uk, bhe@redhat.com, nik.borisov@suse.com, sagis@google.com Subject: [RFC PATCH 4/5] x86/virt/tdx: Remove the !KEXEC_CORE dependency Date: Thu, 13 Mar 2025 00:34:16 +1300 Message-ID: <8ca4ac944560c9c02ef9ba273e2ae8f1cdd31c3a.1741778537.git.kai.huang@intel.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" During kexec it is now guaranteed that all dirty cachelines of TDX private memory are flushed before jumping to the new kernel. The TDX private memory from the old kernel will remain as TDX private memory in the new kernel, but it is OK because kernel read/write to TDX private memory will never cause machine check, except on the platforms with the TDX partial write erratum, which has already been handled. It is safe to allow kexec to work together with TDX now. Remove the !KEXEC_CORE dependency. Signed-off-by: Kai Huang --- arch/x86/Kconfig | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index ac4d65ef54a5..2d423964beb9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1977,7 +1977,6 @@ config INTEL_TDX_HOST depends on X86_X2APIC select ARCH_KEEP_MEMBLOCK depends on CONTIG_ALLOC - depends on !KEXEC_CORE depends on X86_MCE help Intel Trust Domain Extensions (TDX) protects guest VMs from malicious --=20 2.48.1 From nobody Tue Dec 16 11:03:19 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2B2AF233158 for ; Wed, 12 Mar 2025 11:35:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741779305; cv=none; b=ONVfj/pJiTCAiIiwVt5aiQ1IqONQ5fCOJWVCnpTK3Tfjoc1sePmYCUNvoOozhKpAImBnaDD2y6/62AU1Zxj/WcSVEjFEuaknhap7Rb0xOAzIgiA23UvgFy9LSNjZOWrVWw2ojdjMoOjSl92x49Fl6Tb+nneZUb0Y3qvnvZ+Nrjw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741779305; c=relaxed/simple; bh=G0MmR5iRxBh07af046kGj0hf2GwsQOPoKtevieu66rQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=eWldBKabr/3U22DY3kh3Y+vAG+HduvluGSctulTX0n14QwDJoLVaas8N98brjs9v5Y/Mpj1tJE3zvLZT0qI/S9Ro4Vfvp4xO1Fd0lhB7lpWCjTZ1hc5NnnP2u+WDlyq8vGl8bASnhOC64Ve6HAAGlIztR7CZj7Mn47PNh+MYfRM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=JuXGipWS; arc=none smtp.client-ip=192.198.163.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="JuXGipWS" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1741779304; x=1773315304; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=G0MmR5iRxBh07af046kGj0hf2GwsQOPoKtevieu66rQ=; b=JuXGipWSBrSMRqTKfPigXac7XulQL3d2LN5r7DzIhc0N92e964TskbFf TsYfocWUy/QInWn7eP6yPSaqcbqhBCIiua2OyI9w0dwcBY4bivnjSyLsK gTJ/vNrWOM2PlkaYbVc9+T7udGCzSSYbNIJy7vM2v14xiWtWhl7gnxuOg umneOOfG9LBp8WMMC+EUKioAvlm4YQqTCBMX3GGSEwRtiH5vL4/oLuOpO q00wYpmNjGFiDWbRj7+lfOrzikMK4B4+/HS7dZKDVDmVUrz+K7eXCcIMd IDYljeHGxiMDNlvi4KL3tWcIeUxun4Xey73WR3KQ83d3q+uKMt2b8UiOj w==; X-CSE-ConnectionGUID: ioalGvEwTOGzVq7YhLT0Zg== X-CSE-MsgGUID: JdSq7KvSSFSdN6I5XYK4DQ== X-IronPort-AV: E=McAfee;i="6700,10204,11370"; a="42985198" X-IronPort-AV: E=Sophos;i="6.14,241,1736841600"; d="scan'208";a="42985198" Received: from fmviesa003.fm.intel.com ([10.60.135.143]) by fmvoesa109.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2025 04:35:04 -0700 X-CSE-ConnectionGUID: tYeeWgW8SNOS5zbhhCNt1Q== X-CSE-MsgGUID: 7n0YB7gcQmunXVsJtSyjRA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.14,241,1736841600"; d="scan'208";a="124776128" Received: from iweiny-desk3.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.124.221.164]) by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2025 04:34:59 -0700 From: Kai Huang To: dave.hansen@intel.com, bp@alien8.de, tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com, kirill.shutemov@linux.intel.com Cc: hpa@zytor.com, x86@kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, rick.p.edgecombe@intel.com, reinette.chatre@intel.com, isaku.yamahata@intel.com, dan.j.williams@intel.com, thomas.lendacky@amd.com, ashish.kalra@amd.com, dwmw@amazon.co.uk, bhe@redhat.com, nik.borisov@suse.com, sagis@google.com Subject: [RFC PATCH 5/5] x86/virt/tdx: Update the kexec section in the TDX documentation Date: Thu, 13 Mar 2025 00:34:17 +1300 Message-ID: <2e88fd265c88cb42cef505e89c466c5625f4a5b6.1741778537.git.kai.huang@intel.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" TDX host kernel now supports kexec/kdump. Update the documentation to reflect that. Signed-off-by: Kai Huang --- Documentation/arch/x86/tdx.rst | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst index 719043cd8b46..8874c210e545 100644 --- a/Documentation/arch/x86/tdx.rst +++ b/Documentation/arch/x86/tdx.rst @@ -142,13 +142,6 @@ but depends on the BIOS to behave correctly. Note TDX works with CPU logical online/offline, thus the kernel still allows to offline logical CPU and online it again. =20 -Kexec() -~~~~~~~ - -TDX host support currently lacks the ability to handle kexec. For -simplicity only one of them can be enabled in the Kconfig. This will be -fixed in the future. - Erratum ~~~~~~~ =20 @@ -171,6 +164,16 @@ If the platform has such erratum, the kernel prints ad= ditional message in machine check handler to tell user the machine check may be caused by kernel bug on TDX private memory. =20 +Kexec +~~~~~~~ + +Kexec/kdump works normally when TDX is enabled in the kernel. One +limitation is if the old kernel has ever enabled TDX the new kernel won't +be able to use TDX anymore. + +One exception is kexec/kdump are disabled on the platform with the TDX +"tdx_pw_mce" erratum. This will be enhanced in the future. + Interaction vs S3 and deeper states ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ =20 --=20 2.48.1