TL;DR:
Change to do unconditional WBINVD in stop_this_cpu() for bare metal
to cover kexec support for both AMD SME and Intel TDX, despite there
_was_ some issue preventing from doing so but now it has been fixed.
Long version:
Both AMD SME and Intel TDX can leave caches in an incoherent state due
to memory encryption, which can lead to silent memory corruption during
kexec. To address this issue, it is necessary to flush the caches
before jumping to the second kernel.
Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
is supported by hardware. To support TDX, instead of adding one more
vendor-specific check, it is proposed to perform unconditional WBINVD.
Kexec() is a slow path, and the additional WBINVD is acceptable for the
sake of simplicity and maintainability.
It is important to note that WBINVD should only be done for bare-metal
scenarios, as TDX guests and SEV-ES/SEV-SNP guests may not handle the
unexpected exception (#VE or #VC) caused by WBINVD.
Note:
Historically, there _was_ an issue preventing doing unconditional WBINVD
but that has been fixed.
When SME kexec() support was initially added in commit
bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")
WBINVD was done unconditionally. However since then some issues were
reported that different Intel systems would hang or reset due to that
commit.
To try to fix, a later commit
f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
then changed to only do WBINVD when hardware supports SME.
While this commit made the reported issues go away, it didn't pinpoint
the root cause. Also, it forgot to handle a corner case[*], which
resulted in the reveal of the root cause and the final fix by commit
1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
See [1][2] for more information.
Further testing of doing unconditional WBINVD based on the above fix on
the problematic machines (that issues were originally reported)
confirmed the issues couldn't be reproduced.
See [3][4] for more information.
Therefore, it is safe to do unconditional WBINVD for bare-metal now.
[*] The commit didn't check whether the CPUID leaf is available or not.
Making unsupported CPUID leaf on Intel returns garbage resulting in
unintended WBINVD which caused some issue (followed by the analysis and
the reveal of the final root cause). The corner case was independently
fixed by commit
9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")
Link: https://lore.kernel.org/lkml/28a494ca-3173-4072-921c-6c5f5b257e79@amd.com/ [1]
Link: https://lore.kernel.org/lkml/24844584-8031-4b58-ba5c-f85ef2f4c718@amd.com/ [2]
Link: https://lore.kernel.org/lkml/20240221092856.GAZdXCWGJL7c9KLewv@fat_crate.local/ [3]
Link: https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/ [4]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
---
v6 -> v7:
- Use "Link: <permalink>".
v5 -> v6:
- No change
v4 -> v5:
- Add Tom's tag
v3 -> v4:
- Update part of changelog based on Kirill's version (with minor tweak).
- Use "exception (#VE or #VC)" for TDX and SEV-ES/SEV-SNP in changelog
and comments. (Kirill, Tom)
- Point out "WBINVD is not necessary for TDX and SEV-ES/SEV-SNP guests"
in the comment. (Tom)
v2 -> v3:
- Change to only do WBINVD for bare metal
---
arch/x86/kernel/process.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index f63f8fd00a91..d1a20501e686 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -813,18 +813,17 @@ void __noreturn stop_this_cpu(void *dummy)
mcheck_cpu_clear(c);
/*
- * Use wbinvd on processors that support SME. This provides support
- * for performing a successful kexec when going from SME inactive
- * to SME active (or vice-versa). The cache must be cleared so that
- * if there are entries with the same physical address, both with and
- * without the encryption bit, they don't race each other when flushed
- * and potentially end up with the wrong entry being committed to
- * memory.
+ * The kernel could leave caches in incoherent state on SME/TDX
+ * capable platforms. Flush cache to avoid silent memory
+ * corruption for these platforms.
*
- * Test the CPUID bit directly because the machine might've cleared
- * X86_FEATURE_SME due to cmdline options.
+ * stop_this_cpu() isn't a fast path, just do WBINVD for bare-metal
+ * to cover both SME and TDX. It isn't necessary to perform WBINVD
+ * in a guest and performing one could result in an exception (#VE
+ * or #VC) for a TDX or SEV-ES/SEV-SNP guest that the guest may
+ * not be able to handle (e.g., TDX guest panics if it sees #VE).
*/
- if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+ if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
native_wbinvd();
/*
--
2.46.0