x86/virt/tdx: Make TDX and kexec mutually exclusive at runtime

[PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at runtime

Posted by Kai Huang 9 months, 3 weeks ago

Currently, kexec doesn't work well with TDX host support, and only one
of them can be enabled in Kconfig.  However, distributions typically
prefer to use a unified Kconfig with all features enabled.  Therefore,
it would be very useful if both TDX host and kexec could be enabled in
Kconfig simultaneously.

Full support for kexec on a TDX host would require complex work.  The
cache flushing required would need to happen while stopping remote CPUs,
which would require changes to a fragile area of the kernel.  It would
also require resetting TDX private pages, which is non-trivial since the
core kernel does not track them.  Lastly, it would have to rely on a
yet-to-be documented behavior around the TME key (KeyID 0).

Leave the full support and the documentation clarification for future
work, but start with a simple solution: change to make them mutually
exclusive at runtime so that they can be both enabled in the Kconfig.

While there is a little bit of TDX setup at boot, the kexec sensitive
parts of the initialization are enabled when KVM is loaded with a
specific non-default kernel parameter (kvm_intel.tdx=Y).  Use a simple
policy to decide which to run: whichever gets run first disables the
other.  This effectively makes kexec race with TDX when KVM module is
loaded.

Kexec has two phases: the kernel image loading phase and the actual
execution phase.  Specifically, try to disable TDX permanently during
the kernel image loading phase by leveraging the x86 version of
machine_kexec_prepare().  If TDX has already been enabled (thus cannot
be disabled), fail the kexec.

The lock that the TDX disabling operation takes is not held during the
TDX per-CPU initialization, which happens before the main TDX
initialization.  The consequence is that while kexec can't race with
TDX initialization in a way that would leave private memory in a state
that could corrupt the second kernel, it won't exclude the case of the
TDX module being partially initialized.  In this rare scenario, TDX
initialization would simply fail in the second kernel.  Keep the simple
solution simple, and just document the race.

Another option could be to handle this when the kernel actually does
kexec, but this would require adding an arch callout for the operation.
Don't pursue this option to avoid complicating the kexec code.

If TDX cannot be disabled, the users will get an error:

  kexec_load failed: Operation not supported

This could be confusing to the users, thus also tell the reason in the
dmesg:

  [..] kexec: Disabled: TDX is enabled

If TDX can be disabled, also print a message to let users know:

  [..] virt/tdx: explicitly disabled

The reason why this wasn't done earlier was the Kconfig option was just
a bit simpler and the TDX code was large.  Moving to mutual exclusion at
runtime is an incremental improvement that better meets the needs of
distributions.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

Hi Dave,

So far there have been a couple of attempts to resolve the kexec/TDX
incompatibilities, but they have met complications.

The initial attempt was to support kexec on all TDX host platforms.
It had patches to reset TDX private pages on TDX "partial write #MC"
erratum platforms but they had complexity, especially since a KVM
patch was also needed.

The second attempt was to fail kexec on those erratum platforms to
remove the code to reset TDX private pages, but we found more general
issues that will take time to work through.

Next we looked at disabling kexec whenever TDX was supporting,
effectively making kexec or TDX dependent on the BIOS configuration.
But we thought better of this from a UX perspective, which led to the
solution in this patch.

In the meantime, I'd prefer to go with this simpler solution.  I think
this is a good first step. Please consider it for merging.


---
 Documentation/arch/x86/tdx.rst     | 10 ++++++++--
 arch/x86/Kconfig                   |  1 -
 arch/x86/include/asm/tdx.h         |  2 ++
 arch/x86/kernel/machine_kexec_64.c | 14 ++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.c        | 26 ++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h        |  3 ++-
 6 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
index 719043cd8b46..646b6475a90d 100644
--- a/Documentation/arch/x86/tdx.rst
+++ b/Documentation/arch/x86/tdx.rst
@@ -146,8 +146,14 @@ Kexec()
 ~~~~~~~
 
 TDX host support currently lacks the ability to handle kexec.  For
-simplicity only one of them can be enabled in the Kconfig.  This will be
-fixed in the future.
+simplicity, whichever gets run first disables the other.  I.e., loading
+kexec kernel image tries to disable TDX permanently, otherwise it fails
+due to that TDX has already been enabled.  This will be fixed in the
+future.
+
+It is possible that kexec can race with the per-cpu initialization of
+TDX.  In the case of losing this race, TDX will not be usable in the
+second kernel, but otherwise kexec will happen normally.
 
 Erratum
 ~~~~~~~
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index aeac63b11fc2..be0a41cfcf74 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1924,7 +1924,6 @@ config INTEL_TDX_HOST
 	depends on X86_X2APIC
 	select ARCH_KEEP_MEMBLOCK
 	depends on CONTIG_ALLOC
-	depends on !KEXEC_CORE
 	depends on X86_MCE
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4a1922ec80cf..9f9df689506d 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -119,11 +119,13 @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
 int tdx_cpu_enable(void);
 int tdx_enable(void);
 const char *tdx_dump_mce_info(struct mce *m);
+bool tdx_try_disable(void);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
 static inline int tdx_enable(void)  { return -ENODEV; }
 static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
+static inline bool tdx_try_disable(void) { return true; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLER__ */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 949c9e4bfad2..2a66db8c7f94 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -29,6 +29,7 @@
 #include <asm/set_memory.h>
 #include <asm/cpu.h>
 #include <asm/efi.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_ACPI
 /*
@@ -346,6 +347,19 @@ int machine_kexec_prepare(struct kimage *image)
 	unsigned long reloc_end = (unsigned long)__relocate_kernel_end;
 	int result;
 
+	/*
+	 * Kexec doesn't play nice with TDX because there are issues
+	 * like needing to flush cache and resetting TDX private memory.
+	 *
+	 * The kernel doesn't support those things for TDX.  Try to
+	 * disable TDX permanently so that kexec can move on.  If TDX
+	 * has already been enabled, fail kexec.
+	 */
+	if (!tdx_try_disable()) {
+		pr_info_once("Disabled: TDX is enabled");
+		return -EOPNOTSUPP;
+	}
+
 	/* Setup the identity mapped 64bit page table */
 	result = init_pgtable(image, __pa(control_page));
 	if (result)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 7fdb37387886..bcb2ab7505b0 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1456,3 +1456,29 @@ void __init tdx_init(void)
 
 	check_tdx_erratum();
 }
+
+/*
+ * Disable TDX permanently if the module hasn't been initialized
+ * (otherwise does nothing).  Return whether TDX is disabled.
+ *
+ * This function only prevents running concurrently with tdx_enable().
+ * tdx_cpu_enable() can still run successfully even this function
+ * disables TDX successfully.
+ */
+bool tdx_try_disable(void)
+{
+	bool disabled;
+
+	mutex_lock(&tdx_module_lock);
+
+	if (tdx_module_status == TDX_MODULE_UNINITIALIZED) {
+		pr_info("explicitly disabled\n");
+		tdx_module_status = TDX_MODULE_DISABLED;
+	}
+
+	disabled = (tdx_module_status != TDX_MODULE_INITIALIZED);
+
+	mutex_unlock(&tdx_module_lock);
+
+	return disabled;
+}
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 4e3d533cdd61..83ec5fe59f22 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -64,7 +64,8 @@ struct tdmr_info {
 enum tdx_module_status_t {
 	TDX_MODULE_UNINITIALIZED,
 	TDX_MODULE_INITIALIZED,
-	TDX_MODULE_ERROR
+	TDX_MODULE_ERROR,
+	TDX_MODULE_DISABLED
 };
 
 struct tdx_memblock {
-- 
2.49.0

Re: [PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at runtime

Posted by Dave Hansen 9 months, 3 weeks ago

On 4/16/25 16:02, Kai Huang wrote:
> Full support for kexec on a TDX host would require complex work.
> The cache flushing required would need to happen while stopping
> remote CPUs, which would require changes to a fragile area of the
> kernel.

Doesn't kexec already stop remote CPUs? Doesn't this boil down to a
WBINVD? How is that complex?

> It would also require resetting TDX private pages, which is non-
> trivial since the core kernel does not track them.

Why? The next kernel will just use KeyID-0 which will blast the old
pages away with no side effects... right?

> Lastly, it would have to rely on a yet-to-be documented behavior
> around the TME key (KeyID 0).

I'll happily wait for the documentation if you insist on it (I don't).

Re: [PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at runtime

Posted by Edgecombe, Rick P 9 months, 3 weeks ago

On Thu, 2025-04-17 at 10:50 -0700, Dave Hansen wrote:
> On 4/16/25 16:02, Kai Huang wrote:
> > Full support for kexec on a TDX host would require complex work.
> > The cache flushing required would need to happen while stopping
> > remote CPUs, which would require changes to a fragile area of the
> > kernel.
> 
> Doesn't kexec already stop remote CPUs? Doesn't this boil down to a
> WBINVD? How is that complex?

When SME added an SME-only WBINVD in stop_this_cpu() it caused a shutdown hang
on some particular HW. It turns out there was an existing race that was made
worse by the slower operation. It went through some attempts to fix it, and
finally tglx patched it up with:

  1f5e7eb7868e ("x86/smp: Make stop_other_cpus() more robust")

But in that patch he said the fix "cannot plug all holes either". So while
looking at doing the WBINVD for TDX kexec, I was advocating for giving this a
harder look before building on top of it. The patches to add TDX kexec support
made the WBINVD happen on all bare metal, not just TDX HW. So whatever races
exist would be exposed to a much wider variety of HW than SME tested out.

> 
> > It would also require resetting TDX private pages, which is non-
> > trivial since the core kernel does not track them.
> 
> Why? The next kernel will just use KeyID-0 which will blast the old
> pages away with no side effects... right?

I believe this is talking about support to work around the #MC errata. Another
version of kexec TDX support used a KVM callback to have it reset all the TDX
guest memory it knows about.

> 
> > Lastly, it would have to rely on a yet-to-be documented behavior
> > around the TME key (KeyID 0).
> 
> I'll happily wait for the documentation if you insist on it (I don't).

Ok, thanks. This one is probably more of a bonus reason on top of the above.

Re: [PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at runtime

Posted by Dave Hansen 9 months, 3 weeks ago

On 4/17/25 11:21, Edgecombe, Rick P wrote:
> On Thu, 2025-04-17 at 10:50 -0700, Dave Hansen wrote:
>> On 4/16/25 16:02, Kai Huang wrote:
>>> Full support for kexec on a TDX host would require complex work.
>>> The cache flushing required would need to happen while stopping
>>> remote CPUs, which would require changes to a fragile area of the
>>> kernel.
>>
>> Doesn't kexec already stop remote CPUs? Doesn't this boil down to a
>> WBINVD? How is that complex?
> 
> When SME added an SME-only WBINVD in stop_this_cpu() it caused a shutdown hang
> on some particular HW. It turns out there was an existing race that was made
> worse by the slower operation. It went through some attempts to fix it, and
> finally tglx patched it up with:
> 
>   1f5e7eb7868e ("x86/smp: Make stop_other_cpus() more robust")
> 
> But in that patch he said the fix "cannot plug all holes either". So while
> looking at doing the WBINVD for TDX kexec, I was advocating for giving this a
> harder look before building on top of it. The patches to add TDX kexec support
> made the WBINVD happen on all bare metal, not just TDX HW. So whatever races
> exist would be exposed to a much wider variety of HW than SME tested out.

I get it. Adding WBINVD to this same path caused some pain before. But
just turning off the feature that calls this path seems like overkill.

How about we try to push WBINVD out of this path? It should be quite
doable for TDX, I think.

Let's say we had a percpu bool. It get set when SME is enabled on the
system on each CPU. It also gets enabled when TDX is enabled. The kexec
code becomes:

-	if (SME)
+	if (per_cpu(newbool))
		wbinvd();

No TDX, no new wbinvd(). If SME, no change.

Now, here's where it gets fun. The bool can get _cleared_ after WBINVD
is executed on a CPU, at least on TDX systems. It then also needs to get
set after TDX might dirty a cacheline.

	TDCALL(); // dirties stuff
	per_cpu(newbool) = 1;

Then you can also do this on_each_cpu():

	wbinvd();
	per_cpu(newbool) = 0;

hopefully at point after you're sure no more TDCALLs are being made. If
you screw it up, no biggie: the kexec-time one will make up for it,
exposing TDX systems to the kexec timing bugs. But if the on_each_cpu()
thing works in the common case, you get no additional bug exposure.

>>> It would also require resetting TDX private pages, which is non-
>>> trivial since the core kernel does not track them.
>>
>> Why? The next kernel will just use KeyID-0 which will blast the old
>> pages away with no side effects... right?
> 
> I believe this is talking about support to work around the #MC errata. Another
> version of kexec TDX support used a KVM callback to have it reset all the TDX
> guest memory it knows about.

So, let's just not support hardware with that erratum upstream.

Re: [PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at runtime

Posted by Edgecombe, Rick P 9 months ago

On Thu, 2025-04-17 at 11:56 -0700, Dave Hansen wrote:
> I get it. Adding WBINVD to this same path caused some pain before. But
> just turning off the feature that calls this path seems like overkill.
> 
> How about we try to push WBINVD out of this path? It should be quite
> doable for TDX, I think.
> 
> Let's say we had a percpu bool. It get set when SME is enabled on the
> system on each CPU. It also gets enabled when TDX is enabled. The kexec
> code becomes:
> 
> -	if (SME)
> +	if (per_cpu(newbool))
> 		wbinvd();
> 
> No TDX, no new wbinvd(). If SME, no change.
> 
> Now, here's where it gets fun. The bool can get _cleared_ after WBINVD
> is executed on a CPU, at least on TDX systems. It then also needs to get
> set after TDX might dirty a cacheline.
> 
> 	TDCALL(); // dirties stuff
> 	per_cpu(newbool) = 1;
> 
> Then you can also do this on_each_cpu():
> 
> 	wbinvd();
> 	per_cpu(newbool) = 0;
> 
> hopefully at point after you're sure no more TDCALLs are being made. If
> you screw it up, no biggie: the kexec-time one will make up for it,
> exposing TDX systems to the kexec timing bugs. But if the on_each_cpu()
> thing works in the common case, you get no additional bug exposure.

Kai and I have been discussing this internally. Here I'll try to move the
discussion external and continue it.

The problem with this approach turned out to be per_cpu(newbool) = 1 part is
racy with the SEAMCALL. The task could reschedule between SEAMCALL and the per-
cpu set. Disabling preemption around every SEAMCALL should be possible, but it
seems a bit heavyweight for what originally appeared to be an easy way to reduce
but not eliminate the chances of hitting the race.

What if instead, 'newbool' was a global. For SME it can be set at boot. For TDX,
we can minimize the chances of the race in a lesser way. TDX only needs to
WBINVD in stop_this_cpu() when:
1. TDX is late loaded by KVM (currently via module parameter, but maybe someday
when the first TD is loaded)
2. When the native_stop_other_cpus is happening as part of a kexec.

There is already special case logic for kexec in native_stop_other_cpus. So we
just need to add a check of if TDX is loaded, like this:

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 962c3ce39323..147f784d6d0f 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -818,19 +818,7 @@ void __noreturn stop_this_cpu(void *dummy)
        disable_local_APIC();
        mcheck_cpu_clear(c);
 
-       /*
-        * Use wbinvd on processors that support SME. This provides support
-        * for performing a successful kexec when going from SME inactive
-        * to SME active (or vice-versa). The cache must be cleared so that
-        * if there are entries with the same physical address, both with and
-        * without the encryption bit, they don't race each other when flushed
-        * and potentially end up with the wrong entry being committed to
-        * memory.
-        *
-        * Test the CPUID bit directly because the machine might've cleared
-        * X86_FEATURE_SME due to cmdline options.
-        */
-       if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) &
BIT(0)))
+       if (cache_state_incoherent)
                wbinvd();
 
        /*
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 18266cc3d98c..a254f8842ca6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -160,10 +160,19 @@ static void native_stop_other_cpus(int wait)
        if (!atomic_try_cmpxchg(&stopping_cpu, &old_cpu, this_cpu))
                return;
 
-       /* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */
-       if (kexec_in_progress)
+       /*
+        * For kexec, ensure that offline CPUs are out of MWAIT and in HLT.
+        * Also, TDX needs to flush caches when TDX is loaded for the kexec
+        * case. Make sure stop_this_cpu() knows to do this. Otherwise, skip
+        * it due to not increase the chances of the NMI-cpumask race.
+        */
+       if (kexec_in_progress) {
                smp_kick_mwait_play_dead();
 
+               if (tdx_check_load_and_block())
+                       cache_state_incoherent = true;
+       }
+
        /*
         * 1) Send an IPI on the reboot vector to all other CPUs.
         *
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index af8798bc62ed..6cca847e7a9c 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1497,6 +1497,18 @@ const struct tdx_sys_info *tdx_get_sysinfo(void)
 }
 EXPORT_SYMBOL_GPL(tdx_get_sysinfo);
 
+/*
+ * Calling this blocks TDX from ever getting loaded, only for use
+ * during shutdown
+ */
+bool tdx_check_load_and_block(void)
+{
+       if (!mutex_trylock(&tdx_module_lock))
+               return true;
+
+       return tdx_module_status != TDX_MODULE_UNINITIALIZED;
+}
+
 u32 tdx_get_nr_guest_keyids(void)
 {
        return tdx_nr_guest_keyids;



This would leave it as kexec should be generally expected to work whenever TDX
is enabled, but the race is reduced when it's possible to tell it's not enabled.
It pollutes the layers a bit with assumptions on how things are called. But it's
sort of aligned with how kexec_in_progress already works in that way.

Otherwise, maybe we just go with an SME-like approach that limits the WBINVD to
certain systems more statically (i.e.
boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM)). This was pretty much the original
approach way back. It reduce race exposure too in the scope of all kernels out
there, and no layering violations.

Re: [PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at runtime

Posted by Edgecombe, Rick P 9 months ago

On Wed, 2025-05-07 at 14:03 -0700, Rick Edgecombe wrote:
> Kai and I have been discussing this internally. Here I'll try to move the
> discussion external and continue it.
> 
> The problem with this approach turned out to be per_cpu(newbool) = 1 part is
> racy with the SEAMCALL. The task could reschedule between SEAMCALL and the per-
> cpu set. Disabling preemption around every SEAMCALL should be possible, but it
> seems a bit heavyweight for what originally appeared to be an easy way to reduce
> but not eliminate the chances of hitting the race.


Well, after more offline discussion, this time with Dave, we'll go with the per-
cpu approach. The reasons are:
 - global var is not horrible, but...
 - preempt_disable/enable() is not heavyweight. Since the SEAMCALL is not
preemptible, the non-preemptible cycles difference is only a few instructions.
so "preempt_disable is bad" is not a good reason.
 - Is the complexity still too high for what we are trying to do here (shrink a
race)? No, and it's simpler to understand. wbinvd is per-cpu, so the
needs_to_flush bool should also be per-cpu.

So back to the per-cpu approach.

Re: [PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at runtime

Posted by Huang, Kai 9 months, 3 weeks ago

On Thu, 2025-04-17 at 11:56 -0700, Dave Hansen wrote:
> On 4/17/25 11:21, Edgecombe, Rick P wrote:
> > On Thu, 2025-04-17 at 10:50 -0700, Dave Hansen wrote:
> > > On 4/16/25 16:02, Kai Huang wrote:
> > > > Full support for kexec on a TDX host would require complex work.
> > > > The cache flushing required would need to happen while stopping
> > > > remote CPUs, which would require changes to a fragile area of the
> > > > kernel.
> > > 
> > > Doesn't kexec already stop remote CPUs? Doesn't this boil down to a
> > > WBINVD? How is that complex?
> > 
> > When SME added an SME-only WBINVD in stop_this_cpu() it caused a shutdown hang
> > on some particular HW. It turns out there was an existing race that was made
> > worse by the slower operation. It went through some attempts to fix it, and
> > finally tglx patched it up with:
> > 
> >   1f5e7eb7868e ("x86/smp: Make stop_other_cpus() more robust")
> > 
> > But in that patch he said the fix "cannot plug all holes either". So while
> > looking at doing the WBINVD for TDX kexec, I was advocating for giving this a
> > harder look before building on top of it. The patches to add TDX kexec support
> > made the WBINVD happen on all bare metal, not just TDX HW. So whatever races
> > exist would be exposed to a much wider variety of HW than SME tested out.
> 
> I get it. Adding WBINVD to this same path caused some pain before. But
> just turning off the feature that calls this path seems like overkill.
> 
> How about we try to push WBINVD out of this path? It should be quite
> doable for TDX, I think.

Thanks for providing feedback.

> 
> Let's say we had a percpu bool. It get set when SME is enabled on the
> system on each CPU. 
> 

This fits SME, but there's one thing I would like to point out:

For SME there are two phases involving wbinvd():

1) Do wbinvd() in stop_this_cpu() which is for all remote CPUs;
2) Do wbinvd() in relocate_kernel() which is for the last kexec-ing CPU;

And the thing is, currently, the conditions to decide whether to perform
wbivnd() for the above two cases are different:

Case 1) checks whether the _hardware_ supports SME (i.e., via CPUID);
Case 2) checks whether the _kernel_ has ever enabled SME.

We can just expand the case 2) to match case 1).  I don't see any issue here. 
Expanding to doing wbinvd() for bare-metal machine does the same thing anyway.

So I think this approach fits SME easily.  We can just set the per-cpu(newbool)
in early_detect_mem_encrypt() by always checking whether the _hardware_ supports
SME.

> It also gets enabled when TDX is enabled. 
> 

Reading below, I think you mean we only enable the newbool when TDX is enabled
*at runtime*?

> The kexec
> code becomes:
> 
> -	if (SME)
> +	if (per_cpu(newbool))
> 		wbinvd();
> 
> No TDX, no new wbinvd(). If SME, no change.

Yes.

> 
> Now, here's where it gets fun. The bool can get _cleared_ after WBINVD
> is executed on a CPU, at least on TDX systems. 
> 

Right this only works for TDX.  For SME, wbinvd() is always needed, since any
single write to any memory would generate a new dirty cacheline which may have
different encryption property when the second kernel uses that memory.

> It then also needs to get
> set after TDX might dirty a cacheline.
> 
> 	TDCALL(); // dirties stuff
> 	per_cpu(newbool) = 1;
> 
> Then you can also do this on_each_cpu():
> 
> 	wbinvd();
> 	per_cpu(newbool) = 0;
> 
> hopefully at point after you're sure no more TDCALLs are being made. 
> 

(Assume you meant SEAMCALL rather than TDCALL since this context is for TDX
host.)

In the initial TDX support KVM is the only user, so KVM knows when no more
SEAMCALLs can be made.  E.g., we can do this when the last VM (or TD) is
destroyed, or when KVM module is unloaded.

> If
> you screw it up, no biggie: the kexec-time one will make up for it,
> exposing TDX systems to the kexec timing bugs. But if the on_each_cpu()
> thing works in the common case, you get no additional bug exposure.

This is true.  But I don't think it's a bug, because kexec can happen at any
time, e.g., when there's TD still running.

So unless KVM is notified when kexec happens (e.g., via a reboot notifier) and
terminates all TDX guests when notified before the kexec-ing CPU tries to stop
all remote CPUs, KVM cannot guarantee wbinvd() has been done and there will be
no more SEAMCALLs.

So in short:

I think this approach will work for both SME and TDX.  But I am not sure whether
we should/need to make KVM do

	wbinvd();
	per_cpu(newbool) = 0;

unless we also want to make KVM get notified when kexec happens.

Another two concerns are: 1) the above code doesn't work for SME as mentioned
above; 2) with TDX Connect other kernel components may need to do similar thing
too.

Maybe we can start with below?

- setting per-cpu(newbool) for TDX when TDX gets enabled at runtime;
- letting kexec code to catch the wbinvd() for TDX

Or even simpler:

- setting per-cpu(newbool) for TDX when X86_FEATURE_TDX_HOST_PLATFORM is set at
early boot;
- letting kexec code to catch the wbinvd() for TDX

> 
> > > > It would also require resetting TDX private pages, which is non-
> > > > trivial since the core kernel does not track them.
> > > 
> > > Why? The next kernel will just use KeyID-0 which will blast the old
> > > pages away with no side effects... right?
> > 
> > I believe this is talking about support to work around the #MC errata. Another
> > version of kexec TDX support used a KVM callback to have it reset all the TDX
> > guest memory it knows about.
> 
> So, let's just not support hardware with that erratum upstream.

This can be done by refusing kexec at image loading time as shown in this patch:

https://lore.kernel.org/lkml/408103f145360dfa04a41bc836ca2c724f29deb0.1741778537.git.kai.huang@intel.com/

But if we want to make KVM get notified when kexec happens, then resetting TDX
private memory can also be done easily because KVM already has the code to tear
down a TDX guest including resetting TDX private memory and flushing cache
associated with the TDX guest etc.

Re: [PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at runtime

Posted by Edgecombe, Rick P 9 months, 3 weeks ago

On Thu, 2025-04-17 at 11:02 +1200, Kai Huang wrote:
> Currently, kexec doesn't work well with TDX host support, and only one
> of them can be enabled in Kconfig.
> 

Nit:
"only one of them can be enabled in Kconfig at a time." would be a little
clearer.

>   However, distributions typically
> prefer to use a unified Kconfig with all features enabled.  Therefore,
> it would be very useful if both TDX host and kexec could be enabled in
> Kconfig simultaneously.
> 
> Full support for kexec on a TDX host would require complex work.  The
> cache flushing required would need to happen while stopping remote CPUs,
> which would require changes to a fragile area of the kernel.  It would
> also require resetting TDX private pages, which is non-trivial since the
> core kernel does not track them.  Lastly, it would have to rely on a
> yet-to-be documented behavior around the TME key (KeyID 0).
> 
> Leave the full support and the documentation clarification for future
> work, but start with a simple solution: change to make them mutually
> exclusive at runtime so that they can be both enabled in the Kconfig.
> 
> While there is a little bit of TDX setup at boot, the kexec sensitive
> parts of the initialization are enabled when KVM is loaded with a
> specific non-default kernel parameter (kvm_intel.tdx=Y).  Use a simple
> policy to decide which to run: whichever gets run first disables the
> other.  This effectively makes kexec race with TDX when KVM module is
> loaded.
> 
> Kexec has two phases: the kernel image loading phase and the actual
> execution phase.  Specifically, try to disable TDX permanently during
> the kernel image loading phase by leveraging the x86 version of
> machine_kexec_prepare().  If TDX has already been enabled (thus cannot
> be disabled), fail the kexec.
> 
> The lock that the TDX disabling operation takes is not held during the
> TDX per-CPU initialization, which happens before the main TDX
> initialization.  The consequence is that while kexec can't race with
> TDX initialization in a way that would leave private memory in a state
> that could corrupt the second kernel, it won't exclude the case of the
> TDX module being partially initialized.  In this rare scenario, TDX
> initialization would simply fail in the second kernel.  Keep the simple
> solution simple, and just document the race.
> 
> Another option could be to handle this when the kernel actually does
> kexec, but this would require adding an arch callout for the operation.
> Don't pursue this option to avoid complicating the kexec code.
> 
> If TDX cannot be disabled, the users will get an error:
> 
>   kexec_load failed: Operation not supported
> 
> This could be confusing to the users, thus also tell the reason in the
> dmesg:
> 
>   [..] kexec: Disabled: TDX is enabled
> 
> If TDX can be disabled, also print a message to let users know:
> 
>   [..] virt/tdx: explicitly disabled
> 
> The reason why this wasn't done earlier was the Kconfig option was just
> a bit simpler and the TDX code was large.  Moving to mutual exclusion at
> runtime is an incremental improvement that better meets the needs of
> distributions.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>

I think this is good solution to start with. It has to be the simplest we could
do. Well, except just tweaking the Kconfig and documenting the race, but this
protection looks simple enough to be worth the cost. It would be good to get
distros on CC to confirm that this is acceptable.

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>