[v3] TDX host: kexec/kdump support

[PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Kai Huang 3 months, 2 weeks ago

On SME platforms, at hardware level dirty cachelines with and without
encryption bit of the same memory can coexist, and the CPU can flush
them back to memory in random order.  During kexec, the caches must be
flushed before jumping to the new kernel to avoid silent memory
corruption when a cacheline with a different encryption property is
written back over whatever encryption properties the new kernel is
using.

TDX also needs cache flush during kexec for the same reason.  It would
be good to implement a generic way to flush cache instead of scattering
checks for each feature all around.

During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop
them before it boots to the new kernel.  For SME the kernel basically
encrypts all memory including the kernel itself by manipulating page
tables.  A simple memory write from the kernel could dirty cachelines.

Currently, the kernel uses WBINVD to flush cache for SME during kexec in
two places:

1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
   stops them;
2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
   new kernel.

Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
aforementioned WBINVDs, leverage this for TDX.

In order to have a generic way to cover both SME and TDX, use a percpu
boolean to indicate the cache may be in an incoherent state thus cache
flush is needed during kexec, and turn on the boolean for SME.  TDX can
then leverage it by also turning the boolean on.

A percpu boolean isn't strictly needed for SME since it is activated at
very early boot time and on all CPUs.  A global flag would be
sufficient.  But using a percpu variable has two benefits.  Foremost,
the state that is being tracked here (percpu cache coherency situation
requiring a flush) is percpu, so a percpu state is a more direct and
natural fit.

Secondly, it will fit TDX's usage better since the percpu var can be set
when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU
obviates the need for a kexec-time WBINVD.  Saving kexec-time WBINVD is
valuable, because there is an existing race[*] where kexec could proceed
while another CPU is active.  WBINVD could make this race worse, so it's
worth skipping it when possible.

Today the first WBINVD in the stop_this_cpu() is performed when SME is
*supported* by the platform, and the second WBINVD is done in
relocate_kernel() when SME is *activated* by the kernel.  Make things
simple by changing to do the second WBINVD when the platform supports
SME.  This allows the kernel to simply turn on this percpu boolean when
bringing up a CPU by checking whether the platform supports SME.

No other functional change intended.

Also, currently machine_check() has a comment to explain why no function
call is allowed after load_segments().  After changing to use the new
percpu boolean to control whether to perform WBINVD when calling the
relocate_kernel(), that comment isn't needed anymore.  But it is still a
useful comment, so expand the comment around load_segments() to mention
that due to depth tracking no function call can be made after
load_segments().

[*] The "race" in native_stop_other_cpus()

Commit

  1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")

introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
poweroff" issue which was caused by the WBINVD in stop_this_cpu().
Specifically, the new cpumask resolved the below problem mentioned in
that commit:

    CPU0                                    CPU1

     stop_other_cpus()
       send_IPIs(REBOOT);                   stop_this_cpu()
       while (num_online_cpus() > 1);         set_online(false);
       proceed... -> hang
                                              wbinvd()

While it fixed the reported issue, that commit explained the new cpumask
"cannot plug all holes either".

This is because it doesn't address the "race" mentioned in the #3 in the
comment in native_stop_other_cpus():

        /*
         * 1) Send an IPI on the reboot vector to all other CPUs.
         *
         *    The other CPUs should react on it after leaving critical
         *    sections and re-enabling interrupts. They might still hold
         *    locks, but there is nothing which can be done about that.
         *
         * 2) Wait for all other CPUs to report that they reached the
         *    HLT loop in stop_this_cpu()
         *
         * 3) If #2 timed out send an NMI to the CPUs which did not
         *    yet report
         *
         * 4) Wait for all other CPUs to report that they reached the
         *    HLT loop in stop_this_cpu()
         *
         * #3 can obviously race against a CPU reaching the HLT loop late.
         * That CPU will have reported already and the "have all CPUs
         * reached HLT" condition will be true despite the fact that the
         * other CPU is still handling the NMI. Again, there is no
         * protection against that as "disabled" APICs still respond to
         * NMIs.
         */

Consider below case:

    CPU 0					CPU 1

    native_stop_other_cpus()			stop_this_cpu()

	// sends REBOOT IPI to stop remote CPUs
						...
						wbinvd();

	// wait timesout, try NMI
	if (!cpumask_empty(&cpus_stop_mask)) {
		for_each_cpu(cpu, &cpus_stop_mask) {

						...
						cpumask_clear_cpu(cpu,
							&cpus_stop_mask);
						hlt;

			// send NMI 	--->
						wakeup from hlt and run
						stop_this_cpu():

			// WAIT CPUs TO STOP
			while (!cpumask_empty(
			    &cpus_stop_mask) &&
				...)
		}
						...
		proceed ...			wbinvd();
						...
						hlt;

The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
stopped, but actually it quits immediately because the remote CPUs have
been cleared in cpus_stop_mask when stop_this_cpu() is called from the
REBOOT IPI.

Doing WBINVD in stop_this_cpu() could potentially increase the chance to
trigger the above "race" despite it's still rare to happen.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/include/asm/kexec.h         |  2 +-
 arch/x86/include/asm/processor.h     |  2 ++
 arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++++
 arch/x86/kernel/machine_kexec_64.c   | 15 ++++++++++-----
 arch/x86/kernel/process.c            | 16 +++-------------
 arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
 6 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index f2ad77929d6e..d7e93522b93d 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -122,7 +122,7 @@ relocate_kernel_fn(unsigned long indirection_page,
 		   unsigned long pa_control_page,
 		   unsigned long start_address,
 		   unsigned int preserve_context,
-		   unsigned int host_mem_enc_active);
+		   unsigned int cache_incoherent);
 #endif
 extern relocate_kernel_fn relocate_kernel;
 #define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index bde58f6510ac..a24c7805acdb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
 void microcode_check(struct cpuinfo_x86 *prev_info);
 void store_cpu_caps(struct cpuinfo_x86 *info);
 
+DECLARE_PER_CPU(bool, cache_state_incoherent);
+
 enum l1tf_mitigations {
 	L1TF_MITIGATION_OFF,
 	L1TF_MITIGATION_AUTO,
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index f18f540db58c..4c7fde344216 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -503,6 +503,22 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
 {
 	u64 msr;
 
+	/*
+	 * Mark using wbinvd is needed during kexec on processors that
+	 * support SME. This provides support for performing a successful
+	 * kexec when going from SME inactive to SME active (or vice-versa).
+	 *
+	 * The cache must be cleared so that if there are entries with the
+	 * same physical address, both with and without the encryption bit,
+	 * they don't race each other when flushed and potentially end up
+	 * with the wrong entry being committed to memory.
+	 *
+	 * Test the CPUID bit directly because the machine might've cleared
+	 * X86_FEATURE_SME due to cmdline options.
+	 */
+	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+		__this_cpu_write(cache_state_incoherent, true);
+
 	/*
 	 * BIOS support is required for SME and SEV.
 	 *   For SME: If BIOS has enabled SME then adjust x86_phys_bits by
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 697fb99406e6..4519c7b75c49 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -29,6 +29,7 @@
 #include <asm/set_memory.h>
 #include <asm/cpu.h>
 #include <asm/efi.h>
+#include <asm/processor.h>
 
 #ifdef CONFIG_ACPI
 /*
@@ -384,15 +385,15 @@ void __nocfi machine_kexec(struct kimage *image)
 {
 	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
 	relocate_kernel_fn *relocate_kernel_ptr;
-	unsigned int host_mem_enc_active;
+	unsigned int cache_incoherent;
 	int save_ftrace_enabled;
 	void *control_page;
 
 	/*
-	 * This must be done before load_segments() since if call depth tracking
-	 * is used then GS must be valid to make any function calls.
+	 * This must be done before load_segments(), since it resets
+	 * GS to 0 and percpu data needs the correct GS to work.
 	 */
-	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
+	cache_incoherent = this_cpu_read(cache_state_incoherent);
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
@@ -436,6 +437,10 @@ void __nocfi machine_kexec(struct kimage *image)
 	 *
 	 * Take advantage of this here by force loading the segments,
 	 * before the GDT is zapped with an invalid value.
+	 *
+	 * load_segments() resets GS to 0.  Don't make any function call
+	 * after here since call depth tracking uses percpu variables to
+	 * operate (relocate_kernel() is explicitly ignored by call depth
 	 */
 	load_segments();
 
@@ -444,7 +449,7 @@ void __nocfi machine_kexec(struct kimage *image)
 					   virt_to_phys(control_page),
 					   image->start,
 					   image->preserve_context,
-					   host_mem_enc_active);
+					   cache_incoherent);
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 7b94851bb37e..6b5edfbefa9a 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -88,6 +88,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
 DEFINE_PER_CPU(bool, __tss_limit_invalid);
 EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
 
+DEFINE_PER_CPU(bool, cache_state_incoherent);
+
 /*
  * this gets called so that we can store lazy state into memory and copy the
  * current task into the new thread.
@@ -827,19 +829,7 @@ void __noreturn stop_this_cpu(void *dummy)
 	disable_local_APIC();
 	mcheck_cpu_clear(c);
 
-	/*
-	 * Use wbinvd on processors that support SME. This provides support
-	 * for performing a successful kexec when going from SME inactive
-	 * to SME active (or vice-versa). The cache must be cleared so that
-	 * if there are entries with the same physical address, both with and
-	 * without the encryption bit, they don't race each other when flushed
-	 * and potentially end up with the wrong entry being committed to
-	 * memory.
-	 *
-	 * Test the CPUID bit directly because the machine might've cleared
-	 * X86_FEATURE_SME due to cmdline options.
-	 */
-	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+	if (this_cpu_read(cache_state_incoherent))
 		wbinvd();
 
 	/*
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index ea604f4d0b52..34b3a5e4fe49 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -67,7 +67,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	 * %rsi pa_control_page
 	 * %rdx start address
 	 * %rcx preserve_context
-	 * %r8  host_mem_enc_active
+	 * %r8  cache_incoherent
 	 */
 
 	/* Save the CPU context, used for jumping back */
@@ -129,7 +129,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	/*
 	 * %rdi	indirection page
 	 * %rdx start address
-	 * %r8 host_mem_enc_active
+	 * %r8 cache_incoherent
 	 * %r9 page table page
 	 * %r11 preserve_context
 	 * %r13 original CR4 when relocate_kernel() was invoked
@@ -200,14 +200,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%r9, %cr3
 
 	/*
+	 * If the memory cache is in incoherent state, e.g., due to
+	 * memory encryption, do wbinvd to flush cache.
+	 *
 	 * If SME is active, there could be old encrypted cache line
 	 * entries that will conflict with the now unencrypted memory
 	 * used by kexec. Flush the caches before copying the kernel.
+	 *
+	 * Note SME sets this flag to true when the platform supports
+	 * SME, so the wbinvd is performed even SME is not activated
+	 * by the kernel.  But this has no harm.
 	 */
 	testq	%r8, %r8
-	jz .Lsme_off
+	jz .Lnowbinvd
 	wbinvd
-.Lsme_off:
+.Lnowbinvd:
 
 	call	swap_pages
 
-- 
2.49.0

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Borislav Petkov 3 months, 1 week ago

On Thu, Jun 26, 2025 at 10:48:47PM +1200, Kai Huang wrote:

...

> Doing WBINVD in stop_this_cpu() could potentially increase the chance to
> trigger the above "race" despite it's still rare to happen.

Oh the amount of text... 

Please run it and all your comments through AI to simplify formulations etc.
It is a lot to read.

> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/include/asm/kexec.h         |  2 +-
>  arch/x86/include/asm/processor.h     |  2 ++
>  arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++++
>  arch/x86/kernel/machine_kexec_64.c   | 15 ++++++++++-----
>  arch/x86/kernel/process.c            | 16 +++-------------
>  arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
>  6 files changed, 43 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index f2ad77929d6e..d7e93522b93d 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -122,7 +122,7 @@ relocate_kernel_fn(unsigned long indirection_page,
>  		   unsigned long pa_control_page,
>  		   unsigned long start_address,
>  		   unsigned int preserve_context,
> -		   unsigned int host_mem_enc_active);
> +		   unsigned int cache_incoherent);

So preserve_context and cache_incoherent are both a *single* bit of
information. And we use two u32s for that?!?!

How about flags please?

>  #endif
>  extern relocate_kernel_fn relocate_kernel;
>  #define ARCH_HAS_KIMAGE_ARCH
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index bde58f6510ac..a24c7805acdb 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
>  void microcode_check(struct cpuinfo_x86 *prev_info);
>  void store_cpu_caps(struct cpuinfo_x86 *info);
>  

So much text above - not a single comment here explaining what this var is
for.

> +DECLARE_PER_CPU(bool, cache_state_incoherent);
> +
>  enum l1tf_mitigations {
>  	L1TF_MITIGATION_OFF,
>  	L1TF_MITIGATION_AUTO,
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index f18f540db58c..4c7fde344216 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -503,6 +503,22 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>  {
>  	u64 msr;
>  
> +	/*
> +	 * Mark using wbinvd is needed during kexec on processors that

For all text: write insns in caps pls - WBINVD.

> +	 * support SME. This provides support for performing a successful
> +	 * kexec when going from SME inactive to SME active (or vice-versa).
> +	 *
> +	 * The cache must be cleared so that if there are entries with the
> +	 * same physical address, both with and without the encryption bit,
> +	 * they don't race each other when flushed and potentially end up
> +	 * with the wrong entry being committed to memory.
> +	 *
> +	 * Test the CPUID bit directly because the machine might've cleared
> +	 * X86_FEATURE_SME due to cmdline options.

Where?

That same function does the clearing later...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Huang, Kai 3 months, 1 week ago

On Sat, 2025-06-28 at 14:50 +0200, Borislav Petkov wrote:
> On Thu, Jun 26, 2025 at 10:48:47PM +1200, Kai Huang wrote:
> 
> ...
> 
> > Doing WBINVD in stop_this_cpu() could potentially increase the chance to
> > trigger the above "race" despite it's still rare to happen.
> 
> Oh the amount of text... 
> 
> Please run it and all your comments through AI to simplify formulations etc.
> It is a lot to read.

Hi Boris,

Thanks for the comments.

Yeah I agree the text can be improved.  I tried to run AI to simplify but so
far I am not quite happy about the result yet.  I'll try more.

> 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >  arch/x86/include/asm/kexec.h         |  2 +-
> >  arch/x86/include/asm/processor.h     |  2 ++
> >  arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++++
> >  arch/x86/kernel/machine_kexec_64.c   | 15 ++++++++++-----
> >  arch/x86/kernel/process.c            | 16 +++-------------
> >  arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
> >  6 files changed, 43 insertions(+), 23 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> > index f2ad77929d6e..d7e93522b93d 100644
> > --- a/arch/x86/include/asm/kexec.h
> > +++ b/arch/x86/include/asm/kexec.h
> > @@ -122,7 +122,7 @@ relocate_kernel_fn(unsigned long indirection_page,
> >  		   unsigned long pa_control_page,
> >  		   unsigned long start_address,
> >  		   unsigned int preserve_context,
> > -		   unsigned int host_mem_enc_active);
> > +		   unsigned int cache_incoherent);
> 
> So preserve_context and cache_incoherent are both a *single* bit of
> information. And we use two u32s for that?!?!
> 
> How about flags please?

Yeah I agree a single u32 + flags is better.  However this is the problem in
the existing code (this patch only does renaming).

I think I can come up with a patch to clean this up and put it as the first
patch of this series, or I can do a patch to clean this up after this series
(either together with this series, or separately at a later time).  Which
way do you prefer?


> 
> >  #endif
> >  extern relocate_kernel_fn relocate_kernel;
> >  #define ARCH_HAS_KIMAGE_ARCH
> > diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> > index bde58f6510ac..a24c7805acdb 100644
> > --- a/arch/x86/include/asm/processor.h
> > +++ b/arch/x86/include/asm/processor.h
> > @@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
> >  void microcode_check(struct cpuinfo_x86 *prev_info);
> >  void store_cpu_caps(struct cpuinfo_x86 *info);
> >  
> 
> So much text above - not a single comment here explaining what this var is
> for.

Agreed a comment is needed.  How about below?

  /*
   * The cache may be in an incoherent state (e.g., due to memory 
   * encryption) and needs flushing during kexec.
   */

> 
> > +DECLARE_PER_CPU(bool, cache_state_incoherent);
> > +
> >  enum l1tf_mitigations {
> >  	L1TF_MITIGATION_OFF,
> >  	L1TF_MITIGATION_AUTO,
> > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> > index f18f540db58c..4c7fde344216 100644
> > --- a/arch/x86/kernel/cpu/amd.c
> > +++ b/arch/x86/kernel/cpu/amd.c
> > @@ -503,6 +503,22 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
> >  {
> >  	u64 msr;
> >  
> > +	/*
> > +	 * Mark using wbinvd is needed during kexec on processors that
> 
> For all text: write insns in caps pls - WBINVD.

Will do.

> 
> > +	 * support SME. This provides support for performing a successful
> > +	 * kexec when going from SME inactive to SME active (or vice-versa).
> > +	 *
> > +	 * The cache must be cleared so that if there are entries with the
> > +	 * same physical address, both with and without the encryption bit,
> > +	 * they don't race each other when flushed and potentially end up
> > +	 * with the wrong entry being committed to memory.
> > +	 *
> > +	 * Test the CPUID bit directly because the machine might've cleared
> > +	 * X86_FEATURE_SME due to cmdline options.
> 
> Where?
> 
> That same function does the clearing later...

IIUC the X86_FEATURE_SME could be cleared via 'clearcpuid' kernel cmdline.

Please also see my reply to Tom.

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Borislav Petkov 3 months, 1 week ago

On Mon, Jun 30, 2025 at 11:34:34AM +0000, Huang, Kai wrote:
> Yeah I agree the text can be improved.  I tried to run AI to simplify but so
> far I am not quite happy about the result yet.  I'll try more.

Ask it to simplify it. I get it that you want to be exhaustive in your commit
message but there really is such thing as too much text.

Think of it this way: is the text I'm writing optimal when anyone needs to
read it in the future to know why a change has been done. If not, try to make
it so.

> Yeah I agree a single u32 + flags is better.  However this is the problem in
> the existing code (this patch only does renaming).
> 
> I think I can come up with a patch to clean this up and put it as the first
> patch of this series, or I can do a patch to clean this up after this series
> (either together with this series, or separately at a later time).  Which
> way do you prefer?

Clean ups go first, so yeah, please do a cleanup pre-patch.

>   /*
>    * The cache may be in an incoherent state (e.g., due to memory 
>    * encryption) and needs flushing during kexec.
>    */

Better than nothing. I'd try to explain with 1-2 sentences what can happen due
to memory encryption and why cache invalidation is required. So that the
comment is standalone and is not sending you on a wild goose chasing ride.

> IIUC the X86_FEATURE_SME could be cleared via 'clearcpuid' kernel cmdline.
> 
> Please also see my reply to Tom.

I know but we have never said that clearcpuid= should be used in production.
If you break the kernel using it, you get to keep the pieces. clearcpuid=
taints the kernel and screams bloody murder. So I'm not too worried about
that.

What is more relevant is this:

"I did verify that booting with mem_encrypt=off will start with
X86_FEATURE_SME set, the BSP will clear it and then all APs will not see
it set after that."

which should be put there in the comment.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Huang, Kai 3 months, 1 week ago

On Tue, 2025-07-01 at 14:12 +0200, Borislav Petkov wrote:
> On Mon, Jun 30, 2025 at 11:34:34AM +0000, Huang, Kai wrote:
> > Yeah I agree the text can be improved.  I tried to run AI to simplify but so
> > far I am not quite happy about the result yet.  I'll try more.
> 
> Ask it to simplify it. I get it that you want to be exhaustive in your commit
> message but there really is such thing as too much text.
> 
> Think of it this way: is the text I'm writing optimal when anyone needs to
> read it in the future to know why a change has been done. If not, try to make
> it so.

Yeah this is a good point.  Thanks for that.

After working with AI I came up with below [*].  Basically I:

 - Added a "TL:DR" for quick read, and broke the text into different
sections (e.g., "Background" ..) so it can be read more easily;
 - Removed some trivial things (e.g., the paragraph to explain a comment
change around 'call depth tracking');
 - Simplified the "race" description;
 - Simplified some wording to make sentence shorter.

I appreciate if you can help to see whether it is OK.

Btw, if we remove the "race" description, it could be trimmed down to half,
but I thought it could be still valuable there just in case someone wants to
read it years later.

> 
> > Yeah I agree a single u32 + flags is better.  However this is the problem in
> > the existing code (this patch only does renaming).
> > 
> > I think I can come up with a patch to clean this up and put it as the first
> > patch of this series, or I can do a patch to clean this up after this series
> > (either together with this series, or separately at a later time).  Which
> > way do you prefer?
> 
> Clean ups go first, so yeah, please do a cleanup pre-patch.

Sure will do.  Thanks.

> 
> >   /*
> >    * The cache may be in an incoherent state (e.g., due to memory 
> >    * encryption) and needs flushing during kexec.
> >    */
> 
> Better than nothing. I'd try to explain with 1-2 sentences what can happen due
> to memory encryption and why cache invalidation is required. So that the
> comment is standalone and is not sending you on a wild goose chasing ride.

Yeah agree with the comment being standalone.  How about below?

/*
 * The cache may be in an incoherent state and needs flushing during kexec.
 * E.g., on SME/TDX platforms, dirty cacheline aliases with and without
 * encryption bit(s) can coexist and the cache needs to be flushed before
 * booting to the new kernel to avoid the silent memory corruption due to
 * dirty cachelines with different encryption property being written back
 * to the memory.
 */

> 
> > IIUC the X86_FEATURE_SME could be cleared via 'clearcpuid' kernel cmdline.
> > 
> > Please also see my reply to Tom.
> 
> I know but we have never said that clearcpuid= should be used in production.

Thanks for the info.

> If you break the kernel using it, you get to keep the pieces. clearcpuid=
> taints the kernel and screams bloody murder. So I'm not too worried about
> that.
> 
> What is more relevant is this:
> 
> "I did verify that booting with mem_encrypt=off will start with
> X86_FEATURE_SME set, the BSP will clear it and then all APs will not see
> it set after that."
> 
> which should be put there in the comment.

Yeah agreed.  I read the code again and yeah Tom is right :-)

I didn't want to end up with rewriting the original comment so I came up
with below:

        /*                                              
         * Mark using WBINVD is needed during kexec on processors that     
         * support SME. This provides support for performing a successful
         * kexec when going from SME inactive to SME active (or vice-versa).
         *                                                        
         * The cache must be cleared so that if there are entries with the 
         * same physical address, both with and without the encryption bit,
         * they don't race each other when flushed and potentially end up
         * with the wrong entry being committed to memory.                 
         *                                                              
         * Test the CPUID bit directly because with mem_encrypt=off the    
         * BSP will clear the X86_FEATURE_SME bit and the APs will not     
         * see it set after that.                                      
         */

Does this look good to you?

[*] The updated changelog:

TL;DR:

Use a percpu boolean to control whether to perform WBINVD to unify cache
flushing that is needed during kexec due to memory encryption for both
SME and TDX, and switch SME to use the boolean.

--- Background ---

On SME platforms, dirty cacheline aliases with and without encryption
bit can coexist, and the CPU can flush them back to memory in random
order.  During kexec, the caches must be flushed before jumping to the
new kernel otherwise the dirty cachelines could silently corrupt the
memory used by the new kernel due to different encryption property.

TDX also needs a cache flush during kexec for the same reason.  It would
be good to have a generic way to flush the cache instead of scattering
checks for each feature all around.

When SME is enabled, the kernel basically encrypts all memory including
the kernel itself and a simple memory write from the kernel could dirty
cachelines.  Currently, the kernel uses WBINVD to flush the cache for
SME during kexec in two places:

1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
   stops them;
2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
   new kernel.

--- Solution ---

Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
aforementioned WBINVDs, leverage this for TDX.

To unify the approach for SME and TDX, use a percpu boolean to indicate
the cache may be in an incoherent state and needs flushing during kexec,
and set the boolean for SME.  TDX can then leverage it.

While SME could use a global flag (since it's enabled at early boot and
enabled on all CPUs), the percpu flag fits TDX better:

The percpu flag can be set when a CPU makes a SEAMCALL, and cleared when
another WBINVD on the CPU obviates the need for a kexec-time WBINVD.
Saving kexec-time WBINVD is valuable, because there is an existing
race[*] where kexec could proceed while another CPU is active.  WBINVD
could make this race worse, so it's worth skipping it when possible.

--- Side effect to SME ---

Today the first WBINVD in the stop_this_cpu() is performed when SME is
*supported* by the platform, and the second WBINVD is done in
relocate_kernel() when SME is *activated* by the kernel.  Make things
simple by changing to do the second WBINVD when the platform supports
SME.  This allows the kernel to simply turn on this percpu boolean when
bringing up a CPU by checking whether the platform supports SME.

No other functional change intended.

--- More Read ---

[*] The "race" in native_stop_other_cpus()

Commit

  1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")

introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
poweroff" issue which was caused by the WBINVD in stop_this_cpu().
Specifically, the new cpumask resolved the below problem mentioned in
that commit:

    CPU0                                    CPU1

     stop_other_cpus()
       send_IPIs(REBOOT);                   stop_this_cpu()
       while (num_online_cpus() > 1);         set_online(false);
       proceed... -> hang
                                              wbinvd()

While it fixed the reported issue, that commit explained the new cpumask
"cannot plug all holes either", and there's a "race" could still happen:

    CPU 0                                       CPU 1

    native_stop_other_cpus()                    stop_this_cpu()

        // sends REBOOT IPI to stop remote CPUs
                                                ...
                                                wbinvd();

        // wait timesout, try NMI
        if (!cpumask_empty(&cpus_stop_mask)) {
                for_each_cpu(cpu, &cpus_stop_mask) {

                                                ...
                                                cpumask_clear_cpu(cpu,
                                                        &cpus_stop_mask);
                                                hlt;

                        // send NMI     --->
                                                wakeup from hlt and run
                                                stop_this_cpu():

                        // WAIT CPUs TO STOP
                        while (!cpumask_empty(
                            &cpus_stop_mask) &&
                                ...)
                }
                                                ...
                proceed ...                     wbinvd();
                                                ...
                                                hlt;

The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
stopped, but actually it quits immediately because the remote CPUs have
been cleared in cpus_stop_mask when stop_this_cpu() is called from the
REBOOT IPI.

Doing WBINVD in stop_this_cpu() could potentially increase the chance to
trigger the above "race" despite it's still rare to happen.

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Tom Lendacky 3 months, 1 week ago

On 6/28/25 07:50, Borislav Petkov wrote:
> On Thu, Jun 26, 2025 at 10:48:47PM +1200, Kai Huang wrote:
> 

>> +	 * support SME. This provides support for performing a successful
>> +	 * kexec when going from SME inactive to SME active (or vice-versa).
>> +	 *
>> +	 * The cache must be cleared so that if there are entries with the
>> +	 * same physical address, both with and without the encryption bit,
>> +	 * they don't race each other when flushed and potentially end up
>> +	 * with the wrong entry being committed to memory.
>> +	 *
>> +	 * Test the CPUID bit directly because the machine might've cleared
>> +	 * X86_FEATURE_SME due to cmdline options.
> 
> Where?
> 
> That same function does the clearing later...

I think he means that if this function does clear X86_FEATURE_SME during
the BSP boot, then when the APs boot they won't see the feature set, so
you have to check the CPUID information directly. So maybe that can better
worded.

I did verify that booting with mem_encrypt=off will start with
X86_FEATURE_SME set, the BSP will clear it and then all APs will not see
it set after that.

Thanks,
Tom

>

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Huang, Kai 3 months, 1 week ago

On Sat, 2025-06-28 at 12:04 -0500, Tom Lendacky wrote:
> On 6/28/25 07:50, Borislav Petkov wrote:
> > On Thu, Jun 26, 2025 at 10:48:47PM +1200, Kai Huang wrote:
> > 
> 
> > > +	 * support SME. This provides support for performing a successful
> > > +	 * kexec when going from SME inactive to SME active (or vice-versa).
> > > +	 *
> > > +	 * The cache must be cleared so that if there are entries with the
> > > +	 * same physical address, both with and without the encryption bit,
> > > +	 * they don't race each other when flushed and potentially end up
> > > +	 * with the wrong entry being committed to memory.
> > > +	 *
> > > +	 * Test the CPUID bit directly because the machine might've cleared
> > > +	 * X86_FEATURE_SME due to cmdline options.
> > 
> > Where?
> > 
> > That same function does the clearing later...
> 
> I think he means that if this function does clear X86_FEATURE_SME during
> the BSP boot, then when the APs boot they won't see the feature set, so
> you have to check the CPUID information directly. So maybe that can better
> worded.
> 
> I did verify that booting with mem_encrypt=off will start with
> X86_FEATURE_SME set, the BSP will clear it and then all APs will not see
> it set after that.

I think I actually mean it could be cleared by 'clearcpuid' commandline. :-)

IIUC the AP taking out feature bits that are not set in BSP happens at the
end of identify_cpu(), so it happens after early_detect_mem_encrypt(), which
is called in init_amd().

So the mem_encrypt=off will eventually clear the X86_FEATURE_SME, but when 
early_detect_mem_encrypt() for BSP and AP, IIRC it will still see the
X86_FEATURE_SME bit.

But IIUC the 'clearcpuid' commandline could still just make
early_detect_mem_encrypt() unable to see X86_FEATURE_SME bit on the SME
capable platform.

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Tom Lendacky 3 months, 1 week ago

On 6/26/25 05:48, Kai Huang wrote:
> On SME platforms, at hardware level dirty cachelines with and without
> encryption bit of the same memory can coexist, and the CPU can flush
> them back to memory in random order.  During kexec, the caches must be
> flushed before jumping to the new kernel to avoid silent memory
> corruption when a cacheline with a different encryption property is
> written back over whatever encryption properties the new kernel is
> using.
> 
> TDX also needs cache flush during kexec for the same reason.  It would
> be good to implement a generic way to flush cache instead of scattering
> checks for each feature all around.
> 
> During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop
> them before it boots to the new kernel.  For SME the kernel basically
> encrypts all memory including the kernel itself by manipulating page
> tables.  A simple memory write from the kernel could dirty cachelines.
> 
> Currently, the kernel uses WBINVD to flush cache for SME during kexec in
> two places:
> 
> 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
>    stops them;
> 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
>    new kernel.
> 
> Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
> SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
> aforementioned WBINVDs, leverage this for TDX.
> 
> In order to have a generic way to cover both SME and TDX, use a percpu
> boolean to indicate the cache may be in an incoherent state thus cache
> flush is needed during kexec, and turn on the boolean for SME.  TDX can
> then leverage it by also turning the boolean on.
> 
> A percpu boolean isn't strictly needed for SME since it is activated at
> very early boot time and on all CPUs.  A global flag would be
> sufficient.  But using a percpu variable has two benefits.  Foremost,
> the state that is being tracked here (percpu cache coherency situation
> requiring a flush) is percpu, so a percpu state is a more direct and
> natural fit.
> 
> Secondly, it will fit TDX's usage better since the percpu var can be set
> when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU
> obviates the need for a kexec-time WBINVD.  Saving kexec-time WBINVD is
> valuable, because there is an existing race[*] where kexec could proceed
> while another CPU is active.  WBINVD could make this race worse, so it's
> worth skipping it when possible.
> 
> Today the first WBINVD in the stop_this_cpu() is performed when SME is
> *supported* by the platform, and the second WBINVD is done in
> relocate_kernel() when SME is *activated* by the kernel.  Make things
> simple by changing to do the second WBINVD when the platform supports
> SME.  This allows the kernel to simply turn on this percpu boolean when
> bringing up a CPU by checking whether the platform supports SME.
> 
> No other functional change intended.
> 
> Also, currently machine_check() has a comment to explain why no function
> call is allowed after load_segments().  After changing to use the new
> percpu boolean to control whether to perform WBINVD when calling the
> relocate_kernel(), that comment isn't needed anymore.  But it is still a
> useful comment, so expand the comment around load_segments() to mention
> that due to depth tracking no function call can be made after
> load_segments().
> 
> [*] The "race" in native_stop_other_cpus()
> 
> Commit
> 
>   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
> 
> introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
> poweroff" issue which was caused by the WBINVD in stop_this_cpu().
> Specifically, the new cpumask resolved the below problem mentioned in
> that commit:
> 
>     CPU0                                    CPU1
> 
>      stop_other_cpus()
>        send_IPIs(REBOOT);                   stop_this_cpu()
>        while (num_online_cpus() > 1);         set_online(false);
>        proceed... -> hang
>                                               wbinvd()
> 
> While it fixed the reported issue, that commit explained the new cpumask
> "cannot plug all holes either".
> 
> This is because it doesn't address the "race" mentioned in the #3 in the
> comment in native_stop_other_cpus():
> 
>         /*
>          * 1) Send an IPI on the reboot vector to all other CPUs.
>          *
>          *    The other CPUs should react on it after leaving critical
>          *    sections and re-enabling interrupts. They might still hold
>          *    locks, but there is nothing which can be done about that.
>          *
>          * 2) Wait for all other CPUs to report that they reached the
>          *    HLT loop in stop_this_cpu()
>          *
>          * 3) If #2 timed out send an NMI to the CPUs which did not
>          *    yet report
>          *
>          * 4) Wait for all other CPUs to report that they reached the
>          *    HLT loop in stop_this_cpu()
>          *
>          * #3 can obviously race against a CPU reaching the HLT loop late.
>          * That CPU will have reported already and the "have all CPUs
>          * reached HLT" condition will be true despite the fact that the
>          * other CPU is still handling the NMI. Again, there is no
>          * protection against that as "disabled" APICs still respond to
>          * NMIs.
>          */
> 
> Consider below case:
> 
>     CPU 0					CPU 1
> 
>     native_stop_other_cpus()			stop_this_cpu()
> 
> 	// sends REBOOT IPI to stop remote CPUs
> 						...
> 						wbinvd();
> 
> 	// wait timesout, try NMI
> 	if (!cpumask_empty(&cpus_stop_mask)) {
> 		for_each_cpu(cpu, &cpus_stop_mask) {
> 
> 						...
> 						cpumask_clear_cpu(cpu,
> 							&cpus_stop_mask);
> 						hlt;
> 
> 			// send NMI 	--->
> 						wakeup from hlt and run
> 						stop_this_cpu():
> 
> 			// WAIT CPUs TO STOP
> 			while (!cpumask_empty(
> 			    &cpus_stop_mask) &&
> 				...)
> 		}
> 						...
> 		proceed ...			wbinvd();
> 						...
> 						hlt;
> 
> The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
> stopped, but actually it quits immediately because the remote CPUs have
> been cleared in cpus_stop_mask when stop_this_cpu() is called from the
> REBOOT IPI.
> 
> Doing WBINVD in stop_this_cpu() could potentially increase the chance to
> trigger the above "race" despite it's still rare to happen.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>

> ---
>  arch/x86/include/asm/kexec.h         |  2 +-
>  arch/x86/include/asm/processor.h     |  2 ++
>  arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++++
>  arch/x86/kernel/machine_kexec_64.c   | 15 ++++++++++-----
>  arch/x86/kernel/process.c            | 16 +++-------------
>  arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
>  6 files changed, 43 insertions(+), 23 deletions(-)
>

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Huang, Kai 3 months, 1 week ago

On Fri, 2025-06-27 at 10:08 -0500, Tom Lendacky wrote:
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> 
> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> Tested-by: Tom Lendacky <thomas.lendacky@amd.com>

Thanks Tom!!

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Edgecombe, Rick P 3 months, 1 week ago

On Thu, 2025-06-26 at 22:48 +1200, Kai Huang wrote:
> On SME platforms, at hardware level dirty cachelines with and without
> encryption bit of the same memory can coexist, and the CPU can flush
> them back to memory in random order.
> 

Nit: It's a bit of a run-on sentence. I know we struggled with this one. But I
think this might be easier:

  On SME platforms, dirty cacheline aliases with and without encryption bits
  can coexist, and the CPU can flush them back to memory in random order.

>   During kexec, the caches must be
> flushed before jumping to the new kernel to avoid silent memory
> corruption when a cacheline with a different encryption property is
> written back over whatever encryption properties the new kernel is
> using.
> 
> TDX also needs cache flush during kexec for the same reason.  It would
                ^ a
> be good to implement a generic way to flush cache instead of scattering
                                             ^ the
> checks for each feature all around.
> 
> During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop
> them before it boots to the new kernel.  For SME the kernel basically
> encrypts all memory including the kernel itself by manipulating page
> tables.  A simple memory write from the kernel could dirty cachelines.
> 
> Currently, the kernel uses WBINVD to flush cache for SME during kexec in
                                             ^the
> two places:
> 
> 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
>    stops them;
> 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
>    new kernel.
> 
> Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
> SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
> aforementioned WBINVDs, leverage this for TDX.
> 
> In order to have a generic way to cover both SME and TDX, use a percpu
> boolean to indicate the cache may be in an incoherent state thus cache
> flush is needed during kexec, and turn on the boolean for SME.  TDX can
> then leverage it by also turning the boolean on.
> 
> A percpu boolean isn't strictly needed for SME since it is activated at
> very early boot time and on all CPUs.  A global flag would be
> sufficient.  But using a percpu variable has two benefits.  Foremost,
> the state that is being tracked here (percpu cache coherency situation
> requiring a flush) is percpu, so a percpu state is a more direct and
> natural fit.
> 
> Secondly, it will fit TDX's usage better since the percpu var can be set
> when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU
> obviates the need for a kexec-time WBINVD.  Saving kexec-time WBINVD is
> valuable, because there is an existing race[*] where kexec could proceed
> while another CPU is active.  WBINVD could make this race worse, so it's
> worth skipping it when possible.
> 
> Today the first WBINVD in the stop_this_cpu() is performed when SME is
> *supported* by the platform, and the second WBINVD is done in
> relocate_kernel() when SME is *activated* by the kernel.  Make things
> simple by changing to do the second WBINVD when the platform supports
> SME.  This allows the kernel to simply turn on this percpu boolean when
> bringing up a CPU by checking whether the platform supports SME.

Since the race is related to stop_this_cpu() it doesn't affect it. But it does
mean that the bring up CPU may not flush the cache if takes a kdump kexec before
the per-cpu var is set? Of course the existing logic doesn't trigger until
cpuinfo_x86 is populated. What is the consequence?

Arguably the supported/enabled part could be moved to a separate earlier patch.
The code change would just get immediately replaced, but the benefit would be
that a bisect would show which part of the change was responsible.

> 
> No other functional change intended.
> 
> Also, currently machine_check() has a comment to explain why no function

Nittiest of the nits: I would move this section above "No function..." and drop
the "also". This is the same old "notes" critique I always make. Just when you
think you have got all the details and start to process, it asks you to update
your model.

> call is allowed after load_segments().  After changing to use the new
> percpu boolean to control whether to perform WBINVD when calling the
> relocate_kernel(), that comment isn't needed anymore.  But it is still a
> useful comment, so expand the comment around load_segments() to mention
> that due to depth tracking no function call can be made after
> load_segments().
> 
> [*] The "race" in native_stop_other_cpus()
> 
> Commit
> 
>   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
> 
> introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
> poweroff" issue which was caused by the WBINVD in stop_this_cpu().
> Specifically, the new cpumask resolved the below problem mentioned in
> that commit:
> 
>     CPU0                                    CPU1
> 
>      stop_other_cpus()
>        send_IPIs(REBOOT);                   stop_this_cpu()
>        while (num_online_cpus() > 1);         set_online(false);
>        proceed... -> hang
>                                               wbinvd()
> 
> While it fixed the reported issue, that commit explained the new cpumask
> "cannot plug all holes either".
> 
> This is because it doesn't address the "race" mentioned in the #3 in the
> comment in native_stop_other_cpus():
> 
>         /*
>          * 1) Send an IPI on the reboot vector to all other CPUs.
>          *
>          *    The other CPUs should react on it after leaving critical
>          *    sections and re-enabling interrupts. They might still hold
>          *    locks, but there is nothing which can be done about that.
>          *
>          * 2) Wait for all other CPUs to report that they reached the
>          *    HLT loop in stop_this_cpu()
>          *
>          * 3) If #2 timed out send an NMI to the CPUs which did not
>          *    yet report
>          *
>          * 4) Wait for all other CPUs to report that they reached the
>          *    HLT loop in stop_this_cpu()
>          *
>          * #3 can obviously race against a CPU reaching the HLT loop late.
>          * That CPU will have reported already and the "have all CPUs
>          * reached HLT" condition will be true despite the fact that the
>          * other CPU is still handling the NMI. Again, there is no
>          * protection against that as "disabled" APICs still respond to
>          * NMIs.
>          */
> 
> Consider below case:
> 
>     CPU 0					CPU 1
> 
>     native_stop_other_cpus()			stop_this_cpu()
> 
> 	// sends REBOOT IPI to stop remote CPUs
> 						...
> 						wbinvd();
> 
> 	// wait timesout, try NMI
> 	if (!cpumask_empty(&cpus_stop_mask)) {
> 		for_each_cpu(cpu, &cpus_stop_mask) {
> 
> 						...
> 						cpumask_clear_cpu(cpu,
> 							&cpus_stop_mask);
> 						hlt;
> 
> 			// send NMI 	--->
> 						wakeup from hlt and run
> 						stop_this_cpu():
> 
> 			// WAIT CPUs TO STOP
> 			while (!cpumask_empty(
> 			    &cpus_stop_mask) &&
> 				...)
> 		}
> 						...
> 		proceed ...			wbinvd();
> 						...
> 						hlt;
> 
> The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
> stopped, but actually it quits immediately because the remote CPUs have
> been cleared in cpus_stop_mask when stop_this_cpu() is called from the
> REBOOT IPI.
> 
> Doing WBINVD in stop_this_cpu() could potentially increase the chance to
> trigger the above "race" despite it's still rare to happen.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/include/asm/kexec.h         |  2 +-
>  arch/x86/include/asm/processor.h     |  2 ++
>  arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++++
>  arch/x86/kernel/machine_kexec_64.c   | 15 ++++++++++-----
>  arch/x86/kernel/process.c            | 16 +++-------------
>  arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
>  6 files changed, 43 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index f2ad77929d6e..d7e93522b93d 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -122,7 +122,7 @@ relocate_kernel_fn(unsigned long indirection_page,
>  		   unsigned long pa_control_page,
>  		   unsigned long start_address,
>  		   unsigned int preserve_context,
> -		   unsigned int host_mem_enc_active);
> +		   unsigned int cache_incoherent);
>  #endif
>  extern relocate_kernel_fn relocate_kernel;
>  #define ARCH_HAS_KIMAGE_ARCH
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index bde58f6510ac..a24c7805acdb 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
>  void microcode_check(struct cpuinfo_x86 *prev_info);
>  void store_cpu_caps(struct cpuinfo_x86 *info);
>  
> +DECLARE_PER_CPU(bool, cache_state_incoherent);
> +
>  enum l1tf_mitigations {
>  	L1TF_MITIGATION_OFF,
>  	L1TF_MITIGATION_AUTO,
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index f18f540db58c..4c7fde344216 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -503,6 +503,22 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>  {
>  	u64 msr;
>  
> +	/*
> +	 * Mark using wbinvd is needed during kexec on processors that
> +	 * support SME. This provides support for performing a successful
> +	 * kexec when going from SME inactive to SME active (or vice-versa).
> +	 *
> +	 * The cache must be cleared so that if there are entries with the
> +	 * same physical address, both with and without the encryption bit,
> +	 * they don't race each other when flushed and potentially end up
> +	 * with the wrong entry being committed to memory.
> +	 *
> +	 * Test the CPUID bit directly because the machine might've cleared
> +	 * X86_FEATURE_SME due to cmdline options.
> +	 */
> +	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> +		__this_cpu_write(cache_state_incoherent, true);
> +
>  	/*
>  	 * BIOS support is required for SME and SEV.
>  	 *   For SME: If BIOS has enabled SME then adjust x86_phys_bits by
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 697fb99406e6..4519c7b75c49 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -29,6 +29,7 @@
>  #include <asm/set_memory.h>
>  #include <asm/cpu.h>
>  #include <asm/efi.h>
> +#include <asm/processor.h>
>  
>  #ifdef CONFIG_ACPI
>  /*
> @@ -384,15 +385,15 @@ void __nocfi machine_kexec(struct kimage *image)
>  {
>  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
>  	relocate_kernel_fn *relocate_kernel_ptr;
> -	unsigned int host_mem_enc_active;
> +	unsigned int cache_incoherent;
>  	int save_ftrace_enabled;
>  	void *control_page;
>  
>  	/*
> -	 * This must be done before load_segments() since if call depth tracking
> -	 * is used then GS must be valid to make any function calls.
> +	 * This must be done before load_segments(), since it resets
> +	 * GS to 0 and percpu data needs the correct GS to work.
>  	 */
> -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> +	cache_incoherent = this_cpu_read(cache_state_incoherent);
>  
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
> @@ -436,6 +437,10 @@ void __nocfi machine_kexec(struct kimage *image)
>  	 *
>  	 * Take advantage of this here by force loading the segments,
>  	 * before the GDT is zapped with an invalid value.
> +	 *
> +	 * load_segments() resets GS to 0.  Don't make any function call
> +	 * after here since call depth tracking uses percpu variables to
> +	 * operate (relocate_kernel() is explicitly ignored by call depth
>  	 */
>  	load_segments();
>  
> @@ -444,7 +449,7 @@ void __nocfi machine_kexec(struct kimage *image)
>  					   virt_to_phys(control_page),
>  					   image->start,
>  					   image->preserve_context,
> -					   host_mem_enc_active);
> +					   cache_incoherent);
>  
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 7b94851bb37e..6b5edfbefa9a 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -88,6 +88,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
>  DEFINE_PER_CPU(bool, __tss_limit_invalid);
>  EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
>  
> +DEFINE_PER_CPU(bool, cache_state_incoherent);
> +
>  /*
>   * this gets called so that we can store lazy state into memory and copy the
>   * current task into the new thread.
> @@ -827,19 +829,7 @@ void __noreturn stop_this_cpu(void *dummy)
>  	disable_local_APIC();
>  	mcheck_cpu_clear(c);
>  
> -	/*
> -	 * Use wbinvd on processors that support SME. This provides support
> -	 * for performing a successful kexec when going from SME inactive
> -	 * to SME active (or vice-versa). The cache must be cleared so that
> -	 * if there are entries with the same physical address, both with and
> -	 * without the encryption bit, they don't race each other when flushed
> -	 * and potentially end up with the wrong entry being committed to
> -	 * memory.
> -	 *
> -	 * Test the CPUID bit directly because the machine might've cleared
> -	 * X86_FEATURE_SME due to cmdline options.
> -	 */
> -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> +	if (this_cpu_read(cache_state_incoherent))
>  		wbinvd();
>  
>  	/*
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index ea604f4d0b52..34b3a5e4fe49 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -67,7 +67,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
>  	 * %rsi pa_control_page
>  	 * %rdx start address
>  	 * %rcx preserve_context
> -	 * %r8  host_mem_enc_active
> +	 * %r8  cache_incoherent
>  	 */
>  
>  	/* Save the CPU context, used for jumping back */
> @@ -129,7 +129,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	/*
>  	 * %rdi	indirection page
>  	 * %rdx start address
> -	 * %r8 host_mem_enc_active
> +	 * %r8 cache_incoherent
>  	 * %r9 page table page
>  	 * %r11 preserve_context
>  	 * %r13 original CR4 when relocate_kernel() was invoked
> @@ -200,14 +200,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	movq	%r9, %cr3
>  
>  	/*
> +	 * If the memory cache is in incoherent state, e.g., due to
> +	 * memory encryption, do wbinvd to flush cache.
> +	 *
>  	 * If SME is active, there could be old encrypted cache line
>  	 * entries that will conflict with the now unencrypted memory
>  	 * used by kexec. Flush the caches before copying the kernel.
> +	 *
> +	 * Note SME sets this flag to true when the platform supports
> +	 * SME, so the wbinvd is performed even SME is not activated
> +	 * by the kernel.  But this has no harm.
>  	 */
>  	testq	%r8, %r8
> -	jz .Lsme_off
> +	jz .Lnowbinvd
>  	wbinvd
> -.Lsme_off:
> +.Lnowbinvd:
>  
>  	call	swap_pages
>

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Huang, Kai 3 months, 1 week ago

(I'll address all wording comments, as always.)

> > Today the first WBINVD in the stop_this_cpu() is performed when SME is
> > *supported* by the platform, and the second WBINVD is done in
> > relocate_kernel() when SME is *activated* by the kernel.  Make things
> > simple by changing to do the second WBINVD when the platform supports
> > SME.  This allows the kernel to simply turn on this percpu boolean when
> > bringing up a CPU by checking whether the platform supports SME.
> 
> Since the race is related to stop_this_cpu() it doesn't affect it. But it does
> mean that the bring up CPU may not flush the cache if takes a kdump kexec before
> the per-cpu var is set? Of course the existing logic doesn't trigger until
> cpuinfo_x86 is populated. What is the consequence?

See another reply.

> 
> Arguably the supported/enabled part could be moved to a separate earlier patch.
> The code change would just get immediately replaced, but the benefit would be
> that a bisect would show which part of the change was responsible.

I am not a fan of splitting the new variable and the user into different
patches, as long as the patch isn't too big to review.  You need to review
them together anyway I think, so arguably putting them together is easier to
review.

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Edgecombe, Rick P 3 months, 1 week ago

On Fri, 2025-06-27 at 00:37 +0000, Huang, Kai wrote:
> > Arguably the supported/enabled part could be moved to a separate earlier
> > patch.
> > The code change would just get immediately replaced, but the benefit would
> > be
> > that a bisect would show which part of the change was responsible.
> 
> I am not a fan of splitting the new variable and the user into different
> patches, as long as the patch isn't too big to review.  You need to review
> them together anyway I think, so arguably putting them together is easier to
> review.

How about if Tom thinks there is any risk, we can split them for bisectability
help. Otherwise reduce the churn.

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Huang, Kai 3 months, 1 week ago

On Fri, 2025-06-27 at 00:39 +0000, Edgecombe, Rick P wrote:
> On Fri, 2025-06-27 at 00:37 +0000, Huang, Kai wrote:
> > > Arguably the supported/enabled part could be moved to a separate earlier
> > > patch.
> > > The code change would just get immediately replaced, but the benefit would
> > > be
> > > that a bisect would show which part of the change was responsible.
> > 
> > I am not a fan of splitting the new variable and the user into different
> > patches, as long as the patch isn't too big to review.  You need to review
> > them together anyway I think, so arguably putting them together is easier to
> > review.
> 
> How about if Tom thinks there is any risk, we can split them for bisectability
> help. Otherwise reduce the churn.

Sure.  But I don't understand how this can impact bisect?

Let's assume we have patch 1 to introduce the boolean w/o user, and patch 2
to modify SME code.  If there's any issue, the bisect will point to the
patch 2.  If we have one patch then the bisect will point to this one patch
too.  Is there any difference?

The downside of splitting into two patches is we need to make sure there's
no build warning when introducing a variable w/o user, for which we might
need to do additional things (e.g., it happens when adding a static function
w/o use).

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Edgecombe, Rick P 3 months, 1 week ago

On Thu, 2025-06-26 at 10:59 -0700, Rick Edgecombe wrote:
> Since the race is related to stop_this_cpu() it doesn't affect it. But it does
> mean that the bring up CPU may not flush the cache if takes a kdump kexec before
> the per-cpu var is set? Of course the existing logic doesn't trigger until
> cpuinfo_x86 is populated. What is the consequence?

Oh, doh! This stuff all happens before kdump could load. Right?

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Huang, Kai 3 months, 1 week ago

On Thu, 2025-06-26 at 18:42 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-06-26 at 10:59 -0700, Rick Edgecombe wrote:
> > Since the race is related to stop_this_cpu() it doesn't affect it. But it does
> > mean that the bring up CPU may not flush the cache if takes a kdump kexec before
> > the per-cpu var is set? Of course the existing logic doesn't trigger until
> > cpuinfo_x86 is populated. What is the consequence?
> 
> Oh, doh! This stuff all happens before kdump could load. Right?

I'll replace kdump with normal kexec in the above context, since kdump case
(crash_kexec()) uses a different path from normal kexec when stopping remote
CPUs.  And there's no WBINVD involved in the carsh dump code path for SME.

The reason I guess is reading encrypted memory from /proc/vmcore is
meaningless anyway therefore even memory corruption can happen it is fine.

Now for normal kexec:

For SME this boolean is never cleared once set when the CPU is brought up
for the first time.  So in practice yes when the kexec is performed the
boolean is already set for all CPUs.

But, you are right there's a case that 'maxcpus' kernel commandline is used
which results in only part of CPUs are brought up during boot, and the user
can manually online the rest CPUs at runtime.

(Side: this 'maxcpus' should not be used for modern machine, see the end).

In this case, IIUC, _theoretically_, the native_stop_other_cpus() could race
with CPU onlining, because I didn't see CPU hotplug is explicitly disabled
before that.

But, the native_stop_other_cpus() only sends IPIs to the CPUs which is
already set in cpu_online_mask, and when one CPU is already in
cpu_online_mask, the code to set the boolean has already been done, so the
WBINVD will be done for that.

The only possible issue that I could find is one CPU becomes online after
cpus_stop_mask is already populated:

	CPU 0						CPU 1

							start_secondary()

	cpumask_copy(&cpus_stop_mask, cpu_online_mask);
	cpumask_clear_cpu(this_cpu, &cpus_stop_mask);		
							...
							ap_starting();
							...
							set_cpu_online();
							...

	if (!cpumask_empty(&cpus_stop_mask)) {                                    
                apic_send_IPI_allbutself(REBOOT_VECTOR);
		...

But again this issue (assuming it exists) is an existing issue which is not
related to this patch.

And I am not 100% sure whether this issue exists, since allowing CPU hotplug
during kexec doesn't seem reasonable to me at least on x86, despite it is
indeed enabled kernel_kexec() common code:

        /*
         * migrate_to_reboot_cpu() disables CPU hotplug assuming that
         * no further code needs to use CPU hotplug (which is true in
         * the reboot case). However, the kexec path depends on using
         * CPU hotplug again; so re-enable it here.
         */
         cpu_hotplug_enable();
         pr_notice("Starting new kernel\n");
         machine_shutdown();

I tried to git blame to find clue but failed since the history is lost
during file move/renaming etc.  I suspect it is for other ARCHs.

At last, I believe 'maxcpus' should not be used for modern platforms anyway
due to below:

static inline bool cpu_bootable(unsigned int cpu)
{
	...

        /*
         * On x86 it's required to boot all logical CPUs at least once so
         * that the init code can get a chance to set CR4.MCE on each
         * CPU. Otherwise, a broadcasted MCE observing CR4.MCE=0b on any
         * core will shutdown the machine.
         */
        return !cpumask_test_cpu(cpu, &cpus_booted_once_mask);
}

So using boolean for SME really shouldn't have any issue.

Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

Posted by Binbin Wu 3 months, 1 week ago


On 6/27/2025 8:30 AM, Huang, Kai wrote:
[...]
>
> And I am not 100% sure whether this issue exists, since allowing CPU hotplug
> during kexec doesn't seem reasonable to me at least on x86, despite it is
> indeed enabled kernel_kexec() common code:
>
>          /*
>           * migrate_to_reboot_cpu() disables CPU hotplug assuming that
>           * no further code needs to use CPU hotplug (which is true in
>           * the reboot case). However, the kexec path depends on using
>           * CPU hotplug again; so re-enable it here.
>           */
>           cpu_hotplug_enable();
>           pr_notice("Starting new kernel\n");
>           machine_shutdown();
>
> I tried to git blame to find clue but failed since the history is lost
> during file move/renaming etc.  I suspect it is for other ARCHs.
>
Had a check in git history, it's related to powerpc kexec code.

First, commit e8e5c2155b00 ("powerpc/kexec: Fix orphaned offline CPUs across kexec")
add the code to online each present CPU.

Later, commit011e4b02f1da (" powerpc, kexec: Fix "Processor X is stuck" issue
during kexec from ST mode") add the code in the common code kernel_kexec() to
enable hotplug to fix the stuck issue.

[PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
[PATCH v3 2/6] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
[PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
[PATCH v3 4/6] x86/virt/tdx: Remove the !KEXEC_CORE dependency
[PATCH v3 5/6] x86/virt/tdx: Update the kexec section in the TDX documentation
[PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier