[v1] TDX host: kexec/kdump support

[RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Kai Huang 11 months ago

TL;DR:

Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
cover kexec support for both AMD SME and Intel TDX.  Previously there
_was_ some issue preventing from doing so but now it has been fixed.

Long version:

AMD SME uses the C-bit to determine whether to encrypt the memory or
not.  For the same physical memory address, dirty cachelines with and
without the C-bit can coexist and the CPU can flush them back to memory
in random order.  To support kexec for SME, the old kernel uses WBINVD
to flush cache before booting to the new kernel so that no stale dirty
cacheline are left over by the old kernel which could otherwise corrupt
the new kernel's memory.

TDX uses 'KeyID' bits in the physical address for memory encryption and
has the same requirement.  To support kexec for TDX, the old kernel
needs to flush cache of TDX private memory.

Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
is supported by hardware.  Perform unconditional WBINVD to support TDX
instead of adding one more vendor-specific check.  Kexec is a slow path,
and the additional WBINVD is acceptable for the sake of simplicity and
maintainability.

Only do WBINVD on bare-metal.  Doing WBINVD in guest triggers unexpected
exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest
may not be able to handle such exception (e.g., TDX guests panics if it
sees such #VE).

History of SME and kexec WBINVD:

There _was_ an issue preventing doing unconditional WBINVD but that has
been fixed.

Initial SME kexec support added an unconditional WBINVD in commit

  bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")

This commit caused different Intel systems to hang or reset.

Without a clear root cause, a later commit

  f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")

fixed the Intel system hang issues by only doing WBINVD when hardware
supports SME.

A corner case [*] revealed the root cause of the system hang issues and
was fixed by commit

  1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")

See [1][2] for more information.

Further testing of doing unconditional WBINVD based on the above fix on
the problematic machines (that issues were originally reported)
confirmed the issues couldn't be reproduced.

See [3][4] for more information.

Therefore, it is safe to do unconditional WBINVD for bare-metal now.

[*] The commit didn't check whether the CPUID leaf is available or not.
Making unsupported CPUID leaf on Intel returns garbage resulting in
unintended WBINVD which caused some issue (followed by the analysis and
the reveal of the final root cause).  The corner case was independently
fixed by commit

  9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")

Link: https://lore.kernel.org/lkml/28a494ca-3173-4072-921c-6c5f5b257e79@amd.com/ [1]
Link: https://lore.kernel.org/lkml/24844584-8031-4b58-ba5c-f85ef2f4c718@amd.com/ [2]
Link: https://lore.kernel.org/lkml/20240221092856.GAZdXCWGJL7c9KLewv@fat_crate.local/ [3]
Link: https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/ [4]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/kernel/process.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 9c75d701011f..8475d9d2d8c4 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -819,18 +819,19 @@ void __noreturn stop_this_cpu(void *dummy)
 	mcheck_cpu_clear(c);
 
 	/*
-	 * Use wbinvd on processors that support SME. This provides support
-	 * for performing a successful kexec when going from SME inactive
-	 * to SME active (or vice-versa). The cache must be cleared so that
-	 * if there are entries with the same physical address, both with and
-	 * without the encryption bit, they don't race each other when flushed
-	 * and potentially end up with the wrong entry being committed to
-	 * memory.
+	 * Use wbinvd to support kexec for both SME (from inactive to active
+	 * or vice-versa) and TDX.  The cache must be cleared so that if there
+	 * are entries with the same physical address, both with and without
+	 * the encryption bit(s), they don't race each other when flushed and
+	 * potentially end up with the wrong entry being committed to memory.
 	 *
-	 * Test the CPUID bit directly because the machine might've cleared
-	 * X86_FEATURE_SME due to cmdline options.
+	 * stop_this_cpu() isn't a fast path, just do unconditional WBINVD for
+	 * bare-metal to cover both SME and TDX.  Do not do WBINVD in a guest
+	 * since performing one will result in an exception (#VE or #VC) for
+	 * TDX or SEV-ES/SEV-SNP guests which the guest may not be able to
+	 * handle (e.g., TDX guest panics if it sees #VE).
 	 */
-	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		wbinvd();
 
 	/*
-- 
2.48.1

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Edgecombe, Rick P 11 months ago

On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> TL;DR:
> 
> Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
> cover kexec support for both AMD SME and Intel TDX.  Previously there
> _was_ some issue preventing from doing so but now it has been fixed.
                             ^ Adding "the kernel" here would read a little
cleaner to me.


When I read "some issue" I start assuming it wasn't fully debugged and it is
"some issue" that no one knows. But below it sounds like it was root caused.

> Long version:

It might make this easier to digest this long version if it start with a "tell
them what you are going to tell them" paragraph.

> 
> AMD SME uses the C-bit to determine whether to encrypt the memory or
> not.  For the same physical memory address, dirty cachelines with and
> without the C-bit can coexist and the CPU can flush them back to memory
> in random order.  To support kexec for SME, the old kernel uses WBINVD
> to flush cache before booting to the new kernel so that no stale dirty
> cacheline are left over by the old kernel which could otherwise corrupt
> the new kernel's memory.
> 
> TDX uses 'KeyID' bits in the physical address for memory encryption and
> has the same requirement.  To support kexec for TDX, the old kernel
> needs to flush cache of TDX private memory.

This paragraph is a little jarring because it's not clear how it follows from
the first paragraph. It helps the reader to give some hints on how they should
organize the information as they go along. If it's too much of an info dump, it
puts an extra burden. They have to try to hold all of the facts in their head
until they can put together the bigger picture themselves.

The extra info about TDX using KeyID also seems unnecessary to the point (IIUC).

> 
> Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
> is supported by hardware.  Perform unconditional WBINVD to support TDX
> instead of adding one more vendor-specific check.  Kexec is a slow path,
> and the additional WBINVD is acceptable for the sake of simplicity and
> maintainability.

Out of curiosity, do you know why this was not already needed for non-self snoop
CPUs? Why can't there be other cache modes that get written back after the new
kernel starts using the memory for something else?

> 
> Only do WBINVD on bare-metal.  Doing WBINVD in guest triggers unexpected
                                                ^the
> exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest
> may not be able to handle such exception (e.g., TDX guests panics if it
                                                             ^panic
> sees such #VE).

It's a small thing, but I think you could skip the #VE or #VC info in here. All
they need to know to understand this patch is that TDX and some SEV kernels
cannot handle WBINVD. And TDX panics. (does SEV not?)

> 
> History of SME and kexec WBINVD:
> 
> There _was_ an issue preventing doing unconditional WBINVD but that has
> been fixed.
> 
> Initial SME kexec support added an unconditional WBINVD in commit
> 
>   bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")
> 
> This commit caused different Intel systems to hang or reset.
> 
> Without a clear root cause, a later commit
> 
>   f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
> 
> fixed the Intel system hang issues by only doing WBINVD when hardware
> supports SME.
> 
> A corner case [*] revealed the root cause of the system hang issues and
> was fixed by commit
> 
>   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
> 
> See [1][2] for more information.
> 
> Further testing of doing unconditional WBINVD based on the above fix on
> the problematic machines (that issues were originally reported)
> confirmed the issues couldn't be reproduced.
> 
> See [3][4] for more information.
> 
> Therefore, it is safe to do unconditional WBINVD for bare-metal now.

Instead of a play-by-play, it might be more informative to summarize the edges
covered in this history:
 - Don't do anything that writes memory between wbinvd and native_halt(). This
includes function calls that touch the stack.
 - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
too expensive.
 - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
is a race still.

Hmm, on the last one tglx says:
    The cpumask cannot plug all holes either, but it's better than a raw
    counter and allows to restrict the NMI fallback IPI to be sent only the
    CPUs which have not reported within the timeout window

Are we opening up more platforms to a race by unconditionally doing the wbinvd?
Can we be clarify that nothing bad happens if we lose the race? (and is it
true?)

> 
> [*] The commit didn't check whether the CPUID leaf is available or not.
> Making unsupported CPUID leaf on Intel returns garbage resulting in
> unintended WBINVD which caused some issue (followed by the analysis and
> the reveal of the final root cause).  The corner case was independently
> fixed by commit
> 
>   9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")
> 
> Link: https://lore.kernel.org/lkml/28a494ca-3173-4072-921c-6c5f5b257e79@amd.com/ [1]
> Link: https://lore.kernel.org/lkml/24844584-8031-4b58-ba5c-f85ef2f4c718@amd.com/ [2]
> Link: https://lore.kernel.org/lkml/20240221092856.GAZdXCWGJL7c9KLewv@fat_crate.local/ [3]
> Link: https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/ [4]
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Suggested-by: Borislav Petkov <bp@alien8.de>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Dave Young <dyoung@redhat.com>
> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> ---
>  arch/x86/kernel/process.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 9c75d701011f..8475d9d2d8c4 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -819,18 +819,19 @@ void __noreturn stop_this_cpu(void *dummy)
>  	mcheck_cpu_clear(c);
>  
>  	/*
> -	 * Use wbinvd on processors that support SME. This provides support
> -	 * for performing a successful kexec when going from SME inactive
> -	 * to SME active (or vice-versa). The cache must be cleared so that
> -	 * if there are entries with the same physical address, both with and
> -	 * without the encryption bit, they don't race each other when flushed
> -	 * and potentially end up with the wrong entry being committed to
> -	 * memory.
> +	 * Use wbinvd to support kexec for both SME (from inactive to active
> +	 * or vice-versa) and TDX.  The cache must be cleared so that if there
> +	 * are entries with the same physical address, both with and without
> +	 * the encryption bit(s), they don't race each other when flushed and
> +	 * potentially end up with the wrong entry being committed to memory.
>  	 *
> -	 * Test the CPUID bit directly because the machine might've cleared
> -	 * X86_FEATURE_SME due to cmdline options.
> +	 * stop_this_cpu() isn't a fast path, just do unconditional WBINVD for
> +	 * bare-metal to cover both SME and TDX.  Do not do WBINVD in a guest
> +	 * since performing one will result in an exception (#VE or #VC) for
> +	 * TDX or SEV-ES/SEV-SNP guests which the guest may not be able to
> +	 * handle (e.g., TDX guest panics if it sees #VE).
>  	 */
> -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> +	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		wbinvd();

I see that this already has Tom's RB, but I'm not sure how this works for AMD.
The original SME patch tried to avoid writing to memory by putting the wbinvd
immediately before the halt, but today it is further away. Below this hunk there
are more instructions that could dirty memory before the halt.  Ohh... it's new.
9 months ago 26ba7353caaa ("x86/smp: Add smp_ops.stop_this_cpu() callback") adds
a function call that would touch the stack. I think it's wrong? And probably
introduced after this patch was originally written.

Then the cpuid_eax() could be non-inlined, but probably not. But the
boot_cpu_has() added in this patch could call out to kasan and dirty the stack.

So I think the existing SME case might be theoretically incorrect, and if so
this makes things very slightly worse.

>  
>  	/*

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by kirill.shutemov@linux.intel.com 10 months, 3 weeks ago

On Thu, Mar 13, 2025 at 06:40:09PM +0000, Edgecombe, Rick P wrote:
> > Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
> > is supported by hardware.  Perform unconditional WBINVD to support TDX
> > instead of adding one more vendor-specific check.  Kexec is a slow path,
> > and the additional WBINVD is acceptable for the sake of simplicity and
> > maintainability.
> 
> Out of curiosity, do you know why this was not already needed for non-self snoop
> CPUs? Why can't there be other cache modes that get written back after the new
> kernel starts using the memory for something else?

KeyID is a hack. Memory controller is aware about KeyID, but not cache.
Cache considers KeyID as part of physical address. Two cache lines for the
same physical address with different KeyID are considered unrelated from
cache coherency PoV.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Edgecombe, Rick P 10 months, 3 weeks ago

On Mon, 2025-03-17 at 14:52 +0200, kirill.shutemov@linux.intel.com wrote:
> On Thu, Mar 13, 2025 at 06:40:09PM +0000, Edgecombe, Rick P wrote:
> > > Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
> > > is supported by hardware.  Perform unconditional WBINVD to support TDX
> > > instead of adding one more vendor-specific check.  Kexec is a slow path,
> > > and the additional WBINVD is acceptable for the sake of simplicity and
> > > maintainability.
> > 
> > Out of curiosity, do you know why this was not already needed for non-self snoop
> > CPUs? Why can't there be other cache modes that get written back after the new
> > kernel starts using the memory for something else?
> 
> KeyID is a hack. Memory controller is aware about KeyID, but not cache.
> Cache considers KeyID as part of physical address. Two cache lines for the
> same physical address with different KeyID are considered unrelated from
> cache coherency PoV.

Sure, but non-selfsnoop CPUs can have trouble when PAT aliases cachetypes, I
guess. This came up in KVM recently.

So if new kernel maps the same memory with a different memtype I thought it
might be a similar problem.

There is a little bit here, but it doesn't mention snooping. Not an expert in
this area, btw.
https://www.kernel.org/doc/Documentation/x86/pat.txt

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Dave Hansen 10 months, 3 weeks ago

On 3/17/25 14:59, Edgecombe, Rick P wrote:
> Sure, but non-selfsnoop CPUs can have trouble when PAT aliases cachetypes, I
> guess. This came up in KVM recently.
> 
> So if new kernel maps the same memory with a different memtype I thought it
> might be a similar problem.
Yeah, both the KeyIDs and memtypes mismatches are places that normal
cache coherency breaks. They break it in different ways for sure, but
it's still broken in a way that software has to work around.

As for kexec vs. PAT memtypes, there are only even theoretical issues on
old hardware.  They _theoretically_ need a WBINVD at kexec. But there
might be enough other things happening during kexec (including other
WBINVD's) to keep us from getting bitten in practice.

I'm not going to lose any sleep over it though.

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Huang, Kai 10 months, 3 weeks ago

On Wed, 2025-03-19 at 09:41 -0700, Hansen, Dave wrote:
> As for kexec vs. PAT memtypes, there are only even theoretical issues on
> old hardware.  They _theoretically_ need a WBINVD at kexec. But there
> might be enough other things happening during kexec (including other
> WBINVD's) to keep us from getting bitten in practice.

From this perspective, I think it makes sense to do WBINVD for bare-metal
machines anyway?  Perhaps I can mention this in the changelog?

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Edgecombe, Rick P 10 months, 3 weeks ago

On Wed, 2025-03-19 at 09:41 -0700, Dave Hansen wrote:
> They _theoretically_ need a WBINVD at kexec. But there
> might be enough other things happening during kexec (including other
> WBINVD's) to keep us from getting bitten in practice.

Thanks Dave. It seems there have been a number of issues in this step, so just
trying to make sure I understand how it is supposed to work before we change it.

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Huang, Kai 10 months, 3 weeks ago

On Thu, 2025-03-13 at 18:40 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > TL;DR:
> > 
> > Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
> > cover kexec support for both AMD SME and Intel TDX.  Previously there
> > _was_ some issue preventing from doing so but now it has been fixed.
>                              ^ Adding "the kernel" here would read a little
> cleaner to me.

OK.

> 
> 
> When I read "some issue" I start assuming it wasn't fully debugged and it is
> "some issue" that no one knows. But below it sounds like it was root caused.

I am not sure what's wrong with "some issue".  I used "_was_" to convey this
issue was fixed (thus root caused).  Perhaps the "root caused" part isn't clear?
I can explicitly call it out.

  Previously there _was_ some issue preventing the kernel from doing so but 
  now it has been root caused and fixed. 

> 
> > Long version:
> 
> It might make this easier to digest this long version if it start with a "tell
> them what you are going to tell them" paragraph.
> 
> > 
> > AMD SME uses the C-bit to determine whether to encrypt the memory or
> > not.  For the same physical memory address, dirty cachelines with and
> > without the C-bit can coexist and the CPU can flush them back to memory
> > in random order.  To support kexec for SME, the old kernel uses WBINVD
> > to flush cache before booting to the new kernel so that no stale dirty
> > cacheline are left over by the old kernel which could otherwise corrupt
> > the new kernel's memory.
> > 
> > TDX uses 'KeyID' bits in the physical address for memory encryption and
> > has the same requirement.  To support kexec for TDX, the old kernel
> > needs to flush cache of TDX private memory.
> 
> This paragraph is a little jarring because it's not clear how it follows from
> the first paragraph. It helps the reader to give some hints on how they should
> organize the information as they go along. If it's too much of an info dump, it
> puts an extra burden. They have to try to hold all of the facts in their head
> until they can put together the bigger picture themselves.
> 
> The extra info about TDX using KeyID also seems unnecessary to the point (IIUC).

I added the above two paragraphs to mainly address Reinette's concern regarding
"cache can be in coherent status due to memory encryption" being too vague.

I also think they are too verbose.  How about placing the first two paragraphs
with what I have (after addressing your comments) in the patch 2:

 For SME and TDX, dirty cachelines with and without encryption bit(s) of 
 the same memory can coexist, and the CPU can flush them back to memory 
 in random order.

?

> 
> > 
> > Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
> > is supported by hardware.  Perform unconditional WBINVD to support TDX
> > instead of adding one more vendor-specific check.  Kexec is a slow path,
> > and the additional WBINVD is acceptable for the sake of simplicity and
> > maintainability.
> 
> Out of curiosity, do you know why this was not already needed for non-self snoop
> CPUs? 
> 

Self snooping only deals with different memory types (UC, WB etc) but doesn't
handle memory encryption which with additional bit(s) tagged into the physical
address. 

> Why can't there be other cache modes that get written back after the new
> kernel starts using the memory for something else?

I assume teaching cache coherent protocol to understand the additional
encryption bit(s) isn't something that can be supported easily for all vendors.

> 
> > 
> > Only do WBINVD on bare-metal.  Doing WBINVD in guest triggers unexpected
>                                                 ^the

OK.

> > exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest
> > may not be able to handle such exception (e.g., TDX guests panics if it
>                                                              ^panic

Ah good catch.

> > sees such #VE).
> 
> It's a small thing, but I think you could skip the #VE or #VC info in here. All
> they need to know to understand this patch is that TDX and some SEV kernels
> cannot handle WBINVD. And TDX panics. (does SEV not?)

I can remove "(#VE or #VC)", unless Kirill/Tom object.

SEV does not panic here in stop_this_cpu(), but it does panic later in
relocate_kernel():

https://lore.kernel.org/lkml/e1d37efb8951eb1d38493687b10a21b23353e35a.1710811610.git.kai.huang@intel.com/t/#m13e38c50f93afbf1c15f506e96817b2a78444b1d

> 
> > 
> > History of SME and kexec WBINVD:
> > 
> > There _was_ an issue preventing doing unconditional WBINVD but that has
> > been fixed.
> > 
> > Initial SME kexec support added an unconditional WBINVD in commit
> > 
> >   bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")
> > 
> > This commit caused different Intel systems to hang or reset.
> > 
> > Without a clear root cause, a later commit
> > 
> >   f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
> > 
> > fixed the Intel system hang issues by only doing WBINVD when hardware
> > supports SME.
> > 
> > A corner case [*] revealed the root cause of the system hang issues and
> > was fixed by commit
> > 
> >   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
> > 
> > See [1][2] for more information.
> > 
> > Further testing of doing unconditional WBINVD based on the above fix on
> > the problematic machines (that issues were originally reported)
> > confirmed the issues couldn't be reproduced.
> > 
> > See [3][4] for more information.
> > 
> > Therefore, it is safe to do unconditional WBINVD for bare-metal now.
> 
> Instead of a play-by-play, it might be more informative to summarize the edges
> covered in this history:

I think the above is also informative.  Boris suggested to keep the discussion
around those relevant commits in the changelog:

https://lore.kernel.org/linux-kernel/20240228110207.GCZd8Sr8mXHA2KTiLz@fat_crate.local/
https://lore.kernel.org/linux-kernel/20240415175912.GAZh1q8GgpY3tpJj5a@fat_crate.local/

>  - Don't do anything that writes memory between wbinvd and native_halt(). This
> includes function calls that touch the stack.

This is a requirement to SME, but changing to unconditional WBINVD on bare-metal
doesn't change this behaviour.  What's the purpose of mentioning here?

>  - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
> too expensive.

Boris suggested to do unconditional WBINVD.  The test by Dave Young also
confirms there was no issue on the once-problematic platforms:

https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/

>  - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
> is a race still.
> 
> Hmm, on the last one tglx says:
>     The cpumask cannot plug all holes either, but it's better than a raw
>     counter and allows to restrict the NMI fallback IPI to be sent only the
>     CPUs which have not reported within the timeout window
> 
> Are we opening up more platforms to a race by unconditionally doing the wbinvd?
> Can we be clarify that nothing bad happens if we lose the race? (and is it
> true?)

IIUC in most cases the REBOOT vector will just do the job, stop_other_cpus()
won't need to send NMIs thus I believe in most cases this shouldn't increase
race.  

I don't know what kinda platform will need NMI to stop remote CPUs.  For
instance, AFAICT my SPR machine with 192 CPUs never needed to send NMI in my
testings.

Dave Yong tested on those once-problematic platforms and doing unconditional
wbinvd was fine.  This patchset (albeit not upstreamed) has also been tested by
customers in their environment.  I believe doing unconditional WBINVD is fine.

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Edgecombe, Rick P 10 months, 3 weeks ago

On Mon, 2025-03-17 at 10:11 +0000, Huang, Kai wrote:
> On Thu, 2025-03-13 at 18:40 +0000, Edgecombe, Rick P wrote:
> > On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > > TL;DR:
> > > 
> > > Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
> > > cover kexec support for both AMD SME and Intel TDX.  Previously there
> > > _was_ some issue preventing from doing so but now it has been fixed.
> >                               ^ Adding "the kernel" here would read a little
> > cleaner to me.
> 
> OK.
> 
> > 
> > 
> > When I read "some issue" I start assuming it wasn't fully debugged and it is
> > "some issue" that no one knows. But below it sounds like it was root caused.
> 
> I am not sure what's wrong with "some issue".  I used "_was_" to convey this
> issue was fixed (thus root caused).  Perhaps the "root caused" part isn't clear?
> I can explicitly call it out.
> 
>   Previously there _was_ some issue preventing the kernel from doing so but 
>   now it has been root caused and fixed. 

The problem is the phrase "some issue" sounds like the issue is not understood.
You could just change it to "an issue". I don't even think you need the "_"
around "_was_". Just "Previously there was an issue preventing the kernel..." is
more reassuring.

[snip]

> > 
> > Instead of a play-by-play, it might be more informative to summarize the edges
> > covered in this history:
> 
> I think the above is also informative.  Boris suggested to keep the discussion
> around those relevant commits in the changelog:
> 
> https://lore.kernel.org/linux-kernel/20240228110207.GCZd8Sr8mXHA2KTiLz@fat_crate.local/

Summary: Kirill says it's too verbose, can be dropped.

> https://lore.kernel.org/linux-kernel/20240415175912.GAZh1q8GgpY3tpJj5a@fat_crate.local/

Summary: Boris says not to drop it and it's ok to be verbose.

But what I'm saying is that the section doesn't summarize the issue well. It
just links to a bunch of commits for the reviewer to go figure it out
themselves. So I think explaining the takeaways instead of only linking to
threads wouldn't be objectionable. You don't need to drop the commit references,
just don't leave so much homework for the reader.

> 
> >   - Don't do anything that writes memory between wbinvd and native_halt(). This
> > includes function calls that touch the stack.
> 
> This is a requirement to SME, but changing to unconditional WBINVD on bare-metal
> doesn't change this behaviour.  What's the purpose of mentioning here?

The purpose is to help the reviewer understand the delicate design of this
function so that they can evaluate whether the changes upset it.

> 
> >   - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
> > too expensive.
> 
> Boris suggested to do unconditional WBINVD.  The test by Dave Young also
> confirms there was no issue on the once-problematic platforms:
> 
> https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/

I'm not sure what your point is here. That there is no issue? This log points to
a commit that contradicts the assertions it is making. Especially since to
understand this part of the log, the reviewer is going to have to read those
referenced commits, don't you think it's a problem? It is opening new doubts.

> 
> >   - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
> > is a race still.
> > 
> > Hmm, on the last one tglx says:
> >      The cpumask cannot plug all holes either, but it's better than a raw
> >      counter and allows to restrict the NMI fallback IPI to be sent only the
> >      CPUs which have not reported within the timeout window
> > 
> > Are we opening up more platforms to a race by unconditionally doing the wbinvd?
> > Can we be clarify that nothing bad happens if we lose the race? (and is it
> > true?)
> 
> IIUC in most cases the REBOOT vector will just do the job, stop_other_cpus()
> won't need to send NMIs thus I believe in most cases this shouldn't increase
> race.  
> 
> I don't know what kinda platform will need NMI to stop remote CPUs.  For
> instance, AFAICT my SPR machine with 192 CPUs never needed to send NMI in my
> testings.
> 
> Dave Yong tested on those once-problematic platforms and doing unconditional
> wbinvd was fine.  This patchset (albeit not upstreamed) has also been tested by
> customers in their environment.  I believe doing unconditional WBINVD is fine.

Sounds too much like: "Someone tested it on a platform that used to have a
problem and at least that one is fixed, but we really don't understand what the
issue is".

I haven't tried to understand what race tglx was seeing, or what the consequence
is. I think we should understand and explain why it's harmless, especially since
this bring it up by linking the log that references it.

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Huang, Kai 10 months, 3 weeks ago

On Tue, 2025-03-18 at 03:41 +0000, Edgecombe, Rick P wrote:
> On Mon, 2025-03-17 at 10:11 +0000, Huang, Kai wrote:
> > On Thu, 2025-03-13 at 18:40 +0000, Edgecombe, Rick P wrote:
> > > On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > > > TL;DR:
> > > > 
> > > > Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
> > > > cover kexec support for both AMD SME and Intel TDX.  Previously there
> > > > _was_ some issue preventing from doing so but now it has been fixed.
> > >                               ^ Adding "the kernel" here would read a little
> > > cleaner to me.
> > 
> > OK.
> > 
> > > 
> > > 
> > > When I read "some issue" I start assuming it wasn't fully debugged and it is
> > > "some issue" that no one knows. But below it sounds like it was root caused.
> > 
> > I am not sure what's wrong with "some issue".  I used "_was_" to convey this
> > issue was fixed (thus root caused).  Perhaps the "root caused" part isn't clear?
> > I can explicitly call it out.
> > 
> >   Previously there _was_ some issue preventing the kernel from doing so but 
> >   now it has been root caused and fixed. 
> 
> The problem is the phrase "some issue" sounds like the issue is not understood.
> You could just change it to "an issue". I don't even think you need the "_"
> around "_was_". Just "Previously there was an issue preventing the kernel..." is
> more reassuring.

(sorry for getting back late).

Sure.

> 
> 
> [snip]
> 
> > > 
> > > Instead of a play-by-play, it might be more informative to summarize the edges
> > > covered in this history:
> > 
> > I think the above is also informative.  Boris suggested to keep the discussion
> > around those relevant commits in the changelog:
> > 
> > https://lore.kernel.org/linux-kernel/20240228110207.GCZd8Sr8mXHA2KTiLz@fat_crate.local/
> 
> Summary: Kirill says it's too verbose, can be dropped.
> 
> > https://lore.kernel.org/linux-kernel/20240415175912.GAZh1q8GgpY3tpJj5a@fat_crate.local/
> 
> Summary: Boris says not to drop it and it's ok to be verbose.
> 
> But what I'm saying is that the section doesn't summarize the issue well. It
> just links to a bunch of commits for the reviewer to go figure it out
> themselves. So I think explaining the takeaways instead of only linking to
> threads wouldn't be objectionable. You don't need to drop the commit references,
> just don't leave so much homework for the reader.

I'll refine per discussion with you.

> 
> > 
> > >   - Don't do anything that writes memory between wbinvd and native_halt(). This
> > > includes function calls that touch the stack.
> > 
> > This is a requirement to SME, but changing to unconditional WBINVD on bare-metal
> > doesn't change this behaviour.  What's the purpose of mentioning here?
> 
> The purpose is to help the reviewer understand the delicate design of this
> function so that they can evaluate whether the changes upset it.

As discussed, I will mention the SME problem that you discovered and the code
change will not make any worse for SME.

> 
> > 
> > >   - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
> > > too expensive.
> > 
> > Boris suggested to do unconditional WBINVD.  The test by Dave Young also
> > confirms there was no issue on the once-problematic platforms:
> > 
> > https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/
> 
> I'm not sure what your point is here. That there is no issue? 
> 

I meant "too expensive" doesn't mean there will be issue, and indeed no issue
was discovered at least on the once-problematic platforms as tested by Dave
Young.

> This log points to
> a commit that contradicts the assertions it is making. Especially since to
> understand this part of the log, the reviewer is going to have to read those
> referenced commits, don't you think it's a problem? It is opening new doubts.

Sorry I am not able to follow where's the "contradiction".

Anyway I have spent some time to read the code around stop_other_cpus() and
stop_this_cpu() and it is now clear (at least to me) doing wbinvd for all bare-
metal is safe.  Please see below.

> 
> > 
> > >   - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
> > > is a race still.
> > > 
> > > Hmm, on the last one tglx says:
> > >      The cpumask cannot plug all holes either, but it's better than a raw
> > >      counter and allows to restrict the NMI fallback IPI to be sent only the
> > >      CPUs which have not reported within the timeout window
> > > 
> > > Are we opening up more platforms to a race by unconditionally doing the wbinvd?
> > > Can we be clarify that nothing bad happens if we lose the race? (and is it
> > > true?)
> > 
> > IIUC in most cases the REBOOT vector will just do the job, stop_other_cpus()
> > won't need to send NMIs thus I believe in most cases this shouldn't increase
> > race.  
> > 
> > I don't know what kinda platform will need NMI to stop remote CPUs.  For
> > instance, AFAICT my SPR machine with 192 CPUs never needed to send NMI in my
> > testings.
> > 
> > Dave Yong tested on those once-problematic platforms and doing unconditional
> > wbinvd was fine.  This patchset (albeit not upstreamed) has also been tested by
> > customers in their environment.  I believe doing unconditional WBINVD is fine.
> 
> Sounds too much like: "Someone tested it on a platform that used to have a
> problem and at least that one is fixed, but we really don't understand what the
> issue is".

Not exactly.  I think commit 1f5e7eb7868e ("x86/smp: Make stop_other_cpus() more
robust") has actually fixed this issue.

AFAICT the root cause is due to below race mentioned in the changelog:

    CPU0                                    CPU1
    
     stop_other_cpus()
       send_IPIs(REBOOT);                   stop_this_cpu()
       while (num_online_cpus() > 1);         set_online(false);
       proceed... -> hang
                                              wbinvd()
    
So the key is to make sure the "proceed..." part can only happen after all
wbinvd()s are done in the remote CPUs.

To achieve this, the above commit introduced two rounds of sending-requests-and-
waiting:

 1) The first round sends REBOOT IPI to remote CPUs and waits 1 seconds for 
    all remote CPUs to stop;

 2) If the first round times out, the second round sends NMI to remote CPUs 
    and wait for 10ms, or forever (which is requested by the @wait parameter to
    to the smp_ops.stop_other_cpus()).

For the kexec case, the kernel will just wait until all remote CPUs are stopped,
see:

	kernel_kexec() ->
		machine_shutdown() ->
			native_machine_shutdown() -> 
				stop_other_cpus() ->
					smp_ops->stop_other_cpus(1)
> 
> I haven't tried to understand what race tglx was seeing, or what the consequence
> is. I think we should understand and explain why it's harmless, especially since
> this bring it up by linking the log that references it.
> 

As talked to you offline, changing to doing wbinvd for bare-metal only increases
the time to stop remote CPUs thus the potential impact is the first round of
waiting may time out on some large systems (albeit it didn't happen at least on
my 192 CPUs machine), but this is fine since the second round of waiting will
actually wait until all wbinvd have been done and all remote CPUs have actually
been stopped.

In other words, AFAICT the "race" you mentioned (presumably it is the one I
mentioned above) cannot happen in kexec() after changing to do wbinvd for bare-
metal.

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Tom Lendacky 11 months ago

On 3/13/25 13:40, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
>> TL;DR:
>>
>> Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
>> cover kexec support for both AMD SME and Intel TDX.  Previously there
>> _was_ some issue preventing from doing so but now it has been fixed.
>                              ^ Adding "the kernel" here would read a little
> cleaner to me.
> 
> 
> When I read "some issue" I start assuming it wasn't fully debugged and it is
> "some issue" that no one knows. But below it sounds like it was root caused.
> 
>> Long version:
> 
> It might make this easier to digest this long version if it start with a "tell
> them what you are going to tell them" paragraph.
> 
>>
>> AMD SME uses the C-bit to determine whether to encrypt the memory or
>> not.  For the same physical memory address, dirty cachelines with and
>> without the C-bit can coexist and the CPU can flush them back to memory
>> in random order.  To support kexec for SME, the old kernel uses WBINVD
>> to flush cache before booting to the new kernel so that no stale dirty
>> cacheline are left over by the old kernel which could otherwise corrupt
>> the new kernel's memory.
>>
>> TDX uses 'KeyID' bits in the physical address for memory encryption and
>> has the same requirement.  To support kexec for TDX, the old kernel
>> needs to flush cache of TDX private memory.
> 
> This paragraph is a little jarring because it's not clear how it follows from
> the first paragraph. It helps the reader to give some hints on how they should
> organize the information as they go along. If it's too much of an info dump, it
> puts an extra burden. They have to try to hold all of the facts in their head
> until they can put together the bigger picture themselves.
> 
> The extra info about TDX using KeyID also seems unnecessary to the point (IIUC).
> 
>>
>> Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
>> is supported by hardware.  Perform unconditional WBINVD to support TDX
>> instead of adding one more vendor-specific check.  Kexec is a slow path,
>> and the additional WBINVD is acceptable for the sake of simplicity and
>> maintainability.
> 
> Out of curiosity, do you know why this was not already needed for non-self snoop
> CPUs? Why can't there be other cache modes that get written back after the new
> kernel starts using the memory for something else?
> 
>>
>> Only do WBINVD on bare-metal.  Doing WBINVD in guest triggers unexpected
>                                                 ^the
>> exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest
>> may not be able to handle such exception (e.g., TDX guests panics if it
>                                                              ^panic
>> sees such #VE).
> 
> It's a small thing, but I think you could skip the #VE or #VC info in here. All
> they need to know to understand this patch is that TDX and some SEV kernels
> cannot handle WBINVD. And TDX panics. (does SEV not?)
> 
>>
>> History of SME and kexec WBINVD:
>>
>> There _was_ an issue preventing doing unconditional WBINVD but that has
>> been fixed.
>>
>> Initial SME kexec support added an unconditional WBINVD in commit
>>
>>   bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")
>>
>> This commit caused different Intel systems to hang or reset.
>>
>> Without a clear root cause, a later commit
>>
>>   f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
>>
>> fixed the Intel system hang issues by only doing WBINVD when hardware
>> supports SME.
>>
>> A corner case [*] revealed the root cause of the system hang issues and
>> was fixed by commit
>>
>>   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
>>
>> See [1][2] for more information.
>>
>> Further testing of doing unconditional WBINVD based on the above fix on
>> the problematic machines (that issues were originally reported)
>> confirmed the issues couldn't be reproduced.
>>
>> See [3][4] for more information.
>>
>> Therefore, it is safe to do unconditional WBINVD for bare-metal now.
> 
> Instead of a play-by-play, it might be more informative to summarize the edges
> covered in this history:
>  - Don't do anything that writes memory between wbinvd and native_halt(). This
> includes function calls that touch the stack.
>  - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
> too expensive.
>  - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
> is a race still.
> 
> Hmm, on the last one tglx says:
>     The cpumask cannot plug all holes either, but it's better than a raw
>     counter and allows to restrict the NMI fallback IPI to be sent only the
>     CPUs which have not reported within the timeout window
> 
> Are we opening up more platforms to a race by unconditionally doing the wbinvd?
> Can we be clarify that nothing bad happens if we lose the race? (and is it
> true?)
> 
>>
>> [*] The commit didn't check whether the CPUID leaf is available or not.
>> Making unsupported CPUID leaf on Intel returns garbage resulting in
>> unintended WBINVD which caused some issue (followed by the analysis and
>> the reveal of the final root cause).  The corner case was independently
>> fixed by commit
>>
>>   9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")
>>
>> Link: https://lore.kernel.org/lkml/28a494ca-3173-4072-921c-6c5f5b257e79@amd.com/ [1]
>> Link: https://lore.kernel.org/lkml/24844584-8031-4b58-ba5c-f85ef2f4c718@amd.com/ [2]
>> Link: https://lore.kernel.org/lkml/20240221092856.GAZdXCWGJL7c9KLewv@fat_crate.local/ [3]
>> Link: https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/ [4]
>> Signed-off-by: Kai Huang <kai.huang@intel.com>
>> Suggested-by: Borislav Petkov <bp@alien8.de>
>> Cc: Tom Lendacky <thomas.lendacky@amd.com>
>> Cc: Dave Young <dyoung@redhat.com>
>> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
>> ---
>>  arch/x86/kernel/process.c | 21 +++++++++++----------
>>  1 file changed, 11 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index 9c75d701011f..8475d9d2d8c4 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -819,18 +819,19 @@ void __noreturn stop_this_cpu(void *dummy)
>>  	mcheck_cpu_clear(c);
>>  
>>  	/*
>> -	 * Use wbinvd on processors that support SME. This provides support
>> -	 * for performing a successful kexec when going from SME inactive
>> -	 * to SME active (or vice-versa). The cache must be cleared so that
>> -	 * if there are entries with the same physical address, both with and
>> -	 * without the encryption bit, they don't race each other when flushed
>> -	 * and potentially end up with the wrong entry being committed to
>> -	 * memory.
>> +	 * Use wbinvd to support kexec for both SME (from inactive to active
>> +	 * or vice-versa) and TDX.  The cache must be cleared so that if there
>> +	 * are entries with the same physical address, both with and without
>> +	 * the encryption bit(s), they don't race each other when flushed and
>> +	 * potentially end up with the wrong entry being committed to memory.
>>  	 *
>> -	 * Test the CPUID bit directly because the machine might've cleared
>> -	 * X86_FEATURE_SME due to cmdline options.
>> +	 * stop_this_cpu() isn't a fast path, just do unconditional WBINVD for
>> +	 * bare-metal to cover both SME and TDX.  Do not do WBINVD in a guest
>> +	 * since performing one will result in an exception (#VE or #VC) for
>> +	 * TDX or SEV-ES/SEV-SNP guests which the guest may not be able to
>> +	 * handle (e.g., TDX guest panics if it sees #VE).
>>  	 */
>> -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
>> +	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
>>  		wbinvd();
> 
> I see that this already has Tom's RB, but I'm not sure how this works for AMD.
> The original SME patch tried to avoid writing to memory by putting the wbinvd
> immediately before the halt, but today it is further away. Below this hunk there
> are more instructions that could dirty memory before the halt.  Ohh... it's new.
> 9 months ago 26ba7353caaa ("x86/smp: Add smp_ops.stop_this_cpu() callback") adds
> a function call that would touch the stack. I think it's wrong? And probably
> introduced after this patch was originally written.
> 
> Then the cpuid_eax() could be non-inlined, but probably not. But the
> boot_cpu_has() added in this patch could call out to kasan and dirty the stack.
> 
> So I think the existing SME case might be theoretically incorrect, and if so
> this makes things very slightly worse.

But the wbinvd() is performed after those checks, so everything gets flushed.

Thanks,
Tom

> 
>>  
>>  	/*
>

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Edgecombe, Rick P 11 months ago

On Fri, 2025-03-14 at 10:11 -0500, Tom Lendacky wrote:
> > I see that this already has Tom's RB, but I'm not sure how this works for
> > AMD.
> > The original SME patch tried to avoid writing to memory by putting the
> > wbinvd
> > immediately before the halt, but today it is further away. Below this hunk
> > there
> > are more instructions that could dirty memory before the halt.  Ohh... it's
> > new.
> > 9 months ago 26ba7353caaa ("x86/smp: Add smp_ops.stop_this_cpu() callback")
> > adds
> > a function call that would touch the stack. I think it's wrong? And probably
> > introduced after this patch was originally written.
> > 
> > Then the cpuid_eax() could be non-inlined, but probably not. But the
> > boot_cpu_has() added in this patch could call out to kasan and dirty the
> > stack.
> > 
> > So I think the existing SME case might be theoretically incorrect, and if so
> > this makes things very slightly worse.
> 
> But the wbinvd() is performed after those checks, so everything gets flushed.

Oh, right, duh. Thanks for checking. Yea those shouldn't matter.

Does the stop_this_cpu() part never come into play for SME either? It looks like
it was added for TDX guest kexec, but is a general ACPI thing.

Regarding the kasan thing, I was looking at this too:
wbinvd()
cpumask_clear_cpu()
  clear_bit()
    instrument_atomic_write()
      kasan_check_write()
        __kasan_check_write() <- non-inline

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Tom Lendacky 11 months ago

On 3/14/25 11:28, Edgecombe, Rick P wrote:
> On Fri, 2025-03-14 at 10:11 -0500, Tom Lendacky wrote:
>>> I see that this already has Tom's RB, but I'm not sure how this works for
>>> AMD.
>>> The original SME patch tried to avoid writing to memory by putting the
>>> wbinvd
>>> immediately before the halt, but today it is further away. Below this hunk
>>> there
>>> are more instructions that could dirty memory before the halt.  Ohh... it's
>>> new.
>>> 9 months ago 26ba7353caaa ("x86/smp: Add smp_ops.stop_this_cpu() callback")
>>> adds
>>> a function call that would touch the stack. I think it's wrong? And probably
>>> introduced after this patch was originally written.
>>>
>>> Then the cpuid_eax() could be non-inlined, but probably not. But the
>>> boot_cpu_has() added in this patch could call out to kasan and dirty the
>>> stack.
>>>
>>> So I think the existing SME case might be theoretically incorrect, and if so
>>> this makes things very slightly worse.
>>
>> But the wbinvd() is performed after those checks, so everything gets flushed.
> 
> Oh, right, duh. Thanks for checking. Yea those shouldn't matter.
> 
> Does the stop_this_cpu() part never come into play for SME either? It looks like
> it was added for TDX guest kexec, but is a general ACPI thing.

It is a general ACPI thing, but I don't know of it being used by our MADT
tables.

> 
> Regarding the kasan thing, I was looking at this too:
> wbinvd()
> cpumask_clear_cpu()
>   clear_bit()
>     instrument_atomic_write()
>       kasan_check_write()
>         __kasan_check_write() <- non-inline

Yes, this does look worrisome. Too bad there isn't a way to turn off KASAN
for a single function.

Thanks,
Tom

>

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Edgecombe, Rick P 11 months ago

On Fri, 2025-03-14 at 13:18 -0500, Tom Lendacky wrote:
> > Does the stop_this_cpu() part never come into play for SME either? It looks
> > like
> > it was added for TDX guest kexec, but is a general ACPI thing.
> 
> It is a general ACPI thing, but I don't know of it being used by our MADT
> tables.
> 
> > 
> > Regarding the kasan thing, I was looking at this too:
> > wbinvd()
> > cpumask_clear_cpu()
> >    clear_bit()
> >      instrument_atomic_write()
> >        kasan_check_write()
> >          __kasan_check_write() <- non-inline
> 
> Yes, this does look worrisome. Too bad there isn't a way to turn off KASAN
> for a single function.

I wonder why that one has an explicit call, compared to compiler generated
stuff. It makes me wonder if there is just some KASAN skipping bit-wise
operations that could be done to fix it.

For stop_this_cpu(), not sure how to avoid it. Maybe just a raw jump to the
function pointer or something. It's not supposed to return. If it is actually
not an issue due foreseeable lack of real world HW/firmware, a comment would be
nice touch.

Ok, well based on your earlier point, the new code actually doesn't make things
worse for SME. For TDX the late function calls are not a problem. So we'll leave
it to you guys.

Thanks for responding so quickly.

Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Huang, Kai 11 months ago

On Thu, 2025-03-13 at 18:40 +0000, Edgecombe, Rick P wrote:
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index 9c75d701011f..8475d9d2d8c4 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -819,18 +819,19 @@ void __noreturn stop_this_cpu(void *dummy)
> >   	mcheck_cpu_clear(c);
> >   
> >   	/*
> > -	 * Use wbinvd on processors that support SME. This provides support
> > -	 * for performing a successful kexec when going from SME inactive
> > -	 * to SME active (or vice-versa). The cache must be cleared so that
> > -	 * if there are entries with the same physical address, both with and
> > -	 * without the encryption bit, they don't race each other when flushed
> > -	 * and potentially end up with the wrong entry being committed to
> > -	 * memory.
> > +	 * Use wbinvd to support kexec for both SME (from inactive to active
> > +	 * or vice-versa) and TDX.  The cache must be cleared so that if there
> > +	 * are entries with the same physical address, both with and without
> > +	 * the encryption bit(s), they don't race each other when flushed and
> > +	 * potentially end up with the wrong entry being committed to memory.
> >   	 *
> > -	 * Test the CPUID bit directly because the machine might've cleared
> > -	 * X86_FEATURE_SME due to cmdline options.
> > +	 * stop_this_cpu() isn't a fast path, just do unconditional WBINVD for
> > +	 * bare-metal to cover both SME and TDX.  Do not do WBINVD in a guest
> > +	 * since performing one will result in an exception (#VE or #VC) for
> > +	 * TDX or SEV-ES/SEV-SNP guests which the guest may not be able to
> > +	 * handle (e.g., TDX guest panics if it sees #VE).
> >   	 */
> > -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> > +	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
> >   		wbinvd();
> 
> I see that this already has Tom's RB, but I'm not sure how this works for AMD.
> The original SME patch tried to avoid writing to memory by putting the wbinvd
> immediately before the halt, but today it is further away. Below this hunk there
> are more instructions that could dirty memory before the halt.  Ohh... it's new.
> 9 months ago 26ba7353caaa ("x86/smp: Add smp_ops.stop_this_cpu() callback") adds
> a function call that would touch the stack. I think it's wrong? And probably
> introduced after this patch was originally written.

(I'll reply others separately since today I am a little bit sick.)

That callback is for TDX guset (which doesn't invoke WBINVD during kexec).  It
doesn't impact SME host.

I kinda agree maybe the code here can be improved, e.g., the code to call
smp_ops.stop_this_cpu() perhaps should be called before the WBINVD.

> 
> Then the cpuid_eax() could be non-inlined, but probably not. But the
> boot_cpu_has() added in this patch could call out to kasan and dirty the stack.

Could you elaborate this since I don't see how kasan is involved?

> 
> So I think the existing SME case might be theoretically incorrect, and if so
> this makes things very slightly worse.

As explained above the function call is empty for SME host.

[RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
[RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()
[RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
[RFC PATCH 4/5] x86/virt/tdx: Remove the !KEXEC_CORE dependency
[RFC PATCH 5/5] x86/virt/tdx: Update the kexec section in the TDX documentation