[v1] TDX host: kexec/kdump support

[RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()

Posted by Kai Huang 11 months ago

For both SME and TDX, dirty cachelines with and without the encryption
bit(s) of the same physical memory address can coexist and the CPU can
flush them back to memory in random order.  During kexec, the caches
must be flushed before jumping to the new kernel to avoid silent memory
corruption to the new kernel.

The WBINVD in stop_this_cpu() flushes caches for all remote CPUs when
they are being stopped.  For SME, the WBINVD in relocate_kernel()
flushes the cache for the last running CPU (which is doing kexec).

Similarly, to support kexec for TDX host, after stopping all remote CPUs
with cache flushed, the kernel needs to flush cache for the last running
CPU.

Use the existing WBINVD in relocate_kernel() to cover TDX host as well.

Just do unconditional WBINVD to cover both SME and TDX instead of
sprinkling additional vendor-specific checks.  Kexec is a slow path, and
the additional WBINVD is acceptable for the sake of simplicity and
maintainability.

But only do WBINVD for bare-metal because TDX guests and SEV-ES/SEV-SNP
guests will get unexpected (and yet unnecessary) exception (#VE or #VC)
which the kernel is unable to handle at the time of relocate_kernel()
since the kernel has torn down the IDT.

Remove the host_mem_enc_active local variable and directly use
!cpu_feature_enabled(X86_FEATURE_HYPERVISOR) as an argument of calling
relocate_kernel().  cpu_feature_enabled() is always inline but not a
function call, thus it is safe to use after load_segments() when call
depth tracking is enabled.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Kaplan <david.kaplan@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: David Kaplan <david.kaplan@amd.com>
---
 arch/x86/include/asm/kexec.h         |  2 +-
 arch/x86/kernel/machine_kexec_64.c   | 14 ++++++--------
 arch/x86/kernel/relocate_kernel_64.S | 15 ++++++++++-----
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 8ad187462b68..48c313575262 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -123,7 +123,7 @@ relocate_kernel_fn(unsigned long indirection_page,
 		   unsigned long pa_control_page,
 		   unsigned long start_address,
 		   unsigned int preserve_context,
-		   unsigned int host_mem_enc_active);
+		   unsigned int bare_metal);
 #endif
 extern relocate_kernel_fn relocate_kernel;
 #define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index a68f5a0a9f37..0e9808eeb63e 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -346,16 +346,9 @@ void __nocfi machine_kexec(struct kimage *image)
 {
 	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
 	relocate_kernel_fn *relocate_kernel_ptr;
-	unsigned int host_mem_enc_active;
 	int save_ftrace_enabled;
 	void *control_page;
 
-	/*
-	 * This must be done before load_segments() since if call depth tracking
-	 * is used then GS must be valid to make any function calls.
-	 */
-	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
-
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
 		save_processor_state();
@@ -398,6 +391,11 @@ void __nocfi machine_kexec(struct kimage *image)
 	 *
 	 * I take advantage of this here by force loading the
 	 * segments, before I zap the gdt with an invalid value.
+	 *
+	 * load_segments() resets GS to 0.  Don't make any function call
+	 * after here since call depth tracking uses per-CPU variables to
+	 * operate (relocate_kernel() is explicitly ignored by call depth
+	 * tracking).
 	 */
 	load_segments();
 	/*
@@ -412,7 +410,7 @@ void __nocfi machine_kexec(struct kimage *image)
 					   virt_to_phys(control_page),
 					   image->start,
 					   image->preserve_context,
-					   host_mem_enc_active);
+					   !cpu_feature_enabled(X86_FEATURE_HYPERVISOR));
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index b44d8863e57f..dc1a59cd8139 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -50,7 +50,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	 * %rsi pa_control_page
 	 * %rdx start address
 	 * %rcx preserve_context
-	 * %r8  host_mem_enc_active
+	 * %r8  bare_metal
 	 */
 
 	/* Save the CPU context, used for jumping back */
@@ -107,7 +107,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	/*
 	 * %rdi	indirection page
 	 * %rdx start address
-	 * %r8 host_mem_enc_active
+	 * %r8 bare_metal
 	 * %r9 page table page
 	 * %r11 preserve_context
 	 * %r13 original CR4 when relocate_kernel() was invoked
@@ -156,14 +156,19 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%r9, %cr3
 
 	/*
-	 * If SME is active, there could be old encrypted cache line
+	 * If SME/TDX is active, there could be old encrypted cache line
 	 * entries that will conflict with the now unencrypted memory
 	 * used by kexec. Flush the caches before copying the kernel.
+	 *
+	 * Do WBINVD for bare-metal only to cover both SME and TDX. Doing
+	 * WBINVD in guest results in an unexpected exception (#VE or #VC)
+	 * for TDX and SEV-ES/SNP guests which then crashes the guest (the
+	 * kernel has torn down the IDT).
 	 */
 	testq	%r8, %r8
-	jz .Lsme_off
+	jz .Lno_wbinvd
 	wbinvd
-.Lsme_off:
+.Lno_wbinvd:
 
 	call	swap_pages
 
-- 
2.48.1

Re: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()

Posted by Edgecombe, Rick P 11 months ago

On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> For both SME and TDX, dirty cachelines with and without the encryption
> bit(s) of the same physical memory address can coexist and the CPU can
> flush them back to memory in random order.
> 

A lot going on in this sentence, how about simplifying it:

For SME and TDX, multiple dirty cachelines for the same memory can co-exist, and
the CPU can flush them back to memory in a random order.


>   During kexec, the caches
> must be flushed before jumping to the new kernel to avoid silent memory
> corruption to the new kernel.

During kexec, the caches must be flushed before jumping to the new kernel to
avoid silent memory corruption when a cacheline with a different encryption
property is written back over whatever encryption properties the new kernel is
using.

...it distributes some of the details from the first sentence into the second.
Easier to read or no? I'm not sure.

> 
> The WBINVD in stop_this_cpu() flushes caches for all remote CPUs when
> they are being stopped.  For SME, the WBINVD in relocate_kernel()
> flushes the cache for the last running CPU (which is doing kexec).
> 
> Similarly, to support kexec for TDX host, after stopping all remote CPUs
> with cache flushed, the kernel needs to flush cache for the last running
> CPU.


I mentioned this in a previous version. I think you need to give some hint to
where you are going before you start listing facts. Like:

During kexec, WBINVD needs to be executed on each CPU for TDX and SME. On the
remote CPUs this is covered in stop_this_cpu() for both TDX and SME. For the
kexecing CPU, SME handles this in relocate_kernel(). This leaves the TDX case
for the kexec-ing CPU still to implement.

...it first says the overall problem to solve, then explains what is missing in
the current code to solve it. The reader is already thinking of what the
solutions should be and...

> 
> Use the existing WBINVD in relocate_kernel() to cover TDX host as well.

...they read the solution just as they are wondering about it. The reader can
feel like the change is aligned with their thinking.

> 
> Just do unconditional
> 

"Unconditional". Now I'm starting to think about how unconditional wbinvd will
be.

>  WBINVD to cover both SME and TDX instead of
> sprinkling additional vendor-specific checks.  Kexec is a slow path, and
> the additional WBINVD is acceptable for the sake of simplicity and
> maintainability.
> 
> But only do WBINVD for bare-metal
> 

But wait, now I'm learning it's not unconditional. I need to re-think what I
just evaluated. And I'm doubting the earlier statements because I just got
surprised.

>  because TDX guests and SEV-ES/SEV-SNP
> guests will get unexpected (and yet unnecessary) exception (#VE or #VC)
> which the kernel is unable to handle at the time of relocate_kernel()
> since the kernel has torn down the IDT.
> 
> Remove the host_mem_enc_active local variable and directly use
> !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) as an argument of calling
> relocate_kernel().
> 

Start with the problem here. It just describes another change and I'm not sure
why when I start reading it.

By problem I mean that host_mem_enc_active doesn't fit the conditional anymore,
so it needs to be changed.

>   cpu_feature_enabled() is always inline but not a

I was just noticing this on the other patch. Actually it could call into some
kasan stuff.

> function call, thus it is safe to use after load_segments() when call
> depth tracking is enabled.

This function call tracking stuff is a wild card at the end. What about
describing the rules this function needs to follow due to call depth tracking,
and explain why the change does that.

> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: David Kaplan <david.kaplan@amd.com>
> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> Tested-by: David Kaplan <david.kaplan@amd.com>
> ---
>  arch/x86/include/asm/kexec.h         |  2 +-
>  arch/x86/kernel/machine_kexec_64.c   | 14 ++++++--------
>  arch/x86/kernel/relocate_kernel_64.S | 15 ++++++++++-----
>  3 files changed, 17 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 8ad187462b68..48c313575262 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -123,7 +123,7 @@ relocate_kernel_fn(unsigned long indirection_page,
>  		   unsigned long pa_control_page,
>  		   unsigned long start_address,
>  		   unsigned int preserve_context,
> -		   unsigned int host_mem_enc_active);
> +		   unsigned int bare_metal);
>  #endif
>  extern relocate_kernel_fn relocate_kernel;
>  #define ARCH_HAS_KIMAGE_ARCH
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index a68f5a0a9f37..0e9808eeb63e 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -346,16 +346,9 @@ void __nocfi machine_kexec(struct kimage *image)
>  {
>  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
>  	relocate_kernel_fn *relocate_kernel_ptr;
> -	unsigned int host_mem_enc_active;
>  	int save_ftrace_enabled;
>  	void *control_page;
>  
> -	/*
> -	 * This must be done before load_segments() since if call depth tracking
> -	 * is used then GS must be valid to make any function calls.
> -	 */
> -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> -
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
>  		save_processor_state();
> @@ -398,6 +391,11 @@ void __nocfi machine_kexec(struct kimage *image)
>  	 *
>  	 * I take advantage of this here by force loading the
>  	 * segments, before I zap the gdt with an invalid value.
> +	 *
> +	 * load_segments() resets GS to 0.  Don't make any function call
> +	 * after here since call depth tracking uses per-CPU variables to
> +	 * operate (relocate_kernel() is explicitly ignored by call depth
> +	 * tracking).

I think I suggested you should call out the opportunistic change here in the
log. Did you disagree?

>  	 */
>  	load_segments();
>  	/*
> @@ -412,7 +410,7 @@ void __nocfi machine_kexec(struct kimage *image)
>  					   virt_to_phys(control_page),
>  					   image->start,
>  					   image->preserve_context,
> -					   host_mem_enc_active);
> +					   !cpu_feature_enabled(X86_FEATURE_HYPERVISOR));
>  
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index b44d8863e57f..dc1a59cd8139 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -50,7 +50,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
>  	 * %rsi pa_control_page
>  	 * %rdx start address
>  	 * %rcx preserve_context
> -	 * %r8  host_mem_enc_active
> +	 * %r8  bare_metal
>  	 */
>  
>  	/* Save the CPU context, used for jumping back */
> @@ -107,7 +107,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	/*
>  	 * %rdi	indirection page
>  	 * %rdx start address
> -	 * %r8 host_mem_enc_active
> +	 * %r8 bare_metal
>  	 * %r9 page table page
>  	 * %r11 preserve_context
>  	 * %r13 original CR4 when relocate_kernel() was invoked
> @@ -156,14 +156,19 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	movq	%r9, %cr3
>  
>  	/*
> -	 * If SME is active, there could be old encrypted cache line
> +	 * If SME/TDX is active, there could be old encrypted cache line
>  	 * entries that will conflict with the now unencrypted memory
>  	 * used by kexec. Flush the caches before copying the kernel.
> +	 *
> +	 * Do WBINVD for bare-metal only to cover both SME and TDX. Doing
> +	 * WBINVD in guest results in an unexpected exception (#VE or #VC)
> +	 * for TDX and SEV-ES/SNP guests which then crashes the guest (the
> +	 * kernel has torn down the IDT).
>  	 */
>  	testq	%r8, %r8
> -	jz .Lsme_off
> +	jz .Lno_wbinvd
>  	wbinvd
> -.Lsme_off:
> +.Lno_wbinvd:
>  
>  	call	swap_pages
>

Re: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()

Posted by Huang, Kai 11 months ago

On Thu, 2025-03-13 at 23:17 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > For both SME and TDX, dirty cachelines with and without the encryption
> > bit(s) of the same physical memory address can coexist and the CPU can
> > flush them back to memory in random order.
> > 
> 
> A lot going on in this sentence, how about simplifying it:
> 
> For SME and TDX, multiple dirty cachelines for the same memory can co-exist, and
> the CPU can flush them back to memory in a random order.

"multiple" isn't accurate at least for SME.  How about:

 For SME and TDX, dirty cachelines with and without encryption bit(s) of 
 the same memory can coexist, and the CPU can flush them back to memory 
 in random order.

> 
> 
> >   During kexec, the caches
> > must be flushed before jumping to the new kernel to avoid silent memory
> > corruption to the new kernel.
> 
> During kexec, the caches must be flushed before jumping to the new kernel to
> avoid silent memory corruption when a cacheline with a different encryption
> property is written back over whatever encryption properties the new kernel is
> using.
> 
> ...it distributes some of the details from the first sentence into the second.
> Easier to read or no? I'm not sure.

I don't have opinion.  I see no difference.

I tends to keep the original words since people have reviewed.

> 
> > 
> > The WBINVD in stop_this_cpu() flushes caches for all remote CPUs when
> > they are being stopped.  For SME, the WBINVD in relocate_kernel()
> > flushes the cache for the last running CPU (which is doing kexec).
> > 
> > Similarly, to support kexec for TDX host, after stopping all remote CPUs
> > with cache flushed, the kernel needs to flush cache for the last running
> > CPU.
> 
> 
> I mentioned this in a previous version. I think you need to give some hint to
> where you are going before you start listing facts. Like:
> 
> During kexec, WBINVD needs to be executed on each CPU for TDX and SME. 
> 
> On the
> remote CPUs this is covered in stop_this_cpu() for both TDX and SME. For the
> kexecing CPU, SME handles this in relocate_kernel(). This leaves the TDX case
> for the kexec-ing CPU still to implement.
> 
> ...it first says the overall problem to solve, then explains what is missing in
> the current code to solve it. The reader is already thinking of what the
> solutions should be and...

Will do.

> 
> > 
> > Use the existing WBINVD in relocate_kernel() to cover TDX host as well.
> 
> ...they read the solution just as they are wondering about it. The reader can
> feel like the change is aligned with their thinking.
> 
> > 
> > Just do unconditional
> > 
> 
> "Unconditional". Now I'm starting to think about how unconditional wbinvd will
> be.
> 
> >  WBINVD to cover both SME and TDX instead of
> > sprinkling additional vendor-specific checks.  Kexec is a slow path, and
> > the additional WBINVD is acceptable for the sake of simplicity and
> > maintainability.
> > 
> > But only do WBINVD for bare-metal
> > 
> 
> But wait, now I'm learning it's not unconditional. I need to re-think what I
> just evaluated. And I'm doubting the earlier statements because I just got
> surprised.

Do you mean you got surprised by saying "unconditional" first and then saying
"for bare-metal"?  This is literally what the patch title says.  I don't see any
problem, but I can mentioned the "for bare-metal" part when I say
"unconditional" above.

> 
> >  because TDX guests and SEV-ES/SEV-SNP
> > guests will get unexpected (and yet unnecessary) exception (#VE or #VC)
> > which the kernel is unable to handle at the time of relocate_kernel()
> > since the kernel has torn down the IDT.
> > 
> > Remove the host_mem_enc_active local variable and directly use
> > !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) as an argument of calling
> > relocate_kernel().
> > 
> 
> Start with the problem here. It just describes another change and I'm not sure
> why when I start reading it.
> 
> By problem I mean that host_mem_enc_active doesn't fit the conditional anymore,
> so it needs to be changed.
> 
> >   cpu_feature_enabled() is always inline but not a
> 
> I was just noticing this on the other patch. Actually it could call into some
> kasan stuff.

Can you be more specific since I am not seeing it.

> 
> > function call, thus it is safe to use after load_segments() when call
> > depth tracking is enabled.
> 
> This function call tracking stuff is a wild card at the end. What about
> describing the rules this function needs to follow due to call depth tracking,
> and explain why the change does that.

Below is what I had before.  Do you think it's better?  I replaced them with the
current one since Reinette commented the original one (which contains history
etc) was not necessary.

"
Commit 93c1800b3799 ("x86/kexec: Fix bug with call depth tracking")
moved calling 'cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)' as an argument
of relocate_kernel() to an earlier place before load_segments() by
adding a variable 'host_mem_enc_active'.  The reason was the call to
cc_platform_has() after load_segments() caused a fault and system crash
when call depth tracking is active because load_segments() resets GS to
0 but call depth tracking uses per-CPU variable to operate.

Use !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) to check whether the
kernel runs on bare-metal.  cpu_feature_enabled() is always inline but
not a function call, thus it is safe to use it after load_segments()
when call depth tracking is enabled.  Remove the 'host_mem_enc_active'
variable and use cpu_feature_enabled() directly as the argument when
calling relocate_kernel().
"


[...]

> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -346,16 +346,9 @@ void __nocfi machine_kexec(struct kimage *image)
> >  {
> >  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
> >  	relocate_kernel_fn *relocate_kernel_ptr;
> > -	unsigned int host_mem_enc_active;
> >  	int save_ftrace_enabled;
> >  	void *control_page;
> >  
> > -	/*
> > -	 * This must be done before load_segments() since if call depth tracking
> > -	 * is used then GS must be valid to make any function calls.
> > -	 */
> > -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> > -
> >  #ifdef CONFIG_KEXEC_JUMP
> >  	if (image->preserve_context)
> >  		save_processor_state();
> > @@ -398,6 +391,11 @@ void __nocfi machine_kexec(struct kimage *image)
> >  	 *
> >  	 * I take advantage of this here by force loading the
> >  	 * segments, before I zap the gdt with an invalid value.
> > +	 *
> > +	 * load_segments() resets GS to 0.  Don't make any function call
> > +	 * after here since call depth tracking uses per-CPU variables to
> > +	 * operate (relocate_kernel() is explicitly ignored by call depth
> > +	 * tracking).
> 
> I think I suggested you should call out the opportunistic change here in the
> log. Did you disagree?

I replied this was suggested by David Kaplan, but I guess I forgot to reply the
"opportunistic" part.

I don't think this is opportunistic change.  It's a valid comment after the 
'host_mem_enc_active' variable and the comment around it were removed.

Re: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()

Posted by Edgecombe, Rick P 10 months, 3 weeks ago

On Fri, 2025-03-14 at 09:44 +0000, Huang, Kai wrote:
> On Thu, 2025-03-13 at 23:17 +0000, Edgecombe, Rick P wrote:
> > On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > > For both SME and TDX, dirty cachelines with and without the encryption
> > > bit(s) of the same physical memory address can coexist and the CPU can
> > > flush them back to memory in random order.
> > > 
> > 
> > A lot going on in this sentence, how about simplifying it:
> > 
> > For SME and TDX, multiple dirty cachelines for the same memory can co-exist, and
> > the CPU can flush them back to memory in a random order.
> 
> "multiple" isn't accurate at least for SME.  How about:
> 
>  For SME and TDX, dirty cachelines with and without encryption bit(s) of 
>  the same memory can coexist, and the CPU can flush them back to memory 
>  in random order.

Ok.

> 
> > 
> > 
> > >   During kexec, the caches
> > > must be flushed before jumping to the new kernel to avoid silent memory
> > > corruption to the new kernel.
> > 
> > During kexec, the caches must be flushed before jumping to the new kernel to
> > avoid silent memory corruption when a cacheline with a different encryption
> > property is written back over whatever encryption properties the new kernel is
> > using.
> > 
> > ...it distributes some of the details from the first sentence into the second.
> > Easier to read or no? I'm not sure.
> 
> I don't have opinion.  I see no difference.
> 
> I tends to keep the original words since people have reviewed.
> 
> > 
> > > 
> > > The WBINVD in stop_this_cpu() flushes caches for all remote CPUs when
> > > they are being stopped.  For SME, the WBINVD in relocate_kernel()
> > > flushes the cache for the last running CPU (which is doing kexec).
> > > 
> > > Similarly, to support kexec for TDX host, after stopping all remote CPUs
> > > with cache flushed, the kernel needs to flush cache for the last running
> > > CPU.
> > 
> > 
> > I mentioned this in a previous version. I think you need to give some hint to
> > where you are going before you start listing facts. Like:
> > 
> > During kexec, WBINVD needs to be executed on each CPU for TDX and SME. 
> > 
> > On the
> > remote CPUs this is covered in stop_this_cpu() for both TDX and SME. For the
> > kexecing CPU, SME handles this in relocate_kernel(). This leaves the TDX case
> > for the kexec-ing CPU still to implement.
> > 
> > ...it first says the overall problem to solve, then explains what is missing in
> > the current code to solve it. The reader is already thinking of what the
> > solutions should be and...
> 
> Will do.
> 
> > 
> > > 
> > > Use the existing WBINVD in relocate_kernel() to cover TDX host as well.
> > 
> > ...they read the solution just as they are wondering about it. The reader can
> > feel like the change is aligned with their thinking.
> > 
> > > 
> > > Just do unconditional
> > > 
> > 
> > "Unconditional". Now I'm starting to think about how unconditional wbinvd will
> > be.
> > 
> > >  WBINVD to cover both SME and TDX instead of
> > > sprinkling additional vendor-specific checks.  Kexec is a slow path, and
> > > the additional WBINVD is acceptable for the sake of simplicity and
> > > maintainability.
> > > 
> > > But only do WBINVD for bare-metal
> > > 
> > 
> > But wait, now I'm learning it's not unconditional. I need to re-think what I
> > just evaluated. And I'm doubting the earlier statements because I just got
> > surprised.
> 
> Do you mean you got surprised by saying "unconditional" first and then saying
> "for bare-metal"?  This is literally what the patch title says.  I don't see any
> problem, but I can mentioned the "for bare-metal" part when I say
> "unconditional" above.

I don't know how to explain it more. Unconditional means, well, unconditional.
"Only do..." is a condition. It's not unconditional.

So don't say it is or it will be confusing.

> 
> > 
> > >  because TDX guests and SEV-ES/SEV-SNP
> > > guests will get unexpected (and yet unnecessary) exception (#VE or #VC)
> > > which the kernel is unable to handle at the time of relocate_kernel()
> > > since the kernel has torn down the IDT.
> > > 
> > > Remove the host_mem_enc_active local variable and directly use
> > > !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) as an argument of calling
> > > relocate_kernel().
> > > 
> > 
> > Start with the problem here. It just describes another change and I'm not sure
> > why when I start reading it.
> > 
> > By problem I mean that host_mem_enc_active doesn't fit the conditional anymore,
> > so it needs to be changed.
> > 
> > >   cpu_feature_enabled() is always inline but not a
> > 
> > I was just noticing this on the other patch. Actually it could call into some
> > kasan stuff.
> 
> Can you be more specific since I am not seeing it.

You are right, I don't know what I was looking at.

> 
> > 
> > > function call, thus it is safe to use after load_segments() when call
> > > depth tracking is enabled.
> > 
> > This function call tracking stuff is a wild card at the end. What about
> > describing the rules this function needs to follow due to call depth tracking,
> > and explain why the change does that.
> 
> Below is what I had before.  Do you think it's better?  I replaced them with the
> current one since Reinette commented the original one (which contains history
> etc) was not necessary.
> 
> "
> Commit 93c1800b3799 ("x86/kexec: Fix bug with call depth tracking")
> moved calling 'cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)' as an argument
> of relocate_kernel() to an earlier place before load_segments() by
> adding a variable 'host_mem_enc_active'.  The reason was the call to
> cc_platform_has() after load_segments() caused a fault and system crash
> when call depth tracking is active because load_segments() resets GS to
> 0 but call depth tracking uses per-CPU variable to operate.

That does feel like a lot of details. But I think it still needs more info that
in the current version. Again, just describe the special requirements and why.
You don't need to skip the links, but just jumping right into the history takes
more work to understand what is needed.

> 
> Use !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) to check whether the
> kernel runs on bare-metal.  cpu_feature_enabled() is always inline but
> not a function call, thus it is safe to use it after load_segments()
> when call depth tracking is enabled.  Remove the 'host_mem_enc_active'
> variable and use cpu_feature_enabled() directly as the argument when
> calling relocate_kernel().
> "



> 
> 
> [...]
> 
> > > --- a/arch/x86/kernel/machine_kexec_64.c
> > > +++ b/arch/x86/kernel/machine_kexec_64.c
> > > @@ -346,16 +346,9 @@ void __nocfi machine_kexec(struct kimage *image)
> > >  {
> > >  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
> > >  	relocate_kernel_fn *relocate_kernel_ptr;
> > > -	unsigned int host_mem_enc_active;
> > >  	int save_ftrace_enabled;
> > >  	void *control_page;
> > >  
> > > -	/*
> > > -	 * This must be done before load_segments() since if call depth tracking
> > > -	 * is used then GS must be valid to make any function calls.
> > > -	 */
> > > -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> > > -
> > >  #ifdef CONFIG_KEXEC_JUMP
> > >  	if (image->preserve_context)
> > >  		save_processor_state();
> > > @@ -398,6 +391,11 @@ void __nocfi machine_kexec(struct kimage *image)
> > >  	 *
> > >  	 * I take advantage of this here by force loading the
> > >  	 * segments, before I zap the gdt with an invalid value.
> > > +	 *
> > > +	 * load_segments() resets GS to 0.  Don't make any function call
> > > +	 * after here since call depth tracking uses per-CPU variables to
> > > +	 * operate (relocate_kernel() is explicitly ignored by call depth
> > > +	 * tracking).
> > 
> > I think I suggested you should call out the opportunistic change here in the
> > log. Did you disagree?
> 
> I replied this was suggested by David Kaplan, but I guess I forgot to reply the
> "opportunistic" part.
> 
> I don't think this is opportunistic change.  It's a valid comment after the 
> 'host_mem_enc_active' variable and the comment around it were removed.

It's valid before too. So it's a separate change. I think based on the recent
history we should be extra careful about slipping extra changes into the series.
I'm not against doing the change in this patch, just want to make sure it's not
perceived as being slipped in.

> 
> 
>

Re: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()

Posted by Huang, Kai 10 months, 3 weeks ago

> > 
> > > > --- a/arch/x86/kernel/machine_kexec_64.c
> > > > +++ b/arch/x86/kernel/machine_kexec_64.c
> > > > @@ -346,16 +346,9 @@ void __nocfi machine_kexec(struct kimage *image)
> > > >  {
> > > >  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
> > > >  	relocate_kernel_fn *relocate_kernel_ptr;
> > > > -	unsigned int host_mem_enc_active;
> > > >  	int save_ftrace_enabled;
> > > >  	void *control_page;
> > > >  
> > > > -	/*
> > > > -	 * This must be done before load_segments() since if call depth tracking
> > > > -	 * is used then GS must be valid to make any function calls.
> > > > -	 */
> > > > -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> > > > -
> > > >  #ifdef CONFIG_KEXEC_JUMP
> > > >  	if (image->preserve_context)
> > > >  		save_processor_state();
> > > > @@ -398,6 +391,11 @@ void __nocfi machine_kexec(struct kimage *image)
> > > >  	 *
> > > >  	 * I take advantage of this here by force loading the
> > > >  	 * segments, before I zap the gdt with an invalid value.
> > > > +	 *
> > > > +	 * load_segments() resets GS to 0.  Don't make any function call
> > > > +	 * after here since call depth tracking uses per-CPU variables to
> > > > +	 * operate (relocate_kernel() is explicitly ignored by call depth
> > > > +	 * tracking).
> > > 
> > > I think I suggested you should call out the opportunistic change here in the
> > > log. Did you disagree?
> > 
> > I replied this was suggested by David Kaplan, but I guess I forgot to reply the
> > "opportunistic" part.
> > 
> > I don't think this is opportunistic change.  It's a valid comment after the 
> > 'host_mem_enc_active' variable and the comment around it were removed.
> 
> It's valid before too. So it's a separate change. 
> 

I tried to understand what you mean here, but I am not sure I am following.  My
thinking:

Before this code change, in the existing code there's a comment right before the
'host_mem_enc_active' variable to explain why this variable is needed (which is
because of depth tracking).

After we remove 'host_mem_enc_active' and the comment before it, there's no
comment to mention anything about depth tracking here.  So comparing to the
existing code, we lost information which is actually helpful.

To still keep the helpful information about the depth tracking, a new comment is
added before load_segments().

Could you explain why this is a separate/extra change?

Nevertheless, are you looking for something like below in the changelog?

  With the 'host_mem_enc_active' and the comment around it removed,
  the information about depth tracking no longer exists.  Expand the 
  comment around load_segments() to mention that due to depth tracking
  no function call can be made after load_segments().

Re: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()

Posted by Edgecombe, Rick P 10 months, 3 weeks ago

On Wed, 2025-03-19 at 09:57 +0000, Huang, Kai wrote:
> Nevertheless, are you looking for something like below in the changelog?
> 
>   With the 'host_mem_enc_active' and the comment around it removed,
>   the information about depth tracking no longer exists.  Expand the 
>   comment around load_segments() to mention that due to depth tracking
>   no function call can be made after load_segments().

Looks good to me.

[RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
[RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()
[RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
[RFC PATCH 4/5] x86/virt/tdx: Remove the !KEXEC_CORE dependency
[RFC PATCH 5/5] x86/virt/tdx: Update the kexec section in the TDX documentation