[PATCH v5 2/5] x86/kexec: do unconditional WBINVD for bare-metal in relocate_kernel()

Kai Huang posted 5 patches 1 year, 5 months ago
There is a newer version of this series
[PATCH v5 2/5] x86/kexec: do unconditional WBINVD for bare-metal in relocate_kernel()
Posted by Kai Huang 1 year, 5 months ago
Both SME and TDX can leave caches in incoherent state due to memory
encryption.  During kexec, the caches must be flushed before jumping to
the second kernel to avoid silent memory corruption to the second kernel.

During kexec, the WBINVD in stop_this_cpu() flushes caches for all
remote cpus when they are being stopped.  For SME, the WBINVD in
relocate_kernel() flushes the cache for the last running cpu (which is
executing the kexec).

Similarly, to support kexec for TDX host, after stopping all remote cpus
with cache flushed, the kernel needs to flush cache for the last running
cpu.

Use the existing WBINVD in relocate_kernel() to cover TDX host as well.

However, instead of sprinkling around vendor-specific checks, just do
unconditional WBINVD to cover both SME and TDX.  Kexec is not a fast path
so having one additional WBINVD for platforms w/o SME/TDX is acceptable.

But only do WBINVD for bare-metal because TDX guests and SEV-ES/SEV-SNP
guests will get unexpected (and yet unnecessary) exception (#VE or #VC)
which the kernel is unable to handle at this stage.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
---

v4 -> v5:
 - Add Tom's tag

v3 -> v4:
 - Use "exception (#VE or #VC)" for TDX and SEV-ES/SEV-SNP in changelog
   and comments.  (Kirill, Tom)
 - "Save the bare_metal" -> "Save the bare_metal flag" (Tom)
 - Point out "WBINVD is not necessary for TDX and SEV-ES/SEV-SNP guests"
   in the comment.  (Tom)

v2 -> v3:
 - Change to only do WBINVD for bare metal

---
 arch/x86/include/asm/kexec.h         |  2 +-
 arch/x86/kernel/machine_kexec_64.c   |  2 +-
 arch/x86/kernel/relocate_kernel_64.S | 19 +++++++++++++++----
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index ae5482a2f0ca..b3429c70847d 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -128,7 +128,7 @@ relocate_kernel(unsigned long indirection_page,
 		unsigned long page_list,
 		unsigned long start_address,
 		unsigned int preserve_context,
-		unsigned int host_mem_enc_active);
+		unsigned int bare_metal);
 #endif
 
 #define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 9c9ac606893e..07ca9d3361a3 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -392,7 +392,7 @@ void machine_kexec(struct kimage *image)
 				       (unsigned long)page_list,
 				       image->start,
 				       image->preserve_context,
-				       host_mem_enc_active);
+				       !boot_cpu_has(X86_FEATURE_HYPERVISOR));
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 042c9a0334e9..a1a8a79d6b78 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -52,7 +52,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	 * %rsi page_list
 	 * %rdx start address
 	 * %rcx preserve_context
-	 * %r8  host_mem_enc_active
+	 * %r8  bare_metal
 	 */
 
 	/* Save the CPU context, used for jumping back */
@@ -80,7 +80,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	pushq $0
 	popfq
 
-	/* Save SME active flag */
+	/* Save the bare_metal flag */
 	movq	%r8, %r12
 
 	/*
@@ -161,9 +161,20 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%r9, %cr3
 
 	/*
-	 * If SME is active, there could be old encrypted cache line
+	 * The kernel could leave caches in incoherent state on SME/TDX
+	 * capable platforms.  Just do unconditional WBINVD to avoid
+	 * silent memory corruption to the new kernel for these platforms.
+	 *
+	 * For SME, need to flush cache here before copying the kernel.
+	 * When it is active, there could be old encrypted cache line
 	 * entries that will conflict with the now unencrypted memory
-	 * used by kexec. Flush the caches before copying the kernel.
+	 * used by kexec.
+	 *
+	 * Do WBINVD for bare-metal only to cover both SME and TDX.  It
+	 * isn't necessary to perform a WBINVD in a guest and performing
+	 * one could result in an exception (#VE or #VC) for a TDX or
+	 * SEV-ES/SEV-SNP guest that can crash the guest since, at this
+	 * stage, the kernel has torn down the IDT.
 	 */
 	testq	%r12, %r12
 	jz .Lsme_off
-- 
2.45.2
Re: [PATCH v5 2/5] x86/kexec: do unconditional WBINVD for bare-metal in relocate_kernel()
Posted by Borislav Petkov 1 year, 5 months ago
On Fri, Aug 16, 2024 at 12:29:18AM +1200, Kai Huang wrote:
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 9c9ac606893e..07ca9d3361a3 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -392,7 +392,7 @@ void machine_kexec(struct kimage *image)
>  				       (unsigned long)page_list,
>  				       image->start,
>  				       image->preserve_context,
> -				       host_mem_enc_active);
> +				       !boot_cpu_has(X86_FEATURE_HYPERVISOR));

Everytime you feel the need to check a X86_FEATURE_ flag, make sure you use
cpu_feature_enabled().

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH v5 2/5] x86/kexec: do unconditional WBINVD for bare-metal in relocate_kernel()
Posted by Huang, Kai 1 year, 5 months ago

On 5/09/2024 3:30 am, Borislav Petkov wrote:
> On Fri, Aug 16, 2024 at 12:29:18AM +1200, Kai Huang wrote:
>> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
>> index 9c9ac606893e..07ca9d3361a3 100644
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -392,7 +392,7 @@ void machine_kexec(struct kimage *image)
>>   				       (unsigned long)page_list,
>>   				       image->start,
>>   				       image->preserve_context,
>> -				       host_mem_enc_active);
>> +				       !boot_cpu_has(X86_FEATURE_HYPERVISOR));
> 
> Everytime you feel the need to check a X86_FEATURE_ flag, make sure you use
> cpu_feature_enabled().
> 

Thanks for review.  Yeah will do.
Re: [PATCH v5 2/5] x86/kexec: do unconditional WBINVD for bare-metal in relocate_kernel()
Posted by Huang, Kai 1 year, 5 months ago
>  
>  #define ARCH_HAS_KIMAGE_ARCH
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 9c9ac606893e..07ca9d3361a3 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -392,7 +392,7 @@ void machine_kexec(struct kimage *image)
>  				       (unsigned long)page_list,
>  				       image->start,
>  				       image->preserve_context,
> -				       host_mem_enc_active);
> +				       !boot_cpu_has(X86_FEATURE_HYPERVISOR));
>  
> 

LKP reported below warning:

All warnings (new ones prefixed by >>):

   arch/x86/kernel/machine_kexec_64.c: In function 'machine_kexec':
>> arch/x86/kernel/machine_kexec_64.c:325:22: warning: variable
'host_mem_enc_active' set but not used [-Wunused-but-set-variable]
     325 |         unsigned int host_mem_enc_active;
         |                      ^~~~~~~~~~~~~~~~~~~

This is due to while rebasing I didn't pay enough attention to the recent code
from commit 

  93c1800b3799f ("x86/kexec: Fix bug with call depth tracking")

which introduced the host_mem_enc_active variable in order to avoid
cc_platform_has() function call after load_segments() to resolve a problem
when call depth tracking is on.

A 100% safe way is to replace 

	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);

... with

	bare_metal = !boot_cpu_has(X86_FEATURE_HYPERVISOR);

but I think we can just remove that variable and directly use

	!boot_cpu_has(X86_FEATURE_HYPERVISOR)

as the last argument of calling the relocate_kernel(), because AFAICT the
above X86_FEATURE_HYPERVISOR bit test will always generate inline code thus
there will be no additional CALL/RET.

The incremental diff will be:

--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -331,16 +331,9 @@ static void kexec_save_processor_start(struct kimage
*image)
 void machine_kexec(struct kimage *image)
 {
        unsigned long page_list[PAGES_NR];
-       unsigned int host_mem_enc_active;
        int save_ftrace_enabled;
        void *control_page;
 
-       /*
-        * This must be done before load_segments() since if call depth
tracking
-        * is used then GS must be valid to make any function calls.
-        */
-       host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
-

Am I missing anything?

I'll send out a new version with the above and put some explanation to the
changelog if I don't see any other feedback on the rest TDX patches in the
coming days.  Thanks!