[v2] TDX host: kexec() support

[PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Kai Huang 1 year, 10 months ago

Both SME and TDX can leave caches in incoherent state due to memory
encryption.  During kexec, the caches must be flushed before jumping to
the second kernel to avoid silent memory corruption to the second kernel.

During kexec, the WBINVD in stop_this_cpu() flushes caches for all
remote cpus when they are being stopped.  For SME, the WBINVD in
relocate_kernel() flushes the cache for the last running cpu (which is
executing the kexec).

Similarly, for TDX after stopping all remote cpus with cache flushed, to
support kexec, the kernel needs to flush cache for the last running cpu.

Make the WBINVD in the relocate_kernel() unconditional to cover both SME
and TDX.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dave Young <dyoung@redhat.com>
---
 arch/x86/include/asm/kexec.h         |  3 +--
 arch/x86/kernel/machine_kexec_64.c   |  3 +--
 arch/x86/kernel/relocate_kernel_64.S | 13 +++----------
 3 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 91ca9a9ee3a2..9754794242ad 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -127,8 +127,7 @@ unsigned long
 relocate_kernel(unsigned long indirection_page,
 		unsigned long page_list,
 		unsigned long start_address,
-		unsigned int preserve_context,
-		unsigned int host_mem_enc_active);
+		unsigned int preserve_context);
 #endif
 
 #define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index bc0a5348b4a6..b9a632479b36 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -357,8 +357,7 @@ void machine_kexec(struct kimage *image)
 	image->start = relocate_kernel((unsigned long)image->head,
 				       (unsigned long)page_list,
 				       image->start,
-				       image->preserve_context,
-				       cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
+				       image->preserve_context);
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 56cab1bb25f5..66b628686dbc 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -50,7 +50,6 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	 * %rsi page_list
 	 * %rdx start address
 	 * %rcx preserve_context
-	 * %r8  host_mem_enc_active
 	 */
 
 	/* Save the CPU context, used for jumping back */
@@ -78,9 +77,6 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	pushq $0
 	popfq
 
-	/* Save SME active flag */
-	movq	%r8, %r12
-
 	/*
 	 * get physical address of control page now
 	 * this is impossible after page table switch
@@ -160,14 +156,11 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%r9, %cr3
 
 	/*
-	 * If SME is active, there could be old encrypted cache line
-	 * entries that will conflict with the now unencrypted memory
-	 * used by kexec. Flush the caches before copying the kernel.
+	 * The kernel could leave caches in incoherent state on SME/TDX
+	 * capable platforms.  Just do unconditional cache flush to avoid
+	 * silent memory corruption to the new kernel for these platforms.
 	 */
-	testq	%r12, %r12
-	jz 1f
 	wbinvd
-1:
 
 	movq	%rcx, %r11
 	call	swap_pages
-- 
2.34.1

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Borislav Petkov 1 year, 10 months ago

On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
> Both SME and TDX can leave caches in incoherent state due to memory
> encryption.  During kexec, the caches must be flushed before jumping to
> the second kernel to avoid silent memory corruption to the second kernel.
> 
> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
> remote cpus when they are being stopped.  For SME, the WBINVD in
> relocate_kernel() flushes the cache for the last running cpu (which is
> executing the kexec).
> 
> Similarly, for TDX after stopping all remote cpus with cache flushed, to
> support kexec, the kernel needs to flush cache for the last running cpu.
> 
> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
> and TDX.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Suggested-by: Borislav Petkov <bp@alien8.de>

Well, I suggested what you have in patch 1 but not this.

You can't just slap tags willy nilly to patches.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Huang, Kai 1 year, 10 months ago


On 20/03/2024 4:41 am, Borislav Petkov wrote:
> On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
>> Both SME and TDX can leave caches in incoherent state due to memory
>> encryption.  During kexec, the caches must be flushed before jumping to
>> the second kernel to avoid silent memory corruption to the second kernel.
>>
>> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
>> remote cpus when they are being stopped.  For SME, the WBINVD in
>> relocate_kernel() flushes the cache for the last running cpu (which is
>> executing the kexec).
>>
>> Similarly, for TDX after stopping all remote cpus with cache flushed, to
>> support kexec, the kernel needs to flush cache for the last running cpu.
>>
>> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
>> and TDX.
>>
>> Signed-off-by: Kai Huang <kai.huang@intel.com>
>> Suggested-by: Borislav Petkov <bp@alien8.de>
> 
> Well, I suggested what you have in patch 1 but not this.
> 
> You can't just slap tags willy nilly to patches.
> 

Apologize.  Will remove.

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Kirill A. Shutemov 1 year, 10 months ago

On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
> Both SME and TDX can leave caches in incoherent state due to memory
> encryption.  During kexec, the caches must be flushed before jumping to
> the second kernel to avoid silent memory corruption to the second kernel.
> 
> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
> remote cpus when they are being stopped.  For SME, the WBINVD in
> relocate_kernel() flushes the cache for the last running cpu (which is
> executing the kexec).
> 
> Similarly, for TDX after stopping all remote cpus with cache flushed, to
> support kexec, the kernel needs to flush cache for the last running cpu.
> 
> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
> and TDX.

Nope. It breaks TDX guest. WBINVD triggers #VE for TDX guests.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Tom Lendacky 1 year, 10 months ago

On 3/19/24 06:13, Kirill A. Shutemov wrote:
> On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
>> Both SME and TDX can leave caches in incoherent state due to memory
>> encryption.  During kexec, the caches must be flushed before jumping to
>> the second kernel to avoid silent memory corruption to the second kernel.
>>
>> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
>> remote cpus when they are being stopped.  For SME, the WBINVD in
>> relocate_kernel() flushes the cache for the last running cpu (which is
>> executing the kexec).
>>
>> Similarly, for TDX after stopping all remote cpus with cache flushed, to
>> support kexec, the kernel needs to flush cache for the last running cpu.
>>
>> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
>> and TDX.
> 
> Nope. It breaks TDX guest. WBINVD triggers #VE for TDX guests.

Ditto for SEV-ES/SEV-SNP, a #VC is generated and crashes the guest.

Thanks,
Tom

>

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Huang, Kai 1 year, 10 months ago


On 20/03/2024 3:38 am, Tom Lendacky wrote:
> On 3/19/24 06:13, Kirill A. Shutemov wrote:
>> On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
>>> Both SME and TDX can leave caches in incoherent state due to memory
>>> encryption.  During kexec, the caches must be flushed before jumping to
>>> the second kernel to avoid silent memory corruption to the second 
>>> kernel.
>>>
>>> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
>>> remote cpus when they are being stopped.  For SME, the WBINVD in
>>> relocate_kernel() flushes the cache for the last running cpu (which is
>>> executing the kexec).
>>>
>>> Similarly, for TDX after stopping all remote cpus with cache flushed, to
>>> support kexec, the kernel needs to flush cache for the last running cpu.
>>>
>>> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
>>> and TDX.
>>
>> Nope. It breaks TDX guest. WBINVD triggers #VE for TDX guests.
> 
> Ditto for SEV-ES/SEV-SNP, a #VC is generated and crashes the guest.
> 

Oh I forgot these.

Hi Kirill,

Then I think patch 1 will also break TDX guest after your series to 
enable multiple cpus for the second kernel after kexec()?

Hi Tom,

I am not aware of kexec() support status for SEV-ES/SEV-SNP guests. 
Does patch 1 break them?

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Tom Lendacky 1 year, 10 months ago

On 3/19/24 16:20, Huang, Kai wrote:
> 
> 
> On 20/03/2024 3:38 am, Tom Lendacky wrote:
>> On 3/19/24 06:13, Kirill A. Shutemov wrote:
>>> On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
>>>> Both SME and TDX can leave caches in incoherent state due to memory
>>>> encryption.  During kexec, the caches must be flushed before jumping to
>>>> the second kernel to avoid silent memory corruption to the second kernel.
>>>>
>>>> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
>>>> remote cpus when they are being stopped.  For SME, the WBINVD in
>>>> relocate_kernel() flushes the cache for the last running cpu (which is
>>>> executing the kexec).
>>>>
>>>> Similarly, for TDX after stopping all remote cpus with cache flushed, to
>>>> support kexec, the kernel needs to flush cache for the last running cpu.
>>>>
>>>> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
>>>> and TDX.
>>>
>>> Nope. It breaks TDX guest. WBINVD triggers #VE for TDX guests.
>>
>> Ditto for SEV-ES/SEV-SNP, a #VC is generated and crashes the guest.
>>
> 
> Oh I forgot these.
> 
> Hi Kirill,
> 
> Then I think patch 1 will also break TDX guest after your series to enable 
> multiple cpus for the second kernel after kexec()?
> 
> Hi Tom,
> 
> I am not aware of kexec() support status for SEV-ES/SEV-SNP guests. Does 
> patch 1 break them?

SNP guests can kexec with some patches that are currently in process 
around shared to private memory conversions. ES guests can only kexec with 
a single vCPU. There was a recent patch series to add support for multiple 
vCPUs.

Patch #1 doesn't break either ES or SNP because we still have an IDT and 
traditional kernel addressing in place, so the #VC can be handled.

Whereas patch #2 has switched to identity mapping and removed the IDT, so 
a #VC causes a triple fault.

Thanks,
Tom

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Huang, Kai 1 year, 10 months ago

>> Hi Tom,
>>
>> I am not aware of kexec() support status for SEV-ES/SEV-SNP guests. 
>> Does patch 1 break them?
> 
> SNP guests can kexec with some patches that are currently in process 
> around shared to private memory conversions. ES guests can only kexec 
> with a single vCPU. There was a recent patch series to add support for 
> multiple vCPUs.
> 
> Patch #1 doesn't break either ES or SNP because we still have an IDT and 
> traditional kernel addressing in place, so the #VC can be handled.

How about plain SEV guest?

> 
> Whereas patch #2 has switched to identity mapping and removed the IDT, 
> so a #VC causes a triple fault.

That makes sense.  Thanks.

Hi Kirill,

Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() 
can be handled although it causes #VE, while WBINVD in relocate_kernel() 
will just triple fault the guest?

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Kirill A. Shutemov 1 year, 10 months ago

On Thu, Mar 21, 2024 at 09:48:28AM +1300, Huang, Kai wrote:
> 
> > > Hi Tom,
> > > 
> > > I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
> > > Does patch 1 break them?
> > 
> > SNP guests can kexec with some patches that are currently in process
> > around shared to private memory conversions. ES guests can only kexec
> > with a single vCPU. There was a recent patch series to add support for
> > multiple vCPUs.
> > 
> > Patch #1 doesn't break either ES or SNP because we still have an IDT and
> > traditional kernel addressing in place, so the #VC can be handled.
> 
> How about plain SEV guest?
> 
> > 
> > Whereas patch #2 has switched to identity mapping and removed the IDT,
> > so a #VC causes a triple fault.
> 
> That makes sense.  Thanks.
> 
> Hi Kirill,
> 
> Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() can
> be handled although it causes #VE, while WBINVD in relocate_kernel() will
> just triple fault the guest?

No. We never handle WBINVD #VE. Guest cannot handle WBINVD itself and the
only option is to ask host to do this. We cannot guarantee host will do
anything useful with the request. I guess it can be potential attack
vector if host strategically ignores WBINVD to induce bad guest behaviour.

And it is not good from host PoV either. If it does WBINVD on every guest
request we get guest->host DoS attack possibility.

Tom, I am curious, how do you deal with these problems?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Tom Lendacky 1 year, 10 months ago

On 3/20/24 18:10, Kirill A. Shutemov wrote:
> On Thu, Mar 21, 2024 at 09:48:28AM +1300, Huang, Kai wrote:
>>
>>>> Hi Tom,
>>>>
>>>> I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
>>>> Does patch 1 break them?
>>>
>>> SNP guests can kexec with some patches that are currently in process
>>> around shared to private memory conversions. ES guests can only kexec
>>> with a single vCPU. There was a recent patch series to add support for
>>> multiple vCPUs.
>>>
>>> Patch #1 doesn't break either ES or SNP because we still have an IDT and
>>> traditional kernel addressing in place, so the #VC can be handled.
>>
>> How about plain SEV guest?
>>
>>>
>>> Whereas patch #2 has switched to identity mapping and removed the IDT,
>>> so a #VC causes a triple fault.
>>
>> That makes sense.  Thanks.
>>
>> Hi Kirill,
>>
>> Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() can
>> be handled although it causes #VE, while WBINVD in relocate_kernel() will
>> just triple fault the guest?
> 
> No. We never handle WBINVD #VE. Guest cannot handle WBINVD itself and the
> only option is to ask host to do this. We cannot guarantee host will do

Is the WBINVD performed or ignored in that case?

> anything useful with the request. I guess it can be potential attack
> vector if host strategically ignores WBINVD to induce bad guest behaviour.

With SNP, memory is coherent so there isn't a need for a WBINVD within a 
guest and so issuing it should not be an issue whether the hypervisor 
performs the operation or not. I don't know what can happen in the case 
where, say, you have a non-coherent TDISP device attached or such, but 
that would be very unusual/unlikely.

> 
> And it is not good from host PoV either. If it does WBINVD on every guest
> request we get guest->host DoS attack possibility.

Yeah, that can happen today, regardless of the type of VM running.

> 
> Tom, I am curious, how do you deal with these problems?

If the WBINVD is being intercepted, then it will generate a #VC and we use 
the GHCB protocol to communicate that back to the hypervisor to handle.

Thanks,
Tom

>

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Kirill A. Shutemov 1 year, 10 months ago

On Thu, Mar 21, 2024 at 04:02:11PM -0500, Tom Lendacky wrote:
> On 3/20/24 18:10, Kirill A. Shutemov wrote:
> > On Thu, Mar 21, 2024 at 09:48:28AM +1300, Huang, Kai wrote:
> > > 
> > > > > Hi Tom,
> > > > > 
> > > > > I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
> > > > > Does patch 1 break them?
> > > > 
> > > > SNP guests can kexec with some patches that are currently in process
> > > > around shared to private memory conversions. ES guests can only kexec
> > > > with a single vCPU. There was a recent patch series to add support for
> > > > multiple vCPUs.
> > > > 
> > > > Patch #1 doesn't break either ES or SNP because we still have an IDT and
> > > > traditional kernel addressing in place, so the #VC can be handled.
> > > 
> > > How about plain SEV guest?
> > > 
> > > > 
> > > > Whereas patch #2 has switched to identity mapping and removed the IDT,
> > > > so a #VC causes a triple fault.
> > > 
> > > That makes sense.  Thanks.
> > > 
> > > Hi Kirill,
> > > 
> > > Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() can
> > > be handled although it causes #VE, while WBINVD in relocate_kernel() will
> > > just triple fault the guest?
> > 
> > No. We never handle WBINVD #VE. Guest cannot handle WBINVD itself and the
> > only option is to ask host to do this. We cannot guarantee host will do
> 
> Is the WBINVD performed or ignored in that case?

We crash the guest if it tries to use WBINVD. There's no legitimate reason
for it.

> > anything useful with the request. I guess it can be potential attack
> > vector if host strategically ignores WBINVD to induce bad guest behaviour.
> 
> With SNP, memory is coherent so there isn't a need for a WBINVD within a
> guest and so issuing it should not be an issue whether the hypervisor
> performs the operation or not. I don't know what can happen in the case
> where, say, you have a non-coherent TDISP device attached or such, but that
> would be very unusual/unlikely.

Looks like SNP is in the same position as TDX.

> > And it is not good from host PoV either. If it does WBINVD on every guest
> > request we get guest->host DoS attack possibility.
> 
> Yeah, that can happen today, regardless of the type of VM running.
> 
> > 
> > Tom, I am curious, how do you deal with these problems?
> 
> If the WBINVD is being intercepted, then it will generate a #VC and we use
> the GHCB protocol to communicate that back to the hypervisor to handle.

I would argue that forwarding it to hypervisor is worse than crashing. It
gives false sense of doing something. Hypervisor is outside TCB and
considered hostile.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Tom Lendacky 1 year, 10 months ago

On 3/22/24 05:40, Kirill A. Shutemov wrote:
> On Thu, Mar 21, 2024 at 04:02:11PM -0500, Tom Lendacky wrote:
>> On 3/20/24 18:10, Kirill A. Shutemov wrote:
>>> On Thu, Mar 21, 2024 at 09:48:28AM +1300, Huang, Kai wrote:
>>>>
>>>>>> Hi Tom,
>>>>>>
>>>>>> I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
>>>>>> Does patch 1 break them?
>>>>>
>>>>> SNP guests can kexec with some patches that are currently in process
>>>>> around shared to private memory conversions. ES guests can only kexec
>>>>> with a single vCPU. There was a recent patch series to add support for
>>>>> multiple vCPUs.
>>>>>
>>>>> Patch #1 doesn't break either ES or SNP because we still have an IDT and
>>>>> traditional kernel addressing in place, so the #VC can be handled.
>>>>
>>>> How about plain SEV guest?
>>>>
>>>>>
>>>>> Whereas patch #2 has switched to identity mapping and removed the IDT,
>>>>> so a #VC causes a triple fault.
>>>>
>>>> That makes sense.  Thanks.
>>>>
>>>> Hi Kirill,
>>>>
>>>> Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() can
>>>> be handled although it causes #VE, while WBINVD in relocate_kernel() will
>>>> just triple fault the guest?
>>>
>>> No. We never handle WBINVD #VE. Guest cannot handle WBINVD itself and the
>>> only option is to ask host to do this. We cannot guarantee host will do
>>
>> Is the WBINVD performed or ignored in that case?
> 
> We crash the guest if it tries to use WBINVD. There's no legitimate reason
> for it.
> 
>>> anything useful with the request. I guess it can be potential attack
>>> vector if host strategically ignores WBINVD to induce bad guest behaviour.
>>
>> With SNP, memory is coherent so there isn't a need for a WBINVD within a
>> guest and so issuing it should not be an issue whether the hypervisor
>> performs the operation or not. I don't know what can happen in the case
>> where, say, you have a non-coherent TDISP device attached or such, but that
>> would be very unusual/unlikely.
> 
> Looks like SNP is in the same position as TDX.
> 
>>> And it is not good from host PoV either. If it does WBINVD on every guest
>>> request we get guest->host DoS attack possibility.
>>
>> Yeah, that can happen today, regardless of the type of VM running.
>>
>>>
>>> Tom, I am curious, how do you deal with these problems?
>>
>> If the WBINVD is being intercepted, then it will generate a #VC and we use
>> the GHCB protocol to communicate that back to the hypervisor to handle.
> 
> I would argue that forwarding it to hypervisor is worse than crashing. It
> gives false sense of doing something. Hypervisor is outside TCB and
> considered hostile.

Since the memory is coherent, it really doesn't matter what the hypervisor 
does in regards to WBINVD (ignore it or perform it). And the hypervisor 
can do anything it wants on any exit, regardless of this intercept.

Thanks,
Tom

>

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Huang, Kai 1 year, 10 months ago

On Fri, 2024-03-22 at 09:50 -0500, Tom Lendacky wrote:
> On 3/22/24 05:40, Kirill A. Shutemov wrote:
> > On Thu, Mar 21, 2024 at 04:02:11PM -0500, Tom Lendacky wrote:
> > > On 3/20/24 18:10, Kirill A. Shutemov wrote:
> > > > On Thu, Mar 21, 2024 at 09:48:28AM +1300, Huang, Kai wrote:
> > > > > 
> > > > > > > Hi Tom,
> > > > > > > 
> > > > > > > I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
> > > > > > > Does patch 1 break them?
> > > > > > 
> > > > > > SNP guests can kexec with some patches that are currently in process
> > > > > > around shared to private memory conversions. ES guests can only kexec
> > > > > > with a single vCPU. There was a recent patch series to add support for
> > > > > > multiple vCPUs.
> > > > > > 
> > > > > > Patch #1 doesn't break either ES or SNP because we still have an IDT and
> > > > > > traditional kernel addressing in place, so the #VC can be handled.
> > > > > 
> > > > > How about plain SEV guest?
> > > > > 
> > > > > > 
> > > > > > Whereas patch #2 has switched to identity mapping and removed the IDT,
> > > > > > so a #VC causes a triple fault.
> > > > > 
> > > > > That makes sense.  Thanks.
> > > > > 
> > > > > Hi Kirill,
> > > > > 
> > > > > Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() can
> > > > > be handled although it causes #VE, while WBINVD in relocate_kernel() will
> > > > > just triple fault the guest?
> > > > 
> > > > No. We never handle WBINVD #VE. Guest cannot handle WBINVD itself and the
> > > > only option is to ask host to do this. We cannot guarantee host will do
> > > 
> > > Is the WBINVD performed or ignored in that case?
> > 
> > We crash the guest if it tries to use WBINVD. There's no legitimate reason
> > for it.
> > 
> > > > anything useful with the request. I guess it can be potential attack
> > > > vector if host strategically ignores WBINVD to induce bad guest behaviour.
> > > 
> > > With SNP, memory is coherent so there isn't a need for a WBINVD within a
> > > guest and so issuing it should not be an issue whether the hypervisor
> > > performs the operation or not. I don't know what can happen in the case
> > > where, say, you have a non-coherent TDISP device attached or such, but that
> > > would be very unusual/unlikely.
> > 
> > Looks like SNP is in the same position as TDX.
> > 
> > > > And it is not good from host PoV either. If it does WBINVD on every guest
> > > > request we get guest->host DoS attack possibility.
> > > 
> > > Yeah, that can happen today, regardless of the type of VM running.
> > > 
> > > > 
> > > > Tom, I am curious, how do you deal with these problems?
> > > 
> > > If the WBINVD is being intercepted, then it will generate a #VC and we use
> > > the GHCB protocol to communicate that back to the hypervisor to handle.
> > 
> > I would argue that forwarding it to hypervisor is worse than crashing. It
> > gives false sense of doing something. Hypervisor is outside TCB and
> > considered hostile.
> 
> Since the memory is coherent, it really doesn't matter what the hypervisor 
> does in regards to WBINVD (ignore it or perform it). And the hypervisor 
> can do anything it wants on any exit, regardless of this intercept.
> 

I guess it makes sense to not handle #VE due to WBINVD in the sense that guest
shouldn't do WBINVD when memory is coherent from guest's view, although it is
harmless to make the WBINVD and let hypervisor handle it.

Anyway, the current TDX guest doesn't handle #VE due to WBINVD, so I think for
simplicity we just don't do WBINVD in stop_this_cpu() and relocate_kernel() for
both TDX and SNP/SEV-ES guests.

As mentioned in my earlier reply, we can achieve this by skipping WBINVD when
the CC_ATTR_GUEST_MEM_ENCRYPT is true:

	if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
		native_wbinvd();  

(This skips WBINVD for plain SEV guest too, but this exactly is the current
behaviour of the upstream code, so I don't see any problem.)

Alternatively, we can have a dedicated CPU feature flag such as
X86_FEATURE_NO_WBINVD,

	if (!boot_cpu_has(X86_FEATURE_NO_WBINVD))
		native_wbinvd();

Or, we can just change to our mindset to "do unconditional WBINVD, but not in
virtualized environment":

	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
		native_wbinvd();


Hi Boris/Dave/Tom/Kirill,

Do you have any comments?

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by kirill.shutemov@linux.intel.com 1 year, 10 months ago

On Mon, Mar 25, 2024 at 01:04:47PM +0000, Huang, Kai wrote:
> On Fri, 2024-03-22 at 09:50 -0500, Tom Lendacky wrote:
> > On 3/22/24 05:40, Kirill A. Shutemov wrote:
> > > On Thu, Mar 21, 2024 at 04:02:11PM -0500, Tom Lendacky wrote:
> > > > On 3/20/24 18:10, Kirill A. Shutemov wrote:
> > > > > On Thu, Mar 21, 2024 at 09:48:28AM +1300, Huang, Kai wrote:
> > > > > > 
> > > > > > > > Hi Tom,
> > > > > > > > 
> > > > > > > > I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
> > > > > > > > Does patch 1 break them?
> > > > > > > 
> > > > > > > SNP guests can kexec with some patches that are currently in process
> > > > > > > around shared to private memory conversions. ES guests can only kexec
> > > > > > > with a single vCPU. There was a recent patch series to add support for
> > > > > > > multiple vCPUs.
> > > > > > > 
> > > > > > > Patch #1 doesn't break either ES or SNP because we still have an IDT and
> > > > > > > traditional kernel addressing in place, so the #VC can be handled.
> > > > > > 
> > > > > > How about plain SEV guest?
> > > > > > 
> > > > > > > 
> > > > > > > Whereas patch #2 has switched to identity mapping and removed the IDT,
> > > > > > > so a #VC causes a triple fault.
> > > > > > 
> > > > > > That makes sense.  Thanks.
> > > > > > 
> > > > > > Hi Kirill,
> > > > > > 
> > > > > > Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() can
> > > > > > be handled although it causes #VE, while WBINVD in relocate_kernel() will
> > > > > > just triple fault the guest?
> > > > > 
> > > > > No. We never handle WBINVD #VE. Guest cannot handle WBINVD itself and the
> > > > > only option is to ask host to do this. We cannot guarantee host will do
> > > > 
> > > > Is the WBINVD performed or ignored in that case?
> > > 
> > > We crash the guest if it tries to use WBINVD. There's no legitimate reason
> > > for it.
> > > 
> > > > > anything useful with the request. I guess it can be potential attack
> > > > > vector if host strategically ignores WBINVD to induce bad guest behaviour.
> > > > 
> > > > With SNP, memory is coherent so there isn't a need for a WBINVD within a
> > > > guest and so issuing it should not be an issue whether the hypervisor
> > > > performs the operation or not. I don't know what can happen in the case
> > > > where, say, you have a non-coherent TDISP device attached or such, but that
> > > > would be very unusual/unlikely.
> > > 
> > > Looks like SNP is in the same position as TDX.
> > > 
> > > > > And it is not good from host PoV either. If it does WBINVD on every guest
> > > > > request we get guest->host DoS attack possibility.
> > > > 
> > > > Yeah, that can happen today, regardless of the type of VM running.
> > > > 
> > > > > 
> > > > > Tom, I am curious, how do you deal with these problems?
> > > > 
> > > > If the WBINVD is being intercepted, then it will generate a #VC and we use
> > > > the GHCB protocol to communicate that back to the hypervisor to handle.
> > > 
> > > I would argue that forwarding it to hypervisor is worse than crashing. It
> > > gives false sense of doing something. Hypervisor is outside TCB and
> > > considered hostile.
> > 
> > Since the memory is coherent, it really doesn't matter what the hypervisor 
> > does in regards to WBINVD (ignore it or perform it). And the hypervisor 
> > can do anything it wants on any exit, regardless of this intercept.
> > 
> 
> I guess it makes sense to not handle #VE due to WBINVD in the sense that guest
> shouldn't do WBINVD when memory is coherent from guest's view, although it is
> harmless to make the WBINVD and let hypervisor handle it.


> 
> Anyway, the current TDX guest doesn't handle #VE due to WBINVD, so I think for
> simplicity we just don't do WBINVD in stop_this_cpu() and relocate_kernel() for
> both TDX and SNP/SEV-ES guests.
> 
> As mentioned in my earlier reply, we can achieve this by skipping WBINVD when
> the CC_ATTR_GUEST_MEM_ENCRYPT is true:
> 
> 	if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> 		native_wbinvd();  
> 
> (This skips WBINVD for plain SEV guest too, but this exactly is the current
> behaviour of the upstream code, so I don't see any problem.)
> 
> Alternatively, we can have a dedicated CPU feature flag such as
> X86_FEATURE_NO_WBINVD,
> 
> 	if (!boot_cpu_has(X86_FEATURE_NO_WBINVD))
> 		native_wbinvd();
> 
> Or, we can just change to our mindset to "do unconditional WBINVD, but not in
> virtualized environment":
> 
> 	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
> 		native_wbinvd();

ACPI_FLUSH_CPU_CACHE() uses cpu_feature_enabled(X86_FEATURE_HYPERVISOR)
check.


-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Huang, Kai 1 year, 10 months ago

> > > > > > 
> 
> > 
> > Anyway, the current TDX guest doesn't handle #VE due to WBINVD, so I think for
> > simplicity we just don't do WBINVD in stop_this_cpu() and relocate_kernel() for
> > both TDX and SNP/SEV-ES guests.
> > 
> > As mentioned in my earlier reply, we can achieve this by skipping WBINVD when
> > the CC_ATTR_GUEST_MEM_ENCRYPT is true:
> > 
> > 	if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> > 		native_wbinvd();  
> > 
> > (This skips WBINVD for plain SEV guest too, but this exactly is the current
> > behaviour of the upstream code, so I don't see any problem.)
> > 
> > Alternatively, we can have a dedicated CPU feature flag such as
> > X86_FEATURE_NO_WBINVD,
> > 
> > 	if (!boot_cpu_has(X86_FEATURE_NO_WBINVD))
> > 		native_wbinvd();
> > 
> > Or, we can just change to our mindset to "do unconditional WBINVD, but not in
> > virtualized environment":
> > 
> > 	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
> > 		native_wbinvd();
> 
> ACPI_FLUSH_CPU_CACHE() uses cpu_feature_enabled(X86_FEATURE_HYPERVISOR)
> check.
> 
> 

Thanks for pointing out this.  Yeah I think skipping WBINVD in virtualized
environment makes sense.  Will use this way.

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Tom Lendacky 1 year, 10 months ago

On 3/20/24 15:48, Huang, Kai wrote:
> 
>>> Hi Tom,
>>>
>>> I am not aware of kexec() support status for SEV-ES/SEV-SNP guests. 
>>> Does patch 1 break them?
>>
>> SNP guests can kexec with some patches that are currently in process 
>> around shared to private memory conversions. ES guests can only kexec 
>> with a single vCPU. There was a recent patch series to add support for 
>> multiple vCPUs.
>>
>> Patch #1 doesn't break either ES or SNP because we still have an IDT and 
>> traditional kernel addressing in place, so the #VC can be handled.
> 
> How about plain SEV guest?

A plain SEV guest is fine, since WBINVD is intercepted and would just exit 
to the hypervisor (#VC doesn't happen with plain SEV).

Thanks,
Tom

> 
>>
>> Whereas patch #2 has switched to identity mapping and removed the IDT, 
>> so a #VC causes a triple fault.
> 
> That makes sense.  Thanks.
> 
> Hi Kirill,
> 
> Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() 
> can be handled although it causes #VE, while WBINVD in relocate_kernel() 
> will just triple fault the guest?

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Huang, Kai 1 year, 10 months ago


On 21/03/2024 10:06 am, Tom Lendacky wrote:
> On 3/20/24 15:48, Huang, Kai wrote:
>>
>>>> Hi Tom,
>>>>
>>>> I am not aware of kexec() support status for SEV-ES/SEV-SNP guests. 
>>>> Does patch 1 break them?
>>>
>>> SNP guests can kexec with some patches that are currently in process 
>>> around shared to private memory conversions. ES guests can only kexec 
>>> with a single vCPU. There was a recent patch series to add support 
>>> for multiple vCPUs.
>>>
>>> Patch #1 doesn't break either ES or SNP because we still have an IDT 
>>> and traditional kernel addressing in place, so the #VC can be handled.
>>
>> How about plain SEV guest?
> 
> A plain SEV guest is fine, since WBINVD is intercepted and would just 
> exit to the hypervisor (#VC doesn't happen with plain SEV).
> 
> Thanks,
> Tom
> 

Thanks for the info.

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Kirill A. Shutemov 1 year, 10 months ago

On Wed, Mar 20, 2024 at 10:20:50AM +1300, Huang, Kai wrote:
> 
> 
> On 20/03/2024 3:38 am, Tom Lendacky wrote:
> > On 3/19/24 06:13, Kirill A. Shutemov wrote:
> > > On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
> > > > Both SME and TDX can leave caches in incoherent state due to memory
> > > > encryption.  During kexec, the caches must be flushed before jumping to
> > > > the second kernel to avoid silent memory corruption to the
> > > > second kernel.
> > > > 
> > > > During kexec, the WBINVD in stop_this_cpu() flushes caches for all
> > > > remote cpus when they are being stopped.  For SME, the WBINVD in
> > > > relocate_kernel() flushes the cache for the last running cpu (which is
> > > > executing the kexec).
> > > > 
> > > > Similarly, for TDX after stopping all remote cpus with cache flushed, to
> > > > support kexec, the kernel needs to flush cache for the last running cpu.
> > > > 
> > > > Make the WBINVD in the relocate_kernel() unconditional to cover both SME
> > > > and TDX.
> > > 
> > > Nope. It breaks TDX guest. WBINVD triggers #VE for TDX guests.
> > 
> > Ditto for SEV-ES/SEV-SNP, a #VC is generated and crashes the guest.
> > 
> 
> Oh I forgot these.
> 
> Hi Kirill,
> 
> Then I think patch 1 will also break TDX guest after your series to enable
> multiple cpus for the second kernel after kexec()?

Well, not exactly.

My patchset overrides stop_this_cpu() with own implementation for MADT
wakeup method that doesn't have WBINVD. So the patch doesn't break
anything, but if in the future TDX (or SEV) host would use MADT wake up
method instead of IPI we will get back to the problem with missing
WBINVD.

I don't know if we care. There's no reason for host to use MADT wake up
method.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Huang, Kai 1 year, 10 months ago


On 20/03/2024 1:19 pm, Kirill A. Shutemov wrote:
> On Wed, Mar 20, 2024 at 10:20:50AM +1300, Huang, Kai wrote:
>>
>>
>> On 20/03/2024 3:38 am, Tom Lendacky wrote:
>>> On 3/19/24 06:13, Kirill A. Shutemov wrote:
>>>> On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
>>>>> Both SME and TDX can leave caches in incoherent state due to memory
>>>>> encryption.  During kexec, the caches must be flushed before jumping to
>>>>> the second kernel to avoid silent memory corruption to the
>>>>> second kernel.
>>>>>
>>>>> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
>>>>> remote cpus when they are being stopped.  For SME, the WBINVD in
>>>>> relocate_kernel() flushes the cache for the last running cpu (which is
>>>>> executing the kexec).
>>>>>
>>>>> Similarly, for TDX after stopping all remote cpus with cache flushed, to
>>>>> support kexec, the kernel needs to flush cache for the last running cpu.
>>>>>
>>>>> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
>>>>> and TDX.
>>>>
>>>> Nope. It breaks TDX guest. WBINVD triggers #VE for TDX guests.
>>>
>>> Ditto for SEV-ES/SEV-SNP, a #VC is generated and crashes the guest.
>>>
>>
>> Oh I forgot these.
>>
>> Hi Kirill,
>>
>> Then I think patch 1 will also break TDX guest after your series to enable
>> multiple cpus for the second kernel after kexec()?
> 
> Well, not exactly.
> 
> My patchset overrides stop_this_cpu() with own implementation for MADT
> wakeup method that doesn't have WBINVD. So the patch doesn't break
> anything, 

Well, your callback actually only gets called _after_ this WBINVD, so...

I guess I should have that checked by myself. :-)

but if in the future TDX (or SEV) host would use MADT wake up
> method instead of IPI we will get back to the problem with missing
> WBINVD.
> 
> I don't know if we care. There's no reason for host to use MADT wake up
> method.
> 

I don't think MADT wake up will be used at any native environment.

Anyway, regardless whether patch 1 will break TDX/SEV-ES/SEV-SNP guests, 
I think to resolve this, we can simply adjust our mindset from ...

	"do unconditional WBINVD"

to ...

	"do unconditional WBINVD when it can be done safely"

For now, AFAICT, only TDX guests and SEV-ES/SEV-SNP guests are such guests.

And they all report the CC_ATTR_GUEST_MEM_ENCRYPT flag as true, so we 
can change to only do WBINVD when the kernel sees that flag.

	if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
		native_wbinvd();

Alternatively, we can have a dedicated X86_FEATURE_NO_WBINVD and get it 
set for TDX/SEV-ES/SEV-SNP guests (and any guests if this is true), and do:

	if (!boot_cpu_has(X86_FEATURE_NO_WBINVD))
		native_wbinvd();

It seems the first one is too generic (for any CoCo VMs), and the second 
one is better.

Any comments?

Hi Boris/Dave,

Do you have any comments?

Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Posted by Kirill A. Shutemov 1 year, 10 months ago

On Wed, Mar 20, 2024 at 01:45:32PM +1300, Huang, Kai wrote:
> Anyway, regardless whether patch 1 will break TDX/SEV-ES/SEV-SNP guests, I
> think to resolve this, we can simply adjust our mindset from ...
> 
> 	"do unconditional WBINVD"
> 
> to ...
> 
> 	"do unconditional WBINVD when it can be done safely"
> 
> For now, AFAICT, only TDX guests and SEV-ES/SEV-SNP guests are such guests.
> 
> And they all report the CC_ATTR_GUEST_MEM_ENCRYPT flag as true, so we can
> change to only do WBINVD when the kernel sees that flag.
> 
> 	if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> 		native_wbinvd();
> 
> Alternatively, we can have a dedicated X86_FEATURE_NO_WBINVD and get it set
> for TDX/SEV-ES/SEV-SNP guests (and any guests if this is true), and do:
> 
> 	if (!boot_cpu_has(X86_FEATURE_NO_WBINVD))
> 		native_wbinvd();
> 
> It seems the first one is too generic (for any CoCo VMs), and the second one
> is better.
> 
> Any comments?

I like cc_platform_has() variant better. There's no good reason to invent
X86_FEATURE if we don't cases outside of CC.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov