[v3] TDX host: kexec() support

[PATCH v3 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Kai Huang 1 year, 10 months ago

TL;DR:

Change to do unconditional WBINVD in stop_this_cpu() for bare metal
to cover kexec support for both AMD SME and Intel TDX, despite there
_was_ some issue preventing from doing so but now has it got fixed.

Long version:

Both SME and TDX can leave caches in incoherent state due to memory
encryption.  During kexec, the caches must be flushed before jumping to
the second kernel to avoid silent memory corruption to the second kernel.

Currently, for SME the kernel only does WBINVD in stop_this_cpu() when
the kernel determines the hardware supports SME.  To support TDX, one
option is to extend that specific check to cover both SME and TDX.

However, instead of sprinkling around vendor-specific checks, it's
better to just do unconditional WBINVD.  Kexec() is a slow path, and it
is acceptable to have an additional WBINVD in order to have simple
and easy to maintain code.

But only do WBINVD for bare-metal because TDX guests and SEV-ES/SEV-SNP
guests will get unexpected (and yet unnecessary) #VE and may not be able
to handle (e.g., TDX guest panics when it gets #VE due to WBINVD).

Note:

Historically, there _was_ an issue preventing doing unconditional WBINVD
but that has been fixed.

When SME kexec() support was initially added in commit

  bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")

WBINVD was done unconditionally.  However since then some issues were
reported that different Intel systems would hang or reset due to that
commit.

To try to fix, a later commit

  f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")

then changed to only do WBINVD when hardware supports SME.

While this commit made the reported issues go away, it didn't pinpoint
the root cause.  Also, it didn't handle a corner case[*] correctly, which
resulted in the reveal of the root cause and the final fix by commit

  1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")

See [1][2] for more information.

Further testing of doing unconditional WBINVD based on the above fix on
the problematic machines (that issues were originally reported)
confirmed the issues couldn't be reproduced.

See [3][4] for more information.

Therefore, it is safe to do unconditional WBINVD now.

[*] The commit didn't check whether the CPUID leaf is available or not.
Making unsupported CPUID leaf on Intel returns garbage resulting in
unintended WBINVD which caused some issue (followed by the analysis and
the reveal of the final root cause).  The corner case was independently
fixed by commit

  9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")

[1]: https://lore.kernel.org/lkml/CALu+AoQKmeixJdkO07t7BtttN7v3RM4_aBKi642bQ3fTBbSAVg@mail.gmail.com/T/#m300f3f9790850b5daa20a71abcc200ae8d94a12a
[2]: https://lore.kernel.org/lkml/CALu+AoQKmeixJdkO07t7BtttN7v3RM4_aBKi642bQ3fTBbSAVg@mail.gmail.com/T/#ma7263a7765483db0dabdeef62a1110940e634846
[3]: https://lore.kernel.org/lkml/CALu+AoQKmeixJdkO07t7BtttN7v3RM4_aBKi642bQ3fTBbSAVg@mail.gmail.com/T/#mc043191f2ff860d649c8466775dc61ac1e0ae320
[4]: https://lore.kernel.org/lkml/CALu+AoQKmeixJdkO07t7BtttN7v3RM4_aBKi642bQ3fTBbSAVg@mail.gmail.com/T/#md23f1a8f6afcc59fa2b0ac1967f18e418e24347c

Signed-off-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dave Young <dyoung@redhat.com>
---

v2 -> v3:
 - Change to only do WBINVD for bare metal

---
 arch/x86/kernel/process.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b8441147eb5e..5ba8a9c1e47a 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -813,18 +813,16 @@ void __noreturn stop_this_cpu(void *dummy)
 	mcheck_cpu_clear(c);
 
 	/*
-	 * Use wbinvd on processors that support SME. This provides support
-	 * for performing a successful kexec when going from SME inactive
-	 * to SME active (or vice-versa). The cache must be cleared so that
-	 * if there are entries with the same physical address, both with and
-	 * without the encryption bit, they don't race each other when flushed
-	 * and potentially end up with the wrong entry being committed to
-	 * memory.
+	 * The kernel could leave caches in incoherent state on SME/TDX
+	 * capable platforms.  Flush cache to avoid silent memory
+	 * corruption for these platforms.
 	 *
-	 * Test the CPUID bit directly because the machine might've cleared
-	 * X86_FEATURE_SME due to cmdline options.
+	 * stop_this_cpu() is not a fast path, just do unconditional
+	 * WBINVD for simplicity.  But only do WBINVD for bare-metal
+	 * as TDX guests and SEV-ES/SEV-SNP guests will get unexpected
+	 * (and unnecessary) #VE and may unable to handle.
 	 */
-	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		native_wbinvd();
 
 	/*
-- 
2.43.2

Re: [PATCH v3 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Tom Lendacky 1 year, 10 months ago

On 4/7/24 07:44, Kai Huang wrote:

> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index b8441147eb5e..5ba8a9c1e47a 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -813,18 +813,16 @@ void __noreturn stop_this_cpu(void *dummy)
>   	mcheck_cpu_clear(c);
>   
>   	/*
> -	 * Use wbinvd on processors that support SME. This provides support
> -	 * for performing a successful kexec when going from SME inactive
> -	 * to SME active (or vice-versa). The cache must be cleared so that
> -	 * if there are entries with the same physical address, both with and
> -	 * without the encryption bit, they don't race each other when flushed
> -	 * and potentially end up with the wrong entry being committed to
> -	 * memory.
> +	 * The kernel could leave caches in incoherent state on SME/TDX
> +	 * capable platforms.  Flush cache to avoid silent memory
> +	 * corruption for these platforms.
>   	 *
> -	 * Test the CPUID bit directly because the machine might've cleared
> -	 * X86_FEATURE_SME due to cmdline options.
> +	 * stop_this_cpu() is not a fast path, just do unconditional
> +	 * WBINVD for simplicity.  But only do WBINVD for bare-metal
> +	 * as TDX guests and SEV-ES/SEV-SNP guests will get unexpected
> +	 * (and unnecessary) #VE and may unable to handle.

In addition to Kirill's comment on #VE...

This last part of the comment reads a bit odd since you say 
unconditional and then say only do WBINVD for bare-metal. Maybe 
something like this makes it a bit clearer?:

For TDX and SEV-ES/SEV-SNP guests, a WBINVD may cause an exception (#VE 
or #VC). However, all exception handling has been torn down at this 
point, so this would cause the guest to crash. Since memory within these 
types of guests is coherent only issue the WBINVD on bare-metal.

And you can expand the comment block out to at least 80 characters to 
make it more compact.

Thanks,
Tom

>   	 */
> -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> +	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
>   		native_wbinvd();
>   
>   	/*

Re: [PATCH v3 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Tom Lendacky 1 year, 10 months ago

On 4/10/24 11:08, Tom Lendacky wrote:
> On 4/7/24 07:44, Kai Huang wrote:
> 
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index b8441147eb5e..5ba8a9c1e47a 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -813,18 +813,16 @@ void __noreturn stop_this_cpu(void *dummy)
>>       mcheck_cpu_clear(c);
>>       /*
>> -     * Use wbinvd on processors that support SME. This provides support
>> -     * for performing a successful kexec when going from SME inactive
>> -     * to SME active (or vice-versa). The cache must be cleared so that
>> -     * if there are entries with the same physical address, both with 
>> and
>> -     * without the encryption bit, they don't race each other when 
>> flushed
>> -     * and potentially end up with the wrong entry being committed to
>> -     * memory.
>> +     * The kernel could leave caches in incoherent state on SME/TDX
>> +     * capable platforms.  Flush cache to avoid silent memory
>> +     * corruption for these platforms.
>>        *
>> -     * Test the CPUID bit directly because the machine might've cleared
>> -     * X86_FEATURE_SME due to cmdline options.
>> +     * stop_this_cpu() is not a fast path, just do unconditional
>> +     * WBINVD for simplicity.  But only do WBINVD for bare-metal
>> +     * as TDX guests and SEV-ES/SEV-SNP guests will get unexpected
>> +     * (and unnecessary) #VE and may unable to handle.
> 
> In addition to Kirill's comment on #VE...
> 
> This last part of the comment reads a bit odd since you say 
> unconditional and then say only do WBINVD for bare-metal. Maybe 
> something like this makes it a bit clearer?:
> 
> For TDX and SEV-ES/SEV-SNP guests, a WBINVD may cause an exception (#VE 
> or #VC). However, all exception handling has been torn down at this 
> point, so this would cause the guest to crash. Since memory within these 
> types of guests is coherent only issue the WBINVD on bare-metal.

Hmmm... actually it was the other WBINVD in patch #2 that caused the 
crash, so what I wrote above isn't accurate. You might want to re-word 
as appropriate.

Thanks,
Tom

> 
> And you can expand the comment block out to at least 80 characters to 
> make it more compact.
> 
> Thanks,
> Tom
> 
>>        */
>> -    if (c->extended_cpuid_level >= 0x8000001f && 
>> (cpuid_eax(0x8000001f) & BIT(0)))
>> +    if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
>>           native_wbinvd();
>>       /*

Re: [PATCH v3 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Huang, Kai 1 year, 10 months ago


On 11/04/2024 4:14 am, Tom Lendacky wrote:
> On 4/10/24 11:08, Tom Lendacky wrote:
>> On 4/7/24 07:44, Kai Huang wrote:
>>
>>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>>> index b8441147eb5e..5ba8a9c1e47a 100644
>>> --- a/arch/x86/kernel/process.c
>>> +++ b/arch/x86/kernel/process.c
>>> @@ -813,18 +813,16 @@ void __noreturn stop_this_cpu(void *dummy)
>>>       mcheck_cpu_clear(c);
>>>       /*
>>> -     * Use wbinvd on processors that support SME. This provides support
>>> -     * for performing a successful kexec when going from SME inactive
>>> -     * to SME active (or vice-versa). The cache must be cleared so that
>>> -     * if there are entries with the same physical address, both 
>>> with and
>>> -     * without the encryption bit, they don't race each other when 
>>> flushed
>>> -     * and potentially end up with the wrong entry being committed to
>>> -     * memory.
>>> +     * The kernel could leave caches in incoherent state on SME/TDX
>>> +     * capable platforms.  Flush cache to avoid silent memory
>>> +     * corruption for these platforms.
>>>        *
>>> -     * Test the CPUID bit directly because the machine might've cleared
>>> -     * X86_FEATURE_SME due to cmdline options.
>>> +     * stop_this_cpu() is not a fast path, just do unconditional
>>> +     * WBINVD for simplicity.  But only do WBINVD for bare-metal
>>> +     * as TDX guests and SEV-ES/SEV-SNP guests will get unexpected
>>> +     * (and unnecessary) #VE and may unable to handle.
>>
>> In addition to Kirill's comment on #VE...
>>
>> This last part of the comment reads a bit odd since you say 
>> unconditional and then say only do WBINVD for bare-metal. Maybe 
>> something like this makes it a bit clearer?:
>>
>> For TDX and SEV-ES/SEV-SNP guests, a WBINVD may cause an exception 
>> (#VE or #VC). However, all exception handling has been torn down at 
>> this point, so this would cause the guest to crash. Since memory 
>> within these types of guests is coherent only issue the WBINVD on 
>> bare-metal.
> 
> Hmmm... actually it was the other WBINVD in patch #2 that caused the 
> crash, so what I wrote above isn't accurate. You might want to re-word 
> as appropriate.

Yeah that's why I used "may unable to handle" in the comment, as I 
thought there's no need to be that specific?

I tend not to mention "memory within these types of guests is coherent". 
  I mean the current upstream kernel _ONLY_ does WBINVD for SME, that 
means for all non-SME environment there's no concern to do WBINVD here.

Here we only extend to do WBINVD on more environments, so as long as 
there's no harm to do WBINVD for them it's OK.

How about below?

	/*
	 * The kernel could leave caches in incoherent state on SME/TDX
	 * capable platforms.  Flush cache to avoid silent memory
	 * corruption for these platforms.
	 *
	 * For TDX and SEV-ES/SEV-SNP guests, a WBINVD causes an
	 * exception (#VE or #VC), and the kernel may not be able
	 * to handle such exception (e.g., TDX guest panics if it
	 * sees #VE).  Since stop_this_cpu() isn't a fast path, just
	 * issue the WBINVD on bare-metal instead of sprinkling
	 * around vendor-specific checks.
	 */
> 
> Thanks,
> Tom
> 
>>
>> And you can expand the comment block out to at least 80 characters to 
>> make it more compact.

OK I can do.  I guess I have to change my vim setting to do so, though :-)

Re: [PATCH v3 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Tom Lendacky 1 year, 10 months ago

On 4/10/24 17:26, Huang, Kai wrote:
> On 11/04/2024 4:14 am, Tom Lendacky wrote:
>> On 4/10/24 11:08, Tom Lendacky wrote:
>>> On 4/7/24 07:44, Kai Huang wrote:
>>>
>>>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>>>> index b8441147eb5e..5ba8a9c1e47a 100644
>>>> --- a/arch/x86/kernel/process.c
>>>> +++ b/arch/x86/kernel/process.c
>>>> @@ -813,18 +813,16 @@ void __noreturn stop_this_cpu(void *dummy)
>>>>       mcheck_cpu_clear(c);
>>>>       /*
>>>> -     * Use wbinvd on processors that support SME. This provides 
>>>> support
>>>> -     * for performing a successful kexec when going from SME inactive
>>>> -     * to SME active (or vice-versa). The cache must be cleared so 
>>>> that
>>>> -     * if there are entries with the same physical address, both 
>>>> with and
>>>> -     * without the encryption bit, they don't race each other when 
>>>> flushed
>>>> -     * and potentially end up with the wrong entry being committed to
>>>> -     * memory.
>>>> +     * The kernel could leave caches in incoherent state on SME/TDX
>>>> +     * capable platforms.  Flush cache to avoid silent memory
>>>> +     * corruption for these platforms.
>>>>        *
>>>> -     * Test the CPUID bit directly because the machine might've 
>>>> cleared
>>>> -     * X86_FEATURE_SME due to cmdline options.
>>>> +     * stop_this_cpu() is not a fast path, just do unconditional
>>>> +     * WBINVD for simplicity.  But only do WBINVD for bare-metal
>>>> +     * as TDX guests and SEV-ES/SEV-SNP guests will get unexpected
>>>> +     * (and unnecessary) #VE and may unable to handle.
>>>
>>> In addition to Kirill's comment on #VE...
>>>
>>> This last part of the comment reads a bit odd since you say 
>>> unconditional and then say only do WBINVD for bare-metal. Maybe 
>>> something like this makes it a bit clearer?:
>>>
>>> For TDX and SEV-ES/SEV-SNP guests, a WBINVD may cause an exception 
>>> (#VE or #VC). However, all exception handling has been torn down at 
>>> this point, so this would cause the guest to crash. Since memory 
>>> within these types of guests is coherent only issue the WBINVD on 
>>> bare-metal.
>>
>> Hmmm... actually it was the other WBINVD in patch #2 that caused the 
>> crash, so what I wrote above isn't accurate. You might want to re-word 
>> as appropriate.
> 
> Yeah that's why I used "may unable to handle" in the comment, as I 
> thought there's no need to be that specific?

Yes, makes sense.

> 
> I tend not to mention "memory within these types of guests is coherent". 
>   I mean the current upstream kernel _ONLY_ does WBINVD for SME, that 
> means for all non-SME environment there's no concern to do WBINVD here.
> 
> Here we only extend to do WBINVD on more environments, so as long as 
> there's no harm to do WBINVD for them it's OK.
> 
> How about below?
> 
>      /*
>       * The kernel could leave caches in incoherent state on SME/TDX
>       * capable platforms.  Flush cache to avoid silent memory
>       * corruption for these platforms.
>       *
>       * For TDX and SEV-ES/SEV-SNP guests, a WBINVD causes an
>       * exception (#VE or #VC), and the kernel may not be able
>       * to handle such exception (e.g., TDX guest panics if it
>       * sees #VE).  Since stop_this_cpu() isn't a fast path, just
>       * issue the WBINVD on bare-metal instead of sprinkling
>       * around vendor-specific checks.
>       */

I think that's ok, but maybe just adding that the WBINVD is not 
necessary for TDX and SEV-ES/SEV-SNP would make it clearer. Just my 
opinion, though.

Thanks,
Tom

>>
>> Thanks,
>> Tom
>>
>>>
>>> And you can expand the comment block out to at least 80 characters to 
>>> make it more compact.
> 
> OK I can do.  I guess I have to change my vim setting to do so, though :-)

Re: [PATCH v3 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Huang, Kai 1 year, 10 months ago


On 12/04/2024 2:13 am, Tom Lendacky wrote:
> On 4/10/24 17:26, Huang, Kai wrote:
>> On 11/04/2024 4:14 am, Tom Lendacky wrote:
>>> On 4/10/24 11:08, Tom Lendacky wrote:
>>>> On 4/7/24 07:44, Kai Huang wrote:
>>>>
>>>>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>>>>> index b8441147eb5e..5ba8a9c1e47a 100644
>>>>> --- a/arch/x86/kernel/process.c
>>>>> +++ b/arch/x86/kernel/process.c
>>>>> @@ -813,18 +813,16 @@ void __noreturn stop_this_cpu(void *dummy)
>>>>>        mcheck_cpu_clear(c);
>>>>>        /*
>>>>> -     * Use wbinvd on processors that support SME. This provides
>>>>> support
>>>>> -     * for performing a successful kexec when going from SME inactive
>>>>> -     * to SME active (or vice-versa). The cache must be cleared so
>>>>> that
>>>>> -     * if there are entries with the same physical address, both
>>>>> with and
>>>>> -     * without the encryption bit, they don't race each other when
>>>>> flushed
>>>>> -     * and potentially end up with the wrong entry being committed to
>>>>> -     * memory.
>>>>> +     * The kernel could leave caches in incoherent state on SME/TDX
>>>>> +     * capable platforms.  Flush cache to avoid silent memory
>>>>> +     * corruption for these platforms.
>>>>>         *
>>>>> -     * Test the CPUID bit directly because the machine might've
>>>>> cleared
>>>>> -     * X86_FEATURE_SME due to cmdline options.
>>>>> +     * stop_this_cpu() is not a fast path, just do unconditional
>>>>> +     * WBINVD for simplicity.  But only do WBINVD for bare-metal
>>>>> +     * as TDX guests and SEV-ES/SEV-SNP guests will get unexpected
>>>>> +     * (and unnecessary) #VE and may unable to handle.
>>>>
>>>> In addition to Kirill's comment on #VE...
>>>>
>>>> This last part of the comment reads a bit odd since you say
>>>> unconditional and then say only do WBINVD for bare-metal. Maybe
>>>> something like this makes it a bit clearer?:
>>>>
>>>> For TDX and SEV-ES/SEV-SNP guests, a WBINVD may cause an exception
>>>> (#VE or #VC). However, all exception handling has been torn down at
>>>> this point, so this would cause the guest to crash. Since memory
>>>> within these types of guests is coherent only issue the WBINVD on
>>>> bare-metal.
>>>
>>> Hmmm... actually it was the other WBINVD in patch #2 that caused the
>>> crash, so what I wrote above isn't accurate. You might want to re-word
>>> as appropriate.
>>
>> Yeah that's why I used "may unable to handle" in the comment, as I
>> thought there's no need to be that specific?
> 
> Yes, makes sense.
> 
>>
>> I tend not to mention "memory within these types of guests is coherent".
>>    I mean the current upstream kernel _ONLY_ does WBINVD for SME, that
>> means for all non-SME environment there's no concern to do WBINVD here.
>>
>> Here we only extend to do WBINVD on more environments, so as long as
>> there's no harm to do WBINVD for them it's OK.
>>
>> How about below?
>>
>>       /*
>>        * The kernel could leave caches in incoherent state on SME/TDX
>>        * capable platforms.  Flush cache to avoid silent memory
>>        * corruption for these platforms.
>>        *
>>        * For TDX and SEV-ES/SEV-SNP guests, a WBINVD causes an
>>        * exception (#VE or #VC), and the kernel may not be able
>>        * to handle such exception (e.g., TDX guest panics if it
>>        * sees #VE).  Since stop_this_cpu() isn't a fast path, just
>>        * issue the WBINVD on bare-metal instead of sprinkling
>>        * around vendor-specific checks.
>>        */
> 
> I think that's ok, but maybe just adding that the WBINVD is not
> necessary for TDX and SEV-ES/SEV-SNP would make it clearer. Just my
> opinion, though.
> 

Yeah can do that.

Will add "WBINVD is not necessary for TDX and SEV-ES/SEV-SNP guests" 
before starting the "Since stop_this_cpu() ...".

Thanks for the feedback.

Re: [PATCH v3 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Kirill A. Shutemov 1 year, 10 months ago

On Mon, Apr 08, 2024 at 12:44:54AM +1200, Kai Huang wrote:
> TL;DR:

The commit message is waaay too verbose for no good reason. You don't
really need to repeat all the history around this code.

> ---
>  arch/x86/kernel/process.c | 18 ++++++++----------
>  1 file changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index b8441147eb5e..5ba8a9c1e47a 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -813,18 +813,16 @@ void __noreturn stop_this_cpu(void *dummy)
>  	mcheck_cpu_clear(c);
>  
>  	/*
> -	 * Use wbinvd on processors that support SME. This provides support
> -	 * for performing a successful kexec when going from SME inactive
> -	 * to SME active (or vice-versa). The cache must be cleared so that
> -	 * if there are entries with the same physical address, both with and
> -	 * without the encryption bit, they don't race each other when flushed
> -	 * and potentially end up with the wrong entry being committed to
> -	 * memory.
> +	 * The kernel could leave caches in incoherent state on SME/TDX
> +	 * capable platforms.  Flush cache to avoid silent memory
> +	 * corruption for these platforms.
>  	 *
> -	 * Test the CPUID bit directly because the machine might've cleared
> -	 * X86_FEATURE_SME due to cmdline options.
> +	 * stop_this_cpu() is not a fast path, just do unconditional
> +	 * WBINVD for simplicity.  But only do WBINVD for bare-metal
> +	 * as TDX guests and SEV-ES/SEV-SNP guests will get unexpected
> +	 * (and unnecessary) #VE and may unable to handle.

s/#VE/exception/

On SEV it is #VC, not #VE.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v3 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Huang, Kai 1 year, 10 months ago


On 11/04/2024 2:12 am, Kirill A. Shutemov wrote:
> On Mon, Apr 08, 2024 at 12:44:54AM +1200, Kai Huang wrote:
>> TL;DR:
> 
> The commit message is waaay too verbose for no good reason. You don't
> really need to repeat all the history around this code.

Could you be more specific?

I was following Boris's suggestion to summerize all the discussion 
around the "unconditional WBINVD" issue.

https://lore.kernel.org/linux-kernel/20240228110207.GCZd8Sr8mXHA2KTiLz@fat_crate.local/

I can try to improve if I can know specifically what should be trimmed down.

> 
>> ---
>>   arch/x86/kernel/process.c | 18 ++++++++----------
>>   1 file changed, 8 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index b8441147eb5e..5ba8a9c1e47a 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -813,18 +813,16 @@ void __noreturn stop_this_cpu(void *dummy)
>>   	mcheck_cpu_clear(c);
>>   
>>   	/*
>> -	 * Use wbinvd on processors that support SME. This provides support
>> -	 * for performing a successful kexec when going from SME inactive
>> -	 * to SME active (or vice-versa). The cache must be cleared so that
>> -	 * if there are entries with the same physical address, both with and
>> -	 * without the encryption bit, they don't race each other when flushed
>> -	 * and potentially end up with the wrong entry being committed to
>> -	 * memory.
>> +	 * The kernel could leave caches in incoherent state on SME/TDX
>> +	 * capable platforms.  Flush cache to avoid silent memory
>> +	 * corruption for these platforms.
>>   	 *
>> -	 * Test the CPUID bit directly because the machine might've cleared
>> -	 * X86_FEATURE_SME due to cmdline options.
>> +	 * stop_this_cpu() is not a fast path, just do unconditional
>> +	 * WBINVD for simplicity.  But only do WBINVD for bare-metal
>> +	 * as TDX guests and SEV-ES/SEV-SNP guests will get unexpected
>> +	 * (and unnecessary) #VE and may unable to handle.
> 
> s/#VE/exception/
> 
> On SEV it is #VC, not #VE.
> 

Thanks.  I think I'll use "exception (#VE or #VC)" which is clearer, as 
Tom typed in the comments to patch 2.

Re: [PATCH v3 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Borislav Petkov 1 year, 9 months ago

On Thu, Apr 11, 2024 at 09:54:13AM +1200, Huang, Kai wrote:
> Could you be more specific?
> 
> I was following Boris's suggestion to summerize all the discussion around
> the "unconditional WBINVD" issue.
> 
> https://lore.kernel.org/linux-kernel/20240228110207.GCZd8Sr8mXHA2KTiLz@fat_crate.local/
> 
> I can try to improve if I can know specifically what should be trimmed down.

No, keep it this way. I've yet to see someone complaining from too
verbose commit message while doing git archeology.

If it is too verbose to a reader, then that reader can jump over the
paragraphs.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v3 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Huang, Kai 1 year, 9 months ago


On 16/04/2024 5:59 am, Borislav Petkov wrote:
> On Thu, Apr 11, 2024 at 09:54:13AM +1200, Huang, Kai wrote:
>> Could you be more specific?
>>
>> I was following Boris's suggestion to summerize all the discussion around
>> the "unconditional WBINVD" issue.
>>
>> https://lore.kernel.org/linux-kernel/20240228110207.GCZd8Sr8mXHA2KTiLz@fat_crate.local/
>>
>> I can try to improve if I can know specifically what should be trimmed down.
> 
> No, keep it this way. I've yet to see someone complaining from too
> verbose commit message while doing git archeology.
> 
> If it is too verbose to a reader, then that reader can jump over the
> paragraphs.
> 

Yeah as replied to Kirill I am keeping it.  Thanks for the feedback.

Re: [PATCH v3 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Kirill A. Shutemov 1 year, 10 months ago

On Thu, Apr 11, 2024 at 09:54:13AM +1200, Huang, Kai wrote:
> 
> 
> On 11/04/2024 2:12 am, Kirill A. Shutemov wrote:
> > On Mon, Apr 08, 2024 at 12:44:54AM +1200, Kai Huang wrote:
> > > TL;DR:
> > 
> > The commit message is waaay too verbose for no good reason. You don't
> > really need to repeat all the history around this code.
> 
> Could you be more specific?
> 
> I was following Boris's suggestion to summerize all the discussion around
> the "unconditional WBINVD" issue.
> 
> https://lore.kernel.org/linux-kernel/20240228110207.GCZd8Sr8mXHA2KTiLz@fat_crate.local/
> 
> I can try to improve if I can know specifically what should be trimmed down.

What about something like this:

  x86/mm: Do unconditional WBINVD in stop_this_cpu() for bare metal

  Both AMD SME and Intel TDX can leave caches in an incoherent state due to
  memory encryption, which can lead to silent memory corruption during kexec. To
  address this issue, it is necessary to flush the caches before jumping to the
  second kernel.

  Previously, the kernel only performed WBINVD in stop_this_cpu() when SME
  support was detected. To support TDX as well, instead of adding vendor-specific
  checks, it is proposed to unconditionally perform WBINVD. Kexec() is a slow
  path, and the additional WBINVD is acceptable for the sake of simplicity and
  maintainability.

  It is important to note that WBINVD should only be done for bare-metal
  scenarios, as TDX guests and SEV-ES/SEV-SNP guests may not handle unexpected
  exceptions (#VE or #VC) caused by WBINVD.

  Historically, there were issues with unconditional WBINVD, leading to system
  hangs or resets on different Intel systems. These issues were addressed by a
  series of commits, culminating in the fix provided by commit 1f5e7eb7868e
  ("x86/smp: Make stop_other_cpus() more robust").

  Further testing on problematic machines confirmed that the issues could not be
  reproduced after applying the fix. Therefore, it is now safe to unconditionally
  perform WBINVD in stop_this_cpu().

You can also add links to relevant threads as Link: tags.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v3 1/5] x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu()

Posted by Huang, Kai 1 year, 10 months ago

On Thu, 2024-04-11 at 16:31 +0300, Kirill A. Shutemov wrote:
> On Thu, Apr 11, 2024 at 09:54:13AM +1200, Huang, Kai wrote:
> > 
> > 
> > On 11/04/2024 2:12 am, Kirill A. Shutemov wrote:
> > > On Mon, Apr 08, 2024 at 12:44:54AM +1200, Kai Huang wrote:
> > > > TL;DR:
> > > 
> > > The commit message is waaay too verbose for no good reason. You don't
> > > really need to repeat all the history around this code.
> > 
> > Could you be more specific?
> > 
> > I was following Boris's suggestion to summerize all the discussion around
> > the "unconditional WBINVD" issue.
> > 
> > https://lore.kernel.org/linux-kernel/20240228110207.GCZd8Sr8mXHA2KTiLz@fat_crate.local/
> > 
> > I can try to improve if I can know specifically what should be trimmed down.
> 
> What about something like this:
> 
>   x86/mm: Do unconditional WBINVD in stop_this_cpu() for bare metal
> 
>   Both AMD SME and Intel TDX can leave caches in an incoherent state due to
>   memory encryption, which can lead to silent memory corruption during kexec. To
>   address this issue, it is necessary to flush the caches before jumping to the
>   second kernel.
> 
>   Previously, the kernel only performed WBINVD in stop_this_cpu() when SME
>   support was detected. To support TDX as well, instead of adding vendor-specific
>   checks, it is proposed to unconditionally perform WBINVD. Kexec() is a slow
>   path, and the additional WBINVD is acceptable for the sake of simplicity and
>   maintainability.
> 
>   It is important to note that WBINVD should only be done for bare-metal
>   scenarios, as TDX guests and SEV-ES/SEV-SNP guests may not handle unexpected
>   exceptions (#VE or #VC) caused by WBINVD.
> 
>   Historically, there were issues with unconditional WBINVD, leading to system
>   hangs or resets on different Intel systems. These issues were addressed by a
>   series of commits, culminating in the fix provided by commit 1f5e7eb7868e
>   ("x86/smp: Make stop_other_cpus() more robust").
> 
>   Further testing on problematic machines confirmed that the issues could not be
>   reproduced after applying the fix. Therefore, it is now safe to unconditionally
>   perform WBINVD in stop_this_cpu().
> 
> You can also add links to relevant threads as Link: tags.
> 

Hmm.. The last two paragraphs doesn't tell the background that the
"unconditional WBINVD" was the original way to do etc.  The changelog of commit
1f5e7eb7868e ("x86/smp: Make stop_other_cpus() more robust" (and the commit IDs
that it mentions) doesn't tell the full story either.

That means people will need to open all the Links to get the full information. 
I think it is against what Boris suggested.

Yeah I agree having a lengthy changelog is annoying sometimes, but for this
particular case we have a "TL;DR" so doesn't seem that bad to me. :-)

So for now I would like to keep the text after the "Note:" in my original
changelog, but I will use your first 3 paragraphs above to replace mine.