[v5] Enable Linear Address Space Separation support

[PATCH v5 05/16] x86/cpu: Defer CR pinning setup until after EFI initialization

Posted by Alexander Shishkin 1 year, 3 months ago

In order to map the EFI runtime services, set_virtual_address_map
needs to be called, which resides in the lower half of the address
space. This means that LASS needs to be temporarily disabled around
this call. This can only be done before the CR pinning is set up.

Move CR pinning setup behind the EFI initialization.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/cpu/common.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b24ad418536e..c249fd0aa3fb 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1953,7 +1953,6 @@ static __init void identify_boot_cpu(void)
 	enable_sep_cpu();
 #endif
 	cpu_detect_tlb(&boot_cpu_data);
-	setup_cr_pinning();
 
 	tsx_init();
 	tdx_init();
@@ -2385,10 +2384,16 @@ void __init arch_cpu_finalize_init(void)
 
 	/*
 	 * This needs to follow the FPU initializtion, since EFI depends on it.
+	 * It also needs to precede the CR pinning setup, because we need to be
+	 * able to temporarily clear the CR4.LASS bit in order to execute the
+	 * set_virtual_address_map call, which resides in lower addresses and
+	 * would trip LASS if enabled.
 	 */
 	if (efi_enabled(EFI_RUNTIME_SERVICES))
 		efi_enter_virtual_mode();
 
+	setup_cr_pinning();
+
 	/*
 	 * Ensure that access to the per CPU representation has the initial
 	 * boot CPU configuration.
-- 
2.45.2

Re: [PATCH v5 05/16] x86/cpu: Defer CR pinning setup until after EFI initialization

Posted by Sohil Mehta 1 year, 3 months ago

On 10/28/2024 9:07 AM, Alexander Shishkin wrote:
> In order to map the EFI runtime services, set_virtual_address_map
> needs to be called, which resides in the lower half of the address
> space. This means that LASS needs to be temporarily disabled around
> this call. This can only be done before the CR pinning is set up.
> 

...

>  
>  	/*
>  	 * This needs to follow the FPU initializtion, since EFI depends on it.
> +	 * It also needs to precede the CR pinning setup, because we need to be
> +	 * able to temporarily clear the CR4.LASS bit in order to execute the
> +	 * set_virtual_address_map call, which resides in lower addresses and
> +	 * would trip LASS if enabled.
>  	 */

It would be helpful to describe why lass_stac()/clac() doesn't work here
and instead the heavy handed CR4 toggling is needed.

>  	if (efi_enabled(EFI_RUNTIME_SERVICES))
>  		efi_enter_virtual_mode();
>  
> +	setup_cr_pinning();
> +

RE: [PATCH v5 05/16] x86/cpu: Defer CR pinning setup until after EFI initialization

Posted by Luck, Tony 1 year, 3 months ago

>  	/*
>  	 * This needs to follow the FPU initializtion, since EFI depends on it.
> +	 * It also needs to precede the CR pinning setup, because we need to be
> +	 * able to temporarily clear the CR4.LASS bit in order to execute the
> +	 * set_virtual_address_map call, which resides in lower addresses and
> +	 * would trip LASS if enabled.
>  	 */

Why are the temporary mappings used to patch kernel code in the lower half
of the virtual address space? The comments in front of use_temporary_mm()
say:

* Using a temporary mm allows to set temporary mappings that are not accessible
 * by other CPUs. Such mappings are needed to perform sensitive memory writes
 * that override the kernel memory protections (e.g., W^X), without exposing the
 * temporary page-table mappings that are required for these write operations to
 * other CPUs. Using a temporary mm also allows to avoid TLB shootdowns when the
 * mapping is torn down.

But couldn't we map into upper half and do some/all of:

1) Trust that there aren't stupid bugs that dereference random pointers into the
temporary mapping?
2) Make a "this CPU only" mapping
3) Avoid preemption while patching so there is no need for TLB shootdown
by other CPUs when the temporary mapping is torn down, just flush local TLB.

-Tony

Re: [PATCH v5 05/16] x86/cpu: Defer CR pinning setup until after EFI initialization

Posted by Dave Hansen 1 year, 3 months ago

On 10/29/24 15:26, Luck, Tony wrote:
>>  	/*
>>  	 * This needs to follow the FPU initializtion, since EFI depends on it.
>> +	 * It also needs to precede the CR pinning setup, because we need to be
>> +	 * able to temporarily clear the CR4.LASS bit in order to execute the
>> +	 * set_virtual_address_map call, which resides in lower addresses and
>> +	 * would trip LASS if enabled.
>>  	 */
> 
> Why are the temporary mappings used to patch kernel code in the lower half
> of the virtual address space? 

I was just asking myself the same thing.  The upper half is always
mapped uniformly.  When you create an MM you copy the 256->511th pgd
entries verbatim from the init_mm's pgd.

If you map something the <=255th pgd entry, it isn't (by default)
visible to other mm's.  That's why a new mm also tends to get you a new
process.

> But couldn't we map into upper half and do some/all of:
> 
> 1) Trust that there aren't stupid bugs that dereference random pointers into the
> temporary mapping?
> 2) Make a "this CPU only" mapping
> 3) Avoid preemption while patching so there is no need for TLB shootdown
> by other CPUs when the temporary mapping is torn down, just flush local TLB.

It's about enforcing R^X semantics.  We should limit the time and scope
where mappings have some data both writeable and executable.

If we poke text in the upper half of the address space, any kernel
thread might be exploited to write to what will soon be executable.

If we do it in the lower half in its own mm, you have to compromise the
thread doing the text poking after the mapping is created but before it
is invalidated.  With LASS you *ALSO* need to do it in the STAC/CLAC
window which is smaller than the window when the TLB is valid.

*IF* we switched things to do text poking in the upper half of the
address space, we'd probably want to find a completely unused PGD entry.
 I'm not sure off the top of my head if we have a good one for that or
if it's worth the trouble.

RE: [PATCH v5 05/16] x86/cpu: Defer CR pinning setup until after EFI initialization

Posted by Luck, Tony 1 year, 3 months ago

> *IF* we switched things to do text poking in the upper half of the
> address space, we'd probably want to find a completely unused PGD entry.
>  I'm not sure off the top of my head if we have a good one for that or
> if it's worth the trouble.

I expect that would be easy on 64-bit (no way the kernel needs all the
PGD entries from 256..511) but hard for 32-bit (where kernel address
space is in critically short supply on any machine with >1GB RAM).

-Tony

Re: [PATCH v5 05/16] x86/cpu: Defer CR pinning setup until after EFI initialization

Posted by Dave Hansen 1 year, 3 months ago

On 10/29/24 15:59, Luck, Tony wrote:
>> *IF* we switched things to do text poking in the upper half of the
>> address space, we'd probably want to find a completely unused PGD entry.
>>  I'm not sure off the top of my head if we have a good one for that or
>> if it's worth the trouble.
> I expect that would be easy on 64-bit (no way the kernel needs all the
> PGD entries from 256..511) but hard for 32-bit (where kernel address
> space is in critically short supply on any machine with >1GB RAM).

Yeah, I was talking about 64-bit only.  On 32-bit PAE a PGD maps 1/4 of
the address space which is totally unworkable for stealing.

Re: [PATCH v5 05/16] x86/cpu: Defer CR pinning setup until after EFI initialization

Posted by H. Peter Anvin 1 year, 3 months ago

On 10/29/24 16:03, Dave Hansen wrote:
> On 10/29/24 15:59, Luck, Tony wrote:
>>> *IF* we switched things to do text poking in the upper half of the
>>> address space, we'd probably want to find a completely unused PGD entry.
>>>   I'm not sure off the top of my head if we have a good one for that or
>>> if it's worth the trouble.
>> I expect that would be easy on 64-bit (no way the kernel needs all the
>> PGD entries from 256..511) but hard for 32-bit (where kernel address
>> space is in critically short supply on any machine with >1GB RAM).
> 
> Yeah, I was talking about 64-bit only.  On 32-bit PAE a PGD maps 1/4 of
> the address space which is totally unworkable for stealing.

But it is also not necessary.

	-hpa

RE: [PATCH v5 05/16] x86/cpu: Defer CR pinning setup until after EFI initialization

Posted by Luck, Tony 1 year, 3 months ago

>> Yeah, I was talking about 64-bit only.  On 32-bit PAE a PGD maps 1/4 of
>> the address space which is totally unworkable for stealing.
>
> But it is also not necessary.

So maybe we could make the 64-bit version of use_temporary_mm()
use some reserved address mapping to a reserved PGD in the upper
half of address space, and the 32-bit version continue to use "user"
addresses. It's unclear to me whether adding complexity here would be
worth it to remove the 64-bit STAC/CLAC text patching issues.

-Tony

Re: [PATCH v5 05/16] x86/cpu: Defer CR pinning setup until after EFI initialization

Posted by Kirill A. Shutemov 1 year, 3 months ago

On Tue, Oct 29, 2024 at 11:18:29PM +0000, Luck, Tony wrote:
> >> Yeah, I was talking about 64-bit only.  On 32-bit PAE a PGD maps 1/4 of
> >> the address space which is totally unworkable for stealing.
> >
> > But it is also not necessary.
> 
> So maybe we could make the 64-bit version of use_temporary_mm()
> use some reserved address mapping to a reserved PGD in the upper
> half of address space, and the 32-bit version continue to use "user"
> addresses. It's unclear to me whether adding complexity here would be
> worth it to remove the 64-bit STAC/CLAC text patching issues.

Redesigning use_temporary_mm() is an interesting experiment, but it is
out of scope for the series.

LASS blocks LAM enabling. It would be nice to get LAM re-enabled soonish.

Maybe we can look at use_temporary_mm() again after LASS gets upstream?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v5 05/16] x86/cpu: Defer CR pinning setup until after EFI initialization

Posted by H. Peter Anvin 1 year, 3 months ago

On 10/29/24 16:18, Luck, Tony wrote:
>>> Yeah, I was talking about 64-bit only.  On 32-bit PAE a PGD maps 1/4 of
>>> the address space which is totally unworkable for stealing.
>>
>> But it is also not necessary.
> 
> So maybe we could make the 64-bit version of use_temporary_mm()
> use some reserved address mapping to a reserved PGD in the upper
> half of address space, and the 32-bit version continue to use "user"
> addresses. It's unclear to me whether adding complexity here would be
> worth it to remove the 64-bit STAC/CLAC text patching issues.
> 

For 32 bits we can also simply use something further down in the 
hierarchy. It's not like we can afford to have the PGD be anything other 
than RWX on 32 bits.

	-hpa

Re: [PATCH v5 05/16] x86/cpu: Defer CR pinning setup until after EFI initialization

Posted by H. Peter Anvin 1 year, 3 months ago

On 10/29/24 15:59, Luck, Tony wrote:
>> *IF* we switched things to do text poking in the upper half of the
>> address space, we'd probably want to find a completely unused PGD entry.
>>   I'm not sure off the top of my head if we have a good one for that or
>> if it's worth the trouble.
> 
> I expect that would be easy on 64-bit (no way the kernel needs all the
> PGD entries from 256..511) but hard for 32-bit (where kernel address
> space is in critically short supply on any machine with >1GB RAM).

No LASS on 32 bits...

	-hpa