[v9] x86: Enable Linear Address Space Separation support

[PATCHv9 04/16] x86/cpu: Defer CR pinning setup until core initcall

Posted by Kirill A. Shutemov 3 months ago

From: Alexander Shishkin <alexander.shishkin@linux.intel.com>

In order to map the EFI runtime services, set_virtual_address_map()
needs to be called, which resides in the lower half of the address
space. This means that LASS needs to be temporarily disabled around
this call. This can only be done before the CR pinning is set up.

Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in
arch_cpu_finalize_init(), defer it until core initcall.

Wrapping efi_enter_virtual_mode() into lass_stac()/clac() is not enough
because AC flag gates data accesses, but not instruction fetch. Clearing
the CR4 bit is required.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/cpu/common.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ec62e2f9ea16..f10f9f618805 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,11 +490,14 @@ void cr4_init(void)
  * parsed), record any of the sensitive CR bits that are set, and
  * enable CR pinning.
  */
-static void __init setup_cr_pinning(void)
+static int __init setup_cr_pinning(void)
 {
 	cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & cr4_pinned_mask;
 	static_key_enable(&cr_pinning.key);
+
+	return 0;
 }
+core_initcall(setup_cr_pinning);
 
 static __init int x86_nofsgsbase_setup(char *arg)
 {
@@ -2082,7 +2085,6 @@ static __init void identify_boot_cpu(void)
 	enable_sep_cpu();
 #endif
 	cpu_detect_tlb(&boot_cpu_data);
-	setup_cr_pinning();
 
 	tsx_init();
 	tdx_init();
-- 
2.47.2

Re: [PATCHv9 04/16] x86/cpu: Defer CR pinning setup until core initcall

Posted by Dave Hansen 3 months ago

On 7/7/25 01:03, Kirill A. Shutemov wrote:
> Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in
> arch_cpu_finalize_init(), defer it until core initcall.

What are the side effects of this move? Are there other benefits? What
are the risks?

BTW, ideally, you'd get an ack from one of the folks who put the CR
pinning in the kernel in the first place to make sure this isn't
breaking the mechanism in any important ways.

Re: [PATCHv9 04/16] x86/cpu: Defer CR pinning setup until core initcall

Posted by Sohil Mehta 2 months, 1 week ago

On 7/9/2025 10:00 AM, Dave Hansen wrote:
> On 7/7/25 01:03, Kirill A. Shutemov wrote:
>> Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in
>> arch_cpu_finalize_init(), defer it until core initcall.
> 
> What are the side effects of this move? Are there other benefits? What
> are the risks?
> 

Picking this up from Kirill.. Reevaluating this, core_initcall() seems
too late for setup_cr_pinning().

We need to have CR pinning completed, and the associated static key
enabled before AP bring up. start_secondary()->cr4_init() depends on the
cr_pinning static key to initialize CR4 for APs.

To find the optimal location for CR pinning, here are the constraints:

1) The initialization of all the CPU-specific security features such as
SMAP, SMEP, UMIP and LASS must be done.

2) Since EFI needs to toggle CR4.LASS, EFI initialization must be completed.

3) Since APs depend on the BSP for CR initialization, CR pinning should
happen before AP bringup.

4) CR pinning should happen before userspace comes up, since that's what
we are protecting against.

I shortlisted two locations, arch_cpu_finalize_init() and early_initcall().

a) start_kernel()
     arch_cpu_finalize_init()

arch_cpu_finalize_init() seems like the logical fit, since CR pinning
can be considered as the "finalizing" the security sensitive control
registers. Doing it at the conclusion of CPU initialization makes sense.

b) start_kernel()
     rest_init()
       kernel_init()
         kernel_init_freeable()
           do_pre_smp_initcalls()
             early_initcall()

We could push the pinning until early_initcall() since that happens just
before SMP bringup as well the init process being executed. But, I don't
see any benefit to doing that.

Most of the stuff between arch_cpu_finalize_init() and rest_init() seems
to be arch agnostic (except maybe ACPI). Though the likelihood of
anything touching the pinned bits is low, it would be better to have the
bits pinned and complain if someone modifies them.

I am inclined towards arch_cpu_finalize_init() but I don't have a strong
preference. Dave, is there any other aspect I should consider?

> BTW, ideally, you'd get an ack from one of the folks who put the CR
> pinning in the kernel in the first place to make sure this isn't
> breaking the mechanism in any important ways.

Kees, do you have any preference or suggestions?

Re: [PATCHv9 04/16] x86/cpu: Defer CR pinning setup until core initcall

Posted by Dave Hansen 2 months, 1 week ago

On 7/31/25 16:45, Sohil Mehta wrote:
> On 7/9/2025 10:00 AM, Dave Hansen wrote:
>> On 7/7/25 01:03, Kirill A. Shutemov wrote:
>>> Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in
>>> arch_cpu_finalize_init(), defer it until core initcall.
>> What are the side effects of this move? Are there other benefits? What
>> are the risks?
>>
> Picking this up from Kirill.. Reevaluating this, core_initcall() seems
> too late for setup_cr_pinning().
> 
> We need to have CR pinning completed, and the associated static key
> enabled before AP bring up. start_secondary()->cr4_init() depends on the
> cr_pinning static key to initialize CR4 for APs.

Sure, if you leave cr4_init() completely as-is.

'cr4_pinned_bits' should be set by the boot CPU. Secondary CPUs should
also read 'cr4_pinned_bits' when setting up their own cr4's,
unconditionally, independent of 'cr_pinning'.

The thing I think we should change is the pinning _enforcement_. The
easiest way to do that is to remove the static_branch_likely() in
cr4_init() and then delay flipping the static branch until just before
userspace starts.

Basically, split up the:

static void __init setup_cr_pinning(void)
{
    cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & cr4_pinned_mask;
    static_key_enable(&cr_pinning.key);
}

code into its two logical pieces:

 1. Populate 'cr4_pinned_bits' from the boot CPU so the secondaries can
    use it
 2. Enable the static key so pinning enforcement is enabled

Re: [PATCHv9 04/16] x86/cpu: Defer CR pinning setup until core initcall

Posted by Sohil Mehta 2 months, 1 week ago

On 7/31/2025 5:01 PM, Dave Hansen wrote:
> 
> 'cr4_pinned_bits' should be set by the boot CPU. Secondary CPUs should
> also read 'cr4_pinned_bits' when setting up their own cr4's,
> unconditionally, independent of 'cr_pinning'.
> 
> The thing I think we should change is the pinning _enforcement_. The
> easiest way to do that is to remove the static_branch_likely() in
> cr4_init() and then delay flipping the static branch until just before
> userspace starts.
> 

Based on the current implementation and some git history [1], it seems
that cr_pinning is expected to catch 2 types of issues.

One is a malicious user trying to change the CR registers from
userspace. But, another scenario is a regular caller to
native_write_cr4() *mistakenly* clearing a pinned bit.

[1]:
https://lore.kernel.org/all/alpine.DEB.2.21.1906141646320.1722@nanos.tec.linutronix.de/

Could deferring enforcement lead to a scenario where we end up with
different CR4 values on different CPUs? Maybe I am misinterpreting this
and protecting against in-kernel errors is not a goal.

In general, you want to delay the CR pinning enforcement until
absolutely needed. I am curious about the motivation. I understand we
should avoid doing it at arbitrary points in the boot. But,
arch_cpu_finalize_init() and early_initcall() seem to be decent
mileposts to me.

Are you anticipating that we would need to move setup_cr_pinning() again
when another user similar to EFI shows up?
> Basically, split up the:
> 
> static void __init setup_cr_pinning(void)
> {
>     cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & cr4_pinned_mask;
>     static_key_enable(&cr_pinning.key);
> }
> 
> code into its two logical pieces:
> 
>  1. Populate 'cr4_pinned_bits' from the boot CPU so the secondaries can
>     use it
>  2. Enable the static key so pinning enforcement is enabled

Re: [PATCHv9 04/16] x86/cpu: Defer CR pinning setup until core initcall

Posted by Dave Hansen 2 months, 1 week ago

On 7/31/25 21:43, Sohil Mehta wrote:
...
> Could deferring enforcement lead to a scenario where we end up with
> different CR4 values on different CPUs? Maybe I am misinterpreting this
> and protecting against in-kernel errors is not a goal.

Sure, theoretically.

But if that's a concern, it can be checked at the time that enforcement
starts:

	for_each_online_cpu(cpu) {
		unsigned long cr4 = per_cpu(cpu_tlbstate.cr4, cpu);
		if ((cr4 & cr4_pinned_mask) == cr4_pinned_bits))
			continue;
		WARN("blah blah");
	}

Or use smp_call_function() to check each CPU's CR4 directly.

Or, the next time that CPU does a TLB flush that toggles X86_CR4_PGE,
it'll get the warning from the regular pinning path.

So, sure, this does widen the window during boot where a secondary CPU
might get a bad CR4 value, and it would make it harder to track down
where it happened. We _could_ print a pr_debug() message when the bit
gets cleared but not enforce things if anyone is super worried about this.

> In general, you want to delay the CR pinning enforcement until
> absolutely needed. I am curious about the motivation. I understand we
> should avoid doing it at arbitrary points in the boot. But,
> arch_cpu_finalize_init() and early_initcall() seem to be decent
> mileposts to me.
> 
> Are you anticipating that we would need to move setup_cr_pinning() again
> when another user similar to EFI shows up?
Yep.

Re: [PATCHv9 04/16] x86/cpu: Defer CR pinning setup until core initcall

Posted by Kees Cook 2 months ago

On Thu, Jul 31, 2025 at 05:01:37PM -0700, Dave Hansen wrote:
> On 7/31/25 16:45, Sohil Mehta wrote:
> > On 7/9/2025 10:00 AM, Dave Hansen wrote:
> >> On 7/7/25 01:03, Kirill A. Shutemov wrote:
> >>> Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in
> >>> arch_cpu_finalize_init(), defer it until core initcall.
> >> What are the side effects of this move? Are there other benefits? What
> >> are the risks?
> >>
> > Picking this up from Kirill.. Reevaluating this, core_initcall() seems
> > too late for setup_cr_pinning().
> > 
> > We need to have CR pinning completed, and the associated static key
> > enabled before AP bring up. start_secondary()->cr4_init() depends on the
> > cr_pinning static key to initialize CR4 for APs.
> 
> Sure, if you leave cr4_init() completely as-is.
> 
> 'cr4_pinned_bits' should be set by the boot CPU. Secondary CPUs should
> also read 'cr4_pinned_bits' when setting up their own cr4's,
> unconditionally, independent of 'cr_pinning'.
> 
> The thing I think we should change is the pinning _enforcement_. The
> easiest way to do that is to remove the static_branch_likely() in
> cr4_init() and then delay flipping the static branch until just before
> userspace starts.

Yeah, this is fine from my perspective. The goal with the pinning was
about keeping things safe in the face of an attack from userspace that
managed to get at MSR values and keeping them from being trivially
changed.

-- 
Kees Cook

Re: [PATCHv9 04/16] x86/cpu: Defer CR pinning setup until core initcall

Posted by H. Peter Anvin 2 months ago

On August 2, 2025 11:51:28 AM PDT, Kees Cook <kees@kernel.org> wrote:
>On Thu, Jul 31, 2025 at 05:01:37PM -0700, Dave Hansen wrote:
>> On 7/31/25 16:45, Sohil Mehta wrote:
>> > On 7/9/2025 10:00 AM, Dave Hansen wrote:
>> >> On 7/7/25 01:03, Kirill A. Shutemov wrote:
>> >>> Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in
>> >>> arch_cpu_finalize_init(), defer it until core initcall.
>> >> What are the side effects of this move? Are there other benefits? What
>> >> are the risks?
>> >>
>> > Picking this up from Kirill.. Reevaluating this, core_initcall() seems
>> > too late for setup_cr_pinning().
>> > 
>> > We need to have CR pinning completed, and the associated static key
>> > enabled before AP bring up. start_secondary()->cr4_init() depends on the
>> > cr_pinning static key to initialize CR4 for APs.
>> 
>> Sure, if you leave cr4_init() completely as-is.
>> 
>> 'cr4_pinned_bits' should be set by the boot CPU. Secondary CPUs should
>> also read 'cr4_pinned_bits' when setting up their own cr4's,
>> unconditionally, independent of 'cr_pinning'.
>> 
>> The thing I think we should change is the pinning _enforcement_. The
>> easiest way to do that is to remove the static_branch_likely() in
>> cr4_init() and then delay flipping the static branch until just before
>> userspace starts.
>
>Yeah, this is fine from my perspective. The goal with the pinning was
>about keeping things safe in the face of an attack from userspace that
>managed to get at MSR values and keeping them from being trivially
>changed.
>

I have mentioned this before: I would like to see CR4-pinning use a patchable immediate to make it harder to manipulate. If the mask is final when alternatives are run, that would be a good time to install it; the code can just contain a zero immediate (no pinning) or a very limited set of bits that must never be changed at all up to that point.

Re: [PATCHv9 04/16] x86/cpu: Defer CR pinning setup until core initcall

Posted by Sohil Mehta 3 months ago

On 7/7/2025 1:03 AM, Kirill A. Shutemov wrote:
> From: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> 
> In order to map the EFI runtime services, set_virtual_address_map()
> needs to be called, which resides in the lower half of the address
> space. This means that LASS needs to be temporarily disabled around
> this call. This can only be done before the CR pinning is set up.
> 
> Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in
> arch_cpu_finalize_init(), defer it until core initcall.
> 
> Wrapping efi_enter_virtual_mode() into lass_stac()/clac() is not enough
> because AC flag gates data accesses, but not instruction fetch. Clearing
> the CR4 bit is required.
> 

I think the wording might need to be reordered. How about?

In order to map the EFI runtime services, set_virtual_address_map()
needs to be called, which resides in the lower half of the address
space. This means that LASS needs to be temporarily disabled around
this call.

Wrapping efi_enter_virtual_mode() into lass_stac()/clac() is not enough
because AC flag gates data accesses, but not instruction fetch. Clearing
the CR4 bit is required.

However, this must be done before the CR pinning is set up. Instead of
arbitrarily moving setup_cr_pinning() after efi_enter_virtual_mode() in
arch_cpu_finalize_init(), defer it until core initcall.

Other than that,
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>

> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/common.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index ec62e2f9ea16..f10f9f618805 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -490,11 +490,14 @@ void cr4_init(void)
>   * parsed), record any of the sensitive CR bits that are set, and
>   * enable CR pinning.
>   */
> -static void __init setup_cr_pinning(void)
> +static int __init setup_cr_pinning(void)
>  {
>  	cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & cr4_pinned_mask;
>  	static_key_enable(&cr_pinning.key);
> +
> +	return 0;
>  }
> +core_initcall(setup_cr_pinning);
>  
>  static __init int x86_nofsgsbase_setup(char *arg)
>  {
> @@ -2082,7 +2085,6 @@ static __init void identify_boot_cpu(void)
>  	enable_sep_cpu();
>  #endif
>  	cpu_detect_tlb(&boot_cpu_data);
> -	setup_cr_pinning();
>  
>  	tsx_init();
>  	tdx_init();

Re: [PATCHv9 04/16] x86/cpu: Defer CR pinning setup until core initcall

Posted by Kirill A. Shutemov 3 months ago

On Tue, Jul 08, 2025 at 06:19:03PM -0700, Sohil Mehta wrote:
> On 7/7/2025 1:03 AM, Kirill A. Shutemov wrote:
> > From: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> > 
> > In order to map the EFI runtime services, set_virtual_address_map()
> > needs to be called, which resides in the lower half of the address
> > space. This means that LASS needs to be temporarily disabled around
> > this call. This can only be done before the CR pinning is set up.
> > 
> > Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in
> > arch_cpu_finalize_init(), defer it until core initcall.
> > 
> > Wrapping efi_enter_virtual_mode() into lass_stac()/clac() is not enough
> > because AC flag gates data accesses, but not instruction fetch. Clearing
> > the CR4 bit is required.
> > 
> 
> I think the wording might need to be reordered. How about?
> 
> In order to map the EFI runtime services, set_virtual_address_map()
> needs to be called, which resides in the lower half of the address
> space. This means that LASS needs to be temporarily disabled around
> this call.
> 
> Wrapping efi_enter_virtual_mode() into lass_stac()/clac() is not enough
> because AC flag gates data accesses, but not instruction fetch. Clearing
> the CR4 bit is required.
> 
> However, this must be done before the CR pinning is set up. Instead of
> arbitrarily moving setup_cr_pinning() after efi_enter_virtual_mode() in
> arch_cpu_finalize_init(), defer it until core initcall.
> 
> Other than that,
> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>

Okay, looks good, thanks!

-- 
  Kiryl Shutsemau / Kirill A. Shutemov