[PATCH v10 05/15] x86/cpu: Defer CR pinning enforcement until late_initcall()

Sohil Mehta posted 15 patches 4 months ago
There is a newer version of this series
[PATCH v10 05/15] x86/cpu: Defer CR pinning enforcement until late_initcall()
Posted by Sohil Mehta 4 months ago
Problem
-------
In order to map the EFI runtime services, set_virtual_address_map()
needs to be called, which resides in the lower half of the address
space. This means that LASS needs to be temporarily disabled around this
call.

Wrapping efi_enter_virtual_mode() with lass_stac()/clac() is not enough,
because the AC flag only gates data accesses, not instruction fetches.
Clearing the CR4.LASS bit is required to make this work.

However, pinned CR4 bits are not expected to be modified after
boot CPU init, resulting in a kernel warning.

Solution
--------
One option is to move the CR pinning setup immediately after the runtime
services have been mapped. However, that is a narrow fix that would
require revisiting if something else needs to modify a pinned CR bit.

CR pinning mainly prevents exploits from trivially modifying
security-sensitive CR bits. There is limited benefit to enabling CR
pinning before userspace comes up. Defer CR pinning enforcement until
late_initcall() to allow EFI and future users to modify the CR bits
without any concern for CR pinning.

Save the pinned bits while initializing the boot CPU because they are
needed later to program the value on APs when they come up.

Note
----
This introduces a small window between the boot CPU being initialized
and CR pinning being enforced, where any in-kernel clearing of the
pinned bits could go unnoticed. Later, when enforcement begins, a
warning is triggered as soon as any CR4 bit is modified, such as
X86_CR4_PGE during a TLB flush.

Currently, this is a purely theoretical concern. There are multiple ways
to resolve it [1] if it becomes a problem in practice.

Link: https://lore.kernel.org/lkml/c59aa7ac-62a6-45ec-b626-de518b25f7d9@intel.com/ [1]
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
v10:
 - Split recording pinned bits and enabling pinning into two functions.
 - Defer pinning until userspace comes up.

This patch does not include any changes to harden the CR pinning
implementation, as that is beyond the scope of this series.
---
 arch/x86/kernel/cpu/common.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 61ab332eaf73..57d5824465b0 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -476,8 +476,8 @@ void cr4_init(void)
 
 	if (boot_cpu_has(X86_FEATURE_PCID))
 		cr4 |= X86_CR4_PCIDE;
-	if (static_branch_likely(&cr_pinning))
-		cr4 = (cr4 & ~cr4_pinned_mask) | cr4_pinned_bits;
+
+	cr4 = (cr4 & ~cr4_pinned_mask) | cr4_pinned_bits;
 
 	__write_cr4(cr4);
 
@@ -487,14 +487,21 @@ void cr4_init(void)
 
 /*
  * Once CPU feature detection is finished (and boot params have been
- * parsed), record any of the sensitive CR bits that are set, and
- * enable CR pinning.
+ * parsed), record any of the sensitive CR bits that are set.
  */
-static void __init setup_cr_pinning(void)
+static void __init record_cr_pinned_bits(void)
 {
 	cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & cr4_pinned_mask;
+}
+
+/* Enables enforcement of the CR pinned bits */
+static int __init enable_cr_pinning(void)
+{
 	static_key_enable(&cr_pinning.key);
+
+	return 0;
 }
+late_initcall(enable_cr_pinning);
 
 static __init int x86_nofsgsbase_setup(char *arg)
 {
@@ -2119,7 +2126,7 @@ static __init void identify_boot_cpu(void)
 	enable_sep_cpu();
 #endif
 	cpu_detect_tlb(&boot_cpu_data);
-	setup_cr_pinning();
+	record_cr_pinned_bits();
 
 	tsx_init();
 	tdx_init();
-- 
2.43.0
Re: [PATCH v10 05/15] x86/cpu: Defer CR pinning enforcement until late_initcall()
Posted by Sohil Mehta 3 months, 3 weeks ago
On 10/6/2025 11:51 PM, Sohil Mehta wrote:
> Save the pinned bits while initializing the boot CPU because they are
> needed later to program the value on APs when they come up.
> 

Because we are deferring CR pinning, there is no need to program the APs
with the pinned bits. The pinned bits would get enabled during AP bring
up like the rest of CR4 features that are not pinned. This patch can be
simplified to:

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 61ab332eaf73..d041f04c1969 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -476,8 +476,6 @@ void cr4_init(void)

        if (boot_cpu_has(X86_FEATURE_PCID))
                cr4 |= X86_CR4_PCIDE;
-       if (static_branch_likely(&cr_pinning))
-               cr4 = (cr4 & ~cr4_pinned_mask) | cr4_pinned_bits;

        __write_cr4(cr4);

@@ -486,15 +484,17 @@ void cr4_init(void)
 }

 /*
- * Once CPU feature detection is finished (and boot params have been
- * parsed), record any of the sensitive CR bits that are set, and
- * enable CR pinning.
+ * Before userspace starts, record any of the sensitive CR bits that
+ * are set, and enable CR pinning.
  */
-static void __init setup_cr_pinning(void)
+static int __init setup_cr_pinning(void)
 {
        cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & cr4_pinned_mask;
        static_key_enable(&cr_pinning.key);
+
+       return 0;
 }
+late_initcall(setup_cr_pinning);

 static __init int x86_nofsgsbase_setup(char *arg)
 {
@@ -2119,7 +2119,6 @@ static __init void identify_boot_cpu(void)
        enable_sep_cpu();
 #endif
        cpu_detect_tlb(&boot_cpu_data);
-       setup_cr_pinning();

        tsx_init();
        tdx_init();
Re: [PATCH v10 05/15] x86/cpu: Defer CR pinning enforcement until late_initcall()
Posted by Edgecombe, Rick P 4 months ago
On Mon, 2025-10-06 at 23:51 -0700, Sohil Mehta wrote:
> Problem
> -------
> In order to map the EFI runtime services, set_virtual_address_map()
> needs to be called, which resides in the lower half of the address
> space. This means that LASS needs to be temporarily disabled around this
> call.

Possibly naive EFI question...

efi_runtime_services_32_t has: 
   typedef struct {
   	efi_table_hdr_t hdr;
   	u32 get_time;
   	u32 set_time;
   	u32 get_wakeup_time;
   	u32 set_wakeup_time;
   	u32 set_virtual_address_map;
   	u32 convert_pointer;
   	u32 get_variable;
   	u32 get_next_variable;
   	u32 set_variable;
   	u32 get_next_high_mono_count;
   	u32 reset_system;
   	u32 update_capsule;
   	u32 query_capsule_caps;
   	u32 query_variable_info;
   } efi_runtime_services_32_t;
   

Why is only set_virtual_address_map problematic? Some of the other ones get
called after boot by a bunch of modules by the looks of it.

> 
> Wrapping efi_enter_virtual_mode() with lass_stac()/clac() is not enough,
> because the AC flag only gates data accesses, not instruction fetches.
> Clearing the CR4.LASS bit is required to make this work.
> 
> However, pinned CR4 bits are not expected to be modified after
> boot CPU init, resulting in a kernel warning.
> 
> Solution
> --------
> One option is to move the CR pinning setup immediately after the runtime
> services have been mapped. However, that is a narrow fix that would
> require revisiting if something else needs to modify a pinned CR bit.
> 
> CR pinning mainly prevents exploits from trivially modifying
> security-sensitive CR bits. There is limited benefit to enabling CR
> pinning before userspace comes up. Defer CR pinning enforcement until
> late_initcall() to allow EFI and future users to modify the CR bits
> without any concern for CR pinning.
> 
> Save the pinned bits while initializing the boot CPU because they are
> needed later to program the value on APs when they come up.
> 
> Note
> ----
> This introduces a small window between the boot CPU being initialized
> and CR pinning being enforced, where any in-kernel clearing of the
> pinned bits could go unnoticed. Later, when enforcement begins, a
> warning is triggered as soon as any CR4 bit is modified, such as
> X86_CR4_PGE during a TLB flush.
> 
> Currently, this is a purely theoretical concern. There are multiple ways
> to resolve it [1] if it becomes a problem in practice.
> 
> Link: https://lore.kernel.org/lkml/c59aa7ac-62a6-45ec-b626-de518b25f7d9@intel.com/ [1]
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
> ---
> v10:
>  - Split recording pinned bits and enabling pinning into two functions.
>  - Defer pinning until userspace comes up.
> 
> This patch does not include any changes to harden the CR pinning
> implementation, as that is beyond the scope of this series.
> ---
>  arch/x86/kernel/cpu/common.c | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 61ab332eaf73..57d5824465b0 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -476,8 +476,8 @@ void cr4_init(void)
>  
>  	if (boot_cpu_has(X86_FEATURE_PCID))
>  		cr4 |= X86_CR4_PCIDE;
> -	if (static_branch_likely(&cr_pinning))
> -		cr4 = (cr4 & ~cr4_pinned_mask) | cr4_pinned_bits;
> +
> +	cr4 = (cr4 & ~cr4_pinned_mask) | cr4_pinned_bits;


Can you explain why this change is needed? It relies on cr4_pinned_bits to be
already set, and kind of is "enforcement", but no longer depends on
enable_cr_pinning() being called.


>  
>  	__write_cr4(cr4);
>  
> @@ -487,14 +487,21 @@ void cr4_init(void)
>  
>  /*
>   * Once CPU feature detection is finished (and boot params have been
> - * parsed), record any of the sensitive CR bits that are set, and
> - * enable CR pinning.
> + * parsed), record any of the sensitive CR bits that are set.
>   */
> -static void __init setup_cr_pinning(void)
> +static void __init record_cr_pinned_bits(void)
>  {
>  	cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & cr4_pinned_mask;
> +}
> +
> +/* Enables enforcement of the CR pinned bits */
> +static int __init enable_cr_pinning(void)
> +{
>  	static_key_enable(&cr_pinning.key);
> +
> +	return 0;
>  }
> +late_initcall(enable_cr_pinning);
>  
>  static __init int x86_nofsgsbase_setup(char *arg)
>  {
> @@ -2119,7 +2126,7 @@ static __init void identify_boot_cpu(void)
>  	enable_sep_cpu();
>  #endif
>  	cpu_detect_tlb(&boot_cpu_data);
> -	setup_cr_pinning();
> +	record_cr_pinned_bits();
>  
>  	tsx_init();
>  	tdx_init();

Re: [PATCH v10 05/15] x86/cpu: Defer CR pinning enforcement until late_initcall()
Posted by Sohil Mehta 4 months ago
On 10/7/2025 10:23 AM, Edgecombe, Rick P wrote:

> 
> Why is only set_virtual_address_map problematic? Some of the other ones get
> called after boot by a bunch of modules by the looks of it.
> 

AFAIU, efi_enter_virtual_mode()->set_virtual_address_map maps the
runtime services from physical mode into virtual mode.

After that, all the other runtime services can get called using virtual
addressing. I can find out more if you still have concerns.

>> @@ -476,8 +476,8 @@ void cr4_init(void)
>>  
>>  	if (boot_cpu_has(X86_FEATURE_PCID))
>>  		cr4 |= X86_CR4_PCIDE;
>> -	if (static_branch_likely(&cr_pinning))
>> -		cr4 = (cr4 & ~cr4_pinned_mask) | cr4_pinned_bits;
>> +
>> +	cr4 = (cr4 & ~cr4_pinned_mask) | cr4_pinned_bits;
> 
> 
> Can you explain why this change is needed? It relies on cr4_pinned_bits to be
> already set, and kind of is "enforcement", but no longer depends on
> enable_cr_pinning() being called.
> 

cr4_init() is only called from APs during bring up. The pinned bits are
saved on the BSP and then used to program the CR4 on the APs. It is
independent of pinning *enforcement* which warns when these bits get
modified.

> 
>>  
>>  	__write_cr4(cr4);
>>
Re: [PATCH v10 05/15] x86/cpu: Defer CR pinning enforcement until late_initcall()
Posted by Edgecombe, Rick P 4 months ago
On Tue, 2025-10-07 at 16:05 -0700, Sohil Mehta wrote:
> On 10/7/2025 10:23 AM, Edgecombe, Rick P wrote:
> 
> > 
> > Why is only set_virtual_address_map problematic? Some of the other ones get
> > called after boot by a bunch of modules by the looks of it.
> > 
> 
> AFAIU, efi_enter_virtual_mode()->set_virtual_address_map maps the
> runtime services from physical mode into virtual mode.
> 
> After that, all the other runtime services can get called using virtual
> addressing. I can find out more if you still have concerns.

Ah, looking into this more I see the:
ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI region mapping
space

It looks like the services get mapped there in a special MM. The calls switch to
it, so should stay in high address space. But I also saw this snippet:
   /*
    * Certain firmware versions are way too sentimental and still believe
    * they are exclusive and unquestionable owners of the first physical page,
    * even though they explicitly mark it as EFI_CONVENTIONAL_MEMORY
    * (but then write-access it later during SetVirtualAddressMap()).
    *
    * Create a 1:1 mapping for this page, to avoid triple faults during early
    * boot with such firmware. We are free to hand this page to the BIOS,
    * as trim_bios_range() will reserve the first page and isolate it away
    * from memory allocators anyway.
    */
   if (kernel_map_pages_in_pgd(pgd, 0x0, 0x0, 1, pf)) {
   	pr_err("Failed to create 1:1 mapping for the first page!\n");
   	return 1;
   }

By leaving LASS enabled it seems the kernel will be constraining BIOSs to not do
strange things. Which seems reasonable.

> 
> > > @@ -476,8 +476,8 @@ void cr4_init(void)
> > >  
> > >  	if (boot_cpu_has(X86_FEATURE_PCID))
> > >  		cr4 |= X86_CR4_PCIDE;
> > > -	if (static_branch_likely(&cr_pinning))
> > > -		cr4 = (cr4 & ~cr4_pinned_mask) | cr4_pinned_bits;
> > > +
> > > +	cr4 = (cr4 & ~cr4_pinned_mask) | cr4_pinned_bits;
> > 
> > 
> > Can you explain why this change is needed? It relies on cr4_pinned_bits to be
> > already set, and kind of is "enforcement", but no longer depends on
> > enable_cr_pinning() being called.
> > 
> 
> cr4_init() is only called from APs during bring up. The pinned bits are
> saved on the BSP and then used to program the CR4 on the APs. It is
> independent of pinning *enforcement* which warns when these bits get
> modified.

Sorry, still not following. How is it independent of CR pinning enforcement if
the enforcement is still taking place in this function. And if we don't need to
enforce pinning, why drop the branch?

> 
> > 
> > >  
> > >  	__write_cr4(cr4);
> > >  

Re: [PATCH v10 05/15] x86/cpu: Defer CR pinning enforcement until late_initcall()
Posted by Sohil Mehta 3 months, 4 weeks ago
On 10/8/2025 10:36 AM, Edgecombe, Rick P wrote:
>> cr4_init() is only called from APs during bring up. The pinned bits are
>> saved on the BSP and then used to program the CR4 on the APs. It is
>> independent of pinning *enforcement* which warns when these bits get
>> modified.
> 
> Sorry, still not following. How is it independent of CR pinning enforcement if
> the enforcement is still taking place in this function. And if we don't need to
> enforce pinning, why drop the branch?
> 

It depends on how we define "enforcement". The pinned bit verification
as well the warning happens in native_write_cr4().

When APs start, we need to program *a* CR4 value for it. Currently, with
early CR pinning, we use the saved pinned bits on the BSP along with
X86_CR4_PCIDE. Because cr4_init() is only called during boot, the static
branch is always going to be false with deferred pinning. Your
suggestion implies that we only use X86_CR4_PCIDE as the initial CR4
value. It could work, but I have some doubts because CR4 initialization
has had multiple issues in the past.

Not directly related but see commits:
7652ac920185 ("x86/asm: Move native_write_cr0/4() out of line")
c7ad5ad297e6 ("x86/mm/64: Initialize CR4.PCIDE early")

As we discussed in another thread, CR pinning has expanded beyond the
original security related bits. They have become bits that are never
expected to be modified once initialized. I wonder whether we could run
into issues if the initial CR4 value on the APs doesn't have one of the
pinned bits set. From a cursory look, everything should be fine (except
maybe FRED). I could give it a try.

But, is there a preference here? There is no additional cost of setting
the pinned bits because we definitely need to program X86_CR4_PCIDE. Do
we set the pinned bits along with that, or wait for the AP to go through
the init flow and set them one by one?
Re: [PATCH v10 05/15] x86/cpu: Defer CR pinning enforcement until late_initcall()
Posted by Sohil Mehta 3 months, 3 weeks ago
On 10/10/2025 1:45 PM, Sohil Mehta wrote:
> 
> As we discussed in another thread, CR pinning has expanded beyond the
> original security related bits. They have become bits that are never
> expected to be modified once initialized. I wonder whether we could run
> into issues if the initial CR4 value on the APs doesn't have one of the
> pinned bits set. From a cursory look, everything should be fine (except
> maybe FRED). I could give it a try.
> 

I tried this. Getting rid of the cr4_pinned_bits setting during
cr_init() seems to be working fine.

Xin says that there may be an existing issue with FRED, as CR4.FRED is
set before programming the FRED config MSRs in
cpu_init_fred_exceptions(). Any exceptions during that brief window,
though unlikely, would cause a triple fault. I think not setting
CR4.FRED might help the issue, but, I am not sure. I'll let Xin or Peter
evaluate this.


> But, is there a preference here? There is no additional cost of setting
> the pinned bits because we definitely need to program X86_CR4_PCIDE. Do
> we set the pinned bits along with that, or wait for the AP to go through
> the init flow and set them one by one?
> 

As we are planning to defer CR pinning enforcement, I am inclining
towards getting rid of the following check in cr4_init().

	if (static_branch_likely(&cr_pinning))
		cr4 = (cr4 & ~cr4_pinned_mask) | cr4_pinned_bits;

AFAIU, this change shouldn't harm FRED. Resolving the existing FRED
issue can be done in a separate patch.