From: Alexander Shishkin <alexander.shishkin@linux.intel.com>
In order to map the EFI runtime services, set_virtual_address_map()
needs to be called, which resides in the lower half of the address
space. This means that LASS needs to be temporarily disabled around
this call. This can only be done before the CR pinning is set up.
Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in
arch_cpu_finalize_init(), defer it until core initcall.
Wrapping efi_enter_virtual_mode() into lass_stac()/clac() is not enough
because AC flag gates data accesses, but not instruction fetch. Clearing
the CR4 bit is required.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/kernel/cpu/common.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ec62e2f9ea16..f10f9f618805 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,11 +490,14 @@ void cr4_init(void)
* parsed), record any of the sensitive CR bits that are set, and
* enable CR pinning.
*/
-static void __init setup_cr_pinning(void)
+static int __init setup_cr_pinning(void)
{
cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & cr4_pinned_mask;
static_key_enable(&cr_pinning.key);
+
+ return 0;
}
+core_initcall(setup_cr_pinning);
static __init int x86_nofsgsbase_setup(char *arg)
{
@@ -2082,7 +2085,6 @@ static __init void identify_boot_cpu(void)
enable_sep_cpu();
#endif
cpu_detect_tlb(&boot_cpu_data);
- setup_cr_pinning();
tsx_init();
tdx_init();
--
2.47.2
On 7/7/25 01:03, Kirill A. Shutemov wrote: > Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in > arch_cpu_finalize_init(), defer it until core initcall. What are the side effects of this move? Are there other benefits? What are the risks? BTW, ideally, you'd get an ack from one of the folks who put the CR pinning in the kernel in the first place to make sure this isn't breaking the mechanism in any important ways.
On 7/9/2025 10:00 AM, Dave Hansen wrote: > On 7/7/25 01:03, Kirill A. Shutemov wrote: >> Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in >> arch_cpu_finalize_init(), defer it until core initcall. > > What are the side effects of this move? Are there other benefits? What > are the risks? > Picking this up from Kirill.. Reevaluating this, core_initcall() seems too late for setup_cr_pinning(). We need to have CR pinning completed, and the associated static key enabled before AP bring up. start_secondary()->cr4_init() depends on the cr_pinning static key to initialize CR4 for APs. To find the optimal location for CR pinning, here are the constraints: 1) The initialization of all the CPU-specific security features such as SMAP, SMEP, UMIP and LASS must be done. 2) Since EFI needs to toggle CR4.LASS, EFI initialization must be completed. 3) Since APs depend on the BSP for CR initialization, CR pinning should happen before AP bringup. 4) CR pinning should happen before userspace comes up, since that's what we are protecting against. I shortlisted two locations, arch_cpu_finalize_init() and early_initcall(). a) start_kernel() arch_cpu_finalize_init() arch_cpu_finalize_init() seems like the logical fit, since CR pinning can be considered as the "finalizing" the security sensitive control registers. Doing it at the conclusion of CPU initialization makes sense. b) start_kernel() rest_init() kernel_init() kernel_init_freeable() do_pre_smp_initcalls() early_initcall() We could push the pinning until early_initcall() since that happens just before SMP bringup as well the init process being executed. But, I don't see any benefit to doing that. Most of the stuff between arch_cpu_finalize_init() and rest_init() seems to be arch agnostic (except maybe ACPI). Though the likelihood of anything touching the pinned bits is low, it would be better to have the bits pinned and complain if someone modifies them. I am inclined towards arch_cpu_finalize_init() but I don't have a strong preference. Dave, is there any other aspect I should consider? > BTW, ideally, you'd get an ack from one of the folks who put the CR > pinning in the kernel in the first place to make sure this isn't > breaking the mechanism in any important ways. Kees, do you have any preference or suggestions?
On 7/31/25 16:45, Sohil Mehta wrote: > On 7/9/2025 10:00 AM, Dave Hansen wrote: >> On 7/7/25 01:03, Kirill A. Shutemov wrote: >>> Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in >>> arch_cpu_finalize_init(), defer it until core initcall. >> What are the side effects of this move? Are there other benefits? What >> are the risks? >> > Picking this up from Kirill.. Reevaluating this, core_initcall() seems > too late for setup_cr_pinning(). > > We need to have CR pinning completed, and the associated static key > enabled before AP bring up. start_secondary()->cr4_init() depends on the > cr_pinning static key to initialize CR4 for APs. Sure, if you leave cr4_init() completely as-is. 'cr4_pinned_bits' should be set by the boot CPU. Secondary CPUs should also read 'cr4_pinned_bits' when setting up their own cr4's, unconditionally, independent of 'cr_pinning'. The thing I think we should change is the pinning _enforcement_. The easiest way to do that is to remove the static_branch_likely() in cr4_init() and then delay flipping the static branch until just before userspace starts. Basically, split up the: static void __init setup_cr_pinning(void) { cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & cr4_pinned_mask; static_key_enable(&cr_pinning.key); } code into its two logical pieces: 1. Populate 'cr4_pinned_bits' from the boot CPU so the secondaries can use it 2. Enable the static key so pinning enforcement is enabled
On 7/31/2025 5:01 PM, Dave Hansen wrote: > > 'cr4_pinned_bits' should be set by the boot CPU. Secondary CPUs should > also read 'cr4_pinned_bits' when setting up their own cr4's, > unconditionally, independent of 'cr_pinning'. > > The thing I think we should change is the pinning _enforcement_. The > easiest way to do that is to remove the static_branch_likely() in > cr4_init() and then delay flipping the static branch until just before > userspace starts. > Based on the current implementation and some git history [1], it seems that cr_pinning is expected to catch 2 types of issues. One is a malicious user trying to change the CR registers from userspace. But, another scenario is a regular caller to native_write_cr4() *mistakenly* clearing a pinned bit. [1]: https://lore.kernel.org/all/alpine.DEB.2.21.1906141646320.1722@nanos.tec.linutronix.de/ Could deferring enforcement lead to a scenario where we end up with different CR4 values on different CPUs? Maybe I am misinterpreting this and protecting against in-kernel errors is not a goal. In general, you want to delay the CR pinning enforcement until absolutely needed. I am curious about the motivation. I understand we should avoid doing it at arbitrary points in the boot. But, arch_cpu_finalize_init() and early_initcall() seem to be decent mileposts to me. Are you anticipating that we would need to move setup_cr_pinning() again when another user similar to EFI shows up? > Basically, split up the: > > static void __init setup_cr_pinning(void) > { > cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & cr4_pinned_mask; > static_key_enable(&cr_pinning.key); > } > > code into its two logical pieces: > > 1. Populate 'cr4_pinned_bits' from the boot CPU so the secondaries can > use it > 2. Enable the static key so pinning enforcement is enabled
On 7/31/25 21:43, Sohil Mehta wrote: ... > Could deferring enforcement lead to a scenario where we end up with > different CR4 values on different CPUs? Maybe I am misinterpreting this > and protecting against in-kernel errors is not a goal. Sure, theoretically. But if that's a concern, it can be checked at the time that enforcement starts: for_each_online_cpu(cpu) { unsigned long cr4 = per_cpu(cpu_tlbstate.cr4, cpu); if ((cr4 & cr4_pinned_mask) == cr4_pinned_bits)) continue; WARN("blah blah"); } Or use smp_call_function() to check each CPU's CR4 directly. Or, the next time that CPU does a TLB flush that toggles X86_CR4_PGE, it'll get the warning from the regular pinning path. So, sure, this does widen the window during boot where a secondary CPU might get a bad CR4 value, and it would make it harder to track down where it happened. We _could_ print a pr_debug() message when the bit gets cleared but not enforce things if anyone is super worried about this. > In general, you want to delay the CR pinning enforcement until > absolutely needed. I am curious about the motivation. I understand we > should avoid doing it at arbitrary points in the boot. But, > arch_cpu_finalize_init() and early_initcall() seem to be decent > mileposts to me. > > Are you anticipating that we would need to move setup_cr_pinning() again > when another user similar to EFI shows up? Yep.
On Thu, Jul 31, 2025 at 05:01:37PM -0700, Dave Hansen wrote: > On 7/31/25 16:45, Sohil Mehta wrote: > > On 7/9/2025 10:00 AM, Dave Hansen wrote: > >> On 7/7/25 01:03, Kirill A. Shutemov wrote: > >>> Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in > >>> arch_cpu_finalize_init(), defer it until core initcall. > >> What are the side effects of this move? Are there other benefits? What > >> are the risks? > >> > > Picking this up from Kirill.. Reevaluating this, core_initcall() seems > > too late for setup_cr_pinning(). > > > > We need to have CR pinning completed, and the associated static key > > enabled before AP bring up. start_secondary()->cr4_init() depends on the > > cr_pinning static key to initialize CR4 for APs. > > Sure, if you leave cr4_init() completely as-is. > > 'cr4_pinned_bits' should be set by the boot CPU. Secondary CPUs should > also read 'cr4_pinned_bits' when setting up their own cr4's, > unconditionally, independent of 'cr_pinning'. > > The thing I think we should change is the pinning _enforcement_. The > easiest way to do that is to remove the static_branch_likely() in > cr4_init() and then delay flipping the static branch until just before > userspace starts. Yeah, this is fine from my perspective. The goal with the pinning was about keeping things safe in the face of an attack from userspace that managed to get at MSR values and keeping them from being trivially changed. -- Kees Cook
On August 2, 2025 11:51:28 AM PDT, Kees Cook <kees@kernel.org> wrote: >On Thu, Jul 31, 2025 at 05:01:37PM -0700, Dave Hansen wrote: >> On 7/31/25 16:45, Sohil Mehta wrote: >> > On 7/9/2025 10:00 AM, Dave Hansen wrote: >> >> On 7/7/25 01:03, Kirill A. Shutemov wrote: >> >>> Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in >> >>> arch_cpu_finalize_init(), defer it until core initcall. >> >> What are the side effects of this move? Are there other benefits? What >> >> are the risks? >> >> >> > Picking this up from Kirill.. Reevaluating this, core_initcall() seems >> > too late for setup_cr_pinning(). >> > >> > We need to have CR pinning completed, and the associated static key >> > enabled before AP bring up. start_secondary()->cr4_init() depends on the >> > cr_pinning static key to initialize CR4 for APs. >> >> Sure, if you leave cr4_init() completely as-is. >> >> 'cr4_pinned_bits' should be set by the boot CPU. Secondary CPUs should >> also read 'cr4_pinned_bits' when setting up their own cr4's, >> unconditionally, independent of 'cr_pinning'. >> >> The thing I think we should change is the pinning _enforcement_. The >> easiest way to do that is to remove the static_branch_likely() in >> cr4_init() and then delay flipping the static branch until just before >> userspace starts. > >Yeah, this is fine from my perspective. The goal with the pinning was >about keeping things safe in the face of an attack from userspace that >managed to get at MSR values and keeping them from being trivially >changed. > I have mentioned this before: I would like to see CR4-pinning use a patchable immediate to make it harder to manipulate. If the mask is final when alternatives are run, that would be a good time to install it; the code can just contain a zero immediate (no pinning) or a very limited set of bits that must never be changed at all up to that point.
On 7/7/2025 1:03 AM, Kirill A. Shutemov wrote: > From: Alexander Shishkin <alexander.shishkin@linux.intel.com> > > In order to map the EFI runtime services, set_virtual_address_map() > needs to be called, which resides in the lower half of the address > space. This means that LASS needs to be temporarily disabled around > this call. This can only be done before the CR pinning is set up. > > Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in > arch_cpu_finalize_init(), defer it until core initcall. > > Wrapping efi_enter_virtual_mode() into lass_stac()/clac() is not enough > because AC flag gates data accesses, but not instruction fetch. Clearing > the CR4 bit is required. > I think the wording might need to be reordered. How about? In order to map the EFI runtime services, set_virtual_address_map() needs to be called, which resides in the lower half of the address space. This means that LASS needs to be temporarily disabled around this call. Wrapping efi_enter_virtual_mode() into lass_stac()/clac() is not enough because AC flag gates data accesses, but not instruction fetch. Clearing the CR4 bit is required. However, this must be done before the CR pinning is set up. Instead of arbitrarily moving setup_cr_pinning() after efi_enter_virtual_mode() in arch_cpu_finalize_init(), defer it until core initcall. Other than that, Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> > Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> > Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > --- > arch/x86/kernel/cpu/common.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c > index ec62e2f9ea16..f10f9f618805 100644 > --- a/arch/x86/kernel/cpu/common.c > +++ b/arch/x86/kernel/cpu/common.c > @@ -490,11 +490,14 @@ void cr4_init(void) > * parsed), record any of the sensitive CR bits that are set, and > * enable CR pinning. > */ > -static void __init setup_cr_pinning(void) > +static int __init setup_cr_pinning(void) > { > cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & cr4_pinned_mask; > static_key_enable(&cr_pinning.key); > + > + return 0; > } > +core_initcall(setup_cr_pinning); > > static __init int x86_nofsgsbase_setup(char *arg) > { > @@ -2082,7 +2085,6 @@ static __init void identify_boot_cpu(void) > enable_sep_cpu(); > #endif > cpu_detect_tlb(&boot_cpu_data); > - setup_cr_pinning(); > > tsx_init(); > tdx_init();
On Tue, Jul 08, 2025 at 06:19:03PM -0700, Sohil Mehta wrote: > On 7/7/2025 1:03 AM, Kirill A. Shutemov wrote: > > From: Alexander Shishkin <alexander.shishkin@linux.intel.com> > > > > In order to map the EFI runtime services, set_virtual_address_map() > > needs to be called, which resides in the lower half of the address > > space. This means that LASS needs to be temporarily disabled around > > this call. This can only be done before the CR pinning is set up. > > > > Instead of moving setup_cr_pinning() below efi_enter_virtual_mode() in > > arch_cpu_finalize_init(), defer it until core initcall. > > > > Wrapping efi_enter_virtual_mode() into lass_stac()/clac() is not enough > > because AC flag gates data accesses, but not instruction fetch. Clearing > > the CR4 bit is required. > > > > I think the wording might need to be reordered. How about? > > In order to map the EFI runtime services, set_virtual_address_map() > needs to be called, which resides in the lower half of the address > space. This means that LASS needs to be temporarily disabled around > this call. > > Wrapping efi_enter_virtual_mode() into lass_stac()/clac() is not enough > because AC flag gates data accesses, but not instruction fetch. Clearing > the CR4 bit is required. > > However, this must be done before the CR pinning is set up. Instead of > arbitrarily moving setup_cr_pinning() after efi_enter_virtual_mode() in > arch_cpu_finalize_init(), defer it until core initcall. > > Other than that, > Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Okay, looks good, thanks! -- Kiryl Shutsemau / Kirill A. Shutemov
© 2016 - 2025 Red Hat, Inc.