[RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3

Valentin Schneider posted 29 patches 2 months, 1 week ago
There is a newer version of this series
[RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
Posted by Valentin Schneider 2 months, 1 week ago
Deferring kernel range TLB flushes requires the guarantee that upon
entering the kernel, no stale entry may be accessed. The simplest way to
provide such a guarantee is to issue an unconditional flush upon switching
to the kernel CR3, as this is the pivoting point where such stale entries
may be accessed.

As this is only relevant to NOHZ_FULL, restrict the mechanism to NOHZ_FULL
CPUs.

Note that the COALESCE_TLBI config option is introduced in a later commit,
when the whole feature is implemented.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/entry/calling.h      | 26 +++++++++++++++++++++++---
 arch/x86/kernel/asm-offsets.c |  1 +
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 813451b1ddecc..19fb6de276eac 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -9,6 +9,7 @@
 #include <asm/ptrace-abi.h>
 #include <asm/msr.h>
 #include <asm/nospec-branch.h>
+#include <asm/invpcid.h>

 /*

@@ -171,8 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
	andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
 .endm

-.macro COALESCE_TLBI
+.macro COALESCE_TLBI scratch_reg:req
 #ifdef CONFIG_COALESCE_TLBI
+	/* No point in doing this for housekeeping CPUs */
+	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
+	bt	\scratch_reg, tick_nohz_full_mask(%rip)
+	jnc	.Lend_tlbi_\@
+
+	ALTERNATIVE "jmp .Lcr4_\@", "", X86_FEATURE_INVPCID
+	movq $(INVPCID_TYPE_ALL_INCL_GLOBAL), \scratch_reg
+	/* descriptor is all zeroes, point at the zero page */
+	invpcid empty_zero_page(%rip), \scratch_reg
+	jmp .Lend_tlbi_\@
+.Lcr4_\@:
+	/* Note: this gives CR4 pinning the finger */
+	movq PER_CPU_VAR(cpu_tlbstate + TLB_STATE_cr4), \scratch_reg
+	xorq $(X86_CR4_PGE), \scratch_reg
+	movq \scratch_reg, %cr4
+	xorq $(X86_CR4_PGE), \scratch_reg
+	movq \scratch_reg, %cr4
+
+.Lend_tlbi_\@:
	movl     $1, PER_CPU_VAR(kernel_cr3_loaded)
 #endif // CONFIG_COALESCE_TLBI
 .endm
@@ -188,7 +208,7 @@ For 32-bit we have the following conventions - kernel is built with
	mov	%cr3, \scratch_reg
	ADJUST_KERNEL_CR3 \scratch_reg
	mov	\scratch_reg, %cr3
-	COALESCE_TLBI
+	COALESCE_TLBI \scratch_reg
 .Lend_\@:
 .endm

@@ -256,7 +276,7 @@ For 32-bit we have the following conventions - kernel is built with

	ADJUST_KERNEL_CR3 \scratch_reg
	movq	\scratch_reg, %cr3
-	COALESCE_TLBI
+	COALESCE_TLBI \scratch_reg

 .Ldone_\@:
 .endm
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 6259b474073bc..f5abdcbb150d9 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -105,6 +105,7 @@ static void __used common(void)

	/* TLB state for the entry code */
	OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
+	OFFSET(TLB_STATE_cr4, tlb_state, cr4);

	/* Layout info for cpu_entry_area */
	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
--
2.51.0
Re: [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
Posted by Frederic Weisbecker 1 month, 3 weeks ago
Le Fri, Oct 10, 2025 at 05:38:37PM +0200, Valentin Schneider a écrit :
> Deferring kernel range TLB flushes requires the guarantee that upon
> entering the kernel, no stale entry may be accessed. The simplest way to
> provide such a guarantee is to issue an unconditional flush upon switching
> to the kernel CR3, as this is the pivoting point where such stale entries
> may be accessed.
> 
> As this is only relevant to NOHZ_FULL, restrict the mechanism to NOHZ_FULL
> CPUs.
> 
> Note that the COALESCE_TLBI config option is introduced in a later commit,
> when the whole feature is implemented.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  arch/x86/entry/calling.h      | 26 +++++++++++++++++++++++---
>  arch/x86/kernel/asm-offsets.c |  1 +
>  2 files changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 813451b1ddecc..19fb6de276eac 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -9,6 +9,7 @@
>  #include <asm/ptrace-abi.h>
>  #include <asm/msr.h>
>  #include <asm/nospec-branch.h>
> +#include <asm/invpcid.h>
> 
>  /*
> 
> @@ -171,8 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
> 	andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
>  .endm
> 
> -.macro COALESCE_TLBI
> +.macro COALESCE_TLBI scratch_reg:req
>  #ifdef CONFIG_COALESCE_TLBI
> +	/* No point in doing this for housekeeping CPUs */
> +	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
> +	bt	\scratch_reg, tick_nohz_full_mask(%rip)
> +	jnc	.Lend_tlbi_\@

I assume it's not possible to have a static call/branch to
take care of all this ?

Thanks.

-- 
Frederic Weisbecker
SUSE Labs
Re: [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
Posted by Valentin Schneider 1 month, 3 weeks ago
On 28/10/25 16:59, Frederic Weisbecker wrote:
> Le Fri, Oct 10, 2025 at 05:38:37PM +0200, Valentin Schneider a écrit :
>> @@ -171,8 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
>>      andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
>>  .endm
>>
>> -.macro COALESCE_TLBI
>> +.macro COALESCE_TLBI scratch_reg:req
>>  #ifdef CONFIG_COALESCE_TLBI
>> +	/* No point in doing this for housekeeping CPUs */
>> +	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
>> +	bt	\scratch_reg, tick_nohz_full_mask(%rip)
>> +	jnc	.Lend_tlbi_\@
>
> I assume it's not possible to have a static call/branch to
> take care of all this ?
>

I think technically yes, but that would have to be a per-cpu patchable
location, which would mean something like each CPU having its own copy of
that text page... Unless there's some existing way to statically optimize

  if (cpumask_test_cpu(smp_processor_id(), mask))

where @mask is a boot-time constant (i.e. the nohz_full mask).

> Thanks.
>
> --
> Frederic Weisbecker
> SUSE Labs
Re: [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
Posted by Frederic Weisbecker 1 month, 3 weeks ago
Le Wed, Oct 29, 2025 at 11:16:23AM +0100, Valentin Schneider a écrit :
> On 28/10/25 16:59, Frederic Weisbecker wrote:
> > Le Fri, Oct 10, 2025 at 05:38:37PM +0200, Valentin Schneider a écrit :
> >> @@ -171,8 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
> >>      andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
> >>  .endm
> >>
> >> -.macro COALESCE_TLBI
> >> +.macro COALESCE_TLBI scratch_reg:req
> >>  #ifdef CONFIG_COALESCE_TLBI
> >> +	/* No point in doing this for housekeeping CPUs */
> >> +	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
> >> +	bt	\scratch_reg, tick_nohz_full_mask(%rip)
> >> +	jnc	.Lend_tlbi_\@
> >
> > I assume it's not possible to have a static call/branch to
> > take care of all this ?
> >
> 
> I think technically yes, but that would have to be a per-cpu patchable
> location, which would mean something like each CPU having its own copy of
> that text page... Unless there's some existing way to statically optimize
> 
>   if (cpumask_test_cpu(smp_processor_id(), mask))
> 
> where @mask is a boot-time constant (i.e. the nohz_full mask).

Or just check housekeeping_overriden static key before everything. This one is
enabled only if either nohz_full, isolcpus or cpuset isolated partition (well,
it's on the way for the last one) are running, but those are all niche, which
means you spare 99.999% kernel usecases.

Thanks.

> 
> > Thanks.
> >
> > --
> > Frederic Weisbecker
> > SUSE Labs
> 
> 

-- 
Frederic Weisbecker
SUSE Labs
Re: [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
Posted by Valentin Schneider 1 month, 2 weeks ago
On 29/10/25 11:31, Frederic Weisbecker wrote:
> Le Wed, Oct 29, 2025 at 11:16:23AM +0100, Valentin Schneider a écrit :
>> On 28/10/25 16:59, Frederic Weisbecker wrote:
>> > Le Fri, Oct 10, 2025 at 05:38:37PM +0200, Valentin Schneider a écrit :
>> >> @@ -171,8 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
>> >>      andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
>> >>  .endm
>> >>
>> >> -.macro COALESCE_TLBI
>> >> +.macro COALESCE_TLBI scratch_reg:req
>> >>  #ifdef CONFIG_COALESCE_TLBI
>> >> +	/* No point in doing this for housekeeping CPUs */
>> >> +	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
>> >> +	bt	\scratch_reg, tick_nohz_full_mask(%rip)
>> >> +	jnc	.Lend_tlbi_\@
>> >
>> > I assume it's not possible to have a static call/branch to
>> > take care of all this ?
>> >
>>
>> I think technically yes, but that would have to be a per-cpu patchable
>> location, which would mean something like each CPU having its own copy of
>> that text page... Unless there's some existing way to statically optimize
>>
>>   if (cpumask_test_cpu(smp_processor_id(), mask))
>>
>> where @mask is a boot-time constant (i.e. the nohz_full mask).
>
> Or just check housekeeping_overriden static key before everything. This one is
> enabled only if either nohz_full, isolcpus or cpuset isolated partition (well,
> it's on the way for the last one) are running, but those are all niche, which
> means you spare 99.999% kernel usecases.
>

Oh right, if NOHZ_FULL is actually in use.

Yeah that housekeeping key could do since, at least for the cmdline
approach, it's set during start_kernel(). I need to have a think about the
runtime cpuset case.

Given we have ALTERNATIVE's in there I assume something like a
boot-time-driven static key could do, but I haven't found out yet if and
how that can be shoved in an ASM file.

> Thanks.
>
>>
>> > Thanks.
>> >
>> > --
>> > Frederic Weisbecker
>> > SUSE Labs
>>
>>
>
> --
> Frederic Weisbecker
> SUSE Labs
Re: [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
Posted by Frederic Weisbecker 1 month, 2 weeks ago
Le Wed, Oct 29, 2025 at 03:13:59PM +0100, Valentin Schneider a écrit :
> On 29/10/25 11:31, Frederic Weisbecker wrote:
> > Le Wed, Oct 29, 2025 at 11:16:23AM +0100, Valentin Schneider a écrit :
> >> On 28/10/25 16:59, Frederic Weisbecker wrote:
> >> > Le Fri, Oct 10, 2025 at 05:38:37PM +0200, Valentin Schneider a écrit :
> >> >> @@ -171,8 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
> >> >>      andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
> >> >>  .endm
> >> >>
> >> >> -.macro COALESCE_TLBI
> >> >> +.macro COALESCE_TLBI scratch_reg:req
> >> >>  #ifdef CONFIG_COALESCE_TLBI
> >> >> +	/* No point in doing this for housekeeping CPUs */
> >> >> +	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
> >> >> +	bt	\scratch_reg, tick_nohz_full_mask(%rip)
> >> >> +	jnc	.Lend_tlbi_\@
> >> >
> >> > I assume it's not possible to have a static call/branch to
> >> > take care of all this ?
> >> >
> >>
> >> I think technically yes, but that would have to be a per-cpu patchable
> >> location, which would mean something like each CPU having its own copy of
> >> that text page... Unless there's some existing way to statically optimize
> >>
> >>   if (cpumask_test_cpu(smp_processor_id(), mask))
> >>
> >> where @mask is a boot-time constant (i.e. the nohz_full mask).
> >
> > Or just check housekeeping_overriden static key before everything. This one is
> > enabled only if either nohz_full, isolcpus or cpuset isolated partition (well,
> > it's on the way for the last one) are running, but those are all niche, which
> > means you spare 99.999% kernel usecases.
> >
> 
> Oh right, if NOHZ_FULL is actually in use.
> 
> Yeah that housekeeping key could do since, at least for the cmdline
> approach, it's set during start_kernel(). I need to have a think about the
> runtime cpuset case.

You can ignore the runtime thing and simply check the static key before reading
the housekeeping mask. For now nohz_full is only enabled by cmdline.

> Given we have ALTERNATIVE's in there I assume something like a
> boot-time-driven static key could do, but I haven't found out yet if and
> how that can be shoved in an ASM file.

Right, I thought I had seen static keys in ASM already but I can't find it
anymore. arch/x86/include/asm/jump_label.h is full of reusable magic
though.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs
Re: [RFC PATCH v6 27/29] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
Posted by Valentin Schneider 1 month, 2 weeks ago
On 29/10/25 15:49, Frederic Weisbecker wrote:
> Le Wed, Oct 29, 2025 at 03:13:59PM +0100, Valentin Schneider a écrit :
>> Given we have ALTERNATIVE's in there I assume something like a
>> boot-time-driven static key could do, but I haven't found out yet if and
>> how that can be shoved in an ASM file.
>
> Right, I thought I had seen static keys in ASM already but I can't find it
> anymore. arch/x86/include/asm/jump_label.h is full of reusable magic
> though.
>

I got something ugly that /seems/ to work, now to spend twice the time to
clean it up :-)

> Thanks.
>
> --
> Frederic Weisbecker
> SUSE Labs