[v7] context_tracking,x86: Defer some IPIs until a user->kernel transition

[RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3

Posted by Valentin Schneider 2 months, 3 weeks ago

Deferring kernel range TLB flushes requires the guarantee that upon
entering the kernel, no stale entry may be accessed. The simplest way to
provide such a guarantee is to issue an unconditional flush upon switching
to the kernel CR3, as this is the pivoting point where such stale entries
may be accessed.

As this is only relevant to NOHZ_FULL, restrict the mechanism to NOHZ_FULL
CPUs.

Note that the COALESCE_TLBI config option is introduced in a later commit,
when the whole feature is implemented.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/entry/calling.h      | 25 ++++++++++++++++++++++---
 arch/x86/kernel/asm-offsets.c |  1 +
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 0187c0ea2fddb..620203ef04e9f 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -10,6 +10,7 @@
 #include <asm/msr.h>
 #include <asm/nospec-branch.h>
 #include <asm/jump_label.h>
+#include <asm/invpcid.h>

 /*

@@ -171,9 +172,27 @@ For 32-bit we have the following conventions - kernel is built with
	andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
 .endm

-.macro COALESCE_TLBI
+.macro COALESCE_TLBI scratch_reg:req
 #ifdef CONFIG_COALESCE_TLBI
	STATIC_BRANCH_FALSE_LIKELY housekeeping_overridden, .Lend_\@
+	/* No point in doing this for housekeeping CPUs */
+	movslq  PER_CPU_VAR(cpu_number), \scratch_reg
+	bt	\scratch_reg, tick_nohz_full_mask(%rip)
+	jnc	.Lend_tlbi_\@
+
+	ALTERNATIVE "jmp .Lcr4_\@", "", X86_FEATURE_INVPCID
+	movq $(INVPCID_TYPE_ALL_INCL_GLOBAL), \scratch_reg
+	/* descriptor is all zeroes, point at the zero page */
+	invpcid empty_zero_page(%rip), \scratch_reg
+	jmp .Lend_tlbi_\@
+.Lcr4_\@:
+	/* Note: this gives CR4 pinning the finger */
+	movq PER_CPU_VAR(cpu_tlbstate + TLB_STATE_cr4), \scratch_reg
+	xorq $(X86_CR4_PGE), \scratch_reg
+	movq \scratch_reg, %cr4
+	xorq $(X86_CR4_PGE), \scratch_reg
+	movq \scratch_reg, %cr4
+.Lend_tlbi_\@:
	movl     $1, PER_CPU_VAR(kernel_cr3_loaded)
 .Lend_\@:
 #endif // CONFIG_COALESCE_TLBI
@@ -192,7 +211,7 @@ For 32-bit we have the following conventions - kernel is built with
	mov	%cr3, \scratch_reg
	ADJUST_KERNEL_CR3 \scratch_reg
	mov	\scratch_reg, %cr3
-	COALESCE_TLBI
+	COALESCE_TLBI \scratch_reg
 .Lend_\@:
 .endm

@@ -260,7 +279,7 @@ For 32-bit we have the following conventions - kernel is built with

	ADJUST_KERNEL_CR3 \scratch_reg
	movq	\scratch_reg, %cr3
-	COALESCE_TLBI
+	COALESCE_TLBI \scratch_reg

 .Ldone_\@:
 .endm
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 32ba599a51f88..deb92e9c8923d 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -106,6 +106,7 @@ static void __used common(void)

	/* TLB state for the entry code */
	OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
+	OFFSET(TLB_STATE_cr4, tlb_state, cr4);

	/* Layout info for cpu_entry_area */
	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
--
2.51.0

Re: [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3

Posted by Andy Lutomirski 2 months, 3 weeks ago

On Fri, Nov 14, 2025, at 7:14 AM, Valentin Schneider wrote:
> Deferring kernel range TLB flushes requires the guarantee that upon
> entering the kernel, no stale entry may be accessed. The simplest way to
> provide such a guarantee is to issue an unconditional flush upon switching
> to the kernel CR3, as this is the pivoting point where such stale entries
> may be accessed.
>

Doing this together with the PTI CR3 switch has no actual benefit: MOV CR3 doesn’t flush global pages. And doing this in asm is pretty gross.  We don’t even get a free sync_core() out of it because INVPCID is not documented as being serializing.

Why can’t we do it in C?  What’s the actual risk?  In order to trip over a stale TLB entry, we would need to deference a pointer to newly allocated kernel virtual memory that was not valid prior to our entry into user mode. I can imagine BPF doing this, but plain noinstr C in the entry path?  Especially noinstr C *that has RCU disabled*?  We already can’t follow an RCU pointer, and ISTM the only style of kernel code that might do this would use RCU to protect the pointer, and we are already doomed if we follow an RCU pointer to any sort of memory.

We do need to watch out for NMI/MCE hitting before we flush.

Re: [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3

Posted by Valentin Schneider 2 months, 3 weeks ago

On 19/11/25 06:31, Andy Lutomirski wrote:
> On Fri, Nov 14, 2025, at 7:14 AM, Valentin Schneider wrote:
>> Deferring kernel range TLB flushes requires the guarantee that upon
>> entering the kernel, no stale entry may be accessed. The simplest way to
>> provide such a guarantee is to issue an unconditional flush upon switching
>> to the kernel CR3, as this is the pivoting point where such stale entries
>> may be accessed.
>>
>
> Doing this together with the PTI CR3 switch has no actual benefit: MOV CR3 doesn’t flush global pages. And doing this in asm is pretty gross.  We don’t even get a free sync_core() out of it because INVPCID is not documented as being serializing.
>
> Why can’t we do it in C?  What’s the actual risk?  In order to trip over a stale TLB entry, we would need to deference a pointer to newly allocated kernel virtual memory that was not valid prior to our entry into user mode. I can imagine BPF doing this, but plain noinstr C in the entry path?  Especially noinstr C *that has RCU disabled*?  We already can’t follow an RCU pointer, and ISTM the only style of kernel code that might do this would use RCU to protect the pointer, and we are already doomed if we follow an RCU pointer to any sort of memory.
>

So v4 and earlier had the TLB flush faff done in C in the context_tracking entry
just like sync_core().

My biggest issue with it was that I couldn't figure out a way to instrument
memory accesses such that I would get an idea of where vmalloc'd accesses
happen - even with a hackish thing just to survey the landscape. So while I
agree with your reasoning wrt entry noinstr code, I don't have any way to
prove it.
That's unlike the text_poke sync_core() deferral for which I have all of
that nice objtool instrumentation.

Dave also pointed out that the whole stale entry flush deferral is a risky
move, and that the sanest thing would be to execute the deferred flush just
after switching to the kernel CR3.

See the thread surrounding:
  https://lore.kernel.org/lkml/20250114175143.81438-30-vschneid@redhat.com/

mainly Dave's reply and subthread:
  https://lore.kernel.org/lkml/352317e3-c7dc-43b4-b4cb-9644489318d0@intel.com/

> We do need to watch out for NMI/MCE hitting before we flush.

Re: [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3

Posted by Andy Lutomirski 2 months, 3 weeks ago

On Wed, Nov 19, 2025, at 7:44 AM, Valentin Schneider wrote:
> On 19/11/25 06:31, Andy Lutomirski wrote:
>> On Fri, Nov 14, 2025, at 7:14 AM, Valentin Schneider wrote:
>>> Deferring kernel range TLB flushes requires the guarantee that upon
>>> entering the kernel, no stale entry may be accessed. The simplest way to
>>> provide such a guarantee is to issue an unconditional flush upon switching
>>> to the kernel CR3, as this is the pivoting point where such stale entries
>>> may be accessed.
>>>
>>
>> Doing this together with the PTI CR3 switch has no actual benefit: MOV CR3 doesn’t flush global pages. And doing this in asm is pretty gross.  We don’t even get a free sync_core() out of it because INVPCID is not documented as being serializing.
>>
>> Why can’t we do it in C?  What’s the actual risk?  In order to trip over a stale TLB entry, we would need to deference a pointer to newly allocated kernel virtual memory that was not valid prior to our entry into user mode. I can imagine BPF doing this, but plain noinstr C in the entry path?  Especially noinstr C *that has RCU disabled*?  We already can’t follow an RCU pointer, and ISTM the only style of kernel code that might do this would use RCU to protect the pointer, and we are already doomed if we follow an RCU pointer to any sort of memory.
>>
>
> So v4 and earlier had the TLB flush faff done in C in the context_tracking entry
> just like sync_core().
>
> My biggest issue with it was that I couldn't figure out a way to instrument
> memory accesses such that I would get an idea of where vmalloc'd accesses
> happen - even with a hackish thing just to survey the landscape. So while I
> agree with your reasoning wrt entry noinstr code, I don't have any way to
> prove it.
> That's unlike the text_poke sync_core() deferral for which I have all of
> that nice objtool instrumentation.
>
> Dave also pointed out that the whole stale entry flush deferral is a risky
> move, and that the sanest thing would be to execute the deferred flush just
> after switching to the kernel CR3.
>
> See the thread surrounding:
>   https://lore.kernel.org/lkml/20250114175143.81438-30-vschneid@redhat.com/
>
> mainly Dave's reply and subthread:
>   https://lore.kernel.org/lkml/352317e3-c7dc-43b4-b4cb-9644489318d0@intel.com/
>
>> We do need to watch out for NMI/MCE hitting before we flush.

I read a decent fraction of that thread.

Let's consider what we're worried about:

1. Architectural access to a kernel virtual address that has been unmapped, in asm or early C.  If it hasn't been remapped, then we oops anyway.  If it has, then that means we're accessing a pointer where either the pointer has changed or the pointee has been remapped while we're in user mode, and that's a very strange thing to do for anything that the asm points to or that early C points to, unless RCU is involved.  But RCU is already disallowed in the entry paths that might be in extended quiescent states, so I think this is mostly a nonissue.

2. Non-speculative access via GDT access, etc.  We can't control this at all, but we're not avoid to move the GDT, IDT, LDT etc of a running task while that task is in user mode.  We do move the LDT, but that's quite thoroughly synchronized via IPI.  (Should probably be double checked.  I wrote that code, but that doesn't mean I remember it exactly.)

3. Speculative TLB fills.  We can't control this at all.  We have had actual machine checks, on AMD IIRC, due to messing this up.  This is why we can't defer a flush after freeing a page table.

4. Speculative or other nonarchitectural loads.  One would hope that these are not dangerous.  For example, an early version of TDX would machine check if we did a speculative load from TDX memory, but that was fixed.  I don't see why this would be materially different between actual userspace execution (without LASS, anyway), kernel asm, and kernel C.

5. Writes to page table dirty bits.  I don't think we use these.

In any case, the current implementation in your series is really, really, utterly horrifically slow.  It's probably fine for a task that genuinely sits in usermode forever, but I don't think it's likely to be something that we'd be willing to enable for normal kernels and normal tasks.  And it would be really nice for the don't-interrupt-user-code still to move toward being always available rather than further from it.

I admit that I'm kind of with dhansen: Zen 3+ can use INVLPGB and doesn't need any of this.  Some Intel CPUs support RAR and will eventually be able to use RAR, possibly even for sync_core().

Re: [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3

Posted by Valentin Schneider 2 months, 2 weeks ago

On 19/11/25 09:31, Andy Lutomirski wrote:
> Let's consider what we're worried about:
>
> 1. Architectural access to a kernel virtual address that has been unmapped, in asm or early C.  If it hasn't been remapped, then we oops anyway.  If it has, then that means we're accessing a pointer where either the pointer has changed or the pointee has been remapped while we're in user mode, and that's a very strange thing to do for anything that the asm points to or that early C points to, unless RCU is involved.  But RCU is already disallowed in the entry paths that might be in extended quiescent states, so I think this is mostly a nonissue.
>
> 2. Non-speculative access via GDT access, etc.  We can't control this at all, but we're not avoid to move the GDT, IDT, LDT etc of a running task while that task is in user mode.  We do move the LDT, but that's quite thoroughly synchronized via IPI.  (Should probably be double checked.  I wrote that code, but that doesn't mean I remember it exactly.)
>
> 3. Speculative TLB fills.  We can't control this at all.  We have had actual machine checks, on AMD IIRC, due to messing this up.  This is why we can't defer a flush after freeing a page table.
>
> 4. Speculative or other nonarchitectural loads.  One would hope that these are not dangerous.  For example, an early version of TDX would machine check if we did a speculative load from TDX memory, but that was fixed.  I don't see why this would be materially different between actual userspace execution (without LASS, anyway), kernel asm, and kernel C.
>
> 5. Writes to page table dirty bits.  I don't think we use these.
>
> In any case, the current implementation in your series is really, really,
> utterly horrifically slow.

Quite so :-)

> It's probably fine for a task that genuinely sits in usermode forever,
> but I don't think it's likely to be something that we'd be willing to
> enable for normal kernels and normal tasks.  And it would be really nice
> for the don't-interrupt-user-code still to move toward being always
> available rather than further from it.
>

Well following Frederic's suggestion of using the "is NOHZ_FULL actually in
use" static key in the ASM bits, none of the ugly bits get involved unless
you do have 'nohz_full=' on the cmdline - not perfect, but it's something.

RHEL kernels ship with NO_HZ_FULL=y [1], so we do care about that not impacting
performance too much if it's just compiled-in and not actually used.

[1]: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-10/-/blob/main/redhat/configs/common/generic/CONFIG_NO_HZ_FULL
>
> I admit that I'm kind of with dhansen: Zen 3+ can use INVLPGB and doesn't
> need any of this.  Some Intel CPUs support RAR and will eventually be
> able to use RAR, possibly even for sync_core().

Yeah that INVLPGB thing looks really nice, and AFAICT arm64 is similarly
covered with TLBI VMALLE1IS.

My goal here is to poke around and find out what's the minimal amount of
ugly we can get away with to suppress those IPIs on existing fleets, but
there's still too much ugly :/