[v7] context_tracking,x86: Defer some IPIs until a user->kernel transition

[PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

Posted by Valentin Schneider 2 months, 3 weeks ago

Context
=======

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

  64359.052209596    NetworkManager       0    1405     smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
    smp_call_function_many_cond+0x1
    smp_call_function+0x39
    on_each_cpu+0x2a
    flush_tlb_kernel_range+0x7b
    __purge_vmap_area_lazy+0x70
    _vm_unmap_aliases.part.42+0xdf
    change_page_attr_set_clr+0x16a
    set_memory_ro+0x26
    bpf_int_jit_compile+0x2f9
    bpf_prog_select_runtime+0xc6
    bpf_prepare_filter+0x523
    sk_attach_filter+0x13
    sock_setsockopt+0x92c
    __sys_setsockopt+0x16a
    __x64_sys_setsockopt+0x20
    do_syscall_64+0x87
    entry_SYSCALL_64_after_hwframe+0x65

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.

Deferral approach
=================

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.

Kernel entry vs execution of the deferred operation
===================================================

This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].

There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (before we start getting into
context_tracking.c proper), i.e.:

  idtentry_func_foo()                <--- we're in the kernel
    irqentry_enter()
      irqentry_enter_from_user_mode()
	enter_from_user_mode()
	  [...]
	    ct_kernel_enter_state()
	      ct_work_flush()        <--- deferred operation is executed here

This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.

The annoying one: TLB flush deferral
====================================

While leveraging the context_tracking subsystem works for deferring things like
kernel text synchronization, it falls apart when it comes to kernel range TLB
flushes. Consider the following execution flow:

  <userspace>
  
  !interrupt!

  SWITCH_TO_KERNEL_CR3        <--- vmalloc range becomes accessible

  idtentry_func_foo()
    irqentry_enter()
      irqentry_enter_from_user_mode()
	enter_from_user_mode()
	  [...]
	    ct_kernel_enter_state()
	      ct_work_flush() <--- deferred flush would be done here


Since there is no sane way to assert no stale entry is accessed during
kernel entry, any code executed between SWITCH_TO_KERNEL_CR3 and
ct_work_flush() is at risk of accessing a stale entry.

Dave had suggested hacking up something within SWITCH_TO_KERNEL_CR3 itself,
which is what has been implemented in the new RFC patches.

How bad is it?
==============

Code
++++

I'm happy that the COALESCE_TLBI asm code fits in ~half a screen,
although it open-codes native_write_cr4() without the pinning logic.

I hate the kernel_cr3_loaded signal; it's a kludgy context_tracking.state
duplicate but I need *some* sort of signal to drive the TLB flush deferral and
the context_tracking.state one is set too late in kernel entry. I couldn't
find any fitting existing signals for this.

I'm also unhappy to introduce two different IPI deferral mechanisms. I tried
shoving the text_poke_sync() in KERNEL_SWITCH_CR3, but it got ugly(er) really
fast. 

Performance
+++++++++++

Tested by measuring the duration of 10M `syscall(SYS_getpid)` calls on
NOHZ_FULL CPUs, with rteval (hackbench + kernel compilation) running on the
housekeeping CPUs:

o Xeon E5-2699:   base avg 770ns,  patched avg 1340ns (74% increase)
o Xeon E7-8890:   base avg 1040ns, patched avg 1320ns (27% increase)
o Xeon Gold 6248: base avg 270ns,  patched avg 273ns  (.1% increase)

I don't get that last one, I did spend a ridiculous amount of time making sure
the flush was being executed, and AFAICT yes, it was. What I take out of this is
that it can be a pretty massive increase in the entry overhead (for NOHZ_FULL
CPUs), and that's something I want to hear thoughts on

Noise
+++++

Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL10 userspace.

Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:

$ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
	           -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
		   rteval --onlyload --loads-cpulist=$HK_CPUS \
		   --hackbench-runlowmem=True --duration=$DURATION

This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 6 hours.

v6.17
o ~5400 IPIs received, so about ~200 interfering IPI per isolated CPU
o About one interfering IPI just shy of every 2 minutes

v6.17 + patches
o Zilch!

Patches
=======

o Patches 1-2 are standalone objtool cleanups.

o Patches 3-4 add an RCU testing feature.

o Patches 5-6 add infrastructure for annotating static keys and static calls
  that may be used in noinstr code (courtesy of Josh).
o Patches 7-21 use said annotations on relevant keys / calls.
o Patch 22 enforces proper usage of said annotations (courtesy of Josh).

o Patch 23 deals with detecting NOINSTR text in modules

o Patches 24-25 deal with kernel text sync IPIs

o Patch 26 adds ASM support for static keys

o Patches 27-31 deal with kernel range TLB flush IPIs

Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v7

Acknowledgements
================

Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o Dave Hansen for patiently educating me about mm
o All of the folks who attended various (too many?) talks about this and
  provided precious feedback.  

Links
=====

[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: http://lore.kernel.org/r/eef09bdc-7546-462b-9ac0-661a44d2ceae@intel.com
[6]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/

Revisions
=========

v6 -> v7
++++++++

o Rebased onto latest v6.18-rc5 (6fa9041b7177f)
o Collected Acks (Sean, Frederic)

o Fixed <asm/context_tracking_work.h> include (Shrikanth)
o Fixed ct_set_cpu_work() CT_RCU_WATCHING logic (Frederic)

o Wrote more verbose comments about NOINSTR static keys and calls (Petr)

o [NEW PATCH] Instrumented one more static key: cpu_bf_vm_clear
o [NEW PATCH] added ASM-accessible static key helpers to gate NO_HZ_FULL logic
  in early entry code (Frederic)

v5 -> v6
++++++++

o Rebased onto v6.17
o Small conflict fixes with cpu_buf_idle_clear smp_text_poke() renaming

o Added the TLB flush craziness

v4 -> v5
++++++++

o Rebased onto v6.15-rc3
o Collected Reviewed-by

o Annotated a few more static keys
o Added proper checking of noinstr sections that are in loadable code such as
  KVM early entry (Sean Christopherson)

o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or
  CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI
  entry from idle (thanks to Frederic!)

o Ditched the vmap TLB flush deferral (for now)  
  

RFCv3 -> v4
+++++++++++

o Rebased onto v6.13-rc6

o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)

o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups

o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ

RFCv2 -> RFCv3
++++++++++++++

o Rebased onto v6.12-rc6

o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral


RFCv1 -> RFCv2
++++++++++++++

o Rebased onto v6.5-rc1

o Updated the trace filter patches (Steven)

o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
  existing .state field (Peter, Frederic)
  
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
  rcutorture case for a low-size counter (Paul) 

o Fixed flush_tlb_kernel_range_deferrable() definition

Josh Poimboeuf (3):
  jump_label: Add annotations for validating noinstr usage
  static_call: Add read-only-after-init static calls
  objtool: Add noinstr validation for static branches/calls

Valentin Schneider (28):
  objtool: Make validate_call() recognize indirect calls to pv_ops[]
  objtool: Flesh out warning related to pv_ops[] calls
  rcu: Add a small-width RCU watching counter debug option
  rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
  x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
  x86/idle: Mark x86_idle static call as __ro_after_init
  x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
  riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
  loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
  perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
  sched/clock: Mark sched_clock_running key as __ro_after_init
  KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
  x86/bugs: Mark cpu_buf_vm_clear key as allowed in .noinstr
  x86/speculation/mds: Mark cpu_buf_idle_clear key as allowed in
    .noinstr
  sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
  KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
    allowed in .noinstr
  stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
  module: Add MOD_NOINSTR_TEXT mem_type
  context-tracking: Introduce work deferral infrastructure
  context_tracking,x86: Defer kernel text patching IPIs
  x86/jump_label: Add ASM support for static_branch_likely()
  x86/mm: Make INVPCID type macros available to assembly
  x86/mm/pti: Introduce a kernel/user CR3 software signal
  x86/mm/pti: Implement a TLB flush immediately after a switch to kernel
    CR3
  x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under
    CONFIG_COALESCE_TLBI=y
  x86/entry: Add an option to coalesce TLB flushes

 arch/Kconfig                                  |   9 ++
 arch/arm/kernel/paravirt.c                    |   2 +-
 arch/arm64/kernel/paravirt.c                  |   2 +-
 arch/loongarch/kernel/paravirt.c              |   2 +-
 arch/riscv/kernel/paravirt.c                  |   2 +-
 arch/x86/Kconfig                              |  18 +++
 arch/x86/entry/calling.h                      |  40 +++++++
 arch/x86/entry/syscall_64.c                   |   4 +
 arch/x86/events/amd/brs.c                     |   2 +-
 arch/x86/include/asm/context_tracking_work.h  |  18 +++
 arch/x86/include/asm/invpcid.h                |  14 ++-
 arch/x86/include/asm/jump_label.h             |  33 +++++-
 arch/x86/include/asm/text-patching.h          |   1 +
 arch/x86/include/asm/tlbflush.h               |   6 +
 arch/x86/kernel/alternative.c                 |  39 ++++++-
 arch/x86/kernel/asm-offsets.c                 |   1 +
 arch/x86/kernel/cpu/bugs.c                    |  14 ++-
 arch/x86/kernel/kprobes/core.c                |   4 +-
 arch/x86/kernel/kprobes/opt.c                 |   4 +-
 arch/x86/kernel/module.c                      |   2 +-
 arch/x86/kernel/paravirt.c                    |   4 +-
 arch/x86/kernel/process.c                     |   2 +-
 arch/x86/kvm/vmx/vmx.c                        |  11 +-
 arch/x86/kvm/vmx/vmx_onhyperv.c               |   2 +-
 arch/x86/mm/tlb.c                             |  34 ++++--
 include/asm-generic/sections.h                |  15 +++
 include/linux/context_tracking.h              |  21 ++++
 include/linux/context_tracking_state.h        |  54 +++++++--
 include/linux/context_tracking_work.h         |  24 ++++
 include/linux/jump_label.h                    |  30 ++++-
 include/linux/module.h                        |   6 +-
 include/linux/objtool.h                       |  14 +++
 include/linux/static_call.h                   |  19 ++++
 kernel/context_tracking.c                     |  72 +++++++++++-
 kernel/kprobes.c                              |   8 +-
 kernel/kstack_erase.c                         |   6 +-
 kernel/module/main.c                          |  76 ++++++++++---
 kernel/rcu/Kconfig.debug                      |  15 +++
 kernel/sched/clock.c                          |   7 +-
 kernel/time/Kconfig                           |   5 +
 mm/vmalloc.c                                  |  34 +++++-
 tools/objtool/Documentation/objtool.txt       |  34 ++++++
 tools/objtool/check.c                         | 106 +++++++++++++++---
 tools/objtool/include/objtool/check.h         |   1 +
 tools/objtool/include/objtool/elf.h           |   1 +
 tools/objtool/include/objtool/special.h       |   1 +
 tools/objtool/special.c                       |  15 ++-
 .../selftests/rcutorture/configs/rcu/TREE04   |   1 +
 48 files changed, 736 insertions(+), 99 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h

--
2.51.0

Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

Posted by Andy Lutomirski 2 months, 3 weeks ago

On Fri, Nov 14, 2025, at 7:01 AM, Valentin Schneider wrote:
> Context
> =======
>
> We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
> pure-userspace application get regularly interrupted by IPIs sent from
> housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
> leading to various on_each_cpu() calls, e.g.:
>

> The heart of this series is the thought that while we cannot remove NOHZ_FULL
> CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
> the callbacks immediately. Anything that only affects kernelspace can wait
> until the next user->kernel transition, providing it can be executed "early
> enough" in the entry code.
>

I want to point out that there's another option here, although anyone trying to implement it would be fighting against quite a lot of history.

Logically, each CPU is in one of a handful of states: user mode, idle, normal kernel mode (possibly subdivided into IRQ, etc), and a handful of very narrow windows, hopefully uninstrumented and not accessing any PTEs that might be invalid, in the entry and exit paths where any state in memory could be out of sync with actual CPU state.  (The latter includes right after the CPU switches to kernel mode, for example.)  And NMI and MCE and whatever weird "security" entry types that Intel and AMD love to add.

The way the kernel *currently* deals with this has two big historical oddities:

1. The entry and exit code cares about ti_flags, which is per-*task*, which means that atomically poking it from other CPUs involves the runqueue lock or other shenanigans (see the idle nr_polling code for example), and also that it's not accessible from the user page tables if PTI is on.

2. The actual heavyweight atomic part (context tracking) was built for RCU, and it's sort or bolted on, and, as you've observed in this series, it's really quite awkward to do things that aren't RCU using context tracking.

If this were a greenfield project, I think there's a straightforward approach that's much nicer: stick everything into a single percpu flags structure.  Imagine we have cpu_flags, which tracks both the current state of the CPU and what work needs to be done on state changes.  On exit to user mode, we would atomically set the mode to USER and make sure we don't touch anything like vmalloc space after that.  On entry back to kernel mode, we would avoid vmalloc space, etc, then atomically switch to kernel mode and read out whatever deferred work is needed.  As an optimization, if nothing in the current configuration needs atomic state tracking, the state could be left at USER_OR_KERNEL and the overhead of an extra atomic op at entry and exit could be avoided.

And RCU would hook into *that* instead of having its own separate set of hooks.

I think that actually doing this would be a big improvement and would also be a serious project.  There's a lot of code that would get touched, and the existing context tracking code is subtle and confusing.  And, as mentioned, ti_flags has the wrong scope.

It's *possible* that one could avoid making ti_flags percpu either by extensive use of the runqueue locks or by borrowing a kludge from the idle code.  For the latter, right now, the reason that the wake-from-idle code works is that the optimized path only happens if the idle thread/cpu is "polling", and it's impossible for the idle ti_flags to be polling while the CPU isn't actually idle.  We could similarly observe that, if a ti_flags says it's in USER mode *and* is on, say, cpu 3, then cpu 3 is most definitely in USER mode.  So someone could try shoving the CPU number into ti_flags :-p   (USER means actually user or in the late exit / early entry path.)

Anyway, benefits of this whole approach would include considerably (IMO) increased comprehensibility compared to the current tangled ct code and much more straightforward addition of new things that happen to a target CPU conditionally depending on its mode.  And, if the flags word was actually per cpu, it could be mapped such that SWITCH_TO_KERNEL_CR3 would use it -- there could be a single CR3 write (and maybe CR4/invpcid depending on whether a zapped mapping is global) and the flush bit could depend on whether a flush is needed.  And there would be basically no chance that a bug that accessed invalidated-but-not-flushed kernel data could be undetected -- in PTI mode, any such access would page fault!  Similarly, if kernel text pokes deferred the flush and serialization, the only code that could execute before noticing the deferred flush would be the user-CR3 code.

Oh, any another primitive would be possible: one CPU could plausibly execute another CPU's interrupts or soft-irqs or whatever by taking a special lock that would effectively pin the remote CPU in user mode -- you'd set a flag in the target cpu_flags saying "pin in USER mode" and the transition on that CPU to kernel mode would then spin on entry to kernel mode and wait for the lock to be released.  This could plausibly get a lot of the on_each_cpu callers to switch over in one fell swoop: anything that needs to synchronize to the remote CPU but does not need to poke its actual architectural state could be executed locally while the remote CPU is pinned.

--Andy

Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

Posted by Andy Lutomirski 2 months, 3 weeks ago


On Fri, Nov 14, 2025, at 8:20 AM, Andy Lutomirski wrote:
> On Fri, Nov 14, 2025, at 7:01 AM, Valentin Schneider wrote:
>> Context
>> =======
>>
>> We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
>> pure-userspace application get regularly interrupted by IPIs sent from
>> housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
>> leading to various on_each_cpu() calls, e.g.:
>>
>
>> The heart of this series is the thought that while we cannot remove NOHZ_FULL
>> CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
>> the callbacks immediately. Anything that only affects kernelspace can wait
>> until the next user->kernel transition, providing it can be executed "early
>> enough" in the entry code.
>>
>
> I want to point out that there's another option here, although anyone 
> trying to implement it would be fighting against quite a lot of history.
>
> Logically, each CPU is in one of a handful of states: user mode, idle, 
> normal kernel mode (possibly subdivided into IRQ, etc), and a handful 
> of very narrow windows, hopefully uninstrumented and not accessing any 
> PTEs that might be invalid, in the entry and exit paths where any state 
> in memory could be out of sync with actual CPU state.  (The latter 
> includes right after the CPU switches to kernel mode, for example.)  
> And NMI and MCE and whatever weird "security" entry types that Intel 
> and AMD love to add.
>
> The way the kernel *currently* deals with this has two big historical oddities:
>
> 1. The entry and exit code cares about ti_flags, which is per-*task*, 
> which means that atomically poking it from other CPUs involves the 
> runqueue lock or other shenanigans (see the idle nr_polling code for 
> example), and also that it's not accessible from the user page tables 
> if PTI is on.
>
> 2. The actual heavyweight atomic part (context tracking) was built for 
> RCU, and it's sort or bolted on, and, as you've observed in this 
> series, it's really quite awkward to do things that aren't RCU using 
> context tracking.
>
> If this were a greenfield project, I think there's a straightforward 
> approach that's much nicer: stick everything into a single percpu flags 
> structure.  Imagine we have cpu_flags, which tracks both the current 
> state of the CPU and what work needs to be done on state changes.  On 
> exit to user mode, we would atomically set the mode to USER and make 
> sure we don't touch anything like vmalloc space after that.  On entry 
> back to kernel mode, we would avoid vmalloc space, etc, then atomically 
> switch to kernel mode and read out whatever deferred work is needed.  
> As an optimization, if nothing in the current configuration needs 
> atomic state tracking, the state could be left at USER_OR_KERNEL and 
> the overhead of an extra atomic op at entry and exit could be avoided.
>
> And RCU would hook into *that* instead of having its own separate set of hooks.
>
> I think that actually doing this would be a big improvement and would 
> also be a serious project.  There's a lot of code that would get 
> touched, and the existing context tracking code is subtle and 
> confusing.  And, as mentioned, ti_flags has the wrong scope.
>
> It's *possible* that one could avoid making ti_flags percpu either by 
> extensive use of the runqueue locks or by borrowing a kludge from the 
> idle code.  For the latter, right now, the reason that the 
> wake-from-idle code works is that the optimized path only happens if 
> the idle thread/cpu is "polling", and it's impossible for the idle 
> ti_flags to be polling while the CPU isn't actually idle.  We could 
> similarly observe that, if a ti_flags says it's in USER mode *and* is 
> on, say, cpu 3, then cpu 3 is most definitely in USER mode.  So someone 
> could try shoving the CPU number into ti_flags :-p   (USER means 
> actually user or in the late exit / early entry path.)
>
> Anyway, benefits of this whole approach would include considerably 
> (IMO) increased comprehensibility compared to the current tangled ct 
> code and much more straightforward addition of new things that happen 
> to a target CPU conditionally depending on its mode.  And, if the flags 
> word was actually per cpu, it could be mapped such that 
> SWITCH_TO_KERNEL_CR3 would use it -- there could be a single CR3 write 
> (and maybe CR4/invpcid depending on whether a zapped mapping is global) 
> and the flush bit could depend on whether a flush is needed.  And there 
> would be basically no chance that a bug that accessed 
> invalidated-but-not-flushed kernel data could be undetected -- in PTI 
> mode, any such access would page fault!  Similarly, if kernel text 
> pokes deferred the flush and serialization, the only code that could 
> execute before noticing the deferred flush would be the user-CR3 code.
>
> Oh, any another primitive would be possible: one CPU could plausibly 
> execute another CPU's interrupts or soft-irqs or whatever by taking a 
> special lock that would effectively pin the remote CPU in user mode -- 
> you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
> the transition on that CPU to kernel mode would then spin on entry to 
> kernel mode and wait for the lock to be released.  This could plausibly 
> get a lot of the on_each_cpu callers to switch over in one fell swoop: 
> anything that needs to synchronize to the remote CPU but does not need 
> to poke its actual architectural state could be executed locally while 
> the remote CPU is pinned.

Following up, I think that x86 can do this all with a single atomic (in the common case) per usermode round trip.  Imagine:

struct fancy_cpu_state {
  u32 work; // <-- writable by any CPU
  u32 status; // <-- readable anywhere; writable locally
};

status includes KERNEL, USER, and maybe INDETERMINATE.  (INDETERMINATE means USER but we're not committing to doing work.)

Exit to user mode:

atomic_set(&my_state->status, USER);

(or, in the lazy case, set to INDETERMINATE instead.)

Entry from user mode, with IRQs off, before switching to kernel CR3:

if (my_state->status == INDETERMINATE) {
  // we were lazy and we never promised to do work atomically.
  atomic_set(&my_state->status, KERNEL);
  this_entry_work = 0;
} else {
  // we were not lazy and we promised we would do work atomically
  atomic exchange the entire state to { .work = 0, .status = KERNEL }
  this_entry_work = (whatever we just read);
}

if (PTI) {
  switch to kernel CR3 *and flush if this_entry_work says to flush*
} else {
  flush if this_entry_work says to flush;
}

do the rest of the work;



I suppose that a lot of the stuff in ti_flags could merge into here, but it could be done one bit at a time when people feel like doing so.  And I imagine, but I'm very far from confident, that RCU could use this instead of the current context tracking code.


The idea behind INDETERMINATE is that there are plenty of workloads that frequently switch between user and kernel mode and that would rather accept a few IPIs to avoid the heavyweight atomic operation on user -> kernel transitions.  So the default behavior could be to do KERNEL -> INDETERMINATE instead of KERNEL -> USER, but code that wants to be in user mode for a long time could go all the way to USER.  We could make it sort of automatic by noticing that we're returning from an IRQ without a context switch and go to USER (so we would get at most one unneeded IPI per normal user entry), and we could have some nice API for a program that intends to hang out in user mode for a very long time (cpu isolation users, for example) to tell the kernel to go immediately into USER mode.  (Don't we already have something that could be used for this purpose?)

Hmm, now I wonder if it would make sense for the default behavior of Linux to be like that.  We could call it ONEHZ.  It's like NOHZ_FULL except that user threads that don't do syscalls get one single timer tick instead of many or none.


Anyway, I think my proposal is pretty good *if* RCU could be made to use it -- the existing context tracking code is fairly expensive, and I don't think we want to invent a new context-tracking-like mechanism if we still need to do the existing thing.

--Andy

Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 09:22:35AM -0800, Andy Lutomirski wrote:
> 
> 
> On Fri, Nov 14, 2025, at 8:20 AM, Andy Lutomirski wrote:
> > On Fri, Nov 14, 2025, at 7:01 AM, Valentin Schneider wrote:
> >> Context
> >> =======
> >>
> >> We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
> >> pure-userspace application get regularly interrupted by IPIs sent from
> >> housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
> >> leading to various on_each_cpu() calls, e.g.:
> >>
> >
> >> The heart of this series is the thought that while we cannot remove NOHZ_FULL
> >> CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
> >> the callbacks immediately. Anything that only affects kernelspace can wait
> >> until the next user->kernel transition, providing it can be executed "early
> >> enough" in the entry code.
> >>
> >
> > I want to point out that there's another option here, although anyone 
> > trying to implement it would be fighting against quite a lot of history.
> >
> > Logically, each CPU is in one of a handful of states: user mode, idle, 
> > normal kernel mode (possibly subdivided into IRQ, etc), and a handful 
> > of very narrow windows, hopefully uninstrumented and not accessing any 
> > PTEs that might be invalid, in the entry and exit paths where any state 
> > in memory could be out of sync with actual CPU state.  (The latter 
> > includes right after the CPU switches to kernel mode, for example.)  
> > And NMI and MCE and whatever weird "security" entry types that Intel 
> > and AMD love to add.
> >
> > The way the kernel *currently* deals with this has two big historical oddities:
> >
> > 1. The entry and exit code cares about ti_flags, which is per-*task*, 
> > which means that atomically poking it from other CPUs involves the 
> > runqueue lock or other shenanigans (see the idle nr_polling code for 
> > example), and also that it's not accessible from the user page tables 
> > if PTI is on.
> >
> > 2. The actual heavyweight atomic part (context tracking) was built for 
> > RCU, and it's sort or bolted on, and, as you've observed in this 
> > series, it's really quite awkward to do things that aren't RCU using 
> > context tracking.
> >
> > If this were a greenfield project, I think there's a straightforward 
> > approach that's much nicer: stick everything into a single percpu flags 
> > structure.  Imagine we have cpu_flags, which tracks both the current 
> > state of the CPU and what work needs to be done on state changes.  On 
> > exit to user mode, we would atomically set the mode to USER and make 
> > sure we don't touch anything like vmalloc space after that.  On entry 
> > back to kernel mode, we would avoid vmalloc space, etc, then atomically 
> > switch to kernel mode and read out whatever deferred work is needed.  
> > As an optimization, if nothing in the current configuration needs 
> > atomic state tracking, the state could be left at USER_OR_KERNEL and 
> > the overhead of an extra atomic op at entry and exit could be avoided.
> >
> > And RCU would hook into *that* instead of having its own separate set of hooks.

Please note that RCU needs to sample a given CPU's idle state from other
CPUs, and to have pretty heavy-duty ordering guarantees.  This is needed
to avoid RCU needing to wake up idle CPUs on the one hand or relying on
scheduling-clock interrupts waking up idle CPUs on the other.

Or am I missing the point of your suggestion?

> > I think that actually doing this would be a big improvement and would 
> > also be a serious project.  There's a lot of code that would get 
> > touched, and the existing context tracking code is subtle and 
> > confusing.  And, as mentioned, ti_flags has the wrong scope.

Serious care would certainly be needed!  ;-)

> > It's *possible* that one could avoid making ti_flags percpu either by 
> > extensive use of the runqueue locks or by borrowing a kludge from the 
> > idle code.  For the latter, right now, the reason that the 
> > wake-from-idle code works is that the optimized path only happens if 
> > the idle thread/cpu is "polling", and it's impossible for the idle 
> > ti_flags to be polling while the CPU isn't actually idle.  We could 
> > similarly observe that, if a ti_flags says it's in USER mode *and* is 
> > on, say, cpu 3, then cpu 3 is most definitely in USER mode.  So someone 
> > could try shoving the CPU number into ti_flags :-p   (USER means 
> > actually user or in the late exit / early entry path.)
> >
> > Anyway, benefits of this whole approach would include considerably 
> > (IMO) increased comprehensibility compared to the current tangled ct 
> > code and much more straightforward addition of new things that happen 
> > to a target CPU conditionally depending on its mode.  And, if the flags 
> > word was actually per cpu, it could be mapped such that 
> > SWITCH_TO_KERNEL_CR3 would use it -- there could be a single CR3 write 
> > (and maybe CR4/invpcid depending on whether a zapped mapping is global) 
> > and the flush bit could depend on whether a flush is needed.  And there 
> > would be basically no chance that a bug that accessed 
> > invalidated-but-not-flushed kernel data could be undetected -- in PTI 
> > mode, any such access would page fault!  Similarly, if kernel text 
> > pokes deferred the flush and serialization, the only code that could 
> > execute before noticing the deferred flush would be the user-CR3 code.
> >
> > Oh, any another primitive would be possible: one CPU could plausibly 
> > execute another CPU's interrupts or soft-irqs or whatever by taking a 
> > special lock that would effectively pin the remote CPU in user mode -- 
> > you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
> > the transition on that CPU to kernel mode would then spin on entry to 
> > kernel mode and wait for the lock to be released.  This could plausibly 
> > get a lot of the on_each_cpu callers to switch over in one fell swoop: 
> > anything that needs to synchronize to the remote CPU but does not need 
> > to poke its actual architectural state could be executed locally while 
> > the remote CPU is pinned.

It would be necessary to arrange for the remote CPU to remain pinned
while the local CPU executed on its behalf.  Does the above approach
make that happen without re-introducing our current context-tracking
overhead and complexity?

> Following up, I think that x86 can do this all with a single atomic (in the common case) per usermode round trip.  Imagine:
> 
> struct fancy_cpu_state {
>   u32 work; // <-- writable by any CPU
>   u32 status; // <-- readable anywhere; writable locally
> };
> 
> status includes KERNEL, USER, and maybe INDETERMINATE.  (INDETERMINATE means USER but we're not committing to doing work.)
> 
> Exit to user mode:
> 
> atomic_set(&my_state->status, USER);

We need ordering in the RCU nohz_full case.  If the grace-period kthread
sees the status as USER, all the preceding KERNEL code's effects must
be visible to the grace-period kthread.

> (or, in the lazy case, set to INDETERMINATE instead.)
> 
> Entry from user mode, with IRQs off, before switching to kernel CR3:
> 
> if (my_state->status == INDETERMINATE) {
>   // we were lazy and we never promised to do work atomically.
>   atomic_set(&my_state->status, KERNEL);
>   this_entry_work = 0;
> } else {
>   // we were not lazy and we promised we would do work atomically
>   atomic exchange the entire state to { .work = 0, .status = KERNEL }
>   this_entry_work = (whatever we just read);
> }

If this atomic exchange is fully ordered (as opposed to, say, _relaxed),
then this works in that if the grace-period kthread sees USER, its prior
references are guaranteed not to see later kernel-mode references from
that CPU.

> if (PTI) {
>   switch to kernel CR3 *and flush if this_entry_work says to flush*
> } else {
>   flush if this_entry_work says to flush;
> }
> 
> do the rest of the work;
> 
> 
> 
> I suppose that a lot of the stuff in ti_flags could merge into here, but it could be done one bit at a time when people feel like doing so.  And I imagine, but I'm very far from confident, that RCU could use this instead of the current context tracking code.

RCU currently needs pretty heavy-duty ordering to reliably detect the
other CPUs' quiescent states without needing to wake them from idle, or,
in the nohz_full case, interrupt their userspace execution.  Not saying
it is impossible, but it will need extreme care.

> The idea behind INDETERMINATE is that there are plenty of workloads that frequently switch between user and kernel mode and that would rather accept a few IPIs to avoid the heavyweight atomic operation on user -> kernel transitions.  So the default behavior could be to do KERNEL -> INDETERMINATE instead of KERNEL -> USER, but code that wants to be in user mode for a long time could go all the way to USER.  We could make it sort of automatic by noticing that we're returning from an IRQ without a context switch and go to USER (so we would get at most one unneeded IPI per normal user entry), and we could have some nice API for a program that intends to hang out in user mode for a very long time (cpu isolation users, for example) to tell the kernel to go immediately into USER mode.  (Don't we already have something that could be used for this purpose?)

RCU *could* do an smp_call_function_single() when the CPU failed
to respond, perhaps in a manner similar to how it already forces a
given CPU out of nohz_full state if that CPU has been executing in the
kernel for too long.  The real-time guys might not be amused, though.
Especially those real-time guys hitting sub-microsecond latencies.

> Hmm, now I wonder if it would make sense for the default behavior of Linux to be like that.  We could call it ONEHZ.  It's like NOHZ_FULL except that user threads that don't do syscalls get one single timer tick instead of many or none.
> 
> 
> Anyway, I think my proposal is pretty good *if* RCU could be made to use it -- the existing context tracking code is fairly expensive, and I don't think we want to invent a new context-tracking-like mechanism if we still need to do the existing thing.

If you build with CONFIG_NO_HZ_FULL=n, do you still get the heavyweight
operations when transitioning between kernel and user execution?

							Thanx, Paul

Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

Posted by Andy Lutomirski 2 months, 3 weeks ago


On Fri, Nov 14, 2025, at 10:14 AM, Paul E. McKenney wrote:
> On Fri, Nov 14, 2025 at 09:22:35AM -0800, Andy Lutomirski wrote:
>> 

>> > Oh, any another primitive would be possible: one CPU could plausibly 
>> > execute another CPU's interrupts or soft-irqs or whatever by taking a 
>> > special lock that would effectively pin the remote CPU in user mode -- 
>> > you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
>> > the transition on that CPU to kernel mode would then spin on entry to 
>> > kernel mode and wait for the lock to be released.  This could plausibly 
>> > get a lot of the on_each_cpu callers to switch over in one fell swoop: 
>> > anything that needs to synchronize to the remote CPU but does not need 
>> > to poke its actual architectural state could be executed locally while 
>> > the remote CPU is pinned.
>
> It would be necessary to arrange for the remote CPU to remain pinned
> while the local CPU executed on its behalf.  Does the above approach
> make that happen without re-introducing our current context-tracking
> overhead and complexity?

Using the pseudo-implementation farther down, I think this would be like:

if (my_state->status == INDETERMINATE) {
   // we were lazy and we never promised to do work atomically.
   atomic_set(&my_state->status, KERNEL);
   this_entry_work = 0;
   /* we are definitely not pinned in this path */
} else {
   // we were not lazy and we promised we would do work atomically
   atomic exchange the entire state to { .work = 0, .status = KERNEL }
   this_entry_work = (whatever we just read);
   if (this_entry_work & PINNED) {
     u32 this_cpu_pin_count = this_cpu_ptr(pin_count);
     while (atomic_read(&this_cpu_pin_count)) {
       cpu_relax();
     }
   }
 }

and we'd have something like:

bool try_pin_remote_cpu(int cpu)
{
    u32 *remote_pin_count = ...;
    struct fancy_cpu_state *remote_state = ...;
    atomic_inc(remote_pin_count);  // optimistic

    // Hmm, we do not want that read to get reordered with the inc, so we probably
    // need a full barrier or seq_cst.  How does Linux spell that?  C++ has atomic::load
    // with seq_cst and maybe the optimizer can do the right thing.  Maybe it's:
    smp_mb__after_atomic();

    if (atomic_read(&remote_state->status) == USER) {
      // Okay, it's genuinely pinned.
      return true;

      // egads, if this is some arch with very weak ordering,
      // do we need to be concerned that we just took a lock but we
      // just did a relaxed read and therefore a subsequent access
      // that thinks it's locked might appear to precede the load and therefore
      // somehow get surprisingly seen out of order by the target cpu?
      // maybe we wanted atomic_read_acquire above instead?
    } else {
      // We might not have successfully pinned it
      atomic_dec(remote_pin_count);
    }
}

void unpin_remote_cpu(int cpu)
{
    atomic_dec(remote_pin_count();
}

and we'd use it like:

if (try_pin_remote_cpu(cpu)) {
  // do something useful
} else {
  send IPI;
}

but we'd really accumulate the set of CPUs that need the IPIs and do them all at once.

I ran the theorem prover that lives inside my head on this code using the assumption that the machine is a well-behaved x86 system and it said "yeah, looks like it might be correct".  I trust an actual formalized system or someone like you who is genuinely very good at this stuff much more than I trust my initial impression :)

>
>> Following up, I think that x86 can do this all with a single atomic (in the common case) per usermode round trip.  Imagine:
>> 
>> struct fancy_cpu_state {
>>   u32 work; // <-- writable by any CPU
>>   u32 status; // <-- readable anywhere; writable locally
>> };
>> 
>> status includes KERNEL, USER, and maybe INDETERMINATE.  (INDETERMINATE means USER but we're not committing to doing work.)
>> 
>> Exit to user mode:
>> 
>> atomic_set(&my_state->status, USER);
>
> We need ordering in the RCU nohz_full case.  If the grace-period kthread
> sees the status as USER, all the preceding KERNEL code's effects must
> be visible to the grace-period kthread.

Sorry, I'm speaking lazy x86 programmer here.  Maybe I mean atomic_set_release.  I want, roughly, the property that anyone who remotely observes USER can rely on the target cpu subsequently going through the atomic exchange path above.  I think even relaxed ought to be good enough for that one most architectures, but there are some potentially nasty complications involving that fact that this mixes operations on a double word and a single word that's part of the double word.

>
>> (or, in the lazy case, set to INDETERMINATE instead.)
>> 
>> Entry from user mode, with IRQs off, before switching to kernel CR3:
>> 
>> if (my_state->status == INDETERMINATE) {
>>   // we were lazy and we never promised to do work atomically.
>>   atomic_set(&my_state->status, KERNEL);
>>   this_entry_work = 0;
>> } else {
>>   // we were not lazy and we promised we would do work atomically
>>   atomic exchange the entire state to { .work = 0, .status = KERNEL }
>>   this_entry_work = (whatever we just read);
>> }
>
> If this atomic exchange is fully ordered (as opposed to, say, _relaxed),
> then this works in that if the grace-period kthread sees USER, its prior
> references are guaranteed not to see later kernel-mode references from
> that CPU.

Yep, that's the intent of my pseudocode.  On x86 this would be a plain 64-bit lock xchg -- I don't think cmpxchg is needed.  (I like to write lock even when it's implicit to avoid needing to trust myself to remember precisely which instructions imply it.)

>
>> if (PTI) {
>>   switch to kernel CR3 *and flush if this_entry_work says to flush*
>> } else {
>>   flush if this_entry_work says to flush;
>> }
>> 
>> do the rest of the work;
>> 
>> 
>> 
>> I suppose that a lot of the stuff in ti_flags could merge into here, but it could be done one bit at a time when people feel like doing so.  And I imagine, but I'm very far from confident, that RCU could use this instead of the current context tracking code.
>
> RCU currently needs pretty heavy-duty ordering to reliably detect the
> other CPUs' quiescent states without needing to wake them from idle, or,
> in the nohz_full case, interrupt their userspace execution.  Not saying
> it is impossible, but it will need extreme care.

Is the thingy above heavy-duty enough?  Perhaps more relevantly, could RCU do this *instead* of the current CT hooks on architectures that implement it, and/or could RCU arrange for the ct hooks to be cheap no-ops on architectures that support the thingy above.  By "could" I mean "could be done without absolutely massive refactoring and without the resulting code being an unmaintainable disaster?  I'm sure any sufficiently motivated human or LLM could pull off the unmaintainable disaster version.  I bet I could even do it myself!)

>
>> The idea behind INDETERMINATE is that there are plenty of workloads that frequently switch between user and kernel mode and that would rather accept a few IPIs to avoid the heavyweight atomic operation on user -> kernel transitions.  So the default behavior could be to do KERNEL -> INDETERMINATE instead of KERNEL -> USER, but code that wants to be in user mode for a long time could go all the way to USER.  We could make it sort of automatic by noticing that we're returning from an IRQ without a context switch and go to USER (so we would get at most one unneeded IPI per normal user entry), and we could have some nice API for a program that intends to hang out in user mode for a very long time (cpu isolation users, for example) to tell the kernel to go immediately into USER mode.  (Don't we already have something that could be used for this purpose?)
>
> RCU *could* do an smp_call_function_single() when the CPU failed
> to respond, perhaps in a manner similar to how it already forces a
> given CPU out of nohz_full state if that CPU has been executing in the
> kernel for too long.  The real-time guys might not be amused, though.
> Especially those real-time guys hitting sub-microsecond latencies.

It wouldn't be outrageous to have real-time imply the full USER transition.

>
>> Hmm, now I wonder if it would make sense for the default behavior of Linux to be like that.  We could call it ONEHZ.  It's like NOHZ_FULL except that user threads that don't do syscalls get one single timer tick instead of many or none.
>> 
>> 
>> Anyway, I think my proposal is pretty good *if* RCU could be made to use it -- the existing context tracking code is fairly expensive, and I don't think we want to invent a new context-tracking-like mechanism if we still need to do the existing thing.
>
> If you build with CONFIG_NO_HZ_FULL=n, do you still get the heavyweight
> operations when transitioning between kernel and user execution?

No.  And I even tested approximately this a couple weeks ago for unrelated reasons.  The only unnecessary heavy-weight thing we're doing in a syscall loop with mitigations off is the RDTSC for stack randomization.

But I did that test on a machine that would absolutely benefit from the IPI suppression that the OP is talking about here, and I think it would be really quite nice if a default distro kernel with a more or less default distribution could be easily convinced to run a user thread without interrupting it.  It's a little bit had that NO_HZ_FULL is still an exotic non-default thing, IMO.

--Andy

Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

Posted by Thomas Gleixner 2 months, 3 weeks ago

On Fri, Nov 14 2025 at 10:45, Andy Lutomirski wrote:
> But I did that test on a machine that would absolutely benefit from
> the IPI suppression that the OP is talking about here, and I think it
> would be really quite nice if a default distro kernel with a more or
> less default distribution could be easily convinced to run a user
> thread without interrupting it.  It's a little bit had that NO_HZ_FULL
> is still an exotic non-default thing, IMO.

Feel free to help with getting it over the finish line. Feeling sad/bad
(or whatever you wanted to say with 'had') does not change anything as
you should know :)

Thanks,

        tglx

Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 10:45:08AM -0800, Andy Lutomirski wrote:
> 
> 
> On Fri, Nov 14, 2025, at 10:14 AM, Paul E. McKenney wrote:
> > On Fri, Nov 14, 2025 at 09:22:35AM -0800, Andy Lutomirski wrote:
> >> 
> 
> >> > Oh, any another primitive would be possible: one CPU could plausibly 
> >> > execute another CPU's interrupts or soft-irqs or whatever by taking a 
> >> > special lock that would effectively pin the remote CPU in user mode -- 
> >> > you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
> >> > the transition on that CPU to kernel mode would then spin on entry to 
> >> > kernel mode and wait for the lock to be released.  This could plausibly 
> >> > get a lot of the on_each_cpu callers to switch over in one fell swoop: 
> >> > anything that needs to synchronize to the remote CPU but does not need 
> >> > to poke its actual architectural state could be executed locally while 
> >> > the remote CPU is pinned.
> >
> > It would be necessary to arrange for the remote CPU to remain pinned
> > while the local CPU executed on its behalf.  Does the above approach
> > make that happen without re-introducing our current context-tracking
> > overhead and complexity?
> 
> Using the pseudo-implementation farther down, I think this would be like:
> 
> if (my_state->status == INDETERMINATE) {
>    // we were lazy and we never promised to do work atomically.
>    atomic_set(&my_state->status, KERNEL);
>    this_entry_work = 0;
>    /* we are definitely not pinned in this path */
> } else {
>    // we were not lazy and we promised we would do work atomically
>    atomic exchange the entire state to { .work = 0, .status = KERNEL }
>    this_entry_work = (whatever we just read);
>    if (this_entry_work & PINNED) {
>      u32 this_cpu_pin_count = this_cpu_ptr(pin_count);
>      while (atomic_read(&this_cpu_pin_count)) {
>        cpu_relax();
>      }
>    }
>  }
> 
> and we'd have something like:
> 
> bool try_pin_remote_cpu(int cpu)
> {
>     u32 *remote_pin_count = ...;
>     struct fancy_cpu_state *remote_state = ...;
>     atomic_inc(remote_pin_count);  // optimistic
> 
>     // Hmm, we do not want that read to get reordered with the inc, so we probably
>     // need a full barrier or seq_cst.  How does Linux spell that?  C++ has atomic::load
>     // with seq_cst and maybe the optimizer can do the right thing.  Maybe it's:
>     smp_mb__after_atomic();
> 
>     if (atomic_read(&remote_state->status) == USER) {
>       // Okay, it's genuinely pinned.
>       return true;
> 
>       // egads, if this is some arch with very weak ordering,
>       // do we need to be concerned that we just took a lock but we
>       // just did a relaxed read and therefore a subsequent access
>       // that thinks it's locked might appear to precede the load and therefore
>       // somehow get surprisingly seen out of order by the target cpu?
>       // maybe we wanted atomic_read_acquire above instead?
>     } else {
>       // We might not have successfully pinned it
>       atomic_dec(remote_pin_count);
>     }
> }
> 
> void unpin_remote_cpu(int cpu)
> {
>     atomic_dec(remote_pin_count();
> }
> 
> and we'd use it like:
> 
> if (try_pin_remote_cpu(cpu)) {
>   // do something useful
> } else {
>   send IPI;
> }
> 
> but we'd really accumulate the set of CPUs that need the IPIs and do them all at once.
> 
> I ran the theorem prover that lives inside my head on this code using the assumption that the machine is a well-behaved x86 system and it said "yeah, looks like it might be correct".  I trust an actual formalized system or someone like you who is genuinely very good at this stuff much more than I trust my initial impression :)

Let's start with requirements, non-traditional though that might be. ;-)

An "RCU idle" CPU is either in deep idle or executing in nohz_full
userspace.

1.	If the RCU grace-period kthread sees an RCU-idle CPU, then:

	a.	Everything that this CPU did before entering RCU-idle
		state must be visible to sufficiently later code executed
		by the grace-period kthread, and:

	b.	Everything that the CPU will do after exiting RCU-idle
		state must *not* have been visible to the grace-period
		kthread sufficiently prior to having sampled this
		CPU's state.

2.	If the RCU grace-period kthread sees an RCU-nonidle CPU, then
	it depends on whether this is the same nonidle sojourn as was
	initially seen.  (If the kthread initially saw the CPU in an
	RCU-idle state, it would not have bothered resampling.)

	a.	If this is the same nonidle sojourn, then there are no
		ordering requirements.	RCU must continue to wait on
		this CPU.

	b.	Otherwise, everything that this CPU did before entering
		its last RCU-idle state must be visible to sufficiently
		later code executed by the grace-period kthread.
		Similar to (1a) above.

3.	If a given CPU quickly switches into and out of RCU-idle
	state, and it is always in RCU-nonidle state whenever the RCU
	grace-period kthread looks, RCU must still realize that this
	CPU has passed through at least one quiescent state.

	This is why we have a counter for RCU rather than just a
	simple state.

The usual way to handle (1a) and (2b) is make the update marking entry
to the RCU-idle state have release semantics and to make the operation
that the RCU grace-period kthread uses to sample the CPU's state have
acquire semantics.

The usual way to handle (1b) is to have a full barrier after the update
marking exit from the RCU-idle state and another full barrier before the
operation that the RCU grace-period kthread uses to sample the CPU's
state.  The "sufficiently" allows some wiggle room on the placement
of both full barriers.  A full barrier can be smp_mb() or some fully
ordered atomic operation.

I will let Valentin and Frederic check the current code.  ;-)

> >> Following up, I think that x86 can do this all with a single atomic (in the common case) per usermode round trip.  Imagine:
> >> 
> >> struct fancy_cpu_state {
> >>   u32 work; // <-- writable by any CPU
> >>   u32 status; // <-- readable anywhere; writable locally
> >> };
> >> 
> >> status includes KERNEL, USER, and maybe INDETERMINATE.  (INDETERMINATE means USER but we're not committing to doing work.)
> >> 
> >> Exit to user mode:
> >> 
> >> atomic_set(&my_state->status, USER);
> >
> > We need ordering in the RCU nohz_full case.  If the grace-period kthread
> > sees the status as USER, all the preceding KERNEL code's effects must
> > be visible to the grace-period kthread.
> 
> Sorry, I'm speaking lazy x86 programmer here.  Maybe I mean atomic_set_release.  I want, roughly, the property that anyone who remotely observes USER can rely on the target cpu subsequently going through the atomic exchange path above.  I think even relaxed ought to be good enough for that one most architectures, but there are some potentially nasty complications involving that fact that this mixes operations on a double word and a single word that's part of the double word.

Please see the requirements laid out above.

> >> (or, in the lazy case, set to INDETERMINATE instead.)
> >> 
> >> Entry from user mode, with IRQs off, before switching to kernel CR3:
> >> 
> >> if (my_state->status == INDETERMINATE) {
> >>   // we were lazy and we never promised to do work atomically.
> >>   atomic_set(&my_state->status, KERNEL);
> >>   this_entry_work = 0;
> >> } else {
> >>   // we were not lazy and we promised we would do work atomically
> >>   atomic exchange the entire state to { .work = 0, .status = KERNEL }
> >>   this_entry_work = (whatever we just read);
> >> }
> >
> > If this atomic exchange is fully ordered (as opposed to, say, _relaxed),
> > then this works in that if the grace-period kthread sees USER, its prior
> > references are guaranteed not to see later kernel-mode references from
> > that CPU.
> 
> Yep, that's the intent of my pseudocode.  On x86 this would be a plain 64-bit lock xchg -- I don't think cmpxchg is needed.  (I like to write lock even when it's implicit to avoid needing to trust myself to remember precisely which instructions imply it.)

OK, then the atomic exchange provides all the ordering that the update
ever needs.

> >> if (PTI) {
> >>   switch to kernel CR3 *and flush if this_entry_work says to flush*
> >> } else {
> >>   flush if this_entry_work says to flush;
> >> }
> >> 
> >> do the rest of the work;
> >> 
> >> 
> >> 
> >> I suppose that a lot of the stuff in ti_flags could merge into here, but it could be done one bit at a time when people feel like doing so.  And I imagine, but I'm very far from confident, that RCU could use this instead of the current context tracking code.
> >
> > RCU currently needs pretty heavy-duty ordering to reliably detect the
> > other CPUs' quiescent states without needing to wake them from idle, or,
> > in the nohz_full case, interrupt their userspace execution.  Not saying
> > it is impossible, but it will need extreme care.
> 
> Is the thingy above heavy-duty enough?  Perhaps more relevantly, could RCU do this *instead* of the current CT hooks on architectures that implement it, and/or could RCU arrange for the ct hooks to be cheap no-ops on architectures that support the thingy above.  By "could" I mean "could be done without absolutely massive refactoring and without the resulting code being an unmaintainable disaster?  I'm sure any sufficiently motivated human or LLM could pull off the unmaintainable disaster version.  I bet I could even do it myself!)

I write broken and unmaintainable code all the time, having more than 50
years of experience doing so.  This is one reason we have rcutorture.  ;-)

RCU used to do its own CT-like hooks.  Merging them into the actual
context-tracking code reduced the entry/exit overhead significantly.

I am not seeing how the third requirement above is met, though.  I have
not verified the sampling code that the RCU grace-period kthread is
supposed to use because I am not seeing it right off-hand.

> >> The idea behind INDETERMINATE is that there are plenty of workloads that frequently switch between user and kernel mode and that would rather accept a few IPIs to avoid the heavyweight atomic operation on user -> kernel transitions.  So the default behavior could be to do KERNEL -> INDETERMINATE instead of KERNEL -> USER, but code that wants to be in user mode for a long time could go all the way to USER.  We could make it sort of automatic by noticing that we're returning from an IRQ without a context switch and go to USER (so we would get at most one unneeded IPI per normal user entry), and we could have some nice API for a program that intends to hang out in user mode for a very long time (cpu isolation users, for example) to tell the kernel to go immediately into USER mode.  (Don't we already have something that could be used for this purpose?)
> >
> > RCU *could* do an smp_call_function_single() when the CPU failed
> > to respond, perhaps in a manner similar to how it already forces a
> > given CPU out of nohz_full state if that CPU has been executing in the
> > kernel for too long.  The real-time guys might not be amused, though.
> > Especially those real-time guys hitting sub-microsecond latencies.
> 
> It wouldn't be outrageous to have real-time imply the full USER transition.

We could have special code for RT, but this of course increases the
complexity.  Which might be justified by sufficient speedup.

> >> Hmm, now I wonder if it would make sense for the default behavior of Linux to be like that.  We could call it ONEHZ.  It's like NOHZ_FULL except that user threads that don't do syscalls get one single timer tick instead of many or none.
> >> 
> >> 
> >> Anyway, I think my proposal is pretty good *if* RCU could be made to use it -- the existing context tracking code is fairly expensive, and I don't think we want to invent a new context-tracking-like mechanism if we still need to do the existing thing.
> >
> > If you build with CONFIG_NO_HZ_FULL=n, do you still get the heavyweight
> > operations when transitioning between kernel and user execution?
> 
> No.  And I even tested approximately this a couple weeks ago for unrelated reasons.  The only unnecessary heavy-weight thing we're doing in a syscall loop with mitigations off is the RDTSC for stack randomization.

OK, so why not just build your kernels with CONFIG_NO_HZ_FULL=n and be happy?

> But I did that test on a machine that would absolutely benefit from the IPI suppression that the OP is talking about here, and I think it would be really quite nice if a default distro kernel with a more or less default distribution could be easily convinced to run a user thread without interrupting it.  It's a little bit had that NO_HZ_FULL is still an exotic non-default thing, IMO.

And yes, many distros enable NO_HZ_FULL by default.  I will refrain from
suggesting additional static branches.  ;-)

							Thanx, Paul

Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

Posted by Andy Lutomirski 2 months, 3 weeks ago


On Fri, Nov 14, 2025, at 12:03 PM, Paul E. McKenney wrote:
> On Fri, Nov 14, 2025 at 10:45:08AM -0800, Andy Lutomirski wrote:
>> 
>> 
>> On Fri, Nov 14, 2025, at 10:14 AM, Paul E. McKenney wrote:
>> > On Fri, Nov 14, 2025 at 09:22:35AM -0800, Andy Lutomirski wrote:
>> >> 
>> 
>> >> > Oh, any another primitive would be possible: one CPU could plausibly 
>> >> > execute another CPU's interrupts or soft-irqs or whatever by taking a 
>> >> > special lock that would effectively pin the remote CPU in user mode -- 
>> >> > you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
>> >> > the transition on that CPU to kernel mode would then spin on entry to 
>> >> > kernel mode and wait for the lock to be released.  This could plausibly 
>> >> > get a lot of the on_each_cpu callers to switch over in one fell swoop: 
>> >> > anything that needs to synchronize to the remote CPU but does not need 
>> >> > to poke its actual architectural state could be executed locally while 
>> >> > the remote CPU is pinned.
>> >
>> > It would be necessary to arrange for the remote CPU to remain pinned
>> > while the local CPU executed on its behalf.  Does the above approach
>> > make that happen without re-introducing our current context-tracking
>> > overhead and complexity?
>> 
>> Using the pseudo-implementation farther down, I think this would be like:
>> 
>> if (my_state->status == INDETERMINATE) {
>>    // we were lazy and we never promised to do work atomically.
>>    atomic_set(&my_state->status, KERNEL);
>>    this_entry_work = 0;
>>    /* we are definitely not pinned in this path */
>> } else {
>>    // we were not lazy and we promised we would do work atomically
>>    atomic exchange the entire state to { .work = 0, .status = KERNEL }
>>    this_entry_work = (whatever we just read);
>>    if (this_entry_work & PINNED) {
>>      u32 this_cpu_pin_count = this_cpu_ptr(pin_count);
>>      while (atomic_read(&this_cpu_pin_count)) {
>>        cpu_relax();
>>      }
>>    }
>>  }
>> 
>> and we'd have something like:
>> 
>> bool try_pin_remote_cpu(int cpu)
>> {
>>     u32 *remote_pin_count = ...;
>>     struct fancy_cpu_state *remote_state = ...;
>>     atomic_inc(remote_pin_count);  // optimistic
>> 
>>     // Hmm, we do not want that read to get reordered with the inc, so we probably
>>     // need a full barrier or seq_cst.  How does Linux spell that?  C++ has atomic::load
>>     // with seq_cst and maybe the optimizer can do the right thing.  Maybe it's:
>>     smp_mb__after_atomic();
>> 
>>     if (atomic_read(&remote_state->status) == USER) {
>>       // Okay, it's genuinely pinned.
>>       return true;
>> 
>>       // egads, if this is some arch with very weak ordering,
>>       // do we need to be concerned that we just took a lock but we
>>       // just did a relaxed read and therefore a subsequent access
>>       // that thinks it's locked might appear to precede the load and therefore
>>       // somehow get surprisingly seen out of order by the target cpu?
>>       // maybe we wanted atomic_read_acquire above instead?
>>     } else {
>>       // We might not have successfully pinned it
>>       atomic_dec(remote_pin_count);
>>     }
>> }
>> 
>> void unpin_remote_cpu(int cpu)
>> {
>>     atomic_dec(remote_pin_count();
>> }
>> 
>> and we'd use it like:
>> 
>> if (try_pin_remote_cpu(cpu)) {
>>   // do something useful
>> } else {
>>   send IPI;
>> }
>> 
>> but we'd really accumulate the set of CPUs that need the IPIs and do them all at once.
>> 
>> I ran the theorem prover that lives inside my head on this code using the assumption that the machine is a well-behaved x86 system and it said "yeah, looks like it might be correct".  I trust an actual formalized system or someone like you who is genuinely very good at this stuff much more than I trust my initial impression :)
>
> Let's start with requirements, non-traditional though that might be. ;-)

That's ridiculous! :-)

>
> An "RCU idle" CPU is either in deep idle or executing in nohz_full
> userspace.
>
> 1.	If the RCU grace-period kthread sees an RCU-idle CPU, then:
>
> 	a.	Everything that this CPU did before entering RCU-idle
> 		state must be visible to sufficiently later code executed
> 		by the grace-period kthread, and:
>
> 	b.	Everything that the CPU will do after exiting RCU-idle
> 		state must *not* have been visible to the grace-period
> 		kthread sufficiently prior to having sampled this
> 		CPU's state.
>
> 2.	If the RCU grace-period kthread sees an RCU-nonidle CPU, then
> 	it depends on whether this is the same nonidle sojourn as was
> 	initially seen.  (If the kthread initially saw the CPU in an
> 	RCU-idle state, it would not have bothered resampling.)
>
> 	a.	If this is the same nonidle sojourn, then there are no
> 		ordering requirements.	RCU must continue to wait on
> 		this CPU.
>
> 	b.	Otherwise, everything that this CPU did before entering
> 		its last RCU-idle state must be visible to sufficiently
> 		later code executed by the grace-period kthread.
> 		Similar to (1a) above.
>
> 3.	If a given CPU quickly switches into and out of RCU-idle
> 	state, and it is always in RCU-nonidle state whenever the RCU
> 	grace-period kthread looks, RCU must still realize that this
> 	CPU has passed through at least one quiescent state.
>
> 	This is why we have a counter for RCU rather than just a
> 	simple state.

...

> I am not seeing how the third requirement above is met, though.  I have
> not verified the sampling code that the RCU grace-period kthread is
> supposed to use because I am not seeing it right off-hand.

This is ct_rcu_watching_cpu_acquire, right?  Lemme think.  Maybe there's even a way to do everything I'm suggesting without changing that interface.


>> It wouldn't be outrageous to have real-time imply the full USER transition.
>
> We could have special code for RT, but this of course increases the
> complexity.  Which might be justified by sufficient speedup.

I'm thinking that the code would be arranged so that going to user mode in the USER or the INTERMEDIATE state would be fully valid and would just have different performance characteristics.  So the only extra complexity here would be the actual logic to choose which state to go to. and...
>
> And yes, many distros enable NO_HZ_FULL by default.  I will refrain from
> suggesting additional static branches.  ;-)

I'm not even suggesting a static branch.  I think the potential performance wins are big enough to justify a bona fide ordinary if statement or two :)  I'm also contemplating whether it could make sense to make this whole thing be unconditionally configured in if the performance in the case where no one uses it (i.e. everything is INTERMEDIATE instead of USER) is good enough.

I will ponder.

Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 04:29:31PM -0800, Andy Lutomirski wrote:
> 
> 
> On Fri, Nov 14, 2025, at 12:03 PM, Paul E. McKenney wrote:
> > On Fri, Nov 14, 2025 at 10:45:08AM -0800, Andy Lutomirski wrote:
> >> 
> >> 
> >> On Fri, Nov 14, 2025, at 10:14 AM, Paul E. McKenney wrote:
> >> > On Fri, Nov 14, 2025 at 09:22:35AM -0800, Andy Lutomirski wrote:
> >> >> 
> >> 
> >> >> > Oh, any another primitive would be possible: one CPU could plausibly 
> >> >> > execute another CPU's interrupts or soft-irqs or whatever by taking a 
> >> >> > special lock that would effectively pin the remote CPU in user mode -- 
> >> >> > you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
> >> >> > the transition on that CPU to kernel mode would then spin on entry to 
> >> >> > kernel mode and wait for the lock to be released.  This could plausibly 
> >> >> > get a lot of the on_each_cpu callers to switch over in one fell swoop: 
> >> >> > anything that needs to synchronize to the remote CPU but does not need 
> >> >> > to poke its actual architectural state could be executed locally while 
> >> >> > the remote CPU is pinned.
> >> >
> >> > It would be necessary to arrange for the remote CPU to remain pinned
> >> > while the local CPU executed on its behalf.  Does the above approach
> >> > make that happen without re-introducing our current context-tracking
> >> > overhead and complexity?
> >> 
> >> Using the pseudo-implementation farther down, I think this would be like:
> >> 
> >> if (my_state->status == INDETERMINATE) {
> >>    // we were lazy and we never promised to do work atomically.
> >>    atomic_set(&my_state->status, KERNEL);
> >>    this_entry_work = 0;
> >>    /* we are definitely not pinned in this path */
> >> } else {
> >>    // we were not lazy and we promised we would do work atomically
> >>    atomic exchange the entire state to { .work = 0, .status = KERNEL }
> >>    this_entry_work = (whatever we just read);
> >>    if (this_entry_work & PINNED) {
> >>      u32 this_cpu_pin_count = this_cpu_ptr(pin_count);
> >>      while (atomic_read(&this_cpu_pin_count)) {
> >>        cpu_relax();
> >>      }
> >>    }
> >>  }
> >> 
> >> and we'd have something like:
> >> 
> >> bool try_pin_remote_cpu(int cpu)
> >> {
> >>     u32 *remote_pin_count = ...;
> >>     struct fancy_cpu_state *remote_state = ...;
> >>     atomic_inc(remote_pin_count);  // optimistic
> >> 
> >>     // Hmm, we do not want that read to get reordered with the inc, so we probably
> >>     // need a full barrier or seq_cst.  How does Linux spell that?  C++ has atomic::load
> >>     // with seq_cst and maybe the optimizer can do the right thing.  Maybe it's:
> >>     smp_mb__after_atomic();
> >> 
> >>     if (atomic_read(&remote_state->status) == USER) {
> >>       // Okay, it's genuinely pinned.
> >>       return true;
> >> 
> >>       // egads, if this is some arch with very weak ordering,
> >>       // do we need to be concerned that we just took a lock but we
> >>       // just did a relaxed read and therefore a subsequent access
> >>       // that thinks it's locked might appear to precede the load and therefore
> >>       // somehow get surprisingly seen out of order by the target cpu?
> >>       // maybe we wanted atomic_read_acquire above instead?
> >>     } else {
> >>       // We might not have successfully pinned it
> >>       atomic_dec(remote_pin_count);
> >>     }
> >> }
> >> 
> >> void unpin_remote_cpu(int cpu)
> >> {
> >>     atomic_dec(remote_pin_count();
> >> }
> >> 
> >> and we'd use it like:
> >> 
> >> if (try_pin_remote_cpu(cpu)) {
> >>   // do something useful
> >> } else {
> >>   send IPI;
> >> }
> >> 
> >> but we'd really accumulate the set of CPUs that need the IPIs and do them all at once.
> >> 
> >> I ran the theorem prover that lives inside my head on this code using the assumption that the machine is a well-behaved x86 system and it said "yeah, looks like it might be correct".  I trust an actual formalized system or someone like you who is genuinely very good at this stuff much more than I trust my initial impression :)
> >
> > Let's start with requirements, non-traditional though that might be. ;-)
> 
> That's ridiculous! :-)

;-) ;-) ;-)

> > An "RCU idle" CPU is either in deep idle or executing in nohz_full
> > userspace.
> >
> > 1.	If the RCU grace-period kthread sees an RCU-idle CPU, then:
> >
> > 	a.	Everything that this CPU did before entering RCU-idle
> > 		state must be visible to sufficiently later code executed
> > 		by the grace-period kthread, and:
> >
> > 	b.	Everything that the CPU will do after exiting RCU-idle
> > 		state must *not* have been visible to the grace-period
> > 		kthread sufficiently prior to having sampled this
> > 		CPU's state.
> >
> > 2.	If the RCU grace-period kthread sees an RCU-nonidle CPU, then
> > 	it depends on whether this is the same nonidle sojourn as was
> > 	initially seen.  (If the kthread initially saw the CPU in an
> > 	RCU-idle state, it would not have bothered resampling.)
> >
> > 	a.	If this is the same nonidle sojourn, then there are no
> > 		ordering requirements.	RCU must continue to wait on
> > 		this CPU.
> >
> > 	b.	Otherwise, everything that this CPU did before entering
> > 		its last RCU-idle state must be visible to sufficiently
> > 		later code executed by the grace-period kthread.
> > 		Similar to (1a) above.
> >
> > 3.	If a given CPU quickly switches into and out of RCU-idle
> > 	state, and it is always in RCU-nonidle state whenever the RCU
> > 	grace-period kthread looks, RCU must still realize that this
> > 	CPU has passed through at least one quiescent state.
> >
> > 	This is why we have a counter for RCU rather than just a
> > 	simple state.

Sigh...

4.	Entering and exiting either idle or nohz_full usermode execution
	at task level can race with interrupts and NMIs attempting to
	make this same transition.  (And you had much to do with the
	current code that mediates the NMI races!)

> ...
> 
> > I am not seeing how the third requirement above is met, though.  I have
> > not verified the sampling code that the RCU grace-period kthread is
> > supposed to use because I am not seeing it right off-hand.
> 
> This is ct_rcu_watching_cpu_acquire, right?  Lemme think.  Maybe there's even a way to do everything I'm suggesting without changing that interface.

Yes.

Also, the ordering can sometimes be implemnted at a distance, but it
really does have to be there.

> >> It wouldn't be outrageous to have real-time imply the full USER transition.
> >
> > We could have special code for RT, but this of course increases the
> > complexity.  Which might be justified by sufficient speedup.
> 
> I'm thinking that the code would be arranged so that going to user mode in the USER or the INTERMEDIATE state would be fully valid and would just have different performance characteristics.  So the only extra complexity here would be the actual logic to choose which state to go to. and...
> >
> > And yes, many distros enable NO_HZ_FULL by default.  I will refrain from
> > suggesting additional static branches.  ;-)
> 
> I'm not even suggesting a static branch.  I think the potential performance wins are big enough to justify a bona fide ordinary if statement or two :)  I'm also contemplating whether it could make sense to make this whole thing be unconditionally configured in if the performance in the case where no one uses it (i.e. everything is INTERMEDIATE instead of USER) is good enough.

Both sound like they would be nice to have...

> I will ponder.

...give or take feasibility, of course!  ;-)

							Thanx, Paul