arch/Kconfig | 9 ++ arch/arm/kernel/paravirt.c | 2 +- arch/arm64/kernel/paravirt.c | 2 +- arch/loongarch/kernel/paravirt.c | 2 +- arch/riscv/kernel/paravirt.c | 2 +- arch/x86/Kconfig | 18 +++ arch/x86/entry/calling.h | 36 ++++++ arch/x86/entry/syscall_64.c | 4 + arch/x86/events/amd/brs.c | 2 +- arch/x86/include/asm/context_tracking_work.h | 18 +++ arch/x86/include/asm/invpcid.h | 14 ++- arch/x86/include/asm/text-patching.h | 1 + arch/x86/include/asm/tlbflush.h | 6 + arch/x86/kernel/alternative.c | 39 ++++++- arch/x86/kernel/asm-offsets.c | 1 + arch/x86/kernel/cpu/bugs.c | 2 +- arch/x86/kernel/kprobes/core.c | 4 +- arch/x86/kernel/kprobes/opt.c | 4 +- arch/x86/kernel/module.c | 2 +- arch/x86/kernel/paravirt.c | 4 +- arch/x86/kernel/process.c | 2 +- arch/x86/kvm/vmx/vmx.c | 11 +- arch/x86/kvm/vmx/vmx_onhyperv.c | 2 +- arch/x86/mm/tlb.c | 34 ++++-- include/asm-generic/sections.h | 15 +++ include/linux/context_tracking.h | 21 ++++ include/linux/context_tracking_state.h | 54 +++++++-- include/linux/context_tracking_work.h | 26 +++++ include/linux/jump_label.h | 30 ++++- include/linux/module.h | 6 +- include/linux/objtool.h | 7 ++ include/linux/static_call.h | 19 ++++ kernel/context_tracking.c | 69 +++++++++++- kernel/kprobes.c | 8 +- kernel/kstack_erase.c | 6 +- kernel/module/main.c | 76 ++++++++++--- kernel/rcu/Kconfig.debug | 15 +++ kernel/sched/clock.c | 7 +- kernel/time/Kconfig | 5 + mm/vmalloc.c | 34 +++++- tools/objtool/Documentation/objtool.txt | 34 ++++++ tools/objtool/check.c | 106 +++++++++++++++--- tools/objtool/include/objtool/check.h | 1 + tools/objtool/include/objtool/elf.h | 1 + tools/objtool/include/objtool/special.h | 1 + tools/objtool/special.c | 15 ++- .../selftests/rcutorture/configs/rcu/TREE04 | 1 + 47 files changed, 682 insertions(+), 96 deletions(-) create mode 100644 arch/x86/include/asm/context_tracking_work.h create mode 100644 include/linux/context_tracking_work.h
Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.
Kernel entry vs execution of the deferred operation
===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (before we start getting into
context_tracking.c proper), i.e.:
idtentry_func_foo() <--- we're in the kernel
irqentry_enter()
irqentry_enter_from_user_mode()
enter_from_user_mode()
[...]
ct_kernel_enter_state()
ct_work_flush() <--- deferred operation is executed here
This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.
The annoying one: TLB flush deferral
====================================
While leveraging the context_tracking subsystem works for deferring things like
kernel text synchronization, it falls apart when it comes to kernel range TLB
flushes. Consider the following execution flow:
<userspace>
!interrupt!
SWITCH_TO_KERNEL_CR3 <--- vmalloc range becomes accessible
idtentry_func_foo()
irqentry_enter()
irqentry_enter_from_user_mode()
enter_from_user_mode()
[...]
ct_kernel_enter_state()
ct_work_flush() <--- deferred flush would be done here
Since there is no sane way to assert no stale entry is accessed during
kernel entry, any code executed between SWITCH_TO_KERNEL_CR3 and
ct_work_flush() is at risk of accessing a stale entry.
Dave had suggested hacking up something within SWITCH_TO_KERNEL_CR3 itself,
which is what has been implemented in the new RFC patches.
How bad is it?
==============
Code
++++
I'm happy that the COALESCE_TLBI asm code fits in ~half a screen,
although it open-codes native_write_cr4() without the pinning logic.
I hate the kernel_cr3_loaded signal; it's a kludgy context_tracking.state
duplicate but I need *some* sort of signal to drive the TLB flush deferral and
the context_tracking.state one is set too late in kernel entry. I couldn't
find any fitting existing signals for this.
I'm also unhappy to introduce two different IPI deferral mechanisms. I tried
shoving the text_poke_sync() in KERNEL_SWITCH_CR3, but it got ugly(er) really
fast.
Performance
+++++++++++
Tested by measuring the duration of 10M `syscall(SYS_getpid)` calls on
NOHZ_FULL CPUs, with rteval (hackbench + kernel compilation) running on the
housekeeping CPUs:
o Xeon E5-2699: base avg 770ns, patched avg 1340ns (74% increase)
o Xeon E7-8890: base avg 1040ns, patched avg 1320ns (27% increase)
o Xeon Gold 6248: base avg 270ns, patched avg 273ns (.1% increase)
I don't get that last one, I did spend a ridiculous amount of time making sure
the flush was being executed, and AFAICT yes, it was. What I take out of this is
that it can be a pretty massive increase in the entry overhead (for NOHZ_FULL
CPUs), and that's something I want to hear thoughts on
Noise
+++++
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL10 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 6 hours.
v6.17
o ~5400 IPIs received, so about ~200 interfering IPI per isolated CPU
o About one interfering IPI just shy of every 2 minutes
v6.17 + patches
o Zilch!
Patches
=======
o Patches 1-2 are standalone objtool cleanups.
o Patches 3-4 add an RCU testing feature.
o Patches 5-6 add infrastructure for annotating static keys and static calls
that may be used in noinstr code (courtesy of Josh).
o Patches 7-20 use said annotations on relevant keys / calls.
o Patch 21 enforces proper usage of said annotations (courtesy of Josh).
o Patch 22 deals with detecting NOINSTR text in modules
o Patches 23-24 deal with kernel text sync IPIs
o Patches 25-29 deal with kernel range TLB flush IPIs
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v6
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o Dave Hansen for patiently educating me about mm
o All of the folks who attended various (too many?) talks about this and
provided precious feedback.
Links
=====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: http://lore.kernel.org/r/eef09bdc-7546-462b-9ac0-661a44d2ceae@intel.com
[6]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/
Revisions
=========
v5 -> v6
++++++++
o Rebased onto v6.17
o Small conflict fixes with cpu_buf_idle_clear smp_text_poke() renaming
o Added the TLB flush craziness
v4 -> v5
++++++++
o Rebased onto v6.15-rc3
o Collected Reviewed-by
o Annotated a few more static keys
o Added proper checking of noinstr sections that are in loadable code such as
KVM early entry (Sean Christopherson)
o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or
CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI
entry from idle (thanks to Frederic!)
o Ditched the vmap TLB flush deferral (for now)
RFCv3 -> v4
+++++++++++
o Rebased onto v6.13-rc6
o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)
o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups
o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ
RFCv2 -> RFCv3
++++++++++++++
o Rebased onto v6.12-rc6
o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Josh Poimboeuf (3):
jump_label: Add annotations for validating noinstr usage
static_call: Add read-only-after-init static calls
objtool: Add noinstr validation for static branches/calls
Valentin Schneider (26):
objtool: Make validate_call() recognize indirect calls to pv_ops[]
objtool: Flesh out warning related to pv_ops[] calls
rcu: Add a small-width RCU watching counter debug option
rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
x86/idle: Mark x86_idle static call as __ro_after_init
x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
sched/clock: Mark sched_clock_running key as __ro_after_init
KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
x86/speculation/mds: Mark cpu_buf_idle_clear key as allowed in
.noinstr
sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
allowed in .noinstr
stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
module: Add MOD_NOINSTR_TEXT mem_type
context-tracking: Introduce work deferral infrastructure
context_tracking,x86: Defer kernel text patching IPIs
x86/mm: Make INVPCID type macros available to assembly
x86/mm/pti: Introduce a kernel/user CR3 software signal
x86/mm/pti: Implement a TLB flush immediately after a switch to kernel
CR3
x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under
CONFIG_COALESCE_TLBI=y
x86/entry: Add an option to coalesce TLB flushes
arch/Kconfig | 9 ++
arch/arm/kernel/paravirt.c | 2 +-
arch/arm64/kernel/paravirt.c | 2 +-
arch/loongarch/kernel/paravirt.c | 2 +-
arch/riscv/kernel/paravirt.c | 2 +-
arch/x86/Kconfig | 18 +++
arch/x86/entry/calling.h | 36 ++++++
arch/x86/entry/syscall_64.c | 4 +
arch/x86/events/amd/brs.c | 2 +-
arch/x86/include/asm/context_tracking_work.h | 18 +++
arch/x86/include/asm/invpcid.h | 14 ++-
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/include/asm/tlbflush.h | 6 +
arch/x86/kernel/alternative.c | 39 ++++++-
arch/x86/kernel/asm-offsets.c | 1 +
arch/x86/kernel/cpu/bugs.c | 2 +-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/kernel/paravirt.c | 4 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 11 +-
arch/x86/kvm/vmx/vmx_onhyperv.c | 2 +-
arch/x86/mm/tlb.c | 34 ++++--
include/asm-generic/sections.h | 15 +++
include/linux/context_tracking.h | 21 ++++
include/linux/context_tracking_state.h | 54 +++++++--
include/linux/context_tracking_work.h | 26 +++++
include/linux/jump_label.h | 30 ++++-
include/linux/module.h | 6 +-
include/linux/objtool.h | 7 ++
include/linux/static_call.h | 19 ++++
kernel/context_tracking.c | 69 +++++++++++-
kernel/kprobes.c | 8 +-
kernel/kstack_erase.c | 6 +-
kernel/module/main.c | 76 ++++++++++---
kernel/rcu/Kconfig.debug | 15 +++
kernel/sched/clock.c | 7 +-
kernel/time/Kconfig | 5 +
mm/vmalloc.c | 34 +++++-
tools/objtool/Documentation/objtool.txt | 34 ++++++
tools/objtool/check.c | 106 +++++++++++++++---
tools/objtool/include/objtool/check.h | 1 +
tools/objtool/include/objtool/elf.h | 1 +
tools/objtool/include/objtool/special.h | 1 +
tools/objtool/special.c | 15 ++-
.../selftests/rcutorture/configs/rcu/TREE04 | 1 +
47 files changed, 682 insertions(+), 96 deletions(-)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h
--
2.51.0
Hello,
On 10/10/25 17:38, Valentin Schneider wrote:
...
> Performance
> +++++++++++
>
> Tested by measuring the duration of 10M `syscall(SYS_getpid)` calls on
> NOHZ_FULL CPUs, with rteval (hackbench + kernel compilation) running on the
> housekeeping CPUs:
>
> o Xeon E5-2699: base avg 770ns, patched avg 1340ns (74% increase)
> o Xeon E7-8890: base avg 1040ns, patched avg 1320ns (27% increase)
> o Xeon Gold 6248: base avg 270ns, patched avg 273ns (.1% increase)
>
> I don't get that last one, I did spend a ridiculous amount of time making sure
> the flush was being executed, and AFAICT yes, it was. What I take out of this is
> that it can be a pretty massive increase in the entry overhead (for NOHZ_FULL
> CPUs), and that's something I want to hear thoughts on
>
> Noise
> +++++
>
> Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
> RHEL10 userspace.
>
> Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
> and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
>
> $ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
> -e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
> rteval --onlyload --loads-cpulist=$HK_CPUS \
> --hackbench-runlowmem=True --duration=$DURATION
>
> This only records IPIs sent to isolated CPUs, so any event there is interference
> (with a bit of fuzz at the start/end of the workload when spawning the
> processes). All tests were done with a duration of 6 hours.
>
> v6.17
> o ~5400 IPIs received, so about ~200 interfering IPI per isolated CPU
> o About one interfering IPI just shy of every 2 minutes
>
> v6.17 + patches
> o Zilch!
Nice. :)
About performance, can we assume housekeeping CPUs are not affected by
the change (they don't seem to use the trick anyway) or do we want/need
to collect some numbers on them as well just in case (maybe more
throughput oriented)?
Thanks,
Juri
On 14/10/25 14:58, Juri Lelli wrote:
>> Noise
>> +++++
>>
>> Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
>> RHEL10 userspace.
>>
>> Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
>> and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
>>
>> $ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
>> -e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
>> rteval --onlyload --loads-cpulist=$HK_CPUS \
>> --hackbench-runlowmem=True --duration=$DURATION
>>
>> This only records IPIs sent to isolated CPUs, so any event there is interference
>> (with a bit of fuzz at the start/end of the workload when spawning the
>> processes). All tests were done with a duration of 6 hours.
>>
>> v6.17
>> o ~5400 IPIs received, so about ~200 interfering IPI per isolated CPU
>> o About one interfering IPI just shy of every 2 minutes
>>
>> v6.17 + patches
>> o Zilch!
>
> Nice. :)
>
> About performance, can we assume housekeeping CPUs are not affected by
> the change (they don't seem to use the trick anyway) or do we want/need
> to collect some numbers on them as well just in case (maybe more
> throughput oriented)?
>
So for the text_poke IPI yes, because this is all done through
context_tracking which doesn't imply housekeeping CPUs.
For the TLB flush faff the HK CPUs get two extra writes per kernel entry
cycle (one at entry and one at exit, for that stupid signal) which I expect
to be noticeable but small-ish. I can definitely go and measure that.
> Thanks,
> Juri
On 14/10/25 17:26, Valentin Schneider wrote:
> On 14/10/25 14:58, Juri Lelli wrote:
>>> Noise
>>> +++++
>>>
>>> Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
>>> RHEL10 userspace.
>>>
>>> Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
>>> and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
>>>
>>> $ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
>>> -e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
>>> rteval --onlyload --loads-cpulist=$HK_CPUS \
>>> --hackbench-runlowmem=True --duration=$DURATION
>>>
>>> This only records IPIs sent to isolated CPUs, so any event there is interference
>>> (with a bit of fuzz at the start/end of the workload when spawning the
>>> processes). All tests were done with a duration of 6 hours.
>>>
>>> v6.17
>>> o ~5400 IPIs received, so about ~200 interfering IPI per isolated CPU
>>> o About one interfering IPI just shy of every 2 minutes
>>>
>>> v6.17 + patches
>>> o Zilch!
>>
>> Nice. :)
>>
>> About performance, can we assume housekeeping CPUs are not affected by
>> the change (they don't seem to use the trick anyway) or do we want/need
>> to collect some numbers on them as well just in case (maybe more
>> throughput oriented)?
>>
>
> So for the text_poke IPI yes, because this is all done through
> context_tracking which doesn't imply housekeeping CPUs.
>
> For the TLB flush faff the HK CPUs get two extra writes per kernel entry
> cycle (one at entry and one at exit, for that stupid signal) which I expect
> to be noticeable but small-ish. I can definitely go and measure that.
>
On that same Xeon E5-2699 system with the same tuning, the average time
taken for 300M gettid syscalls on housekeeping CPUs is
v6.17: 698.64ns ± 2.35ns
v6.17 + series: 702.60ns ± 3.43ns
So noticeable (~.6% worse) but not horrible?
>> Thanks,
>> Juri
On 15/10/25 15:16, Valentin Schneider wrote:
> On 14/10/25 17:26, Valentin Schneider wrote:
> > On 14/10/25 14:58, Juri Lelli wrote:
> >>> Noise
> >>> +++++
> >>>
> >>> Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
> >>> RHEL10 userspace.
> >>>
> >>> Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
> >>> and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
> >>>
> >>> $ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
> >>> -e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
> >>> rteval --onlyload --loads-cpulist=$HK_CPUS \
> >>> --hackbench-runlowmem=True --duration=$DURATION
> >>>
> >>> This only records IPIs sent to isolated CPUs, so any event there is interference
> >>> (with a bit of fuzz at the start/end of the workload when spawning the
> >>> processes). All tests were done with a duration of 6 hours.
> >>>
> >>> v6.17
> >>> o ~5400 IPIs received, so about ~200 interfering IPI per isolated CPU
> >>> o About one interfering IPI just shy of every 2 minutes
> >>>
> >>> v6.17 + patches
> >>> o Zilch!
> >>
> >> Nice. :)
> >>
> >> About performance, can we assume housekeeping CPUs are not affected by
> >> the change (they don't seem to use the trick anyway) or do we want/need
> >> to collect some numbers on them as well just in case (maybe more
> >> throughput oriented)?
> >>
> >
> > So for the text_poke IPI yes, because this is all done through
> > context_tracking which doesn't imply housekeeping CPUs.
> >
> > For the TLB flush faff the HK CPUs get two extra writes per kernel entry
> > cycle (one at entry and one at exit, for that stupid signal) which I expect
> > to be noticeable but small-ish. I can definitely go and measure that.
> >
>
> On that same Xeon E5-2699 system with the same tuning, the average time
> taken for 300M gettid syscalls on housekeeping CPUs is
> v6.17: 698.64ns ± 2.35ns
> v6.17 + series: 702.60ns ± 3.43ns
>
> So noticeable (~.6% worse) but not horrible?
Yeah, seems reasonable.
Thanks for collecting numbers!
+Cc Phil Auld Le Fri, Oct 10, 2025 at 05:38:10PM +0200, Valentin Schneider a écrit : > Patches > ======= > > o Patches 1-2 are standalone objtool cleanups. Would be nice to get these merged. > o Patches 3-4 add an RCU testing feature. I'm taking this one. > > o Patches 5-6 add infrastructure for annotating static keys and static calls > that may be used in noinstr code (courtesy of Josh). > o Patches 7-20 use said annotations on relevant keys / calls. > o Patch 21 enforces proper usage of said annotations (courtesy of Josh). > > o Patch 22 deals with detecting NOINSTR text in modules Not sure how to route those. If we wait for each individual subsystem, this may take a while. > o Patches 23-24 deal with kernel text sync IPIs I would be fine taking those (the concerns I had are just details) but they depend on all the annotations. Alternatively I can take the whole thing but then we'll need some acks. > o Patches 25-29 deal with kernel range TLB flush IPIs I'll leave these more time for now ;o) And if they ever go somewhere, it should be through x86 tree. Also, here is another candidate usecase for this deferral thing. I remember Phil Auld complaining that stop_machine() on CPU offlining was a big problem for nohz_full. Especially while we are working on a cpuset interface to toggle nohz_full but this will require the CPUs to be offline. So my point is that when a CPU goes offline, stop_machine() puts all CPUs into a loop with IRQs disabled. CPUs in userspace could possibly escape that since they don't touch the kernel anyway. But as soon as they enter the kernel, they should either acquire the final state of stop_machine if completed or join the global loop if in the middle. Thanks. -- Frederic Weisbecker SUSE Labs
On 28/10/25 17:25, Frederic Weisbecker wrote: > +Cc Phil Auld > > Le Fri, Oct 10, 2025 at 05:38:10PM +0200, Valentin Schneider a écrit : >> Patches >> ======= >> >> o Patches 1-2 are standalone objtool cleanups. > > Would be nice to get these merged. > >> o Patches 3-4 add an RCU testing feature. > > I'm taking this one. > Thanks! >> >> o Patches 5-6 add infrastructure for annotating static keys and static calls >> that may be used in noinstr code (courtesy of Josh). >> o Patches 7-20 use said annotations on relevant keys / calls. >> o Patch 21 enforces proper usage of said annotations (courtesy of Josh). >> >> o Patch 22 deals with detecting NOINSTR text in modules > > Not sure how to route those. If we wait for each individual subsystem, > this may take a while. > At least the __ro_after_init ones could go as their own thing since they're standalone, but yeah they're the ones touching all sorts of subsystems :/ >> o Patches 23-24 deal with kernel text sync IPIs > > I would be fine taking those (the concerns I had are just details) > but they depend on all the annotations. Alternatively I can take the whole > thing but then we'll need some acks. > >> o Patches 25-29 deal with kernel range TLB flush IPIs > > I'll leave these more time for now ;o) > And if they ever go somewhere, it should be through x86 tree. > > Also, here is another candidate usecase for this deferral thing. > I remember Phil Auld complaining that stop_machine() on CPU offlining was > a big problem for nohz_full. Especially while we are working on > a cpuset interface to toggle nohz_full but this will require the CPUs > to be offline. > Yeah that does ring a bell... > So my point is that when a CPU goes offline, stop_machine() puts all > CPUs into a loop with IRQs disabled. CPUs in userspace could possibly > escape that since they don't touch the kernel anyway. But as soon as > they enter the kernel, they should either acquire the final state of > stop_machine if completed or join the global loop if in the middle. > I need to have a think about that one; one pain point I see is the context tracking work has to be NMI safe since e.g. an NMI can take us out of userspace. Another is that NOHZ-full CPUs need to be special cased in the stop machine queueing / completion. /me goes fetch a new notebook > Thanks. > > -- > Frederic Weisbecker > SUSE Labs
Le Wed, Oct 29, 2025 at 11:32:58AM +0100, Valentin Schneider a écrit :
> I need to have a think about that one; one pain point I see is the context
> tracking work has to be NMI safe since e.g. an NMI can take us out of
> userspace. Another is that NOHZ-full CPUs need to be special cased in the
> stop machine queueing / completion.
>
> /me goes fetch a new notebook
Something like the below (untested) ?
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 485b32881fde..2940e28ecea6 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -3,6 +3,7 @@
#define _ASM_X86_CONTEXT_TRACKING_WORK_H
#include <asm/sync_core.h>
+#include <linux/stop_machine.h>
static __always_inline void arch_context_tracking_work(enum ct_work work)
{
@@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(enum ct_work work)
case CT_WORK_SYNC:
sync_core();
break;
+ case CT_WORK_STOP_MACHINE:
+ stop_machine_poll_wait();
+ break;
case CT_WORK_MAX:
WARN_ON_ONCE(true);
}
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index 2facc621be06..b63200bd73d6 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -6,12 +6,14 @@
enum {
CT_WORK_SYNC_OFFSET,
+ CT_WORK_STOP_MACHINE_OFFSET,
CT_WORK_MAX_OFFSET
};
enum ct_work {
- CT_WORK_SYNC = BIT(CT_WORK_SYNC_OFFSET),
- CT_WORK_MAX = BIT(CT_WORK_MAX_OFFSET)
+ CT_WORK_SYNC = BIT(CT_WORK_SYNC_OFFSET),
+ CT_WORK_STOP_MACHINE = BIT(CT_WORK_STOP_MACHINE_OFFSET),
+ CT_WORK_MAX = BIT(CT_WORK_MAX_OFFSET)
};
#include <asm/context_tracking_work.h>
diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 72820503514c..0efe88e84b8a 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -36,6 +36,7 @@ bool stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
void stop_machine_park(int cpu);
void stop_machine_unpark(int cpu);
void stop_machine_yield(const struct cpumask *cpumask);
+void stop_machine_poll_wait(void);
extern void print_stop_info(const char *log_lvl, struct task_struct *task);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 3fe6b0c99f3d..8f0281b0db64 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -22,6 +22,7 @@
#include <linux/atomic.h>
#include <linux/nmi.h>
#include <linux/sched/wake_q.h>
+#include <linux/sched/isolation.h>
/*
* Structure to determine completion condition and record errors. May
@@ -176,6 +177,68 @@ struct multi_stop_data {
atomic_t thread_ack;
};
+static DEFINE_PER_CPU(int, stop_machine_poll);
+
+void stop_machine_poll_wait(void)
+{
+ int *poll = this_cpu_ptr(&stop_machine_poll);
+
+ while (*poll)
+ cpu_relax();
+ /* Enforce the work in stop machine to be visible */
+ smp_mb();
+}
+
+static void stop_machine_poll_start(struct multi_stop_data *msdata)
+{
+ int cpu;
+
+ if (housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
+ return;
+
+ /* Random target can't be known in advance */
+ if (!msdata->active_cpus)
+ return;
+
+ for_each_cpu_andnot(cpu, cpu_online_mask, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)) {
+ int *poll = per_cpu_ptr(&stop_machine_poll, cpu);
+
+ if (cpumask_test_cpu(cpu, msdata->active_cpus))
+ continue;
+
+ *poll = 1;
+
+ /*
+ * Act as a full barrier so that if the work is queued, polling is
+ * visible.
+ */
+ if (ct_set_cpu_work(cpu, CT_WORK_STOP_MACHINE))
+ msdata->num_threads--;
+ else
+ *poll = 0;
+ }
+}
+
+static void stop_machine_poll_complete(struct multi_stop_data *msdata)
+{
+ int cpu;
+
+ if (housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
+ return;
+
+ for_each_cpu_andnot(cpu, cpu_online_mask, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)) {
+ int *poll = per_cpu_ptr(&stop_machine_poll, cpu);
+
+ if (cpumask_test_cpu(cpu, msdata->active_cpus))
+ continue;
+ /*
+ * The RmW in ack_state() fully orders the work performed in stop_machine()
+ * with polling.
+ */
+ *poll = 0;
+ }
+}
+
static void set_state(struct multi_stop_data *msdata,
enum multi_stop_state newstate)
{
@@ -186,10 +249,13 @@ static void set_state(struct multi_stop_data *msdata,
}
/* Last one to ack a state moves to the next state. */
-static void ack_state(struct multi_stop_data *msdata)
+static bool ack_state(struct multi_stop_data *msdata)
{
- if (atomic_dec_and_test(&msdata->thread_ack))
+ if (atomic_dec_and_test(&msdata->thread_ack)) {
set_state(msdata, msdata->state + 1);
+ return true;
+ }
+ return false;
}
notrace void __weak stop_machine_yield(const struct cpumask *cpumask)
@@ -240,7 +306,8 @@ static int multi_cpu_stop(void *data)
default:
break;
}
- ack_state(msdata);
+ if (ack_state(msdata) && msdata->state == MULTI_STOP_EXIT)
+ stop_machine_poll_complete(msdata);
} else if (curstate > MULTI_STOP_PREPARE) {
/*
* At this stage all other CPUs we depend on must spin
@@ -615,6 +682,8 @@ int stop_machine_cpuslocked(cpu_stop_fn_t fn, void *data,
return ret;
}
+ stop_machine_poll_start(&msdata);
+
/* Set the initial state and stop all online cpus. */
set_state(&msdata, MULTI_STOP_PREPARE);
return stop_cpus(cpu_online_mask, multi_cpu_stop, &msdata);
On 29/10/25 18:15, Frederic Weisbecker wrote:
> Le Wed, Oct 29, 2025 at 11:32:58AM +0100, Valentin Schneider a écrit :
>> I need to have a think about that one; one pain point I see is the context
>> tracking work has to be NMI safe since e.g. an NMI can take us out of
>> userspace. Another is that NOHZ-full CPUs need to be special cased in the
>> stop machine queueing / completion.
>>
>> /me goes fetch a new notebook
>
> Something like the below (untested) ?
>
Some minor nits below but otherwise that looks promising.
One problem I'm having however is reasoning about the danger zone; what
forbidden actions could a NO_HZ_FULL CPU take when entering the kernel
while take_cpu_down() is happening?
I'm actually not familiar with why we actually use stop_machine() for CPU
hotplug; I see things like CPUHP_AP_SMPCFD_DYING::smpcfd_dying_cpu() or
CPUHP_AP_TICK_DYING::tick_cpu_dying() expect other CPUs to be patiently
spinning in multi_cpu_stop(), and I *think* nothing in the entry code up to
context_tracking entry would disrupt that, but it's not a small thing to
reason about.
AFAICT we need to reason about every .teardown callback from
CPUHP_TEARDOWN_CPU to CPUHP_AP_OFFLINE and their explicit & implicit
dependencies on other CPUs being STOP'd.
> @@ -176,6 +177,68 @@ struct multi_stop_data {
> atomic_t thread_ack;
> };
>
> +static DEFINE_PER_CPU(int, stop_machine_poll);
> +
> +void stop_machine_poll_wait(void)
That needs to be noinstr, and AFAICT there's no problem with doing just that.
> +{
> + int *poll = this_cpu_ptr(&stop_machine_poll);
> +
> + while (*poll)
> + cpu_relax();
> + /* Enforce the work in stop machine to be visible */
> + smp_mb();
> +}
> +
> +static void stop_machine_poll_start(struct multi_stop_data *msdata)
> +{
> + int cpu;
> +
> + if (housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
I think that wants a negation
> + return;
> +
> + /* Random target can't be known in advance */
> + if (!msdata->active_cpus)
> + return;
Le Wed, Nov 05, 2025 at 05:24:29PM +0100, Valentin Schneider a écrit :
> On 29/10/25 18:15, Frederic Weisbecker wrote:
> > Le Wed, Oct 29, 2025 at 11:32:58AM +0100, Valentin Schneider a écrit :
> >> I need to have a think about that one; one pain point I see is the context
> >> tracking work has to be NMI safe since e.g. an NMI can take us out of
> >> userspace. Another is that NOHZ-full CPUs need to be special cased in the
> >> stop machine queueing / completion.
> >>
> >> /me goes fetch a new notebook
> >
> > Something like the below (untested) ?
> >
>
> Some minor nits below but otherwise that looks promising.
>
> One problem I'm having however is reasoning about the danger zone; what
> forbidden actions could a NO_HZ_FULL CPU take when entering the kernel
> while take_cpu_down() is happening?
>
> I'm actually not familiar with why we actually use stop_machine() for CPU
> hotplug; I see things like CPUHP_AP_SMPCFD_DYING::smpcfd_dying_cpu() or
> CPUHP_AP_TICK_DYING::tick_cpu_dying() expect other CPUs to be patiently
> spinning in multi_cpu_stop(), and I *think* nothing in the entry code up to
> context_tracking entry would disrupt that, but it's not a small thing to
> reason about.
>
> AFAICT we need to reason about every .teardown callback from
> CPUHP_TEARDOWN_CPU to CPUHP_AP_OFFLINE and their explicit & implicit
> dependencies on other CPUs being STOP'd.
You're raising a very interesting question. The initial point of stop_machine()
is to synchronize this:
set_cpu_online(cpu, 0)
migrate timers;
migrate hrtimers;
flush IPIs;
etc...
against this pattern:
preempt_disable()
if (cpu_online(cpu))
queue something; // could be timer, IPI, etc...
preempt_enable()
There have been attempts:
https://lore.kernel.org/all/20241218171531.2217275-1-costa.shul@redhat.com/
And really it should be fine to just do:
set_cpu_online(cpu, 0)
synchronize_rcu()
migrate / flush stuff
Probably we should try that instead of the busy loop I proposed
which only papers over the problem.
Of course there are other assumptions. For example the tick
timekeeper is migrated easily knowing that all online CPUs are
not idle (cf: tick_cpu_dying()). So I expect a few traps, with RCU
for example and indeed all these hotplug callbacks must be audited
one by one.
I'm not entirely unfamiliar with many of them. Let me see what I can do...
Thanks.
--
Frederic Weisbecker
SUSE Labs
On 05/11/25 18:46, Frederic Weisbecker wrote: > Le Wed, Nov 05, 2025 at 05:24:29PM +0100, Valentin Schneider a écrit : >> On 29/10/25 18:15, Frederic Weisbecker wrote: >> > Le Wed, Oct 29, 2025 at 11:32:58AM +0100, Valentin Schneider a écrit : >> >> I need to have a think about that one; one pain point I see is the context >> >> tracking work has to be NMI safe since e.g. an NMI can take us out of >> >> userspace. Another is that NOHZ-full CPUs need to be special cased in the >> >> stop machine queueing / completion. >> >> >> >> /me goes fetch a new notebook >> > >> > Something like the below (untested) ? >> > >> >> Some minor nits below but otherwise that looks promising. >> >> One problem I'm having however is reasoning about the danger zone; what >> forbidden actions could a NO_HZ_FULL CPU take when entering the kernel >> while take_cpu_down() is happening? >> >> I'm actually not familiar with why we actually use stop_machine() for CPU >> hotplug; I see things like CPUHP_AP_SMPCFD_DYING::smpcfd_dying_cpu() or >> CPUHP_AP_TICK_DYING::tick_cpu_dying() expect other CPUs to be patiently >> spinning in multi_cpu_stop(), and I *think* nothing in the entry code up to >> context_tracking entry would disrupt that, but it's not a small thing to >> reason about. >> >> AFAICT we need to reason about every .teardown callback from >> CPUHP_TEARDOWN_CPU to CPUHP_AP_OFFLINE and their explicit & implicit >> dependencies on other CPUs being STOP'd. > > You're raising a very interesting question. The initial point of stop_machine() > is to synchronize this: > > set_cpu_online(cpu, 0) > migrate timers; > migrate hrtimers; > flush IPIs; > etc... > > against this pattern: > > preempt_disable() > if (cpu_online(cpu)) > queue something; // could be timer, IPI, etc... > preempt_enable() > > There have been attempts: > > https://lore.kernel.org/all/20241218171531.2217275-1-costa.shul@redhat.com/ > > And really it should be fine to just do: > > set_cpu_online(cpu, 0) > synchronize_rcu() > migrate / flush stuff > That's what I was thinking as well, at the very least for the cpu_online_mask bit. > Probably we should try that instead of the busy loop I proposed > which only papers over the problem. > > Of course there are other assumptions. For example the tick > timekeeper is migrated easily knowing that all online CPUs are > not idle (cf: tick_cpu_dying()). So I expect a few traps, with RCU > for example and indeed all these hotplug callbacks must be audited > one by one. > > I'm not entirely unfamiliar with many of them. Let me see what I can do... > Here be dragons :-)
© 2016 - 2025 Red Hat, Inc.