Changes in v8:
- Use MIN(SMP_CACHE_BYTES, 64) for FLUSH_TLB_INFO_ALIGN in
[PATCH v8 11/14].
- In flush_tlb_mm_range(), name the stack object "info" directly instead
of keeping a temporary "_info" plus pointer in [PATCH v8 12/14].
Changes in v7:
- Split the x86/mm flush_tlb_info stack-storage change into
[PATCH v7 10/14], [PATCH v7 11/14], and [PATCH v7 12/14].
- Rewrite the flush_tlb_info changelog with the historical context, test
data, Link tag, and Acked-by tag.
- Add a preparatory KVM patch for kvm_flush_tlb_multi() in
[PATCH v7 13/14].
- Refactor flush_tlb_mm_range() and arch_tlbbatch_flush() to avoid gotos
in [PATCH v7 14/14].
- Drop the flush_tlb_kernel_range() preemption optimization for now.
- Tighten the smp_task_ipi_mask_alloc()/smp_task_ipi_mask_free()
declarations in [PATCH v7 4/14].
- Update the smp_call_function_many_cond() comment wording.
- Add Paul E. McKenney's Tested-by tags and Sebastian's Reviewed-by tag.
Changes in v6:
- Make the task-local cpumask selection explicit and drop preemptible()
check in smp_call_function_many_cond(). The early put_cpu() decision now
depends only on whether a task-local cpumask is available.
- Keep smp_task_ipi_mask() private to kernel/smp.c in [PATCH v6 4/12].
- Add #include <linux/slab.h> to kernel/smp.c in [PATCH v6 4/12] for
kmalloc()/kfree(), fixing the kernel test robot build failure reported
at:
https://lore.kernel.org/oe-kbuild-all/202605241101.w6T2LApw-lkp@intel.com/
- Update the csd_lock_wait() comment in [PATCH v6 6/12].
- Add Sebastian's Reviewed-by tags to the reviewed patches.
Changes in v5:
- Replace "smp: Remove get_cpu from smp_call_function_any" with a new
approach that extracts a common __smp_call_function_single() to safely
keep the remote CPU selection and IPI dispatch process within a single
preemption-disabled region in [PATCH v5 3/12].
- Fix a typo in comments (s/cpumask_stack/task_mask/) and remove the
obsolete "Preemption must be disabled" constraint from the kernel-doc
in [PATCH v5 6/12].
- Adjust the WARN_ON_ONCE() validation condition to avoid a false positive
warning caused by CPU hotplug races when use_cpus_read_lock is false in
[PATCH v5 9/12].
- Move the preemptible() check in smp_call_function_many_cond() from
[PATCH v5 4/12] to [PATCH v5 6/12].
- Rebase to commit 4ac4d6549a65 ("sched: Use trace_call__<tp>() to save a
static branch").
Changes in v4:
- Use task-local IPI cpumask rather than on-stack cpumask in
[PATCH v4 4/12] (suggested by sebastian).
- Skip to free csd memory in smpcfd_dead_cpu() to guarantee csd memory
access safety, instead of using RCU mechanism in [PATCH v4 5/12]
(suggested by sebastian).
- Align flush_tlb_info with SMP_CACHE_BYTES to avoid performance
degradation caused by unnecessary cache line movements in [PATCH v4
10/12](suggested by sebastian and Nadav).
- Collect Acked-bys and Reviewed-bys.
Changes in v3:
- Add benchmarks to measure the performance impact of changing
flush_tlb_info to stack variable in [PATCH v3 10/12] (suggested by
peter)
- Adjust the rcu_read_unlock() location in [PATCH v3 5/12] (suggested
by muchun)
- Use raw_smp_processor_id() to prevent warning[1] from
check_preemption_disabled() in [PATCH v3 12/12].
- Collect Acked-bys and Reviewed-by.
[1]: https://lore.kernel.org/lkml/20260302075216.2170675-1-zhouchuyi@bytedance.com/T/#mc39999cbeb3f50be176f0903d0fa4075688b073d
Changes in v2:
- Simplify the code comments in [PATCH v2 2/12] (pointed by peter and
muchun)
- Adjust the preemption disabling logic in smp_call_function_any() in
[PATCH v2 3/12] (suggested by peter).
- Use on-stack cpumask only when !CONFIG_CPUMASK_OFFSTACK in [PATCH V2
4/12] (pointed by peter)
- Add [PATCH v2 5/12] to replace migrate_disable with the rcu mechanism
- Adjust the preemption disabling logic to allow flush_tlb_multi() to be
preemptible and migratable in [PATCH v2 11/12]
- Collect Acked-bys and Reviewed-bys
Introduction
============
The vast majority of smp_call_function*() callers block until remote CPUs
complete the IPI function execution. As smp_call_function*() runs with
preemption disabled throughout, scheduling latency increases dramatically
with the number of remote CPUs and other factors (such as interrupts being
disabled).
On x86-64 architectures, TLB flushes are performed via IPIs; thus, during
process exit or when process-mapped pages are reclaimed, numerous IPI
operations must be awaited, leading to increased scheduling latency for
other threads on the current CPU. In our production environment, we
observed IPI wait-induced scheduling latency reaching up to 16ms on a
16-core machine. Our goal is to allow preemption during IPI completion
waiting to improve real-time performance.
Background
==========
In our production environments, latency-sensitive workloads (DPDK) are
configured with the highest priority to preempt lower-priority tasks at any
time. We discovered that DPDK's wake-up latency is primarily caused by the
current CPU having preemption disabled. Therefore, we collected the maximum
preemption disabled events within every 30-second interval and then
calculated the P50/P99 of these max preemption disabled events:
p50(ns) p99(ns)
cpu0 254956 5465050
cpu1 115801 120782
cpu2 43324 72957
cpu3 256637 16723307
cpu4 58979 87237
cpu5 47464 79815
cpu6 48881 81371
cpu7 52263 82294
cpu8 263555 4657713
cpu9 44935 73962
cpu10 37659 65026
cpu11 257008 2706878
cpu12 49669 90006
cpu13 45186 74666
cpu14 60705 83866
cpu15 51311 86885
Meanwhile, we have collected the distribution of preemption disabling
events exceeding 1ms across different CPUs over several hours(I omitted
CPU data that were all zeros):
CPU 1~10ms 10~50ms 50~100ms
cpu0 29 5 0
cpu3 38 13 0
cpu8 34 6 0
cpu11 24 10 0
The preemption disabled for several milliseconds or even 10ms+ mostly
originates from TLB flush:
@stack[
trace_preempt_on+143
trace_preempt_on+143
preempt_count_sub+67
arch_tlbbatch_flush/flush_tlb_mm_range
task_exit/page_reclaim/...
]
Further analysis confirms that the majority of the time is consumed in
csd_lock_wait().
Now smp_call*() always needs to disable preemption, mainly to protect its
internal per-CPU data structures and synchronize with CPU offline
operations. This patchset attempts to make csd_lock_wait() preemptible,
thereby reducing the preemption-disabled critical section and improving
kernel real-time performance.
Effect
======
After applying this patchset, we no longer observe preemption disabled for
more than 1ms on the arch_tlbbatch_flush/flush_tlb_mm_range path. The
overall P99 of max preemption disabled events in every 30-second interval
is reduced to around 1.5ms. The remaining latency is primarily due to lock
contention.
before patch after patch reduced by
----------- -------------- ------------
p99(ns) 16723307 1556034 ~90.70%
Chuyi Zhou (14):
smp: Disable preemption explicitly in __csd_lock_wait()
smp: Enable preemption early in smp_call_function_single()
smp: Refactor remote CPU selection in smp_call_function_any()
smp: Use task-local IPI cpumask in smp_call_function_many_cond()
smp: Alloc percpu csd data in smpcfd_prepare_cpu() only once
smp: Enable preemption early in smp_call_function_many_cond()
smp: Remove preempt_disable() from smp_call_function()
smp: Remove preempt_disable() from on_each_cpu_cond_mask()
scftorture: Remove preempt_disable() in scftorture_invoke_one()
x86/mm: Factor out flush_tlb_info initialization
x86/mm: Cap flush_tlb_info alignment at 64 bytes
x86/mm: Move flush_tlb_info back to the stack
x86/kvm: Disable preemption in kvm_flush_tlb_multi()
x86/mm: Re-enable preemption before flush_tlb_multi()
arch/x86/include/asm/tlbflush.h | 5 +-
arch/x86/kernel/kvm.c | 4 +-
arch/x86/mm/tlb.c | 95 +++++++------------
include/linux/sched.h | 6 ++
include/linux/smp.h | 11 +++
kernel/fork.c | 9 +-
kernel/scftorture.c | 13 +--
kernel/smp.c | 161 ++++++++++++++++++++++++--------
8 files changed, 193 insertions(+), 111 deletions(-)
--
2.20.1