kernel/sched/core.c | 15 +++++++++++++-- kernel/softirq.c | 15 +++++++++++---- 2 files changed, 24 insertions(+), 6 deletions(-)
Hello everyone, This is the forth version with minor changes from the last, and some more benchmarking data down below. Any and all feedback is highly appreciated. This series is based on: git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core at commit 1a6151017ee5 ("sched: psi: pass enqueue/dequeue flags to psi callbacks directly") Since commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()"), an idle CPU in TIF_POLLING_NRFLAG can be pulled out of idle by setting TIF_NEED_RESCHED instead of sending an actual IPI. This affects at least three scenarios that have been described below: 1. A need_resched() check within a call function does not necessarily indicate a task wakeup since a CPU intending to send an IPI to an idle target in TIF_POLLING_NRFLAG mode can simply queue the SMP-call-function and set the TIF_NEED_RESCHED flag to pull the polling target out of idle. The SMP-call-function will be executed by flush_smp_call_function_queue() on the idle-exit path. On x86, where mwait_idle_with_hints() sets TIF_POLLING_NRFLAG for long idling, this leads to idle load balancer bailing out early since need_resched() check in nohz_csd_func() returns true in most instances. 2. A TIF_POLLING_NRFLAG idling CPU woken up to process an IPI will end up calling schedule() even in cases where the call function does not wake up a new task on the idle CPU, thus delaying the idle re-entry. 3. Julia Lawall reported a case where a softirq raised from a SMP-call-function on an idle CPU will wake up ksoftirqd since flush_smp_call_function_queue() executes in the idle thread's context. This can throw off the idle load balancer by making the idle CPU appear busy since ksoftirqd just woke on the said CPU [1]. Solution to (2.) was sent independently in [2] since it was not dependent on the changes enclosed in this series which reworks some PREEMPT_RT specific bits. (1.) Was solved by dropping the need_resched() check in nohz_csd_func() (please refer Patch 2/5 for the full version of the explanation) which led to a splat on PREEMPT_RT kernels [3]. Since flush_smp_call_function_queue() and the following do_softirq_post_smp_call_flush() runs with interrupts disabled, it is not ideal for the IRQ handlers to raise a SOFTIRQ, prolonging the preempt disabled section especially on PREEMPT_RT kernels. Patch 1 works around this by allowing processing of SCHED_SOFTIRQ without raising a warning since the idle load balancer has checkpoints to detect wakup of tasks on the idle CPU in which case it immediately bails out of the load balaning process and proceeds to call schedule() With the above solution, problem discussed in (3.) is even more prominent with idle load balancing waking up ksoftirqd to unnecessarily. Previous versions attempted to solve this by introducing a per-cpu variable to keep track on an impending call to do_softirq(). Sebastian suggested using __raise_softirq_irqoff() for SCHED_SOFTIRQ since it avoids wakeup of ksoftirqd and since SCHED_SOFTIRQ is being raised from the IPI handler, it is guaranteed to be serviced on IRQ exit or via do_softirq_post_smp_call_flush() Chenyu had reported a regression when running a modified version of ipistorm that performs a fixed set of IPIs between two CPUs on his setup with the whole v1 applied. Since this version gets rid of the per-CPU variable updates at flush_smp_call_function_queue() boundary, the results from testing is identical to the base kernel. Julia Lawall reported reduction in the number of load balancing attempts on v6.11 based tip:sched/core at NUMA level. The issue was root-caused to commit 3dcac251b066 ("sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()") which would skip the newidle_balance() if schedule_idle() was called without any task wakeups on the idle CPUs to favor a faster idle re-entry. To rule out any surprises from this series in particular, I tested the bt.B.x benchmark where she originally observed this behavior on. Following are the numbers from a dual socket Intel Ice Lake Xeon server (2 x 32C/32T smt off): ================================================================== Test : bt.B.x (OMP variant) Units : % improvement over base kernel in Mop/s throughput Interpretation: Higher is better ======================= Intel Ice Lake Xeon ====================== kernel: [pct imp] performance gov, boost off, idle=poll, smt off 0.89% ================================================================== I did not see any discernable difference with this one over the base kernel. [1] https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/ [2] https://lore.kernel.org/lkml/20240809092240.6921-1-kprateek.nayak@amd.com/ [3] https://lore.kernel.org/lkml/225e6d74-ed43-51dd-d1aa-c75c86dd58eb@amd.com/ [4] https://lore.kernel.org/lkml/20240710150557.GB27299@noisy.programming.kicks-ass.net/ --- v3..v4: o Use __raise_softirq_irqoff() for SCHED_SOFTIRQ. (Sebastian) o Updated benchmark numbers and reworded the cover letter. v3: https://lore.kernel.org/lkml/20241014090339.2478-1-kprateek.nayak@amd.com/ v2..v3: o Removed ifdefs around local_lock_t. (Peter) o Reworded Patch 1 to add more details on raising SCHED_SOFTIRQ from flush_smp_call_function_queue() and why it should be okay on PREEMPT_RT. o Updated the trace data in Patch 5. o More benchmarking. v2: https://lore.kernel.org/lkml/20240904111223.1035-1-kprateek.nayak@amd.com/ v1..v2: o Broke the PREEMPT_RT unification and idle load balance fixes into separate series (this) and post the SM_IDLE fast-path enhancements separately. o Worked around the splat on PREEMPT_RT kernel caused by raising SCHED_SOFTIRQ from nohz_csd_func() in context of flush_smp_call_function_queue() which is undesirable on PREEMPT_RT kernels. (Please refer to commit 1a90bfd22020 ("smp: Make softirq handling RT safe in flush_smp_call_function_queue()") o Reuse softirq_ctrl::cnt from PREEMPT_RT to prevent unnecessary wakeups of ksoftirqd. (Peter) This unifies should_wakeup_ksoftirqd() and adds an interface to indicate an impending call to do_softirq (set_do_softirq_pending()) and clear it just before fulfilling the promise (clr_do_softirq_pending()). o More benchmarking. v1: https://lore.kernel.org/lkml/20240710090210.41856-1-kprateek.nayak@amd.com/ -- K Prateek Nayak (3): softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() sched/core: Prevent wakeup of ksoftirqd during idle load balance kernel/sched/core.c | 15 +++++++++++++-- kernel/softirq.c | 15 +++++++++++---- 2 files changed, 24 insertions(+), 6 deletions(-) -- 2.34.1
© 2016 - 2024 Red Hat, Inc.