.../admin-guide/kernel-parameters.txt | 5 +- Documentation/trace/ftrace.rst | 6 +- arch/Kconfig | 7 + arch/alpha/Kconfig | 3 +- arch/hexagon/Kconfig | 3 +- arch/m68k/Kconfig | 3 +- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/thread_info.h | 5 +- arch/powerpc/kernel/interrupt.c | 5 +- arch/um/Kconfig | 3 +- arch/x86/Kconfig | 1 + arch/x86/include/asm/thread_info.h | 6 +- include/linux/entry-common.h | 2 +- include/linux/entry-kvm.h | 2 +- include/linux/preempt.h | 43 ++- include/linux/rcutree.h | 2 +- include/linux/sched.h | 101 +++--- include/linux/spinlock.h | 14 +- include/linux/thread_info.h | 71 +++- include/linux/trace_events.h | 6 +- init/Makefile | 1 + kernel/Kconfig.preempt | 37 ++- kernel/entry/common.c | 16 +- kernel/entry/kvm.c | 4 +- kernel/rcu/Kconfig | 2 +- kernel/rcu/tree.c | 13 +- kernel/rcu/tree_plugin.h | 11 +- kernel/sched/core.c | 311 ++++++++++++------ kernel/sched/deadline.c | 9 +- kernel/sched/debug.c | 13 +- kernel/sched/fair.c | 56 ++-- kernel/sched/rt.c | 6 +- kernel/sched/sched.h | 27 +- kernel/trace/trace.c | 30 +- kernel/trace/trace_osnoise.c | 22 +- kernel/trace/trace_output.c | 16 +- 36 files changed, 598 insertions(+), 265 deletions(-)
Hi,
This series adds a new scheduling model PREEMPT_AUTO, which like
PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
on explicit preemption points for the voluntary models.
The series is based on Thomas' original proposal which he outlined
in [1], [2] and in his PoC [3].
v2 mostly reworks v1, with one of the main changes having less
noisy need-resched-lazy related interfaces.
More details in the changelog below.
The v1 of the series is at [4] and the RFC at [5].
Design
==
PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
PREEMPT_COUNT). This means that the scheduler can always safely
preempt. (This is identical to CONFIG_PREEMPT.)
Having that, the next step is to make the rescheduling policy dependent
on the chosen scheduling model. Currently, the scheduler uses a single
need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
reschedule is needed.
PREEMPT_AUTO extends this by adding an additional need-resched bit
(TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
scheduler to express two kinds of rescheduling intent: schedule at
the earliest opportunity (TIF_NEED_RESCHED), or express a need for
rescheduling while allowing the task on the runqueue to run to
timeslice completion (TIF_NEED_RESCHED_LAZY).
The scheduler decides which need-resched bits are chosen based on
the preemption model in use:
TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY
none never always [*]
voluntary higher sched class other tasks [*]
full always never
[*] some details elided.
The last part of the puzzle is, when does preemption happen, or
alternately stated, when are the need-resched bits checked:
exit-to-user ret-to-kernel preempt_count()
NEED_RESCHED_LAZY Y N N
NEED_RESCHED Y Y Y
Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
none/voluntary preemption policies are in effect. And eager semantics
under full preemption.
In addition, since this is driven purely by the scheduler (not
depending on cond_resched() placement and the like), there is enough
flexibility in the scheduler to cope with edge cases -- ex. a kernel
task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
simply upgrading to a full NEED_RESCHED which can use more coercive
instruments like resched IPI to induce a context-switch.
Performance
==
The performance in the basic tests (perf bench sched messaging, kernbench,
cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
(See patches
"sched: support preempt=none under PREEMPT_AUTO"
"sched: support preempt=full under PREEMPT_AUTO"
"sched: handle preempt=voluntary under PREEMPT_AUTO")
For a macro test, a colleague in Oracle's Exadata team tried two
OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
backported.)
In both tests the data was cached on remote nodes (cells), and the
database nodes (compute) served client queries, with clients being
local in the first test and remote in the second.
Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
PREEMPT_VOLUNTARY PREEMPT_AUTO
(preempt=voluntary)
============================== =============================
clients throughput cpu-usage throughput cpu-usage Gain
(tx/min) (utime %/stime %) (tx/min) (utime %/stime %)
------- ---------- ----------------- ---------- ----------------- -------
OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7%
benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6%
(local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8%
OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5%
benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6%
(remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3%
90/10 RW ratio)
(Both sets of tests have a fair amount of NW traffic since the query
tables etc are cached on the cells. Additionally, the first set,
given the local clients, stress the scheduler a bit more than the
second.)
The comparative performance for both the tests is fairly close,
more or less within a margin of error.
Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu, 512GB RAM):
"
a) Base kernel (6.7),
b) v1, PREEMPT_AUTO, preempt=voluntary
c) v1, PREEMPT_DYNAMIC, preempt=voluntary
d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
Workloads I tested and their %gain,
case b case c case d
NAS +2.7% +1.9% +2.1%
Hashjoin, +0.0% +0.0% +0.0%
Graph500, -6.0% +0.0% +0.0%
XSBench +1.7% +0.0% +1.2%
(Note about the Graph500 numbers at [8].)
Did kernbench etc test from Mel's mmtests suite also. Did not notice
much difference.
"
One case where there is a significant performance drop is on powerpc,
seen running hackbench on a 320 core system (a test on a smaller system is
fine.) In theory there's no reason for this to only happen on powerpc
since most of the code is common, but I haven't been able to reproduce
it on x86 so far.
All in all, I think the tests above show that this scheduling model has legs.
However, the none/voluntary models under PREEMPT_AUTO are conceptually
different enough from the current none/voluntary models that there
likely are workloads where performance would be subpar. That needs more
extensive testing to figure out the weak points.
Series layout
==
Patches 1,2
"sched/core: Move preempt_model_*() helpers from sched.h to preempt.h"
"sched/core: Drop spinlocks on contention iff kernel is preemptible"
condition spin_needbreak() on the dynamic preempt_model_*().
Not really required but a useful bugfix for PREEMPT_DYNAMIC and PREEMPT_AUTO.
Patch 3
"sched: make test_*_tsk_thread_flag() return bool"
is a minor cleanup.
Patch 4,
"preempt: introduce CONFIG_PREEMPT_AUTO"
introduces the new scheduling model.
Patch 5-7,
"thread_info: selector for TIF_NEED_RESCHED[_LAZY]"
"thread_info: define __tif_need_resched(resched_t)"
"sched: define *_tsk_need_resched_lazy() helpers"
introduce new thread_info/task helper interfaces or make changes to
pre-existing ones that will be used in the rest of the series.
Patches 8-11,
"entry: handle lazy rescheduling at user-exit"
"entry/kvm: handle lazy rescheduling at guest-entry"
"entry: irqentry_exit only preempts for TIF_NEED_RESCHED"
"sched: __schedule_loop() doesn't need to check for need_resched_lazy()"
make changes/document the rescheduling points.
Patches 12-13,
"sched: separate PREEMPT_DYNAMIC config logic"
"sched: allow runtime config for PREEMPT_AUTO"
reuse the PREEMPT_DYNAMIC runtime configuration logic.
Patch 14-18,
"rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO"
"rcu: fix header guard for rcu_all_qs()"
"preempt,rcu: warn on PREEMPT_RCU=n, preempt=full"
"rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y"
"rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y"
add changes needed for RCU.
Patch 19-20,
"x86/thread_info: define TIF_NEED_RESCHED_LAZY"
"powerpc: add support for PREEMPT_AUTO"
adds x86, powerpc support.
Patches 21-24,
"sched: prepare for lazy rescheduling in resched_curr()"
"sched: default preemption policy for PREEMPT_AUTO"
"sched: handle idle preemption for PREEMPT_AUTO"
"sched: schedule eagerly in resched_cpu()"
are preparatory patches for adding PREEMPT_AUTO. Among other things
they add the default need-resched policy for !PREEMPT_AUTO,
PREEMPT_AUTO, and the idle task.
Patches 25-26,
"sched/fair: refactor update_curr(), entity_tick()",
"sched/fair: handle tick expiry under lazy preemption"
handle the 'hog' problem, where a kernel task does not voluntarily
schedule out.
And, finally patches 27-29,
"sched: support preempt=none under PREEMPT_AUTO"
"sched: support preempt=full under PREEMPT_AUTO"
"sched: handle preempt=voluntary under PREEMPT_AUTO"
add support for the three preemption models.
Patch 30-33,
"sched: latency warn for TIF_NEED_RESCHED_LAZY",
"tracing: support lazy resched",
"Documentation: tracing: add TIF_NEED_RESCHED_LAZY",
"osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y"
handles remaining bits and pieces to do with TIF_NEED_RESCHED_LAZY.
And, finally patches 34-35
"kconfig: decompose ARCH_NO_PREEMPT"
"arch: decompose ARCH_NO_PREEMPT"
decompose ARCH_NO_PREEMPT which might make it easier to support
CONFIG_PREEMPTION on some architectures.
Changelog
==
v2: rebased to v6.9, addreses review comments, folds some other patches.
- the lazy interfaces are less noisy now: the current interfaces stay
unchanged so non-scheduler code doesn't need to change.
This also means that the lazy preemption becomes a scheduler detail
which works well with the core idea of lazy scheduling.
(Mark Rutland, Thomas Gleixner)
- preempt=none model now respects the leftmost deadline (Juri Lelli)
- Add need-resched flag combination state in tracing headers (Steven Rostedt)
- Decompose ARCH_NO_PREEMPT
- Changes for RCU (and TASKS_RCU) will go in separately [6]
- spin_needbreak() should be conditioned on preempt_model_*() at
runtime (patches from Sean Christopherson [7])
- powerpc support from Shrikanth Hegde
RFC:
- Addresses review comments and is generally a more focused
version of the RFC.
- Lots of code reorganization.
- Bugfixes all over.
- need_resched() now only checks for TIF_NEED_RESCHED instead
of TIF_NEED_RESCHED|TIF_NEED_RESCHED_LAZY.
- set_nr_if_polling() now does not check for TIF_NEED_RESCHED_LAZY.
- Tighten idle related checks.
- RCU changes to force context-switches when a quiescent state is
urgently needed.
- Does not break live-patching anymore
Also at: github.com/terminus/linux preempt-v2
Please review.
Thanks
Ankur
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
[1] https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
[2] https://lore.kernel.org/lkml/87led2wdj0.ffs@tglx/
[3] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
[4] https://lore.kernel.org/lkml/20240213055554.1802415-1-ankur.a.arora@oracle.com/
[5] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/
[6] https://lore.kernel.org/lkml/20240507093530.3043-1-urezki@gmail.com/
[7] https://lore.kernel.org/lkml/20240312193911.1796717-1-seanjc@google.com/
[8] https://lore.kernel.org/lkml/af122806-8325-4302-991f-9c0dc1857bfe@amd.com/
[9] https://lore.kernel.org/lkml/17cc54c4-2e75-4964-9155-84db081ce209@linux.ibm.com/
Ankur Arora (32):
sched: make test_*_tsk_thread_flag() return bool
preempt: introduce CONFIG_PREEMPT_AUTO
thread_info: selector for TIF_NEED_RESCHED[_LAZY]
thread_info: define __tif_need_resched(resched_t)
sched: define *_tsk_need_resched_lazy() helpers
entry: handle lazy rescheduling at user-exit
entry/kvm: handle lazy rescheduling at guest-entry
entry: irqentry_exit only preempts for TIF_NEED_RESCHED
sched: __schedule_loop() doesn't need to check for need_resched_lazy()
sched: separate PREEMPT_DYNAMIC config logic
sched: allow runtime config for PREEMPT_AUTO
rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO
rcu: fix header guard for rcu_all_qs()
preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y
rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y
x86/thread_info: define TIF_NEED_RESCHED_LAZY
sched: prepare for lazy rescheduling in resched_curr()
sched: default preemption policy for PREEMPT_AUTO
sched: handle idle preemption for PREEMPT_AUTO
sched: schedule eagerly in resched_cpu()
sched/fair: refactor update_curr(), entity_tick()
sched/fair: handle tick expiry under lazy preemption
sched: support preempt=none under PREEMPT_AUTO
sched: support preempt=full under PREEMPT_AUTO
sched: handle preempt=voluntary under PREEMPT_AUTO
sched: latency warn for TIF_NEED_RESCHED_LAZY
tracing: support lazy resched
Documentation: tracing: add TIF_NEED_RESCHED_LAZY
osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y
kconfig: decompose ARCH_NO_PREEMPT
arch: decompose ARCH_NO_PREEMPT
Sean Christopherson (2):
sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
sched/core: Drop spinlocks on contention iff kernel is preemptible
Shrikanth Hegde (1):
powerpc: add support for PREEMPT_AUTO
.../admin-guide/kernel-parameters.txt | 5 +-
Documentation/trace/ftrace.rst | 6 +-
arch/Kconfig | 7 +
arch/alpha/Kconfig | 3 +-
arch/hexagon/Kconfig | 3 +-
arch/m68k/Kconfig | 3 +-
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/thread_info.h | 5 +-
arch/powerpc/kernel/interrupt.c | 5 +-
arch/um/Kconfig | 3 +-
arch/x86/Kconfig | 1 +
arch/x86/include/asm/thread_info.h | 6 +-
include/linux/entry-common.h | 2 +-
include/linux/entry-kvm.h | 2 +-
include/linux/preempt.h | 43 ++-
include/linux/rcutree.h | 2 +-
include/linux/sched.h | 101 +++---
include/linux/spinlock.h | 14 +-
include/linux/thread_info.h | 71 +++-
include/linux/trace_events.h | 6 +-
init/Makefile | 1 +
kernel/Kconfig.preempt | 37 ++-
kernel/entry/common.c | 16 +-
kernel/entry/kvm.c | 4 +-
kernel/rcu/Kconfig | 2 +-
kernel/rcu/tree.c | 13 +-
kernel/rcu/tree_plugin.h | 11 +-
kernel/sched/core.c | 311 ++++++++++++------
kernel/sched/deadline.c | 9 +-
kernel/sched/debug.c | 13 +-
kernel/sched/fair.c | 56 ++--
kernel/sched/rt.c | 6 +-
kernel/sched/sched.h | 27 +-
kernel/trace/trace.c | 30 +-
kernel/trace/trace_osnoise.c | 22 +-
kernel/trace/trace_output.c | 16 +-
36 files changed, 598 insertions(+), 265 deletions(-)
--
2.31.1
On Mon, May 27, 2024, Ankur Arora wrote: > Patches 1,2 > "sched/core: Move preempt_model_*() helpers from sched.h to preempt.h" > "sched/core: Drop spinlocks on contention iff kernel is preemptible" > condition spin_needbreak() on the dynamic preempt_model_*(). ... > Not really required but a useful bugfix for PREEMPT_DYNAMIC and PREEMPT_AUTO. > Sean Christopherson (2): > sched/core: Move preempt_model_*() helpers from sched.h to preempt.h > sched/core: Drop spinlocks on contention iff kernel is preemptible Peter and/or Thomas, would it be possible to get these applied to tip-tree sooner than later? They fix a real bug that affects KVM to varying degrees.
On Wed, Jun 05, 2024 at 08:44:50AM -0700, Sean Christopherson wrote: > On Mon, May 27, 2024, Ankur Arora wrote: > > Patches 1,2 > > "sched/core: Move preempt_model_*() helpers from sched.h to preempt.h" > > "sched/core: Drop spinlocks on contention iff kernel is preemptible" > > condition spin_needbreak() on the dynamic preempt_model_*(). > > ... > > > Not really required but a useful bugfix for PREEMPT_DYNAMIC and PREEMPT_AUTO. > > Sean Christopherson (2): > > sched/core: Move preempt_model_*() helpers from sched.h to preempt.h > > sched/core: Drop spinlocks on contention iff kernel is preemptible > > Peter and/or Thomas, would it be possible to get these applied to tip-tree sooner > than later? They fix a real bug that affects KVM to varying degrees. It so happens I've queued them for sched/core earlier today (see queue/sched/core). If the robot comes back happy, I'll push them into tip. Thanks!
On 5/28/24 6:04 AM, Ankur Arora wrote:
> Hi,
>
> This series adds a new scheduling model PREEMPT_AUTO, which like
> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
> on explicit preemption points for the voluntary models.
>
> The series is based on Thomas' original proposal which he outlined
> in [1], [2] and in his PoC [3].
>
> v2 mostly reworks v1, with one of the main changes having less
> noisy need-resched-lazy related interfaces.
> More details in the changelog below.
>
Hi Ankur. Thanks for the series.
nit: had to manually patch 11,12,13 since it didnt apply cleanly on
tip/master and tip/sched/core. Mostly due some word differences in the change.
tip/master was at:
commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
Merge: 5d145493a139 47ff30cc1be7
Author: Ingo Molnar <mingo@kernel.org>
Date: Tue May 28 12:44:26 2024 +0200
Merge branch into tip/master: 'x86/percpu'
> The v1 of the series is at [4] and the RFC at [5].
>
> Design
> ==
>
> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
> PREEMPT_COUNT). This means that the scheduler can always safely
> preempt. (This is identical to CONFIG_PREEMPT.)
>
> Having that, the next step is to make the rescheduling policy dependent
> on the chosen scheduling model. Currently, the scheduler uses a single
> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
> reschedule is needed.
> PREEMPT_AUTO extends this by adding an additional need-resched bit
> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
> scheduler to express two kinds of rescheduling intent: schedule at
> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
> rescheduling while allowing the task on the runqueue to run to
> timeslice completion (TIF_NEED_RESCHED_LAZY).
>
> The scheduler decides which need-resched bits are chosen based on
> the preemption model in use:
>
> TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY
>
> none never always [*]
> voluntary higher sched class other tasks [*]
> full always never
>
> [*] some details elided.
>
> The last part of the puzzle is, when does preemption happen, or
> alternately stated, when are the need-resched bits checked:
>
> exit-to-user ret-to-kernel preempt_count()
>
> NEED_RESCHED_LAZY Y N N
> NEED_RESCHED Y Y Y
>
> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
> none/voluntary preemption policies are in effect. And eager semantics
> under full preemption.
>
> In addition, since this is driven purely by the scheduler (not
> depending on cond_resched() placement and the like), there is enough
> flexibility in the scheduler to cope with edge cases -- ex. a kernel
> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
> simply upgrading to a full NEED_RESCHED which can use more coercive
> instruments like resched IPI to induce a context-switch.
>
> Performance
> ==
> The performance in the basic tests (perf bench sched messaging, kernbench,
> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
> (See patches
> "sched: support preempt=none under PREEMPT_AUTO"
> "sched: support preempt=full under PREEMPT_AUTO"
> "sched: handle preempt=voluntary under PREEMPT_AUTO")
>
> For a macro test, a colleague in Oracle's Exadata team tried two
> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
> backported.)
>
> In both tests the data was cached on remote nodes (cells), and the
> database nodes (compute) served client queries, with clients being
> local in the first test and remote in the second.
>
> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>
>
> PREEMPT_VOLUNTARY PREEMPT_AUTO
> (preempt=voluntary)
> ============================== =============================
> clients throughput cpu-usage throughput cpu-usage Gain
> (tx/min) (utime %/stime %) (tx/min) (utime %/stime %)
> ------- ---------- ----------------- ---------- ----------------- -------
>
>
> OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7%
> benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6%
> (local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8%
>
>
> OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5%
> benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6%
> (remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3%
> 90/10 RW ratio)
>
>
> (Both sets of tests have a fair amount of NW traffic since the query
> tables etc are cached on the cells. Additionally, the first set,
> given the local clients, stress the scheduler a bit more than the
> second.)
>
> The comparative performance for both the tests is fairly close,
> more or less within a margin of error.
>
> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu, 512GB RAM):
>
> "
> a) Base kernel (6.7),
> b) v1, PREEMPT_AUTO, preempt=voluntary
> c) v1, PREEMPT_DYNAMIC, preempt=voluntary
> d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>
> Workloads I tested and their %gain,
> case b case c case d
> NAS +2.7% +1.9% +2.1%
> Hashjoin, +0.0% +0.0% +0.0%
> Graph500, -6.0% +0.0% +0.0%
> XSBench +1.7% +0.0% +1.2%
>
> (Note about the Graph500 numbers at [8].)
>
> Did kernbench etc test from Mel's mmtests suite also. Did not notice
> much difference.
> "
>
> One case where there is a significant performance drop is on powerpc,
> seen running hackbench on a 320 core system (a test on a smaller system is
> fine.) In theory there's no reason for this to only happen on powerpc
> since most of the code is common, but I haven't been able to reproduce
> it on x86 so far.
>
> All in all, I think the tests above show that this scheduling model has legs.
> However, the none/voluntary models under PREEMPT_AUTO are conceptually
> different enough from the current none/voluntary models that there
> likely are workloads where performance would be subpar. That needs more
> extensive testing to figure out the weak points.
>
>
>
Did test it again on PowerPC. Unfortunately numbers shows there is regression
still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
smaller system too to confirm. For now I have done the comparison for the hackbench
where highest regression was seen in v1.
perf stat collected for 20 iterations show higher context switch and higher migrations.
Could it be that LAZY bit is causing more context switches? or could it be something
else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.
Meanwhile, will do more test with other micro-benchmarks and post the results.
More details below.
CONFIG_HZ = 100
./hackbench -pipe 60 process 100000 loops
====================================================================================
On the larger system. (40 Cores, 320CPUS)
====================================================================================
6.10-rc1 +preempt_auto
preempt=none preempt=none
20 iterations avg value
hackbench pipe(60) 26.403 32.368 ( -31.1%)
++++++++++++++++++
baseline 6.10-rc1:
++++++++++++++++++
Performance counter stats for 'system wide' (20 runs):
168,980,939.76 msec cpu-clock # 6400.026 CPUs utilized ( +- 6.59% )
6,299,247,371 context-switches # 70.596 K/sec ( +- 6.60% )
246,646,236 cpu-migrations # 2.764 K/sec ( +- 6.57% )
1,759,232 page-faults # 19.716 /sec ( +- 6.61% )
577,719,907,794,874 cycles # 6.475 GHz ( +- 6.60% )
226,392,778,622,410 instructions # 0.74 insn per cycle ( +- 6.61% )
37,280,192,946,445 branches # 417.801 M/sec ( +- 6.61% )
166,456,311,053 branch-misses # 0.85% of all branches ( +- 6.60% )
26.403 +- 0.166 seconds time elapsed ( +- 0.63% )
++++++++++++
preempt auto
++++++++++++
Performance counter stats for 'system wide' (20 runs):
207,154,235.95 msec cpu-clock # 6400.009 CPUs utilized ( +- 6.64% )
9,337,462,696 context-switches # 85.645 K/sec ( +- 6.68% )
631,276,554 cpu-migrations # 5.790 K/sec ( +- 6.79% )
1,756,583 page-faults # 16.112 /sec ( +- 6.59% )
700,281,729,230,103 cycles # 6.423 GHz ( +- 6.64% )
254,713,123,656,485 instructions # 0.69 insn per cycle ( +- 6.63% )
42,275,061,484,512 branches # 387.756 M/sec ( +- 6.63% )
231,944,216,106 branch-misses # 1.04% of all branches ( +- 6.64% )
32.368 +- 0.200 seconds time elapsed ( +- 0.62% )
============================================================================================
Smaller system ( 12Cores, 96CPUS)
============================================================================================
6.10-rc1 +preempt_auto
preempt=none preempt=none
20 iterations avg value
hackbench pipe(60) 55.930 65.75 ( -17.6%)
++++++++++++++++++
baseline 6.10-rc1:
++++++++++++++++++
Performance counter stats for 'system wide' (20 runs):
107,386,299.19 msec cpu-clock # 1920.003 CPUs utilized ( +- 6.55% )
1,388,830,542 context-switches # 24.536 K/sec ( +- 6.19% )
44,538,641 cpu-migrations # 786.840 /sec ( +- 6.23% )
1,698,710 page-faults # 30.010 /sec ( +- 6.58% )
412,401,110,929,055 cycles # 7.286 GHz ( +- 6.54% )
192,380,094,075,743 instructions # 0.88 insn per cycle ( +- 6.59% )
30,328,724,557,878 branches # 535.801 M/sec ( +- 6.58% )
99,642,840,901 branch-misses # 0.63% of all branches ( +- 6.57% )
55.930 +- 0.509 seconds time elapsed ( +- 0.91% )
+++++++++++++++++
v2_preempt_auto
+++++++++++++++++
Performance counter stats for 'system wide' (20 runs):
126,244,029.04 msec cpu-clock # 1920.005 CPUs utilized ( +- 6.51% )
2,563,720,294 context-switches # 38.356 K/sec ( +- 6.10% )
147,445,392 cpu-migrations # 2.206 K/sec ( +- 6.37% )
1,710,637 page-faults # 25.593 /sec ( +- 6.55% )
483,419,889,144,017 cycles # 7.232 GHz ( +- 6.51% )
210,788,030,476,548 instructions # 0.82 insn per cycle ( +- 6.57% )
33,851,562,301,187 branches # 506.454 M/sec ( +- 6.56% )
134,059,721,699 branch-misses # 0.75% of all branches ( +- 6.45% )
65.75 +- 1.06 seconds time elapsed ( +- 1.61% )
Shrikanth Hegde <sshegde@linux.ibm.com> writes: > On 5/28/24 6:04 AM, Ankur Arora wrote: >> Hi, >> >> This series adds a new scheduling model PREEMPT_AUTO, which like >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full >> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend >> on explicit preemption points for the voluntary models. >> >> The series is based on Thomas' original proposal which he outlined >> in [1], [2] and in his PoC [3]. >> >> v2 mostly reworks v1, with one of the main changes having less >> noisy need-resched-lazy related interfaces. >> More details in the changelog below. >> > > Hi Ankur. Thanks for the series. > > nit: had to manually patch 11,12,13 since it didnt apply cleanly on > tip/master and tip/sched/core. Mostly due some word differences in the change. > > tip/master was at: > commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD) > Merge: 5d145493a139 47ff30cc1be7 > Author: Ingo Molnar <mingo@kernel.org> > Date: Tue May 28 12:44:26 2024 +0200 > > Merge branch into tip/master: 'x86/percpu' > > > >> The v1 of the series is at [4] and the RFC at [5]. >> >> Design >> == >> >> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus >> PREEMPT_COUNT). This means that the scheduler can always safely >> preempt. (This is identical to CONFIG_PREEMPT.) >> >> Having that, the next step is to make the rescheduling policy dependent >> on the chosen scheduling model. Currently, the scheduler uses a single >> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a >> reschedule is needed. >> PREEMPT_AUTO extends this by adding an additional need-resched bit >> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the >> scheduler to express two kinds of rescheduling intent: schedule at >> the earliest opportunity (TIF_NEED_RESCHED), or express a need for >> rescheduling while allowing the task on the runqueue to run to >> timeslice completion (TIF_NEED_RESCHED_LAZY). >> >> The scheduler decides which need-resched bits are chosen based on >> the preemption model in use: >> >> TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY >> >> none never always [*] >> voluntary higher sched class other tasks [*] >> full always never >> >> [*] some details elided. >> >> The last part of the puzzle is, when does preemption happen, or >> alternately stated, when are the need-resched bits checked: >> >> exit-to-user ret-to-kernel preempt_count() >> >> NEED_RESCHED_LAZY Y N N >> NEED_RESCHED Y Y Y >> >> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when >> none/voluntary preemption policies are in effect. And eager semantics >> under full preemption. >> >> In addition, since this is driven purely by the scheduler (not >> depending on cond_resched() placement and the like), there is enough >> flexibility in the scheduler to cope with edge cases -- ex. a kernel >> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by >> simply upgrading to a full NEED_RESCHED which can use more coercive >> instruments like resched IPI to induce a context-switch. >> >> Performance >> == >> The performance in the basic tests (perf bench sched messaging, kernbench, >> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC. >> (See patches >> "sched: support preempt=none under PREEMPT_AUTO" >> "sched: support preempt=full under PREEMPT_AUTO" >> "sched: handle preempt=voluntary under PREEMPT_AUTO") >> >> For a macro test, a colleague in Oracle's Exadata team tried two >> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series >> backported.) >> >> In both tests the data was cached on remote nodes (cells), and the >> database nodes (compute) served client queries, with clients being >> local in the first test and remote in the second. >> >> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs) >> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs >> >> >> PREEMPT_VOLUNTARY PREEMPT_AUTO >> (preempt=voluntary) >> ============================== ============================= >> clients throughput cpu-usage throughput cpu-usage Gain >> (tx/min) (utime %/stime %) (tx/min) (utime %/stime %) >> ------- ---------- ----------------- ---------- ----------------- ------- >> >> >> OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7% >> benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6% >> (local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8% >> >> >> OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5% >> benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6% >> (remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3% >> 90/10 RW ratio) >> >> >> (Both sets of tests have a fair amount of NW traffic since the query >> tables etc are cached on the cells. Additionally, the first set, >> given the local clients, stress the scheduler a bit more than the >> second.) >> >> The comparative performance for both the tests is fairly close, >> more or less within a margin of error. >> >> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu, 512GB RAM): >> >> " >> a) Base kernel (6.7), >> b) v1, PREEMPT_AUTO, preempt=voluntary >> c) v1, PREEMPT_DYNAMIC, preempt=voluntary >> d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y >> >> Workloads I tested and their %gain, >> case b case c case d >> NAS +2.7% +1.9% +2.1% >> Hashjoin, +0.0% +0.0% +0.0% >> Graph500, -6.0% +0.0% +0.0% >> XSBench +1.7% +0.0% +1.2% >> >> (Note about the Graph500 numbers at [8].) >> >> Did kernbench etc test from Mel's mmtests suite also. Did not notice >> much difference. >> " >> >> One case where there is a significant performance drop is on powerpc, >> seen running hackbench on a 320 core system (a test on a smaller system is >> fine.) In theory there's no reason for this to only happen on powerpc >> since most of the code is common, but I haven't been able to reproduce >> it on x86 so far. >> >> All in all, I think the tests above show that this scheduling model has legs. >> However, the none/voluntary models under PREEMPT_AUTO are conceptually >> different enough from the current none/voluntary models that there >> likely are workloads where performance would be subpar. That needs more >> extensive testing to figure out the weak points. >> >> >> > Did test it again on PowerPC. Unfortunately numbers shows there is regression > still compared to 6.10-rc1. This is done with preempt=none. I tried again on the > smaller system too to confirm. For now I have done the comparison for the hackbench > where highest regression was seen in v1. > > perf stat collected for 20 iterations show higher context switch and higher migrations. > Could it be that LAZY bit is causing more context switches? or could it be something > else? Could it be that more exit-to-user happens in PowerPC? will continue to debug. Thanks for trying it out. As you point out, context-switches and migrations are signficantly higher. Definitely unexpected. I ran the same test on an x86 box (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference. 6.9.0/none.process.pipe.60: 170,719,761 context-switches # 0.022 M/sec ( +- 0.19% ) 6.9.0/none.process.pipe.60: 16,871,449 cpu-migrations # 0.002 M/sec ( +- 0.16% ) 6.9.0/none.process.pipe.60: 30.833112186 seconds time elapsed ( +- 0.11% ) 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 177,889,639 context-switches # 0.023 M/sec ( +- 0.21% ) 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 17,426,670 cpu-migrations # 0.002 M/sec ( +- 0.41% ) 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 30.731126312 seconds time elapsed ( +- 0.07% ) Clearly there's something different going on powerpc. I'm travelling right now, but will dig deeper into this once I get back. Meanwhile can you check if the increased context-switches are voluntary or involuntary (or what the division is)? Thanks Ankur > Meanwhile, will do more test with other micro-benchmarks and post the results. > > > More details below. > CONFIG_HZ = 100 > ./hackbench -pipe 60 process 100000 loops > > ==================================================================================== > On the larger system. (40 Cores, 320CPUS) > ==================================================================================== > 6.10-rc1 +preempt_auto > preempt=none preempt=none > 20 iterations avg value > hackbench pipe(60) 26.403 32.368 ( -31.1%) > > ++++++++++++++++++ > baseline 6.10-rc1: > ++++++++++++++++++ > Performance counter stats for 'system wide' (20 runs): > 168,980,939.76 msec cpu-clock # 6400.026 CPUs utilized ( +- 6.59% ) > 6,299,247,371 context-switches # 70.596 K/sec ( +- 6.60% ) > 246,646,236 cpu-migrations # 2.764 K/sec ( +- 6.57% ) > 1,759,232 page-faults # 19.716 /sec ( +- 6.61% ) > 577,719,907,794,874 cycles # 6.475 GHz ( +- 6.60% ) > 226,392,778,622,410 instructions # 0.74 insn per cycle ( +- 6.61% ) > 37,280,192,946,445 branches # 417.801 M/sec ( +- 6.61% ) > 166,456,311,053 branch-misses # 0.85% of all branches ( +- 6.60% ) > > 26.403 +- 0.166 seconds time elapsed ( +- 0.63% ) > > ++++++++++++ > preempt auto > ++++++++++++ > Performance counter stats for 'system wide' (20 runs): > 207,154,235.95 msec cpu-clock # 6400.009 CPUs utilized ( +- 6.64% ) > 9,337,462,696 context-switches # 85.645 K/sec ( +- 6.68% ) > 631,276,554 cpu-migrations # 5.790 K/sec ( +- 6.79% ) > 1,756,583 page-faults # 16.112 /sec ( +- 6.59% ) > 700,281,729,230,103 cycles # 6.423 GHz ( +- 6.64% ) > 254,713,123,656,485 instructions # 0.69 insn per cycle ( +- 6.63% ) > 42,275,061,484,512 branches # 387.756 M/sec ( +- 6.63% ) > 231,944,216,106 branch-misses # 1.04% of all branches ( +- 6.64% ) > > 32.368 +- 0.200 seconds time elapsed ( +- 0.62% ) > > > ============================================================================================ > Smaller system ( 12Cores, 96CPUS) > ============================================================================================ > 6.10-rc1 +preempt_auto > preempt=none preempt=none > 20 iterations avg value > hackbench pipe(60) 55.930 65.75 ( -17.6%) > > ++++++++++++++++++ > baseline 6.10-rc1: > ++++++++++++++++++ > Performance counter stats for 'system wide' (20 runs): > 107,386,299.19 msec cpu-clock # 1920.003 CPUs utilized ( +- 6.55% ) > 1,388,830,542 context-switches # 24.536 K/sec ( +- 6.19% ) > 44,538,641 cpu-migrations # 786.840 /sec ( +- 6.23% ) > 1,698,710 page-faults # 30.010 /sec ( +- 6.58% ) > 412,401,110,929,055 cycles # 7.286 GHz ( +- 6.54% ) > 192,380,094,075,743 instructions # 0.88 insn per cycle ( +- 6.59% ) > 30,328,724,557,878 branches # 535.801 M/sec ( +- 6.58% ) > 99,642,840,901 branch-misses # 0.63% of all branches ( +- 6.57% ) > > 55.930 +- 0.509 seconds time elapsed ( +- 0.91% ) > > > +++++++++++++++++ > v2_preempt_auto > +++++++++++++++++ > Performance counter stats for 'system wide' (20 runs): > 126,244,029.04 msec cpu-clock # 1920.005 CPUs utilized ( +- 6.51% ) > 2,563,720,294 context-switches # 38.356 K/sec ( +- 6.10% ) > 147,445,392 cpu-migrations # 2.206 K/sec ( +- 6.37% ) > 1,710,637 page-faults # 25.593 /sec ( +- 6.55% ) > 483,419,889,144,017 cycles # 7.232 GHz ( +- 6.51% ) > 210,788,030,476,548 instructions # 0.82 insn per cycle ( +- 6.57% ) > 33,851,562,301,187 branches # 506.454 M/sec ( +- 6.56% ) > 134,059,721,699 branch-misses # 0.75% of all branches ( +- 6.45% ) > > 65.75 +- 1.06 seconds time elapsed ( +- 1.61% ) So, the context-switches are meaningfully higher. -- ankur
On 6/1/24 5:17 PM, Ankur Arora wrote: > > Shrikanth Hegde <sshegde@linux.ibm.com> writes: > >> On 5/28/24 6:04 AM, Ankur Arora wrote: >>> Hi, >>> >>> This series adds a new scheduling model PREEMPT_AUTO, which like >>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full >>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend >>> on explicit preemption points for the voluntary models. >>> >>> The series is based on Thomas' original proposal which he outlined >>> in [1], [2] and in his PoC [3]. >>> >>> v2 mostly reworks v1, with one of the main changes having less >>> noisy need-resched-lazy related interfaces. >>> More details in the changelog below. >>> >> >> Hi Ankur. Thanks for the series. >> >> nit: had to manually patch 11,12,13 since it didnt apply cleanly on >> tip/master and tip/sched/core. Mostly due some word differences in the change. >> >> tip/master was at: >> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD) >> Merge: 5d145493a139 47ff30cc1be7 >> Author: Ingo Molnar <mingo@kernel.org> >> Date: Tue May 28 12:44:26 2024 +0200 >> >> Merge branch into tip/master: 'x86/percpu' >> >> >> >>> The v1 of the series is at [4] and the RFC at [5]. >>> >>> Design >>> == >>> >>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus >>> PREEMPT_COUNT). This means that the scheduler can always safely >>> preempt. (This is identical to CONFIG_PREEMPT.) >>> >>> Having that, the next step is to make the rescheduling policy dependent >>> on the chosen scheduling model. Currently, the scheduler uses a single >>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a >>> reschedule is needed. >>> PREEMPT_AUTO extends this by adding an additional need-resched bit >>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the >>> scheduler to express two kinds of rescheduling intent: schedule at >>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for >>> rescheduling while allowing the task on the runqueue to run to >>> timeslice completion (TIF_NEED_RESCHED_LAZY). >>> >>> The scheduler decides which need-resched bits are chosen based on >>> the preemption model in use: >>> >>> TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY >>> >>> none never always [*] >>> voluntary higher sched class other tasks [*] >>> full always never >>> >>> [*] some details elided. >>> >>> The last part of the puzzle is, when does preemption happen, or >>> alternately stated, when are the need-resched bits checked: >>> >>> exit-to-user ret-to-kernel preempt_count() >>> >>> NEED_RESCHED_LAZY Y N N >>> NEED_RESCHED Y Y Y >>> >>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when >>> none/voluntary preemption policies are in effect. And eager semantics >>> under full preemption. >>> >>> In addition, since this is driven purely by the scheduler (not >>> depending on cond_resched() placement and the like), there is enough >>> flexibility in the scheduler to cope with edge cases -- ex. a kernel >>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by >>> simply upgrading to a full NEED_RESCHED which can use more coercive >>> instruments like resched IPI to induce a context-switch. >>> >>> Performance >>> == >>> The performance in the basic tests (perf bench sched messaging, kernbench, >>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC. >>> (See patches >>> "sched: support preempt=none under PREEMPT_AUTO" >>> "sched: support preempt=full under PREEMPT_AUTO" >>> "sched: handle preempt=voluntary under PREEMPT_AUTO") >>> >>> For a macro test, a colleague in Oracle's Exadata team tried two >>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series >>> backported.) >>> >>> In both tests the data was cached on remote nodes (cells), and the >>> database nodes (compute) served client queries, with clients being >>> local in the first test and remote in the second. >>> >>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs) >>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs >>> >>> >>> PREEMPT_VOLUNTARY PREEMPT_AUTO >>> (preempt=voluntary) >>> ============================== ============================= >>> clients throughput cpu-usage throughput cpu-usage Gain >>> (tx/min) (utime %/stime %) (tx/min) (utime %/stime %) >>> ------- ---------- ----------------- ---------- ----------------- ------- >>> >>> >>> OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7% >>> benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6% >>> (local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8% >>> >>> >>> OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5% >>> benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6% >>> (remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3% >>> 90/10 RW ratio) >>> >>> >>> (Both sets of tests have a fair amount of NW traffic since the query >>> tables etc are cached on the cells. Additionally, the first set, >>> given the local clients, stress the scheduler a bit more than the >>> second.) >>> >>> The comparative performance for both the tests is fairly close, >>> more or less within a margin of error. >>> >>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu, 512GB RAM): >>> >>> " >>> a) Base kernel (6.7), >>> b) v1, PREEMPT_AUTO, preempt=voluntary >>> c) v1, PREEMPT_DYNAMIC, preempt=voluntary >>> d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y >>> >>> Workloads I tested and their %gain, >>> case b case c case d >>> NAS +2.7% +1.9% +2.1% >>> Hashjoin, +0.0% +0.0% +0.0% >>> Graph500, -6.0% +0.0% +0.0% >>> XSBench +1.7% +0.0% +1.2% >>> >>> (Note about the Graph500 numbers at [8].) >>> >>> Did kernbench etc test from Mel's mmtests suite also. Did not notice >>> much difference. >>> " >>> >>> One case where there is a significant performance drop is on powerpc, >>> seen running hackbench on a 320 core system (a test on a smaller system is >>> fine.) In theory there's no reason for this to only happen on powerpc >>> since most of the code is common, but I haven't been able to reproduce >>> it on x86 so far. >>> >>> All in all, I think the tests above show that this scheduling model has legs. >>> However, the none/voluntary models under PREEMPT_AUTO are conceptually >>> different enough from the current none/voluntary models that there >>> likely are workloads where performance would be subpar. That needs more >>> extensive testing to figure out the weak points. >>> >>> >>> >> Did test it again on PowerPC. Unfortunately numbers shows there is regression >> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the >> smaller system too to confirm. For now I have done the comparison for the hackbench >> where highest regression was seen in v1. >> >> perf stat collected for 20 iterations show higher context switch and higher migrations. >> Could it be that LAZY bit is causing more context switches? or could it be something >> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug. > > Thanks for trying it out. > > As you point out, context-switches and migrations are signficantly higher. > > Definitely unexpected. I ran the same test on an x86 box > (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference. > > 6.9.0/none.process.pipe.60: 170,719,761 context-switches # 0.022 M/sec ( +- 0.19% ) > 6.9.0/none.process.pipe.60: 16,871,449 cpu-migrations # 0.002 M/sec ( +- 0.16% ) > 6.9.0/none.process.pipe.60: 30.833112186 seconds time elapsed ( +- 0.11% ) > > 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 177,889,639 context-switches # 0.023 M/sec ( +- 0.21% ) > 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 17,426,670 cpu-migrations # 0.002 M/sec ( +- 0.41% ) > 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 30.731126312 seconds time elapsed ( +- 0.07% ) > > Clearly there's something different going on powerpc. I'm travelling > right now, but will dig deeper into this once I get back. > > Meanwhile can you check if the increased context-switches are voluntary or > involuntary (or what the division is)? Used "pidstat -w -p ALL 1 10" to capture 10 seconds data at 1 second interval for context switches per second while running "hackbench -pipe 60 process 100000 loops" preempt=none 6.10 preempt_auto ============================================================================= voluntary context switches 7632166.19 9391636.34(+23%) involuntary context switches 2305544.07 3527293.94(+53%) Numbers vary between multiple runs. But trend seems to be similar. Both the context switches increase involuntary seems to increase at higher rate. BTW, ran Unixbench as well. It shows slight regression. stress-ng numbers didn't seem conclusive. schench(old) showed slightly lower latency when the number of threads were low. at higher thread count showed higher tail latency. But it doesn't seem very convincing numbers. All these were done under preempt=none in both 6.10 and preempt_auto. Unixbench 6.10 preempt_auto ===================================================================== 1 X Execl Throughput : 5345.70, 5109.68(-4.42) 4 X Execl Throughput : 15610.54, 15087.92(-3.35) 1 X Pipe-based Context Switching : 183172.30, 177069.52(-3.33) 4 X Pipe-based Context Switching : 615471.66, 602773.74(-2.06) 1 X Process Creation : 10778.92, 10443.76(-3.11) 4 X Process Creation : 24327.06, 25150.42(+3.38) 1 X Shell Scripts (1 concurrent) : 10416.76, 10222.28(-1.87) 4 X Shell Scripts (1 concurrent) : 36051.00, 35206.90(-2.34) 1 X Shell Scripts (8 concurrent) : 5004.22, 4907.32(-1.94) 4 X Shell Scripts (8 concurrent) : 12676.08, 12418.18(-2.03) > > > Thanks > Ankur > >> Meanwhile, will do more test with other micro-benchmarks and post the results. >> >> >> More details below. >> CONFIG_HZ = 100 >> ./hackbench -pipe 60 process 100000 loops >> >> ==================================================================================== >> On the larger system. (40 Cores, 320CPUS) >> ==================================================================================== >> 6.10-rc1 +preempt_auto >> preempt=none preempt=none >> 20 iterations avg value >> hackbench pipe(60) 26.403 32.368 ( -31.1%) >> >> ++++++++++++++++++ >> baseline 6.10-rc1: >> ++++++++++++++++++ >> Performance counter stats for 'system wide' (20 runs): >> 168,980,939.76 msec cpu-clock # 6400.026 CPUs utilized ( +- 6.59% ) >> 6,299,247,371 context-switches # 70.596 K/sec ( +- 6.60% ) >> 246,646,236 cpu-migrations # 2.764 K/sec ( +- 6.57% ) >> 1,759,232 page-faults # 19.716 /sec ( +- 6.61% ) >> 577,719,907,794,874 cycles # 6.475 GHz ( +- 6.60% ) >> 226,392,778,622,410 instructions # 0.74 insn per cycle ( +- 6.61% ) >> 37,280,192,946,445 branches # 417.801 M/sec ( +- 6.61% ) >> 166,456,311,053 branch-misses # 0.85% of all branches ( +- 6.60% ) >> >> 26.403 +- 0.166 seconds time elapsed ( +- 0.63% ) >> >> ++++++++++++ >> preempt auto >> ++++++++++++ >> Performance counter stats for 'system wide' (20 runs): >> 207,154,235.95 msec cpu-clock # 6400.009 CPUs utilized ( +- 6.64% ) >> 9,337,462,696 context-switches # 85.645 K/sec ( +- 6.68% ) >> 631,276,554 cpu-migrations # 5.790 K/sec ( +- 6.79% ) >> 1,756,583 page-faults # 16.112 /sec ( +- 6.59% ) >> 700,281,729,230,103 cycles # 6.423 GHz ( +- 6.64% ) >> 254,713,123,656,485 instructions # 0.69 insn per cycle ( +- 6.63% ) >> 42,275,061,484,512 branches # 387.756 M/sec ( +- 6.63% ) >> 231,944,216,106 branch-misses # 1.04% of all branches ( +- 6.64% ) >> >> 32.368 +- 0.200 seconds time elapsed ( +- 0.62% ) >> >> >> ============================================================================================ >> Smaller system ( 12Cores, 96CPUS) >> ============================================================================================ >> 6.10-rc1 +preempt_auto >> preempt=none preempt=none >> 20 iterations avg value >> hackbench pipe(60) 55.930 65.75 ( -17.6%) >> >> ++++++++++++++++++ >> baseline 6.10-rc1: >> ++++++++++++++++++ >> Performance counter stats for 'system wide' (20 runs): >> 107,386,299.19 msec cpu-clock # 1920.003 CPUs utilized ( +- 6.55% ) >> 1,388,830,542 context-switches # 24.536 K/sec ( +- 6.19% ) >> 44,538,641 cpu-migrations # 786.840 /sec ( +- 6.23% ) >> 1,698,710 page-faults # 30.010 /sec ( +- 6.58% ) >> 412,401,110,929,055 cycles # 7.286 GHz ( +- 6.54% ) >> 192,380,094,075,743 instructions # 0.88 insn per cycle ( +- 6.59% ) >> 30,328,724,557,878 branches # 535.801 M/sec ( +- 6.58% ) >> 99,642,840,901 branch-misses # 0.63% of all branches ( +- 6.57% ) >> >> 55.930 +- 0.509 seconds time elapsed ( +- 0.91% ) >> >> >> +++++++++++++++++ >> v2_preempt_auto >> +++++++++++++++++ >> Performance counter stats for 'system wide' (20 runs): >> 126,244,029.04 msec cpu-clock # 1920.005 CPUs utilized ( +- 6.51% ) >> 2,563,720,294 context-switches # 38.356 K/sec ( +- 6.10% ) >> 147,445,392 cpu-migrations # 2.206 K/sec ( +- 6.37% ) >> 1,710,637 page-faults # 25.593 /sec ( +- 6.55% ) >> 483,419,889,144,017 cycles # 7.232 GHz ( +- 6.51% ) >> 210,788,030,476,548 instructions # 0.82 insn per cycle ( +- 6.57% ) >> 33,851,562,301,187 branches # 506.454 M/sec ( +- 6.56% ) >> 134,059,721,699 branch-misses # 0.75% of all branches ( +- 6.45% ) >> >> 65.75 +- 1.06 seconds time elapsed ( +- 1.61% ) > > So, the context-switches are meaningfully higher. > > -- > ankur
On 6/4/24 1:02 PM, Shrikanth Hegde wrote: > > > On 6/1/24 5:17 PM, Ankur Arora wrote: >> >> Shrikanth Hegde <sshegde@linux.ibm.com> writes: >> >>> On 5/28/24 6:04 AM, Ankur Arora wrote: >>>> Hi, >>>> >>>> This series adds a new scheduling model PREEMPT_AUTO, which like >>>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full >>>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend >>>> on explicit preemption points for the voluntary models. >>>> >>>> The series is based on Thomas' original proposal which he outlined >>>> in [1], [2] and in his PoC [3]. >>>> >>>> v2 mostly reworks v1, with one of the main changes having less >>>> noisy need-resched-lazy related interfaces. >>>> More details in the changelog below. >>>> >>> >>> Hi Ankur. Thanks for the series. >>> >>> nit: had to manually patch 11,12,13 since it didnt apply cleanly on >>> tip/master and tip/sched/core. Mostly due some word differences in the change. >>> >>> tip/master was at: >>> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD) >>> Merge: 5d145493a139 47ff30cc1be7 >>> Author: Ingo Molnar <mingo@kernel.org> >>> Date: Tue May 28 12:44:26 2024 +0200 >>> >>> Merge branch into tip/master: 'x86/percpu' >>> >>> >>> >>>> The v1 of the series is at [4] and the RFC at [5]. >>>> >>>> Design >>>> == >>>> >>>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus >>>> PREEMPT_COUNT). This means that the scheduler can always safely >>>> preempt. (This is identical to CONFIG_PREEMPT.) >>>> >>>> Having that, the next step is to make the rescheduling policy dependent >>>> on the chosen scheduling model. Currently, the scheduler uses a single >>>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a >>>> reschedule is needed. >>>> PREEMPT_AUTO extends this by adding an additional need-resched bit >>>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the >>>> scheduler to express two kinds of rescheduling intent: schedule at >>>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for >>>> rescheduling while allowing the task on the runqueue to run to >>>> timeslice completion (TIF_NEED_RESCHED_LAZY). >>>> >>>> The scheduler decides which need-resched bits are chosen based on >>>> the preemption model in use: >>>> >>>> TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY >>>> >>>> none never always [*] >>>> voluntary higher sched class other tasks [*] >>>> full always never >>>> >>>> [*] some details elided. >>>> >>>> The last part of the puzzle is, when does preemption happen, or >>>> alternately stated, when are the need-resched bits checked: >>>> >>>> exit-to-user ret-to-kernel preempt_count() >>>> >>>> NEED_RESCHED_LAZY Y N N >>>> NEED_RESCHED Y Y Y >>>> >>>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when >>>> none/voluntary preemption policies are in effect. And eager semantics >>>> under full preemption. >>>> >>>> In addition, since this is driven purely by the scheduler (not >>>> depending on cond_resched() placement and the like), there is enough >>>> flexibility in the scheduler to cope with edge cases -- ex. a kernel >>>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by >>>> simply upgrading to a full NEED_RESCHED which can use more coercive >>>> instruments like resched IPI to induce a context-switch. >>>> >>>> Performance >>>> == >>>> The performance in the basic tests (perf bench sched messaging, kernbench, >>>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC. >>>> (See patches >>>> "sched: support preempt=none under PREEMPT_AUTO" >>>> "sched: support preempt=full under PREEMPT_AUTO" >>>> "sched: handle preempt=voluntary under PREEMPT_AUTO") >>>> >>>> For a macro test, a colleague in Oracle's Exadata team tried two >>>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series >>>> backported.) >>>> >>>> In both tests the data was cached on remote nodes (cells), and the >>>> database nodes (compute) served client queries, with clients being >>>> local in the first test and remote in the second. >>>> >>>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs) >>>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs >>>> >>>> >>>> PREEMPT_VOLUNTARY PREEMPT_AUTO >>>> (preempt=voluntary) >>>> ============================== ============================= >>>> clients throughput cpu-usage throughput cpu-usage Gain >>>> (tx/min) (utime %/stime %) (tx/min) (utime %/stime %) >>>> ------- ---------- ----------------- ---------- ----------------- ------- >>>> >>>> >>>> OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7% >>>> benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6% >>>> (local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8% >>>> >>>> >>>> OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5% >>>> benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6% >>>> (remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3% >>>> 90/10 RW ratio) >>>> >>>> >>>> (Both sets of tests have a fair amount of NW traffic since the query >>>> tables etc are cached on the cells. Additionally, the first set, >>>> given the local clients, stress the scheduler a bit more than the >>>> second.) >>>> >>>> The comparative performance for both the tests is fairly close, >>>> more or less within a margin of error. >>>> >>>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu, 512GB RAM): >>>> >>>> " >>>> a) Base kernel (6.7), >>>> b) v1, PREEMPT_AUTO, preempt=voluntary >>>> c) v1, PREEMPT_DYNAMIC, preempt=voluntary >>>> d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y >>>> >>>> Workloads I tested and their %gain, >>>> case b case c case d >>>> NAS +2.7% +1.9% +2.1% >>>> Hashjoin, +0.0% +0.0% +0.0% >>>> Graph500, -6.0% +0.0% +0.0% >>>> XSBench +1.7% +0.0% +1.2% >>>> >>>> (Note about the Graph500 numbers at [8].) >>>> >>>> Did kernbench etc test from Mel's mmtests suite also. Did not notice >>>> much difference. >>>> " >>>> >>>> One case where there is a significant performance drop is on powerpc, >>>> seen running hackbench on a 320 core system (a test on a smaller system is >>>> fine.) In theory there's no reason for this to only happen on powerpc >>>> since most of the code is common, but I haven't been able to reproduce >>>> it on x86 so far. >>>> >>>> All in all, I think the tests above show that this scheduling model has legs. >>>> However, the none/voluntary models under PREEMPT_AUTO are conceptually >>>> different enough from the current none/voluntary models that there >>>> likely are workloads where performance would be subpar. That needs more >>>> extensive testing to figure out the weak points. >>>> >>>> >>>> >>> Did test it again on PowerPC. Unfortunately numbers shows there is regression >>> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the >>> smaller system too to confirm. For now I have done the comparison for the hackbench >>> where highest regression was seen in v1. >>> >>> perf stat collected for 20 iterations show higher context switch and higher migrations. >>> Could it be that LAZY bit is causing more context switches? or could it be something >>> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug. >> >> Thanks for trying it out. >> >> As you point out, context-switches and migrations are signficantly higher. >> >> Definitely unexpected. I ran the same test on an x86 box >> (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference. >> >> 6.9.0/none.process.pipe.60: 170,719,761 context-switches # 0.022 M/sec ( +- 0.19% ) >> 6.9.0/none.process.pipe.60: 16,871,449 cpu-migrations # 0.002 M/sec ( +- 0.16% ) >> 6.9.0/none.process.pipe.60: 30.833112186 seconds time elapsed ( +- 0.11% ) >> >> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 177,889,639 context-switches # 0.023 M/sec ( +- 0.21% ) >> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 17,426,670 cpu-migrations # 0.002 M/sec ( +- 0.41% ) >> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 30.731126312 seconds time elapsed ( +- 0.07% ) >> >> Clearly there's something different going on powerpc. I'm travelling >> right now, but will dig deeper into this once I get back. >> >> Meanwhile can you check if the increased context-switches are voluntary or >> involuntary (or what the division is)? > > > Used "pidstat -w -p ALL 1 10" to capture 10 seconds data at 1 second interval for > context switches per second while running "hackbench -pipe 60 process 100000 loops" > > > preempt=none 6.10 preempt_auto > ============================================================================= > voluntary context switches 7632166.19 9391636.34(+23%) > involuntary context switches 2305544.07 3527293.94(+53%) > > Numbers vary between multiple runs. But trend seems to be similar. Both the context switches increase > involuntary seems to increase at higher rate. > > Continued data from hackbench regression. preempt=none in both the cases. From mpstat, I see slightly higher idle time and more irq time with preempt_auto. 6.10-rc1: ========= 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 09:45:23 AM all 4.14 0.00 77.57 0.00 16.92 0.00 0.00 0.00 0.00 1.37 09:45:24 AM all 4.42 0.00 77.62 0.00 16.76 0.00 0.00 0.00 0.00 1.20 09:45:25 AM all 4.43 0.00 77.45 0.00 16.94 0.00 0.00 0.00 0.00 1.18 09:45:26 AM all 4.45 0.00 77.87 0.00 16.68 0.00 0.00 0.00 0.00 0.99 PREEMPT_AUTO: =========== 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 10:09:56 AM all 3.11 0.00 72.59 0.00 21.34 0.00 0.00 0.00 0.00 2.96 10:09:57 AM all 3.31 0.00 73.10 0.00 20.99 0.00 0.00 0.00 0.00 2.60 10:09:58 AM all 3.40 0.00 72.83 0.00 20.85 0.00 0.00 0.00 0.00 2.92 10:10:00 AM all 3.21 0.00 72.87 0.00 21.19 0.00 0.00 0.00 0.00 2.73 10:10:01 AM all 3.02 0.00 72.18 0.00 21.08 0.00 0.00 0.00 0.00 3.71 Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more timer,sched softirq. Numbers vary between different samples, but trend seems to be similar. 6.10-rc1: ========= SOFTIRQ TOTAL_usecs tasklet 71 block 145 net_rx 7914 rcu 136988 timer 304357 sched 1404497 PREEMPT_AUTO: =========== SOFTIRQ TOTAL_usecs tasklet 80 block 139 net_rx 6907 rcu 223508 timer 492767 sched 1794441 Would any specific setting of RCU matter for this? This is what I have in config. # RCU Subsystem # CONFIG_TREE_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_TREE_SRCU=y CONFIG_NEED_SRCU_NMI_SAFE=y CONFIG_TASKS_RCU_GENERIC=y CONFIG_NEED_TASKS_RCU=y CONFIG_TASKS_RCU=y CONFIG_TASKS_RUDE_RCU=y CONFIG_TASKS_TRACE_RCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_RCU_NOCB_CPU=y # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set # CONFIG_RCU_LAZY is not set # end of RCU Subsystem # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set # CONFIG_NO_HZ_IDLE is not set CONFIG_NO_HZ_FULL=y CONFIG_CONTEXT_TRACKING_USER=y # CONFIG_CONTEXT_TRACKING_USER_FORCE is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # end of Timers subsystem
Shrikanth Hegde <sshegde@linux.ibm.com> writes: > On 6/4/24 1:02 PM, Shrikanth Hegde wrote: >> >> >> On 6/1/24 5:17 PM, Ankur Arora wrote: >>> >>> Shrikanth Hegde <sshegde@linux.ibm.com> writes: >>> >>>> On 5/28/24 6:04 AM, Ankur Arora wrote: >>>>> Hi, >>>>> >>>>> This series adds a new scheduling model PREEMPT_AUTO, which like >>>>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full >>>>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend >>>>> on explicit preemption points for the voluntary models. >>>>> >>>>> The series is based on Thomas' original proposal which he outlined >>>>> in [1], [2] and in his PoC [3]. >>>>> >>>>> v2 mostly reworks v1, with one of the main changes having less >>>>> noisy need-resched-lazy related interfaces. >>>>> More details in the changelog below. >>>>> >>>> >>>> Hi Ankur. Thanks for the series. >>>> >>>> nit: had to manually patch 11,12,13 since it didnt apply cleanly on >>>> tip/master and tip/sched/core. Mostly due some word differences in the change. >>>> >>>> tip/master was at: >>>> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD) >>>> Merge: 5d145493a139 47ff30cc1be7 >>>> Author: Ingo Molnar <mingo@kernel.org> >>>> Date: Tue May 28 12:44:26 2024 +0200 >>>> >>>> Merge branch into tip/master: 'x86/percpu' >>>> >>>> >>>> >>>>> The v1 of the series is at [4] and the RFC at [5]. >>>>> >>>>> Design >>>>> == >>>>> >>>>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus >>>>> PREEMPT_COUNT). This means that the scheduler can always safely >>>>> preempt. (This is identical to CONFIG_PREEMPT.) >>>>> >>>>> Having that, the next step is to make the rescheduling policy dependent >>>>> on the chosen scheduling model. Currently, the scheduler uses a single >>>>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a >>>>> reschedule is needed. >>>>> PREEMPT_AUTO extends this by adding an additional need-resched bit >>>>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the >>>>> scheduler to express two kinds of rescheduling intent: schedule at >>>>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for >>>>> rescheduling while allowing the task on the runqueue to run to >>>>> timeslice completion (TIF_NEED_RESCHED_LAZY). >>>>> >>>>> The scheduler decides which need-resched bits are chosen based on >>>>> the preemption model in use: >>>>> >>>>> TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY >>>>> >>>>> none never always [*] >>>>> voluntary higher sched class other tasks [*] >>>>> full always never >>>>> >>>>> [*] some details elided. >>>>> >>>>> The last part of the puzzle is, when does preemption happen, or >>>>> alternately stated, when are the need-resched bits checked: >>>>> >>>>> exit-to-user ret-to-kernel preempt_count() >>>>> >>>>> NEED_RESCHED_LAZY Y N N >>>>> NEED_RESCHED Y Y Y >>>>> >>>>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when >>>>> none/voluntary preemption policies are in effect. And eager semantics >>>>> under full preemption. >>>>> >>>>> In addition, since this is driven purely by the scheduler (not >>>>> depending on cond_resched() placement and the like), there is enough >>>>> flexibility in the scheduler to cope with edge cases -- ex. a kernel >>>>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by >>>>> simply upgrading to a full NEED_RESCHED which can use more coercive >>>>> instruments like resched IPI to induce a context-switch. >>>>> >>>>> Performance >>>>> == >>>>> The performance in the basic tests (perf bench sched messaging, kernbench, >>>>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC. >>>>> (See patches >>>>> "sched: support preempt=none under PREEMPT_AUTO" >>>>> "sched: support preempt=full under PREEMPT_AUTO" >>>>> "sched: handle preempt=voluntary under PREEMPT_AUTO") >>>>> >>>>> For a macro test, a colleague in Oracle's Exadata team tried two >>>>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series >>>>> backported.) >>>>> >>>>> In both tests the data was cached on remote nodes (cells), and the >>>>> database nodes (compute) served client queries, with clients being >>>>> local in the first test and remote in the second. >>>>> >>>>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs) >>>>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs >>>>> >>>>> >>>>> PREEMPT_VOLUNTARY PREEMPT_AUTO >>>>> (preempt=voluntary) >>>>> ============================== ============================= >>>>> clients throughput cpu-usage throughput cpu-usage Gain >>>>> (tx/min) (utime %/stime %) (tx/min) (utime %/stime %) >>>>> ------- ---------- ----------------- ---------- ----------------- ------- >>>>> >>>>> >>>>> OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7% >>>>> benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6% >>>>> (local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8% >>>>> >>>>> >>>>> OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5% >>>>> benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6% >>>>> (remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3% >>>>> 90/10 RW ratio) >>>>> >>>>> >>>>> (Both sets of tests have a fair amount of NW traffic since the query >>>>> tables etc are cached on the cells. Additionally, the first set, >>>>> given the local clients, stress the scheduler a bit more than the >>>>> second.) >>>>> >>>>> The comparative performance for both the tests is fairly close, >>>>> more or less within a margin of error. >>>>> >>>>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu, 512GB RAM): >>>>> >>>>> " >>>>> a) Base kernel (6.7), >>>>> b) v1, PREEMPT_AUTO, preempt=voluntary >>>>> c) v1, PREEMPT_DYNAMIC, preempt=voluntary >>>>> d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y >>>>> >>>>> Workloads I tested and their %gain, >>>>> case b case c case d >>>>> NAS +2.7% +1.9% +2.1% >>>>> Hashjoin, +0.0% +0.0% +0.0% >>>>> Graph500, -6.0% +0.0% +0.0% >>>>> XSBench +1.7% +0.0% +1.2% >>>>> >>>>> (Note about the Graph500 numbers at [8].) >>>>> >>>>> Did kernbench etc test from Mel's mmtests suite also. Did not notice >>>>> much difference. >>>>> " >>>>> >>>>> One case where there is a significant performance drop is on powerpc, >>>>> seen running hackbench on a 320 core system (a test on a smaller system is >>>>> fine.) In theory there's no reason for this to only happen on powerpc >>>>> since most of the code is common, but I haven't been able to reproduce >>>>> it on x86 so far. >>>>> >>>>> All in all, I think the tests above show that this scheduling model has legs. >>>>> However, the none/voluntary models under PREEMPT_AUTO are conceptually >>>>> different enough from the current none/voluntary models that there >>>>> likely are workloads where performance would be subpar. That needs more >>>>> extensive testing to figure out the weak points. >>>>> >>>>> >>>>> >>>> Did test it again on PowerPC. Unfortunately numbers shows there is regression >>>> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the >>>> smaller system too to confirm. For now I have done the comparison for the hackbench >>>> where highest regression was seen in v1. >>>> >>>> perf stat collected for 20 iterations show higher context switch and higher migrations. >>>> Could it be that LAZY bit is causing more context switches? or could it be something >>>> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug. >>> >>> Thanks for trying it out. >>> >>> As you point out, context-switches and migrations are signficantly higher. >>> >>> Definitely unexpected. I ran the same test on an x86 box >>> (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference. >>> >>> 6.9.0/none.process.pipe.60: 170,719,761 context-switches # 0.022 M/sec ( +- 0.19% ) >>> 6.9.0/none.process.pipe.60: 16,871,449 cpu-migrations # 0.002 M/sec ( +- 0.16% ) >>> 6.9.0/none.process.pipe.60: 30.833112186 seconds time elapsed ( +- 0.11% ) >>> >>> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 177,889,639 context-switches # 0.023 M/sec ( +- 0.21% ) >>> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 17,426,670 cpu-migrations # 0.002 M/sec ( +- 0.41% ) >>> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 30.731126312 seconds time elapsed ( +- 0.07% ) >>> >>> Clearly there's something different going on powerpc. I'm travelling >>> right now, but will dig deeper into this once I get back. >>> >>> Meanwhile can you check if the increased context-switches are voluntary or >>> involuntary (or what the division is)? >> >> >> Used "pidstat -w -p ALL 1 10" to capture 10 seconds data at 1 second interval for >> context switches per second while running "hackbench -pipe 60 process 100000 loops" >> >> >> preempt=none 6.10 preempt_auto >> ============================================================================= >> voluntary context switches 7632166.19 9391636.34(+23%) >> involuntary context switches 2305544.07 3527293.94(+53%) >> >> Numbers vary between multiple runs. But trend seems to be similar. Both the context switches increase >> involuntary seems to increase at higher rate. >> >> > > > Continued data from hackbench regression. preempt=none in both the cases. > From mpstat, I see slightly higher idle time and more irq time with preempt_auto. > > 6.10-rc1: > ========= > 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle > 09:45:23 AM all 4.14 0.00 77.57 0.00 16.92 0.00 0.00 0.00 0.00 1.37 > 09:45:24 AM all 4.42 0.00 77.62 0.00 16.76 0.00 0.00 0.00 0.00 1.20 > 09:45:25 AM all 4.43 0.00 77.45 0.00 16.94 0.00 0.00 0.00 0.00 1.18 > 09:45:26 AM all 4.45 0.00 77.87 0.00 16.68 0.00 0.00 0.00 0.00 0.99 > > PREEMPT_AUTO: > =========== > 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle > 10:09:56 AM all 3.11 0.00 72.59 0.00 21.34 0.00 0.00 0.00 0.00 2.96 > 10:09:57 AM all 3.31 0.00 73.10 0.00 20.99 0.00 0.00 0.00 0.00 2.60 > 10:09:58 AM all 3.40 0.00 72.83 0.00 20.85 0.00 0.00 0.00 0.00 2.92 > 10:10:00 AM all 3.21 0.00 72.87 0.00 21.19 0.00 0.00 0.00 0.00 2.73 > 10:10:01 AM all 3.02 0.00 72.18 0.00 21.08 0.00 0.00 0.00 0.00 3.71 > > Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more > timer,sched softirq. Numbers vary between different samples, but trend seems to be similar. Yeah, the %sys is lower and %irq, higher. Can you also see where the increased %irq is? For instance are the resched IPIs numbers greater? > 6.10-rc1: > ========= > SOFTIRQ TOTAL_usecs > tasklet 71 > block 145 > net_rx 7914 > rcu 136988 > timer 304357 > sched 1404497 > > > > PREEMPT_AUTO: > =========== > SOFTIRQ TOTAL_usecs > tasklet 80 > block 139 > net_rx 6907 > rcu 223508 > timer 492767 > sched 1794441 > > > Would any specific setting of RCU matter for this? > This is what I have in config. Don't see how it could matter unless the RCU settings are changing between the two tests? In my testing I'm also using TREE_RCU=y, PREEMPT_RCU=n. Let me see if I can find a test which shows a similar trend to what you are seeing. And, then maybe see if tracing sched-switch might point to an interesting difference between x86 and powerpc. Thanks for all the detail. Ankur > # RCU Subsystem > # > CONFIG_TREE_RCU=y > # CONFIG_RCU_EXPERT is not set > CONFIG_TREE_SRCU=y > CONFIG_NEED_SRCU_NMI_SAFE=y > CONFIG_TASKS_RCU_GENERIC=y > CONFIG_NEED_TASKS_RCU=y > CONFIG_TASKS_RCU=y > CONFIG_TASKS_RUDE_RCU=y > CONFIG_TASKS_TRACE_RCU=y > CONFIG_RCU_STALL_COMMON=y > CONFIG_RCU_NEED_SEGCBLIST=y > CONFIG_RCU_NOCB_CPU=y > # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set > # CONFIG_RCU_LAZY is not set > # end of RCU Subsystem > > > # Timers subsystem > # > CONFIG_TICK_ONESHOT=y > CONFIG_NO_HZ_COMMON=y > # CONFIG_HZ_PERIODIC is not set > # CONFIG_NO_HZ_IDLE is not set > CONFIG_NO_HZ_FULL=y > CONFIG_CONTEXT_TRACKING_USER=y > # CONFIG_CONTEXT_TRACKING_USER_FORCE is not set > CONFIG_NO_HZ=y > CONFIG_HIGH_RES_TIMERS=y > # end of Timers subsystem -- ankur
On 6/10/24 12:53 PM, Ankur Arora wrote: > _auto. >> >> 6.10-rc1: >> ========= >> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle >> 09:45:23 AM all 4.14 0.00 77.57 0.00 16.92 0.00 0.00 0.00 0.00 1.37 >> 09:45:24 AM all 4.42 0.00 77.62 0.00 16.76 0.00 0.00 0.00 0.00 1.20 >> 09:45:25 AM all 4.43 0.00 77.45 0.00 16.94 0.00 0.00 0.00 0.00 1.18 >> 09:45:26 AM all 4.45 0.00 77.87 0.00 16.68 0.00 0.00 0.00 0.00 0.99 >> >> PREEMPT_AUTO: >> =========== >> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle >> 10:09:56 AM all 3.11 0.00 72.59 0.00 21.34 0.00 0.00 0.00 0.00 2.96 >> 10:09:57 AM all 3.31 0.00 73.10 0.00 20.99 0.00 0.00 0.00 0.00 2.60 >> 10:09:58 AM all 3.40 0.00 72.83 0.00 20.85 0.00 0.00 0.00 0.00 2.92 >> 10:10:00 AM all 3.21 0.00 72.87 0.00 21.19 0.00 0.00 0.00 0.00 2.73 >> 10:10:01 AM all 3.02 0.00 72.18 0.00 21.08 0.00 0.00 0.00 0.00 3.71 >> >> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more >> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar. > > Yeah, the %sys is lower and %irq, higher. Can you also see where the > increased %irq is? For instance are the resched IPIs numbers greater? Hi Ankur, Used mpstat -I ALL to capture this info for 20 seconds. HARDIRQ per second: =================== 6.10: =================== 18 19 22 23 48 49 50 51 LOC BCT LOC2 SPU PMI MCE NMI WDG DBL --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 417956.86 1114642.30 1712683.65 2058664.99 0.00 0.00 18.30 0.39 31978.37 0.00 0.35 351.98 0.00 0.00 0.00 6405.54 329189.45 Preempt_auto: =================== 18 19 22 23 48 49 50 51 LOC BCT LOC2 SPU PMI MCE NMI WDG DBL --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 609509.69 1910413.99 1923503.52 2061876.33 0.00 0.00 19.14 0.30 31916.59 0.00 0.45 497.88 0.00 0.00 0.00 6825.49 88247.85 18,19,22,23 are called XIVE interrupts. These are IPI interrupts. I am not sure which type of IPI are these. will have to see why its increasing. SOFTIRQ per second: =================== 6.10: =================== HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU 0.00 3966.47 0.00 18.25 0.59 0.00 0.34 12811.00 0.00 9693.95 Preempt_auto: =================== HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU 0.00 4871.67 0.00 18.94 0.40 0.00 0.25 13518.66 0.00 15732.77 Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why. It maybe irq triggering to softirq or softirq causing more IPI. Also, Noticed a below config difference which gets removed in preempt auto. This happens because PREEMPTION make them as N. Made the changes in kernel/Kconfig.locks to get them enabled. I still see the same regression in hackbench. These configs still may need attention? 6.10 | preempt auto CONFIG_INLINE_SPIN_UNLOCK_IRQ=y | CONFIG_UNINLINE_SPIN_UNLOCK=y CONFIG_INLINE_READ_UNLOCK=y | ---------------------------------------------------------------------------- CONFIG_INLINE_READ_UNLOCK_IRQ=y | ---------------------------------------------------------------------------- CONFIG_INLINE_WRITE_UNLOCK=y | ---------------------------------------------------------------------------- CONFIG_INLINE_WRITE_UNLOCK_IRQ=y | ---------------------------------------------------------------------------- > >> 6.10-rc1: >> ========= >> SOFTIRQ TOTAL_usecs >> tasklet 71 >> block 145 >> net_rx 7914 >> rcu 136988 >> timer 304357 >> sched 1404497 >> >> >> >> PREEMPT_AUTO: >> =========== >> SOFTIRQ TOTAL_usecs >> tasklet 80 >> block 139 >> net_rx 6907 >> rcu 223508 >> timer 492767 >> sched 1794441 >> >> >> Would any specific setting of RCU matter for this? >> This is what I have in config. > > Don't see how it could matter unless the RCU settings are changing > between the two tests? In my testing I'm also using TREE_RCU=y, > PREEMPT_RCU=n. > > Let me see if I can find a test which shows a similar trend to what you > are seeing. And, then maybe see if tracing sched-switch might point to > an interesting difference between x86 and powerpc. > > > Thanks for all the detail. > > Ankur > >> # RCU Subsystem >> # >> CONFIG_TREE_RCU=y >> # CONFIG_RCU_EXPERT is not set >> CONFIG_TREE_SRCU=y >> CONFIG_NEED_SRCU_NMI_SAFE=y >> CONFIG_TASKS_RCU_GENERIC=y >> CONFIG_NEED_TASKS_RCU=y >> CONFIG_TASKS_RCU=y >> CONFIG_TASKS_RUDE_RCU=y >> CONFIG_TASKS_TRACE_RCU=y >> CONFIG_RCU_STALL_COMMON=y >> CONFIG_RCU_NEED_SEGCBLIST=y >> CONFIG_RCU_NOCB_CPU=y >> # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set >> # CONFIG_RCU_LAZY is not set >> # end of RCU Subsystem >> >> >> # Timers subsystem >> # >> CONFIG_TICK_ONESHOT=y >> CONFIG_NO_HZ_COMMON=y >> # CONFIG_HZ_PERIODIC is not set >> # CONFIG_NO_HZ_IDLE is not set >> CONFIG_NO_HZ_FULL=y >> CONFIG_CONTEXT_TRACKING_USER=y >> # CONFIG_CONTEXT_TRACKING_USER_FORCE is not set >> CONFIG_NO_HZ=y >> CONFIG_HIGH_RES_TIMERS=y >> # end of Timers subsystem > > > -- > ankur
On 6/15/24 8:34 PM, Shrikanth Hegde wrote: > > > On 6/10/24 12:53 PM, Ankur Arora wrote: >> > _auto. >>> >>> 6.10-rc1: >>> ========= >>> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle >>> 09:45:23 AM all 4.14 0.00 77.57 0.00 16.92 0.00 0.00 0.00 0.00 1.37 >>> 09:45:24 AM all 4.42 0.00 77.62 0.00 16.76 0.00 0.00 0.00 0.00 1.20 >>> 09:45:25 AM all 4.43 0.00 77.45 0.00 16.94 0.00 0.00 0.00 0.00 1.18 >>> 09:45:26 AM all 4.45 0.00 77.87 0.00 16.68 0.00 0.00 0.00 0.00 0.99 >>> >>> PREEMPT_AUTO: >>> =========== >>> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle >>> 10:09:56 AM all 3.11 0.00 72.59 0.00 21.34 0.00 0.00 0.00 0.00 2.96 >>> 10:09:57 AM all 3.31 0.00 73.10 0.00 20.99 0.00 0.00 0.00 0.00 2.60 >>> 10:09:58 AM all 3.40 0.00 72.83 0.00 20.85 0.00 0.00 0.00 0.00 2.92 >>> 10:10:00 AM all 3.21 0.00 72.87 0.00 21.19 0.00 0.00 0.00 0.00 2.73 >>> 10:10:01 AM all 3.02 0.00 72.18 0.00 21.08 0.00 0.00 0.00 0.00 3.71 >>> >>> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more >>> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar. >> >> Yeah, the %sys is lower and %irq, higher. Can you also see where the >> increased %irq is? For instance are the resched IPIs numbers greater? > > Hi Ankur, > > > Used mpstat -I ALL to capture this info for 20 seconds. > > HARDIRQ per second: > =================== > 6.10: > =================== > 18 19 22 23 48 49 50 51 LOC BCT LOC2 SPU PMI MCE NMI WDG DBL > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > 417956.86 1114642.30 1712683.65 2058664.99 0.00 0.00 18.30 0.39 31978.37 0.00 0.35 351.98 0.00 0.00 0.00 6405.54 329189.45 > > Preempt_auto: > =================== > 18 19 22 23 48 49 50 51 LOC BCT LOC2 SPU PMI MCE NMI WDG DBL > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > 609509.69 1910413.99 1923503.52 2061876.33 0.00 0.00 19.14 0.30 31916.59 0.00 0.45 497.88 0.00 0.00 0.00 6825.49 88247.85 > > 18,19,22,23 are called XIVE interrupts. These are IPI interrupts. I am not sure which type of IPI are these. will have to see why its increasing. > > > SOFTIRQ per second: > =================== > 6.10: > =================== > HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU > 0.00 3966.47 0.00 18.25 0.59 0.00 0.34 12811.00 0.00 9693.95 > > Preempt_auto: > =================== > HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU > 0.00 4871.67 0.00 18.94 0.40 0.00 0.25 13518.66 0.00 15732.77 > > Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why. > It maybe irq triggering to softirq or softirq causing more IPI. > > > > Also, Noticed a below config difference which gets removed in preempt auto. This happens because PREEMPTION make them as N. Made the changes in kernel/Kconfig.locks to get them > enabled. I still see the same regression in hackbench. These configs still may need attention? > > 6.10 | preempt auto > CONFIG_INLINE_SPIN_UNLOCK_IRQ=y | CONFIG_UNINLINE_SPIN_UNLOCK=y > CONFIG_INLINE_READ_UNLOCK=y | ---------------------------------------------------------------------------- > CONFIG_INLINE_READ_UNLOCK_IRQ=y | ---------------------------------------------------------------------------- > CONFIG_INLINE_WRITE_UNLOCK=y | ---------------------------------------------------------------------------- > CONFIG_INLINE_WRITE_UNLOCK_IRQ=y | ---------------------------------------------------------------------------- > > Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across. When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start spanning across sockets. Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation which may not scale well. Will try to shift to percpu based method and see. will get back if I can get that done successfully.
Shrikanth Hegde <sshegde@linux.ibm.com> writes: > On 6/15/24 8:34 PM, Shrikanth Hegde wrote: >> >> >> On 6/10/24 12:53 PM, Ankur Arora wrote: >>> >> _auto. >>>> >>>> 6.10-rc1: >>>> ========= >>>> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle >>>> 09:45:23 AM all 4.14 0.00 77.57 0.00 16.92 0.00 0.00 0.00 0.00 1.37 >>>> 09:45:24 AM all 4.42 0.00 77.62 0.00 16.76 0.00 0.00 0.00 0.00 1.20 >>>> 09:45:25 AM all 4.43 0.00 77.45 0.00 16.94 0.00 0.00 0.00 0.00 1.18 >>>> 09:45:26 AM all 4.45 0.00 77.87 0.00 16.68 0.00 0.00 0.00 0.00 0.99 >>>> >>>> PREEMPT_AUTO: >>>> =========== >>>> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle >>>> 10:09:56 AM all 3.11 0.00 72.59 0.00 21.34 0.00 0.00 0.00 0.00 2.96 >>>> 10:09:57 AM all 3.31 0.00 73.10 0.00 20.99 0.00 0.00 0.00 0.00 2.60 >>>> 10:09:58 AM all 3.40 0.00 72.83 0.00 20.85 0.00 0.00 0.00 0.00 2.92 >>>> 10:10:00 AM all 3.21 0.00 72.87 0.00 21.19 0.00 0.00 0.00 0.00 2.73 >>>> 10:10:01 AM all 3.02 0.00 72.18 0.00 21.08 0.00 0.00 0.00 0.00 3.71 >>>> >>>> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more >>>> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar. >>> >>> Yeah, the %sys is lower and %irq, higher. Can you also see where the >>> increased %irq is? For instance are the resched IPIs numbers greater? >> >> Hi Ankur, >> >> >> Used mpstat -I ALL to capture this info for 20 seconds. >> >> HARDIRQ per second: >> =================== >> 6.10: >> =================== >> 18 19 22 23 48 49 50 51 LOC BCT LOC2 SPU PMI MCE NMI WDG DBL >> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> 417956.86 1114642.30 1712683.65 2058664.99 0.00 0.00 18.30 0.39 31978.37 0.00 0.35 351.98 0.00 0.00 0.00 6405.54 329189.45 >> >> Preempt_auto: >> =================== >> 18 19 22 23 48 49 50 51 LOC BCT LOC2 SPU PMI MCE NMI WDG DBL >> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> 609509.69 1910413.99 1923503.52 2061876.33 0.00 0.00 19.14 0.30 31916.59 0.00 0.45 497.88 0.00 0.00 0.00 6825.49 88247.85 >> >> 18,19,22,23 are called XIVE interrupts. These are IPI interrupts. I am not sure which type of IPI are these. will have to see why its increasing. >> >> >> SOFTIRQ per second: >> =================== >> 6.10: >> =================== >> HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU >> 0.00 3966.47 0.00 18.25 0.59 0.00 0.34 12811.00 0.00 9693.95 >> >> Preempt_auto: >> =================== >> HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU >> 0.00 4871.67 0.00 18.94 0.40 0.00 0.25 13518.66 0.00 15732.77 >> >> Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why. >> It maybe irq triggering to softirq or softirq causing more IPI. >> >> >> >> Also, Noticed a below config difference which gets removed in preempt auto. This happens because PREEMPTION make them as N. Made the changes in kernel/Kconfig.locks to get them >> enabled. I still see the same regression in hackbench. These configs still may need attention? >> >> 6.10 | preempt auto >> CONFIG_INLINE_SPIN_UNLOCK_IRQ=y | CONFIG_UNINLINE_SPIN_UNLOCK=y >> CONFIG_INLINE_READ_UNLOCK=y | ---------------------------------------------------------------------------- >> CONFIG_INLINE_READ_UNLOCK_IRQ=y | ---------------------------------------------------------------------------- >> CONFIG_INLINE_WRITE_UNLOCK=y | ---------------------------------------------------------------------------- >> CONFIG_INLINE_WRITE_UNLOCK_IRQ=y | ---------------------------------------------------------------------------- >> >> > > Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across. > When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start > spanning across sockets. Ah. That's really interesting. So, upto 160 CPUs was okay? > Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation > which may not scale well. Yeah this would explain why I don't see similar behaviour on a 384 CPU x86 box. Also, IIRC the powerpc numbers on preempt=full were significantly worse than preempt=none. That test might also be worth doing once you have the percpu based method working. > Will try to shift to percpu based method and see. will get back if I can get that done successfully. Sounds good to me. Thanks Ankur
On 6/19/24 8:10 AM, Ankur Arora wrote:
>>>
>>> SOFTIRQ per second:
>>> ===================
>>> 6.10:
>>> ===================
>>> HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU
>>> 0.00 3966.47 0.00 18.25 0.59 0.00 0.34 12811.00 0.00 9693.95
>>>
>>> Preempt_auto:
>>> ===================
>>> HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU
>>> 0.00 4871.67 0.00 18.94 0.40 0.00 0.25 13518.66 0.00 15732.77
>>>
>>> Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why.
>>> It maybe irq triggering to softirq or softirq causing more IPI.
>>
>> Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across.CPU
>> When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start
>> spanning across sockets.
>
> Ah. That's really interesting. So, upto 160 CPUs was okay?
No. In both the cases CPUs are limited to 96. In one case its in single NUMA node and in other case its across two NUMA nodes.
>
>> Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation
>> which may not scale well.
>
> Yeah this would explain why I don't see similar behaviour on a 384 CPU
> x86 box.
>
> Also, IIRC the powerpc numbers on preempt=full were significantly worse
> than preempt=none. That test might also be worth doing once you have the
> percpu based method working.
>
>> Will try to shift to percpu based method and see. will get back if I can get that done successfully.
>
> Sounds good to me.
>
Did give a try. Made the preempt count per CPU by adding it in paca field. Unfortunately it didn't
improve the the performance. Its more or less same as preempt_auto.
Issue still remains illusive. Likely crux is that somehow IPI-interrupts and SOFTIRQs are increasing
with preempt_auto. Doing some more data collection with perf/ftrace. Will share that soon.
This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided
tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch
for reference. It didn't help fix the regression unless I implemented it wrongly.
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 1d58da946739..374642288061 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -268,6 +268,7 @@ struct paca_struct {
u16 slb_save_cache_ptr;
#endif
#endif /* CONFIG_PPC_BOOK3S_64 */
+ int preempt_count;
#ifdef CONFIG_STACKPROTECTOR
unsigned long canary;
#endif
diff --git a/arch/powerpc/include/asm/preempt.h b/arch/powerpc/include/asm/preempt.h
new file mode 100644
index 000000000000..406dad1a0cf6
--- /dev/null
+++ b/arch/powerpc/include/asm/preempt.h
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_PREEMPT_H
+#define __ASM_PREEMPT_H
+
+#include <linux/thread_info.h>
+
+#ifdef CONFIG_PPC64
+#include <asm/paca.h>
+#endif
+#include <asm/percpu.h>
+#include <asm/smp.h>
+
+#define PREEMPT_ENABLED (0)
+
+/*
+ * We mask the PREEMPT_NEED_RESCHED bit so as not to confuse all current users
+ * that think a non-zero value indicates we cannot preempt.
+ */
+static __always_inline int preempt_count(void)
+{
+ return READ_ONCE(local_paca->preempt_count);
+}
+
+static __always_inline void preempt_count_set(int pc)
+{
+ WRITE_ONCE(local_paca->preempt_count, pc);
+}
+
+/*
+ * must be macros to avoid header recursion hell
+ */
+#define init_task_preempt_count(p) do { } while (0)
+
+#define init_idle_preempt_count(p, cpu) do { } while (0)
+
+static __always_inline void set_preempt_need_resched(void)
+{
+}
+
+static __always_inline void clear_preempt_need_resched(void)
+{
+}
+
+static __always_inline bool test_preempt_need_resched(void)
+{
+ return false;
+}
+
+/*
+ * The various preempt_count add/sub methods
+ */
+
+static __always_inline void __preempt_count_add(int val)
+{
+ preempt_count_set(preempt_count() + val);
+}
+
+static __always_inline void __preempt_count_sub(int val)
+{
+ preempt_count_set(preempt_count() - val);
+}
+
+static __always_inline bool __preempt_count_dec_and_test(void)
+{
+ /*
+ * Because of load-store architectures cannot do per-cpu atomic
+ * operations; we cannot use PREEMPT_NEED_RESCHED because it might get
+ * lost.
+ */
+ preempt_count_set(preempt_count() - 1);
+ if (preempt_count() == 0 && tif_need_resched())
+ return true;
+ else
+ return false;
+}
+
+/*
+ * Returns true when we need to resched and can (barring IRQ state).
+ */
+static __always_inline bool should_resched(int preempt_offset)
+{
+ return unlikely(preempt_count() == preempt_offset && tif_need_resched());
+}
+
+//EXPORT_SYMBOL(per_cpu_preempt_count);
+
+#ifdef CONFIG_PREEMPTION
+extern asmlinkage void preempt_schedule(void);
+extern asmlinkage void preempt_schedule_notrace(void);
+
+#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
+
+void dynamic_preempt_schedule(void);
+void dynamic_preempt_schedule_notrace(void);
+#define __preempt_schedule() dynamic_preempt_schedule()
+#define __preempt_schedule_notrace() dynamic_preempt_schedule_notrace()
+
+#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
+
+#define __preempt_schedule() preempt_schedule()
+#define __preempt_schedule_notrace() preempt_schedule_notrace()
+
+#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
+#endif /* CONFIG_PREEMPTION */
+
+#endif /* __ASM_PREEMPT_H */
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 0d170e2be2b6..bf2199384751 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -52,8 +52,8 @@
* low level task data.
*/
struct thread_info {
- int preempt_count; /* 0 => preemptable,
- <0 => BUG */
+ //int preempt_count; // 0 => preemptable,
+ // <0 => BUG
#ifdef CONFIG_SMP
unsigned int cpu;
#endif
@@ -77,7 +77,6 @@ struct thread_info {
*/
#define INIT_THREAD_INFO(tsk) \
{ \
- .preempt_count = INIT_PREEMPT_COUNT, \
.flags = 0, \
}
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 7502066c3c53..f90245b8359f 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -204,6 +204,7 @@ void __init initialise_paca(struct paca_struct *new_paca, int cpu)
#ifdef CONFIG_PPC_64S_HASH_MMU
new_paca->slb_shadow_ptr = NULL;
#endif
+ new_paca->preempt_count = PREEMPT_DISABLED;
#ifdef CONFIG_PPC_BOOK3E_64
/* For now -- if we have threads this will be adjusted later */
diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
index 85050be08a23..2adab682aab9 100644
--- a/arch/powerpc/kexec/core_64.c
+++ b/arch/powerpc/kexec/core_64.c
@@ -33,6 +33,8 @@
#include <asm/ultravisor.h>
#include <asm/crashdump-ppc64.h>
+#include <linux/percpu-defs.h>
+
int machine_kexec_prepare(struct kimage *image)
{
int i;
@@ -324,7 +326,7 @@ void default_machine_kexec(struct kimage *image)
* XXX: the task struct will likely be invalid once we do the copy!
*/
current_thread_info()->flags = 0;
- current_thread_info()->preempt_count = HARDIRQ_OFFSET;
+ local_paca->preempt_count = HARDIRQ_OFFSET;
/* We need a static PACA, too; copy this CPU's PACA over and switch to
* it. Also poison per_cpu_offset and NULL lppaca to catch anyone using
Shrikanth Hegde <sshegde@linux.ibm.com> writes:
> On 6/19/24 8:10 AM, Ankur Arora wrote:
>
>
>>>>
>>>> SOFTIRQ per second:
>>>> ===================
>>>> 6.10:
>>>> ===================
>>>> HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU
>>>> 0.00 3966.47 0.00 18.25 0.59 0.00 0.34 12811.00 0.00 9693.95
>>>>
>>>> Preempt_auto:
>>>> ===================
>>>> HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU
>>>> 0.00 4871.67 0.00 18.94 0.40 0.00 0.25 13518.66 0.00 15732.77
>>>>
>>>> Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why.
>>>> It maybe irq triggering to softirq or softirq causing more IPI.
>>>
>>> Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across.CPU
>>> When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start
>>> spanning across sockets.
>>
>> Ah. That's really interesting. So, upto 160 CPUs was okay?
>
> No. In both the cases CPUs are limited to 96. In one case its in single NUMA node and in other case its across two NUMA nodes.
>
>>
>>> Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation
>>> which may not scale well.
>>
>> Yeah this would explain why I don't see similar behaviour on a 384 CPU
>> x86 box.
>>
>> Also, IIRC the powerpc numbers on preempt=full were significantly worse
>> than preempt=none. That test might also be worth doing once you have the
>> percpu based method working.
>>
>>> Will try to shift to percpu based method and see. will get back if I can get that done successfully.
>>
>> Sounds good to me.
>>
>
> Did give a try. Made the preempt count per CPU by adding it in paca field. Unfortunately it didn't
> improve the the performance. Its more or less same as preempt_auto.
>
> Issue still remains illusive. Likely crux is that somehow IPI-interrupts and SOFTIRQs are increasing
> with preempt_auto. Doing some more data collection with perf/ftrace. Will share that soon.
True. But, just looking at IPC for now:
>> baseline 6.10-rc1:
>> ++++++++++++++++++
>> Performance counter stats for 'system wide' (20 runs):
>> 577,719,907,794,874 cycles # 6.475 GHz ( +- 6.60% )
>> 226,392,778,622,410 instructions # 0.74 insn per cycle ( +- 6.61% )
>> preempt auto
>> Performance counter stats for 'system wide' (20 runs):
>> 700,281,729,230,103 cycles # 6.423 GHz ( +- 6.64% )
>> 254,713,123,656,485 instructions # 0.69 insn per cycle ( +- 6.63% )
>> 42,275,061,484,512 branches # 387.756 M/sec ( +- 6.63% )
>> 231,944,216,106 branch-misses # 1.04% of all branches ( +- 6.64% )
Not sure if comparing IPC is worthwhile given the substantially higher
number of instructions under execution. But, that is meaningfully worse.
This was also true on the 12 core system:
>> baseline 6.10-rc1:
>> Performance counter stats for 'system wide' (20 runs):
>> 412,401,110,929,055 cycles # 7.286 GHz ( +- 6.54% )
>> 192,380,094,075,743 instructions # 0.88 insn per cycle ( +- 6.59% )
>> v2_preempt_auto
>> Performance counter stats for 'system wide' (20 runs):
>> 483,419,889,144,017 cycles # 7.232 GHz ( +- 6.51% )
>> 210,788,030,476,548 instructions # 0.82 insn per cycle ( +- 6.57% )
Just to get rid of the preempt_auto aspect completely, maybe you could
try seeing what perf stat -d shows for:
CONFIG_PREEMPT vs CONFIG_PREEMPT_NONE vs (CONFIG_PREEMPT_DYNAMIC, preempt=none).
> This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
> Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided
> tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch
> for reference. It didn't help fix the regression unless I implemented it wrongly.
>
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index 1d58da946739..374642288061 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -268,6 +268,7 @@ struct paca_struct {
> u16 slb_save_cache_ptr;
> #endif
> #endif /* CONFIG_PPC_BOOK3S_64 */
> + int preempt_count;
I don't know powerpc at all. But, would this cacheline be hotter
than current_thread_info()::preempt_count?
Thanks
Ankur
> #ifdef CONFIG_STACKPROTECTOR
> unsigned long canary;
> #endif
> diff --git a/arch/powerpc/include/asm/preempt.h b/arch/powerpc/include/asm/preempt.h
> new file mode 100644
> index 000000000000..406dad1a0cf6
> --- /dev/null
> +++ b/arch/powerpc/include/asm/preempt.h
> @@ -0,0 +1,106 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __ASM_PREEMPT_H
> +#define __ASM_PREEMPT_H
> +
> +#include <linux/thread_info.h>
> +
> +#ifdef CONFIG_PPC64
> +#include <asm/paca.h>
> +#endif
> +#include <asm/percpu.h>
> +#include <asm/smp.h>
> +
> +#define PREEMPT_ENABLED (0)
> +
> +/*
> + * We mask the PREEMPT_NEED_RESCHED bit so as not to confuse all current users
> + * that think a non-zero value indicates we cannot preempt.
> + */
> +static __always_inline int preempt_count(void)
> +{
> + return READ_ONCE(local_paca->preempt_count);
> +}
> +
> +static __always_inline void preempt_count_set(int pc)
> +{
> + WRITE_ONCE(local_paca->preempt_count, pc);
> +}
> +
> +/*
> + * must be macros to avoid header recursion hell
> + */
> +#define init_task_preempt_count(p) do { } while (0)
> +
> +#define init_idle_preempt_count(p, cpu) do { } while (0)
> +
> +static __always_inline void set_preempt_need_resched(void)
> +{
> +}
> +
> +static __always_inline void clear_preempt_need_resched(void)
> +{
> +}
> +
> +static __always_inline bool test_preempt_need_resched(void)
> +{
> + return false;
> +}
> +
> +/*
> + * The various preempt_count add/sub methods
> + */
> +
> +static __always_inline void __preempt_count_add(int val)
> +{
> + preempt_count_set(preempt_count() + val);
> +}
> +
> +static __always_inline void __preempt_count_sub(int val)
> +{
> + preempt_count_set(preempt_count() - val);
> +}
> +
> +static __always_inline bool __preempt_count_dec_and_test(void)
> +{
> + /*
> + * Because of load-store architectures cannot do per-cpu atomic
> + * operations; we cannot use PREEMPT_NEED_RESCHED because it might get
> + * lost.
> + */
> + preempt_count_set(preempt_count() - 1);
> + if (preempt_count() == 0 && tif_need_resched())
> + return true;
> + else
> + return false;
> +}
> +
> +/*
> + * Returns true when we need to resched and can (barring IRQ state).
> + */
> +static __always_inline bool should_resched(int preempt_offset)
> +{
> + return unlikely(preempt_count() == preempt_offset && tif_need_resched());
> +}
> +
> +//EXPORT_SYMBOL(per_cpu_preempt_count);
> +
> +#ifdef CONFIG_PREEMPTION
> +extern asmlinkage void preempt_schedule(void);
> +extern asmlinkage void preempt_schedule_notrace(void);
> +
> +#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> +
> +void dynamic_preempt_schedule(void);
> +void dynamic_preempt_schedule_notrace(void);
> +#define __preempt_schedule() dynamic_preempt_schedule()
> +#define __preempt_schedule_notrace() dynamic_preempt_schedule_notrace()
> +
> +#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
> +
> +#define __preempt_schedule() preempt_schedule()
> +#define __preempt_schedule_notrace() preempt_schedule_notrace()
> +
> +#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
> +#endif /* CONFIG_PREEMPTION */
> +
> +#endif /* __ASM_PREEMPT_H */
> diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
> index 0d170e2be2b6..bf2199384751 100644
> --- a/arch/powerpc/include/asm/thread_info.h
> +++ b/arch/powerpc/include/asm/thread_info.h
> @@ -52,8 +52,8 @@
> * low level task data.
> */
> struct thread_info {
> - int preempt_count; /* 0 => preemptable,
> - <0 => BUG */
> + //int preempt_count; // 0 => preemptable,
> + // <0 => BUG
> #ifdef CONFIG_SMP
> unsigned int cpu;
> #endif
> @@ -77,7 +77,6 @@ struct thread_info {
> */
> #define INIT_THREAD_INFO(tsk) \
> { \
> - .preempt_count = INIT_PREEMPT_COUNT, \
> .flags = 0, \
> }
>
> diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
> index 7502066c3c53..f90245b8359f 100644
> --- a/arch/powerpc/kernel/paca.c
> +++ b/arch/powerpc/kernel/paca.c
> @@ -204,6 +204,7 @@ void __init initialise_paca(struct paca_struct *new_paca, int cpu)
> #ifdef CONFIG_PPC_64S_HASH_MMU
> new_paca->slb_shadow_ptr = NULL;
> #endif
> + new_paca->preempt_count = PREEMPT_DISABLED;
>
> #ifdef CONFIG_PPC_BOOK3E_64
> /* For now -- if we have threads this will be adjusted later */
> diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
> index 85050be08a23..2adab682aab9 100644
> --- a/arch/powerpc/kexec/core_64.c
> +++ b/arch/powerpc/kexec/core_64.c
> @@ -33,6 +33,8 @@
> #include <asm/ultravisor.h>
> #include <asm/crashdump-ppc64.h>
>
> +#include <linux/percpu-defs.h>
> +
> int machine_kexec_prepare(struct kimage *image)
> {
> int i;
> @@ -324,7 +326,7 @@ void default_machine_kexec(struct kimage *image)
> * XXX: the task struct will likely be invalid once we do the copy!
> */
> current_thread_info()->flags = 0;
> - current_thread_info()->preempt_count = HARDIRQ_OFFSET;
> + local_paca->preempt_count = HARDIRQ_OFFSET;
>
> /* We need a static PACA, too; copy this CPU's PACA over and switch to
> * it. Also poison per_cpu_offset and NULL lppaca to catch anyone using
--
ankur
Ankur Arora <ankur.a.arora@oracle.com> writes:
> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>> ...
>> This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
>> Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided
>> tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch
>> for reference. It didn't help fix the regression unless I implemented it wrongly.
>>
>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
>> index 1d58da946739..374642288061 100644
>> --- a/arch/powerpc/include/asm/paca.h
>> +++ b/arch/powerpc/include/asm/paca.h
>> @@ -268,6 +268,7 @@ struct paca_struct {
>> u16 slb_save_cache_ptr;
>> #endif
>> #endif /* CONFIG_PPC_BOOK3S_64 */
>> + int preempt_count;
>
> I don't know powerpc at all. But, would this cacheline be hotter
> than current_thread_info()::preempt_count?
>
>> #ifdef CONFIG_STACKPROTECTOR
>> unsigned long canary;
>> #endif
Assuming stack protector is enabled (it is in defconfig), that cache
line should quite be hot, because the canary is loaded as part of the
epilogue of many functions.
Putting preempt_count in the paca also means it's a single load/store to
access the value, just paca (in r13) + static offset. With the
preempt_count in thread_info it's two loads, one to load current from
the paca and then another to get the preempt_count.
It could be worthwhile to move preempt_count into the paca, but I'm not
convinced preempt_count is accessed enough for it to be a major
performance issue.
cheers
On 6/27/24 11:26 AM, Michael Ellerman wrote:
> Ankur Arora <ankur.a.arora@oracle.com> writes:
>> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>>> ...
>>> This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
>>> Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided
>>> tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch
>>> for reference. It didn't help fix the regression unless I implemented it wrongly.
>>>
>>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
>>> index 1d58da946739..374642288061 100644
>>> --- a/arch/powerpc/include/asm/paca.h
>>> +++ b/arch/powerpc/include/asm/paca.h
>>> @@ -268,6 +268,7 @@ struct paca_struct {
>>> u16 slb_save_cache_ptr;
>>> #endif
>>> #endif /* CONFIG_PPC_BOOK3S_64 */
>>> + int preempt_count;
>>
>> I don't know powerpc at all. But, would this cacheline be hotter
>> than current_thread_info()::preempt_count?
>>
>>> #ifdef CONFIG_STACKPROTECTOR
>>> unsigned long canary;
>>> #endif
>
> Assuming stack protector is enabled (it is in defconfig), that cache
> line should quite be hot, because the canary is loaded as part of the
> epilogue of many functions.
Thanks Michael for taking a look at it.
Yes. CONFIG_STACKPROTECTOR=y
which cacheline is a question still if we are going to pursue this.
> Putting preempt_count in the paca also means it's a single load/store to
> access the value, just paca (in r13) + static offset. With the
> preempt_count in thread_info it's two loads, one to load current from
> the paca and then another to get the preempt_count.
>
> It could be worthwhile to move preempt_count into the paca, but I'm not
> convinced preempt_count is accessed enough for it to be a major
> performance issue.
With PREEMPT_COUNT enabled, this would mean for every preempt_enable/disable.
That means for every spin lock/unlock, get/set cpu etc. Those might be
quite frequent. no? But w.r.t to preempt auto it didn't change the performance per se.
>
> cheers
Shrikanth Hegde <sshegde@linux.ibm.com> writes:
> On 6/27/24 11:26 AM, Michael Ellerman wrote:
>> Ankur Arora <ankur.a.arora@oracle.com> writes:
>>> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>>>> ...
>>>> This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
>>>> Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided
>>>> tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch
>>>> for reference. It didn't help fix the regression unless I implemented it wrongly.
>>>>
>>>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
>>>> index 1d58da946739..374642288061 100644
>>>> --- a/arch/powerpc/include/asm/paca.h
>>>> +++ b/arch/powerpc/include/asm/paca.h
>>>> @@ -268,6 +268,7 @@ struct paca_struct {
>>>> u16 slb_save_cache_ptr;
>>>> #endif
>>>> #endif /* CONFIG_PPC_BOOK3S_64 */
>>>> + int preempt_count;
>>>
>>> I don't know powerpc at all. But, would this cacheline be hotter
>>> than current_thread_info()::preempt_count?
>>>
>>>> #ifdef CONFIG_STACKPROTECTOR
>>>> unsigned long canary;
>>>> #endif
>>
>> Assuming stack protector is enabled (it is in defconfig), that cache
>> line should quite be hot, because the canary is loaded as part of the
>> epilogue of many functions.
>
> Thanks Michael for taking a look at it.
>
> Yes. CONFIG_STACKPROTECTOR=y
> which cacheline is a question still if we are going to pursue this.
>> Putting preempt_count in the paca also means it's a single load/store to
>> access the value, just paca (in r13) + static offset. With the
>> preempt_count in thread_info it's two loads, one to load current from
>> the paca and then another to get the preempt_count.
>>
>> It could be worthwhile to move preempt_count into the paca, but I'm not
>> convinced preempt_count is accessed enough for it to be a major
>> performance issue.
Yeah, that makes sense. I'm working on making the x86 preempt_count
and related code similar to powerpc. Let's see how that does on x86.
> With PREEMPT_COUNT enabled, this would mean for every preempt_enable/disable.
> That means for every spin lock/unlock, get/set cpu etc. Those might be
> quite frequent. no? But w.r.t to preempt auto it didn't change the performance per se.
Yeah and you had mentioned that folding the NR bit (or not) doesn't
seem to matter either. Hackbench does a lot of remote wakeups, which
should mean that the target's thread_info::flags cacheline would be
bouncing around, so I would have imagined that that would be noticeable.
--
ankur
On 7/3/24 10:57, Ankur Arora wrote:
>
> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>
Hi.
Sorry for the delayed response.
I could see this hackbench pipe regression with preempt=full kernel on 6.10-rc kernel. i.e without PREEMPT_AUTO as well.
There seems to more wakeups in read path, implies pipe was more often empty. Correspondingly more contention
is there on the mutex pipe lock in preempt=full. But why, not sure. One difference in powerpc is page size. but
here pipe isn't getting full. Its not the write side that is blocked.
preempt=none: Time taken for 20 groups in seconds : 25.70
preempt=full: Time taken for 20 groups in seconds : 54.56
----------------
hackbench (pipe)
----------------
top 3 callstacks of __schedule collected with bpftrace.
preempt=none preempt=full
__schedule+12 |@[
schedule+64 | __schedule+12
interrupt_exit_user_prepare_main+600 | preempt_schedule+84
interrupt_exit_user_prepare+88 | _raw_spin_unlock_irqrestore+124
interrupt_return_srr_user+8 | __wake_up_sync_key+108
, hackbench]: 482228 | pipe_write+1772
@[ | vfs_write+1052
__schedule+12 | ksys_write+248
schedule+64 | system_call_exception+296
pipe_write+1452 | system_call_vectored_common+348
vfs_write+940 |, hackbench]: 538591
ksys_write+248 |@[
system_call_exception+292 | __schedule+12
system_call_vectored_common+348 | schedule+76
, hackbench]: 1427161 | schedule_preempt_disabled+52
@[ | __mutex_lock.constprop.0+1748
__schedule+12 | pipe_write+132
schedule+64 | vfs_write+1052
interrupt_exit_user_prepare_main+600 | ksys_write+248
syscall_exit_prepare+336 | system_call_exception+296
system_call_vectored_common+360 | system_call_vectored_common+348
, hackbench]: 8151309 |, hackbench]: 5388301
@[ |@[
__schedule+12 | __schedule+12
schedule+64 | schedule+76
pipe_read+1100 | pipe_read+1100
vfs_read+716 | vfs_read+716
ksys_read+252 | ksys_read+252
system_call_exception+292 | system_call_exception+296
system_call_vectored_common+348 | system_call_vectored_common+348
, hackbench]: 18132753 |, hackbench]: 64424110
--------------------------------------------
hackbench (messaging) - one that uses sockets
--------------------------------------------
Here there is no regression with preempt=full.
preempt=none: Time taken for 20 groups in seconds : 55.51
preempt=full: Time taken for 20 groups in seconds : 55.10
Similar bpftrace collected for socket based hackbench. highest caller of __schedule doesn't change much.
preempt=none preempt=full
| __schedule+12
| preempt_schedule+84
| _raw_spin_unlock+108
@[ | unix_stream_sendmsg+660
__schedule+12 | sock_write_iter+372
schedule+64 | vfs_write+1052
schedule_timeout+412 | ksys_write+248
sock_alloc_send_pskb+684 | system_call_exception+296
unix_stream_sendmsg+448 | system_call_vectored_common+348
sock_write_iter+372 |, hackbench]: 819290
vfs_write+940 |@[
ksys_write+248 | __schedule+12
system_call_exception+292 | schedule+76
system_call_vectored_common+348 | schedule_timeout+476
, hackbench]: 3424197 | sock_alloc_send_pskb+684
@[ | unix_stream_sendmsg+444
__schedule+12 | sock_write_iter+372
schedule+64 | vfs_write+1052
interrupt_exit_user_prepare_main+600 | ksys_write+248
syscall_exit_prepare+336 | system_call_exception+296
system_call_vectored_common+360 | system_call_vectored_common+348
, hackbench]: 9800144 |, hackbench]: 3386594
@[ |@[
__schedule+12 | __schedule+12
schedule+64 | schedule+76
schedule_timeout+412 | schedule_timeout+476
unix_stream_data_wait+528 | unix_stream_data_wait+468
unix_stream_read_generic+872 | unix_stream_read_generic+804
unix_stream_recvmsg+196 | unix_stream_recvmsg+196
sock_recvmsg+164 | sock_recvmsg+156
sock_read_iter+200 | sock_read_iter+200
vfs_read+716 | vfs_read+716
ksys_read+252 | ksys_read+252
system_call_exception+292 | system_call_exception+296
system_call_vectored_common+348 | system_call_vectored_common+348
, hackbench]: 25375142 |, hackbench]: 27275685
On Mon, 12 Aug 2024 at 10:33, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>
> top 3 callstacks of __schedule collected with bpftrace.
>
> preempt=none preempt=full
>
> __schedule+12 |@[
> schedule+64 | __schedule+12
> interrupt_exit_user_prepare_main+600 | preempt_schedule+84
> interrupt_exit_user_prepare+88 | _raw_spin_unlock_irqrestore+124
> interrupt_return_srr_user+8 | __wake_up_sync_key+108
> , hackbench]: 482228 | pipe_write+1772
> @[ | vfs_write+1052
> __schedule+12 | ksys_write+248
> schedule+64 | system_call_exception+296
> pipe_write+1452 | system_call_vectored_common+348
> vfs_write+940 |, hackbench]: 538591
> ksys_write+248 |@[
> system_call_exception+292 | __schedule+12
> system_call_vectored_common+348 | schedule+76
> , hackbench]: 1427161 | schedule_preempt_disabled+52
> @[ | __mutex_lock.constprop.0+1748
> __schedule+12 | pipe_write+132
> schedule+64 | vfs_write+1052
> interrupt_exit_user_prepare_main+600 | ksys_write+248
> syscall_exit_prepare+336 | system_call_exception+296
> system_call_vectored_common+360 | system_call_vectored_common+348
> , hackbench]: 8151309 |, hackbench]: 5388301
> @[ |@[
> __schedule+12 | __schedule+12
> schedule+64 | schedule+76
> pipe_read+1100 | pipe_read+1100
> vfs_read+716 | vfs_read+716
> ksys_read+252 | ksys_read+252
> system_call_exception+292 | system_call_exception+296
> system_call_vectored_common+348 | system_call_vectored_common+348
> , hackbench]: 18132753 |, hackbench]: 64424110
>
So the pipe performance is very sensitive, partly because the pipe
overhead is normally very low.
So we've seen it in lots of benchmarks where the benchmark then gets
wildly different results depending on whether you get the goo "optimal
pattern".
And I think your "preempt=none" pattern is the one you really want,
where all the pipe IO scheduling is basically done at exactly the
(optimized) pipe points, ie where the writer blocks because there is
no room (if it's a throughput benchmark), and the reader blocks
because there is no data (for the ping-pong or pipe ring latency
benchmarks).
And then when you get that "perfect" behavior, you typically also get
the best performance when all readers and all writers are on the same
CPU, so you get no unnecessary cache ping-pong either.
And that's a *very* typical pipe benchmark, where there are no costs
to generating the pipe data and no costs involved with consuming it
(ie the actual data isn't really *used* by the benchmark).
In real (non-benchmark) loads, you typically want to spread the
consumer and producer apart on different CPUs, so that the real load
then uses multiple CPUs on the data. But the benchmark case - having
no real data load - likes the "stay on the same CPU" thing.
Your traces for "preempt=none" very much look like that "both reader
and writer sleep synchronously" case, which is the optimal benchmark
case.
And then with "preempt=full", you see that "oh damn, reader and writer
actually hit the pipe mutex contention, because they are presumably
running at the same time on different CPUs, and didn't get into that
nice serial synchronous pattern. So now you not only have that mutex
overhead (which doesn't exist in the reader and writer synchronize),
you also end up with the cost of cache misses *and* the cost of
scheduling on two different CPU's where both of them basically go into
idle while waiting for the other end.
I'm not convinced this is solvable, because it really is an effect
that comes from "benchmarking is doing something odd that we
*shouldn't* generally optimize for".
I also absolutely detest the pipe mutex - 99% of what it protects
should be using either just atomic cmpxchg or possibly a spinlock, and
that's actually what the "use pipes for events" code does. However,
the actual honest user read()/write() code needs to do user space
accesses, and so it wants a sleeping lock.
We could - and probably at some point should - split the pipe mutex
into two: one that protects the writer side, one that protects the
reader side. Then with the common situation of a single reader and a
single writer, the mutex would never be contended. Then the rendezvous
between that "one reader" and "one writer" would be done using
atomics.
But it would be more complex, and it's already complicated by the
whole "you can also use pipes for atomic messaging for watch-queues".
Anyeway, preempt=none has always excelled at certain things. This is
one of them.
Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Mon, 12 Aug 2024 at 10:33, Shrikanth Hegde <sshegde@linux.ibm.com> wrote: >> >> top 3 callstacks of __schedule collected with bpftrace. >> >> preempt=none preempt=full >> >> __schedule+12 |@[ >> schedule+64 | __schedule+12 >> interrupt_exit_user_prepare_main+600 | preempt_schedule+84 >> interrupt_exit_user_prepare+88 | _raw_spin_unlock_irqrestore+124 >> interrupt_return_srr_user+8 | __wake_up_sync_key+108 >> , hackbench]: 482228 | pipe_write+1772 >> @[ | vfs_write+1052 >> __schedule+12 | ksys_write+248 >> schedule+64 | system_call_exception+296 >> pipe_write+1452 | system_call_vectored_common+348 >> vfs_write+940 |, hackbench]: 538591 >> ksys_write+248 |@[ >> system_call_exception+292 | __schedule+12 >> system_call_vectored_common+348 | schedule+76 >> , hackbench]: 1427161 | schedule_preempt_disabled+52 >> @[ | __mutex_lock.constprop.0+1748 >> __schedule+12 | pipe_write+132 >> schedule+64 | vfs_write+1052 >> interrupt_exit_user_prepare_main+600 | ksys_write+248 >> syscall_exit_prepare+336 | system_call_exception+296 >> system_call_vectored_common+360 | system_call_vectored_common+348 >> , hackbench]: 8151309 |, hackbench]: 5388301 >> @[ |@[ >> __schedule+12 | __schedule+12 >> schedule+64 | schedule+76 >> pipe_read+1100 | pipe_read+1100 >> vfs_read+716 | vfs_read+716 >> ksys_read+252 | ksys_read+252 >> system_call_exception+292 | system_call_exception+296 >> system_call_vectored_common+348 | system_call_vectored_common+348 >> , hackbench]: 18132753 |, hackbench]: 64424110 >> > > So the pipe performance is very sensitive, partly because the pipe > overhead is normally very low. > > So we've seen it in lots of benchmarks where the benchmark then gets > wildly different results depending on whether you get the goo "optimal > pattern". > > And I think your "preempt=none" pattern is the one you really want, > where all the pipe IO scheduling is basically done at exactly the > (optimized) pipe points, ie where the writer blocks because there is > no room (if it's a throughput benchmark), and the reader blocks > because there is no data (for the ping-pong or pipe ring latency > benchmarks). > > And then when you get that "perfect" behavior, you typically also get > the best performance when all readers and all writers are on the same > CPU, so you get no unnecessary cache ping-pong either. > > And that's a *very* typical pipe benchmark, where there are no costs > to generating the pipe data and no costs involved with consuming it > (ie the actual data isn't really *used* by the benchmark). > > In real (non-benchmark) loads, you typically want to spread the > consumer and producer apart on different CPUs, so that the real load > then uses multiple CPUs on the data. But the benchmark case - having > no real data load - likes the "stay on the same CPU" thing. > > Your traces for "preempt=none" very much look like that "both reader > and writer sleep synchronously" case, which is the optimal benchmark > case. > > And then with "preempt=full", you see that "oh damn, reader and writer > actually hit the pipe mutex contention, because they are presumably > running at the same time on different CPUs, and didn't get into that > nice serial synchronous pattern. So now you not only have that mutex > overhead (which doesn't exist in the reader and writer synchronize), > you also end up with the cost of cache misses *and* the cost of > scheduling on two different CPU's where both of them basically go into > idle while waiting for the other end. Thanks. That was very clarifying. -- ankur
© 2016 - 2026 Red Hat, Inc.