[v2] PREEMPT_AUTO: support lazy rescheduling

[PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Ankur Arora 1 year, 8 months ago

Hi,

This series adds a new scheduling model PREEMPT_AUTO, which like
PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
on explicit preemption points for the voluntary models.

The series is based on Thomas' original proposal which he outlined
in [1], [2] and in his PoC [3].

v2 mostly reworks v1, with one of the main changes having less
noisy need-resched-lazy related interfaces.
More details in the changelog below.

The v1 of the series is at [4] and the RFC at [5].

Design
==

PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
PREEMPT_COUNT). This means that the scheduler can always safely
preempt. (This is identical to CONFIG_PREEMPT.)

Having that, the next step is to make the rescheduling policy dependent
on the chosen scheduling model. Currently, the scheduler uses a single
need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
reschedule is needed.
PREEMPT_AUTO extends this by adding an additional need-resched bit
(TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
scheduler to express two kinds of rescheduling intent: schedule at
the earliest opportunity (TIF_NEED_RESCHED), or express a need for
rescheduling while allowing the task on the runqueue to run to
timeslice completion (TIF_NEED_RESCHED_LAZY).

The scheduler decides which need-resched bits are chosen based on
the preemption model in use:

	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY

none		never   		always [*]
voluntary       higher sched class	other tasks [*]
full 		always                  never

[*] some details elided.

The last part of the puzzle is, when does preemption happen, or
alternately stated, when are the need-resched bits checked:

                 exit-to-user    ret-to-kernel    preempt_count()

NEED_RESCHED_LAZY     Y               N                N
NEED_RESCHED          Y               Y                Y

Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
none/voluntary preemption policies are in effect. And eager semantics
under full preemption.

In addition, since this is driven purely by the scheduler (not
depending on cond_resched() placement and the like), there is enough
flexibility in the scheduler to cope with edge cases -- ex. a kernel
task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
simply upgrading to a full NEED_RESCHED which can use more coercive
instruments like resched IPI to induce a context-switch.

Performance
==
The performance in the basic tests (perf bench sched messaging, kernbench,
cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
(See patches 
  "sched: support preempt=none under PREEMPT_AUTO"
  "sched: support preempt=full under PREEMPT_AUTO"
  "sched: handle preempt=voluntary under PREEMPT_AUTO")

For a macro test, a colleague in Oracle's Exadata team tried two
OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
backported.)

In both tests the data was cached on remote nodes (cells), and the
database nodes (compute) served client queries, with clients being
local in the first test and remote in the second.

Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs


				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
				                                        (preempt=voluntary)          
                              ==============================      =============================
                      clients  throughput    cpu-usage            throughput     cpu-usage         Gain
                               (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
		      -------  ----------  -----------------      ----------  -----------------   -------
				                                            

  OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
  benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
 (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%


  OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
  benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
 (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
  90/10 RW ratio)


(Both sets of tests have a fair amount of NW traffic since the query
tables etc are cached on the cells. Additionally, the first set,
given the local clients, stress the scheduler a bit more than the
second.)

The comparative performance for both the tests is fairly close,
more or less within a margin of error.

Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):

"
 a) Base kernel (6.7),
 b) v1, PREEMPT_AUTO, preempt=voluntary
 c) v1, PREEMPT_DYNAMIC, preempt=voluntary
 d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
 
 Workloads I tested and their %gain,
                    case b           case c       case d
 NAS                +2.7%              +1.9%         +2.1%
 Hashjoin,          +0.0%              +0.0%         +0.0%
 Graph500,          -6.0%              +0.0%         +0.0%
 XSBench            +1.7%              +0.0%         +1.2%
 
 (Note about the Graph500 numbers at [8].)
 
 Did kernbench etc test from Mel's mmtests suite also. Did not notice
 much difference.
"

One case where there is a significant performance drop is on powerpc,
seen running hackbench on a 320 core system (a test on a smaller system is
fine.) In theory there's no reason for this to only happen on powerpc
since most of the code is common, but I haven't been able to reproduce
it on x86 so far.

All in all, I think the tests above show that this scheduling model has legs.
However, the none/voluntary models under PREEMPT_AUTO are conceptually
different enough from the current none/voluntary models that there
likely are workloads where performance would be subpar. That needs more
extensive testing to figure out the weak points.


Series layout
==

Patches 1,2 
 "sched/core: Move preempt_model_*() helpers from sched.h to preempt.h"
 "sched/core: Drop spinlocks on contention iff kernel is preemptible"
condition spin_needbreak() on the dynamic preempt_model_*().
Not really required but a useful bugfix for PREEMPT_DYNAMIC and PREEMPT_AUTO.

Patch 3
  "sched: make test_*_tsk_thread_flag() return bool"
is a minor cleanup.

Patch 4,
  "preempt: introduce CONFIG_PREEMPT_AUTO"
introduces the new scheduling model.

Patch 5-7,
 "thread_info: selector for TIF_NEED_RESCHED[_LAZY]"
 "thread_info: define __tif_need_resched(resched_t)"
 "sched: define *_tsk_need_resched_lazy() helpers"

introduce new thread_info/task helper interfaces or make changes to
pre-existing ones that will be used in the rest of the series.

Patches 8-11,
  "entry: handle lazy rescheduling at user-exit"
  "entry/kvm: handle lazy rescheduling at guest-entry"
  "entry: irqentry_exit only preempts for TIF_NEED_RESCHED"
  "sched: __schedule_loop() doesn't need to check for need_resched_lazy()"

make changes/document the rescheduling points.

Patches 12-13,
  "sched: separate PREEMPT_DYNAMIC config logic"
  "sched: allow runtime config for PREEMPT_AUTO"

reuse the PREEMPT_DYNAMIC runtime configuration logic.

Patch 14-18,

  "rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO"
  "rcu: fix header guard for rcu_all_qs()"
  "preempt,rcu: warn on PREEMPT_RCU=n, preempt=full"
  "rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y"
  "rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y"

add changes needed for RCU.

Patch 19-20,
  "x86/thread_info: define TIF_NEED_RESCHED_LAZY"
  "powerpc: add support for PREEMPT_AUTO"

adds x86, powerpc support. 

Patches 21-24,
  "sched: prepare for lazy rescheduling in resched_curr()"
  "sched: default preemption policy for PREEMPT_AUTO"
  "sched: handle idle preemption for PREEMPT_AUTO"
  "sched: schedule eagerly in resched_cpu()"

are preparatory patches for adding PREEMPT_AUTO. Among other things
they add the default need-resched policy for !PREEMPT_AUTO,
PREEMPT_AUTO, and the idle task.

Patches 25-26,
  "sched/fair: refactor update_curr(), entity_tick()",
  "sched/fair: handle tick expiry under lazy preemption"

handle the 'hog' problem, where a kernel task does not voluntarily
schedule out.

And, finally patches 27-29,
  "sched: support preempt=none under PREEMPT_AUTO"
  "sched: support preempt=full under PREEMPT_AUTO"
  "sched: handle preempt=voluntary under PREEMPT_AUTO"

add support for the three preemption models.

Patch 30-33,
  "sched: latency warn for TIF_NEED_RESCHED_LAZY",
  "tracing: support lazy resched",
  "Documentation: tracing: add TIF_NEED_RESCHED_LAZY",
  "osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y"

handles remaining bits and pieces to do with TIF_NEED_RESCHED_LAZY.

And, finally patches 34-35

  "kconfig: decompose ARCH_NO_PREEMPT"
  "arch: decompose ARCH_NO_PREEMPT"

decompose ARCH_NO_PREEMPT which might make it easier to support
CONFIG_PREEMPTION on some architectures.


Changelog
==
v2: rebased to v6.9, addreses review comments, folds some other patches.

 - the lazy interfaces are less noisy now: the current interfaces stay
   unchanged so non-scheduler code doesn't need to change.
   This also means that the lazy preemption becomes a scheduler detail
   which works well with the core idea of lazy scheduling.
   (Mark Rutland, Thomas Gleixner)

 - preempt=none model now respects the leftmost deadline (Juri Lelli)
 - Add need-resched flag combination state in tracing headers (Steven Rostedt)
 - Decompose ARCH_NO_PREEMPT
 - Changes for RCU (and TASKS_RCU) will go in separately [6]

 - spin_needbreak() should be conditioned on preempt_model_*() at
   runtime (patches from Sean Christopherson [7])
 - powerpc support from Shrikanth Hegde

RFC:
 - Addresses review comments and is generally a more focused
   version of the RFC.
 - Lots of code reorganization.
 - Bugfixes all over.
 - need_resched() now only checks for TIF_NEED_RESCHED instead
   of TIF_NEED_RESCHED|TIF_NEED_RESCHED_LAZY.
 - set_nr_if_polling() now does not check for TIF_NEED_RESCHED_LAZY.
 - Tighten idle related checks.
 - RCU changes to force context-switches when a quiescent state is
   urgently needed.
 - Does not break live-patching anymore

Also at: github.com/terminus/linux preempt-v2

Please review.

Thanks
Ankur

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>

[1] https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
[2] https://lore.kernel.org/lkml/87led2wdj0.ffs@tglx/
[3] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
[4] https://lore.kernel.org/lkml/20240213055554.1802415-1-ankur.a.arora@oracle.com/
[5] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/
[6] https://lore.kernel.org/lkml/20240507093530.3043-1-urezki@gmail.com/
[7] https://lore.kernel.org/lkml/20240312193911.1796717-1-seanjc@google.com/
[8] https://lore.kernel.org/lkml/af122806-8325-4302-991f-9c0dc1857bfe@amd.com/
[9] https://lore.kernel.org/lkml/17cc54c4-2e75-4964-9155-84db081ce209@linux.ibm.com/

Ankur Arora (32):
  sched: make test_*_tsk_thread_flag() return bool
  preempt: introduce CONFIG_PREEMPT_AUTO
  thread_info: selector for TIF_NEED_RESCHED[_LAZY]
  thread_info: define __tif_need_resched(resched_t)
  sched: define *_tsk_need_resched_lazy() helpers
  entry: handle lazy rescheduling at user-exit
  entry/kvm: handle lazy rescheduling at guest-entry
  entry: irqentry_exit only preempts for TIF_NEED_RESCHED
  sched: __schedule_loop() doesn't need to check for need_resched_lazy()
  sched: separate PREEMPT_DYNAMIC config logic
  sched: allow runtime config for PREEMPT_AUTO
  rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO
  rcu: fix header guard for rcu_all_qs()
  preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
  rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y
  rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y
  x86/thread_info: define TIF_NEED_RESCHED_LAZY
  sched: prepare for lazy rescheduling in resched_curr()
  sched: default preemption policy for PREEMPT_AUTO
  sched: handle idle preemption for PREEMPT_AUTO
  sched: schedule eagerly in resched_cpu()
  sched/fair: refactor update_curr(), entity_tick()
  sched/fair: handle tick expiry under lazy preemption
  sched: support preempt=none under PREEMPT_AUTO
  sched: support preempt=full under PREEMPT_AUTO
  sched: handle preempt=voluntary under PREEMPT_AUTO
  sched: latency warn for TIF_NEED_RESCHED_LAZY
  tracing: support lazy resched
  Documentation: tracing: add TIF_NEED_RESCHED_LAZY
  osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y
  kconfig: decompose ARCH_NO_PREEMPT
  arch: decompose ARCH_NO_PREEMPT

Sean Christopherson (2):
  sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
  sched/core: Drop spinlocks on contention iff kernel is preemptible

Shrikanth Hegde (1):
  powerpc: add support for PREEMPT_AUTO

 .../admin-guide/kernel-parameters.txt         |   5 +-
 Documentation/trace/ftrace.rst                |   6 +-
 arch/Kconfig                                  |   7 +
 arch/alpha/Kconfig                            |   3 +-
 arch/hexagon/Kconfig                          |   3 +-
 arch/m68k/Kconfig                             |   3 +-
 arch/powerpc/Kconfig                          |   1 +
 arch/powerpc/include/asm/thread_info.h        |   5 +-
 arch/powerpc/kernel/interrupt.c               |   5 +-
 arch/um/Kconfig                               |   3 +-
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/thread_info.h            |   6 +-
 include/linux/entry-common.h                  |   2 +-
 include/linux/entry-kvm.h                     |   2 +-
 include/linux/preempt.h                       |  43 ++-
 include/linux/rcutree.h                       |   2 +-
 include/linux/sched.h                         | 101 +++---
 include/linux/spinlock.h                      |  14 +-
 include/linux/thread_info.h                   |  71 +++-
 include/linux/trace_events.h                  |   6 +-
 init/Makefile                                 |   1 +
 kernel/Kconfig.preempt                        |  37 ++-
 kernel/entry/common.c                         |  16 +-
 kernel/entry/kvm.c                            |   4 +-
 kernel/rcu/Kconfig                            |   2 +-
 kernel/rcu/tree.c                             |  13 +-
 kernel/rcu/tree_plugin.h                      |  11 +-
 kernel/sched/core.c                           | 311 ++++++++++++------
 kernel/sched/deadline.c                       |   9 +-
 kernel/sched/debug.c                          |  13 +-
 kernel/sched/fair.c                           |  56 ++--
 kernel/sched/rt.c                             |   6 +-
 kernel/sched/sched.h                          |  27 +-
 kernel/trace/trace.c                          |  30 +-
 kernel/trace/trace_osnoise.c                  |  22 +-
 kernel/trace/trace_output.c                   |  16 +-
 36 files changed, 598 insertions(+), 265 deletions(-)

-- 
2.31.1

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Sean Christopherson 1 year, 8 months ago

On Mon, May 27, 2024, Ankur Arora wrote:
> Patches 1,2 
>  "sched/core: Move preempt_model_*() helpers from sched.h to preempt.h"
>  "sched/core: Drop spinlocks on contention iff kernel is preemptible"
> condition spin_needbreak() on the dynamic preempt_model_*().

...

> Not really required but a useful bugfix for PREEMPT_DYNAMIC and PREEMPT_AUTO.
> Sean Christopherson (2):
>   sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
>   sched/core: Drop spinlocks on contention iff kernel is preemptible

Peter and/or Thomas, would it be possible to get these applied to tip-tree sooner
than later?  They fix a real bug that affects KVM to varying degrees.

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Peter Zijlstra 1 year, 8 months ago

On Wed, Jun 05, 2024 at 08:44:50AM -0700, Sean Christopherson wrote:
> On Mon, May 27, 2024, Ankur Arora wrote:
> > Patches 1,2 
> >  "sched/core: Move preempt_model_*() helpers from sched.h to preempt.h"
> >  "sched/core: Drop spinlocks on contention iff kernel is preemptible"
> > condition spin_needbreak() on the dynamic preempt_model_*().
> 
> ...
> 
> > Not really required but a useful bugfix for PREEMPT_DYNAMIC and PREEMPT_AUTO.
> > Sean Christopherson (2):
> >   sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
> >   sched/core: Drop spinlocks on contention iff kernel is preemptible
> 
> Peter and/or Thomas, would it be possible to get these applied to tip-tree sooner
> than later?  They fix a real bug that affects KVM to varying degrees.

It so happens I've queued them for sched/core earlier today (see
queue/sched/core). If the robot comes back happy, I'll push them into
tip.

Thanks!

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Shrikanth Hegde 1 year, 8 months ago


On 5/28/24 6:04 AM, Ankur Arora wrote:
> Hi,
> 
> This series adds a new scheduling model PREEMPT_AUTO, which like
> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
> on explicit preemption points for the voluntary models.
> 
> The series is based on Thomas' original proposal which he outlined
> in [1], [2] and in his PoC [3].
> 
> v2 mostly reworks v1, with one of the main changes having less
> noisy need-resched-lazy related interfaces.
> More details in the changelog below.
>

Hi Ankur. Thanks for the series. 

nit: had to manually patch 11,12,13 since it didnt apply cleanly on
tip/master and tip/sched/core. Mostly due some word differences in the change. 

tip/master was at:
commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
Merge: 5d145493a139 47ff30cc1be7
Author: Ingo Molnar <mingo@kernel.org>
Date:   Tue May 28 12:44:26 2024 +0200

    Merge branch into tip/master: 'x86/percpu'
    

 
> The v1 of the series is at [4] and the RFC at [5].
> 
> Design
> ==
> 
> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
> PREEMPT_COUNT). This means that the scheduler can always safely
> preempt. (This is identical to CONFIG_PREEMPT.)
> 
> Having that, the next step is to make the rescheduling policy dependent
> on the chosen scheduling model. Currently, the scheduler uses a single
> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
> reschedule is needed.
> PREEMPT_AUTO extends this by adding an additional need-resched bit
> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
> scheduler to express two kinds of rescheduling intent: schedule at
> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
> rescheduling while allowing the task on the runqueue to run to
> timeslice completion (TIF_NEED_RESCHED_LAZY).
> 
> The scheduler decides which need-resched bits are chosen based on
> the preemption model in use:
> 
> 	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY
> 
> none		never   		always [*]
> voluntary       higher sched class	other tasks [*]
> full 		always                  never
> 
> [*] some details elided.
> 
> The last part of the puzzle is, when does preemption happen, or
> alternately stated, when are the need-resched bits checked:
> 
>                  exit-to-user    ret-to-kernel    preempt_count()
> 
> NEED_RESCHED_LAZY     Y               N                N
> NEED_RESCHED          Y               Y                Y
> 
> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
> none/voluntary preemption policies are in effect. And eager semantics
> under full preemption.
> 
> In addition, since this is driven purely by the scheduler (not
> depending on cond_resched() placement and the like), there is enough
> flexibility in the scheduler to cope with edge cases -- ex. a kernel
> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
> simply upgrading to a full NEED_RESCHED which can use more coercive
> instruments like resched IPI to induce a context-switch.
> 
> Performance
> ==
> The performance in the basic tests (perf bench sched messaging, kernbench,
> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
> (See patches 
>   "sched: support preempt=none under PREEMPT_AUTO"
>   "sched: support preempt=full under PREEMPT_AUTO"
>   "sched: handle preempt=voluntary under PREEMPT_AUTO")
> 
> For a macro test, a colleague in Oracle's Exadata team tried two
> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
> backported.)
> 
> In both tests the data was cached on remote nodes (cells), and the
> database nodes (compute) served client queries, with clients being
> local in the first test and remote in the second.
> 
> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
> 
> 
> 				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
> 				                                        (preempt=voluntary)          
>                               ==============================      =============================
>                       clients  throughput    cpu-usage            throughput     cpu-usage         Gain
>                                (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
> 		      -------  ----------  -----------------      ----------  -----------------   -------
> 				                                            
> 
>   OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
>   benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
>  (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%
> 
> 
>   OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
>   benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
>  (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
>   90/10 RW ratio)
> 
> 
> (Both sets of tests have a fair amount of NW traffic since the query
> tables etc are cached on the cells. Additionally, the first set,
> given the local clients, stress the scheduler a bit more than the
> second.)
> 
> The comparative performance for both the tests is fairly close,
> more or less within a margin of error.
> 
> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):
> 
> "
>  a) Base kernel (6.7),
>  b) v1, PREEMPT_AUTO, preempt=voluntary
>  c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>  d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>  
>  Workloads I tested and their %gain,
>                     case b           case c       case d
>  NAS                +2.7%              +1.9%         +2.1%
>  Hashjoin,          +0.0%              +0.0%         +0.0%
>  Graph500,          -6.0%              +0.0%         +0.0%
>  XSBench            +1.7%              +0.0%         +1.2%
>  
>  (Note about the Graph500 numbers at [8].)
>  
>  Did kernbench etc test from Mel's mmtests suite also. Did not notice
>  much difference.
> "
> 
> One case where there is a significant performance drop is on powerpc,
> seen running hackbench on a 320 core system (a test on a smaller system is
> fine.) In theory there's no reason for this to only happen on powerpc
> since most of the code is common, but I haven't been able to reproduce
> it on x86 so far.
> 
> All in all, I think the tests above show that this scheduling model has legs.
> However, the none/voluntary models under PREEMPT_AUTO are conceptually
> different enough from the current none/voluntary models that there
> likely are workloads where performance would be subpar. That needs more
> extensive testing to figure out the weak points.
> 
> 
> 
Did test it again on PowerPC. Unfortunately numbers shows there is regression 
still compared to 6.10-rc1. This is done with preempt=none. I tried again on the 
smaller system too to confirm. For now I have done the comparison for the hackbench 
where highest regression was seen in v1.  

perf stat collected for 20 iterations show higher context switch and higher migrations. 
Could it be that LAZY bit is causing more context switches? or could it be something 
else? Could it be that more exit-to-user happens in PowerPC? will continue to debug. 

Meanwhile, will do more test with other micro-benchmarks and post the results.


More details below. 
CONFIG_HZ = 100
./hackbench -pipe 60 process 100000 loops

====================================================================================
On the larger system. (40 Cores, 320CPUS)
====================================================================================
				6.10-rc1		+preempt_auto
				preempt=none		preempt=none
20 iterations avg value
hackbench pipe(60)		26.403			32.368 ( -31.1%)

++++++++++++++++++
baseline 6.10-rc1:
++++++++++++++++++
 Performance counter stats for 'system wide' (20 runs):
    168,980,939.76 msec cpu-clock                        # 6400.026 CPUs utilized               ( +-  6.59% )
     6,299,247,371      context-switches                 #   70.596 K/sec                       ( +-  6.60% )
       246,646,236      cpu-migrations                   #    2.764 K/sec                       ( +-  6.57% )
         1,759,232      page-faults                      #   19.716 /sec                        ( +-  6.61% )
577,719,907,794,874      cycles                           #    6.475 GHz                         ( +-  6.60% )
226,392,778,622,410      instructions                     #    0.74  insn per cycle              ( +-  6.61% )
37,280,192,946,445      branches                         #  417.801 M/sec                       ( +-  6.61% )
   166,456,311,053      branch-misses                    #    0.85% of all branches             ( +-  6.60% )

            26.403 +- 0.166 seconds time elapsed  ( +-  0.63% )

++++++++++++
preempt auto
++++++++++++
 Performance counter stats for 'system wide' (20 runs):
    207,154,235.95 msec cpu-clock                        # 6400.009 CPUs utilized               ( +-  6.64% )
     9,337,462,696      context-switches                 #   85.645 K/sec                       ( +-  6.68% )
       631,276,554      cpu-migrations                   #    5.790 K/sec                       ( +-  6.79% )
         1,756,583      page-faults                      #   16.112 /sec                        ( +-  6.59% )
700,281,729,230,103      cycles                           #    6.423 GHz                         ( +-  6.64% )
254,713,123,656,485      instructions                     #    0.69  insn per cycle              ( +-  6.63% )
42,275,061,484,512      branches                         #  387.756 M/sec                       ( +-  6.63% )
   231,944,216,106      branch-misses                    #    1.04% of all branches             ( +-  6.64% )

            32.368 +- 0.200 seconds time elapsed  ( +-  0.62% )


============================================================================================
Smaller system ( 12Cores, 96CPUS)
============================================================================================
				6.10-rc1		+preempt_auto
				preempt=none		preempt=none
20 iterations avg value
hackbench pipe(60)		55.930			65.75 ( -17.6%)

++++++++++++++++++
baseline 6.10-rc1:
++++++++++++++++++
 Performance counter stats for 'system wide' (20 runs):
    107,386,299.19 msec cpu-clock                        # 1920.003 CPUs utilized               ( +-  6.55% )
     1,388,830,542      context-switches                 #   24.536 K/sec                       ( +-  6.19% )
        44,538,641      cpu-migrations                   #  786.840 /sec                        ( +-  6.23% )
         1,698,710      page-faults                      #   30.010 /sec                        ( +-  6.58% )
412,401,110,929,055      cycles                           #    7.286 GHz                         ( +-  6.54% )
192,380,094,075,743      instructions                     #    0.88  insn per cycle              ( +-  6.59% )
30,328,724,557,878      branches                         #  535.801 M/sec                       ( +-  6.58% )
    99,642,840,901      branch-misses                    #    0.63% of all branches             ( +-  6.57% )

            55.930 +- 0.509 seconds time elapsed  ( +-  0.91% )


+++++++++++++++++
v2_preempt_auto
+++++++++++++++++
 Performance counter stats for 'system wide' (20 runs):
    126,244,029.04 msec cpu-clock                        # 1920.005 CPUs utilized               ( +-  6.51% )
     2,563,720,294      context-switches                 #   38.356 K/sec                       ( +-  6.10% )
       147,445,392      cpu-migrations                   #    2.206 K/sec                       ( +-  6.37% )
         1,710,637      page-faults                      #   25.593 /sec                        ( +-  6.55% )
483,419,889,144,017      cycles                           #    7.232 GHz                         ( +-  6.51% )
210,788,030,476,548      instructions                     #    0.82  insn per cycle              ( +-  6.57% )
33,851,562,301,187      branches                         #  506.454 M/sec                       ( +-  6.56% )
   134,059,721,699      branch-misses                    #    0.75% of all branches             ( +-  6.45% )

             65.75 +- 1.06 seconds time elapsed  ( +-  1.61% )

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Ankur Arora 1 year, 8 months ago

Shrikanth Hegde <sshegde@linux.ibm.com> writes:

> On 5/28/24 6:04 AM, Ankur Arora wrote:
>> Hi,
>>
>> This series adds a new scheduling model PREEMPT_AUTO, which like
>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>> on explicit preemption points for the voluntary models.
>>
>> The series is based on Thomas' original proposal which he outlined
>> in [1], [2] and in his PoC [3].
>>
>> v2 mostly reworks v1, with one of the main changes having less
>> noisy need-resched-lazy related interfaces.
>> More details in the changelog below.
>>
>
> Hi Ankur. Thanks for the series.
>
> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
> tip/master and tip/sched/core. Mostly due some word differences in the change.
>
> tip/master was at:
> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
> Merge: 5d145493a139 47ff30cc1be7
> Author: Ingo Molnar <mingo@kernel.org>
> Date:   Tue May 28 12:44:26 2024 +0200
>
>     Merge branch into tip/master: 'x86/percpu'
>
>
>
>> The v1 of the series is at [4] and the RFC at [5].
>>
>> Design
>> ==
>>
>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>> PREEMPT_COUNT). This means that the scheduler can always safely
>> preempt. (This is identical to CONFIG_PREEMPT.)
>>
>> Having that, the next step is to make the rescheduling policy dependent
>> on the chosen scheduling model. Currently, the scheduler uses a single
>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>> reschedule is needed.
>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>> scheduler to express two kinds of rescheduling intent: schedule at
>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>> rescheduling while allowing the task on the runqueue to run to
>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>
>> The scheduler decides which need-resched bits are chosen based on
>> the preemption model in use:
>>
>> 	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY
>>
>> none		never   		always [*]
>> voluntary       higher sched class	other tasks [*]
>> full 		always                  never
>>
>> [*] some details elided.
>>
>> The last part of the puzzle is, when does preemption happen, or
>> alternately stated, when are the need-resched bits checked:
>>
>>                  exit-to-user    ret-to-kernel    preempt_count()
>>
>> NEED_RESCHED_LAZY     Y               N                N
>> NEED_RESCHED          Y               Y                Y
>>
>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>> none/voluntary preemption policies are in effect. And eager semantics
>> under full preemption.
>>
>> In addition, since this is driven purely by the scheduler (not
>> depending on cond_resched() placement and the like), there is enough
>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>> simply upgrading to a full NEED_RESCHED which can use more coercive
>> instruments like resched IPI to induce a context-switch.
>>
>> Performance
>> ==
>> The performance in the basic tests (perf bench sched messaging, kernbench,
>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>> (See patches
>>   "sched: support preempt=none under PREEMPT_AUTO"
>>   "sched: support preempt=full under PREEMPT_AUTO"
>>   "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>
>> For a macro test, a colleague in Oracle's Exadata team tried two
>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>> backported.)
>>
>> In both tests the data was cached on remote nodes (cells), and the
>> database nodes (compute) served client queries, with clients being
>> local in the first test and remote in the second.
>>
>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>
>>
>> 				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
>> 				                                        (preempt=voluntary)
>>                               ==============================      =============================
>>                       clients  throughput    cpu-usage            throughput     cpu-usage         Gain
>>                                (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
>> 		      -------  ----------  -----------------      ----------  -----------------   -------
>>
>>
>>   OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
>>   benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
>>  (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%
>>
>>
>>   OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
>>   benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
>>  (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
>>   90/10 RW ratio)
>>
>>
>> (Both sets of tests have a fair amount of NW traffic since the query
>> tables etc are cached on the cells. Additionally, the first set,
>> given the local clients, stress the scheduler a bit more than the
>> second.)
>>
>> The comparative performance for both the tests is fairly close,
>> more or less within a margin of error.
>>
>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):
>>
>> "
>>  a) Base kernel (6.7),
>>  b) v1, PREEMPT_AUTO, preempt=voluntary
>>  c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>>  d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>
>>  Workloads I tested and their %gain,
>>                     case b           case c       case d
>>  NAS                +2.7%              +1.9%         +2.1%
>>  Hashjoin,          +0.0%              +0.0%         +0.0%
>>  Graph500,          -6.0%              +0.0%         +0.0%
>>  XSBench            +1.7%              +0.0%         +1.2%
>>
>>  (Note about the Graph500 numbers at [8].)
>>
>>  Did kernbench etc test from Mel's mmtests suite also. Did not notice
>>  much difference.
>> "
>>
>> One case where there is a significant performance drop is on powerpc,
>> seen running hackbench on a 320 core system (a test on a smaller system is
>> fine.) In theory there's no reason for this to only happen on powerpc
>> since most of the code is common, but I haven't been able to reproduce
>> it on x86 so far.
>>
>> All in all, I think the tests above show that this scheduling model has legs.
>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>> different enough from the current none/voluntary models that there
>> likely are workloads where performance would be subpar. That needs more
>> extensive testing to figure out the weak points.
>>
>>
>>
> Did test it again on PowerPC. Unfortunately numbers shows there is regression
> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
> smaller system too to confirm. For now I have done the comparison for the hackbench
> where highest regression was seen in v1.
>
> perf stat collected for 20 iterations show higher context switch and higher migrations.
> Could it be that LAZY bit is causing more context switches? or could it be something
> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.

Thanks for trying it out.

As you point out, context-switches and migrations are signficantly higher.

Definitely unexpected. I ran the same test on an x86 box
(Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.

  6.9.0/none.process.pipe.60:       170,719,761      context-switches          #    0.022 M/sec                    ( +-  0.19% )
  6.9.0/none.process.pipe.60:        16,871,449      cpu-migrations            #    0.002 M/sec                    ( +-  0.16% )
  6.9.0/none.process.pipe.60:      30.833112186 seconds time elapsed                                          ( +-  0.11% )

  6.9.0-00035-gc90017e055a6/none.process.pipe.60:       177,889,639      context-switches          #    0.023 M/sec                    ( +-  0.21% )
  6.9.0-00035-gc90017e055a6/none.process.pipe.60:        17,426,670      cpu-migrations            #    0.002 M/sec                    ( +-  0.41% )
  6.9.0-00035-gc90017e055a6/none.process.pipe.60:      30.731126312 seconds time elapsed                                          ( +-  0.07% )

Clearly there's something different going on powerpc. I'm travelling
right now, but will dig deeper into this once I get back.

Meanwhile can you check if the increased context-switches are voluntary or
involuntary (or what the division is)?


Thanks
Ankur

> Meanwhile, will do more test with other micro-benchmarks and post the results.
>
>
> More details below.
> CONFIG_HZ = 100
> ./hackbench -pipe 60 process 100000 loops
>
> ====================================================================================
> On the larger system. (40 Cores, 320CPUS)
> ====================================================================================
> 				6.10-rc1		+preempt_auto
> 				preempt=none		preempt=none
> 20 iterations avg value
> hackbench pipe(60)		26.403			32.368 ( -31.1%)
>
> ++++++++++++++++++
> baseline 6.10-rc1:
> ++++++++++++++++++
>  Performance counter stats for 'system wide' (20 runs):
>     168,980,939.76 msec cpu-clock                        # 6400.026 CPUs utilized               ( +-  6.59% )
>      6,299,247,371      context-switches                 #   70.596 K/sec                       ( +-  6.60% )
>        246,646,236      cpu-migrations                   #    2.764 K/sec                       ( +-  6.57% )
>          1,759,232      page-faults                      #   19.716 /sec                        ( +-  6.61% )
> 577,719,907,794,874      cycles                           #    6.475 GHz                         ( +-  6.60% )
> 226,392,778,622,410      instructions                     #    0.74  insn per cycle              ( +-  6.61% )
> 37,280,192,946,445      branches                         #  417.801 M/sec                       ( +-  6.61% )
>    166,456,311,053      branch-misses                    #    0.85% of all branches             ( +-  6.60% )
>
>             26.403 +- 0.166 seconds time elapsed  ( +-  0.63% )
>
> ++++++++++++
> preempt auto
> ++++++++++++
>  Performance counter stats for 'system wide' (20 runs):
>     207,154,235.95 msec cpu-clock                        # 6400.009 CPUs utilized               ( +-  6.64% )
>      9,337,462,696      context-switches                 #   85.645 K/sec                       ( +-  6.68% )
>        631,276,554      cpu-migrations                   #    5.790 K/sec                       ( +-  6.79% )
>          1,756,583      page-faults                      #   16.112 /sec                        ( +-  6.59% )
> 700,281,729,230,103      cycles                           #    6.423 GHz                         ( +-  6.64% )
> 254,713,123,656,485      instructions                     #    0.69  insn per cycle              ( +-  6.63% )
> 42,275,061,484,512      branches                         #  387.756 M/sec                       ( +-  6.63% )
>    231,944,216,106      branch-misses                    #    1.04% of all branches             ( +-  6.64% )
>
>             32.368 +- 0.200 seconds time elapsed  ( +-  0.62% )
>
>
> ============================================================================================
> Smaller system ( 12Cores, 96CPUS)
> ============================================================================================
> 				6.10-rc1		+preempt_auto
> 				preempt=none		preempt=none
> 20 iterations avg value
> hackbench pipe(60)		55.930			65.75 ( -17.6%)
>
> ++++++++++++++++++
> baseline 6.10-rc1:
> ++++++++++++++++++
>  Performance counter stats for 'system wide' (20 runs):
>     107,386,299.19 msec cpu-clock                        # 1920.003 CPUs utilized               ( +-  6.55% )
>      1,388,830,542      context-switches                 #   24.536 K/sec                       ( +-  6.19% )
>         44,538,641      cpu-migrations                   #  786.840 /sec                        ( +-  6.23% )
>          1,698,710      page-faults                      #   30.010 /sec                        ( +-  6.58% )
> 412,401,110,929,055      cycles                           #    7.286 GHz                         ( +-  6.54% )
> 192,380,094,075,743      instructions                     #    0.88  insn per cycle              ( +-  6.59% )
> 30,328,724,557,878      branches                         #  535.801 M/sec                       ( +-  6.58% )
>     99,642,840,901      branch-misses                    #    0.63% of all branches             ( +-  6.57% )
>
>             55.930 +- 0.509 seconds time elapsed  ( +-  0.91% )
>
>
> +++++++++++++++++
> v2_preempt_auto
> +++++++++++++++++
>  Performance counter stats for 'system wide' (20 runs):
>     126,244,029.04 msec cpu-clock                        # 1920.005 CPUs utilized               ( +-  6.51% )
>      2,563,720,294      context-switches                 #   38.356 K/sec                       ( +-  6.10% )
>        147,445,392      cpu-migrations                   #    2.206 K/sec                       ( +-  6.37% )
>          1,710,637      page-faults                      #   25.593 /sec                        ( +-  6.55% )
> 483,419,889,144,017      cycles                           #    7.232 GHz                         ( +-  6.51% )
> 210,788,030,476,548      instructions                     #    0.82  insn per cycle              ( +-  6.57% )
> 33,851,562,301,187      branches                         #  506.454 M/sec                       ( +-  6.56% )
>    134,059,721,699      branch-misses                    #    0.75% of all branches             ( +-  6.45% )
>
>              65.75 +- 1.06 seconds time elapsed  ( +-  1.61% )

So, the context-switches are meaningfully higher.

--
ankur

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Shrikanth Hegde 1 year, 8 months ago


On 6/1/24 5:17 PM, Ankur Arora wrote:
> 
> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
> 
>> On 5/28/24 6:04 AM, Ankur Arora wrote:
>>> Hi,
>>>
>>> This series adds a new scheduling model PREEMPT_AUTO, which like
>>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>>> on explicit preemption points for the voluntary models.
>>>
>>> The series is based on Thomas' original proposal which he outlined
>>> in [1], [2] and in his PoC [3].
>>>
>>> v2 mostly reworks v1, with one of the main changes having less
>>> noisy need-resched-lazy related interfaces.
>>> More details in the changelog below.
>>>
>>
>> Hi Ankur. Thanks for the series.
>>
>> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
>> tip/master and tip/sched/core. Mostly due some word differences in the change.
>>
>> tip/master was at:
>> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
>> Merge: 5d145493a139 47ff30cc1be7
>> Author: Ingo Molnar <mingo@kernel.org>
>> Date:   Tue May 28 12:44:26 2024 +0200
>>
>>     Merge branch into tip/master: 'x86/percpu'
>>
>>
>>
>>> The v1 of the series is at [4] and the RFC at [5].
>>>
>>> Design
>>> ==
>>>
>>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>>> PREEMPT_COUNT). This means that the scheduler can always safely
>>> preempt. (This is identical to CONFIG_PREEMPT.)
>>>
>>> Having that, the next step is to make the rescheduling policy dependent
>>> on the chosen scheduling model. Currently, the scheduler uses a single
>>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>>> reschedule is needed.
>>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>>> scheduler to express two kinds of rescheduling intent: schedule at
>>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>>> rescheduling while allowing the task on the runqueue to run to
>>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>>
>>> The scheduler decides which need-resched bits are chosen based on
>>> the preemption model in use:
>>>
>>> 	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY
>>>
>>> none		never   		always [*]
>>> voluntary       higher sched class	other tasks [*]
>>> full 		always                  never
>>>
>>> [*] some details elided.
>>>
>>> The last part of the puzzle is, when does preemption happen, or
>>> alternately stated, when are the need-resched bits checked:
>>>
>>>                  exit-to-user    ret-to-kernel    preempt_count()
>>>
>>> NEED_RESCHED_LAZY     Y               N                N
>>> NEED_RESCHED          Y               Y                Y
>>>
>>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>>> none/voluntary preemption policies are in effect. And eager semantics
>>> under full preemption.
>>>
>>> In addition, since this is driven purely by the scheduler (not
>>> depending on cond_resched() placement and the like), there is enough
>>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>>> simply upgrading to a full NEED_RESCHED which can use more coercive
>>> instruments like resched IPI to induce a context-switch.
>>>
>>> Performance
>>> ==
>>> The performance in the basic tests (perf bench sched messaging, kernbench,
>>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>>> (See patches
>>>   "sched: support preempt=none under PREEMPT_AUTO"
>>>   "sched: support preempt=full under PREEMPT_AUTO"
>>>   "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>>
>>> For a macro test, a colleague in Oracle's Exadata team tried two
>>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>>> backported.)
>>>
>>> In both tests the data was cached on remote nodes (cells), and the
>>> database nodes (compute) served client queries, with clients being
>>> local in the first test and remote in the second.
>>>
>>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>>
>>>
>>> 				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
>>> 				                                        (preempt=voluntary)
>>>                               ==============================      =============================
>>>                       clients  throughput    cpu-usage            throughput     cpu-usage         Gain
>>>                                (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
>>> 		      -------  ----------  -----------------      ----------  -----------------   -------
>>>
>>>
>>>   OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
>>>   benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
>>>  (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%
>>>
>>>
>>>   OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
>>>   benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
>>>  (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
>>>   90/10 RW ratio)
>>>
>>>
>>> (Both sets of tests have a fair amount of NW traffic since the query
>>> tables etc are cached on the cells. Additionally, the first set,
>>> given the local clients, stress the scheduler a bit more than the
>>> second.)
>>>
>>> The comparative performance for both the tests is fairly close,
>>> more or less within a margin of error.
>>>
>>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):
>>>
>>> "
>>>  a) Base kernel (6.7),
>>>  b) v1, PREEMPT_AUTO, preempt=voluntary
>>>  c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>>>  d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>>
>>>  Workloads I tested and their %gain,
>>>                     case b           case c       case d
>>>  NAS                +2.7%              +1.9%         +2.1%
>>>  Hashjoin,          +0.0%              +0.0%         +0.0%
>>>  Graph500,          -6.0%              +0.0%         +0.0%
>>>  XSBench            +1.7%              +0.0%         +1.2%
>>>
>>>  (Note about the Graph500 numbers at [8].)
>>>
>>>  Did kernbench etc test from Mel's mmtests suite also. Did not notice
>>>  much difference.
>>> "
>>>
>>> One case where there is a significant performance drop is on powerpc,
>>> seen running hackbench on a 320 core system (a test on a smaller system is
>>> fine.) In theory there's no reason for this to only happen on powerpc
>>> since most of the code is common, but I haven't been able to reproduce
>>> it on x86 so far.
>>>
>>> All in all, I think the tests above show that this scheduling model has legs.
>>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>>> different enough from the current none/voluntary models that there
>>> likely are workloads where performance would be subpar. That needs more
>>> extensive testing to figure out the weak points.
>>>
>>>
>>>
>> Did test it again on PowerPC. Unfortunately numbers shows there is regression
>> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
>> smaller system too to confirm. For now I have done the comparison for the hackbench
>> where highest regression was seen in v1.
>>
>> perf stat collected for 20 iterations show higher context switch and higher migrations.
>> Could it be that LAZY bit is causing more context switches? or could it be something
>> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.
> 
> Thanks for trying it out.
> 
> As you point out, context-switches and migrations are signficantly higher.
> 
> Definitely unexpected. I ran the same test on an x86 box
> (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.
> 
>   6.9.0/none.process.pipe.60:       170,719,761      context-switches          #    0.022 M/sec                    ( +-  0.19% )
>   6.9.0/none.process.pipe.60:        16,871,449      cpu-migrations            #    0.002 M/sec                    ( +-  0.16% )
>   6.9.0/none.process.pipe.60:      30.833112186 seconds time elapsed                                          ( +-  0.11% )
> 
>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:       177,889,639      context-switches          #    0.023 M/sec                    ( +-  0.21% )
>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:        17,426,670      cpu-migrations            #    0.002 M/sec                    ( +-  0.41% )
>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:      30.731126312 seconds time elapsed                                          ( +-  0.07% )
> 
> Clearly there's something different going on powerpc. I'm travelling
> right now, but will dig deeper into this once I get back.
> 
> Meanwhile can you check if the increased context-switches are voluntary or
> involuntary (or what the division is)?


Used "pidstat -w -p ALL 1 10" to capture 10 seconds data at 1 second interval for 
context switches per second while running "hackbench -pipe 60 process 100000 loops" 


preempt=none				6.10			preempt_auto
=============================================================================
voluntary context switches	    	7632166.19	        9391636.34(+23%)
involuntary context switches		2305544.07		3527293.94(+53%)

Numbers vary between multiple runs. But trend seems to be similar. Both the context switches increase 
involuntary seems to increase at higher rate. 


BTW, ran Unixbench as well. It shows slight regression. stress-ng numbers didn't seem conclusive. 
schench(old) showed slightly lower latency when the number of threads were low. at higher thread 
count showed higher tail latency. But it doesn't seem very convincing numbers. 
All these were done under preempt=none in both 6.10 and preempt_auto. 


Unixbench				6.10		preempt_auto
=====================================================================
1 X Execl Throughput               :    5345.70,    5109.68(-4.42)
4 X Execl Throughput               :   15610.54,   15087.92(-3.35)
1 X Pipe-based Context Switching   :  183172.30,  177069.52(-3.33)
4 X Pipe-based Context Switching   :  615471.66,  602773.74(-2.06)
1 X Process Creation               :   10778.92,   10443.76(-3.11)
4 X Process Creation               :   24327.06,   25150.42(+3.38)
1 X Shell Scripts (1 concurrent)   :   10416.76,   10222.28(-1.87)
4 X Shell Scripts (1 concurrent)   :   36051.00,   35206.90(-2.34)
1 X Shell Scripts (8 concurrent)   :    5004.22,    4907.32(-1.94)
4 X Shell Scripts (8 concurrent)   :   12676.08,   12418.18(-2.03)


> 
> 
> Thanks
> Ankur
> 
>> Meanwhile, will do more test with other micro-benchmarks and post the results.
>>
>>
>> More details below.
>> CONFIG_HZ = 100
>> ./hackbench -pipe 60 process 100000 loops
>>
>> ====================================================================================
>> On the larger system. (40 Cores, 320CPUS)
>> ====================================================================================
>> 				6.10-rc1		+preempt_auto
>> 				preempt=none		preempt=none
>> 20 iterations avg value
>> hackbench pipe(60)		26.403			32.368 ( -31.1%)
>>
>> ++++++++++++++++++
>> baseline 6.10-rc1:
>> ++++++++++++++++++
>>  Performance counter stats for 'system wide' (20 runs):
>>     168,980,939.76 msec cpu-clock                        # 6400.026 CPUs utilized               ( +-  6.59% )
>>      6,299,247,371      context-switches                 #   70.596 K/sec                       ( +-  6.60% )
>>        246,646,236      cpu-migrations                   #    2.764 K/sec                       ( +-  6.57% )
>>          1,759,232      page-faults                      #   19.716 /sec                        ( +-  6.61% )
>> 577,719,907,794,874      cycles                           #    6.475 GHz                         ( +-  6.60% )
>> 226,392,778,622,410      instructions                     #    0.74  insn per cycle              ( +-  6.61% )
>> 37,280,192,946,445      branches                         #  417.801 M/sec                       ( +-  6.61% )
>>    166,456,311,053      branch-misses                    #    0.85% of all branches             ( +-  6.60% )
>>
>>             26.403 +- 0.166 seconds time elapsed  ( +-  0.63% )
>>
>> ++++++++++++
>> preempt auto
>> ++++++++++++
>>  Performance counter stats for 'system wide' (20 runs):
>>     207,154,235.95 msec cpu-clock                        # 6400.009 CPUs utilized               ( +-  6.64% )
>>      9,337,462,696      context-switches                 #   85.645 K/sec                       ( +-  6.68% )
>>        631,276,554      cpu-migrations                   #    5.790 K/sec                       ( +-  6.79% )
>>          1,756,583      page-faults                      #   16.112 /sec                        ( +-  6.59% )
>> 700,281,729,230,103      cycles                           #    6.423 GHz                         ( +-  6.64% )
>> 254,713,123,656,485      instructions                     #    0.69  insn per cycle              ( +-  6.63% )
>> 42,275,061,484,512      branches                         #  387.756 M/sec                       ( +-  6.63% )
>>    231,944,216,106      branch-misses                    #    1.04% of all branches             ( +-  6.64% )
>>
>>             32.368 +- 0.200 seconds time elapsed  ( +-  0.62% )
>>
>>
>> ============================================================================================
>> Smaller system ( 12Cores, 96CPUS)
>> ============================================================================================
>> 				6.10-rc1		+preempt_auto
>> 				preempt=none		preempt=none
>> 20 iterations avg value
>> hackbench pipe(60)		55.930			65.75 ( -17.6%)
>>
>> ++++++++++++++++++
>> baseline 6.10-rc1:
>> ++++++++++++++++++
>>  Performance counter stats for 'system wide' (20 runs):
>>     107,386,299.19 msec cpu-clock                        # 1920.003 CPUs utilized               ( +-  6.55% )
>>      1,388,830,542      context-switches                 #   24.536 K/sec                       ( +-  6.19% )
>>         44,538,641      cpu-migrations                   #  786.840 /sec                        ( +-  6.23% )
>>          1,698,710      page-faults                      #   30.010 /sec                        ( +-  6.58% )
>> 412,401,110,929,055      cycles                           #    7.286 GHz                         ( +-  6.54% )
>> 192,380,094,075,743      instructions                     #    0.88  insn per cycle              ( +-  6.59% )
>> 30,328,724,557,878      branches                         #  535.801 M/sec                       ( +-  6.58% )
>>     99,642,840,901      branch-misses                    #    0.63% of all branches             ( +-  6.57% )
>>
>>             55.930 +- 0.509 seconds time elapsed  ( +-  0.91% )
>>
>>
>> +++++++++++++++++
>> v2_preempt_auto
>> +++++++++++++++++
>>  Performance counter stats for 'system wide' (20 runs):
>>     126,244,029.04 msec cpu-clock                        # 1920.005 CPUs utilized               ( +-  6.51% )
>>      2,563,720,294      context-switches                 #   38.356 K/sec                       ( +-  6.10% )
>>        147,445,392      cpu-migrations                   #    2.206 K/sec                       ( +-  6.37% )
>>          1,710,637      page-faults                      #   25.593 /sec                        ( +-  6.55% )
>> 483,419,889,144,017      cycles                           #    7.232 GHz                         ( +-  6.51% )
>> 210,788,030,476,548      instructions                     #    0.82  insn per cycle              ( +-  6.57% )
>> 33,851,562,301,187      branches                         #  506.454 M/sec                       ( +-  6.56% )
>>    134,059,721,699      branch-misses                    #    0.75% of all branches             ( +-  6.45% )
>>
>>              65.75 +- 1.06 seconds time elapsed  ( +-  1.61% )
> 
> So, the context-switches are meaningfully higher.
> 
> --
> ankur

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Shrikanth Hegde 1 year, 8 months ago


On 6/4/24 1:02 PM, Shrikanth Hegde wrote:
> 
> 
> On 6/1/24 5:17 PM, Ankur Arora wrote:
>>
>> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>>
>>> On 5/28/24 6:04 AM, Ankur Arora wrote:
>>>> Hi,
>>>>
>>>> This series adds a new scheduling model PREEMPT_AUTO, which like
>>>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>>>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>>>> on explicit preemption points for the voluntary models.
>>>>
>>>> The series is based on Thomas' original proposal which he outlined
>>>> in [1], [2] and in his PoC [3].
>>>>
>>>> v2 mostly reworks v1, with one of the main changes having less
>>>> noisy need-resched-lazy related interfaces.
>>>> More details in the changelog below.
>>>>
>>>
>>> Hi Ankur. Thanks for the series.
>>>
>>> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
>>> tip/master and tip/sched/core. Mostly due some word differences in the change.
>>>
>>> tip/master was at:
>>> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
>>> Merge: 5d145493a139 47ff30cc1be7
>>> Author: Ingo Molnar <mingo@kernel.org>
>>> Date:   Tue May 28 12:44:26 2024 +0200
>>>
>>>     Merge branch into tip/master: 'x86/percpu'
>>>
>>>
>>>
>>>> The v1 of the series is at [4] and the RFC at [5].
>>>>
>>>> Design
>>>> ==
>>>>
>>>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>>>> PREEMPT_COUNT). This means that the scheduler can always safely
>>>> preempt. (This is identical to CONFIG_PREEMPT.)
>>>>
>>>> Having that, the next step is to make the rescheduling policy dependent
>>>> on the chosen scheduling model. Currently, the scheduler uses a single
>>>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>>>> reschedule is needed.
>>>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>>>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>>>> scheduler to express two kinds of rescheduling intent: schedule at
>>>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>>>> rescheduling while allowing the task on the runqueue to run to
>>>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>>>
>>>> The scheduler decides which need-resched bits are chosen based on
>>>> the preemption model in use:
>>>>
>>>> 	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY
>>>>
>>>> none		never   		always [*]
>>>> voluntary       higher sched class	other tasks [*]
>>>> full 		always                  never
>>>>
>>>> [*] some details elided.
>>>>
>>>> The last part of the puzzle is, when does preemption happen, or
>>>> alternately stated, when are the need-resched bits checked:
>>>>
>>>>                  exit-to-user    ret-to-kernel    preempt_count()
>>>>
>>>> NEED_RESCHED_LAZY     Y               N                N
>>>> NEED_RESCHED          Y               Y                Y
>>>>
>>>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>>>> none/voluntary preemption policies are in effect. And eager semantics
>>>> under full preemption.
>>>>
>>>> In addition, since this is driven purely by the scheduler (not
>>>> depending on cond_resched() placement and the like), there is enough
>>>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>>>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>>>> simply upgrading to a full NEED_RESCHED which can use more coercive
>>>> instruments like resched IPI to induce a context-switch.
>>>>
>>>> Performance
>>>> ==
>>>> The performance in the basic tests (perf bench sched messaging, kernbench,
>>>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>>>> (See patches
>>>>   "sched: support preempt=none under PREEMPT_AUTO"
>>>>   "sched: support preempt=full under PREEMPT_AUTO"
>>>>   "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>>>
>>>> For a macro test, a colleague in Oracle's Exadata team tried two
>>>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>>>> backported.)
>>>>
>>>> In both tests the data was cached on remote nodes (cells), and the
>>>> database nodes (compute) served client queries, with clients being
>>>> local in the first test and remote in the second.
>>>>
>>>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>>>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>>>
>>>>
>>>> 				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
>>>> 				                                        (preempt=voluntary)
>>>>                               ==============================      =============================
>>>>                       clients  throughput    cpu-usage            throughput     cpu-usage         Gain
>>>>                                (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
>>>> 		      -------  ----------  -----------------      ----------  -----------------   -------
>>>>
>>>>
>>>>   OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
>>>>   benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
>>>>  (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%
>>>>
>>>>
>>>>   OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
>>>>   benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
>>>>  (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
>>>>   90/10 RW ratio)
>>>>
>>>>
>>>> (Both sets of tests have a fair amount of NW traffic since the query
>>>> tables etc are cached on the cells. Additionally, the first set,
>>>> given the local clients, stress the scheduler a bit more than the
>>>> second.)
>>>>
>>>> The comparative performance for both the tests is fairly close,
>>>> more or less within a margin of error.
>>>>
>>>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):
>>>>
>>>> "
>>>>  a) Base kernel (6.7),
>>>>  b) v1, PREEMPT_AUTO, preempt=voluntary
>>>>  c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>>>>  d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>>>
>>>>  Workloads I tested and their %gain,
>>>>                     case b           case c       case d
>>>>  NAS                +2.7%              +1.9%         +2.1%
>>>>  Hashjoin,          +0.0%              +0.0%         +0.0%
>>>>  Graph500,          -6.0%              +0.0%         +0.0%
>>>>  XSBench            +1.7%              +0.0%         +1.2%
>>>>
>>>>  (Note about the Graph500 numbers at [8].)
>>>>
>>>>  Did kernbench etc test from Mel's mmtests suite also. Did not notice
>>>>  much difference.
>>>> "
>>>>
>>>> One case where there is a significant performance drop is on powerpc,
>>>> seen running hackbench on a 320 core system (a test on a smaller system is
>>>> fine.) In theory there's no reason for this to only happen on powerpc
>>>> since most of the code is common, but I haven't been able to reproduce
>>>> it on x86 so far.
>>>>
>>>> All in all, I think the tests above show that this scheduling model has legs.
>>>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>>>> different enough from the current none/voluntary models that there
>>>> likely are workloads where performance would be subpar. That needs more
>>>> extensive testing to figure out the weak points.
>>>>
>>>>
>>>>
>>> Did test it again on PowerPC. Unfortunately numbers shows there is regression
>>> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
>>> smaller system too to confirm. For now I have done the comparison for the hackbench
>>> where highest regression was seen in v1.
>>>
>>> perf stat collected for 20 iterations show higher context switch and higher migrations.
>>> Could it be that LAZY bit is causing more context switches? or could it be something
>>> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.
>>
>> Thanks for trying it out.
>>
>> As you point out, context-switches and migrations are signficantly higher.
>>
>> Definitely unexpected. I ran the same test on an x86 box
>> (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.
>>
>>   6.9.0/none.process.pipe.60:       170,719,761      context-switches          #    0.022 M/sec                    ( +-  0.19% )
>>   6.9.0/none.process.pipe.60:        16,871,449      cpu-migrations            #    0.002 M/sec                    ( +-  0.16% )
>>   6.9.0/none.process.pipe.60:      30.833112186 seconds time elapsed                                          ( +-  0.11% )
>>
>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:       177,889,639      context-switches          #    0.023 M/sec                    ( +-  0.21% )
>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:        17,426,670      cpu-migrations            #    0.002 M/sec                    ( +-  0.41% )
>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:      30.731126312 seconds time elapsed                                          ( +-  0.07% )
>>
>> Clearly there's something different going on powerpc. I'm travelling
>> right now, but will dig deeper into this once I get back.
>>
>> Meanwhile can you check if the increased context-switches are voluntary or
>> involuntary (or what the division is)?
> 
> 
> Used "pidstat -w -p ALL 1 10" to capture 10 seconds data at 1 second interval for 
> context switches per second while running "hackbench -pipe 60 process 100000 loops" 
> 
> 
> preempt=none				6.10			preempt_auto
> =============================================================================
> voluntary context switches	    	7632166.19	        9391636.34(+23%)
> involuntary context switches		2305544.07		3527293.94(+53%)
> 
> Numbers vary between multiple runs. But trend seems to be similar. Both the context switches increase 
> involuntary seems to increase at higher rate. 
> 
> 


Continued data from hackbench regression. preempt=none in both the cases.
From mpstat, I see slightly higher idle time and more irq time with preempt_auto. 

6.10-rc1:
=========
10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
09:45:23 AM  all    4.14    0.00   77.57    0.00   16.92    0.00    0.00    0.00    0.00    1.37
09:45:24 AM  all    4.42    0.00   77.62    0.00   16.76    0.00    0.00    0.00    0.00    1.20
09:45:25 AM  all    4.43    0.00   77.45    0.00   16.94    0.00    0.00    0.00    0.00    1.18
09:45:26 AM  all    4.45    0.00   77.87    0.00   16.68    0.00    0.00    0.00    0.00    0.99

PREEMPT_AUTO:
===========
10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:09:56 AM  all    3.11    0.00   72.59    0.00   21.34    0.00    0.00    0.00    0.00    2.96
10:09:57 AM  all    3.31    0.00   73.10    0.00   20.99    0.00    0.00    0.00    0.00    2.60
10:09:58 AM  all    3.40    0.00   72.83    0.00   20.85    0.00    0.00    0.00    0.00    2.92
10:10:00 AM  all    3.21    0.00   72.87    0.00   21.19    0.00    0.00    0.00    0.00    2.73
10:10:01 AM  all    3.02    0.00   72.18    0.00   21.08    0.00    0.00    0.00    0.00    3.71

Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more 
timer,sched softirq. Numbers vary between different samples, but trend seems to be similar. 

6.10-rc1:
=========
SOFTIRQ          TOTAL_usecs
tasklet                   71
block                    145
net_rx                  7914
rcu                   136988
timer                 304357
sched                1404497



PREEMPT_AUTO:
===========
SOFTIRQ          TOTAL_usecs
tasklet                   80
block                    139
net_rx                  6907
rcu                   223508
timer                 492767
sched                1794441


Would any specific setting of RCU matter for this? 
This is what I have in config. 

# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_TREE_SRCU=y
CONFIG_NEED_SRCU_NMI_SAFE=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_NEED_TASKS_RCU=y
CONFIG_TASKS_RCU=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_NOCB_CPU=y
# CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
# CONFIG_RCU_LAZY is not set
# end of RCU Subsystem


# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_CONTEXT_TRACKING_USER=y
# CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Ankur Arora 1 year, 8 months ago

Shrikanth Hegde <sshegde@linux.ibm.com> writes:

> On 6/4/24 1:02 PM, Shrikanth Hegde wrote:
>>
>>
>> On 6/1/24 5:17 PM, Ankur Arora wrote:
>>>
>>> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>>>
>>>> On 5/28/24 6:04 AM, Ankur Arora wrote:
>>>>> Hi,
>>>>>
>>>>> This series adds a new scheduling model PREEMPT_AUTO, which like
>>>>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>>>>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>>>>> on explicit preemption points for the voluntary models.
>>>>>
>>>>> The series is based on Thomas' original proposal which he outlined
>>>>> in [1], [2] and in his PoC [3].
>>>>>
>>>>> v2 mostly reworks v1, with one of the main changes having less
>>>>> noisy need-resched-lazy related interfaces.
>>>>> More details in the changelog below.
>>>>>
>>>>
>>>> Hi Ankur. Thanks for the series.
>>>>
>>>> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
>>>> tip/master and tip/sched/core. Mostly due some word differences in the change.
>>>>
>>>> tip/master was at:
>>>> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
>>>> Merge: 5d145493a139 47ff30cc1be7
>>>> Author: Ingo Molnar <mingo@kernel.org>
>>>> Date:   Tue May 28 12:44:26 2024 +0200
>>>>
>>>>     Merge branch into tip/master: 'x86/percpu'
>>>>
>>>>
>>>>
>>>>> The v1 of the series is at [4] and the RFC at [5].
>>>>>
>>>>> Design
>>>>> ==
>>>>>
>>>>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>>>>> PREEMPT_COUNT). This means that the scheduler can always safely
>>>>> preempt. (This is identical to CONFIG_PREEMPT.)
>>>>>
>>>>> Having that, the next step is to make the rescheduling policy dependent
>>>>> on the chosen scheduling model. Currently, the scheduler uses a single
>>>>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>>>>> reschedule is needed.
>>>>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>>>>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>>>>> scheduler to express two kinds of rescheduling intent: schedule at
>>>>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>>>>> rescheduling while allowing the task on the runqueue to run to
>>>>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>>>>
>>>>> The scheduler decides which need-resched bits are chosen based on
>>>>> the preemption model in use:
>>>>>
>>>>> 	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY
>>>>>
>>>>> none		never   		always [*]
>>>>> voluntary       higher sched class	other tasks [*]
>>>>> full 		always                  never
>>>>>
>>>>> [*] some details elided.
>>>>>
>>>>> The last part of the puzzle is, when does preemption happen, or
>>>>> alternately stated, when are the need-resched bits checked:
>>>>>
>>>>>                  exit-to-user    ret-to-kernel    preempt_count()
>>>>>
>>>>> NEED_RESCHED_LAZY     Y               N                N
>>>>> NEED_RESCHED          Y               Y                Y
>>>>>
>>>>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>>>>> none/voluntary preemption policies are in effect. And eager semantics
>>>>> under full preemption.
>>>>>
>>>>> In addition, since this is driven purely by the scheduler (not
>>>>> depending on cond_resched() placement and the like), there is enough
>>>>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>>>>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>>>>> simply upgrading to a full NEED_RESCHED which can use more coercive
>>>>> instruments like resched IPI to induce a context-switch.
>>>>>
>>>>> Performance
>>>>> ==
>>>>> The performance in the basic tests (perf bench sched messaging, kernbench,
>>>>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>>>>> (See patches
>>>>>   "sched: support preempt=none under PREEMPT_AUTO"
>>>>>   "sched: support preempt=full under PREEMPT_AUTO"
>>>>>   "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>>>>
>>>>> For a macro test, a colleague in Oracle's Exadata team tried two
>>>>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>>>>> backported.)
>>>>>
>>>>> In both tests the data was cached on remote nodes (cells), and the
>>>>> database nodes (compute) served client queries, with clients being
>>>>> local in the first test and remote in the second.
>>>>>
>>>>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>>>>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>>>>
>>>>>
>>>>> 				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
>>>>> 				                                        (preempt=voluntary)
>>>>>                               ==============================      =============================
>>>>>                       clients  throughput    cpu-usage            throughput     cpu-usage         Gain
>>>>>                                (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
>>>>> 		      -------  ----------  -----------------      ----------  -----------------   -------
>>>>>
>>>>>
>>>>>   OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
>>>>>   benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
>>>>>  (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%
>>>>>
>>>>>
>>>>>   OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
>>>>>   benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
>>>>>  (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
>>>>>   90/10 RW ratio)
>>>>>
>>>>>
>>>>> (Both sets of tests have a fair amount of NW traffic since the query
>>>>> tables etc are cached on the cells. Additionally, the first set,
>>>>> given the local clients, stress the scheduler a bit more than the
>>>>> second.)
>>>>>
>>>>> The comparative performance for both the tests is fairly close,
>>>>> more or less within a margin of error.
>>>>>
>>>>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):
>>>>>
>>>>> "
>>>>>  a) Base kernel (6.7),
>>>>>  b) v1, PREEMPT_AUTO, preempt=voluntary
>>>>>  c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>>>>>  d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>>>>
>>>>>  Workloads I tested and their %gain,
>>>>>                     case b           case c       case d
>>>>>  NAS                +2.7%              +1.9%         +2.1%
>>>>>  Hashjoin,          +0.0%              +0.0%         +0.0%
>>>>>  Graph500,          -6.0%              +0.0%         +0.0%
>>>>>  XSBench            +1.7%              +0.0%         +1.2%
>>>>>
>>>>>  (Note about the Graph500 numbers at [8].)
>>>>>
>>>>>  Did kernbench etc test from Mel's mmtests suite also. Did not notice
>>>>>  much difference.
>>>>> "
>>>>>
>>>>> One case where there is a significant performance drop is on powerpc,
>>>>> seen running hackbench on a 320 core system (a test on a smaller system is
>>>>> fine.) In theory there's no reason for this to only happen on powerpc
>>>>> since most of the code is common, but I haven't been able to reproduce
>>>>> it on x86 so far.
>>>>>
>>>>> All in all, I think the tests above show that this scheduling model has legs.
>>>>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>>>>> different enough from the current none/voluntary models that there
>>>>> likely are workloads where performance would be subpar. That needs more
>>>>> extensive testing to figure out the weak points.
>>>>>
>>>>>
>>>>>
>>>> Did test it again on PowerPC. Unfortunately numbers shows there is regression
>>>> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
>>>> smaller system too to confirm. For now I have done the comparison for the hackbench
>>>> where highest regression was seen in v1.
>>>>
>>>> perf stat collected for 20 iterations show higher context switch and higher migrations.
>>>> Could it be that LAZY bit is causing more context switches? or could it be something
>>>> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.
>>>
>>> Thanks for trying it out.
>>>
>>> As you point out, context-switches and migrations are signficantly higher.
>>>
>>> Definitely unexpected. I ran the same test on an x86 box
>>> (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.
>>>
>>>   6.9.0/none.process.pipe.60:       170,719,761      context-switches          #    0.022 M/sec                    ( +-  0.19% )
>>>   6.9.0/none.process.pipe.60:        16,871,449      cpu-migrations            #    0.002 M/sec                    ( +-  0.16% )
>>>   6.9.0/none.process.pipe.60:      30.833112186 seconds time elapsed                                          ( +-  0.11% )
>>>
>>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:       177,889,639      context-switches          #    0.023 M/sec                    ( +-  0.21% )
>>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:        17,426,670      cpu-migrations            #    0.002 M/sec                    ( +-  0.41% )
>>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:      30.731126312 seconds time elapsed                                          ( +-  0.07% )
>>>
>>> Clearly there's something different going on powerpc. I'm travelling
>>> right now, but will dig deeper into this once I get back.
>>>
>>> Meanwhile can you check if the increased context-switches are voluntary or
>>> involuntary (or what the division is)?
>>
>>
>> Used "pidstat -w -p ALL 1 10" to capture 10 seconds data at 1 second interval for
>> context switches per second while running "hackbench -pipe 60 process 100000 loops"
>>
>>
>> preempt=none				6.10			preempt_auto
>> =============================================================================
>> voluntary context switches	    	7632166.19	        9391636.34(+23%)
>> involuntary context switches		2305544.07		3527293.94(+53%)
>>
>> Numbers vary between multiple runs. But trend seems to be similar. Both the context switches increase
>> involuntary seems to increase at higher rate.
>>
>>
>
>
> Continued data from hackbench regression. preempt=none in both the cases.
> From mpstat, I see slightly higher idle time and more irq time with preempt_auto.
>
> 6.10-rc1:
> =========
> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 09:45:23 AM  all    4.14    0.00   77.57    0.00   16.92    0.00    0.00    0.00    0.00    1.37
> 09:45:24 AM  all    4.42    0.00   77.62    0.00   16.76    0.00    0.00    0.00    0.00    1.20
> 09:45:25 AM  all    4.43    0.00   77.45    0.00   16.94    0.00    0.00    0.00    0.00    1.18
> 09:45:26 AM  all    4.45    0.00   77.87    0.00   16.68    0.00    0.00    0.00    0.00    0.99
>
> PREEMPT_AUTO:
> ===========
> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 10:09:56 AM  all    3.11    0.00   72.59    0.00   21.34    0.00    0.00    0.00    0.00    2.96
> 10:09:57 AM  all    3.31    0.00   73.10    0.00   20.99    0.00    0.00    0.00    0.00    2.60
> 10:09:58 AM  all    3.40    0.00   72.83    0.00   20.85    0.00    0.00    0.00    0.00    2.92
> 10:10:00 AM  all    3.21    0.00   72.87    0.00   21.19    0.00    0.00    0.00    0.00    2.73
> 10:10:01 AM  all    3.02    0.00   72.18    0.00   21.08    0.00    0.00    0.00    0.00    3.71
>
> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.

Yeah, the %sys is lower and %irq, higher. Can you also see where the
increased %irq is? For instance are the resched IPIs numbers greater?

> 6.10-rc1:
> =========
> SOFTIRQ          TOTAL_usecs
> tasklet                   71
> block                    145
> net_rx                  7914
> rcu                   136988
> timer                 304357
> sched                1404497
>
>
>
> PREEMPT_AUTO:
> ===========
> SOFTIRQ          TOTAL_usecs
> tasklet                   80
> block                    139
> net_rx                  6907
> rcu                   223508
> timer                 492767
> sched                1794441
>
>
> Would any specific setting of RCU matter for this?
> This is what I have in config.

Don't see how it could matter unless the RCU settings are changing
between the two tests? In my testing I'm also using TREE_RCU=y,
PREEMPT_RCU=n.

Let me see if I can find a test which shows a similar trend to what you
are seeing. And, then maybe see if tracing sched-switch might point to
an interesting difference between x86 and powerpc.


Thanks for all the detail.

Ankur

> # RCU Subsystem
> #
> CONFIG_TREE_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_TREE_SRCU=y
> CONFIG_NEED_SRCU_NMI_SAFE=y
> CONFIG_TASKS_RCU_GENERIC=y
> CONFIG_NEED_TASKS_RCU=y
> CONFIG_TASKS_RCU=y
> CONFIG_TASKS_RUDE_RCU=y
> CONFIG_TASKS_TRACE_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> CONFIG_RCU_NOCB_CPU=y
> # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
> # CONFIG_RCU_LAZY is not set
> # end of RCU Subsystem
>
>
> # Timers subsystem
> #
> CONFIG_TICK_ONESHOT=y
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_HZ_PERIODIC is not set
> # CONFIG_NO_HZ_IDLE is not set
> CONFIG_NO_HZ_FULL=y
> CONFIG_CONTEXT_TRACKING_USER=y
> # CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
> CONFIG_NO_HZ=y
> CONFIG_HIGH_RES_TIMERS=y
> # end of Timers subsystem


--
ankur

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Shrikanth Hegde 1 year, 7 months ago


On 6/10/24 12:53 PM, Ankur Arora wrote:
> 
_auto.
>>
>> 6.10-rc1:
>> =========
>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>> 09:45:23 AM  all    4.14    0.00   77.57    0.00   16.92    0.00    0.00    0.00    0.00    1.37
>> 09:45:24 AM  all    4.42    0.00   77.62    0.00   16.76    0.00    0.00    0.00    0.00    1.20
>> 09:45:25 AM  all    4.43    0.00   77.45    0.00   16.94    0.00    0.00    0.00    0.00    1.18
>> 09:45:26 AM  all    4.45    0.00   77.87    0.00   16.68    0.00    0.00    0.00    0.00    0.99
>>
>> PREEMPT_AUTO:
>> ===========
>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>> 10:09:56 AM  all    3.11    0.00   72.59    0.00   21.34    0.00    0.00    0.00    0.00    2.96
>> 10:09:57 AM  all    3.31    0.00   73.10    0.00   20.99    0.00    0.00    0.00    0.00    2.60
>> 10:09:58 AM  all    3.40    0.00   72.83    0.00   20.85    0.00    0.00    0.00    0.00    2.92
>> 10:10:00 AM  all    3.21    0.00   72.87    0.00   21.19    0.00    0.00    0.00    0.00    2.73
>> 10:10:01 AM  all    3.02    0.00   72.18    0.00   21.08    0.00    0.00    0.00    0.00    3.71
>>
>> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
>> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.
> 
> Yeah, the %sys is lower and %irq, higher. Can you also see where the
> increased %irq is? For instance are the resched IPIs numbers greater?

Hi Ankur,


Used mpstat -I ALL to capture this info for 20 seconds. 

HARDIRQ per second:
===================
6.10:
===================
18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
417956.86	1114642.30	1712683.65	2058664.99	0.00	0.00	18.30	0.39	31978.37	0.00	0.35	351.98	0.00	0.00	0.00	6405.54	329189.45

Preempt_auto:
===================
18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
609509.69	1910413.99	1923503.52	2061876.33	0.00	0.00	19.14	0.30	31916.59	0.00	0.45	497.88	0.00	0.00	0.00	6825.49	88247.85

18,19,22,23 are called XIVE interrupts. These are IPI interrupts. I am not sure which type of IPI are these. will have to see why its increasing. 


SOFTIRQ per second:
===================
6.10:
=================== 
HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU	
0.00	3966.47	0.00	18.25	0.59	0.00		0.34		12811.00	0.00		9693.95

Preempt_auto:
===================
HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU	
0.00	4871.67	0.00	18.94	0.40	0.00		0.25		13518.66	0.00		15732.77

Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why. 
It maybe irq triggering to softirq or softirq causing more IPI. 



Also, Noticed a below config difference which gets removed in preempt auto. This happens because PREEMPTION make them as N. Made the changes in kernel/Kconfig.locks to get them 
enabled. I still see the same regression in hackbench. These configs still may need attention?
		
					6.10				       | 					preempt auto 
  CONFIG_INLINE_SPIN_UNLOCK_IRQ=y                                              |  CONFIG_UNINLINE_SPIN_UNLOCK=y                                               
  CONFIG_INLINE_READ_UNLOCK=y                                                  |  ----------------------------------------------------------------------------
  CONFIG_INLINE_READ_UNLOCK_IRQ=y                                              |  ----------------------------------------------------------------------------
  CONFIG_INLINE_WRITE_UNLOCK=y                                                 |  ----------------------------------------------------------------------------
  CONFIG_INLINE_WRITE_UNLOCK_IRQ=y                                             |  ----------------------------------------------------------------------------


> 
>> 6.10-rc1:
>> =========
>> SOFTIRQ          TOTAL_usecs
>> tasklet                   71
>> block                    145
>> net_rx                  7914
>> rcu                   136988
>> timer                 304357
>> sched                1404497
>>
>>
>>
>> PREEMPT_AUTO:
>> ===========
>> SOFTIRQ          TOTAL_usecs
>> tasklet                   80
>> block                    139
>> net_rx                  6907
>> rcu                   223508
>> timer                 492767
>> sched                1794441
>>
>>
>> Would any specific setting of RCU matter for this?
>> This is what I have in config.
> 
> Don't see how it could matter unless the RCU settings are changing
> between the two tests? In my testing I'm also using TREE_RCU=y,
> PREEMPT_RCU=n.
> 
> Let me see if I can find a test which shows a similar trend to what you
> are seeing. And, then maybe see if tracing sched-switch might point to
> an interesting difference between x86 and powerpc.
> 
> 
> Thanks for all the detail.
> 
> Ankur
> 
>> # RCU Subsystem
>> #
>> CONFIG_TREE_RCU=y
>> # CONFIG_RCU_EXPERT is not set
>> CONFIG_TREE_SRCU=y
>> CONFIG_NEED_SRCU_NMI_SAFE=y
>> CONFIG_TASKS_RCU_GENERIC=y
>> CONFIG_NEED_TASKS_RCU=y
>> CONFIG_TASKS_RCU=y
>> CONFIG_TASKS_RUDE_RCU=y
>> CONFIG_TASKS_TRACE_RCU=y
>> CONFIG_RCU_STALL_COMMON=y
>> CONFIG_RCU_NEED_SEGCBLIST=y
>> CONFIG_RCU_NOCB_CPU=y
>> # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
>> # CONFIG_RCU_LAZY is not set
>> # end of RCU Subsystem
>>
>>
>> # Timers subsystem
>> #
>> CONFIG_TICK_ONESHOT=y
>> CONFIG_NO_HZ_COMMON=y
>> # CONFIG_HZ_PERIODIC is not set
>> # CONFIG_NO_HZ_IDLE is not set
>> CONFIG_NO_HZ_FULL=y
>> CONFIG_CONTEXT_TRACKING_USER=y
>> # CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
>> CONFIG_NO_HZ=y
>> CONFIG_HIGH_RES_TIMERS=y
>> # end of Timers subsystem
> 
> 
> --
> ankur

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Shrikanth Hegde 1 year, 7 months ago


On 6/15/24 8:34 PM, Shrikanth Hegde wrote:
> 
> 
> On 6/10/24 12:53 PM, Ankur Arora wrote:
>>
> _auto.
>>>
>>> 6.10-rc1:
>>> =========
>>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> 09:45:23 AM  all    4.14    0.00   77.57    0.00   16.92    0.00    0.00    0.00    0.00    1.37
>>> 09:45:24 AM  all    4.42    0.00   77.62    0.00   16.76    0.00    0.00    0.00    0.00    1.20
>>> 09:45:25 AM  all    4.43    0.00   77.45    0.00   16.94    0.00    0.00    0.00    0.00    1.18
>>> 09:45:26 AM  all    4.45    0.00   77.87    0.00   16.68    0.00    0.00    0.00    0.00    0.99
>>>
>>> PREEMPT_AUTO:
>>> ===========
>>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> 10:09:56 AM  all    3.11    0.00   72.59    0.00   21.34    0.00    0.00    0.00    0.00    2.96
>>> 10:09:57 AM  all    3.31    0.00   73.10    0.00   20.99    0.00    0.00    0.00    0.00    2.60
>>> 10:09:58 AM  all    3.40    0.00   72.83    0.00   20.85    0.00    0.00    0.00    0.00    2.92
>>> 10:10:00 AM  all    3.21    0.00   72.87    0.00   21.19    0.00    0.00    0.00    0.00    2.73
>>> 10:10:01 AM  all    3.02    0.00   72.18    0.00   21.08    0.00    0.00    0.00    0.00    3.71
>>>
>>> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
>>> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.
>>
>> Yeah, the %sys is lower and %irq, higher. Can you also see where the
>> increased %irq is? For instance are the resched IPIs numbers greater?
> 
> Hi Ankur,
> 
> 
> Used mpstat -I ALL to capture this info for 20 seconds. 
> 
> HARDIRQ per second:
> ===================
> 6.10:
> ===================
> 18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 417956.86	1114642.30	1712683.65	2058664.99	0.00	0.00	18.30	0.39	31978.37	0.00	0.35	351.98	0.00	0.00	0.00	6405.54	329189.45
> 
> Preempt_auto:
> ===================
> 18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 609509.69	1910413.99	1923503.52	2061876.33	0.00	0.00	19.14	0.30	31916.59	0.00	0.45	497.88	0.00	0.00	0.00	6825.49	88247.85
> 
> 18,19,22,23 are called XIVE interrupts. These are IPI interrupts. I am not sure which type of IPI are these. will have to see why its increasing. 
> 
> 
> SOFTIRQ per second:
> ===================
> 6.10:
> =================== 
> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU	
> 0.00	3966.47	0.00	18.25	0.59	0.00		0.34		12811.00	0.00		9693.95
> 
> Preempt_auto:
> ===================
> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU	
> 0.00	4871.67	0.00	18.94	0.40	0.00		0.25		13518.66	0.00		15732.77
> 
> Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why. 
> It maybe irq triggering to softirq or softirq causing more IPI. 
> 
> 
> 
> Also, Noticed a below config difference which gets removed in preempt auto. This happens because PREEMPTION make them as N. Made the changes in kernel/Kconfig.locks to get them 
> enabled. I still see the same regression in hackbench. These configs still may need attention?
> 		
> 					6.10				       | 					preempt auto 
>   CONFIG_INLINE_SPIN_UNLOCK_IRQ=y                                              |  CONFIG_UNINLINE_SPIN_UNLOCK=y                                               
>   CONFIG_INLINE_READ_UNLOCK=y                                                  |  ----------------------------------------------------------------------------
>   CONFIG_INLINE_READ_UNLOCK_IRQ=y                                              |  ----------------------------------------------------------------------------
>   CONFIG_INLINE_WRITE_UNLOCK=y                                                 |  ----------------------------------------------------------------------------
>   CONFIG_INLINE_WRITE_UNLOCK_IRQ=y                                             |  ----------------------------------------------------------------------------
> 
> 

Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across. 
When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start 
spanning across sockets. 

Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation
which may not scale well. Will try to shift to percpu based method and see. will get back if I can get that done successfully.

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Ankur Arora 1 year, 7 months ago

Shrikanth Hegde <sshegde@linux.ibm.com> writes:

> On 6/15/24 8:34 PM, Shrikanth Hegde wrote:
>>
>>
>> On 6/10/24 12:53 PM, Ankur Arora wrote:
>>>
>> _auto.
>>>>
>>>> 6.10-rc1:
>>>> =========
>>>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>>> 09:45:23 AM  all    4.14    0.00   77.57    0.00   16.92    0.00    0.00    0.00    0.00    1.37
>>>> 09:45:24 AM  all    4.42    0.00   77.62    0.00   16.76    0.00    0.00    0.00    0.00    1.20
>>>> 09:45:25 AM  all    4.43    0.00   77.45    0.00   16.94    0.00    0.00    0.00    0.00    1.18
>>>> 09:45:26 AM  all    4.45    0.00   77.87    0.00   16.68    0.00    0.00    0.00    0.00    0.99
>>>>
>>>> PREEMPT_AUTO:
>>>> ===========
>>>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>>> 10:09:56 AM  all    3.11    0.00   72.59    0.00   21.34    0.00    0.00    0.00    0.00    2.96
>>>> 10:09:57 AM  all    3.31    0.00   73.10    0.00   20.99    0.00    0.00    0.00    0.00    2.60
>>>> 10:09:58 AM  all    3.40    0.00   72.83    0.00   20.85    0.00    0.00    0.00    0.00    2.92
>>>> 10:10:00 AM  all    3.21    0.00   72.87    0.00   21.19    0.00    0.00    0.00    0.00    2.73
>>>> 10:10:01 AM  all    3.02    0.00   72.18    0.00   21.08    0.00    0.00    0.00    0.00    3.71
>>>>
>>>> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
>>>> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.
>>>
>>> Yeah, the %sys is lower and %irq, higher. Can you also see where the
>>> increased %irq is? For instance are the resched IPIs numbers greater?
>>
>> Hi Ankur,
>>
>>
>> Used mpstat -I ALL to capture this info for 20 seconds.
>>
>> HARDIRQ per second:
>> ===================
>> 6.10:
>> ===================
>> 18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 417956.86	1114642.30	1712683.65	2058664.99	0.00	0.00	18.30	0.39	31978.37	0.00	0.35	351.98	0.00	0.00	0.00	6405.54	329189.45
>>
>> Preempt_auto:
>> ===================
>> 18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 609509.69	1910413.99	1923503.52	2061876.33	0.00	0.00	19.14	0.30	31916.59	0.00	0.45	497.88	0.00	0.00	0.00	6825.49	88247.85
>>
>> 18,19,22,23 are called XIVE interrupts. These are IPI interrupts. I am not sure which type of IPI are these. will have to see why its increasing.
>>
>>
>> SOFTIRQ per second:
>> ===================
>> 6.10:
>> ===================
>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>> 0.00	3966.47	0.00	18.25	0.59	0.00		0.34		12811.00	0.00		9693.95
>>
>> Preempt_auto:
>> ===================
>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>> 0.00	4871.67	0.00	18.94	0.40	0.00		0.25		13518.66	0.00		15732.77
>>
>> Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why.
>> It maybe irq triggering to softirq or softirq causing more IPI.
>>
>>
>>
>> Also, Noticed a below config difference which gets removed in preempt auto. This happens because PREEMPTION make them as N. Made the changes in kernel/Kconfig.locks to get them
>> enabled. I still see the same regression in hackbench. These configs still may need attention?
>>
>> 					6.10				       | 					preempt auto
>>   CONFIG_INLINE_SPIN_UNLOCK_IRQ=y                                              |  CONFIG_UNINLINE_SPIN_UNLOCK=y
>>   CONFIG_INLINE_READ_UNLOCK=y                                                  |  ----------------------------------------------------------------------------
>>   CONFIG_INLINE_READ_UNLOCK_IRQ=y                                              |  ----------------------------------------------------------------------------
>>   CONFIG_INLINE_WRITE_UNLOCK=y                                                 |  ----------------------------------------------------------------------------
>>   CONFIG_INLINE_WRITE_UNLOCK_IRQ=y                                             |  ----------------------------------------------------------------------------
>>
>>
>
> Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across.
> When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start
> spanning across sockets.

Ah. That's really interesting. So, upto 160 CPUs was okay?

> Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation
> which may not scale well.

Yeah this would explain why I don't see similar behaviour on a 384 CPU
x86 box.

Also, IIRC the powerpc numbers on preempt=full were significantly worse
than preempt=none. That test might also be worth doing once you have the
percpu based method working.

> Will try to shift to percpu based method and see. will get back if I can get that done successfully.

Sounds good to me.


Thanks
Ankur

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Shrikanth Hegde 1 year, 7 months ago


On 6/19/24 8:10 AM, Ankur Arora wrote:


>>>
>>> SOFTIRQ per second:
>>> ===================
>>> 6.10:
>>> ===================
>>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>>> 0.00	3966.47	0.00	18.25	0.59	0.00		0.34		12811.00	0.00		9693.95
>>>
>>> Preempt_auto:
>>> ===================
>>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>>> 0.00	4871.67	0.00	18.94	0.40	0.00		0.25		13518.66	0.00		15732.77
>>>
>>> Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why.
>>> It maybe irq triggering to softirq or softirq causing more IPI.
>>
>> Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across.CPU 
>> When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start
>> spanning across sockets.
> 
> Ah. That's really interesting. So, upto 160 CPUs was okay?

No. In both the cases CPUs are limited to 96. In one case its in single NUMA node and in other case its across two NUMA nodes. 

> 
>> Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation
>> which may not scale well.
> 
> Yeah this would explain why I don't see similar behaviour on a 384 CPU
> x86 box.
> 
> Also, IIRC the powerpc numbers on preempt=full were significantly worse
> than preempt=none. That test might also be worth doing once you have the
> percpu based method working.
> 
>> Will try to shift to percpu based method and see. will get back if I can get that done successfully.
> 
> Sounds good to me.
> 

Did give a try. Made the preempt count per CPU by adding it in paca field. Unfortunately it didn't
improve the the performance. Its more or less same as preempt_auto.  

Issue still remains illusive. Likely crux is that somehow IPI-interrupts and SOFTIRQs are increasing 
with preempt_auto. Doing some more data collection with perf/ftrace. Will share that soon. 

This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided 
tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch 
for reference. It didn't help fix the regression unless I implemented it wrongly.  


diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 1d58da946739..374642288061 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -268,6 +268,7 @@ struct paca_struct {
 	u16 slb_save_cache_ptr;
 #endif
 #endif /* CONFIG_PPC_BOOK3S_64 */
+	int preempt_count;
 #ifdef CONFIG_STACKPROTECTOR
 	unsigned long canary;
 #endif
diff --git a/arch/powerpc/include/asm/preempt.h b/arch/powerpc/include/asm/preempt.h
new file mode 100644
index 000000000000..406dad1a0cf6
--- /dev/null
+++ b/arch/powerpc/include/asm/preempt.h
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_PREEMPT_H
+#define __ASM_PREEMPT_H
+
+#include <linux/thread_info.h>
+
+#ifdef CONFIG_PPC64
+#include <asm/paca.h>
+#endif
+#include <asm/percpu.h>
+#include <asm/smp.h>
+
+#define PREEMPT_ENABLED (0)
+
+/*
+ * We mask the PREEMPT_NEED_RESCHED bit so as not to confuse all current users
+ * that think a non-zero value indicates we cannot preempt.
+ */
+static __always_inline int preempt_count(void)
+{
+	return READ_ONCE(local_paca->preempt_count);
+}
+
+static __always_inline void preempt_count_set(int pc)
+{
+	WRITE_ONCE(local_paca->preempt_count, pc);
+}
+
+/*
+ * must be macros to avoid header recursion hell
+ */
+#define init_task_preempt_count(p) do { } while (0)
+
+#define init_idle_preempt_count(p, cpu) do { } while (0)
+
+static __always_inline void set_preempt_need_resched(void)
+{
+}
+
+static __always_inline void clear_preempt_need_resched(void)
+{
+}
+
+static __always_inline bool test_preempt_need_resched(void)
+{
+	return false;
+}
+
+/*
+ * The various preempt_count add/sub methods
+ */
+
+static __always_inline void __preempt_count_add(int val)
+{
+	preempt_count_set(preempt_count() + val);
+}
+
+static __always_inline void __preempt_count_sub(int val)
+{
+	preempt_count_set(preempt_count() - val);
+}
+
+static __always_inline bool __preempt_count_dec_and_test(void)
+{
+	/*
+	 * Because of load-store architectures cannot do per-cpu atomic
+	 * operations; we cannot use PREEMPT_NEED_RESCHED because it might get
+	 * lost.
+	 */
+	preempt_count_set(preempt_count() - 1);
+	if (preempt_count() == 0 && tif_need_resched())
+		return true;
+	else
+		return false;
+}
+
+/*
+ * Returns true when we need to resched and can (barring IRQ state).
+ */
+static __always_inline bool should_resched(int preempt_offset)
+{
+	return unlikely(preempt_count() == preempt_offset && tif_need_resched());
+}
+
+//EXPORT_SYMBOL(per_cpu_preempt_count);
+
+#ifdef CONFIG_PREEMPTION
+extern asmlinkage void preempt_schedule(void);
+extern asmlinkage void preempt_schedule_notrace(void);
+
+#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
+
+void dynamic_preempt_schedule(void);
+void dynamic_preempt_schedule_notrace(void);
+#define __preempt_schedule()		dynamic_preempt_schedule()
+#define __preempt_schedule_notrace()	dynamic_preempt_schedule_notrace()
+
+#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
+
+#define __preempt_schedule() preempt_schedule()
+#define __preempt_schedule_notrace() preempt_schedule_notrace()
+
+#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
+#endif /* CONFIG_PREEMPTION */
+
+#endif /* __ASM_PREEMPT_H */
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 0d170e2be2b6..bf2199384751 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -52,8 +52,8 @@
  * low level task data.
  */
 struct thread_info {
-	int		preempt_count;		/* 0 => preemptable,
-						   <0 => BUG */
+	//int		preempt_count;		// 0 => preemptable,
+						//   <0 => BUG
 #ifdef CONFIG_SMP
 	unsigned int	cpu;
 #endif
@@ -77,7 +77,6 @@ struct thread_info {
  */
 #define INIT_THREAD_INFO(tsk)			\
 {						\
-	.preempt_count = INIT_PREEMPT_COUNT,	\
 	.flags =	0,			\
 }
 
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 7502066c3c53..f90245b8359f 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -204,6 +204,7 @@ void __init initialise_paca(struct paca_struct *new_paca, int cpu)
 #ifdef CONFIG_PPC_64S_HASH_MMU
 	new_paca->slb_shadow_ptr = NULL;
 #endif
+	new_paca->preempt_count = PREEMPT_DISABLED;
 
 #ifdef CONFIG_PPC_BOOK3E_64
 	/* For now -- if we have threads this will be adjusted later */
diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
index 85050be08a23..2adab682aab9 100644
--- a/arch/powerpc/kexec/core_64.c
+++ b/arch/powerpc/kexec/core_64.c
@@ -33,6 +33,8 @@
 #include <asm/ultravisor.h>
 #include <asm/crashdump-ppc64.h>
 
+#include <linux/percpu-defs.h>
+
 int machine_kexec_prepare(struct kimage *image)
 {
 	int i;
@@ -324,7 +326,7 @@ void default_machine_kexec(struct kimage *image)
 	 * XXX: the task struct will likely be invalid once we do the copy!
 	 */
 	current_thread_info()->flags = 0;
-	current_thread_info()->preempt_count = HARDIRQ_OFFSET;
+	local_paca->preempt_count = HARDIRQ_OFFSET;
 
 	/* We need a static PACA, too; copy this CPU's PACA over and switch to
 	 * it. Also poison per_cpu_offset and NULL lppaca to catch anyone using

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Ankur Arora 1 year, 7 months ago

Shrikanth Hegde <sshegde@linux.ibm.com> writes:

> On 6/19/24 8:10 AM, Ankur Arora wrote:
>
>
>>>>
>>>> SOFTIRQ per second:
>>>> ===================
>>>> 6.10:
>>>> ===================
>>>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>>>> 0.00	3966.47	0.00	18.25	0.59	0.00		0.34		12811.00	0.00		9693.95
>>>>
>>>> Preempt_auto:
>>>> ===================
>>>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>>>> 0.00	4871.67	0.00	18.94	0.40	0.00		0.25		13518.66	0.00		15732.77
>>>>
>>>> Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why.
>>>> It maybe irq triggering to softirq or softirq causing more IPI.
>>>
>>> Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across.CPU
>>> When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start
>>> spanning across sockets.
>>
>> Ah. That's really interesting. So, upto 160 CPUs was okay?
>
> No. In both the cases CPUs are limited to 96. In one case its in single NUMA node and in other case its across two NUMA nodes.
>
>>
>>> Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation
>>> which may not scale well.
>>
>> Yeah this would explain why I don't see similar behaviour on a 384 CPU
>> x86 box.
>>
>> Also, IIRC the powerpc numbers on preempt=full were significantly worse
>> than preempt=none. That test might also be worth doing once you have the
>> percpu based method working.
>>
>>> Will try to shift to percpu based method and see. will get back if I can get that done successfully.
>>
>> Sounds good to me.
>>
>
> Did give a try. Made the preempt count per CPU by adding it in paca field. Unfortunately it didn't
> improve the the performance. Its more or less same as preempt_auto.
>
> Issue still remains illusive. Likely crux is that somehow IPI-interrupts and SOFTIRQs are increasing
> with preempt_auto. Doing some more data collection with perf/ftrace. Will share that soon.

True. But, just looking at IPC for now:

>> baseline 6.10-rc1:
>> ++++++++++++++++++
>>  Performance counter stats for 'system wide' (20 runs):
>> 577,719,907,794,874      cycles                           #    6.475 GHz                         ( +-  6.60% )
>> 226,392,778,622,410      instructions                     #    0.74  insn per cycle              ( +-  6.61% )

>> preempt auto
>>  Performance counter stats for 'system wide' (20 runs):
>> 700,281,729,230,103      cycles                           #    6.423 GHz                         ( +-  6.64% )
>> 254,713,123,656,485      instructions                     #    0.69  insn per cycle              ( +-  6.63% )
>> 42,275,061,484,512      branches                         #  387.756 M/sec                       ( +-  6.63% )
>>    231,944,216,106      branch-misses                    #    1.04% of all branches             ( +-  6.64% )

Not sure if comparing IPC is worthwhile given the substantially higher
number of instructions under execution. But, that is meaningfully worse.

This was also true on the 12 core system:

>> baseline 6.10-rc1:
>>  Performance counter stats for 'system wide' (20 runs):
>> 412,401,110,929,055      cycles                           #    7.286 GHz                         ( +-  6.54% )
>> 192,380,094,075,743      instructions                     #    0.88  insn per cycle              ( +-  6.59% )

>> v2_preempt_auto
>> Performance counter stats for 'system wide' (20 runs):
>> 483,419,889,144,017      cycles                           #    7.232 GHz                         ( +-  6.51% )
>> 210,788,030,476,548      instructions                     #    0.82  insn per cycle              ( +-  6.57% )

Just to get rid of the preempt_auto aspect completely, maybe you could
try seeing what perf stat -d shows for:
CONFIG_PREEMPT vs CONFIG_PREEMPT_NONE vs (CONFIG_PREEMPT_DYNAMIC, preempt=none).

> This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
> Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided
> tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch
> for reference. It didn't help fix the regression unless I implemented it wrongly.
>
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index 1d58da946739..374642288061 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -268,6 +268,7 @@ struct paca_struct {
>  	u16 slb_save_cache_ptr;
>  #endif
>  #endif /* CONFIG_PPC_BOOK3S_64 */
> +	int preempt_count;

I don't know powerpc at all. But, would this cacheline be hotter
than current_thread_info()::preempt_count?

Thanks
Ankur

>  #ifdef CONFIG_STACKPROTECTOR
>  	unsigned long canary;
>  #endif
> diff --git a/arch/powerpc/include/asm/preempt.h b/arch/powerpc/include/asm/preempt.h
> new file mode 100644
> index 000000000000..406dad1a0cf6
> --- /dev/null
> +++ b/arch/powerpc/include/asm/preempt.h
> @@ -0,0 +1,106 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __ASM_PREEMPT_H
> +#define __ASM_PREEMPT_H
> +
> +#include <linux/thread_info.h>
> +
> +#ifdef CONFIG_PPC64
> +#include <asm/paca.h>
> +#endif
> +#include <asm/percpu.h>
> +#include <asm/smp.h>
> +
> +#define PREEMPT_ENABLED (0)
> +
> +/*
> + * We mask the PREEMPT_NEED_RESCHED bit so as not to confuse all current users
> + * that think a non-zero value indicates we cannot preempt.
> + */
> +static __always_inline int preempt_count(void)
> +{
> +	return READ_ONCE(local_paca->preempt_count);
> +}
> +
> +static __always_inline void preempt_count_set(int pc)
> +{
> +	WRITE_ONCE(local_paca->preempt_count, pc);
> +}
> +
> +/*
> + * must be macros to avoid header recursion hell
> + */
> +#define init_task_preempt_count(p) do { } while (0)
> +
> +#define init_idle_preempt_count(p, cpu) do { } while (0)
> +
> +static __always_inline void set_preempt_need_resched(void)
> +{
> +}
> +
> +static __always_inline void clear_preempt_need_resched(void)
> +{
> +}
> +
> +static __always_inline bool test_preempt_need_resched(void)
> +{
> +	return false;
> +}
> +
> +/*
> + * The various preempt_count add/sub methods
> + */
> +
> +static __always_inline void __preempt_count_add(int val)
> +{
> +	preempt_count_set(preempt_count() + val);
> +}
> +
> +static __always_inline void __preempt_count_sub(int val)
> +{
> +	preempt_count_set(preempt_count() - val);
> +}
> +
> +static __always_inline bool __preempt_count_dec_and_test(void)
> +{
> +	/*
> +	 * Because of load-store architectures cannot do per-cpu atomic
> +	 * operations; we cannot use PREEMPT_NEED_RESCHED because it might get
> +	 * lost.
> +	 */
> +	preempt_count_set(preempt_count() - 1);
> +	if (preempt_count() == 0 && tif_need_resched())
> +		return true;
> +	else
> +		return false;
> +}
> +
> +/*
> + * Returns true when we need to resched and can (barring IRQ state).
> + */
> +static __always_inline bool should_resched(int preempt_offset)
> +{
> +	return unlikely(preempt_count() == preempt_offset && tif_need_resched());
> +}
> +
> +//EXPORT_SYMBOL(per_cpu_preempt_count);
> +
> +#ifdef CONFIG_PREEMPTION
> +extern asmlinkage void preempt_schedule(void);
> +extern asmlinkage void preempt_schedule_notrace(void);
> +
> +#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> +
> +void dynamic_preempt_schedule(void);
> +void dynamic_preempt_schedule_notrace(void);
> +#define __preempt_schedule()		dynamic_preempt_schedule()
> +#define __preempt_schedule_notrace()	dynamic_preempt_schedule_notrace()
> +
> +#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
> +
> +#define __preempt_schedule() preempt_schedule()
> +#define __preempt_schedule_notrace() preempt_schedule_notrace()
> +
> +#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
> +#endif /* CONFIG_PREEMPTION */
> +
> +#endif /* __ASM_PREEMPT_H */
> diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
> index 0d170e2be2b6..bf2199384751 100644
> --- a/arch/powerpc/include/asm/thread_info.h
> +++ b/arch/powerpc/include/asm/thread_info.h
> @@ -52,8 +52,8 @@
>   * low level task data.
>   */
>  struct thread_info {
> -	int		preempt_count;		/* 0 => preemptable,
> -						   <0 => BUG */
> +	//int		preempt_count;		// 0 => preemptable,
> +						//   <0 => BUG
>  #ifdef CONFIG_SMP
>  	unsigned int	cpu;
>  #endif
> @@ -77,7 +77,6 @@ struct thread_info {
>   */
>  #define INIT_THREAD_INFO(tsk)			\
>  {						\
> -	.preempt_count = INIT_PREEMPT_COUNT,	\
>  	.flags =	0,			\
>  }
>
> diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
> index 7502066c3c53..f90245b8359f 100644
> --- a/arch/powerpc/kernel/paca.c
> +++ b/arch/powerpc/kernel/paca.c
> @@ -204,6 +204,7 @@ void __init initialise_paca(struct paca_struct *new_paca, int cpu)
>  #ifdef CONFIG_PPC_64S_HASH_MMU
>  	new_paca->slb_shadow_ptr = NULL;
>  #endif
> +	new_paca->preempt_count = PREEMPT_DISABLED;
>
>  #ifdef CONFIG_PPC_BOOK3E_64
>  	/* For now -- if we have threads this will be adjusted later */
> diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
> index 85050be08a23..2adab682aab9 100644
> --- a/arch/powerpc/kexec/core_64.c
> +++ b/arch/powerpc/kexec/core_64.c
> @@ -33,6 +33,8 @@
>  #include <asm/ultravisor.h>
>  #include <asm/crashdump-ppc64.h>
>
> +#include <linux/percpu-defs.h>
> +
>  int machine_kexec_prepare(struct kimage *image)
>  {
>  	int i;
> @@ -324,7 +326,7 @@ void default_machine_kexec(struct kimage *image)
>  	 * XXX: the task struct will likely be invalid once we do the copy!
>  	 */
>  	current_thread_info()->flags = 0;
> -	current_thread_info()->preempt_count = HARDIRQ_OFFSET;
> +	local_paca->preempt_count = HARDIRQ_OFFSET;
>
>  	/* We need a static PACA, too; copy this CPU's PACA over and switch to
>  	 * it. Also poison per_cpu_offset and NULL lppaca to catch anyone using


--
ankur

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Michael Ellerman 1 year, 7 months ago

Ankur Arora <ankur.a.arora@oracle.com> writes:
> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>> ...
>> This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
>> Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided
>> tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch
>> for reference. It didn't help fix the regression unless I implemented it wrongly.
>>
>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
>> index 1d58da946739..374642288061 100644
>> --- a/arch/powerpc/include/asm/paca.h
>> +++ b/arch/powerpc/include/asm/paca.h
>> @@ -268,6 +268,7 @@ struct paca_struct {
>>  	u16 slb_save_cache_ptr;
>>  #endif
>>  #endif /* CONFIG_PPC_BOOK3S_64 */
>> +	int preempt_count;
>
> I don't know powerpc at all. But, would this cacheline be hotter
> than current_thread_info()::preempt_count?
>
>>  #ifdef CONFIG_STACKPROTECTOR
>>  	unsigned long canary;
>>  #endif

Assuming stack protector is enabled (it is in defconfig), that cache
line should quite be hot, because the canary is loaded as part of the
epilogue of many functions.

Putting preempt_count in the paca also means it's a single load/store to
access the value, just paca (in r13) + static offset. With the
preempt_count in thread_info it's two loads, one to load current from
the paca and then another to get the preempt_count.

It could be worthwhile to move preempt_count into the paca, but I'm not
convinced preempt_count is accessed enough for it to be a major
performance issue.

cheers

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Shrikanth Hegde 1 year, 7 months ago


On 6/27/24 11:26 AM, Michael Ellerman wrote:
> Ankur Arora <ankur.a.arora@oracle.com> writes:
>> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>>> ...
>>> This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
>>> Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided
>>> tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch
>>> for reference. It didn't help fix the regression unless I implemented it wrongly.
>>>
>>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
>>> index 1d58da946739..374642288061 100644
>>> --- a/arch/powerpc/include/asm/paca.h
>>> +++ b/arch/powerpc/include/asm/paca.h
>>> @@ -268,6 +268,7 @@ struct paca_struct {
>>>  	u16 slb_save_cache_ptr;
>>>  #endif
>>>  #endif /* CONFIG_PPC_BOOK3S_64 */
>>> +	int preempt_count;
>>
>> I don't know powerpc at all. But, would this cacheline be hotter
>> than current_thread_info()::preempt_count?
>>
>>>  #ifdef CONFIG_STACKPROTECTOR
>>>  	unsigned long canary;
>>>  #endif
> 
> Assuming stack protector is enabled (it is in defconfig), that cache
> line should quite be hot, because the canary is loaded as part of the
> epilogue of many functions.

Thanks Michael for taking a look at it.  

Yes. CONFIG_STACKPROTECTOR=y 
which cacheline is a question still if we are going to pursue this. 
> Putting preempt_count in the paca also means it's a single load/store to 
> access the value, just paca (in r13) + static offset. With the
> preempt_count in thread_info it's two loads, one to load current from
> the paca and then another to get the preempt_count.
> 
> It could be worthwhile to move preempt_count into the paca, but I'm not
> convinced preempt_count is accessed enough for it to be a major
> performance issue.

With PREEMPT_COUNT enabled, this would mean for every preempt_enable/disable. 
That means for every spin lock/unlock, get/set cpu etc. Those might be 
quite frequent. no? But w.r.t to preempt auto it didn't change the performance per se. 

> 
> cheers

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Ankur Arora 1 year, 7 months ago

Shrikanth Hegde <sshegde@linux.ibm.com> writes:

> On 6/27/24 11:26 AM, Michael Ellerman wrote:
>> Ankur Arora <ankur.a.arora@oracle.com> writes:
>>> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>>>> ...
>>>> This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
>>>> Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided
>>>> tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch
>>>> for reference. It didn't help fix the regression unless I implemented it wrongly.
>>>>
>>>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
>>>> index 1d58da946739..374642288061 100644
>>>> --- a/arch/powerpc/include/asm/paca.h
>>>> +++ b/arch/powerpc/include/asm/paca.h
>>>> @@ -268,6 +268,7 @@ struct paca_struct {
>>>>  	u16 slb_save_cache_ptr;
>>>>  #endif
>>>>  #endif /* CONFIG_PPC_BOOK3S_64 */
>>>> +	int preempt_count;
>>>
>>> I don't know powerpc at all. But, would this cacheline be hotter
>>> than current_thread_info()::preempt_count?
>>>
>>>>  #ifdef CONFIG_STACKPROTECTOR
>>>>  	unsigned long canary;
>>>>  #endif
>>
>> Assuming stack protector is enabled (it is in defconfig), that cache
>> line should quite be hot, because the canary is loaded as part of the
>> epilogue of many functions.
>
> Thanks Michael for taking a look at it.
>
> Yes. CONFIG_STACKPROTECTOR=y
> which cacheline is a question still if we are going to pursue this.
>> Putting preempt_count in the paca also means it's a single load/store to
>> access the value, just paca (in r13) + static offset. With the
>> preempt_count in thread_info it's two loads, one to load current from
>> the paca and then another to get the preempt_count.
>>
>> It could be worthwhile to move preempt_count into the paca, but I'm not
>> convinced preempt_count is accessed enough for it to be a major
>> performance issue.

Yeah, that makes sense. I'm working on making the x86 preempt_count
and related code similar to powerpc. Let's see how that does on x86.

> With PREEMPT_COUNT enabled, this would mean for every preempt_enable/disable.
> That means for every spin lock/unlock, get/set cpu etc. Those might be
> quite frequent. no? But w.r.t to preempt auto it didn't change the performance per se.

Yeah and you had mentioned that folding the NR bit (or not) doesn't
seem to matter either. Hackbench does a lot of remote wakeups, which
should mean that the target's thread_info::flags cacheline would be
bouncing around, so I would have imagined that that would be noticeable.

--
ankur

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Shrikanth Hegde 1 year, 5 months ago


On 7/3/24 10:57, Ankur Arora wrote:
> 
> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
> 

Hi.
Sorry for the delayed response.

I could see this hackbench pipe regression with preempt=full kernel on 6.10-rc kernel. i.e without PREEMPT_AUTO as well.

There seems to more wakeups in read path, implies pipe was more often empty. Correspondingly more contention
is there on the mutex pipe lock in preempt=full. But why, not sure. One difference in powerpc is page size. but
here pipe isn't getting full. Its not the write side that is blocked.



preempt=none: Time taken for 20 groups  in seconds        : 25.70
preempt=full: Time taken for 20 groups  in seconds        : 54.56

----------------
hackbench (pipe)
----------------
top 3 callstacks of __schedule collected with bpftrace.

			preempt=none								preempt=full

     __schedule+12                                                                  |@[
     schedule+64                                                                    |    __schedule+12
     interrupt_exit_user_prepare_main+600                                           |    preempt_schedule+84
     interrupt_exit_user_prepare+88                                                 |    _raw_spin_unlock_irqrestore+124
     interrupt_return_srr_user+8                                                    |    __wake_up_sync_key+108
, hackbench]: 482228                                                               |    pipe_write+1772
@[                                                                                 |    vfs_write+1052
     __schedule+12                                                                  |    ksys_write+248
     schedule+64                                                                    |    system_call_exception+296
     pipe_write+1452                                                                |    system_call_vectored_common+348
     vfs_write+940                                                                  |, hackbench]: 538591
     ksys_write+248                                                                 |@[
     system_call_exception+292                                                      |    __schedule+12
     system_call_vectored_common+348                                                |    schedule+76
, hackbench]: 1427161                                                              |    schedule_preempt_disabled+52
@[                                                                                 |    __mutex_lock.constprop.0+1748
     __schedule+12                                                                  |    pipe_write+132
     schedule+64                                                                    |    vfs_write+1052
     interrupt_exit_user_prepare_main+600                                           |    ksys_write+248
     syscall_exit_prepare+336                                                       |    system_call_exception+296
     system_call_vectored_common+360                                                |    system_call_vectored_common+348
, hackbench]: 8151309                                                              |, hackbench]: 5388301
@[                                                                                 |@[
     __schedule+12                                                                  |    __schedule+12
     schedule+64                                                                    |    schedule+76
     pipe_read+1100                                                                 |    pipe_read+1100
     vfs_read+716                                                                   |    vfs_read+716
     ksys_read+252                                                                  |    ksys_read+252
     system_call_exception+292                                                      |    system_call_exception+296
     system_call_vectored_common+348                                                |    system_call_vectored_common+348
, hackbench]: 18132753                                                             |, hackbench]: 64424110
                                                                                                                                                                 



--------------------------------------------
hackbench (messaging) - one that uses sockets
--------------------------------------------
Here there is no regression with preempt=full.

preempt=none: Time taken for 20 groups  in seconds        : 55.51
preempt=full: Time taken for 20 groups  in seconds        : 55.10


Similar bpftrace collected for socket based hackbench. highest caller of __schedule doesn't change much.

	preempt=none                                                                             preempt=full


                                                                                    |    __schedule+12
                                                                                    |    preempt_schedule+84
                                                                                    |    _raw_spin_unlock+108
@[                                                                                 |    unix_stream_sendmsg+660
     __schedule+12                                                                  |    sock_write_iter+372
     schedule+64                                                                    |    vfs_write+1052
     schedule_timeout+412                                                           |    ksys_write+248
     sock_alloc_send_pskb+684                                                       |    system_call_exception+296
     unix_stream_sendmsg+448                                                        |    system_call_vectored_common+348
     sock_write_iter+372                                                            |, hackbench]: 819290
     vfs_write+940                                                                  |@[
     ksys_write+248                                                                 |    __schedule+12
     system_call_exception+292                                                      |    schedule+76
     system_call_vectored_common+348                                                |    schedule_timeout+476
, hackbench]: 3424197                                                              |    sock_alloc_send_pskb+684
@[                                                                                 |    unix_stream_sendmsg+444
     __schedule+12                                                                  |    sock_write_iter+372
     schedule+64                                                                    |    vfs_write+1052
     interrupt_exit_user_prepare_main+600                                           |    ksys_write+248
     syscall_exit_prepare+336                                                       |    system_call_exception+296
     system_call_vectored_common+360                                                |    system_call_vectored_common+348
, hackbench]: 9800144                                                              |, hackbench]: 3386594
@[                                                                                 |@[
     __schedule+12                                                                  |    __schedule+12
     schedule+64                                                                    |    schedule+76
     schedule_timeout+412                                                           |    schedule_timeout+476
     unix_stream_data_wait+528                                                      |    unix_stream_data_wait+468
     unix_stream_read_generic+872                                                   |    unix_stream_read_generic+804
     unix_stream_recvmsg+196                                                        |    unix_stream_recvmsg+196
     sock_recvmsg+164                                                               |    sock_recvmsg+156
     sock_read_iter+200                                                             |    sock_read_iter+200
     vfs_read+716                                                                   |    vfs_read+716
     ksys_read+252                                                                  |    ksys_read+252
     system_call_exception+292                                                      |    system_call_exception+296
     system_call_vectored_common+348                                                |    system_call_vectored_common+348
, hackbench]: 25375142                                                             |, hackbench]: 27275685

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Linus Torvalds 1 year, 5 months ago

On Mon, 12 Aug 2024 at 10:33, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>
> top 3 callstacks of __schedule collected with bpftrace.
>
>                         preempt=none                                                            preempt=full
>
>      __schedule+12                                                                  |@[
>      schedule+64                                                                    |    __schedule+12
>      interrupt_exit_user_prepare_main+600                                           |    preempt_schedule+84
>      interrupt_exit_user_prepare+88                                                 |    _raw_spin_unlock_irqrestore+124
>      interrupt_return_srr_user+8                                                    |    __wake_up_sync_key+108
> , hackbench]: 482228                                                               |    pipe_write+1772
> @[                                                                                 |    vfs_write+1052
>      __schedule+12                                                                  |    ksys_write+248
>      schedule+64                                                                    |    system_call_exception+296
>      pipe_write+1452                                                                |    system_call_vectored_common+348
>      vfs_write+940                                                                  |, hackbench]: 538591
>      ksys_write+248                                                                 |@[
>      system_call_exception+292                                                      |    __schedule+12
>      system_call_vectored_common+348                                                |    schedule+76
> , hackbench]: 1427161                                                              |    schedule_preempt_disabled+52
> @[                                                                                 |    __mutex_lock.constprop.0+1748
>      __schedule+12                                                                  |    pipe_write+132
>      schedule+64                                                                    |    vfs_write+1052
>      interrupt_exit_user_prepare_main+600                                           |    ksys_write+248
>      syscall_exit_prepare+336                                                       |    system_call_exception+296
>      system_call_vectored_common+360                                                |    system_call_vectored_common+348
> , hackbench]: 8151309                                                              |, hackbench]: 5388301
> @[                                                                                 |@[
>      __schedule+12                                                                  |    __schedule+12
>      schedule+64                                                                    |    schedule+76
>      pipe_read+1100                                                                 |    pipe_read+1100
>      vfs_read+716                                                                   |    vfs_read+716
>      ksys_read+252                                                                  |    ksys_read+252
>      system_call_exception+292                                                      |    system_call_exception+296
>      system_call_vectored_common+348                                                |    system_call_vectored_common+348
> , hackbench]: 18132753                                                             |, hackbench]: 64424110
>

So the pipe performance is very sensitive, partly because the pipe
overhead is normally very low.

So we've seen it in lots of benchmarks where the benchmark then gets
wildly different results depending on whether you get the goo "optimal
pattern".

And I think your "preempt=none" pattern is the one you really want,
where all the pipe IO scheduling is basically done at exactly the
(optimized) pipe points, ie where the writer blocks because there is
no room (if it's a throughput benchmark), and the reader blocks
because there is no data (for the ping-pong or pipe ring latency
benchmarks).

And then when you get that "perfect" behavior, you typically also get
the best performance when all readers and all writers are on the same
CPU, so you get no unnecessary cache ping-pong either.

And that's a *very* typical pipe benchmark, where there are no costs
to generating the pipe data and no costs involved with consuming it
(ie the actual data isn't really *used* by the benchmark).

In real (non-benchmark) loads, you typically want to spread the
consumer and producer apart on different CPUs, so that the real load
then uses multiple CPUs on the data. But the benchmark case - having
no real data load - likes the "stay on the same CPU" thing.

Your traces for "preempt=none" very much look like that "both reader
and writer sleep synchronously" case, which is the optimal benchmark
case.

And then with "preempt=full", you see that "oh damn, reader and writer
actually hit the pipe mutex contention, because they are presumably
running at the same time on different CPUs, and didn't get into that
nice serial synchronous pattern. So now you not only have that mutex
overhead (which doesn't exist in the reader and writer synchronize),
you also end up with the cost of cache misses *and* the cost of
scheduling on two different CPU's where both of them basically go into
idle while waiting for the other end.

I'm not convinced this is solvable, because it really is an effect
that comes from "benchmarking is doing something odd that we
*shouldn't* generally optimize for".

I also absolutely detest the pipe mutex - 99% of what it protects
should be using either just atomic cmpxchg or possibly a spinlock, and
that's actually what the "use pipes for events" code does. However,
the actual honest user read()/write() code needs to do user space
accesses, and so it wants a sleeping lock.

We could - and probably at some point should - split the pipe mutex
into two: one that protects the writer side, one that protects the
reader side. Then with the common situation of a single reader and a
single writer, the mutex would never be contended. Then the rendezvous
between that "one reader" and "one writer" would be done using
atomics.

But it would be more complex, and it's already complicated by the
whole "you can also use pipes for atomic messaging for watch-queues".

Anyeway, preempt=none has always excelled at certain things. This is
one of them.

               Linus

Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Posted by Ankur Arora 1 year, 5 months ago

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, 12 Aug 2024 at 10:33, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>>
>> top 3 callstacks of __schedule collected with bpftrace.
>>
>>                         preempt=none                                                            preempt=full
>>
>>      __schedule+12                                                                  |@[
>>      schedule+64                                                                    |    __schedule+12
>>      interrupt_exit_user_prepare_main+600                                           |    preempt_schedule+84
>>      interrupt_exit_user_prepare+88                                                 |    _raw_spin_unlock_irqrestore+124
>>      interrupt_return_srr_user+8                                                    |    __wake_up_sync_key+108
>> , hackbench]: 482228                                                               |    pipe_write+1772
>> @[                                                                                 |    vfs_write+1052
>>      __schedule+12                                                                  |    ksys_write+248
>>      schedule+64                                                                    |    system_call_exception+296
>>      pipe_write+1452                                                                |    system_call_vectored_common+348
>>      vfs_write+940                                                                  |, hackbench]: 538591
>>      ksys_write+248                                                                 |@[
>>      system_call_exception+292                                                      |    __schedule+12
>>      system_call_vectored_common+348                                                |    schedule+76
>> , hackbench]: 1427161                                                              |    schedule_preempt_disabled+52
>> @[                                                                                 |    __mutex_lock.constprop.0+1748
>>      __schedule+12                                                                  |    pipe_write+132
>>      schedule+64                                                                    |    vfs_write+1052
>>      interrupt_exit_user_prepare_main+600                                           |    ksys_write+248
>>      syscall_exit_prepare+336                                                       |    system_call_exception+296
>>      system_call_vectored_common+360                                                |    system_call_vectored_common+348
>> , hackbench]: 8151309                                                              |, hackbench]: 5388301
>> @[                                                                                 |@[
>>      __schedule+12                                                                  |    __schedule+12
>>      schedule+64                                                                    |    schedule+76
>>      pipe_read+1100                                                                 |    pipe_read+1100
>>      vfs_read+716                                                                   |    vfs_read+716
>>      ksys_read+252                                                                  |    ksys_read+252
>>      system_call_exception+292                                                      |    system_call_exception+296
>>      system_call_vectored_common+348                                                |    system_call_vectored_common+348
>> , hackbench]: 18132753                                                             |, hackbench]: 64424110
>>
>
> So the pipe performance is very sensitive, partly because the pipe
> overhead is normally very low.
>
> So we've seen it in lots of benchmarks where the benchmark then gets
> wildly different results depending on whether you get the goo "optimal
> pattern".
>
> And I think your "preempt=none" pattern is the one you really want,
> where all the pipe IO scheduling is basically done at exactly the
> (optimized) pipe points, ie where the writer blocks because there is
> no room (if it's a throughput benchmark), and the reader blocks
> because there is no data (for the ping-pong or pipe ring latency
> benchmarks).
>
> And then when you get that "perfect" behavior, you typically also get
> the best performance when all readers and all writers are on the same
> CPU, so you get no unnecessary cache ping-pong either.
>
> And that's a *very* typical pipe benchmark, where there are no costs
> to generating the pipe data and no costs involved with consuming it
> (ie the actual data isn't really *used* by the benchmark).
>
> In real (non-benchmark) loads, you typically want to spread the
> consumer and producer apart on different CPUs, so that the real load
> then uses multiple CPUs on the data. But the benchmark case - having
> no real data load - likes the "stay on the same CPU" thing.
>
> Your traces for "preempt=none" very much look like that "both reader
> and writer sleep synchronously" case, which is the optimal benchmark
> case.
>
> And then with "preempt=full", you see that "oh damn, reader and writer
> actually hit the pipe mutex contention, because they are presumably
> running at the same time on different CPUs, and didn't get into that
> nice serial synchronous pattern. So now you not only have that mutex
> overhead (which doesn't exist in the reader and writer synchronize),
> you also end up with the cost of cache misses *and* the cost of
> scheduling on two different CPU's where both of them basically go into
> idle while waiting for the other end.

Thanks. That was very clarifying.

--
ankur