[v9] Preparatory changes for Proxy Execution v9

[PATCH v9 0/7] Preparatory changes for Proxy Execution v9

Posted by John Stultz 1 year, 10 months ago

As mentioned last time[1], after previous submissions of the
Proxy Execution series, I got feedback that the patch series was
getting a bit unwieldy to review, and Qais suggested I break out
just the cleanups/preparatory components of the patch series and
submit them on their own in the hope we can start to merge the
less complex bits and discussion can focus on the more
complicated portions afterwards. This so far has not been very
successful, with the submission & RESEND of the v8 preparatory
changes not getting much in the way of review.

Nonetheless, for v9 of this series, I’m again only submitting
those early cleanup/preparatory changes here (which have not
changed since the v8 submissions, but to avoid confusion with the
git branch names, I’m labeling it as v9). In the meantime, I’ve
continued to put a lot of effort into the full series, mostly
focused on polishing the series for correctness, and fixing some
hard to trip races.

If you are interested, the full v9 series, it can be found here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v9-6.8
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v9-6.8


New in v9:
(In the git tree. Again, none of the preparatory patches
submitted here have changed since v8)
---------
* Change to force mutex lock handoff when we have a blocked donor
  (preserves optimistic spinning elsewhere, but still prioritizes
  donor when present on unlock)

* Do return migration whenever we’re not on the wake_cpu (should
  address placement concerns brought up earlier by Xuewen Yan)

* Closed hole where we might mark a task as BO_RUNNABLE without
  doing return migration

* Much improved handling of balance callbacks when we need to
  pick_again

* Fixes for cases where we put_prev_task() but left a dangling
  pointer to rq_selected() when deactivating a task (as it could
  then be migrated away while we still have a reference to it),
  by selecting idle before deactivating next.

* Fixes for dangling references to rq->curr (which had been
  put_prev_task’ed)  when we drop rq lock for proxy_migration

* Fixes for ttwu / find_proxy_task() races if the lock owner was
  being return migrated, and ttwu hadn’t yet set_task_cpu() and
  activated it, which allowed that task to be scheduled on two
  cpus at the same time.

* Fix for live-lock between activate_blocked_tasks() and
  proxy_enqueue_on_owner() if activated owner went right back to
  sleep (which also simplifies the locking in
  activate_blocked_tasks())

* Cleanups to avoid locked BO_WAKING->BO_RUNNABLE transition in
  try_to_wake_up() if proxy execution isn't enabled

* Fix for psi_dequeue, as proxy changes assumptions around
  voluntary sleeps.

* Numerous typos, comment improvements, and other fixups
  suggested by Metin

* And more!


Performance:
---------
K Prateek Nayak provided some feedback on the v8 series here[2].
Given the potential extra overhead of doing rq migrations/return
migrations/etc for the proxy case, it’s not completely surprising
a few of K Prateek’s test cases saw ~3-5% regressions, but I’m
hoping to look into this soon to see if we can reduce those
further. The donor mutex handoff in this revision may help some.


Issues still to address:
---------
* The chain migration functionality needs further iterations and
  better validation to ensure it truly maintains the RT/DL load
  balancing invariants.

* CFS load balancing. There was concern that blocked tasks may
  carry forward load (PELT) to the lock owner's CPU, so the CPU
  may look like it is overloaded. Needs investigation.

* The sleeping owner handling (where we deactivate waiting tasks
  and enqueue them onto a list, then reactivate them when the
  owner wakes up) doesn’t feel great. This is in part because
  when we want to activate tasks, we’re already holding a
  task.pi_lock and a rq_lock, just not the locks for the task
  we’re activating, nor the rq we’re enqueuing it onto. So there
  has to be a bit of lock juggling to drop and acquire the right
  locks (in the right order). It feels like there’s got to be a
  better way. Also needs some rework to get rid of the recursion.


Credit/Disclaimer:
—--------------------
As mentioned previously, this Proxy Execution series has a long
history: First described in a paper[3] by Watkins, Straub,
Niehaus, then from patches from Peter Zijlstra, extended with
lots of work by Juri Lelli, Valentin Schneider, and Connor
O'Brien. (and thank you to Steven Rostedt for providing
additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

Thanks so much!
-john

[1] https://lore.kernel.org/lkml/20240224001153.2584030-1-jstultz@google.com/
[2] https://lore.kernel.org/lkml/c26251d2-e1bf-e5c7-0636-12ad886e1ea8@amd.com/
[3] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@android.com


Connor O'Brien (2):
  sched: Add do_push_task helper
  sched: Consolidate pick_*_task to task_is_pushable helper

John Stultz (1):
  sched: Split out __schedule() deactivate task logic into a helper

Juri Lelli (2):
  locking/mutex: Make mutex::wait_lock irq safe
  locking/mutex: Expose __mutex_owner()

Peter Zijlstra (2):
  locking/mutex: Remove wakeups from under mutex::wait_lock
  sched: Split scheduler and execution contexts

 kernel/locking/mutex.c       |  60 +++++++----------
 kernel/locking/mutex.h       |  25 +++++++
 kernel/locking/rtmutex.c     |  26 +++++---
 kernel/locking/rwbase_rt.c   |   4 +-
 kernel/locking/rwsem.c       |   4 +-
 kernel/locking/spinlock_rt.c |   3 +-
 kernel/locking/ww_mutex.h    |  49 ++++++++------
 kernel/sched/core.c          | 122 +++++++++++++++++++++--------------
 kernel/sched/deadline.c      |  53 ++++++---------
 kernel/sched/fair.c          |  18 +++---
 kernel/sched/rt.c            |  59 +++++++----------
 kernel/sched/sched.h         |  44 ++++++++++++-
 12 files changed, 268 insertions(+), 199 deletions(-)

-- 
2.44.0.291.gc1ea87d7ee-goog

Re: [PATCH v9 0/7] Preparatory changes for Proxy Execution v9

Posted by K Prateek Nayak 1 year, 10 months ago

Hello John,

On 3/15/2024 10:09 AM, John Stultz wrote:
> As mentioned last time[1], after previous submissions of the
> Proxy Execution series, I got feedback that the patch series was
> getting a bit unwieldy to review, and Qais suggested I break out
> just the cleanups/preparatory components of the patch series and
> submit them on their own in the hope we can start to merge the
> less complex bits and discussion can focus on the more
> complicated portions afterwards. This so far has not been very
> successful, with the submission & RESEND of the v8 preparatory
> changes not getting much in the way of review.
> 
> Nonetheless, for v9 of this series, I’m again only submitting
> those early cleanup/preparatory changes here (which have not
> changed since the v8 submissions, but to avoid confusion with the
> git branch names, I’m labeling it as v9). In the meantime, I’ve
> continued to put a lot of effort into the full series, mostly
> focused on polishing the series for correctness, and fixing some
> hard to trip races.
> 
> If you are interested, the full v9 series, it can be found here:
>   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v9-6.8
>   https://github.com/johnstultz-work/linux-dev.git proxy-exec-v9-6.8

Tested the v9 of the series.

tl;dr

o I still see a small regression for hackbench. I'll get some perf
  profiles for the same and leave them in this thread soon (I do
  not have them at the moment unfortunately)

o There is a regression for some combinations in schbench. I'll
  have to recheck if I can consistently reproduce this or not and
  look at the perf profile to see if something is sticking out.

Rest of the benchmark results look good. I'll leave them below and
go digging at the regressions.

o System Details

- 3rd Generation EPYC System
- 2 x 64C/128T
- NPS1 mode

o Kernels

tip:			tip:sched/core at commit 8cec3dd9e593
			("sched/core: Simplify code by removing
			 duplicate #ifdefs")

proxy-exec-full:	tip + proxy execution commits from
			"proxy-exec-v9-6.8"

o Results

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           tip[pct imp](CV)    proxy_exec_v9[pct imp](CV)
 1-groups     1.00 [ -0.00]( 1.80)     1.03 [ -2.88]( 2.71)
 2-groups     1.00 [ -0.00]( 1.76)     1.02 [ -2.32]( 1.71)
 4-groups     1.00 [ -0.00]( 1.82)     1.03 [ -2.79]( 0.84)
 8-groups     1.00 [ -0.00]( 1.40)     1.02 [ -1.89]( 0.89)
16-groups     1.00 [ -0.00]( 3.38)     1.01 [ -0.53]( 1.61)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:           tip[pct imp](CV)    proxy_exec_v9[pct imp](CV)
    1     1.00 [  0.00]( 0.44)     0.99 [ -1.30]( 0.66)
    2     1.00 [  0.00]( 0.39)     0.98 [ -1.76]( 0.64)
    4     1.00 [  0.00]( 0.40)     0.99 [ -1.12]( 0.63)
    8     1.00 [  0.00]( 0.16)     0.97 [ -2.94]( 1.49)
   16     1.00 [  0.00]( 3.00)     1.01 [  0.92]( 2.18)
   32     1.00 [  0.00]( 0.84)     1.01 [  0.66]( 1.22)
   64     1.00 [  0.00]( 1.66)     1.00 [ -0.39]( 0.24)
  128     1.00 [  0.00]( 1.04)     0.99 [ -1.23]( 2.26)
  256     1.00 [  0.00]( 0.26)     1.02 [  1.92]( 1.09)
  512     1.00 [  0.00]( 0.15)     1.02 [  1.84]( 0.17)
 1024     1.00 [  0.00]( 0.20)     1.03 [  2.71]( 0.33)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:           tip[pct imp](CV)    proxy_exec_v9[pct imp](CV)
 Copy     1.00 [  0.00]( 6.19)     1.11 [ 11.16]( 2.57)
Scale     1.00 [  0.00]( 6.47)     0.98 [ -2.43]( 7.68)
  Add     1.00 [  0.00]( 6.50)     0.99 [ -0.74]( 7.25)
Triad     1.00 [  0.00]( 5.70)     1.03 [  2.95]( 4.41)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:           tip[pct imp](CV)    proxy_exec_v9[pct imp](CV)
 Copy     1.00 [  0.00]( 3.22)     1.04 [  4.29]( 3.02)
Scale     1.00 [  0.00]( 6.17)     1.02 [  1.97]( 1.55)
  Add     1.00 [  0.00]( 5.12)     1.02 [  2.48]( 1.55)
Triad     1.00 [  0.00]( 2.29)     1.01 [  1.06]( 1.49)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:           tip[pct imp](CV)    proxy_exec_v9[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.17)     0.98 [ -1.99]( 0.24)
 2-clients     1.00 [  0.00]( 0.49)     0.98 [ -1.86]( 0.45)
 4-clients     1.00 [  0.00]( 0.65)     0.98 [ -1.65]( 0.30)
 8-clients     1.00 [  0.00]( 0.56)     0.98 [ -1.73]( 0.41)
16-clients     1.00 [  0.00]( 0.78)     0.98 [ -1.52]( 0.34)
32-clients     1.00 [  0.00]( 0.62)     0.98 [ -1.90]( 0.73)
64-clients     1.00 [  0.00]( 1.41)     0.99 [ -1.46]( 1.39)
128-clients    1.00 [  0.00]( 0.83)     0.98 [ -1.63]( 0.89)
256-clients    1.00 [  0.00]( 4.60)     1.01 [  1.47]( 2.12)
512-clients    1.00 [  0.00](54.18)     1.02 [  2.25](56.18)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:           tip[pct imp](CV)    proxy_exec_v9[pct imp](CV)
  1     1.00 [ -0.00](34.63)     1.43 [-43.33]( 2.73)
  2     1.00 [ -0.00]( 2.70)     0.89 [ 10.81](23.82)
  4     1.00 [ -0.00]( 4.70)     1.04 [ -4.44](12.54)
  8     1.00 [ -0.00]( 5.09)     0.87 [ 13.21](14.08)
 16     1.00 [ -0.00]( 5.08)     1.03 [ -3.39]( 4.10)
 32     1.00 [ -0.00]( 2.91)     1.14 [-14.44]( 0.56)
 64     1.00 [ -0.00]( 2.73)     1.04 [ -4.17]( 2.77)
128     1.00 [ -0.00]( 7.89)     1.07 [ -7.14]( 2.83)
256     1.00 [ -0.00](28.55)     0.69 [ 31.37](19.96)
512     1.00 [ -0.00]( 2.11)     1.01 [ -1.20]( 1.07)
--

I'll leave more test results on the thread as I get to them. It has been
a slightly busy season so sorry about the delays.

> 
> 
> New in v9:
> (In the git tree. Again, none of the preparatory patches
> submitted here have changed since v8)

Since these changes in this preparatory series have remained the same,
please feel free to add:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

> ---------
> * Change to force mutex lock handoff when we have a blocked donor
>   (preserves optimistic spinning elsewhere, but still prioritizes
>   donor when present on unlock)
> 
> * Do return migration whenever we’re not on the wake_cpu (should
>   address placement concerns brought up earlier by Xuewen Yan)
> 
> * Closed hole where we might mark a task as BO_RUNNABLE without
>   doing return migration
> 
> * Much improved handling of balance callbacks when we need to
>   pick_again
> 
> * Fixes for cases where we put_prev_task() but left a dangling
>   pointer to rq_selected() when deactivating a task (as it could
>   then be migrated away while we still have a reference to it),
>   by selecting idle before deactivating next.
> 
> * Fixes for dangling references to rq->curr (which had been
>   put_prev_task’ed)  when we drop rq lock for proxy_migration
> 
> * Fixes for ttwu / find_proxy_task() races if the lock owner was
>   being return migrated, and ttwu hadn’t yet set_task_cpu() and
>   activated it, which allowed that task to be scheduled on two
>   cpus at the same time.
> 
> * Fix for live-lock between activate_blocked_tasks() and
>   proxy_enqueue_on_owner() if activated owner went right back to
>   sleep (which also simplifies the locking in
>   activate_blocked_tasks())
> 
> * Cleanups to avoid locked BO_WAKING->BO_RUNNABLE transition in
>   try_to_wake_up() if proxy execution isn't enabled
> 
> * Fix for psi_dequeue, as proxy changes assumptions around
>   voluntary sleeps.
> 
> * Numerous typos, comment improvements, and other fixups
>   suggested by Metin
> 
> * And more!
> 
> 
> Performance:
> ---------
> K Prateek Nayak provided some feedback on the v8 series here[2].
> Given the potential extra overhead of doing rq migrations/return
> migrations/etc for the proxy case, it’s not completely surprising
> a few of K Prateek’s test cases saw ~3-5% regressions, but I’m
> hoping to look into this soon to see if we can reduce those
> further. The donor mutex handoff in this revision may help some.
> 
> 
> Issues still to address:
> ---------
> * The chain migration functionality needs further iterations and
>   better validation to ensure it truly maintains the RT/DL load
>   balancing invariants.
> 
> * CFS load balancing. There was concern that blocked tasks may
>   carry forward load (PELT) to the lock owner's CPU, so the CPU
>   may look like it is overloaded. Needs investigation.
> 
> * The sleeping owner handling (where we deactivate waiting tasks
>   and enqueue them onto a list, then reactivate them when the
>   owner wakes up) doesn’t feel great. This is in part because
>   when we want to activate tasks, we’re already holding a
>   task.pi_lock and a rq_lock, just not the locks for the task
>   we’re activating, nor the rq we’re enqueuing it onto. So there
>   has to be a bit of lock juggling to drop and acquire the right
>   locks (in the right order). It feels like there’s got to be a
>   better way. Also needs some rework to get rid of the recursion.
> 
> 
> Credit/Disclaimer:
> —--------------------
> As mentioned previously, this Proxy Execution series has a long
> history: First described in a paper[3] by Watkins, Straub,
> Niehaus, then from patches from Peter Zijlstra, extended with
> lots of work by Juri Lelli, Valentin Schneider, and Connor
> O'Brien. (and thank you to Steven Rostedt for providing
> additional details here!)
> 
> So again, many thanks to those above, as all the credit for this
> series really is due to them - while the mistakes are likely
> mine.
> 
> Thanks so much!
> -john
> 
> [1] https://lore.kernel.org/lkml/20240224001153.2584030-1-jstultz@google.com/
> [2] https://lore.kernel.org/lkml/c26251d2-e1bf-e5c7-0636-12ad886e1ea8@amd.com/
> [3] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
> 
>[..snip..]
> 

--
Thanks and Regards,
Prateek

Re: [PATCH v9 0/7] Preparatory changes for Proxy Execution v9

Posted by John Stultz 1 year, 10 months ago

On Sun, Mar 24, 2024 at 11:44 PM 'K Prateek Nayak' via kernel-team
<kernel-team@android.com> wrote:
> On 3/15/2024 10:09 AM, John Stultz wrote:
> > As mentioned last time[1], after previous submissions of the
> > Proxy Execution series, I got feedback that the patch series was
> > getting a bit unwieldy to review, and Qais suggested I break out
> > just the cleanups/preparatory components of the patch series and
> > submit them on their own in the hope we can start to merge the
> > less complex bits and discussion can focus on the more
> > complicated portions afterwards. This so far has not been very
> > successful, with the submission & RESEND of the v8 preparatory
> > changes not getting much in the way of review.
> >
> > Nonetheless, for v9 of this series, I’m again only submitting
> > those early cleanup/preparatory changes here (which have not
> > changed since the v8 submissions, but to avoid confusion with the
> > git branch names, I’m labeling it as v9). In the meantime, I’ve
> > continued to put a lot of effort into the full series, mostly
> > focused on polishing the series for correctness, and fixing some
> > hard to trip races.
> >
> > If you are interested, the full v9 series, it can be found here:
> >   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v9-6.8
> >   https://github.com/johnstultz-work/linux-dev.git proxy-exec-v9-6.8
>
> Tested the v9 of the series.
>
> tl;dr
>
> o I still see a small regression for hackbench. I'll get some perf
>   profiles for the same and leave them in this thread soon (I do
>   not have them at the moment unfortunately)
>
> o There is a regression for some combinations in schbench. I'll
>   have to recheck if I can consistently reproduce this or not and
>   look at the perf profile to see if something is sticking out.

Much appreciated for the extra testing of the whole series!

I've not had much time to dig further on performance tuning since v8
(v9 mostly focused on fixes), so I didn't expect much change.
As I do some more detailed analysis of behavior via tracing for
correctness, I'm hoping there will be some finds there to further
improve things, but I'm not sure all of the overhead of handling proxy
rq-migrations and re-selecting the next task will be avoidable. So I
expect some potential throughput impact, but hopefully the latency
improvements will be worth it.

> > New in v9:
> > (In the git tree. Again, none of the preparatory patches
> > submitted here have changed since v8)
>
> Since these changes in this preparatory series have remained the same,
> please feel free to add:
>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

Just added to my WIP tree. I'll likely be sending a RESEND of the v9
prep patches in the next few days, and will add them there as well.

Thank you so much again!
-john

Re: [PATCH v9 0/7] Preparatory changes for Proxy Execution v9

Posted by Metin Kaya 1 year, 10 months ago

On 15/03/2024 4:39 am, John Stultz wrote:
> As mentioned last time[1], after previous submissions of the
> Proxy Execution series, I got feedback that the patch series was
> getting a bit unwieldy to review, and Qais suggested I break out
> just the cleanups/preparatory components of the patch series and
> submit them on their own in the hope we can start to merge the
> less complex bits and discussion can focus on the more
> complicated portions afterwards. This so far has not been very
> successful, with the submission & RESEND of the v8 preparatory
> changes not getting much in the way of review.
> 
> Nonetheless, for v9 of this series, I’m again only submitting
> those early cleanup/preparatory changes here (which have not
> changed since the v8 submissions, but to avoid confusion with the
> git branch names, I’m labeling it as v9). In the meantime, I’ve
> continued to put a lot of effort into the full series, mostly
> focused on polishing the series for correctness, and fixing some
> hard to trip races.
> 
> If you are interested, the full v9 series, it can be found here:
>    https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v9-6.8
>    https://github.com/johnstultz-work/linux-dev.git proxy-exec-v9-6.8
> 
> 
> New in v9:
> (In the git tree. Again, none of the preparatory patches
> submitted here have changed since v8)
> ---------
> * Change to force mutex lock handoff when we have a blocked donor
>    (preserves optimistic spinning elsewhere, but still prioritizes
>    donor when present on unlock)
> 
> * Do return migration whenever we’re not on the wake_cpu (should
>    address placement concerns brought up earlier by Xuewen Yan)
> 
> * Closed hole where we might mark a task as BO_RUNNABLE without
>    doing return migration
> 
> * Much improved handling of balance callbacks when we need to
>    pick_again
> 
> * Fixes for cases where we put_prev_task() but left a dangling
>    pointer to rq_selected() when deactivating a task (as it could
>    then be migrated away while we still have a reference to it),
>    by selecting idle before deactivating next.
> 
> * Fixes for dangling references to rq->curr (which had been
>    put_prev_task’ed)  when we drop rq lock for proxy_migration
> 
> * Fixes for ttwu / find_proxy_task() races if the lock owner was
>    being return migrated, and ttwu hadn’t yet set_task_cpu() and
>    activated it, which allowed that task to be scheduled on two
>    cpus at the same time.
> 
> * Fix for live-lock between activate_blocked_tasks() and
>    proxy_enqueue_on_owner() if activated owner went right back to
>    sleep (which also simplifies the locking in
>    activate_blocked_tasks())
> 
> * Cleanups to avoid locked BO_WAKING->BO_RUNNABLE transition in
>    try_to_wake_up() if proxy execution isn't enabled
> 
> * Fix for psi_dequeue, as proxy changes assumptions around
>    voluntary sleeps.
> 
> * Numerous typos, comment improvements, and other fixups
>    suggested by Metin
> 
> * And more!
> 
> 
> Performance:
> ---------
> K Prateek Nayak provided some feedback on the v8 series here[2].
> Given the potential extra overhead of doing rq migrations/return
> migrations/etc for the proxy case, it’s not completely surprising
> a few of K Prateek’s test cases saw ~3-5% regressions, but I’m
> hoping to look into this soon to see if we can reduce those
> further. The donor mutex handoff in this revision may help some.
> 
> 
> Issues still to address:
> ---------
> * The chain migration functionality needs further iterations and
>    better validation to ensure it truly maintains the RT/DL load
>    balancing invariants.
> 
> * CFS load balancing. There was concern that blocked tasks may
>    carry forward load (PELT) to the lock owner's CPU, so the CPU
>    may look like it is overloaded. Needs investigation.
> 
> * The sleeping owner handling (where we deactivate waiting tasks
>    and enqueue them onto a list, then reactivate them when the
>    owner wakes up) doesn’t feel great. This is in part because
>    when we want to activate tasks, we’re already holding a
>    task.pi_lock and a rq_lock, just not the locks for the task
>    we’re activating, nor the rq we’re enqueuing it onto. So there
>    has to be a bit of lock juggling to drop and acquire the right
>    locks (in the right order). It feels like there’s got to be a
>    better way. Also needs some rework to get rid of the recursion.
> 
> 
> Credit/Disclaimer:
> —--------------------
> As mentioned previously, this Proxy Execution series has a long
> history: First described in a paper[3] by Watkins, Straub,
> Niehaus, then from patches from Peter Zijlstra, extended with
> lots of work by Juri Lelli, Valentin Schneider, and Connor
> O'Brien. (and thank you to Steven Rostedt for providing
> additional details here!)
> 
> So again, many thanks to those above, as all the credit for this
> series really is due to them - while the mistakes are likely
> mine.
> 
> Thanks so much!
> -john
> 
> [1] https://lore.kernel.org/lkml/20240224001153.2584030-1-jstultz@google.com/
> [2] https://lore.kernel.org/lkml/c26251d2-e1bf-e5c7-0636-12ad886e1ea8@amd.com/
> [3] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
> 
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Qais Yousef <qyousef@google.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Youssef Esmat <youssefesmat@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: kernel-team@android.com
> 
> 
> Connor O'Brien (2):
>    sched: Add do_push_task helper
>    sched: Consolidate pick_*_task to task_is_pushable helper
> 
> John Stultz (1):
>    sched: Split out __schedule() deactivate task logic into a helper
> 
> Juri Lelli (2):
>    locking/mutex: Make mutex::wait_lock irq safe
>    locking/mutex: Expose __mutex_owner()
> 
> Peter Zijlstra (2):
>    locking/mutex: Remove wakeups from under mutex::wait_lock
>    sched: Split scheduler and execution contexts
> 
>   kernel/locking/mutex.c       |  60 +++++++----------
>   kernel/locking/mutex.h       |  25 +++++++
>   kernel/locking/rtmutex.c     |  26 +++++---
>   kernel/locking/rwbase_rt.c   |   4 +-
>   kernel/locking/rwsem.c       |   4 +-
>   kernel/locking/spinlock_rt.c |   3 +-
>   kernel/locking/ww_mutex.h    |  49 ++++++++------
>   kernel/sched/core.c          | 122 +++++++++++++++++++++--------------
>   kernel/sched/deadline.c      |  53 ++++++---------
>   kernel/sched/fair.c          |  18 +++---
>   kernel/sched/rt.c            |  59 +++++++----------
>   kernel/sched/sched.h         |  44 ++++++++++++-
>   12 files changed, 268 insertions(+), 199 deletions(-)
> 

I could cleanly apply this patch series to `linux-next/next-20240318` 
branch.

Here is checkpatch report:
```
$ ./scripts/checkpatch.pl -g 6aa2a9e7aea6d..HEAD 
 
 

--------------------------------------------------------------------- 
 
 

Commit a9124906b27f ("locking/mutex: Make mutex::wait_lock irq safe") 
 
 

---------------------------------------------------------------------
total: 0 errors, 0 warnings, 128 lines checked

Commit a9124906b27f ("locking/mutex: Make mutex::wait_lock irq safe") 
has no obvious style problems and is ready for submission.
-------------------------------------------------------------
Commit c3b81eb4858a ("locking/mutex: Expose __mutex_owner()")
-------------------------------------------------------------
WARNING: Possible repeated word: 'at'
#91: FILE: kernel/locking/mutex.h:26:
+ * NULL means not owned. Since task_struct pointers are aligned at
+ * at least L1_CACHE_BYTES, we have low bits to store extra state.

total: 0 errors, 1 warnings, 62 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
       mechanically convert to the typical style using --fix or 
--fix-inplace.

Commit c3b81eb4858a ("locking/mutex: Expose __mutex_owner()") has style 
problems, please review.
------------------------------------------------------
Commit 6adb713bd389 ("sched: Add do_push_task helper")
------------------------------------------------------
total: 0 errors, 0 warnings, 64 lines checked

Commit 6adb713bd389 ("sched: Add do_push_task helper") has no obvious 
style problems and is ready for submission.
---------------------------------------------------------------------------------
Commit 037345d6896a ("sched: Consolidate pick_*_task to task_is_pushable 
helper")
---------------------------------------------------------------------------------
total: 0 errors, 0 warnings, 61 lines checked

Commit 037345d6896a ("sched: Consolidate pick_*_task to task_is_pushable 
helper") has no obvious style problems and is ready for submission.
-----------------------------------------------------------------------------------------
Commit 6c9aa2fd15de ("sched: Split out __schedule() deactivate task 
logic into a helper")
-----------------------------------------------------------------------------------------
total: 0 errors, 0 warnings, 84 lines checked

Commit 6c9aa2fd15de ("sched: Split out __schedule() deactivate task 
logic into a helper") has no obvious style problems and is ready for 
submission.
---------------------------------------------------------------------
Commit 6af85d44be35 ("sched: Split scheduler and execution contexts")
---------------------------------------------------------------------
WARNING: Duplicate signature
#46:
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

WARNING: Do not crash the kernel unless it is absolutely 
unavoidable--use WARN_ON_ONCE() plus recovery code (if feasible) instead 
of BUG() or variants
#557: FILE: kernel/sched/rt.c:1979:
+       BUG_ON(task_current_selected(rq, p));

total: 0 errors, 2 warnings, 539 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
       mechanically convert to the typical style using --fix or 
--fix-inplace.

Commit 6af85d44be35 ("sched: Split scheduler and execution contexts") 
has style problems, please review.

NOTE: If any of the errors are false positives, please report
       them to the maintainer, see CHECKPATCH in MAINTAINERS.
```

I think only the duplicate signature warning is a real problem and will 
reply to the offending mail.

Issued `make defconfig` on a x86_64 machine, enabled lock & RCU 
debugging config options and built the kernel.

Then, ran commands below on a QEMU Debian guest (arch=x86_64, cores=32, 
threads=2, memory=2GiB) before and after this patch series:
```
$ insmod torture.ko && insmod scftorture.ko verbose=1 onoff_holdoff=5 
onoff_interval=1; sleep 60; rmmod scftorture torture
$ insmod torture.ko random_shuffle=1 && insmod locktorture.ko 
torture_type=mutex_lock rt_boost=1 rt_boost_factor=50 nested_locks=3 
writer_fifo=1; sleep 60; rmmod locktorture torture
$ taskset -c 0-31,32-63 perf bench sched messaging -p -t -l 100000 -g 2
$ perf bench sched pipe -l 100000 -T
```

Logs before this patch series:
```
$ insmod torture.ko && insmod scftorture.ko verbose=1 onoff_holdoff=5 
onoff_interval=1; sleep 60; rmmod scftorture torture
[   35.912508] torture module --- scftorture:  disable_onoff_at_boot=0 
ftrace_dump_at_shutdown=0 verbose_sleep_frequency=0 
verbose_sleep_duration=1 random_shuffle=0
[   35.914627] scftorture: --- Start of test:  verbose=1 holdoff=0 
longwait=0 nthreads=-1 onoff_holdoff=5 onoff_interval=1 shutdown_secs=0 
stat_interval=60 stutter=5 use_cpus_read_lock=0, weight_resched=-1, 
weight_single=-1, weight_single_rpc=-1, weight_single_wait=-1, weight_many1
[   35.918824] scftorture: !!! built as module, weight_resched ignored
[   35.919842] scf_sel_dump:  32.820 smp_call_function_single(nowait)
[   35.921035] scf_sel_dump:  32.820 smp_call_function_single_rpc(wait)
[   35.922183] scf_sel_dump:  32.820 smp_call_function_single(wait)
[   35.923334] scf_sel_dump:   0.512 smp_call_function_many(nowait)
[   35.924372] scf_sel_dump:   0.512 smp_call_function_many(wait)
[   35.925410] scf_sel_dump:   0.256 smp_call_function(nowait)
[   35.926398] scf_sel_dump:   0.256 smp_call_function(wait)
[   35.927373] scftorture-torture: Creating torture_onoff task
[   35.928665] scftorture-torture: torture_onoff task started
[   35.928677] scftorture-torture: Creating torture_stutter task
[   35.929696] scftorture-torture: torture_onoff begin holdoff
[   35.930902] scftorture: Starting 64 smp_call_function() threads
[   35.930904] scftorture-torture: torture_stutter task started
[   35.933755] scftorture-torture: Creating scftorture_invoker task
[   35.934966] scftorture-torture: Creating scftorture_invoker task
[   35.934980] scftorture: scftorture_invoker 0: task started
...
[   99.689799] scftorture:  scf_invoked_count VER: 87177502 resched: 0 
single: 28726722/28492519 single_ofl: 13877596/13766523 single_rpc: 
28492768 single_rpc_ofl: 0 many: 445547/446215 all: 221861/223143 onoff: 
1527/1527:1559/1559 14,74:8,642 23749:29438 (HZ=1000) ste: 0 stnmie: 0
[   99.694023] scftorture: --- End of test: SUCCESS:  verbose=1 
holdoff=0 longwait=0 nthreads=64 onoff_holdoff=5 onoff_interval=1 
shutdown_secs=0 stat_interval=60 stutter=5 use_cpus_read_lock=0, 
weight_resched=-1, weight_single=-1, weight_single_rpc=-1, 
weight_single_wait=-1, weig1

$ insmod torture.ko random_shuffle=1 && insmod locktorture.ko 
torture_type=mutex_lock rt_boost=1 rt_boost_factor=50 nested_locks=3 
writer_fifo=1; sleep 60; rmmod locktorture torture
[  114.543156] torture module --- mutex_lock:  disable_onoff_at_boot=0 
ftrace_dump_at_shutdown=0 verbose_sleep_frequency=0 
verbose_sleep_duration=1 random_shuffle=1
[  114.545291] mutex_lock-torture:--- Start of test [debug]: 
acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0 
long_hold=100 nested_locks=3 nreaders_stress=0 nwriters_stress=128 
onoff_holdoff=0 onoff_interval=0 rt_boost=1 rt_boost_factor=50 
shuffle_interval=3 1
[  114.549640] mutex_lock-torture: Creating torture_shuffle task
[  114.550880] mutex_lock-torture: Creating torture_stutter task
[  114.550905] mutex_lock-torture: torture_shuffle task started
[  114.552006] mutex_lock-torture: Creating lock_torture_writer task
[  114.552021] mutex_lock-torture: torture_stutter task started
...
[  177.883162] mutex_lock-torture: Stopping lock_torture_stats task
[  177.884249] Writes:  Total: 72982638  Max/Min: 748221/404806   Fail: 0
[  177.885483] mutex_lock-torture: lock_torture_stats is stopping
[  177.886516] Writes:  Total: 72982638  Max/Min: 748221/404806   Fail: 0
[  177.887630] mutex_lock-torture:--- End of test: SUCCESS [debug]: 
acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0 
long_hold=100 nested_locks=3 nreaders_stress=0 nwriters_stress=128 
onoff_holdoff=0 onoff_interval=0 rt_boost=1 rt_boost_factor=50 shuffle_inte1

$ taskset -c 0-31,32-63 perf bench sched messaging -p -t -l 100000 -g 2
# Running 'sched/messaging' benchmark:0-31,32-63 perf bench sched 
messaging -p -t -l 100000 -g 2
# 20 sender and receiver threads per group
# 2 groups == 80 threads run

      Total time: 13.835 [sec]

$ perf bench sched pipe -l 100000 -T
# Running 'sched/pipe' benchmark:
# Executed 100000 pipe operations between two threads

      Total time: 2.772 [sec]

       27.725990 usecs/op
           36067 ops/sec
```

Logs after applying this patch series:
```
$ insmod torture.ko && insmod scftorture.ko verbose=1 onoff_holdoff=5 
onoff_interval=1; sleep 60; rmmod scftorture torture
[   17.840768] torture module --- scftorture:  disable_onoff_at_boot=0 
ftrace_dump_at_shutdown=0 verbose_sleep_frequency=0 
verbose_sleep_duration=1 random_shuffle=0
[   17.842941] scftorture: --- Start of test:  verbose=1 holdoff=0 
longwait=0 nthreads=-1 onoff_holdoff=5 onoff_interval=1 shutdown_secs=0 
stat_interval=60 stutter=5 use_cpus_read_lock=0, weight_resched=-1, 
weight_single=-1, weight_single_rpc=-1, weight_single_wait=-1, weight_many1
[   17.847211] scftorture: !!! built as module, weight_resched ignored
[   17.848135] scf_sel_dump:  32.820 smp_call_function_single(nowait)
[   17.849230] scf_sel_dump:  32.820 smp_call_function_single_rpc(wait)
[   17.850475] scf_sel_dump:  32.820 smp_call_function_single(wait)
[   17.851523] scf_sel_dump:   0.512 smp_call_function_many(nowait)
[   17.852559] scf_sel_dump:   0.512 smp_call_function_many(wait)
[   17.853602] scf_sel_dump:   0.256 smp_call_function(nowait)
[   17.854541] scf_sel_dump:   0.256 smp_call_function(wait)
[   17.855490] scftorture-torture: Creating torture_onoff task
[   17.856673] scftorture-torture: torture_onoff task started
[   17.856694] scftorture-torture: Creating torture_stutter task
[   17.857701] scftorture-torture: torture_onoff begin holdoff
[   17.859816] scftorture: Starting 64 smp_call_function() threads
[   17.859816] scftorture-torture: torture_stutter task started
[   17.860839] scftorture-torture: Creating scftorture_invoker task
[   17.860930] scftorture-torture: Creating scftorture_invoker task
[   17.860940] scftorture: scftorture_invoker 0: task started
...
[   78.914652] scftorture:  scf_invoked_count VER: 86855271 resched: 0 
single: 27751906/27523464 single_ofl: 12966034/12854343 single_rpc: 
27530433 single_rpc_ofl: 0 many: 430178/430527 all: 215119/214473 onoff: 
1441/1441:1474/1474 14,268:8,177 22685:28034 (HZ=1000) ste: 0 stnmie:0
[   78.919133] scftorture: --- End of test: SUCCESS:  verbose=1 
holdoff=0 longwait=0 nthreads=64 onoff_holdoff=5 onoff_interval=1 
shutdown_secs=0 stat_interval=60 stutter=5 use_cpus_read_lock=0, 
weight_resched=-1, weight_single=-1, weight_single_rpc=-1, 
weight_single_wait=-1, weig1

$ insmod torture.ko random_shuffle=1 && insmod locktorture.ko 
torture_type=mutex_lock rt_boost=1 rt_boost_factor=50 nested_locks=3 
writer_fifo=1; sleep 60; rmmod locktorture torture
[   83.340115] torture module --- mutex_lock:  disable_onoff_at_boot=0 
ftrace_dump_at_shutdown=0 verbose_sleep_frequency=0 
verbose_sleep_duration=1 random_shuffle=1
[   83.342142] mutex_lock-torture:--- Start of test [debug]: 
acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0 
long_hold=100 nested_locks=3 nreaders_stress=0 nwriters_stress=128 
onoff_holdoff=0 onoff_interval=0 rt_boost=1 rt_boost_factor=50 
shuffle_interval=3 1
[   83.346306] mutex_lock-torture: Creating torture_shuffle task
[   83.347383] mutex_lock-torture: Creating torture_stutter task
[   83.347395] mutex_lock-torture: torture_shuffle task started
[   83.348549] mutex_lock-torture: Creating lock_torture_writer task
[   83.348555] mutex_lock-torture: torture_stutter task started
...
[  146.611980] mutex_lock-torture: Stopping lock_torture_stats task
[  146.613012] Writes:  Total: 75575274  Max/Min: 1085187/396665 ??? 
Fail: 0
[  146.614183] mutex_lock-torture: lock_torture_stats is stopping
[  146.615160] Writes:  Total: 75575274  Max/Min: 1085187/396665 ??? 
Fail: 0
[  146.616273] mutex_lock-torture:--- End of test: SUCCESS [debug]: 
acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0 
long_hold=100 nested_locks=3 nreaders_stress=0 nwriters_stress=128 
onoff_holdoff=0 onoff_interval=0 rt_boost=1 rt_boost_factor=50 shuffle_inte1

$ taskset -c 0-31,32-63 perf bench sched messaging -p -t -l 100000 -g 2
# Running 'sched/messaging' benchmark:0-31,32-63 perf bench sched 
messaging -p -t -l 100000 -g 2
# 20 sender and receiver threads per group
# 2 groups == 80 threads run

      Total time: 13.548 [sec]

$ perf bench sched pipe -l 100000 -T
# Running 'sched/pipe' benchmark:
# Executed 100000 pipe operations between two threads

      Total time: 2.350 [sec]

       23.503460 usecs/op
           42546 ops/sec
```

Looks like there are slight differences (both improvements and 
regressions) before & after this patch series. I believe they should be 
ignored since I ran the test only few times and it's just a virtual machine.

I also ported commits in 
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v9-6.8 
to an Android v6.1 kernel and tested on my Pixel 6.

In conclusion, I'm happy to give my:
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>

Thanks,