[v18] Single RunQueue Proxy Execution (v18)

[PATCH v18 0/8] Single RunQueue Proxy Execution (v18)

Posted by John Stultz 3 months, 2 weeks ago

Hey All,

After not getting much response from the v17 series (and
resending it), I was going to continue to just iterate resending
the v17 single runqueue focused series. However, Suleiman had a
very good suggestion for improving the larger patch series and a
few of the tweaks for those changes trickled back into the set
I’m submitting here.

Unfortunately those later changes also uncovered some stability
problems with the full proxy-exec patch series, which took a
painfully long time (stress testing taking 30-60 hours to trip
the problem) to resolve. However, after finally sorting those
issues out it has been running well, so I can now send out the
next revision (v18) of the set.

So here is v18 of the Proxy Execution series, a generalized form
of priority inheritance.

As I’m trying to submit this work in smallish digestible pieces,
in this series, I’m only submitting for review the logic that
allows us to do the proxying if the lock owner is on the same
runqueue as the blocked waiter: Introducing the
CONFIG_SCHED_PROXY_EXEC option and boot-argument, reworking the
task_struct::blocked_on pointer and wrapper functions, the
initial sketch of the find_proxy_task() logic, some fixes for
using split contexts, and finally same-runqueue proxying. 

As I mentioned above, for the series I’m submitting here, it has
only barely changed from v17. With the main difference being
slightly different order of checks for cases where we don’t
actually do anything yet (more on why below), and use of
READ_ONCE for the on_rq reads to avoid the compiler fusing
loads, which I was bitten by with the full series.

However for the full proxy-exec series, there are a few
differences:
* Suleiman Souhlal noticed an inefficiency in that we evaluate
  if the lock owner’s task_cpu() is the current cpu, before we
  look to see if the lock owner is on_rq at all. With v17 this
  would result in us proxy-migrating a donor to a remote cpu,
  only to then realize the task wasn’t even on the runqueue,
  and doing the sleeping owner enqueuing. Suleiman suggested
  instead that we evaluate on_rq first, so we can immediately do
  sleeping owner enqueueing. Then only if the owner is on a
  runqueue do we proxy-migrate the donor (which requires the
  more costly lock juggling). While not a huge logical change,
  it did uncover other problems, which needed to be resolved.

* One issue found was there was a race where if
  do_activate_blocked_waiter() from the sleeping owner wakeup
  was delayed and the task had already been woken up elsewhere.
  It’s possible if that task was running and called into
  schedule() to be blocked, it would be dequeued from the
  runqueue, but before we switched to the new task,
  do_activate_blocked_waiter() might try to activate it on a
  different cpu. Clearly the do_activate_blocked_waiter() needed
  to check the task on_cpu value as well.

* I found that we still can hit wakeups that end up skipping the
  BO_WAKING -> BO_RUNNALBE transition (causing find_proxy_task()
  to end up spinning waiting for that transition), so I re-added
  the logic to handle doing return migrations from
  find_proxy_task() if we hit that case.

* Hupu suggested a tweak in ttwu_runnable() to evaluate
  proxy_needs_return() slightly earlier.

* Kuyo Chang reported and isolated a fix for a problem with
  __task_is_pushable() in the !sched_proxy_exec case, which was
  folded into the “sched: Fix rt/dl load balancing via chain
  level balance” patch

* Reworked some of the logic around releasing the rq->donor
  reference on migrations, using rq->idle directly.

* Sueliman also pointed out that some added task_struct elements
  were not being initialized in the init_task code path, so that
  was good to fix.

You can find the full series here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v18-6.16-rc3/
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v18-6.16-rc3

Issues still to address with the full series (not the patches
submitted here):
* Peter suggested an idea that instead of when tasks become
  unblocked, using (blocked_on_state == BO_WAKING) to protect
  against running proxy-migrated tasks on cpu’s they are not
  affined to, we could dequeue tasks first and then wake them.
  This does look to be cleaner in many ways, but the locking
  rework is significant and I’ve not worked out all the kinks
  with it yet. I am also a little worried that we may trip
  other wakeup paths that might not do the dequeue first.
  However, I have adopted this approach for the
  find_proxy_task() forced return migration, and it’s working
  well.

* Need to sort out what is needed for sched_ext to be ok with
  proxy-execution enabled.

* K Prateek Nayak did some testing about a bit over a year ago
  with an earlier version of the series and saw ~3-5%
  regressions in some cases. Need to re-evaluate this with the
  proxy-migration avoidance optimization Suleiman suggested now
  implemented.

* The chain migration functionality needs further iterations and
  better validation to ensure it truly maintains the RT/DL load
  balancing invariants (despite this being broken in vanilla
  upstream with RT_PUSH_IPI currently)

I’d really appreciate any feedback or review thoughts on the
full series as well. I’m trying to keep the chunks small,
reviewable and iteratively testable, but if you have any
suggestions on how to improve the series, I’m all ears.

Credit/Disclaimer:
—--------------------
As always, this Proxy Execution series has a long history with
lots of developers that deserve credit: 

First described in a paper[1] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely mine.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>  
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com

John Stultz (4):
  sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
  sched: Move update_curr_task logic into update_curr_se
  sched: Fix runtime accounting w/ split exec & sched contexts
  sched: Add an initial sketch of the find_proxy_task() function

Peter Zijlstra (2):
  locking/mutex: Rework task_struct::blocked_on
  sched: Start blocked_on chain processing in find_proxy_task()

Valentin Schneider (2):
  locking/mutex: Add p->blocked_on wrappers for correctness checks
  sched: Fix proxy/current (push,pull)ability

 .../admin-guide/kernel-parameters.txt         |   5 +
 include/linux/sched.h                         |  72 ++++-
 init/Kconfig                                  |  12 +
 kernel/fork.c                                 |   3 +-
 kernel/locking/mutex-debug.c                  |   9 +-
 kernel/locking/mutex.c                        |  18 ++
 kernel/locking/mutex.h                        |   3 +-
 kernel/locking/ww_mutex.h                     |  16 +-
 kernel/sched/core.c                           | 257 +++++++++++++++++-
 kernel/sched/deadline.c                       |   7 +
 kernel/sched/fair.c                           |  65 +++--
 kernel/sched/rt.c                             |   5 +
 kernel/sched/sched.h                          |  22 +-
 13 files changed, 448 insertions(+), 46 deletions(-)

-- 
2.50.0.727.gbf7dc18ff4-goog

Re: [PATCH v18 0/8] Single RunQueue Proxy Execution (v18)

Posted by K Prateek Nayak 3 months, 1 week ago

Hello John,

On 6/26/2025 2:00 AM, John Stultz wrote:
> Hey All,
> 
> After not getting much response from the v17 series (and
> resending it), I was going to continue to just iterate resending
> the v17 single runqueue focused series. However, Suleiman had a
> very good suggestion for improving the larger patch series and a
> few of the tweaks for those changes trickled back into the set
> I’m submitting here.
> 
> Unfortunately those later changes also uncovered some stability
> problems with the full proxy-exec patch series, which took a
> painfully long time (stress testing taking 30-60 hours to trip
> the problem) to resolve. However, after finally sorting those
> issues out it has been running well, so I can now send out the
> next revision (v18) of the set.
> 
> So here is v18 of the Proxy Execution series, a generalized form
> of priority inheritance.

Sorry for the lack of response on the previous version but here
are the test results for v18.

tl;dr I don't see anything major. Few regressions I see are for
data points with lot of deviation so I think  they can be safely
ignored.

Full results are below:

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel details

tip:	    tip:sched/urgentat commit 914873bc7df9 ("Merge tag
             'x86-build-2025-05-25' of
             git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

proxy_exec: tip + this series as is with CONFIG_SCHED_PROXY_EXEC=y

o Benchmark results

     ==================================================================
     Test          : hackbench
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:           tip[pct imp](CV)      proxy_exec[pct imp](CV)
      1-groups     1.00 [ -0.00](13.74)     1.03 [ -3.20]( 8.80)
      2-groups     1.00 [ -0.00]( 9.58)     1.04 [ -4.45]( 6.58)
      4-groups     1.00 [ -0.00]( 2.10)     1.02 [ -2.17]( 1.85)
      8-groups     1.00 [ -0.00]( 1.51)     0.99 [  1.42]( 1.47)
     16-groups     1.00 [ -0.00]( 1.10)     1.00 [  0.42]( 1.23)
     
     
     ==================================================================
     Test          : tbench
     Units         : Normalized throughput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:    tip[pct imp](CV)      proxy_exec[pct imp](CV)
         1     1.00 [  0.00]( 0.82)     1.02 [  1.78]( 1.06)
         2     1.00 [  0.00]( 1.13)     1.03 [  3.30]( 1.05)
         4     1.00 [  0.00]( 1.12)     1.02 [  1.86]( 1.05)
         8     1.00 [  0.00]( 0.93)     1.02 [  1.74]( 0.72)
        16     1.00 [  0.00]( 0.38)     1.02 [  2.28]( 1.35)
        32     1.00 [  0.00]( 0.66)     1.01 [  1.44]( 0.85)
        64     1.00 [  0.00]( 1.18)     1.02 [  1.98]( 1.28)
       128     1.00 [  0.00]( 1.12)     1.00 [  0.31]( 0.89)
       256     1.00 [  0.00]( 0.42)     1.00 [ -0.49]( 0.91)
       512     1.00 [  0.00]( 0.14)     1.01 [  0.94]( 0.33)
      1024     1.00 [  0.00]( 0.26)     1.01 [  0.95]( 0.24)
     
     
     ==================================================================
     Test          : stream-10
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)      proxy_exec[pct imp](CV)
      Copy     1.00 [  0.00]( 8.37)     0.98 [ -2.35]( 8.36)
     Scale     1.00 [  0.00]( 2.85)     0.93 [ -7.21]( 7.24)
       Add     1.00 [  0.00]( 3.39)     0.93 [ -7.50]( 6.56)
     Triad     1.00 [  0.00]( 6.39)     1.04 [  4.18]( 7.77)
     
     
     ==================================================================
     Test          : stream-100
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)      proxy_exec[pct imp](CV)
      Copy     1.00 [  0.00]( 3.91)     1.02 [  2.00]( 2.92)
     Scale     1.00 [  0.00]( 4.34)     0.99 [ -0.58]( 3.88)
       Add     1.00 [  0.00]( 4.14)     1.02 [  1.96]( 1.71)
     Triad     1.00 [  0.00]( 1.00)     0.99 [ -0.50]( 2.43)
     
     
     ==================================================================
     Test          : netperf
     Units         : Normalized Througput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:         tip[pct imp](CV)      proxy_exec[pct imp](CV)
      1-clients     1.00 [  0.00]( 0.41)     1.02 [  2.40]( 0.32)
      2-clients     1.00 [  0.00]( 0.58)     1.02 [  2.21]( 0.30)
      4-clients     1.00 [  0.00]( 0.35)     1.02 [  2.20]( 0.63)
      8-clients     1.00 [  0.00]( 0.48)     1.02 [  1.98]( 0.50)
     16-clients     1.00 [  0.00]( 0.66)     1.02 [  2.19]( 0.49)
     32-clients     1.00 [  0.00]( 1.15)     1.02 [  2.17]( 0.75)
     64-clients     1.00 [  0.00]( 1.38)     1.01 [  1.43]( 1.39)
     128-clients    1.00 [  0.00]( 0.87)     1.01 [  0.60]( 1.09)
     256-clients    1.00 [  0.00]( 5.36)     1.01 [  0.54]( 4.29)
     512-clients    1.00 [  0.00](54.39)     0.99 [ -0.61](52.23)
     
     
     ==================================================================
     Test          : schbench
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)      proxy_exec[pct imp](CV)
       1     1.00 [ -0.00]( 8.54)     0.76 [ 23.91](23.47)
       2     1.00 [ -0.00]( 1.15)     0.90 [ 10.00]( 8.11)
       4     1.00 [ -0.00](13.46)     1.10 [-10.42](10.94)
       8     1.00 [ -0.00]( 7.14)     0.89 [ 10.53]( 3.92)
      16     1.00 [ -0.00]( 3.49)     1.00 [ -0.00]( 8.93)
      32     1.00 [ -0.00]( 1.06)     0.96 [  4.26](10.99)
      64     1.00 [ -0.00]( 5.48)     1.08 [ -8.14]( 4.03)
     128     1.00 [ -0.00](10.45)     1.09 [ -8.64](13.37)
     256     1.00 [ -0.00](31.14)     1.12 [-11.66](16.77)
     512     1.00 [ -0.00]( 1.52)     0.98 [  2.02]( 1.50)
     
     
     ==================================================================
     Test          : new-schbench-requests-per-second
     Units         : Normalized Requests per second
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)      proxy_exec[pct imp](CV)
       1     1.00 [  0.00]( 1.07)     1.00 [ -0.29]( 0.53)
       2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)
       4     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.30)
       8     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.00)
      16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
      32     1.00 [  0.00]( 3.41)     1.03 [  3.50]( 0.27)
      64     1.00 [  0.00]( 1.05)     1.00 [ -0.38]( 4.45)
     128     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.19)
     256     1.00 [  0.00]( 0.72)     0.99 [ -0.61]( 0.63)
     512     1.00 [  0.00]( 0.57)     1.00 [ -0.24]( 0.33)
     
     
     ==================================================================
     Test          : new-schbench-wakeup-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)      proxy_exec[pct imp](CV)
       1     1.00 [ -0.00]( 9.11)     0.81 [ 18.75](10.25)
       2     1.00 [ -0.00]( 0.00)     0.86 [ 14.29](11.08)
       4     1.00 [ -0.00]( 3.78)     1.29 [-28.57](17.25)
       8     1.00 [ -0.00]( 0.00)     1.17 [-16.67]( 3.60)
      16     1.00 [ -0.00]( 7.56)     1.00 [ -0.00]( 6.88)
      32     1.00 [ -0.00](15.11)     0.80 [ 20.00]( 0.00)
      64     1.00 [ -0.00]( 9.63)     0.95 [  5.00]( 7.32)
     128     1.00 [ -0.00]( 4.86)     0.96 [  3.52]( 8.69)
     256     1.00 [ -0.00]( 2.34)     0.95 [  4.70]( 2.78)
     512     1.00 [ -0.00]( 0.40)     0.99 [  0.77]( 0.20)
     
     
     ==================================================================
     Test          : new-schbench-request-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)      proxy_exec[pct imp](CV)
       1     1.00 [ -0.00]( 2.73)     1.02 [ -1.82]( 3.15)
       2     1.00 [ -0.00]( 0.87)     1.02 [ -2.16]( 1.90)
       4     1.00 [ -0.00]( 1.21)     1.04 [ -3.77]( 2.76)
       8     1.00 [ -0.00]( 0.27)     1.01 [ -1.31]( 2.01)
      16     1.00 [ -0.00]( 4.04)     1.00 [  0.27]( 0.77)
      32     1.00 [ -0.00]( 7.35)     0.89 [ 11.07]( 1.68)
      64     1.00 [ -0.00]( 3.54)     1.02 [ -1.55]( 1.47)
     128     1.00 [ -0.00]( 0.37)     1.00 [  0.41]( 0.11)
     256     1.00 [ -0.00]( 9.57)     0.91 [  8.84]( 3.64)
     512     1.00 [ -0.00]( 1.82)     1.02 [ -1.93]( 1.21)


     ==================================================================
     Test          : Various longer running benchmarks
     Units         : %diff in throughput reported
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     Benchmarks:                  %diff
     ycsb-cassandra               0.82%
     ycsb-mongodb                -0.45%
     deathstarbench-1x            2.44%
     deathstarbench-2x            1.88%
     deathstarbench-3x            0.09%
     deathstarbench-6x            1.94%
     hammerdb+mysql 16VU          3.65%
     hammerdb+mysql 64VU         -0.59%


> 
> As I’m trying to submit this work in smallish digestible pieces,
> in this series, I’m only submitting for review the logic that
> allows us to do the proxying if the lock owner is on the same
> runqueue as the blocked waiter: Introducing the
> CONFIG_SCHED_PROXY_EXEC option and boot-argument, reworking the
> task_struct::blocked_on pointer and wrapper functions, the
> initial sketch of the find_proxy_task() logic, some fixes for
> using split contexts, and finally same-runqueue proxying.
> 
> As I mentioned above, for the series I’m submitting here, it has
> only barely changed from v17. With the main difference being
> slightly different order of checks for cases where we don’t
> actually do anything yet (more on why below), and use of
> READ_ONCE for the on_rq reads to avoid the compiler fusing
> loads, which I was bitten by with the full series.

For this series (Single RunQueue Proxy), feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

I'll go and test the full series next and reply with the
results on this same thread sometime next week. Meanwhile I'll
try to queue a longer locktorture run over the weekend. I'll
let you know if I see anything out of the ordinary on my setup.

-- 
Thanks and Regards,
Prateek