[v4] Generalized Priority Inheritance via Proxy Execution v3

[PATCH v4 00/13] Generalized Priority Inheritance via Proxy Execution v3

Posted by John Stultz 2 years, 8 months ago

After having to catch up on other work after OSPM[1], I've finally
gotten back to focusing on Proxy Execution and wanted to send out this
next iteration of the patch series for review, testing, and feedback.
(Many thanks to folks who provided feedback on the last revision!)

As mentioned previously, this Proxy Execution series has a long history:
First described in a paper[2] by Watkins, Straub, Niehaus, then from
patches from Peter Zijlstra, extended with lots of work by Juri Lelli,
Valentin Schneider, and Connor O'Brien. (and thank you to Steven Rostedt
for providing additional details here!)

So again, many thanks to those above, as all the credit for this series
really is due to them - while the mistakes are likely mine.

Overview:
—----------
Proxy Execution is a generalized form of priority inheritance. Classic
priority inheritance works well for real-time tasks where there is a
straight forward priority order to how things are run. But it breaks
down when used between CFS or DEADLINE tasks, as there are lots
of parameters involved outside of just the task’s nice value when
selecting the next task to run (via pick_next_task()).  So ideally we
want to imbue the mutex holder with all the scheduler attributes of 
the blocked waiting task.

Proxy Execution does this via a few changes:
* Keeping tasks that are blocked on a mutex *on* the runqueue
* Keeping additional tracking of which mutex a task is blocked on, and
  which task holds a specific mutex.
* Special handling for when we select a blocked task to run, so that we
  instead run the mutex holder. 

The first of these is the most difficult to grasp (I do get the mental
friction here: blocked tasks on the *run*queue sounds like nonsense!
Personally I like to think of the runqueue in this model more like a
“task-selection queue”).

By leaving blocked tasks on the runqueue, we allow pick_next_task() to
choose the task that should run next (even if it’s blocked waiting on a
mutex). If we do select a blocked task, we look at the task’s blocked_on
mutex and from there look at the mutex’s owner task. And in the simple
case, the task which owns the mutex is what we then choose to run,
allowing it to release the mutex.

This means that instead of just tracking “curr”, the scheduler needs to
track both the scheduler context (what was picked and all the state used
for scheduling decisions), and the execution context (what we’re
running)

In this way, the mutex owner is run “on behalf” of the blocked task
that was picked to run, essentially inheriting the scheduler context of
the blocked task.

As Connor outlined in a previous submission of this patch series,  this
raises a number of complicated situations:  The mutex owner might itself
be blocked on another mutex, or it could be sleeping, running on a
different CPU, in the process of migrating between CPUs, etc.

But the functionality provided by Proxy Execution is useful, as in
Android we have a number of cases where we are seeing priority inversion
(not unbounded, but longer than we’d like) between “foreground” and
“background” SCHED_NORMAL applications, so having a generalized solution
would be very useful.

New in v4:
—------
* Fixed deadlock that was caused by wait/wound mutexes having circular
  blocked_on references by clearing the blocked_on pointer on the task
  we are waking to wound/die. 

* Tried to resolve an issue Dietmar raised with RT balancing where the
  proxy migration and push_rt_task() were fighting ping-ponging tasks
  back and forth, caused by push_rt_task() migrating tasks even if they
  were in the chain that ends with the current running task. Though this
  likely needs more work, as this change resulted in different migration
  quirks (see below).

* Fixed a number of null-pointer traversals that the changes were
  occasionally tripping on

* Reworked patch that exposes __mutex_owner() to the scheduler to ensure
  it doesn’t expose it any more than necessary, as suggested by Peter. 

* To address some of Peter’s complaints, backed out the rq_curr()
  wrapper, and reworked rq_selected() to be a macro to avoid needing
  multiple accessors for READ_ONCE/rcu_dereference() type accesses. 

* Removed verbose legacy comments from previous developers of the series
  as Dietmar was finding them distracting when reviewing the diffs
  (Though, to ensure I heed the warnings from previous experienced
  travelers, I’ve preserved the comments/questions in a separate patch
  in my own development tree).

* Dropped patch that added *_task_blocked_on() wrappers to check locking
  correctness. Mostly as Peter didn’t seem happy with the wrappers in
  other patches, but I still think it's useful for testing (and the
  blocked_on locking still needs some work), so I’m carrying it in my
  personal development tree.

Issues still to address:
—----------
* Occasional getting null scheduler entities from pick_next_entity() in
  CFS. I’m a little stumped as to where this is going awry just yet, and
  delayed sending this out, but figured it was worth getting it out for
  review on the other issues while I chase this down.

* Better deadlock handling in proxy(): With the ww_mutex issues
  resolved, we shouldn’t see circular blocked_on references, but a
  number of the bugs I’ve been chasing recently come from getting stuck
  with proxy() returning null forcing a reselection over and over. These
  are still bugs to address, but my current thinking is that if we get
  stuck like this, we can start to remove the selected mutex blocked
  tasks from the rq, and let them be woken from the mutex waiters list
  as is done currently? Thoughts here would be appreciated.

* More work on migration correctness (RT/DL load balancing,etc). I’m
  still seeing occasional trouble as cpu counts go up which seems to be
  due to a bunch of tasks being proxy migrated to a cpu, then having to
  migrate them all away at once (seeing lots of pick again iterations).
  This may actually be correct, due to chain migration, but it ends up
  looking similar to a deadlock.

* “rq_selected()” naming. Peter doesn’t like it, but I’ve not thought of
  a better name. Open to suggestions.

* As discussed at OSPM, I want to split pick_next_task() up into two
  phases selecting and setting the next tasks, as currently
  pick_next_task() assumes the returned task will be run which results
  in various side-effects in sched class logic when it’s run. This
  causes trouble should proxy() require us to re-select a task due to
  migration or other edge cases.

* CFS load balancing. Blocked tasks may carry forward load (PELT) to the
  lock owner's CPU, so CPU may look like it is overloaded.

* I still want to push down the split scheduler and execution context
  awareness further through the scheduling code, as lots of logic still
  assumes there’s only a single “rq->curr” task.

* Optimization to avoid migrating blocked tasks (allowing for optimistic
  spinning) if the runnable lock-owner at the end of the blocked_on chain
  is already running.  


Performance:
—----------
This patch series switches mutexes to use handoff mode rather than
optimistic spinning. This is a potential concern where locks are under
high contention. However, so far in our initial performance analysis (on
both x86 and mobile devices) we’ve not seen major regressions. That
said, Chenyu did report a regression[3], which we’ll need to look
further into. As mentioned above, there may be some optimizations that
can help here, but my focus is on getting the code working well before I
spend time optimizing.

Review and feedback would be greatly appreciated!

If folks find it easier to test/tinker with, this patch series can also
be found here (along with some debug patches):
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v4-6.4-rc3

Thanks so much!
-john

[1] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
[2] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
[3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com

Connor O'Brien (1):
  sched: Attempt to fix rt/dl load balancing via chain level balance

John Stultz (3):
  sched: Unnest ttwu_runnable in prep for proxy-execution
  sched: Fix runtime accounting w/ proxy-execution
  sched: Fixups to find_exec_ctx

Juri Lelli (2):
  locking/mutex: make mutex::wait_lock irq safe
  locking/mutex: Expose __mutex_owner()

Peter Zijlstra (6):
  sched: Unify runtime accounting across classes
  locking/ww_mutex: Remove wakeups from under mutex::wait_lock
  locking/mutex: Rework task_struct::blocked_on
  locking/mutex: Add task_struct::blocked_lock to serialize changes to
    the blocked_on state
  sched: Split scheduler execution context
  sched: Add proxy execution

Valentin Schneider (1):
  sched/rt: Fix proxy/current (push,pull)ability

 include/linux/sched.h        |  10 +-
 include/linux/ww_mutex.h     |   3 +
 init/Kconfig                 |   7 +
 init/init_task.c             |   1 +
 kernel/Kconfig.locks         |   2 +-
 kernel/fork.c                |   6 +-
 kernel/locking/mutex-debug.c |   9 +-
 kernel/locking/mutex.c       | 113 ++++--
 kernel/locking/mutex.h       |  25 ++
 kernel/locking/ww_mutex.h    |  54 ++-
 kernel/sched/core.c          | 719 +++++++++++++++++++++++++++++++++--
 kernel/sched/cpudeadline.c   |  12 +-
 kernel/sched/cpudeadline.h   |   3 +-
 kernel/sched/cpupri.c        |  28 +-
 kernel/sched/cpupri.h        |   6 +-
 kernel/sched/deadline.c      | 187 +++++----
 kernel/sched/fair.c          |  99 +++--
 kernel/sched/rt.c            | 242 +++++++-----
 kernel/sched/sched.h         |  75 +++-
 kernel/sched/stop_task.c     |  13 +-
 20 files changed, 1284 insertions(+), 330 deletions(-)

-- 
2.41.0.rc0.172.g3f132b7071-goog

Re: [PATCH v4 00/13] Generalized Priority Inheritance via Proxy Execution v3

Posted by K Prateek Nayak 2 years, 7 months ago

Hello John,

On 6/1/2023 11:28 AM, John Stultz wrote:
> After having to catch up on other work after OSPM[1], I've finally
> gotten back to focusing on Proxy Execution and wanted to send out this
> next iteration of the patch series for review, testing, and feedback.
> (Many thanks to folks who provided feedback on the last revision!)
> 
> As mentioned previously, this Proxy Execution series has a long history:
> First described in a paper[2] by Watkins, Straub, Niehaus, then from
> patches from Peter Zijlstra, extended with lots of work by Juri Lelli,
> Valentin Schneider, and Connor O'Brien. (and thank you to Steven Rostedt
> for providing additional details here!)
> 
> So again, many thanks to those above, as all the credit for this series
> really is due to them - while the mistakes are likely mine.
> 
> Overview:
> —----------
> Proxy Execution is a generalized form of priority inheritance. Classic
> priority inheritance works well for real-time tasks where there is a
> straight forward priority order to how things are run. But it breaks
> down when used between CFS or DEADLINE tasks, as there are lots
> of parameters involved outside of just the task’s nice value when
> selecting the next task to run (via pick_next_task()).  So ideally we
> want to imbue the mutex holder with all the scheduler attributes of 
> the blocked waiting task.
> 
> Proxy Execution does this via a few changes:
> * Keeping tasks that are blocked on a mutex *on* the runqueue
> * Keeping additional tracking of which mutex a task is blocked on, and
>   which task holds a specific mutex.
> * Special handling for when we select a blocked task to run, so that we
>   instead run the mutex holder. 
> 
> The first of these is the most difficult to grasp (I do get the mental
> friction here: blocked tasks on the *run*queue sounds like nonsense!
> Personally I like to think of the runqueue in this model more like a
> “task-selection queue”).
> 
> By leaving blocked tasks on the runqueue, we allow pick_next_task() to
> choose the task that should run next (even if it’s blocked waiting on a
> mutex). If we do select a blocked task, we look at the task’s blocked_on
> mutex and from there look at the mutex’s owner task. And in the simple
> case, the task which owns the mutex is what we then choose to run,
> allowing it to release the mutex.
> 
> This means that instead of just tracking “curr”, the scheduler needs to
> track both the scheduler context (what was picked and all the state used
> for scheduling decisions), and the execution context (what we’re
> running)
> 
> In this way, the mutex owner is run “on behalf” of the blocked task
> that was picked to run, essentially inheriting the scheduler context of
> the blocked task.
> 
> As Connor outlined in a previous submission of this patch series,  this
> raises a number of complicated situations:  The mutex owner might itself
> be blocked on another mutex, or it could be sleeping, running on a
> different CPU, in the process of migrating between CPUs, etc.
> 
> But the functionality provided by Proxy Execution is useful, as in
> Android we have a number of cases where we are seeing priority inversion
> (not unbounded, but longer than we’d like) between “foreground” and
> “background” SCHED_NORMAL applications, so having a generalized solution
> would be very useful.
> 
> New in v4:
> —------
> * Fixed deadlock that was caused by wait/wound mutexes having circular
>   blocked_on references by clearing the blocked_on pointer on the task
>   we are waking to wound/die. 
> 
> * Tried to resolve an issue Dietmar raised with RT balancing where the
>   proxy migration and push_rt_task() were fighting ping-ponging tasks
>   back and forth, caused by push_rt_task() migrating tasks even if they
>   were in the chain that ends with the current running task. Though this
>   likely needs more work, as this change resulted in different migration
>   quirks (see below).
> 
> * Fixed a number of null-pointer traversals that the changes were
>   occasionally tripping on
> 
> * Reworked patch that exposes __mutex_owner() to the scheduler to ensure
>   it doesn’t expose it any more than necessary, as suggested by Peter. 
> 
> * To address some of Peter’s complaints, backed out the rq_curr()
>   wrapper, and reworked rq_selected() to be a macro to avoid needing
>   multiple accessors for READ_ONCE/rcu_dereference() type accesses. 
> 
> * Removed verbose legacy comments from previous developers of the series
>   as Dietmar was finding them distracting when reviewing the diffs
>   (Though, to ensure I heed the warnings from previous experienced
>   travelers, I’ve preserved the comments/questions in a separate patch
>   in my own development tree).
> 
> * Dropped patch that added *_task_blocked_on() wrappers to check locking
>   correctness. Mostly as Peter didn’t seem happy with the wrappers in
>   other patches, but I still think it's useful for testing (and the
>   blocked_on locking still needs some work), so I’m carrying it in my
>   personal development tree.
> 
> Issues still to address:
> —----------
> * Occasional getting null scheduler entities from pick_next_entity() in
>   CFS. I’m a little stumped as to where this is going awry just yet, and
>   delayed sending this out, but figured it was worth getting it out for
>   review on the other issues while I chase this down.

I'm consistently hitting the above issue early during the boot when I was
trying to test the series on a dual socket 3rd Generation EPYC platform
(2 x 64C/128T). Sharing the trace below:

    [   20.821808] ------------[ cut here ]------------
    [   20.826432] kernel BUG at kernel/sched/core.c:7462!
    [   20.831322] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
    [   20.836545] CPU: 250 PID: 0 Comm: swapper/250 Not tainted 6.4.0-rc4-proxy-execution-v4+ #474
    [   20.844976] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    [   20.852544] RIP: 0010:__schedule+0x18b6/0x20a0
    [   20.856998] Code: 0f 85 51 04 00 00 83 ad 50 ff ff ff 01 0f 85 05 e9 ff ff f3 0f 1e fa 48 8b 35 0e 0c fe 00 48 c7 c7 33 a1 c1 85 e8 ca 37 23 ff <0f> 0b 4c 89 ff 4c 8b 6d 98 e8 1c 82 00 00 4c 89 f7 e8 14 82 00 00
    [   20.875744] RSP: 0018:ffffbd1e4d1d7dd0 EFLAGS: 00010082
    [   20.880970] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000005
    [   20.888104] RDX: ffff9d4d0006b000 RSI: 0000000000000200 RDI: ffff9d4d0004d400
    [   20.895235] RBP: ffffbd1e4d1d7e98 R08: 0000000000000024 R09: ffffffffff7edbd0
    [   20.902369] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9d4d12e25a20
    [   20.909501] R13: ffff9dcbffab3840 R14: ffffbd1e4d1d7e50 R15: ffff9dcbff2b3840
    [   20.916632] FS:  0000000000000000(0000) GS:ffff9dcbffa80000(0000) knlGS:0000000000000000
    [   20.924709] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [   20.930449] CR2: 00007f92120c4800 CR3: 000000011477a002 CR4: 0000000000770ee0
    [   20.937581] PKRU: 55555554
    [   20.940292] Call Trace:
    [   20.942741]  <TASK>
    [   20.944845]  ? show_regs+0x6e/0x80
    [   20.948259]  ? die+0x3c/0xa0
    [   20.951146]  ? do_trap+0xd4/0xf0
    [   20.954377]  ? do_error_trap+0x75/0xa0
    [   20.958129]  ? __schedule+0x18b6/0x20a0
    [   20.961971]  ? exc_invalid_op+0x57/0x80
    [   20.965808]  ? __schedule+0x18b6/0x20a0
    [   20.969648]  ? asm_exc_invalid_op+0x1f/0x30
    [   20.973835]  ? __schedule+0x18b6/0x20a0
    [   20.977672]  ? cpuidle_enter_state+0xde/0x710
    [   20.982033]  schedule_idle+0x2e/0x50
    [   20.985614]  do_idle+0x15d/0x240
    [   20.988845]  cpu_startup_entry+0x24/0x30
    [   20.992772]  start_secondary+0x126/0x160
    [   20.996695]  secondary_startup_64_no_verify+0x10b/0x10b
    [   21.001924]  </TASK>
    [   21.004117] Modules linked in: sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_devintf ipmi_msghandler msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore 
    ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 
    i2c_algo_bit drm_shmem_helper drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect aesni_intel sysimgblt crypto_simd crc32_pclmul cryptd crct10dif_pclmul sha512_ssse3 xhci_pci tg3 drm 
    xhci_pci_renesas megaraid_sas wmi
    [   21.055707] Dumping ftrace buffer:
    [   21.059291] ---------------------------------
    [   21.063697]   <idle>-0       250dn.2. 21175635us : __schedule: JDB: BUG!!! pick next retry_count > 50
    [   21.072915] ---------------------------------
    [   21.077282] ---[ end trace 0000000000000000 ]---

    $ sed -n 7460,7462p kernel/sched/core.c
      if (retry_count++ > 50) {
              trace_printk("JDB: BUG!!! pick next retry_count > 50\n");
              BUG();

Hope it helps during the debug. If you have a fix in mind that you
would like me to test, please do let me know.

> 
> * Better deadlock handling in proxy(): With the ww_mutex issues
>   resolved, we shouldn’t see circular blocked_on references, but a
>   number of the bugs I’ve been chasing recently come from getting stuck
>   with proxy() returning null forcing a reselection over and over. These
>   are still bugs to address, but my current thinking is that if we get
>   stuck like this, we can start to remove the selected mutex blocked
>   tasks from the rq, and let them be woken from the mutex waiters list
>   as is done currently? Thoughts here would be appreciated.
> 
> * More work on migration correctness (RT/DL load balancing,etc). I’m
>   still seeing occasional trouble as cpu counts go up which seems to be
>   due to a bunch of tasks being proxy migrated to a cpu, then having to
>   migrate them all away at once (seeing lots of pick again iterations).
>   This may actually be correct, due to chain migration, but it ends up
>   looking similar to a deadlock.
> 
> * “rq_selected()” naming. Peter doesn’t like it, but I’ve not thought of
>   a better name. Open to suggestions.
> 
> * As discussed at OSPM, I want to split pick_next_task() up into two
>   phases selecting and setting the next tasks, as currently
>   pick_next_task() assumes the returned task will be run which results
>   in various side-effects in sched class logic when it’s run. This
>   causes trouble should proxy() require us to re-select a task due to
>   migration or other edge cases.
> 
> * CFS load balancing. Blocked tasks may carry forward load (PELT) to the
>   lock owner's CPU, so CPU may look like it is overloaded.
> 
> * I still want to push down the split scheduler and execution context
>   awareness further through the scheduling code, as lots of logic still
>   assumes there’s only a single “rq->curr” task.
> 
> * Optimization to avoid migrating blocked tasks (allowing for optimistic
>   spinning) if the runnable lock-owner at the end of the blocked_on chain
>   is already running.  
> 
> 
> Performance:
> —----------
> This patch series switches mutexes to use handoff mode rather than
> optimistic spinning. This is a potential concern where locks are under
> high contention. However, so far in our initial performance analysis (on
> both x86 and mobile devices) we’ve not seen major regressions. That
> said, Chenyu did report a regression[3], which we’ll need to look
> further into. As mentioned above, there may be some optimizations that
> can help here, but my focus is on getting the code working well before I
> spend time optimizing.
> 
> Review and feedback would be greatly appreciated!
> 
> If folks find it easier to test/tinker with, this patch series can also
> be found here (along with some debug patches):
>   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v4-6.4-rc3

I'm using the same tree with the HEAD at commit 821c8a48233f ("HACK:
sched: Add BUG_ONs for proxy related loops")

> 
> Thanks so much!
> -john
> 
> [1] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
> [2] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
> [3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
> 
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Qais Yousef <qyousef@google.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Youssef Esmat <youssefesmat@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E . McKenney" <paulmck@kernel.org>
> Cc: kernel-team@android.com
> 
> Connor O'Brien (1):
>   sched: Attempt to fix rt/dl load balancing via chain level balance
> 
> John Stultz (3):
>   sched: Unnest ttwu_runnable in prep for proxy-execution
>   sched: Fix runtime accounting w/ proxy-execution
>   sched: Fixups to find_exec_ctx
> 
> Juri Lelli (2):
>   locking/mutex: make mutex::wait_lock irq safe
>   locking/mutex: Expose __mutex_owner()
> 
> Peter Zijlstra (6):
>   sched: Unify runtime accounting across classes
>   locking/ww_mutex: Remove wakeups from under mutex::wait_lock
>   locking/mutex: Rework task_struct::blocked_on
>   locking/mutex: Add task_struct::blocked_lock to serialize changes to
>     the blocked_on state
>   sched: Split scheduler execution context
>   sched: Add proxy execution
> 
> Valentin Schneider (1):
>   sched/rt: Fix proxy/current (push,pull)ability
> 
>  include/linux/sched.h        |  10 +-
>  include/linux/ww_mutex.h     |   3 +
>  init/Kconfig                 |   7 +
>  init/init_task.c             |   1 +
>  kernel/Kconfig.locks         |   2 +-
>  kernel/fork.c                |   6 +-
>  kernel/locking/mutex-debug.c |   9 +-
>  kernel/locking/mutex.c       | 113 ++++--
>  kernel/locking/mutex.h       |  25 ++
>  kernel/locking/ww_mutex.h    |  54 ++-
>  kernel/sched/core.c          | 719 +++++++++++++++++++++++++++++++++--
>  kernel/sched/cpudeadline.c   |  12 +-
>  kernel/sched/cpudeadline.h   |   3 +-
>  kernel/sched/cpupri.c        |  28 +-
>  kernel/sched/cpupri.h        |   6 +-
>  kernel/sched/deadline.c      | 187 +++++----
>  kernel/sched/fair.c          |  99 +++--
>  kernel/sched/rt.c            | 242 +++++++-----
>  kernel/sched/sched.h         |  75 +++-
>  kernel/sched/stop_task.c     |  13 +-
>  20 files changed, 1284 insertions(+), 330 deletions(-)
> 

--
Thanks and Regards,
Prateek

Re: [PATCH v4 00/13] Generalized Priority Inheritance via Proxy Execution v3

Posted by John Stultz 2 years, 7 months ago

On Mon, Jun 12, 2023 at 10:21 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 6/1/2023 11:28 AM, John Stultz wrote:
> > After having to catch up on other work after OSPM[1], I've finally
> > gotten back to focusing on Proxy Execution and wanted to send out this
> > next iteration of the patch series for review, testing, and feedback.
> > (Many thanks to folks who provided feedback on the last revision!)
> >
> > As mentioned previously, this Proxy Execution series has a long history:
> > First described in a paper[2] by Watkins, Straub, Niehaus, then from
> > patches from Peter Zijlstra, extended with lots of work by Juri Lelli,
> > Valentin Schneider, and Connor O'Brien. (and thank you to Steven Rostedt
> > for providing additional details here!)
> >
> > So again, many thanks to those above, as all the credit for this series
> > really is due to them - while the mistakes are likely mine.
> >
> > Overview:
> > —----------
> > Proxy Execution is a generalized form of priority inheritance. Classic
> > priority inheritance works well for real-time tasks where there is a
> > straight forward priority order to how things are run. But it breaks
> > down when used between CFS or DEADLINE tasks, as there are lots
> > of parameters involved outside of just the task’s nice value when
> > selecting the next task to run (via pick_next_task()).  So ideally we
> > want to imbue the mutex holder with all the scheduler attributes of
> > the blocked waiting task.
> >
> > Proxy Execution does this via a few changes:
> > * Keeping tasks that are blocked on a mutex *on* the runqueue
> > * Keeping additional tracking of which mutex a task is blocked on, and
> >   which task holds a specific mutex.
> > * Special handling for when we select a blocked task to run, so that we
> >   instead run the mutex holder.
> >
> > The first of these is the most difficult to grasp (I do get the mental
> > friction here: blocked tasks on the *run*queue sounds like nonsense!
> > Personally I like to think of the runqueue in this model more like a
> > “task-selection queue”).
> >
> > By leaving blocked tasks on the runqueue, we allow pick_next_task() to
> > choose the task that should run next (even if it’s blocked waiting on a
> > mutex). If we do select a blocked task, we look at the task’s blocked_on
> > mutex and from there look at the mutex’s owner task. And in the simple
> > case, the task which owns the mutex is what we then choose to run,
> > allowing it to release the mutex.
> >
> > This means that instead of just tracking “curr”, the scheduler needs to
> > track both the scheduler context (what was picked and all the state used
> > for scheduling decisions), and the execution context (what we’re
> > running)
> >
> > In this way, the mutex owner is run “on behalf” of the blocked task
> > that was picked to run, essentially inheriting the scheduler context of
> > the blocked task.
> >
> > As Connor outlined in a previous submission of this patch series,  this
> > raises a number of complicated situations:  The mutex owner might itself
> > be blocked on another mutex, or it could be sleeping, running on a
> > different CPU, in the process of migrating between CPUs, etc.
> >
> > But the functionality provided by Proxy Execution is useful, as in
> > Android we have a number of cases where we are seeing priority inversion
> > (not unbounded, but longer than we’d like) between “foreground” and
> > “background” SCHED_NORMAL applications, so having a generalized solution
> > would be very useful.
> >
> > New in v4:
> > —------
> > * Fixed deadlock that was caused by wait/wound mutexes having circular
> >   blocked_on references by clearing the blocked_on pointer on the task
> >   we are waking to wound/die.
> >
> > * Tried to resolve an issue Dietmar raised with RT balancing where the
> >   proxy migration and push_rt_task() were fighting ping-ponging tasks
> >   back and forth, caused by push_rt_task() migrating tasks even if they
> >   were in the chain that ends with the current running task. Though this
> >   likely needs more work, as this change resulted in different migration
> >   quirks (see below).
> >
> > * Fixed a number of null-pointer traversals that the changes were
> >   occasionally tripping on
> >
> > * Reworked patch that exposes __mutex_owner() to the scheduler to ensure
> >   it doesn’t expose it any more than necessary, as suggested by Peter.
> >
> > * To address some of Peter’s complaints, backed out the rq_curr()
> >   wrapper, and reworked rq_selected() to be a macro to avoid needing
> >   multiple accessors for READ_ONCE/rcu_dereference() type accesses.
> >
> > * Removed verbose legacy comments from previous developers of the series
> >   as Dietmar was finding them distracting when reviewing the diffs
> >   (Though, to ensure I heed the warnings from previous experienced
> >   travelers, I’ve preserved the comments/questions in a separate patch
> >   in my own development tree).
> >
> > * Dropped patch that added *_task_blocked_on() wrappers to check locking
> >   correctness. Mostly as Peter didn’t seem happy with the wrappers in
> >   other patches, but I still think it's useful for testing (and the
> >   blocked_on locking still needs some work), so I’m carrying it in my
> >   personal development tree.
> >
> > Issues still to address:
> > —----------
> > * Occasional getting null scheduler entities from pick_next_entity() in
> >   CFS. I’m a little stumped as to where this is going awry just yet, and
> >   delayed sending this out, but figured it was worth getting it out for
> >   review on the other issues while I chase this down.
>
> I'm consistently hitting the above issue early during the boot when I was
> trying to test the series on a dual socket 3rd Generation EPYC platform
> (2 x 64C/128T). Sharing the trace below:
>
>     [   20.821808] ------------[ cut here ]------------
>     [   20.826432] kernel BUG at kernel/sched/core.c:7462!
>     [   20.831322] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>     [   20.836545] CPU: 250 PID: 0 Comm: swapper/250 Not tainted 6.4.0-rc4-proxy-execution-v4+ #474
>     [   20.844976] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
>     [   20.852544] RIP: 0010:__schedule+0x18b6/0x20a0
>     [   20.856998] Code: 0f 85 51 04 00 00 83 ad 50 ff ff ff 01 0f 85 05 e9 ff ff f3 0f 1e fa 48 8b 35 0e 0c fe 00 48 c7 c7 33 a1 c1 85 e8 ca 37 23 ff <0f> 0b 4c 89 ff 4c 8b 6d 98 e8 1c 82 00 00 4c 89 f7 e8 14 82 00 00
>     [   20.875744] RSP: 0018:ffffbd1e4d1d7dd0 EFLAGS: 00010082
>     [   20.880970] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000005
>     [   20.888104] RDX: ffff9d4d0006b000 RSI: 0000000000000200 RDI: ffff9d4d0004d400
>     [   20.895235] RBP: ffffbd1e4d1d7e98 R08: 0000000000000024 R09: ffffffffff7edbd0
>     [   20.902369] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9d4d12e25a20
>     [   20.909501] R13: ffff9dcbffab3840 R14: ffffbd1e4d1d7e50 R15: ffff9dcbff2b3840
>     [   20.916632] FS:  0000000000000000(0000) GS:ffff9dcbffa80000(0000) knlGS:0000000000000000
>     [   20.924709] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     [   20.930449] CR2: 00007f92120c4800 CR3: 000000011477a002 CR4: 0000000000770ee0
>     [   20.937581] PKRU: 55555554
>     [   20.940292] Call Trace:
>     [   20.942741]  <TASK>
>     [   20.944845]  ? show_regs+0x6e/0x80
>     [   20.948259]  ? die+0x3c/0xa0
>     [   20.951146]  ? do_trap+0xd4/0xf0
>     [   20.954377]  ? do_error_trap+0x75/0xa0
>     [   20.958129]  ? __schedule+0x18b6/0x20a0
>     [   20.961971]  ? exc_invalid_op+0x57/0x80
>     [   20.965808]  ? __schedule+0x18b6/0x20a0
>     [   20.969648]  ? asm_exc_invalid_op+0x1f/0x30
>     [   20.973835]  ? __schedule+0x18b6/0x20a0
>     [   20.977672]  ? cpuidle_enter_state+0xde/0x710
>     [   20.982033]  schedule_idle+0x2e/0x50
>     [   20.985614]  do_idle+0x15d/0x240
>     [   20.988845]  cpu_startup_entry+0x24/0x30
>     [   20.992772]  start_secondary+0x126/0x160
>     [   20.996695]  secondary_startup_64_no_verify+0x10b/0x10b
>     [   21.001924]  </TASK>
>     [   21.004117] Modules linked in: sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_devintf ipmi_msghandler msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore
>     ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200
>     i2c_algo_bit drm_shmem_helper drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect aesni_intel sysimgblt crypto_simd crc32_pclmul cryptd crct10dif_pclmul sha512_ssse3 xhci_pci tg3 drm
>     xhci_pci_renesas megaraid_sas wmi
>     [   21.055707] Dumping ftrace buffer:
>     [   21.059291] ---------------------------------
>     [   21.063697]   <idle>-0       250dn.2. 21175635us : __schedule: JDB: BUG!!! pick next retry_count > 50
>     [   21.072915] ---------------------------------
>     [   21.077282] ---[ end trace 0000000000000000 ]---
>
>     $ sed -n 7460,7462p kernel/sched/core.c
>       if (retry_count++ > 50) {
>               trace_printk("JDB: BUG!!! pick next retry_count > 50\n");
>               BUG();
>
> Hope it helps during the debug. If you have a fix in mind that you
> would like me to test, please do let me know.

Thank you for the testing and feedback here! I really appreciate it!
And my apologies that you're hitting trouble here!

> > * Better deadlock handling in proxy(): With the ww_mutex issues
> >   resolved, we shouldn’t see circular blocked_on references, but a
> >   number of the bugs I’ve been chasing recently come from getting stuck
> >   with proxy() returning null forcing a reselection over and over. These
> >   are still bugs to address, but my current thinking is that if we get
> >   stuck like this, we can start to remove the selected mutex blocked
> >   tasks from the rq, and let them be woken from the mutex waiters list
> >   as is done currently? Thoughts here would be appreciated.
> >
> > * More work on migration correctness (RT/DL load balancing,etc). I’m
> >   still seeing occasional trouble as cpu counts go up which seems to be
> >   due to a bunch of tasks being proxy migrated to a cpu, then having to
> >   migrate them all away at once (seeing lots of pick again iterations).
> >   This may actually be correct, due to chain migration, but it ends up
> >   looking similar to a deadlock.

So I suspect what you're seeing is a combination of the two items
above. With 128 threads, my deadlock detection BUG() at 50 is probably
far too low, as migration chains can get pretty long.
Clearly BUG'ing at a fixed count is the wrong approach (but was
helpful for quickly catching problems and debugging in my
environment).

My main priority is trying to get the null se crashes resolved (almost
have my hands around it, but took a little break from it end of last
week), and once I have something there I'll update and share my WIP
tree:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-WIP

To include some extra trace logging and I'll reach out to see if you
can capture the issue again.

Thanks so much again for your interest and help in testing!
-john

Re: [PATCH v4 00/13] Generalized Priority Inheritance via Proxy Execution v3

Posted by K Prateek Nayak 2 years, 7 months ago

Hello John,

On 6/13/2023 12:22 AM, John Stultz wrote:
> On Mon, Jun 12, 2023 at 10:21 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>> On 6/1/2023 11:28 AM, John Stultz wrote:
>>> [..snip..]
>>>
>>> Issues still to address:
>>> —----------
>>> * Occasional getting null scheduler entities from pick_next_entity() in
>>>   CFS. I’m a little stumped as to where this is going awry just yet, and
>>>   delayed sending this out, but figured it was worth getting it out for
>>>   review on the other issues while I chase this down.
>>
>> I'm consistently hitting the above issue early during the boot when I was
>> trying to test the series on a dual socket 3rd Generation EPYC platform
>> (2 x 64C/128T). Sharing the trace below:
>>
>>     [   20.821808] ------------[ cut here ]------------
>>     [   20.826432] kernel BUG at kernel/sched/core.c:7462!
>>     [   20.831322] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>     [   20.836545] CPU: 250 PID: 0 Comm: swapper/250 Not tainted 6.4.0-rc4-proxy-execution-v4+ #474
>>     [   20.844976] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
>>     [   20.852544] RIP: 0010:__schedule+0x18b6/0x20a0
>>     [   20.856998] Code: 0f 85 51 04 00 00 83 ad 50 ff ff ff 01 0f 85 05 e9 ff ff f3 0f 1e fa 48 8b 35 0e 0c fe 00 48 c7 c7 33 a1 c1 85 e8 ca 37 23 ff <0f> 0b 4c 89 ff 4c 8b 6d 98 e8 1c 82 00 00 4c 89 f7 e8 14 82 00 00
>>     [   20.875744] RSP: 0018:ffffbd1e4d1d7dd0 EFLAGS: 00010082
>>     [   20.880970] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000005
>>     [   20.888104] RDX: ffff9d4d0006b000 RSI: 0000000000000200 RDI: ffff9d4d0004d400
>>     [   20.895235] RBP: ffffbd1e4d1d7e98 R08: 0000000000000024 R09: ffffffffff7edbd0
>>     [   20.902369] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9d4d12e25a20
>>     [   20.909501] R13: ffff9dcbffab3840 R14: ffffbd1e4d1d7e50 R15: ffff9dcbff2b3840
>>     [   20.916632] FS:  0000000000000000(0000) GS:ffff9dcbffa80000(0000) knlGS:0000000000000000
>>     [   20.924709] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>     [   20.930449] CR2: 00007f92120c4800 CR3: 000000011477a002 CR4: 0000000000770ee0
>>     [   20.937581] PKRU: 55555554
>>     [   20.940292] Call Trace:
>>     [   20.942741]  <TASK>
>>     [   20.944845]  ? show_regs+0x6e/0x80
>>     [   20.948259]  ? die+0x3c/0xa0
>>     [   20.951146]  ? do_trap+0xd4/0xf0
>>     [   20.954377]  ? do_error_trap+0x75/0xa0
>>     [   20.958129]  ? __schedule+0x18b6/0x20a0
>>     [   20.961971]  ? exc_invalid_op+0x57/0x80
>>     [   20.965808]  ? __schedule+0x18b6/0x20a0
>>     [   20.969648]  ? asm_exc_invalid_op+0x1f/0x30
>>     [   20.973835]  ? __schedule+0x18b6/0x20a0
>>     [   20.977672]  ? cpuidle_enter_state+0xde/0x710
>>     [   20.982033]  schedule_idle+0x2e/0x50
>>     [   20.985614]  do_idle+0x15d/0x240
>>     [   20.988845]  cpu_startup_entry+0x24/0x30
>>     [   20.992772]  start_secondary+0x126/0x160
>>     [   20.996695]  secondary_startup_64_no_verify+0x10b/0x10b
>>     [   21.001924]  </TASK>
>>     [   21.004117] Modules linked in: sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_devintf ipmi_msghandler msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore
>>     ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200
>>     i2c_algo_bit drm_shmem_helper drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect aesni_intel sysimgblt crypto_simd crc32_pclmul cryptd crct10dif_pclmul sha512_ssse3 xhci_pci tg3 drm
>>     xhci_pci_renesas megaraid_sas wmi
>>     [   21.055707] Dumping ftrace buffer:
>>     [   21.059291] ---------------------------------
>>     [   21.063697]   <idle>-0       250dn.2. 21175635us : __schedule: JDB: BUG!!! pick next retry_count > 50
>>     [   21.072915] ---------------------------------
>>     [   21.077282] ---[ end trace 0000000000000000 ]---
>>
>>     $ sed -n 7460,7462p kernel/sched/core.c
>>       if (retry_count++ > 50) {
>>               trace_printk("JDB: BUG!!! pick next retry_count > 50\n");
>>               BUG();
>>
>> Hope it helps during the debug. If you have a fix in mind that you
>> would like me to test, please do let me know.
> 
> Thank you for the testing and feedback here! I really appreciate it!
> And my apologies that you're hitting trouble here!

No worries! Better to hit the snags now than later :)

> 
>>> * Better deadlock handling in proxy(): With the ww_mutex issues
>>>   resolved, we shouldn’t see circular blocked_on references, but a
>>>   number of the bugs I’ve been chasing recently come from getting stuck
>>>   with proxy() returning null forcing a reselection over and over. These
>>>   are still bugs to address, but my current thinking is that if we get
>>>   stuck like this, we can start to remove the selected mutex blocked
>>>   tasks from the rq, and let them be woken from the mutex waiters list
>>>   as is done currently? Thoughts here would be appreciated.
>>>
>>> * More work on migration correctness (RT/DL load balancing,etc). I’m
>>>   still seeing occasional trouble as cpu counts go up which seems to be
>>>   due to a bunch of tasks being proxy migrated to a cpu, then having to
>>>   migrate them all away at once (seeing lots of pick again iterations).
>>>   This may actually be correct, due to chain migration, but it ends up
>>>   looking similar to a deadlock.
> 
> So I suspect what you're seeing is a combination of the two items
> above. With 128 threads, my deadlock detection BUG() at 50 is probably
> far too low, as migration chains can get pretty long.
> Clearly BUG'ing at a fixed count is the wrong approach (but was
> helpful for quickly catching problems and debugging in my
> environment).

Ah! I see. Thank you for clarifying. Let me check if going to commit
259a8134aa27 ("sched: Potential fixup, not sure why rq_selected() is used
here") helps.

> 
> My main priority is trying to get the null se crashes resolved (almost
> have my hands around it, but took a little break from it end of last
> week), and once I have something there I'll update and share my WIP
> tree:
>   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-WIP

I'll keep an eye out for any updates on the branch.

> 
> To include some extra trace logging and I'll reach out to see if you
> can capture the issue again.
> 
> Thanks so much again for your interest and help in testing!
> -john

 
--
Thanks and Regards,
Prateek

Re: [PATCH v4 00/13] Generalized Priority Inheritance via Proxy Execution v3

Posted by Dietmar Eggemann 2 years, 7 months ago

On 01/06/2023 07:58, John Stultz wrote:
> After having to catch up on other work after OSPM[1], I've finally
> gotten back to focusing on Proxy Execution and wanted to send out this
> next iteration of the patch series for review, testing, and feedback.
> (Many thanks to folks who provided feedback on the last revision!)
> 
> As mentioned previously, this Proxy Execution series has a long history:
> First described in a paper[2] by Watkins, Straub, Niehaus, then from
> patches from Peter Zijlstra, extended with lots of work by Juri Lelli,
> Valentin Schneider, and Connor O'Brien. (and thank you to Steven Rostedt
> for providing additional details here!)
> 
> So again, many thanks to those above, as all the credit for this series
> really is due to them - while the mistakes are likely mine.
> 
> Overview:
> —----------
> Proxy Execution is a generalized form of priority inheritance. Classic
> priority inheritance works well for real-time tasks where there is a
> straight forward priority order to how things are run. But it breaks
> down when used between CFS or DEADLINE tasks, as there are lots
> of parameters involved outside of just the task’s nice value when
> selecting the next task to run (via pick_next_task()).  So ideally we
> want to imbue the mutex holder with all the scheduler attributes of 
> the blocked waiting task.
> 
> Proxy Execution does this via a few changes:
> * Keeping tasks that are blocked on a mutex *on* the runqueue
> * Keeping additional tracking of which mutex a task is blocked on, and
>   which task holds a specific mutex.
> * Special handling for when we select a blocked task to run, so that we
>   instead run the mutex holder. 
> 
> The first of these is the most difficult to grasp (I do get the mental
> friction here: blocked tasks on the *run*queue sounds like nonsense!
> Personally I like to think of the runqueue in this model more like a
> “task-selection queue”).
> 
> By leaving blocked tasks on the runqueue, we allow pick_next_task() to
> choose the task that should run next (even if it’s blocked waiting on a
> mutex). If we do select a blocked task, we look at the task’s blocked_on
> mutex and from there look at the mutex’s owner task. And in the simple
> case, the task which owns the mutex is what we then choose to run,
> allowing it to release the mutex.
> 
> This means that instead of just tracking “curr”, the scheduler needs to
> track both the scheduler context (what was picked and all the state used
> for scheduling decisions), and the execution context (what we’re
> running)
> 
> In this way, the mutex owner is run “on behalf” of the blocked task
> that was picked to run, essentially inheriting the scheduler context of
> the blocked task.
> 
> As Connor outlined in a previous submission of this patch series,  this
> raises a number of complicated situations:  The mutex owner might itself
> be blocked on another mutex, or it could be sleeping, running on a
> different CPU, in the process of migrating between CPUs, etc.
> 
> But the functionality provided by Proxy Execution is useful, as in
> Android we have a number of cases where we are seeing priority inversion
> (not unbounded, but longer than we’d like) between “foreground” and
> “background” SCHED_NORMAL applications, so having a generalized solution
> would be very useful.
> 
> New in v4:
> —------
> * Fixed deadlock that was caused by wait/wound mutexes having circular
>   blocked_on references by clearing the blocked_on pointer on the task
>   we are waking to wound/die.

I always get this when running `insmod ./test-ww_mutex.ko` with default
SCHED_FEAT(TTWU_QUEUE, true) with this fix. Don't understand the issue
fully yet:

qemu-system-x86_64 ... -smp cores=64 -enable-kvm ...

[   21.109134] Beginning ww mutex selftests
[   26.397545] ------------[ cut here ]------------
[   26.397951] WARNING: CPU: 41 PID: 0 at kernel/sched/core.c:4126 sched_ttwu_pending+0xc5/0x120
[   26.398590] Modules linked in: test_ww_mutex(+)
[   26.398916] CPU: 41 PID: 0 Comm: swapper/41 Not tainted 6.4.0-rc1-00054-gb4baf2e792df-dirty #9
[   26.399506] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[   26.400193] RIP: 0010:sched_ttwu_pending+0xc5/0x120
[   26.400515] Code: c8 75 ba 41 c7 46 48 00 00 00 00 4c 89 f7 e8 32 b5 d4 00 41 f7 c4 00 02 00 00 74 01
                     fb 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc <0f> 0b 44 8b 45 14 8b 8d 20 05 00 00 48
                     8d 95 18 07 00 00 48 c7 c6
[   26.401840] RSP: 0018:ffffa31940990fc0 EFLAGS: 00010006
[   26.402178] RAX: 0000000000000012 RBX: ffffffffffffffc8 RCX: 00000006256a6d58
[   26.402631] RDX: 000000000001c9f4 RSI: ffff9dc5012fe180 RDI: ffffffff97320a40
[   26.403096] RBP: ffff9dc50552d140 R08: 00000006256a6d58 R09: 0000000000000029
[   26.403607] R10: 0000000000000000 R11: ffffa31940990ff8 R12: 0000000000000086
[   26.404117] R13: ffffffffffffffc8 R14: ffff9dc57d86b3c0 R15: 0000000000000000
[   26.404691] FS:  0000000000000000(0000) GS:ffff9dc57d840000(0000) knlGS:0000000000000000
[   26.405236] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   26.405663] CR2: 00007ffeda3d7b00 CR3: 0000000013e2e003 CR4: 0000000000370ee0
[   26.406236] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   26.406715] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   26.407219] Call Trace:
[   26.407390]  <IRQ>
[   26.407571]  __sysvec_call_function_single+0x28/0xc0
[   26.407988]  sysvec_call_function_single+0x69/0x90
[   26.408312]  </IRQ>
[   26.408467]  <TASK>
[   26.408612]  asm_sysvec_call_function_single+0x1a/0x20
[   26.408992] RIP: 0010:default_idle+0xf/0x20
[   26.409267] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
                     f3 0f 1e fa 66 90 0f 00 2d d3 00 40 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f
                     84 00 00 00 00 00 90 90 90 90 90
[   26.410629] RSP: 0018:ffffa319401cbed8 EFLAGS: 00000252
[   26.411073] RAX: ffff9dc57d867f80 RBX: ffff9dc5012fe180 RCX: 4000000000000000
[   26.411625] RDX: 0000000000000001 RSI: 0000000000000087 RDI: 00000000000ed25c
[   26.411788] ------------[ cut here ]------------

extra debug:

sched_ttwu_pending [kworker/u128:87 738] task_cpu(p)=29 cpu_of(rq)=41

[...]

Re: [PATCH v4 00/13] Generalized Priority Inheritance via Proxy Execution v3

Posted by John Stultz 2 years, 7 months ago

On Tue, Jun 13, 2023 at 10:36 AM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 01/06/2023 07:58, John Stultz wrote:
> > After having to catch up on other work after OSPM[1], I've finally
> > gotten back to focusing on Proxy Execution and wanted to send out this
> > next iteration of the patch series for review, testing, and feedback.
> > (Many thanks to folks who provided feedback on the last revision!)
> >
> > As mentioned previously, this Proxy Execution series has a long history:
> > First described in a paper[2] by Watkins, Straub, Niehaus, then from
> > patches from Peter Zijlstra, extended with lots of work by Juri Lelli,
> > Valentin Schneider, and Connor O'Brien. (and thank you to Steven Rostedt
> > for providing additional details here!)
> >
> > So again, many thanks to those above, as all the credit for this series
> > really is due to them - while the mistakes are likely mine.
> >
> > Overview:
> > —----------
> > Proxy Execution is a generalized form of priority inheritance. Classic
> > priority inheritance works well for real-time tasks where there is a
> > straight forward priority order to how things are run. But it breaks
> > down when used between CFS or DEADLINE tasks, as there are lots
> > of parameters involved outside of just the task’s nice value when
> > selecting the next task to run (via pick_next_task()).  So ideally we
> > want to imbue the mutex holder with all the scheduler attributes of
> > the blocked waiting task.
> >
> > Proxy Execution does this via a few changes:
> > * Keeping tasks that are blocked on a mutex *on* the runqueue
> > * Keeping additional tracking of which mutex a task is blocked on, and
> >   which task holds a specific mutex.
> > * Special handling for when we select a blocked task to run, so that we
> >   instead run the mutex holder.
> >
> > The first of these is the most difficult to grasp (I do get the mental
> > friction here: blocked tasks on the *run*queue sounds like nonsense!
> > Personally I like to think of the runqueue in this model more like a
> > “task-selection queue”).
> >
> > By leaving blocked tasks on the runqueue, we allow pick_next_task() to
> > choose the task that should run next (even if it’s blocked waiting on a
> > mutex). If we do select a blocked task, we look at the task’s blocked_on
> > mutex and from there look at the mutex’s owner task. And in the simple
> > case, the task which owns the mutex is what we then choose to run,
> > allowing it to release the mutex.
> >
> > This means that instead of just tracking “curr”, the scheduler needs to
> > track both the scheduler context (what was picked and all the state used
> > for scheduling decisions), and the execution context (what we’re
> > running)
> >
> > In this way, the mutex owner is run “on behalf” of the blocked task
> > that was picked to run, essentially inheriting the scheduler context of
> > the blocked task.
> >
> > As Connor outlined in a previous submission of this patch series,  this
> > raises a number of complicated situations:  The mutex owner might itself
> > be blocked on another mutex, or it could be sleeping, running on a
> > different CPU, in the process of migrating between CPUs, etc.
> >
> > But the functionality provided by Proxy Execution is useful, as in
> > Android we have a number of cases where we are seeing priority inversion
> > (not unbounded, but longer than we’d like) between “foreground” and
> > “background” SCHED_NORMAL applications, so having a generalized solution
> > would be very useful.
> >
> > New in v4:
> > —------
> > * Fixed deadlock that was caused by wait/wound mutexes having circular
> >   blocked_on references by clearing the blocked_on pointer on the task
> >   we are waking to wound/die.
>
> I always get this when running `insmod ./test-ww_mutex.ko` with default
> SCHED_FEAT(TTWU_QUEUE, true) with this fix. Don't understand the issue
> fully yet:
>
> qemu-system-x86_64 ... -smp cores=64 -enable-kvm ...
>
> [   21.109134] Beginning ww mutex selftests
> [   26.397545] ------------[ cut here ]------------
> [   26.397951] WARNING: CPU: 41 PID: 0 at kernel/sched/core.c:4126 sched_ttwu_pending+0xc5/0x120

Thanks for testing it out and sharing this!

Yeah. I've seen this as well, and I suspect this is tied to the
runqueue accounting issues that are causing the null sched entity
issue I've been chasing.

One issue seems to be the blocked_on manipulation done in the
try_to_wakeup() path. For mutex blocked tasks, try_to_wakeup() needs
to clear blocked_on so the task can actually run and try to acquire
the mutex. This is why __mutex_lock_slowpath_common() sets blocked_on
twice.  I'm seeing some trouble where try_to_wakeup() ends up
migrating the still technically blocked task (as it's not yet acquired
the mutex, just waking to try to take it) back to the p->wake_cpu -
but the issue is that the task has not been deactivated from the
current runqueue.

The "fix" for ww_mutex circular dependencies resolution in
__ww_mutex_die() is similar, as it clears the blocked_on value and
tries to wake the task, but then ttwu often migrates the task without
deactivating it from the current rq.

Unfortunately once the rq accounting is gone wrong, the failures seem
multi-modal, so trying to get enough tracing logs in place to get your
head around what's going wrong in one case is hard because run-to-run
it will fail differently.

In my current tree I've extended this so we have a separate
task->blocked_on_waking flag, so we don't mess with the
task->blocked_on state in ttwu & __wwmutex_die and can distinguish
between waking a blocked task and waking an unblocked task.  However
that in itself isn't working yet, so I'm currently deconstructing the
patch series a bit to try to understand that logic better and if we
can avoid special casing as much in the ttwu path (I'd like to be able
to take mutex blocked tasks and also just drop them from the runqueue
- as we do now - and have things work so we have fallback options if
proxy() migrations are taking too long to resolve).

thanks
-john