sched: Support shared runqueue locking

[PATCH 00/14] sched: Support shared runqueue locking

Posted by Peter Zijlstra 5 hours ago

Hi,

As mentioned [1], a fair amount of sched ext weirdness (current and proposed)
is down to the core code not quite working right for shared runqueue stuff.

Instead of endlessly hacking around that, bite the bullet and fix it all up.

With these patches, it should be possible to clean up pick_task_scx() to not
rely on balance_scx(). Additionally it should be possible to fix that RT issue,
and the dl_server issue without further propagating lock breaks.

As is, these patches boot and run/pass selftests/sched_ext with lockdep on.

I meant to do more sched_ext cleanups, but since this has all already taken
longer than I would've liked (real life interrupted :/), I figured I should
post this as is and let TJ/Andrea poke at it.

Patches are also available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/cleanup


[1] https://lkml.kernel.org/r/20250904202858.GN4068168@noisy.programming.kicks-ass.net


---
 include/linux/cleanup.h  |   5 +
 include/linux/sched.h    |   6 +-
 kernel/cgroup/cpuset.c   |   2 +-
 kernel/kthread.c         |  15 +-
 kernel/sched/core.c      | 370 +++++++++++++++++++++--------------------------
 kernel/sched/deadline.c  |  26 ++--
 kernel/sched/ext.c       | 104 +++++++------
 kernel/sched/fair.c      |  23 ++-
 kernel/sched/idle.c      |  14 +-
 kernel/sched/rt.c        |  13 +-
 kernel/sched/sched.h     | 225 ++++++++++++++++++++--------
 kernel/sched/stats.h     |   2 +-
 kernel/sched/stop_task.c |  14 +-
 kernel/sched/syscalls.c  |  80 ++++------
 14 files changed, 495 insertions(+), 404 deletions(-)

Re: [PATCH 00/14] sched: Support shared runqueue locking

Posted by Andrea Righi 3 hours ago

Hi Peter,

thanks for jumping on this. Comment below.

On Wed, Sep 10, 2025 at 05:44:09PM +0200, Peter Zijlstra wrote:
> Hi,
> 
> As mentioned [1], a fair amount of sched ext weirdness (current and proposed)
> is down to the core code not quite working right for shared runqueue stuff.
> 
> Instead of endlessly hacking around that, bite the bullet and fix it all up.
> 
> With these patches, it should be possible to clean up pick_task_scx() to not
> rely on balance_scx(). Additionally it should be possible to fix that RT issue,
> and the dl_server issue without further propagating lock breaks.
> 
> As is, these patches boot and run/pass selftests/sched_ext with lockdep on.
> 
> I meant to do more sched_ext cleanups, but since this has all already taken
> longer than I would've liked (real life interrupted :/), I figured I should
> post this as is and let TJ/Andrea poke at it.
> 
> Patches are also available at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/cleanup
> 
> 
> [1] https://lkml.kernel.org/r/20250904202858.GN4068168@noisy.programming.kicks-ass.net

I've done a quick test with this patch set applied and I was able to
trigger this:

[   49.746281] ============================================
[   49.746457] WARNING: possible recursive locking detected
[   49.746559] 6.17.0-rc4-virtme #85 Not tainted
[   49.746666] --------------------------------------------
[   49.746763] stress-ng-race-/5818 is trying to acquire lock:
[   49.746856] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: dispatch_dequeue+0x125/0x1f0
[   49.747052]
[   49.747052] but task is already holding lock:
[   49.747234] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170
[   49.747416]
[   49.747416] other info that might help us debug this:
[   49.747557]  Possible unsafe locking scenario:
[   49.747557]
[   49.747689]        CPU0
[   49.747740]        ----
[   49.747793]   lock(&dsq->lock);
[   49.747867]   lock(&dsq->lock);
[   49.747950]
[   49.747950]  *** DEADLOCK ***
[   49.747950]
[   49.748086]  May be due to missing lock nesting notation
[   49.748086]
[   49.748197] 3 locks held by stress-ng-race-/5818:
[   49.748335]  #0: ffff890e0f0fce70 (&p->pi_lock){-.-.}-{2:2}, at: task_rq_lock+0x38/0x170
[   49.748474]  #1: ffff890e3b6bcc98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0xa0
[   49.748652]  #2: ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170

Reproducer:

 $ cd tools/sched_ext
 $ make scx_simple
 $ sudo ./build/bin/scx_simple
 ... and in another shell
 $ stress-ng --race-sched 0

I added an explicit BUG_ON() to see where the double locking is happening:

[   15.160400] Call Trace:
[   15.160706]  dequeue_task_scx+0x14a/0x270
[   15.160857]  move_queued_task+0x7d/0x2d0
[   15.160952]  affine_move_task+0x6ca/0x700
[   15.161210]  __set_cpus_allowed_ptr+0x64/0xa0
[   15.161348]  __sched_setaffinity+0x72/0x100
[   15.161459]  sched_setaffinity+0x261/0x2f0
[   15.161569]  __x64_sys_sched_setaffinity+0x50/0x80
[   15.161705]  do_syscall_64+0xbb/0x370
[   15.161816]  entry_SYSCALL_64_after_hwframe+0x77/0x7f

Are we missing a DEQUEUE_LOCKED in the sched_setaffinity() path?

Thanks,
-Andrea

Re: [PATCH 00/14] sched: Support shared runqueue locking

Posted by Peter Zijlstra 2 hours ago

On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:

> [   15.160400] Call Trace:
> [   15.160706]  dequeue_task_scx+0x14a/0x270
> [   15.160857]  move_queued_task+0x7d/0x2d0
> [   15.160952]  affine_move_task+0x6ca/0x700
> [   15.161210]  __set_cpus_allowed_ptr+0x64/0xa0
> [   15.161348]  __sched_setaffinity+0x72/0x100
> [   15.161459]  sched_setaffinity+0x261/0x2f0
> [   15.161569]  __x64_sys_sched_setaffinity+0x50/0x80
> [   15.161705]  do_syscall_64+0xbb/0x370
> [   15.161816]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Are we missing a DEQUEUE_LOCKED in the sched_setaffinity() path?

Yeah, the affine_move_task->move_queued_task path is messed up. It
relied on raw_spin_lock_irqsave(&p->pi_lock); rq_lock(rq); being
equivalent to task_rq_lock(), which is no longer true.

I fixed a few such sites earlier today but missed this one.

I'll go untangle it, but probably something for tomorrow, I'm bound to
make a mess of it now :-)

Re: [PATCH 00/14] sched: Support shared runqueue locking

Posted by Andrea Righi 2 hours ago

On Wed, Sep 10, 2025 at 08:35:55PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:
> 
> > [   15.160400] Call Trace:
> > [   15.160706]  dequeue_task_scx+0x14a/0x270
> > [   15.160857]  move_queued_task+0x7d/0x2d0
> > [   15.160952]  affine_move_task+0x6ca/0x700
> > [   15.161210]  __set_cpus_allowed_ptr+0x64/0xa0
> > [   15.161348]  __sched_setaffinity+0x72/0x100
> > [   15.161459]  sched_setaffinity+0x261/0x2f0
> > [   15.161569]  __x64_sys_sched_setaffinity+0x50/0x80
> > [   15.161705]  do_syscall_64+0xbb/0x370
> > [   15.161816]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > 
> > Are we missing a DEQUEUE_LOCKED in the sched_setaffinity() path?
> 
> Yeah, the affine_move_task->move_queued_task path is messed up. It
> relied on raw_spin_lock_irqsave(&p->pi_lock); rq_lock(rq); being
> equivalent to task_rq_lock(), which is no longer true.
> 
> I fixed a few such sites earlier today but missed this one.
> 
> I'll go untangle it, but probably something for tomorrow, I'm bound to
> make a mess of it now :-)

Sure! I’ll run more tests in the meantime. For now, that's the only issue
I've found. :)

Thanks!
-Andrea

Re: [PATCH 00/14] sched: Support shared runqueue locking

Posted by Peter Zijlstra 3 hours ago

On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:

> I've done a quick test with this patch set applied and I was able to
> trigger this:
> 
> [   49.746281] ============================================
> [   49.746457] WARNING: possible recursive locking detected
> [   49.746559] 6.17.0-rc4-virtme #85 Not tainted
> [   49.746666] --------------------------------------------
> [   49.746763] stress-ng-race-/5818 is trying to acquire lock:
> [   49.746856] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: dispatch_dequeue+0x125/0x1f0
> [   49.747052]
> [   49.747052] but task is already holding lock:
> [   49.747234] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170
> [   49.747416]
> [   49.747416] other info that might help us debug this:
> [   49.747557]  Possible unsafe locking scenario:
> [   49.747557]
> [   49.747689]        CPU0
> [   49.747740]        ----
> [   49.747793]   lock(&dsq->lock);
> [   49.747867]   lock(&dsq->lock);
> [   49.747950]
> [   49.747950]  *** DEADLOCK ***
> [   49.747950]
> [   49.748086]  May be due to missing lock nesting notation
> [   49.748086]
> [   49.748197] 3 locks held by stress-ng-race-/5818:
> [   49.748335]  #0: ffff890e0f0fce70 (&p->pi_lock){-.-.}-{2:2}, at: task_rq_lock+0x38/0x170
> [   49.748474]  #1: ffff890e3b6bcc98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0xa0
> [   49.748652]  #2: ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170
> 
> Reproducer:
> 
>  $ cd tools/sched_ext
>  $ make scx_simple
>  $ sudo ./build/bin/scx_simple
>  ... and in another shell
>  $ stress-ng --race-sched 0

Heh, the selftests thing was bound to not cover everything. I'll have a
poke at it. Thanks!