kernel/sched/ext.c | 2 ++ 1 file changed, 2 insertions(+)
While performing the rq locking dance in dispatch_to_local_dsq(), we may
trigger the following lock imbalance condition, in particular when
multiple tasks are rapidly changing CPU affinity (i.e., running a
`stress-ng --race-sched 0`):
[ 13.413579] =====================================
[ 13.413660] WARNING: bad unlock balance detected!
[ 13.413729] 6.13.0-virtme #15 Not tainted
[ 13.413792] -------------------------------------
[ 13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at:
[ 13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0
[ 13.414111] but there are no more locks to release!
[ 13.414176]
[ 13.414176] other info that might help us debug this:
[ 13.414258] 1 lock held by kworker/1:1/80:
[ 13.414318] #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
[ 13.414612]
[ 13.414612] stack backtrace:
[ 13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15
[ 13.415505] Workqueue: 0x0 (events)
[ 13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms
[ 13.415570] Call Trace:
[ 13.415700] <TASK>
[ 13.415744] dump_stack_lvl+0x78/0xe0
[ 13.415806] ? dispatch_to_local_dsq+0x108/0x1a0
[ 13.415884] print_unlock_imbalance_bug+0x11b/0x130
[ 13.415965] ? dispatch_to_local_dsq+0x108/0x1a0
[ 13.416226] lock_release+0x231/0x2c0
[ 13.416326] _raw_spin_unlock+0x1b/0x40
[ 13.416422] dispatch_to_local_dsq+0x108/0x1a0
[ 13.416554] flush_dispatch_buf+0x199/0x1d0
[ 13.416652] balance_one+0x194/0x370
[ 13.416751] balance_scx+0x61/0x1e0
[ 13.416848] prev_balance+0x43/0xb0
[ 13.416947] __pick_next_task+0x6b/0x1b0
[ 13.417052] __schedule+0x20d/0x1740
This happens because dispatch_to_local_dsq() is racing with
dispatch_dequeue(), when the latter wins we incorrectly assume that the
task has been moved to the dst_rq.
Fix this by correctly assuming that task is still in the src_rq in this
specific scenario.
Fixes: 4d3ca89bdd31 ("sched_ext: Refactor consume_remote_task()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index a24d48cebfb7..7500b1a26757 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2617,6 +2617,8 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
/* if the destination CPU is idle, wake it up */
if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
resched_curr(dst_rq);
+ } else {
+ dst_rq = src_rq;
}
/* switch back to @rq lock */
--
2.48.1
Hello Andrea, On 25. 1. 24. 08:42, Andrea Righi wrote: > kernel/sched/ext.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index a24d48cebfb7..7500b1a26757 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -2617,6 +2617,8 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq, > /* if the destination CPU is idle, wake it up */ > if (sched_class_above(p->sched_class, dst_rq->curr->sched_class)) > resched_curr(dst_rq); > + } else { > + dst_rq = src_rq; > } The fix makes sense to me. Since this is a very specific and tricky case, it will be better to include detailed comments in the else part so anyone can easily understand why the else part is necessary. Regards, Changwoo Min
On Fri, Jan 24, 2025 at 11:21:33AM +0900, Changwoo Min wrote: > Hello Andrea, > > On 25. 1. 24. 08:42, Andrea Righi wrote: > > kernel/sched/ext.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > > index a24d48cebfb7..7500b1a26757 100644 > > --- a/kernel/sched/ext.c > > +++ b/kernel/sched/ext.c > > @@ -2617,6 +2617,8 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq, > > /* if the destination CPU is idle, wake it up */ > > if (sched_class_above(p->sched_class, dst_rq->curr->sched_class)) > > resched_curr(dst_rq); > > + } else { > > + dst_rq = src_rq; > > } > > The fix makes sense to me. Since this is a very specific and > tricky case, it will be better to include detailed comments in > the else part so anyone can easily understand why the else part > is necessary. Good idea, I'll send a v2 including a comment in the else part. Thanks! -Andrea
© 2016 - 2025 Red Hat, Inc.