kernel/sched/ext.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
While performing the rq locking dance in dispatch_to_local_dsq(), we may
trigger the following lock imbalance condition, in particular when
multiple tasks are rapidly changing CPU affinity (i.e., running a
`stress-ng --race-sched 0`):
[ 13.413579] =====================================
[ 13.413660] WARNING: bad unlock balance detected!
[ 13.413729] 6.13.0-virtme #15 Not tainted
[ 13.413792] -------------------------------------
[ 13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at:
[ 13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0
[ 13.414111] but there are no more locks to release!
[ 13.414176]
[ 13.414176] other info that might help us debug this:
[ 13.414258] 1 lock held by kworker/1:1/80:
[ 13.414318] #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
[ 13.414612]
[ 13.414612] stack backtrace:
[ 13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15
[ 13.415505] Workqueue: 0x0 (events)
[ 13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms
[ 13.415570] Call Trace:
[ 13.415700] <TASK>
[ 13.415744] dump_stack_lvl+0x78/0xe0
[ 13.415806] ? dispatch_to_local_dsq+0x108/0x1a0
[ 13.415884] print_unlock_imbalance_bug+0x11b/0x130
[ 13.415965] ? dispatch_to_local_dsq+0x108/0x1a0
[ 13.416226] lock_release+0x231/0x2c0
[ 13.416326] _raw_spin_unlock+0x1b/0x40
[ 13.416422] dispatch_to_local_dsq+0x108/0x1a0
[ 13.416554] flush_dispatch_buf+0x199/0x1d0
[ 13.416652] balance_one+0x194/0x370
[ 13.416751] balance_scx+0x61/0x1e0
[ 13.416848] prev_balance+0x43/0xb0
[ 13.416947] __pick_next_task+0x6b/0x1b0
[ 13.417052] __schedule+0x20d/0x1740
This happens because dispatch_to_local_dsq() is racing with
dispatch_dequeue(), when the latter wins we incorrectly assume that the
task has been moved to dst_rq.
Fix this by correctly assuming that task is still in src_rq in this
specific scenario.
Fixes: 4d3ca89bdd31 ("sched_ext: Refactor consume_remote_task()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
ChangeLog v1 -> v2:
- more comments to clarify the race with dequeue
- rebase to tip
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 3f3f6baac917..92b69c57b400 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2598,7 +2598,10 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
raw_spin_rq_lock(src_rq);
}
- /* task_rq couldn't have changed if we're still the holding cpu */
+ /*
+ * If p->scx.holding_cpu still matches the current CPU, task_rq(p)
+ * has not changed and we can safely move the task to @dst_rq.
+ */
if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
!WARN_ON_ONCE(src_rq != task_rq(p))) {
/*
@@ -2617,6 +2620,13 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
/* if the destination CPU is idle, wake it up */
if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
resched_curr(dst_rq);
+ } else {
+ /*
+ * Otherwise, if dequeue wins the race, we no longer have
+ * exclusive ownership of the task and we must keep it in
+ * its original @src_dsq.
+ */
+ dst_rq = src_rq;
}
/* switch back to @rq lock */
--
2.48.1
Hello, Andrea. Thanks for debugging this. On Fri, Jan 24, 2025 at 08:24:25AM +0100, Andrea Righi wrote: ... > Fix this by correctly assuming that task is still in src_rq in this > specific scenario. But I find this a bit misleading. It's more that we didn't do anything and thus didn't switch the held lock and the assumption that the unlock path makes is wrong. ... > + } else { > + /* > + * Otherwise, if dequeue wins the race, we no longer have > + * exclusive ownership of the task and we must keep it in > + * its original @src_dsq. > + */ > + dst_rq = src_rq; Both the code and explanation are confusing to me. This is working around the incorrect assumption the following block is making - that we'd be locking $dsq_rq when control reach that point. Can you instead add new variable $locked_rq which tracks which rq is currently locked and then use that in the unlock path? It starts with $rq and set to $src_rq when switching to that and then set to $dst_rq in the migration block. Then, the unlock path can test @rq against $locked_rq. scx_dsq_move() uses a similar approach for reference. Thanks. -- tejun
On Fri, Jan 24, 2025 at 09:08:39AM -1000, Tejun Heo wrote: > Hello, Andrea. > > Thanks for debugging this. > > On Fri, Jan 24, 2025 at 08:24:25AM +0100, Andrea Righi wrote: > ... > > Fix this by correctly assuming that task is still in src_rq in this > > specific scenario. > > But I find this a bit misleading. It's more that we didn't do anything and > thus didn't switch the held lock and the assumption that the unlock path > makes is wrong. > > ... > > + } else { > > + /* > > + * Otherwise, if dequeue wins the race, we no longer have > > + * exclusive ownership of the task and we must keep it in > > + * its original @src_dsq. > > + */ > > + dst_rq = src_rq; > > Both the code and explanation are confusing to me. This is working around > the incorrect assumption the following block is making - that we'd be > locking $dsq_rq when control reach that point. Can you instead add new > variable $locked_rq which tracks which rq is currently locked and then use > that in the unlock path? It starts with $rq and set to $src_rq when > switching to that and then set to $dst_rq in the migration block. Then, the > unlock path can test @rq against $locked_rq. scx_dsq_move() uses a similar > approach for reference. I see, that makes sense, I'll send a new patch following this approach. Thanks! -Andrea
© 2016 - 2025 Red Hat, Inc.