[PATCH] sched_ext: Fix lock imbalance in dispatch_to_local_dsq()

Andrea Righi posted 1 patch 6 days, 5 hours ago
There is a newer version of this series
kernel/sched/ext.c | 2 ++
1 file changed, 2 insertions(+)
[PATCH] sched_ext: Fix lock imbalance in dispatch_to_local_dsq()
Posted by Andrea Righi 6 days, 5 hours ago
While performing the rq locking dance in dispatch_to_local_dsq(), we may
trigger the following lock imbalance condition, in particular when
multiple tasks are rapidly changing CPU affinity (i.e., running a
`stress-ng --race-sched 0`):

[   13.413579] =====================================
[   13.413660] WARNING: bad unlock balance detected!
[   13.413729] 6.13.0-virtme #15 Not tainted
[   13.413792] -------------------------------------
[   13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at:
[   13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0
[   13.414111] but there are no more locks to release!
[   13.414176]
[   13.414176] other info that might help us debug this:
[   13.414258] 1 lock held by kworker/1:1/80:
[   13.414318]  #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
[   13.414612]
[   13.414612] stack backtrace:
[   13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15
[   13.415505] Workqueue:  0x0 (events)
[   13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms
[   13.415570] Call Trace:
[   13.415700]  <TASK>
[   13.415744]  dump_stack_lvl+0x78/0xe0
[   13.415806]  ? dispatch_to_local_dsq+0x108/0x1a0
[   13.415884]  print_unlock_imbalance_bug+0x11b/0x130
[   13.415965]  ? dispatch_to_local_dsq+0x108/0x1a0
[   13.416226]  lock_release+0x231/0x2c0
[   13.416326]  _raw_spin_unlock+0x1b/0x40
[   13.416422]  dispatch_to_local_dsq+0x108/0x1a0
[   13.416554]  flush_dispatch_buf+0x199/0x1d0
[   13.416652]  balance_one+0x194/0x370
[   13.416751]  balance_scx+0x61/0x1e0
[   13.416848]  prev_balance+0x43/0xb0
[   13.416947]  __pick_next_task+0x6b/0x1b0
[   13.417052]  __schedule+0x20d/0x1740

This happens because dispatch_to_local_dsq() is racing with
dispatch_dequeue(), when the latter wins we incorrectly assume that the
task has been moved to the dst_rq.

Fix this by correctly assuming that task is still in the src_rq in this
specific scenario.

Fixes: 4d3ca89bdd31 ("sched_ext: Refactor consume_remote_task()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index a24d48cebfb7..7500b1a26757 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2617,6 +2617,8 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
 		/* if the destination CPU is idle, wake it up */
 		if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
 			resched_curr(dst_rq);
+	} else {
+		dst_rq = src_rq;
 	}
 
 	/* switch back to @rq lock */
-- 
2.48.1
Re: [PATCH] sched_ext: Fix lock imbalance in dispatch_to_local_dsq()
Posted by Changwoo Min 6 days, 2 hours ago
Hello Andrea,

On 25. 1. 24. 08:42, Andrea Righi wrote:
>   kernel/sched/ext.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index a24d48cebfb7..7500b1a26757 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2617,6 +2617,8 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
>   		/* if the destination CPU is idle, wake it up */
>   		if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
>   			resched_curr(dst_rq);
> +	} else {
> +		dst_rq = src_rq;
>   	}

The fix makes sense to me. Since this is a very specific and
tricky case, it will be better to include detailed comments in
the else part so anyone can easily understand why the else part
is necessary.

Regards,
Changwoo Min
Re: [PATCH] sched_ext: Fix lock imbalance in dispatch_to_local_dsq()
Posted by Andrea Righi 5 days, 22 hours ago
On Fri, Jan 24, 2025 at 11:21:33AM +0900, Changwoo Min wrote:
> Hello Andrea,
> 
> On 25. 1. 24. 08:42, Andrea Righi wrote:
> >   kernel/sched/ext.c | 2 ++
> >   1 file changed, 2 insertions(+)
> > 
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index a24d48cebfb7..7500b1a26757 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -2617,6 +2617,8 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
> >   		/* if the destination CPU is idle, wake it up */
> >   		if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
> >   			resched_curr(dst_rq);
> > +	} else {
> > +		dst_rq = src_rq;
> >   	}
> 
> The fix makes sense to me. Since this is a very specific and
> tricky case, it will be better to include detailed comments in
> the else part so anyone can easily understand why the else part
> is necessary.

Good idea, I'll send a v2 including a comment in the else part.

Thanks!
-Andrea