sched_ext: Reduce DSQ lock contention in consume_dispatch_q()

[PATCH sched_ext/for-7.1] sched_ext: Reduce DSQ lock contention in consume_dispatch_q()

Posted by Andrea Righi 3 weeks, 2 days ago

Replace raw_spin_lock() with raw_spin_trylock() when taking the DSQ lock
in consume_dispatch_q(). If the lock is contended, kick the current CPU
to retry on the next balance instead of spinning.

Under high load multiple CPUs can contend on the same DSQ lock. With a
spin_lock, waiters spin on the same cache line, wasting cycles and
increasing cache coherency traffic, which can slow the lock holder. With
trylock, waiters back off and retry later, so the holder can complete
faster and the backing-off CPUs have a chance to consume other DSQs or run
tasks.

When in bypass mode scx_kick_cpu() is suppressed, so just fall back to
raw_spin_lock() to guarantee forward progress.

Since this slightly changes the behavior of scx_bpf_dsq_move_to_local(),
update the documentation to clarify that a false return value means no
eligible task could be consumed from the DSQ. This covers both the case
of an empty DSQ and any other condition that prevented task consumption.

Benchmarks that generate many enqueue/dispatch events (e.g., schbench)
show around 2-3x higher throughput with most of the scx schedulers with
this change applied.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9202c6d7a7713..8f48472f70f18 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2451,6 +2451,7 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
 			       struct scx_dispatch_q *dsq, u64 enq_flags)
 {
 	struct task_struct *p;
+	s32 cpu = cpu_of(rq);
 retry:
 	/*
 	 * The caller can't expect to successfully consume a task if the task's
@@ -2460,7 +2461,19 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
 	if (list_empty(&dsq->list))
 		return false;
 
-	raw_spin_lock(&dsq->lock);
+	/*
+	 * Use trylock to avoid spinning on a contended DSQ, if we fail to
+	 * acquire the lock kick the CPU to retry on the next balance.
+	 *
+	 * In bypass mode simply spin to acquire the lock, since
+	 * scx_kick_cpu() is suppressed.
+	 */
+	if (scx_bypassing(sch, cpu)) {
+		raw_spin_lock(&dsq->lock);
+	} else if (!raw_spin_trylock(&dsq->lock)) {
+		scx_kick_cpu(sch, cpu, 0);
+		return false;
+	}
 
 	nldsq_for_each_task(p, dsq) {
 		struct rq *task_rq = task_rq(p);
@@ -8185,8 +8198,8 @@ __bpf_kfunc void scx_bpf_dispatch_cancel(const struct bpf_prog_aux *aux)
  * before trying to move from the specified DSQ. It may also grab rq locks and
  * thus can't be called under any BPF locks.
  *
- * Returns %true if a task has been moved, %false if there isn't any task to
- * move.
+ * Returns %true if a task has been moved, %false if no eligible task could
+ * be consumed from @dsq_id.
  */
 __bpf_kfunc bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags,
 						const struct bpf_prog_aux *aux)
-- 
2.53.0

Re: [PATCH sched_ext/for-7.1] sched_ext: Reduce DSQ lock contention in consume_dispatch_q()

Posted by Tejun Heo 3 weeks, 1 day ago

Hello, Andrea.

On Sun, Mar 15, 2026 at 12:52:31AM +0100, Andrea Righi wrote:
...
> Benchmarks that generate many enqueue/dispatch events (e.g., schbench)
> show around 2-3x higher throughput with most of the scx schedulers with
> this change applied.

Can you share more details about the benchmark setup and results?

> +	/*
> +	 * Use trylock to avoid spinning on a contended DSQ, if we fail to
> +	 * acquire the lock kick the CPU to retry on the next balance.
> +	 *
> +	 * In bypass mode simply spin to acquire the lock, since
> +	 * scx_kick_cpu() is suppressed.
> +	 */
> +	if (scx_bypassing(sch, cpu)) {
> +		raw_spin_lock(&dsq->lock);
> +	} else if (!raw_spin_trylock(&dsq->lock)) {
> +		scx_kick_cpu(sch, cpu, 0);
> +		return false;
> +	}

But I'm not sure this is what we wanna do. If we *really* want to do this,
maybe we can add a try_move variant; however, I'm pretty deeply skeptical
about the approach for a few reasons.

- If a shared DSQ becomes a bottleneck, the right thing to do would be
  introducing multiple DSQs and shard them.

- This likely is trading off fairness to gain bandwidth and this approach
  depending on machine / workload may lead to severe starvation. One can
  argue that controlled trade off between fairness and bandwidth is useful
  for some use cases. However, even if that is the case, I don't think
  trylock is the way to get there. If we think that low overhead high
  fan-out shared queue is desirable, it'd be better to introduce dedicated
  data structure which can do so in a controlled manner.

Thakns.

-- 
tejun

Re: [PATCH sched_ext/for-7.1] sched_ext: Reduce DSQ lock contention in consume_dispatch_q()

Posted by Andrea Righi 3 weeks, 1 day ago

On Sat, Mar 14, 2026 at 10:58:05PM -1000, Tejun Heo wrote:
> Hello, Andrea.
> 
> On Sun, Mar 15, 2026 at 12:52:31AM +0100, Andrea Righi wrote:
> ...
> > Benchmarks that generate many enqueue/dispatch events (e.g., schbench)
> > show around 2-3x higher throughput with most of the scx schedulers with
> > this change applied.
> 
> Can you share more details about the benchmark setup and results?

Just running schbench and perf bench for now, it definitely needs more
testing, but I wanted to send a patch to start a discussion about this (I
should have added the RFC in the subject, sorry).

> 
> > +	/*
> > +	 * Use trylock to avoid spinning on a contended DSQ, if we fail to
> > +	 * acquire the lock kick the CPU to retry on the next balance.
> > +	 *
> > +	 * In bypass mode simply spin to acquire the lock, since
> > +	 * scx_kick_cpu() is suppressed.
> > +	 */
> > +	if (scx_bypassing(sch, cpu)) {
> > +		raw_spin_lock(&dsq->lock);
> > +	} else if (!raw_spin_trylock(&dsq->lock)) {
> > +		scx_kick_cpu(sch, cpu, 0);
> > +		return false;
> > +	}
> 
> But I'm not sure this is what we wanna do. If we *really* want to do this,
> maybe we can add a try_move variant; however, I'm pretty deeply skeptical
> about the approach for a few reasons.
> 
> - If a shared DSQ becomes a bottleneck, the right thing to do would be
>   introducing multiple DSQs and shard them.

True, but then we also need a load balancer with multiple DSQs and moving
tasks across DSQs is also not very efficient. With a shared DSQ we do
really well with latency, but under intense scheduling activity (e.g.,
schbench) we get poor performance, so all those scheduling-related
benchmarks get a bad score with most of the scx schedulers.

With this applied pretty much all the scx schedulers (scx_cosmos,
scx_bpfland, scx_p2dq, scx_lavd) get pretty much the same score (or
even slightly better) as EEVDF with schbench, without any noticeable impact
on latency (tested avg fps and tail latency with a few games).

> 
> - This likely is trading off fairness to gain bandwidth and this approach
>   depending on machine / workload may lead to severe starvation. One can
>   argue that controlled trade off between fairness and bandwidth is useful
>   for some use cases. However, even if that is the case, I don't think
>   trylock is the way to get there. If we think that low overhead high
>   fan-out shared queue is desirable, it'd be better to introduce dedicated
>   data structure which can do so in a controlled manner.

True, and I think with moderate CPU activity this may increase latency due
to the additional kick/balance step when trylock fails (maybe control this
behavior with a flag?).

That said, the throughput benefits seem significant. While schbench is
probably an extreme case, the improvement there is substantial (2-3x),
which suggests this approach might also benefit some more realistic
workloads. I'm planning to run additional tests over the next few days to
better understand this.

Based on the schbench results, it seems like a missed opportunity to drop
this entirely. Can you elaborate more on the dedicated data structure you
mentioned? Do you have something specific in mind?

Thanks,
-Andrea

Re: [PATCH sched_ext/for-7.1] sched_ext: Reduce DSQ lock contention in consume_dispatch_q()

Posted by Tejun Heo 3 weeks, 1 day ago

(cc'ing Ryan Newton and quoting whole body)

On Sun, Mar 15, 2026 at 10:40:38AM +0100, Andrea Righi wrote:
> On Sat, Mar 14, 2026 at 10:58:05PM -1000, Tejun Heo wrote:
> > Hello, Andrea.
> > 
> > On Sun, Mar 15, 2026 at 12:52:31AM +0100, Andrea Righi wrote:
> > ...
> > > Benchmarks that generate many enqueue/dispatch events (e.g., schbench)
> > > show around 2-3x higher throughput with most of the scx schedulers with
> > > this change applied.
> > 
> > Can you share more details about the benchmark setup and results?
> 
> Just running schbench and perf bench for now, it definitely needs more
> testing, but I wanted to send a patch to start a discussion about this (I
> should have added the RFC in the subject, sorry).
> 
> > 
> > > +	/*
> > > +	 * Use trylock to avoid spinning on a contended DSQ, if we fail to
> > > +	 * acquire the lock kick the CPU to retry on the next balance.
> > > +	 *
> > > +	 * In bypass mode simply spin to acquire the lock, since
> > > +	 * scx_kick_cpu() is suppressed.
> > > +	 */
> > > +	if (scx_bypassing(sch, cpu)) {
> > > +		raw_spin_lock(&dsq->lock);
> > > +	} else if (!raw_spin_trylock(&dsq->lock)) {
> > > +		scx_kick_cpu(sch, cpu, 0);
> > > +		return false;
> > > +	}
> > 
> > But I'm not sure this is what we wanna do. If we *really* want to do this,
> > maybe we can add a try_move variant; however, I'm pretty deeply skeptical
> > about the approach for a few reasons.
> > 
> > - If a shared DSQ becomes a bottleneck, the right thing to do would be
> >   introducing multiple DSQs and shard them.
> 
> True, but then we also need a load balancer with multiple DSQs and moving
> tasks across DSQs is also not very efficient. With a shared DSQ we do
> really well with latency, but under intense scheduling activity (e.g.,
> schbench) we get poor performance, so all those scheduling-related
> benchmarks get a bad score with most of the scx schedulers.
> 
> With this applied pretty much all the scx schedulers (scx_cosmos,
> scx_bpfland, scx_p2dq, scx_lavd) get pretty much the same score (or
> even slightly better) as EEVDF with schbench, without any noticeable impact
> on latency (tested avg fps and tail latency with a few games).
> 
> > 
> > - This likely is trading off fairness to gain bandwidth and this approach
> >   depending on machine / workload may lead to severe starvation. One can
> >   argue that controlled trade off between fairness and bandwidth is useful
> >   for some use cases. However, even if that is the case, I don't think
> >   trylock is the way to get there. If we think that low overhead high
> >   fan-out shared queue is desirable, it'd be better to introduce dedicated
> >   data structure which can do so in a controlled manner.
> 
> True, and I think with moderate CPU activity this may increase latency due
> to the additional kick/balance step when trylock fails (maybe control this
> behavior with a flag?).
> 
> That said, the throughput benefits seem significant. While schbench is
> probably an extreme case, the improvement there is substantial (2-3x),
> which suggests this approach might also benefit some more realistic
> workloads. I'm planning to run additional tests over the next few days to
> better understand this.
> 
> Based on the schbench results, it seems like a missed opportunity to drop
> this entirely. Can you elaborate more on the dedicated data structure you
> mentioned? Do you have something specific in mind?

IIRC, Ryan was experimenting with some data structures which trade off
fairness for scalability in BPF and saw promising results. Ryan, can you
please share what you did?

Thanks.

-- 
tejun