sched_ext: Separate lock and first_task into distinct cache lines in scx_dispatch_q

[PATCH] sched_ext: Separate lock and first_task into distinct cache lines in scx_dispatch_q

Posted by David Carlier 1 month, 2 weeks ago

lock (write-heavy) and first_task (read-mostly, lockless RCU peek) share
the same cache line in struct scx_dispatch_q. Every lock acquire/release
by a dispatching CPU invalidates the line for all CPUs performing
lockless first_task peeks, causing unnecessary cache coherence traffic,
especially across NUMA nodes.

Add ____cacheline_aligned_in_smp to first_task to place it on its own
cache line, eliminating this false sharing on SMP systems. On
uniprocessor builds the annotation is a no-op, so no space is wasted.

On SMP, the trade-off is increased struct size: each scx_dispatch_q
grows by up to ~56 bytes of padding. There are two instances embedded
per-CPU in scx_rq (local_dsq and bypass_dsq), plus any dynamically
allocated custom DSQs, so the total overhead scales with the number of
CPUs and active DSQs.

Signed-off-by: David Carlier <devnexen@gmail.com>
---
 include/linux/sched/ext.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d..2988df68a97a 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -70,7 +70,7 @@ enum scx_dsq_id_flags {
  */
 struct scx_dispatch_q {
 	raw_spinlock_t		lock;
-	struct task_struct __rcu *first_task; /* lockless peek at head */
+	struct task_struct __rcu *first_task ____cacheline_aligned_in_smp; /* lockless peek at head */
 	struct list_head	list;	/* tasks in dispatch order */
 	struct rb_root		priq;	/* used to order by p->scx.dsq_vtime */
 	u32			nr;
-- 
2.51.0

Re: [PATCH] sched_ext: Separate lock and first_task into distinct cache lines in scx_dispatch_q

Posted by Tejun Heo 1 month, 2 weeks ago

On Sat, Feb 28, 2026 at 01:06:47PM +0000, David Carlier wrote:
> lock (write-heavy) and first_task (read-mostly, lockless RCU peek) share
> the same cache line in struct scx_dispatch_q. Every lock acquire/release
> by a dispatching CPU invalidates the line for all CPUs performing
> lockless first_task peeks, causing unnecessary cache coherence traffic,
> especially across NUMA nodes.
> 
> Add ____cacheline_aligned_in_smp to first_task to place it on its own
> cache line, eliminating this false sharing on SMP systems. On
> uniprocessor builds the annotation is a no-op, so no space is wasted.
> 
> On SMP, the trade-off is increased struct size: each scx_dispatch_q
> grows by up to ~56 bytes of padding. There are two instances embedded
> per-CPU in scx_rq (local_dsq and bypass_dsq), plus any dynamically
> allocated custom DSQs, so the total overhead scales with the number of
> CPUs and active DSQs.

But first_task is read-mostly. How could it be? David, from now on, I'm not
going to apply these patches unless you provide backing experimental data.

Thanks.

-- 
tejun

Re: [PATCH] sched_ext: Separate lock and first_task into distinct cache lines in scx_dispatch_q

Posted by David CARLIER 1 month, 2 weeks ago

Hi Tejun,

  You're right, I got the access pattern wrong. Looking at it more
carefully, first_task is written via rcu_assign_pointer() on every
enqueue and on dequeues when the removed task is the head — all under
  dsq->lock. Since the lock acquisition already brings the cache line
into exclusive state, writing first_task on the same line is
essentially free. The only lockless reader is scx_bpf_dsq_nr_queued(),
which
  isn't a hot path ...

  Understood on requiring experimental data going forward. I'll make
sure to back any performance-related patches with benchmark numbers
and profiling output (perf c2c / perf stat).

  Sorry for the noise (again..).

On Sat, 28 Feb 2026 at 17:28, Tejun Heo <tj@kernel.org> wrote:
>
> On Sat, Feb 28, 2026 at 01:06:47PM +0000, David Carlier wrote:
> > lock (write-heavy) and first_task (read-mostly, lockless RCU peek) share
> > the same cache line in struct scx_dispatch_q. Every lock acquire/release
> > by a dispatching CPU invalidates the line for all CPUs performing
> > lockless first_task peeks, causing unnecessary cache coherence traffic,
> > especially across NUMA nodes.
> >
> > Add ____cacheline_aligned_in_smp to first_task to place it on its own
> > cache line, eliminating this false sharing on SMP systems. On
> > uniprocessor builds the annotation is a no-op, so no space is wasted.
> >
> > On SMP, the trade-off is increased struct size: each scx_dispatch_q
> > grows by up to ~56 bytes of padding. There are two instances embedded
> > per-CPU in scx_rq (local_dsq and bypass_dsq), plus any dynamically
> > allocated custom DSQs, so the total overhead scales with the number of
> > CPUs and active DSQs.
>
> But first_task is read-mostly. How could it be? David, from now on, I'm not
> going to apply these patches unless you provide backing experimental data.
>
> Thanks.
>
> --
> tejun