include/linux/sched/ext.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
lock (write-heavy) and first_task (read-mostly, lockless RCU peek) share
the same cache line in struct scx_dispatch_q. Every lock acquire/release
by a dispatching CPU invalidates the line for all CPUs performing
lockless first_task peeks, causing unnecessary cache coherence traffic,
especially across NUMA nodes.
Add ____cacheline_aligned_in_smp to first_task to place it on its own
cache line, eliminating this false sharing on SMP systems. On
uniprocessor builds the annotation is a no-op, so no space is wasted.
On SMP, the trade-off is increased struct size: each scx_dispatch_q
grows by up to ~56 bytes of padding. There are two instances embedded
per-CPU in scx_rq (local_dsq and bypass_dsq), plus any dynamically
allocated custom DSQs, so the total overhead scales with the number of
CPUs and active DSQs.
Signed-off-by: David Carlier <devnexen@gmail.com>
---
include/linux/sched/ext.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d..2988df68a97a 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -70,7 +70,7 @@ enum scx_dsq_id_flags {
*/
struct scx_dispatch_q {
raw_spinlock_t lock;
- struct task_struct __rcu *first_task; /* lockless peek at head */
+ struct task_struct __rcu *first_task ____cacheline_aligned_in_smp; /* lockless peek at head */
struct list_head list; /* tasks in dispatch order */
struct rb_root priq; /* used to order by p->scx.dsq_vtime */
u32 nr;
--
2.51.0
On Sat, Feb 28, 2026 at 01:06:47PM +0000, David Carlier wrote: > lock (write-heavy) and first_task (read-mostly, lockless RCU peek) share > the same cache line in struct scx_dispatch_q. Every lock acquire/release > by a dispatching CPU invalidates the line for all CPUs performing > lockless first_task peeks, causing unnecessary cache coherence traffic, > especially across NUMA nodes. > > Add ____cacheline_aligned_in_smp to first_task to place it on its own > cache line, eliminating this false sharing on SMP systems. On > uniprocessor builds the annotation is a no-op, so no space is wasted. > > On SMP, the trade-off is increased struct size: each scx_dispatch_q > grows by up to ~56 bytes of padding. There are two instances embedded > per-CPU in scx_rq (local_dsq and bypass_dsq), plus any dynamically > allocated custom DSQs, so the total overhead scales with the number of > CPUs and active DSQs. But first_task is read-mostly. How could it be? David, from now on, I'm not going to apply these patches unless you provide backing experimental data. Thanks. -- tejun
Hi Tejun, You're right, I got the access pattern wrong. Looking at it more carefully, first_task is written via rcu_assign_pointer() on every enqueue and on dequeues when the removed task is the head — all under dsq->lock. Since the lock acquisition already brings the cache line into exclusive state, writing first_task on the same line is essentially free. The only lockless reader is scx_bpf_dsq_nr_queued(), which isn't a hot path ... Understood on requiring experimental data going forward. I'll make sure to back any performance-related patches with benchmark numbers and profiling output (perf c2c / perf stat). Sorry for the noise (again..). On Sat, 28 Feb 2026 at 17:28, Tejun Heo <tj@kernel.org> wrote: > > On Sat, Feb 28, 2026 at 01:06:47PM +0000, David Carlier wrote: > > lock (write-heavy) and first_task (read-mostly, lockless RCU peek) share > > the same cache line in struct scx_dispatch_q. Every lock acquire/release > > by a dispatching CPU invalidates the line for all CPUs performing > > lockless first_task peeks, causing unnecessary cache coherence traffic, > > especially across NUMA nodes. > > > > Add ____cacheline_aligned_in_smp to first_task to place it on its own > > cache line, eliminating this false sharing on SMP systems. On > > uniprocessor builds the annotation is a no-op, so no space is wasted. > > > > On SMP, the trade-off is increased struct size: each scx_dispatch_q > > grows by up to ~56 bytes of padding. There are two instances embedded > > per-CPU in scx_rq (local_dsq and bypass_dsq), plus any dynamically > > allocated custom DSQs, so the total overhead scales with the number of > > CPUs and active DSQs. > > But first_task is read-mostly. How could it be? David, from now on, I'm not > going to apply these patches unless you provide backing experimental data. > > Thanks. > > -- > tejun
© 2016 - 2026 Red Hat, Inc.