[PATCH 00/14] sched: Support shared runqueue locking

Peter Zijlstra posted 14 patches 4 months, 4 weeks ago
include/linux/cleanup.h  |   5 +
include/linux/sched.h    |   6 +-
kernel/cgroup/cpuset.c   |   2 +-
kernel/kthread.c         |  15 +-
kernel/sched/core.c      | 370 +++++++++++++++++++++--------------------------
kernel/sched/deadline.c  |  26 ++--
kernel/sched/ext.c       | 104 +++++++------
kernel/sched/fair.c      |  23 ++-
kernel/sched/idle.c      |  14 +-
kernel/sched/rt.c        |  13 +-
kernel/sched/sched.h     | 225 ++++++++++++++++++++--------
kernel/sched/stats.h     |   2 +-
kernel/sched/stop_task.c |  14 +-
kernel/sched/syscalls.c  |  80 ++++------
14 files changed, 495 insertions(+), 404 deletions(-)
[PATCH 00/14] sched: Support shared runqueue locking
Posted by Peter Zijlstra 4 months, 4 weeks ago
Hi,

As mentioned [1], a fair amount of sched ext weirdness (current and proposed)
is down to the core code not quite working right for shared runqueue stuff.

Instead of endlessly hacking around that, bite the bullet and fix it all up.

With these patches, it should be possible to clean up pick_task_scx() to not
rely on balance_scx(). Additionally it should be possible to fix that RT issue,
and the dl_server issue without further propagating lock breaks.

As is, these patches boot and run/pass selftests/sched_ext with lockdep on.

I meant to do more sched_ext cleanups, but since this has all already taken
longer than I would've liked (real life interrupted :/), I figured I should
post this as is and let TJ/Andrea poke at it.

Patches are also available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/cleanup


[1] https://lkml.kernel.org/r/20250904202858.GN4068168@noisy.programming.kicks-ass.net


---
 include/linux/cleanup.h  |   5 +
 include/linux/sched.h    |   6 +-
 kernel/cgroup/cpuset.c   |   2 +-
 kernel/kthread.c         |  15 +-
 kernel/sched/core.c      | 370 +++++++++++++++++++++--------------------------
 kernel/sched/deadline.c  |  26 ++--
 kernel/sched/ext.c       | 104 +++++++------
 kernel/sched/fair.c      |  23 ++-
 kernel/sched/idle.c      |  14 +-
 kernel/sched/rt.c        |  13 +-
 kernel/sched/sched.h     | 225 ++++++++++++++++++++--------
 kernel/sched/stats.h     |   2 +-
 kernel/sched/stop_task.c |  14 +-
 kernel/sched/syscalls.c  |  80 ++++------
 14 files changed, 495 insertions(+), 404 deletions(-)
Re: [PATCH 00/14] sched: Support shared runqueue locking
Posted by Christian Loehle 4 months, 3 weeks ago
On 9/10/25 16:44, Peter Zijlstra wrote:
> Hi,
> 
> As mentioned [1], a fair amount of sched ext weirdness (current and proposed)
> is down to the core code not quite working right for shared runqueue stuff.
> 
> Instead of endlessly hacking around that, bite the bullet and fix it all up.
> 
> With these patches, it should be possible to clean up pick_task_scx() to not
> rely on balance_scx(). Additionally it should be possible to fix that RT issue,
> and the dl_server issue without further propagating lock breaks.
> 
> As is, these patches boot and run/pass selftests/sched_ext with lockdep on.
> 
> I meant to do more sched_ext cleanups, but since this has all already taken
> longer than I would've liked (real life interrupted :/), I figured I should
> post this as is and let TJ/Andrea poke at it.
> 
> Patches are also available at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/cleanup
> 
> 
> [1] https://lkml.kernel.org/r/20250904202858.GN4068168@noisy.programming.kicks-ass.net
> 
> 
> ---
>  include/linux/cleanup.h  |   5 +
>  include/linux/sched.h    |   6 +-
>  kernel/cgroup/cpuset.c   |   2 +-
>  kernel/kthread.c         |  15 +-
>  kernel/sched/core.c      | 370 +++++++++++++++++++++--------------------------
>  kernel/sched/deadline.c  |  26 ++--
>  kernel/sched/ext.c       | 104 +++++++------
>  kernel/sched/fair.c      |  23 ++-
>  kernel/sched/idle.c      |  14 +-
>  kernel/sched/rt.c        |  13 +-
>  kernel/sched/sched.h     | 225 ++++++++++++++++++++--------
>  kernel/sched/stats.h     |   2 +-
>  kernel/sched/stop_task.c |  14 +-
>  kernel/sched/syscalls.c  |  80 ++++------
>  14 files changed, 495 insertions(+), 404 deletions(-)
> 
> 

Hi Peter, A couple of issues popped up when testing this [1] (that don't trigger on [2]):

When booting (arm64 orion o6) I get:

[    1.298020] sched: DL replenish lagged too much
[    1.298364] ------------[ cut here ]------------
[    1.298377] WARNING: CPU: 4 PID: 0 at kernel/sched/deadline.c:239 inactive_task_timer+0x3d0/0x474
[    1.298413] Modules linked in:
[    1.298436] CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Tainted: G S                  6.17.0-rc4-cix-build+ #56 PREEMPT 
[    1.298455] Tainted: [S]=CPU_OUT_OF_SPEC
[    1.298463] Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 0.3.0-1 2025-04-28T03:35:34+00:00
[    1.298473] pstate: 034000c9 (nzcv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[    1.298486] pc : inactive_task_timer+0x3d0/0x474
[    1.298505] lr : inactive_task_timer+0x394/0x474
[    1.298522] sp : ffff800083d4be00
[    1.298530] x29: ffff800083d4be20 x28: ffff00008362d888 x27: ffff800082ab1f88
[    1.298561] x26: ffff800082ab4a98 x25: ffff0001fef50c18 x24: 0000000000019999
[    1.298589] x23: 000000000000cccc x22: ffff0001fef51708 x21: ffff00008362d640
[    1.298616] x20: ffff0001fef50c00 x19: ffff00008362d7f0 x18: fffffffffff0b580
[    1.298642] x17: ffff80017c966000 x16: ffff800083d48000 x15: 0000000000000028
[    1.298669] x14: 0000000000000000 x13: 00000000000c4000 x12: 00000000000000c5
[    1.298695] x11: 0000000000004bb8 x10: 0000000000004bb8 x9 : 0000000000000000
[    1.298722] x8 : 0000000000000000 x7 : 0000000000000011 x6 : ffff0001fef51bc0
[    1.298747] x5 : ffff0001fef50c00 x4 : 00000000000000cc x3 : 0000000000000000
[    1.298773] x2 : ffff80017c966000 x1 : 0000000000000000 x0 : ffffffffffff3333
[    1.298800] Call trace:
[    1.298808]  inactive_task_timer+0x3d0/0x474 (P)
[    1.298830]  __hrtimer_run_queues+0x3c4/0x440
[    1.298852]  hrtimer_interrupt+0xe4/0x244
[    1.298871]  arch_timer_handler_phys+0x2c/0x44
[    1.298893]  handle_percpu_devid_irq+0xa8/0x1f0
[    1.298916]  handle_irq_desc+0x40/0x58
[    1.298933]  generic_handle_domain_irq+0x1c/0x28
[    1.298950]  gic_handle_irq+0x4c/0x11c
[    1.298965]  call_on_irq_stack+0x30/0x48
[    1.298982]  do_interrupt_handler+0x80/0x84
[    1.299001]  el1_interrupt+0x34/0x64
[    1.299022]  el1h_64_irq_handler+0x18/0x24
[    1.299037]  el1h_64_irq+0x6c/0x70
[    1.299052]  finish_task_switch.isra.0+0xac/0x2bc (P)
[    1.299070]  __schedule+0x45c/0xffc
[    1.299088]  schedule_idle+0x28/0x48
[    1.299104]  do_idle+0x184/0x27c
[    1.299121]  cpu_startup_entry+0x34/0x3c
[    1.299137]  secondary_start_kernel+0x134/0x154
[    1.299158]  __secondary_switched+0xc0/0xc4
[    1.299179] irq event stamp: 1634
[    1.299189] hardirqs last  enabled at (1633): [<ffff800081486354>] el1_interrupt+0x54/0x64
[    1.299210] hardirqs last disabled at (1634): [<ffff800081486324>] el1_interrupt+0x24/0x64
[    1.299229] softirqs last  enabled at (1614): [<ffff8000800bf7b0>] handle_softirqs+0x4a0/0x4b8
[    1.299248] softirqs last disabled at (1609): [<ffff800080010600>] __do_softirq+0x14/0x20
[    1.299262] ---[ end trace 0000000000000000 ]---

and when running actual tests (e.g. iterating through all scx schedulers under load):

[  146.532691] ================================                                                                                                                                                                                                                                    
[  146.536947] WARNING: inconsistent lock state                                                                                                                                                                                                                                    
[  146.541204] 6.17.0-rc4-cix-build+ #58 Tainted: G S      W                                                                                                                                                                                                                       
[  146.547457] --------------------------------                                                                                                                                                                                                                                    
[  146.551713] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.                                                                                                                                                                                                                
[  146.557705] rcu_tasks_trace/79 [HC0[0]:SC0[0]:HE0:SE1] takes:                                                                                                                                                                                                                   
[  146.563438] ffff000089c90e58 (&dsq->lock){?.-.}-{2:2}, at: __task_rq_lock+0x88/0x194                                                                                                                                                                                            
[  146.571179] {IN-HARDIRQ-W} state was registered at:                                                                                                                                                                                                                             
[  146.576042]   lock_acquire+0x1c8/0x338                                                                                                                                                                                                                                          
[  146.579788]   _raw_spin_lock+0x48/0x60                                                                                                                                                                                                                                          
[  146.583536]   dispatch_enqueue+0x130/0x3e8                                                                                                                                                                                                                                      
[  146.587632]   do_enqueue_task+0x2f0/0x464                                                                                                                                                                                                                                       
[  146.591629]   enqueue_task_scx+0x1b0/0x290                                                                                                                                                                                                                                      
[  146.595712]   enqueue_task+0x84/0x18c                                                                                                                                                                                                                                           
[  146.599360]   ttwu_do_activate+0x84/0x25c                                                                                                                                                                                                                                       
[  146.603361]   try_to_wake_up+0x310/0x5f8                                                                                                                                                                                                                                        
[  146.607271]   wake_up_process+0x18/0x24                                                                                                                                                                                                                                         
[  146.611094]   kick_pool+0x9c/0x17c                                                                                                                                                                                                                                              
[  146.614483]   __queue_work+0x544/0x7a8                                                                                                                                                                                                                                          
[  146.618223]   __queue_delayed_work+0x118/0x15c                                                                                                                                                                                                                                  
[  146.622653]   mod_delayed_work_on+0xcc/0xe0                                                                                                                                                                                                                                     
[  146.626823]   kblockd_mod_delayed_work_on+0x20/0x30                                                                                                                                                                                                                             
[  146.631696]   blk_mq_kick_requeue_list+0x1c/0x28                                                                                                                                                                                                                                
[  146.636307]   blk_flush_complete_seq+0xd4/0x2a4                                                                                                                                                                                                                                 
[  146.640824]   flush_end_io+0x1e0/0x3f4                                                                                                                                                                                                                                          
[  146.644559]   blk_mq_end_request+0x60/0x154                                                                                                                                                                                                                                     
[  146.648733]   nvme_end_req+0x30/0x78                                                                                                                                                                                                                                            
[  146.652306]   nvme_complete_rq+0x7c/0x218                                                                                                                                                                                                                                       
[  146.656302]   nvme_pci_complete_rq+0x98/0x110                                                                                                                                                                                                                                   
[  146.660650]   nvme_poll_cq+0x1cc/0x3b4                                                                                                                                                                                                                                          
[  146.664385]   nvme_irq+0x34/0x88                                                                                                                                                                                                                                                
[  146.667600]   __handle_irq_event_percpu+0x88/0x304                                                                                                                                                                                                                              
[  146.672384]   handle_irq_event+0x4c/0xa8                                                                                                                                                                                                                                        
[  146.676293]   handle_fasteoi_irq+0x108/0x20c                                                                                                                                                                                                                                    
[  146.680555]   handle_irq_desc+0x40/0x58                                                                                                                                                                                                                                         
[  146.684378]   generic_handle_domain_irq+0x1c/0x28                                                                                                                                                                                                                               
[  146.689068]   gic_handle_irq+0x4c/0x11c                                                                                                                                                                                                                                         
[  146.692891]   call_on_irq_stack+0x30/0x48                                                                                                                                                                                                                                       
[  146.696891]   do_interrupt_handler+0x80/0x84                                                                                                                                                                                                                                    
[  146.701151]   el1_interrupt+0x34/0x64                                                                                                                                                                                                                                           
[  146.704810]   el1h_64_irq_handler+0x18/0x24                                                                                                                                                                                                                                     
[  146.708979]   el1h_64_irq+0x6c/0x70                                                                                                                                                                                                                                             
[  146.712453]   cpuidle_enter_state+0x12c/0x53c                                                                                                                                                                                                                                   
[  146.716796]   cpuidle_enter+0x38/0x50                                                                                                                                                                                                                                           
[  146.720458]   do_idle+0x204/0x27c                                                                                                                                                                                                                                               
[  146.723759]   cpu_startup_entry+0x38/0x3c                                                                                                                                                                                                                                       
[  146.727755]   secondary_start_kernel+0x134/0x154                                                                                                                                                                                                                                
[  146.732370]   __secondary_switched+0xc0/0xc4                                                                                                                                                                                                                                    
[  146.736638] irq event stamp: 1754                                                                                                                                                                                                                                               
[  146.739938] hardirqs last  enabled at (1753): [<ffff800081497184>] _raw_spin_unlock_irqrestore+0x6c/0x70                                                                                                                                                                        
[  146.749405] hardirqs last disabled at (1754): [<ffff8000814965e4>] _raw_spin_lock_irqsave+0x84/0x88                                                                                                                                                                             
[  146.758437] softirqs last  enabled at (1664): [<ffff800080195598>] rcu_tasks_invoke_cbs+0x100/0x394                                                                                                                                                                             
[  146.767476] softirqs last disabled at (1660): [<ffff800080195598>] rcu_tasks_invoke_cbs+0x100/0x394                                                                                                                                                                             
[  146.776506]                                                                                                                                                                                                                                                                     
[  146.776506] other info that might help us debug this:                                                                                                                                                                                                                           
[  146.783019]  Possible unsafe locking scenario:                                                                                                                                                                                                                                  
[  146.783019]                                                                                                                                                                                                                                                                     
[  146.788923]        CPU0                                                                                                                                                                                                                                                         
[  146.791356]        ----                                                                                                                                                                                                                                                         
[  146.793788]   lock(&dsq->lock);                                                                                                                                                                                                                                                 
[  146.796915]   <Interrupt>                                                                                                                                                                                                                                                       
[  146.799521]     lock(&dsq->lock);                                                                                                                                                                                                                                               
[  146.802821]                                                                                                                                                                                                                                                                     
[  146.802821]  *** DEADLOCK ***                                                                                                                                                                                                                                                   
[  146.802821]                                                                                                                                                                                                                                                                     
[  146.808725] 3 locks held by rcu_tasks_trace/79:                                                                                                                                                                                                                                 
[  146.813242]  #0: ffff800082e6e650 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{4:4}, at: rcu_tasks_one_gp+0x328/0x570                                                                                                                                                                
[  146.823403]  #1: ffff800082adc1f0 (cpu_hotplug_lock){++++}-{0:0}, at: cpus_read_lock+0x10/0x1c                                                                                                                                                                                  
[  146.832014]  #2: ffff000089c90e58 (&dsq->lock){?.-.}-{2:2}, at: __task_rq_lock+0x88/0x194                                                                                                                                                                                       
[  146.840178]                                                                                                                                                                                     
[  146.813242]  #0: ffff800082e6e650 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{4:4}, at: rcu_tasks_one_gp+0x328/0x570                                                                                                                                                                
[  146.823403]  #1: ffff800082adc1f0 (cpu_hotplug_lock){++++}-{0:0}, at: cpus_read_lock+0x10/0x1c                                                                                                                                                                                  
[  146.832014]  #2: ffff000089c90e58 (&dsq->lock){?.-.}-{2:2}, at: __task_rq_lock+0x88/0x194                                                                                                                                                                                       
[  146.840178] 

[  146.840178] stack backtrace:
[  146.844521] CPU: 10 UID: 0 PID: 79 Comm: rcu_tasks_trace Tainted: G S      W           6.17.0-rc4-cix-build+ #58 PREEMPT 
[  146.855463] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[  146.860240] Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 0.3.0-1 2025-04-28T03:35:34+00:00
[  146.872136] Sched_ext: simple (enabled+all), task: runnable_at=-4ms
[  146.872138] Call trace:
[  146.880822]  show_stack+0x18/0x24 (C)
[  146.884471]  dump_stack_lvl+0x90/0xd0
[  146.888131]  dump_stack+0x18/0x24
[  146.891432]  print_usage_bug.part.0+0x29c/0x364
[  146.895950]  mark_lock+0x778/0x978
[  146.899338]  mark_held_locks+0x58/0x90
[  146.903074]  lockdep_hardirqs_on_prepare+0x100/0x210
[  146.908025]  trace_hardirqs_on+0x5c/0x1cc
[  146.912025]  _raw_spin_unlock_irqrestore+0x6c/0x70
[  146.916803]  task_call_func+0x110/0x164
[  146.920625]  trc_wait_for_one_reader.part.0+0x5c/0x3b8
[  146.925750]  check_all_holdout_tasks_trace+0x124/0x480
[  146.930874]  rcu_tasks_wait_gp+0x1f0/0x3b4
[  146.934957]  rcu_tasks_one_gp+0x4a4/0x570
[  146.938953]  rcu_tasks_kthread+0xd4/0xe0
[  146.942862]  kthread+0x148/0x208
[  146.946079]  ret_from_fork+0x10/0x20
              

(This actually locks up the system without any further print FWIW).

I'll keep testing and start debugging now, but if I can help you with anything immediately, please
do shout.


[1]
This referring to sched/cleanup at time of writing:
e127838bf8f9 sched: Cleanup NOCLOCK
ce024feefe1c sched/ext: Implement p->srq_lock support
6ef342071dd7 sched: Add {DE,EN}QUEUE_LOCKED
ed738ce6f9fb sched: Add shared runqueue locking to __task_rq_lock()
94f197f28834 sched: Add flags to {put_prev,set_next}_task() methods
254d43c94105 sched: Add locking comments to sched_class methods
f8864b505a17 sched: Make __do_set_cpus_allowed() use the sched_change pattern
d0e9cfb835d3 sched: Rename do_set_cpus_allowed()
cfcabf45249d sched: Fix do_set_cpus_allowed() locking
f7b9b39041fb sched: Fix migrate_disable_switch() locking
91128b33456a sched: Move sched_class::prio_changed() into the change pattern
c59dc6ce071b sched: Cleanup sched_delayed handling for class switches
13ea43940095 sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern
f0b336327a1b sched: Re-arrange the {EN,DE}QUEUE flags
b55442cb4ec1 sched: Employ sched_change guards

[2]
5b726e9bf954 sched/fair: Get rid of throttled_lb_pair()
Re: [PATCH 00/14] sched: Support shared runqueue locking
Posted by Andrea Righi 4 months, 4 weeks ago
Hi Peter,

thanks for jumping on this. Comment below.

On Wed, Sep 10, 2025 at 05:44:09PM +0200, Peter Zijlstra wrote:
> Hi,
> 
> As mentioned [1], a fair amount of sched ext weirdness (current and proposed)
> is down to the core code not quite working right for shared runqueue stuff.
> 
> Instead of endlessly hacking around that, bite the bullet and fix it all up.
> 
> With these patches, it should be possible to clean up pick_task_scx() to not
> rely on balance_scx(). Additionally it should be possible to fix that RT issue,
> and the dl_server issue without further propagating lock breaks.
> 
> As is, these patches boot and run/pass selftests/sched_ext with lockdep on.
> 
> I meant to do more sched_ext cleanups, but since this has all already taken
> longer than I would've liked (real life interrupted :/), I figured I should
> post this as is and let TJ/Andrea poke at it.
> 
> Patches are also available at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/cleanup
> 
> 
> [1] https://lkml.kernel.org/r/20250904202858.GN4068168@noisy.programming.kicks-ass.net

I've done a quick test with this patch set applied and I was able to
trigger this:

[   49.746281] ============================================
[   49.746457] WARNING: possible recursive locking detected
[   49.746559] 6.17.0-rc4-virtme #85 Not tainted
[   49.746666] --------------------------------------------
[   49.746763] stress-ng-race-/5818 is trying to acquire lock:
[   49.746856] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: dispatch_dequeue+0x125/0x1f0
[   49.747052]
[   49.747052] but task is already holding lock:
[   49.747234] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170
[   49.747416]
[   49.747416] other info that might help us debug this:
[   49.747557]  Possible unsafe locking scenario:
[   49.747557]
[   49.747689]        CPU0
[   49.747740]        ----
[   49.747793]   lock(&dsq->lock);
[   49.747867]   lock(&dsq->lock);
[   49.747950]
[   49.747950]  *** DEADLOCK ***
[   49.747950]
[   49.748086]  May be due to missing lock nesting notation
[   49.748086]
[   49.748197] 3 locks held by stress-ng-race-/5818:
[   49.748335]  #0: ffff890e0f0fce70 (&p->pi_lock){-.-.}-{2:2}, at: task_rq_lock+0x38/0x170
[   49.748474]  #1: ffff890e3b6bcc98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0xa0
[   49.748652]  #2: ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170

Reproducer:

 $ cd tools/sched_ext
 $ make scx_simple
 $ sudo ./build/bin/scx_simple
 ... and in another shell
 $ stress-ng --race-sched 0

I added an explicit BUG_ON() to see where the double locking is happening:

[   15.160400] Call Trace:
[   15.160706]  dequeue_task_scx+0x14a/0x270
[   15.160857]  move_queued_task+0x7d/0x2d0
[   15.160952]  affine_move_task+0x6ca/0x700
[   15.161210]  __set_cpus_allowed_ptr+0x64/0xa0
[   15.161348]  __sched_setaffinity+0x72/0x100
[   15.161459]  sched_setaffinity+0x261/0x2f0
[   15.161569]  __x64_sys_sched_setaffinity+0x50/0x80
[   15.161705]  do_syscall_64+0xbb/0x370
[   15.161816]  entry_SYSCALL_64_after_hwframe+0x77/0x7f

Are we missing a DEQUEUE_LOCKED in the sched_setaffinity() path?

Thanks,
-Andrea
Re: [PATCH 00/14] sched: Support shared runqueue locking
Posted by Peter Zijlstra 4 months, 4 weeks ago
On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:

> Reproducer:
> 
>  $ cd tools/sched_ext
>  $ make scx_simple

FWIW, I only have one machine where this works. Most of my machines this
results in an endless stream of build fail; same for the selftest stuff.

No clues given, just endless build fail :-(
Re: [PATCH 00/14] sched: Support shared runqueue locking
Posted by Peter Zijlstra 4 months, 4 weeks ago
On Thu, Sep 11, 2025 at 04:00:22PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:
> 
> > Reproducer:
> > 
> >  $ cd tools/sched_ext
> >  $ make scx_simple
> 
> FWIW, I only have one machine where this works. Most of my machines this
> results in an endless stream of build fail; same for the selftest stuff.
> 
> No clues given, just endless build fail :-(

Ah, I need to do: make O=/build-path/. The one machine it worked on had
an actual test kernel installed and booted.
Re: [PATCH 00/14] sched: Support shared runqueue locking
Posted by Andrea Righi 4 months, 4 weeks ago
On Thu, Sep 11, 2025 at 04:30:00PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 11, 2025 at 04:00:22PM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:
> > 
> > > Reproducer:
> > > 
> > >  $ cd tools/sched_ext
> > >  $ make scx_simple
> > 
> > FWIW, I only have one machine where this works. Most of my machines this
> > results in an endless stream of build fail; same for the selftest stuff.
> > 
> > No clues given, just endless build fail :-(
> 
> Ah, I need to do: make O=/build-path/. The one machine it worked on had
> an actual test kernel installed and booted.

Maybe you need this?
https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/commit/?h=for-6.18&id=de68c05189cc4508c3ac4e1e44da1ddb16b1bceb

In case you're getting build failures with the likely() macro.

-Andrea
Re: [PATCH 00/14] sched: Support shared runqueue locking
Posted by Peter Zijlstra 4 months, 4 weeks ago
On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:

> [   15.160400] Call Trace:
> [   15.160706]  dequeue_task_scx+0x14a/0x270
> [   15.160857]  move_queued_task+0x7d/0x2d0
> [   15.160952]  affine_move_task+0x6ca/0x700
> [   15.161210]  __set_cpus_allowed_ptr+0x64/0xa0
> [   15.161348]  __sched_setaffinity+0x72/0x100
> [   15.161459]  sched_setaffinity+0x261/0x2f0
> [   15.161569]  __x64_sys_sched_setaffinity+0x50/0x80
> [   15.161705]  do_syscall_64+0xbb/0x370
> [   15.161816]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Are we missing a DEQUEUE_LOCKED in the sched_setaffinity() path?

Yeah, the affine_move_task->move_queued_task path is messed up. It
relied on raw_spin_lock_irqsave(&p->pi_lock); rq_lock(rq); being
equivalent to task_rq_lock(), which is no longer true.

I fixed a few such sites earlier today but missed this one.

I'll go untangle it, but probably something for tomorrow, I'm bound to
make a mess of it now :-)
Re: [PATCH 00/14] sched: Support shared runqueue locking
Posted by Peter Zijlstra 4 months, 4 weeks ago
On Wed, Sep 10, 2025 at 08:35:55PM +0200, Peter Zijlstra wrote:

> I'll go untangle it, but probably something for tomorrow, I'm bound to
> make a mess of it now :-)

Best I could come up with is something like this. I tried a few other
approaches, but they all turned into a bigger mess.

Let me go try and run this.

---
 kernel/sched/core.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2481,11 +2481,11 @@ static inline bool is_cpu_allowed(struct
  * Returns (locked) new rq. Old rq's lock is released.
  */
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
-				   struct task_struct *p, int new_cpu)
+				   struct task_struct *p, int new_cpu, int flags)
 {
 	lockdep_assert_rq_held(rq);
 
-	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+	deactivate_task(rq, p, flags | DEQUEUE_NOCLOCK);
 	set_task_cpu(p, new_cpu);
 	rq_unlock(rq, rf);
 
@@ -2493,7 +2493,7 @@ static struct rq *move_queued_task(struc
 
 	rq_lock(rq, rf);
 	WARN_ON_ONCE(task_cpu(p) != new_cpu);
-	activate_task(rq, p, 0);
+	activate_task(rq, p, flags);
 	wakeup_preempt(rq, p, 0);
 
 	return rq;
@@ -2533,7 +2533,7 @@ static struct rq *__migrate_task(struct
 	if (!is_cpu_allowed(p, dest_cpu))
 		return rq;
 
-	rq = move_queued_task(rq, rf, p, dest_cpu);
+	rq = move_queued_task(rq, rf, p, dest_cpu, 0);
 
 	return rq;
 }
@@ -3007,7 +3007,7 @@ static int affine_move_task(struct rq *r
 
 		if (!is_migration_disabled(p)) {
 			if (task_on_rq_queued(p))
-				rq = move_queued_task(rq, rf, p, dest_cpu);
+				rq = move_queued_task(rq, rf, p, dest_cpu, DEQUEUE_LOCKED);
 
 			if (!pending->stop_pending) {
 				p->migration_pending = NULL;
Re: [PATCH 00/14] sched: Support shared runqueue locking
Posted by Andrea Righi 4 months, 4 weeks ago
On Thu, Sep 11, 2025 at 11:58:45AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 10, 2025 at 08:35:55PM +0200, Peter Zijlstra wrote:
> 
> > I'll go untangle it, but probably something for tomorrow, I'm bound to
> > make a mess of it now :-)
> 
> Best I could come up with is something like this. I tried a few other
> approaches, but they all turned into a bigger mess.
> 
> Let me go try and run this.

With this one it's complaining about lockdep_assert_held(p->srq_lock):

[   19.055730] WARNING: CPU: 2 PID: 368 at kernel/sched/core.c:10840 sched_change_begin+0x2ac/0x3e0
...
[   19.056468] RIP: 0010:sched_change_begin+0x2ac/0x3e0
...
[   19.057217] RSP: 0018:ffffa9f7805bbde8 EFLAGS: 00010046
[   19.057359] RAX: 0000000000000000 RBX: ffff97ae04880000 RCX: 0000000000000001
[   19.057464] RDX: 0000000000000046 RSI: ffff97ae01715518 RDI: ffff97ae027f0b68
[   19.057568] RBP: 0000000000000082 R08: 0000000000000001 R09: 0000000000000001
[   19.057706] R10: 0000000000000001 R11: 0000000000000001 R12: ffff97ae3bdbcc80
[   19.057833] R13: ffff97ae93c48000 R14: ffff97ae3b717f20 R15: 0000000000000000
[   19.057973] FS:  00007f18999edb00(0000) GS:ffff97ae93c48000(0000) knlGS:0000000000000000
[   19.058112] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   19.058223] CR2: 000055e1e6b0246c CR3: 0000000102ce8000 CR4: 0000000000750ef0
[   19.058460] PKRU: 55555554
[   19.058561] Call Trace:
[   19.058604]  <TASK>
[   19.058675]  __set_cpus_allowed_ptr_locked+0x17c/0x230
[   19.058769]  __set_cpus_allowed_ptr+0x64/0xa0
[   19.058853]  __sched_setaffinity+0x72/0x100
[   19.058920]  sched_setaffinity+0x261/0x2f0
[   19.058985]  __x64_sys_sched_setaffinity+0x50/0x80
[   19.059084]  do_syscall_64+0xbb/0x370
[   19.059158]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   19.059236] RIP: 0033:0x7f189a3bd25b

Thanks,
-Andrea

> 
> ---
>  kernel/sched/core.c |   10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2481,11 +2481,11 @@ static inline bool is_cpu_allowed(struct
>   * Returns (locked) new rq. Old rq's lock is released.
>   */
>  static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
> -				   struct task_struct *p, int new_cpu)
> +				   struct task_struct *p, int new_cpu, int flags)
>  {
>  	lockdep_assert_rq_held(rq);
>  
> -	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +	deactivate_task(rq, p, flags | DEQUEUE_NOCLOCK);
>  	set_task_cpu(p, new_cpu);
>  	rq_unlock(rq, rf);
>  
> @@ -2493,7 +2493,7 @@ static struct rq *move_queued_task(struc
>  
>  	rq_lock(rq, rf);
>  	WARN_ON_ONCE(task_cpu(p) != new_cpu);
> -	activate_task(rq, p, 0);
> +	activate_task(rq, p, flags);
>  	wakeup_preempt(rq, p, 0);
>  
>  	return rq;
> @@ -2533,7 +2533,7 @@ static struct rq *__migrate_task(struct
>  	if (!is_cpu_allowed(p, dest_cpu))
>  		return rq;
>  
> -	rq = move_queued_task(rq, rf, p, dest_cpu);
> +	rq = move_queued_task(rq, rf, p, dest_cpu, 0);
>  
>  	return rq;
>  }
> @@ -3007,7 +3007,7 @@ static int affine_move_task(struct rq *r
>  
>  		if (!is_migration_disabled(p)) {
>  			if (task_on_rq_queued(p))
> -				rq = move_queued_task(rq, rf, p, dest_cpu);
> +				rq = move_queued_task(rq, rf, p, dest_cpu, DEQUEUE_LOCKED);
>  
>  			if (!pending->stop_pending) {
>  				p->migration_pending = NULL;
Re: [PATCH 00/14] sched: Support shared runqueue locking
Posted by Andrea Righi 4 months, 4 weeks ago
On Wed, Sep 10, 2025 at 08:35:55PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:
> 
> > [   15.160400] Call Trace:
> > [   15.160706]  dequeue_task_scx+0x14a/0x270
> > [   15.160857]  move_queued_task+0x7d/0x2d0
> > [   15.160952]  affine_move_task+0x6ca/0x700
> > [   15.161210]  __set_cpus_allowed_ptr+0x64/0xa0
> > [   15.161348]  __sched_setaffinity+0x72/0x100
> > [   15.161459]  sched_setaffinity+0x261/0x2f0
> > [   15.161569]  __x64_sys_sched_setaffinity+0x50/0x80
> > [   15.161705]  do_syscall_64+0xbb/0x370
> > [   15.161816]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > 
> > Are we missing a DEQUEUE_LOCKED in the sched_setaffinity() path?
> 
> Yeah, the affine_move_task->move_queued_task path is messed up. It
> relied on raw_spin_lock_irqsave(&p->pi_lock); rq_lock(rq); being
> equivalent to task_rq_lock(), which is no longer true.
> 
> I fixed a few such sites earlier today but missed this one.
> 
> I'll go untangle it, but probably something for tomorrow, I'm bound to
> make a mess of it now :-)

Sure! I’ll run more tests in the meantime. For now, that's the only issue
I've found. :)

Thanks!
-Andrea
Re: [PATCH 00/14] sched: Support shared runqueue locking
Posted by Peter Zijlstra 4 months, 4 weeks ago
On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:

> I've done a quick test with this patch set applied and I was able to
> trigger this:
> 
> [   49.746281] ============================================
> [   49.746457] WARNING: possible recursive locking detected
> [   49.746559] 6.17.0-rc4-virtme #85 Not tainted
> [   49.746666] --------------------------------------------
> [   49.746763] stress-ng-race-/5818 is trying to acquire lock:
> [   49.746856] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: dispatch_dequeue+0x125/0x1f0
> [   49.747052]
> [   49.747052] but task is already holding lock:
> [   49.747234] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170
> [   49.747416]
> [   49.747416] other info that might help us debug this:
> [   49.747557]  Possible unsafe locking scenario:
> [   49.747557]
> [   49.747689]        CPU0
> [   49.747740]        ----
> [   49.747793]   lock(&dsq->lock);
> [   49.747867]   lock(&dsq->lock);
> [   49.747950]
> [   49.747950]  *** DEADLOCK ***
> [   49.747950]
> [   49.748086]  May be due to missing lock nesting notation
> [   49.748086]
> [   49.748197] 3 locks held by stress-ng-race-/5818:
> [   49.748335]  #0: ffff890e0f0fce70 (&p->pi_lock){-.-.}-{2:2}, at: task_rq_lock+0x38/0x170
> [   49.748474]  #1: ffff890e3b6bcc98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0xa0
> [   49.748652]  #2: ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170
> 
> Reproducer:
> 
>  $ cd tools/sched_ext
>  $ make scx_simple
>  $ sudo ./build/bin/scx_simple
>  ... and in another shell
>  $ stress-ng --race-sched 0

Heh, the selftests thing was bound to not cover everything. I'll have a
poke at it. Thanks!