[PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled

Adam Li posted 3 patches 1 year, 2 months ago
[PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled
Posted by Adam Li 1 year, 2 months ago
Enabling NEXT_BUDDY triggers warning, and rcu stall:

[  124.977300] cfs_rq->next->sched_delayed
[  124.977310] WARNING: CPU: 51 PID: 2150 at kernel/sched/fair.c:5621 pick_task_fair+0x130/0x150
[  125.049547] CPU: 51 UID: 0 PID: 2150 Comm: kworker/51:1 Tainted: G            E      6.12.0.adam+ #1
<snip>
[  125.163561] Call trace:
[  125.165996]  pick_task_fair+0x130/0x150 (P)
[  125.170167]  pick_task_fair+0x130/0x150 (L)
[  125.174338]  pick_next_task_fair+0x48/0x3c0
[  125.178512]  __pick_next_task+0x4c/0x220
[  125.182426]  pick_next_task+0x44/0x980
[  125.186163]  __schedule+0x3d0/0x628
[  125.189645]  schedule+0x3c/0xe0
[  125.192776]  worker_thread+0x1a4/0x368
[  125.196516]  kthread+0xfc/0x110
[  125.199647]  ret_from_fork+0x10/0x20
[  125.203213] ---[ end trace 0000000000000000 ]---
<snip>
[  211.151849] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  211.159759] rcu:     (detected by 141, t=15003 jiffies, g=5629, q=26516 ncpus=384)
<snip>

Do not set next buddy if sched_delayed is set.

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbdca89c677f..cd1188b7f3df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8748,6 +8748,8 @@ static void set_next_buddy(struct sched_entity *se)
 			return;
 		if (se_is_idle(se))
 			return;
+		if (se->sched_delayed)
+			return;
 		cfs_rq_of(se)->next = se;
 	}
 }
-- 
2.25.1
Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled
Posted by K Prateek Nayak 1 year, 2 months ago
Hello Adam,

On 11/27/2024 11:26 AM, Adam Li wrote:
> Enabling NEXT_BUDDY triggers warning, and rcu stall:
> 
> [  124.977300] cfs_rq->next->sched_delayed

I could reproduce this with a run of "perf bench sched messaging" but
given that we hit this warning, it also means that either
set_next_buddy() has incorrectly set a delayed entity as next buddy, or
clear_next_buddy() did not clear a delayed entity.

I also see PSI splats like:

     psi: inconsistent task state! task=2524:kworker/u1028:2 cpu=154 psi_flags=10 clear=14 set=0

but the PSI flags it has set "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL)" and
the flags it is trying to clear
"(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL | TSK_RUNNING)" seem to be only
possible if you have picked a dequeued entity for running before its
wakeup, which is also perhaps why the "nr_running" computation goes awry
and pick_eevdf() returns NULL (which it should never since
pick_next_entity() is only called when rq->cfs.nr_running is > 0)

> [  124.977310] WARNING: CPU: 51 PID: 2150 at kernel/sched/fair.c:5621 pick_task_fair+0x130/0x150
> [  125.049547] CPU: 51 UID: 0 PID: 2150 Comm: kworker/51:1 Tainted: G            E      6.12.0.adam+ #1
> <snip>
> [  125.163561] Call trace:
> [  125.165996]  pick_task_fair+0x130/0x150 (P)
> [  125.170167]  pick_task_fair+0x130/0x150 (L)
> [  125.174338]  pick_next_task_fair+0x48/0x3c0
> [  125.178512]  __pick_next_task+0x4c/0x220
> [  125.182426]  pick_next_task+0x44/0x980
> [  125.186163]  __schedule+0x3d0/0x628
> [  125.189645]  schedule+0x3c/0xe0
> [  125.192776]  worker_thread+0x1a4/0x368
> [  125.196516]  kthread+0xfc/0x110
> [  125.199647]  ret_from_fork+0x10/0x20
> [  125.203213] ---[ end trace 0000000000000000 ]---
> <snip>
> [  211.151849] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  211.159759] rcu:     (detected by 141, t=15003 jiffies, g=5629, q=26516 ncpus=384)
> <snip>
> 
> Do not set next buddy if sched_delayed is set.
> 
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
> ---
>   kernel/sched/fair.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fbdca89c677f..cd1188b7f3df 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8748,6 +8748,8 @@ static void set_next_buddy(struct sched_entity *se)
>   			return;
>   		if (se_is_idle(se))
>   			return;
> +		if (se->sched_delayed)
> +			return;

I tried to put a SCHED_WARN_ON() here to track where this comes from and
seems like it is usually from attach_task() in the load balancing path
pulling a delayed task which is set as the next buddy in
check_preempt_wakeup_fair()

Can you please try the following diff instead of the first two patches
and see if you still hit these warnings, stalls, and pick_eevdf()
returning NULL?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff7cae9274c5..61e74eb5af22 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5478,6 +5478,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
  	bool sleep = flags & DEQUEUE_SLEEP;
  
  	update_curr(cfs_rq);
+	clear_buddies(cfs_rq, se);
  
  	if (flags & DEQUEUE_DELAYED) {
  		SCHED_WARN_ON(!se->sched_delayed);
@@ -5520,8 +5521,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
  
  	update_stats_dequeue_fair(cfs_rq, se, flags);
  
-	clear_buddies(cfs_rq, se);
-
  	update_entity_lag(cfs_rq, se);
  	if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
  		se->deadline -= se->vruntime;
@@ -8767,7 +8766,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
  	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
  		return;
  
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) {
+	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
  		set_next_buddy(pse);
  	}
  
--

If you are still encountering pick_eevdf() returning NULL, there could
be a larger issues (with eligibility computation, etc.) that the second
patch can hide which can lead to bigger problems later. Thank you.

>   		cfs_rq_of(se)->next = se;
>   	}
>   }

-- 
Thanks and Regards,
Prateek
Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled
Posted by Peter Zijlstra 1 year, 2 months ago
On Thu, Nov 28, 2024 at 12:59:54PM +0530, K Prateek Nayak wrote:

> Can you please try the following diff instead of the first two patches
> and see if you still hit these warnings, stalls, and pick_eevdf()
> returning NULL?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff7cae9274c5..61e74eb5af22 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5478,6 +5478,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	bool sleep = flags & DEQUEUE_SLEEP;
>  	update_curr(cfs_rq);
> +	clear_buddies(cfs_rq, se);
>  	if (flags & DEQUEUE_DELAYED) {
>  		SCHED_WARN_ON(!se->sched_delayed);
> @@ -5520,8 +5521,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	update_stats_dequeue_fair(cfs_rq, se, flags);
> -	clear_buddies(cfs_rq, se);
> -
>  	update_entity_lag(cfs_rq, se);
>  	if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
>  		se->deadline -= se->vruntime;
> @@ -8767,7 +8766,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
>  		return;
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) {
> +	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
>  		set_next_buddy(pse);
>  	}


Prateek, I've presumed your SoB on this change:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/urgent&id=d1139307fe97ffefcf90212772f7516732a11034

Holler if you want it modified.

Thanks!
Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled
Posted by K Prateek Nayak 1 year, 2 months ago
Hello Peter,

On 12/3/2024 9:35 PM, Peter Zijlstra wrote:
> On Thu, Nov 28, 2024 at 12:59:54PM +0530, K Prateek Nayak wrote:
> 
>> Can you please try the following diff instead of the first two patches
>> and see if you still hit these warnings, stalls, and pick_eevdf()
>> returning NULL?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ff7cae9274c5..61e74eb5af22 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5478,6 +5478,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>   	bool sleep = flags & DEQUEUE_SLEEP;
>>   	update_curr(cfs_rq);
>> +	clear_buddies(cfs_rq, se);
>>   	if (flags & DEQUEUE_DELAYED) {
>>   		SCHED_WARN_ON(!se->sched_delayed);
>> @@ -5520,8 +5521,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>   	update_stats_dequeue_fair(cfs_rq, se, flags);
>> -	clear_buddies(cfs_rq, se);
>> -
>>   	update_entity_lag(cfs_rq, se);
>>   	if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
>>   		se->deadline -= se->vruntime;
>> @@ -8767,7 +8766,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>   	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
>>   		return;
>> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) {
>> +	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
>>   		set_next_buddy(pse);
>>   	}
> 
> 
> Prateek, I've presumed your SoB on this change:
> 
>    https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/urgent&id=d1139307fe97ffefcf90212772f7516732a11034

No objections from my side! While at it, perhaps you can also squash in:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ed4af8be76b..eadcd64c03e9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5519,8 +5519,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
  
  		if (sched_feat(DELAY_DEQUEUE) && delay &&
  		    !entity_eligible(cfs_rq, se)) {
-			if (cfs_rq->next == se)
-				cfs_rq->next = NULL;
  			update_load_avg(cfs_rq, se, 0);
  			set_delayed(se);
  			return false;
--

Since we do a clear_buddy() upfront, we no longer need this special case
for delayed entities. Tested it on top of queue:sched/urgent with
hackbench and I didn't run into any problems / splats. Thank you.

> 
> Holler if you want it modified.
> 
> Thanks!

-- 
Thanks and Regards,
Prateek
Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled
Posted by Peter Zijlstra 1 year, 2 months ago
On Thu, Nov 28, 2024 at 12:59:54PM +0530, K Prateek Nayak wrote:

> I tried to put a SCHED_WARN_ON() here to track where this comes from and
> seems like it is usually from attach_task() in the load balancing path
> pulling a delayed task which is set as the next buddy in
> check_preempt_wakeup_fair()
> 
> Can you please try the following diff instead of the first two patches
> and see if you still hit these warnings, stalls, and pick_eevdf()
> returning NULL?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff7cae9274c5..61e74eb5af22 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5478,6 +5478,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	bool sleep = flags & DEQUEUE_SLEEP;
>  	update_curr(cfs_rq);
> +	clear_buddies(cfs_rq, se);
>  	if (flags & DEQUEUE_DELAYED) {
>  		SCHED_WARN_ON(!se->sched_delayed);
> @@ -5520,8 +5521,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	update_stats_dequeue_fair(cfs_rq, se, flags);
> -	clear_buddies(cfs_rq, se);
> -
>  	update_entity_lag(cfs_rq, se);
>  	if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
>  		se->deadline -= se->vruntime;

So this puts the clear_buddies() before the whole delayed thing, and
should be sufficient afaict, no?

> @@ -8767,7 +8766,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
>  		return;
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) {
> +	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
>  		set_next_buddy(pse);
>  	}

But then this should never happen, which is after a wakeup, p and the
whole hierarchy up should be runnable at this point.

Or should I go find more wake-up juice and try again :-)



Anyway..  I'm sure I started a patch series cleaning up the whole next
buddy thing months ago (there's more problems here), but I can't seem to
find it in a hurry :/
Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled
Posted by K Prateek Nayak 1 year, 2 months ago
Hello Peter,

On 11/29/2024 3:25 PM, Peter Zijlstra wrote:
> On Thu, Nov 28, 2024 at 12:59:54PM +0530, K Prateek Nayak wrote:
> 
>> I tried to put a SCHED_WARN_ON() here to track where this comes from and
>> seems like it is usually from attach_task() in the load balancing path
>> pulling a delayed task which is set as the next buddy in
>> check_preempt_wakeup_fair()
>>
>> Can you please try the following diff instead of the first two patches
>> and see if you still hit these warnings, stalls, and pick_eevdf()
>> returning NULL?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ff7cae9274c5..61e74eb5af22 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5478,6 +5478,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>   	bool sleep = flags & DEQUEUE_SLEEP;
>>   	update_curr(cfs_rq);
>> +	clear_buddies(cfs_rq, se);
>>   	if (flags & DEQUEUE_DELAYED) {
>>   		SCHED_WARN_ON(!se->sched_delayed);
>> @@ -5520,8 +5521,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>   	update_stats_dequeue_fair(cfs_rq, se, flags);
>> -	clear_buddies(cfs_rq, se);
>> -
>>   	update_entity_lag(cfs_rq, se);
>>   	if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
>>   		se->deadline -= se->vruntime;
> 
> So this puts the clear_buddies() before the whole delayed thing, and
> should be sufficient afaict, no?
> 
>> @@ -8767,7 +8766,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>   	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
>>   		return;
>> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) {
>> +	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
>>   		set_next_buddy(pse);
>>   	}
> 
> But then this should never happen, which is after a wakeup, p and the
> whole hierarchy up should be runnable at this point.
> 
> Or should I go find more wake-up juice and try again :-)

The motivation there was this splat from my testing:

     Kernel panic - not syncing: Encountered delayed entity in check_preempt_wakeup_fair()
     CPU: 190 UID: 1000 PID: 4215 Comm: sched-messaging Tainted: G        W          6.12.0-rc4-test+ #156
     Tainted: [W]=WARN
     Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
     Call Trace:
      <TASK>
      panic+0x399/0x3f0
      check_preempt_wakeup_fair+0x21b/0x220
      wakeup_preempt+0x64/0x70
      sched_balance_rq+0x970/0x12e0
      ? update_load_avg+0x7e/0x7e0
      sched_balance_newidle+0x1e2/0x490
      pick_next_task_fair+0x32/0x3b0
      __pick_next_task+0x3d/0x1a0
      __schedule+0x1da/0x1540
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? aa_file_perm+0x121/0x4d0
      schedule+0x28/0x110
      pipe_read+0x345/0x470
      ? __pfx_autoremove_wake_function+0x10/0x10
      vfs_read+0x2f1/0x330
      ksys_read+0xaf/0xe0
      do_syscall_64+0x6f/0x110
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? current_time+0x31/0xf0
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? atime_needs_update+0x9c/0x110
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? touch_atime+0x1e/0x100
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? pipe_read+0x3ec/0x470
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? vfs_read+0x2f1/0x330
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? ksys_read+0xcc/0xe0
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? syscall_exit_to_user_mode+0x51/0x1a0
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? do_syscall_64+0x7b/0x110
      ? ksys_read+0xcc/0xe0
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? syscall_exit_to_user_mode+0x51/0x1a0
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? do_syscall_64+0x7b/0x110
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? ksys_read+0xcc/0xe0
      ? do_syscall_64+0x7b/0x110
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? syscall_exit_to_user_mode+0x51/0x1a0
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? do_syscall_64+0x7b/0x110
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? do_syscall_64+0x7b/0x110
      entry_SYSCALL_64_after_hwframe+0x76/0x7e
     RIP: 0033:0x7fa8b851481c
     Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 e9 c1 f7 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 2f c2 f7 ff 48
     RSP: 002b:00007fa8ae0d8cb0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
     RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa8b851481c
     RDX: 0000000000000064 RSI: 00007fa8ae0d8cf0 RDI: 0000000000000027
     RBP: 00007fa8ae0d8d90 R08: 0000000000000000 R09: 00007ffee0fdf97f
     R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000064
     R13: 00007fa8ae0d8cf0 R14: 00000000000005aa R15: 000055c2e80accb0
      </TASK>
--

newidle balance pulls a delayed entity which goes through the
check_preempt_wakeup_fair() path in attach_task() and is set as the next
buddy. On a second thought this is perhaps not required since even if
this delayed entity is picked, it'll go thorough a full dequeue and the
clear_buddies() change above should take care of it.

> 
> 
> 
> Anyway..  I'm sure I started a patch series cleaning up the whole next
> buddy thing months ago (there's more problems here), but I can't seem to
> find it in a hurry :/
> 
> 
> 

-- 
Thanks and Regards,
Prateek
Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled
Posted by K Prateek Nayak 1 year, 2 months ago
On 11/29/2024 11:16 PM, K Prateek Nayak wrote:
> [..snip..]
> newidle balance pulls a delayed entity which goes through the
> check_preempt_wakeup_fair() path in attach_task() and is set as the next
> buddy. On a second thought this is perhaps not required since even if
> this delayed entity is picked, it'll go thorough a full dequeue and the
> clear_buddies() change above should take care of it.

That said, it'll still trigger the following warning in
pick_next_entity():

     /* ->next will never be delayed */
     SCHED_WARN_ON(cfs_rq->next->sched_delayed);

which is also why I added that check in check_preempt_wakeup_fair() :)

-- 
Thanks and Regards,
Prateek
[PATCH] sched/fair: Untangle NEXT_BUDDY and pick_next_task()
Posted by Peter Zijlstra 1 year, 2 months ago
On Fri, Nov 29, 2024 at 10:55:00AM +0100, Peter Zijlstra wrote:

> Anyway..  I'm sure I started a patch series cleaning up the whole next
> buddy thing months ago (there's more problems here), but I can't seem to
> find it in a hurry :/

There was this..

---
Subject: sched/fair: Untangle NEXT_BUDDY and pick_next_task()
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Nov 29 10:36:59 CET 2024

There are 3 sites using set_next_buddy() and only one is conditional
on NEXT_BUDDY, the other two sites are unconditional; to note:

  - yield_to_task()
  - cgroup dequeue / pick optimization

However, having NEXT_BUDDY control both the wakeup-preemption and the
picking side of things means its near useless.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |    4 ++--
 kernel/sched/features.h |    9 +++++++++
 2 files changed, 11 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5613,9 +5613,9 @@ static struct sched_entity *
 pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	/*
-	 * Enabling NEXT_BUDDY will affect latency but not fairness.
+	 * Picking the ->next buddy will affect latency but not fairness.
 	 */
-	if (sched_feat(NEXT_BUDDY) &&
+	if (sched_feat(PICK_BUDDY) &&
 	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
 		/* ->next will never be delayed */
 		SCHED_WARN_ON(cfs_rq->next->sched_delayed);
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -32,6 +32,15 @@ SCHED_FEAT(PREEMPT_SHORT, true)
 SCHED_FEAT(NEXT_BUDDY, false)
 
 /*
+ * Allow completely ignoring cfs_rq->next; which can be set from various
+ * places:
+ *   - NEXT_BUDDY (wakeup preemption)
+ *   - yield_to_task()
+ *   - cgroup dequeue / pick
+ */
+SCHED_FEAT(PICK_BUDDY, true)
+
+/*
  * Consider buddies to be cache hot, decreases the likeliness of a
  * cache buddy being migrated away, increases cache locality.
  */
> 
>
Re: [PATCH] sched/fair: Untangle NEXT_BUDDY and pick_next_task()
Posted by Adam Li 1 year, 2 months ago
On 11/29/2024 6:15 PM, Peter Zijlstra wrote:
> On Fri, Nov 29, 2024 at 10:55:00AM +0100, Peter Zijlstra wrote:
> 
>> Anyway..  I'm sure I started a patch series cleaning up the whole next
>> buddy thing months ago (there's more problems here), but I can't seem to
>> find it in a hurry :/
> 
> There was this..
> 
Hi Peter and Prateek,

I tested the two patches on 6.13-rc1 + patch "sched/fair: Fix NEXT_BUDDY"
(https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/urgent&id=d1139307fe97ffefcf90212772f7516732a11034)
 
1) sched/fair: Untangle NEXT_BUDDY and pick_next_task()
2) sched/fair: Add CGROUP_BUDDY feature

With all 2^3=8 combinations: (NO_)NEXT_BUDDY, (NO_)CGROUP_BUDDY, (NO_)PICK_BUDDY,
there is no warning or panic. There is no significant performance difference for
Specjbb workload.

And there is no much performance difference before and after the two patches.

Before the patches, I think the default setting 'NO_NEXT_BUDDY' logically
equals to 'NO_PICK_BUDDY && CGROUP_BUDDY && NO_NEXT_BUDDY'.
After the patches, the default becomes 'PICK_BUDDY && CGROUP_BUDDY && NO_NEXT_BUDDY'.

Thanks,
-adam
> ---
> Subject: sched/fair: Untangle NEXT_BUDDY and pick_next_task()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Nov 29 10:36:59 CET 2024
> 
> There are 3 sites using set_next_buddy() and only one is conditional
> on NEXT_BUDDY, the other two sites are unconditional; to note:
> 
>   - yield_to_task()
>   - cgroup dequeue / pick optimization
> 
> However, having NEXT_BUDDY control both the wakeup-preemption and the
> picking side of things means its near useless.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c     |    4 ++--
>  kernel/sched/features.h |    9 +++++++++
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5613,9 +5613,9 @@ static struct sched_entity *
>  pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>  {
>  	/*
> -	 * Enabling NEXT_BUDDY will affect latency but not fairness.
> +	 * Picking the ->next buddy will affect latency but not fairness.
>  	 */
> -	if (sched_feat(NEXT_BUDDY) &&
> +	if (sched_feat(PICK_BUDDY) &&
>  	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
>  		/* ->next will never be delayed */
>  		SCHED_WARN_ON(cfs_rq->next->sched_delayed);
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -32,6 +32,15 @@ SCHED_FEAT(PREEMPT_SHORT, true)
>  SCHED_FEAT(NEXT_BUDDY, false)
>  
>  /*
> + * Allow completely ignoring cfs_rq->next; which can be set from various
> + * places:
> + *   - NEXT_BUDDY (wakeup preemption)
> + *   - yield_to_task()
> + *   - cgroup dequeue / pick
> + */
> +SCHED_FEAT(PICK_BUDDY, true)
> +
> +/*
>   * Consider buddies to be cache hot, decreases the likeliness of a
>   * cache buddy being migrated away, increases cache locality.
>   */
>>
>>
Re: [PATCH] sched/fair: Untangle NEXT_BUDDY and pick_next_task()
Posted by Peter Zijlstra 1 year, 2 months ago
On Fri, Nov 29, 2024 at 11:15:41AM +0100, Peter Zijlstra wrote:
> On Fri, Nov 29, 2024 at 10:55:00AM +0100, Peter Zijlstra wrote:
> 
> > Anyway..  I'm sure I started a patch series cleaning up the whole next
> > buddy thing months ago (there's more problems here), but I can't seem to
> > find it in a hurry :/
> 
> There was this..

And this I think.

Adam, what was the reason you were enabling NEXT_BUDDY in the first
place?

I think someone (Ingo?) was proposing we kill the wakeup preempt thing;
and I suspect you don't actually care about that but instead want either
the cgroup or the yield_to_task()/KVM thing working.

---
Subject: sched/fair: Add CGROUP_BUDDY feature
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Nov 29 10:49:45 CET 2024

Add a feature to toggle the cgroup optimization.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |    3 ++-
 kernel/sched/features.h |    8 +++++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7126,7 +7126,8 @@ static int dequeue_entities(struct rq *r
 			 * Bias pick_next to pick a task from this cfs_rq, as
 			 * p is sleeping when it is within its sched_slice.
 			 */
-			if (task_sleep && se && !throttled_hierarchy(cfs_rq))
+			if (sched_feat(CGROUP_BUDDY) &&
+			    task_sleep && se && !throttled_hierarchy(cfs_rq))
 				set_next_buddy(se);
 			break;
 		}
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -32,11 +32,17 @@ SCHED_FEAT(PREEMPT_SHORT, true)
 SCHED_FEAT(NEXT_BUDDY, false)
 
 /*
+ * Optimization for cgroup scheduling where a dequeue + pick tries
+ * to share as much of the hierarchy as possible.
+ */
+SCHED_FEAT(CGROUP_BUDDY, true)
+
+/*
  * Allow completely ignoring cfs_rq->next; which can be set from various
  * places:
  *   - NEXT_BUDDY (wakeup preemption)
  *   - yield_to_task()
- *   - cgroup dequeue / pick
+ *   - CGROUP_BUDDY (cgroup dequeue / pick)
  */
 SCHED_FEAT(PICK_BUDDY, true)
Re: [PATCH] sched/fair: Untangle NEXT_BUDDY and pick_next_task()
Posted by Adam Li 1 year, 2 months ago
On 11/29/2024 6:18 PM, Peter Zijlstra wrote:
> On Fri, Nov 29, 2024 at 11:15:41AM +0100, Peter Zijlstra wrote:
>> On Fri, Nov 29, 2024 at 10:55:00AM +0100, Peter Zijlstra wrote:
>>
>>> Anyway..  I'm sure I started a patch series cleaning up the whole next
>>> buddy thing months ago (there's more problems here), but I can't seem to
>>> find it in a hurry :/
>>
>> There was this..
> 
> And this I think.
> 
> Adam, what was the reason you were enabling NEXT_BUDDY in the first
> place?
> 
Hi Peter,

I am tuning Specjbb critical-jOPS, which is latency sensitive.
NEXT_BUDDY affects schedule latency so I tried to enable NEXT_BUDDY.
However Specjbb critical-jOPS drops with NEXT_BUDDY enabled (after my patch fixing panic).

I will test your new NEXT_BUDDY patches.

> I think someone (Ingo?) was proposing we kill the wakeup preempt thing;
> and I suspect you don't actually care about that but instead want either
> the cgroup or the yield_to_task()/KVM thing working.
> 
> ---
> Subject: sched/fair: Add CGROUP_BUDDY feature
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Nov 29 10:49:45 CET 2024
> 
> Add a feature to toggle the cgroup optimization.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c     |    3 ++-
>  kernel/sched/features.h |    8 +++++++-
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7126,7 +7126,8 @@ static int dequeue_entities(struct rq *r
>  			 * Bias pick_next to pick a task from this cfs_rq, as
>  			 * p is sleeping when it is within its sched_slice.
>  			 */
> -			if (task_sleep && se && !throttled_hierarchy(cfs_rq))
> +			if (sched_feat(CGROUP_BUDDY) &&
> +			    task_sleep && se && !throttled_hierarchy(cfs_rq))
>  				set_next_buddy(se);
>  			break;
>  		}
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -32,11 +32,17 @@ SCHED_FEAT(PREEMPT_SHORT, true)
>  SCHED_FEAT(NEXT_BUDDY, false)
>  
>  /*
> + * Optimization for cgroup scheduling where a dequeue + pick tries
> + * to share as much of the hierarchy as possible.
> + */
> +SCHED_FEAT(CGROUP_BUDDY, true)
> +
> +/*
>   * Allow completely ignoring cfs_rq->next; which can be set from various
>   * places:
>   *   - NEXT_BUDDY (wakeup preemption)
>   *   - yield_to_task()
> - *   - cgroup dequeue / pick
> + *   - CGROUP_BUDDY (cgroup dequeue / pick)
>   */
>  SCHED_FEAT(PICK_BUDDY, true)
>  

Thanks,
-adam
Re: [PATCH] sched/fair: Untangle NEXT_BUDDY and pick_next_task()
Posted by Peter Zijlstra 1 year, 2 months ago
On Fri, Nov 29, 2024 at 06:37:06PM +0800, Adam Li wrote:
> On 11/29/2024 6:18 PM, Peter Zijlstra wrote:
> > On Fri, Nov 29, 2024 at 11:15:41AM +0100, Peter Zijlstra wrote:
> >> On Fri, Nov 29, 2024 at 10:55:00AM +0100, Peter Zijlstra wrote:
> >>
> >>> Anyway..  I'm sure I started a patch series cleaning up the whole next
> >>> buddy thing months ago (there's more problems here), but I can't seem to
> >>> find it in a hurry :/
> >>
> >> There was this..
> > 
> > And this I think.
> > 
> > Adam, what was the reason you were enabling NEXT_BUDDY in the first
> > place?
> > 
> Hi Peter,
> 
> I am tuning Specjbb critical-jOPS, which is latency sensitive.

There is a lot to latency, sometimes it's best to not preempt. I think
Prateek has found a fair number of workloads where SCHED_BATCH has been
helpful.

> NEXT_BUDDY affects schedule latency so I tried to enable NEXT_BUDDY.
> However Specjbb critical-jOPS drops with NEXT_BUDDY enabled (after my patch fixing panic).

Yes, picking outside of the EEVDF policy can make worse decisions for
latency.

The yield_to_task() can help performance for KVM (the only user AFAIK
-- oh DMA fences seem to also use it these days).

And the CGROUP_BUDDY thing can sometimes help when using cgroups.

But the wakeup thing is very situational -- it's disabled for a reason.
Unfortunately it seems to also have disabled the other users, which
wasn't intended.
	
> I will test your new NEXT_BUDDY patches.

We still need Prateek's fix. That ensures a delayed task will ever end
up being ->next.
[tip: sched/core] sched/fair: Untangle NEXT_BUDDY and pick_next_task()
Posted by tip-bot2 for Peter Zijlstra 1 year, 1 month ago
The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2a77e4be12cb58bbf774e7c717c8bb80e128b7a4
Gitweb:        https://git.kernel.org/tip/2a77e4be12cb58bbf774e7c717c8bb80e128b7a4
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 29 Nov 2024 11:15:41 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 09 Dec 2024 11:48:13 +01:00

sched/fair: Untangle NEXT_BUDDY and pick_next_task()

There are 3 sites using set_next_buddy() and only one is conditional
on NEXT_BUDDY, the other two sites are unconditional; to note:

  - yield_to_task()
  - cgroup dequeue / pick optimization

However, having NEXT_BUDDY control both the wakeup-preemption and the
picking side of things means its near useless.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241129101541.GA33464@noisy.programming.kicks-ass.net
---
 kernel/sched/fair.c     |  4 ++--
 kernel/sched/features.h |  9 +++++++++
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b505d3d..2c4ebfc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5630,9 +5630,9 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 	struct sched_entity *se;
 
 	/*
-	 * Enabling NEXT_BUDDY will affect latency but not fairness.
+	 * Picking the ->next buddy will affect latency but not fairness.
 	 */
-	if (sched_feat(NEXT_BUDDY) &&
+	if (sched_feat(PICK_BUDDY) &&
 	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
 		/* ->next will never be delayed */
 		SCHED_WARN_ON(cfs_rq->next->sched_delayed);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a3d331d..3c12d9f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -32,6 +32,15 @@ SCHED_FEAT(PREEMPT_SHORT, true)
 SCHED_FEAT(NEXT_BUDDY, false)
 
 /*
+ * Allow completely ignoring cfs_rq->next; which can be set from various
+ * places:
+ *   - NEXT_BUDDY (wakeup preemption)
+ *   - yield_to_task()
+ *   - cgroup dequeue / pick
+ */
+SCHED_FEAT(PICK_BUDDY, true)
+
+/*
  * Consider buddies to be cache hot, decreases the likeliness of a
  * cache buddy being migrated away, increases cache locality.
  */
Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled
Posted by Adam Li 1 year, 2 months ago
On 11/28/2024 3:29 PM, K Prateek Nayak wrote:
> Hello Adam,
> 
Hi Prateek,
Thanks for comments.

> On 11/27/2024 11:26 AM, Adam Li wrote:
>> Enabling NEXT_BUDDY triggers warning, and rcu stall:
>>
>> [  124.977300] cfs_rq->next->sched_delayed
> 
> I could reproduce this with a run of "perf bench sched messaging" but
> given that we hit this warning, it also means that either
> set_next_buddy() has incorrectly set a delayed entity as next buddy, or
> clear_next_buddy() did not clear a delayed entity.
> 
Yes. The logic of this patch is a delayed entity should not be set as next buddy.

> I also see PSI splats like:
> 
>     psi: inconsistent task state! task=2524:kworker/u1028:2 cpu=154 psi_flags=10 clear=14 set=0
> 
> but the PSI flags it has set "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL)" and
> the flags it is trying to clear
> "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL | TSK_RUNNING)" seem to be only
> possible if you have picked a dequeued entity for running before its
> wakeup, which is also perhaps why the "nr_running" computation goes awry
> and pick_eevdf() returns NULL (which it should never since
> pick_next_entity() is only called when rq->cfs.nr_running is > 0)
IIUC, one path for pick_eevdf() to return NULL is:
pick_eevdf():
<snip>
	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
		curr = NULL; <--- curr is set to NULL
<snip>
found:
	if (!best || (curr && entity_before(curr, best)))
		best = curr; <--- curr and best are both NULL

	return best;  <--- return NULL

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index fbdca89c677f..cd1188b7f3df 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8748,6 +8748,8 @@ static void set_next_buddy(struct sched_entity *se)
>>               return;
>>           if (se_is_idle(se))
>>               return;
>> +        if (se->sched_delayed)
>> +            return;
> 
> I tried to put a SCHED_WARN_ON() here to track where this comes from and
> seems like it is usually from attach_task() in the load balancing path
> pulling a delayed task which is set as the next buddy in
> check_preempt_wakeup_fair()
> 
> Can you please try the following diff instead of the first two patches
> and see if you still hit these warnings, stalls, and pick_eevdf()
> returning NULL?
> 
Tested. Run specjbb with NEXT_BUDDY enabled, warnings, stalls and panic disappear.

Regards,
-adam
Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled
Posted by K Prateek Nayak 1 year, 2 months ago
Hello Adam,

On 11/29/2024 8:51 AM, Adam Li wrote:
> On 11/28/2024 3:29 PM, K Prateek Nayak wrote:
>> Hello Adam,
>>
> Hi Prateek,
> Thanks for comments.
> 
>> On 11/27/2024 11:26 AM, Adam Li wrote:
>>> Enabling NEXT_BUDDY triggers warning, and rcu stall:
>>>
>>> [  124.977300] cfs_rq->next->sched_delayed
>>
>> I could reproduce this with a run of "perf bench sched messaging" but
>> given that we hit this warning, it also means that either
>> set_next_buddy() has incorrectly set a delayed entity as next buddy, or
>> clear_next_buddy() did not clear a delayed entity.
>>
> Yes. The logic of this patch is a delayed entity should not be set as next buddy.
> 
>> I also see PSI splats like:
>>
>>      psi: inconsistent task state! task=2524:kworker/u1028:2 cpu=154 psi_flags=10 clear=14 set=0
>>
>> but the PSI flags it has set "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL)" and
>> the flags it is trying to clear
>> "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL | TSK_RUNNING)" seem to be only
>> possible if you have picked a dequeued entity for running before its
>> wakeup, which is also perhaps why the "nr_running" computation goes awry
>> and pick_eevdf() returns NULL (which it should never since
>> pick_next_entity() is only called when rq->cfs.nr_running is > 0)
> IIUC, one path for pick_eevdf() to return NULL is:
> pick_eevdf():
> <snip>
> 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> 		curr = NULL; <--- curr is set to NULL

"on_rq" is only cleared when the entity is dequeued so "curr" is in fact
going to sleep (proper sleep) and we've reached at pick_eevdf(),
otherwise, if "curr" is not eligible, there is at least one more tasks
on the cfs_rq which implies best has be found and will be non-null.

> <snip>
> found:
> 	if (!best || (curr && entity_before(curr, best)))
> 		best = curr; <--- curr and best are both NULL

Say "curr" is going to sleep, and there is no "best", in which case
"curr" is already blocked and "cfs_rq->nr_running" should be 0 and it
should have not reached pick_eevdf() in the first place since
pick_next_entity() is only called by pick_task_fair() if
"cfs_rq->nr_running" is non-zero.

So as long as "cfs_rq->nr_running" is non-zero, pick_eevdf() should
return a valid runnable entity. Failure to do so perhaps points to
"entity_eligible()" check going sideways somewhere or a bug in
"nr_running" accounting.

Chenyu had proposed a similar fix long back in
https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/
but the consensus was it was covering up a larger problem which
then boiled down to avg_vruntime being computed incorrectly
https://lore.kernel.org/lkml/ZiAWTU5xb%2FJMn%2FHs@chenyu5-mobl2/

> 
> 	return best;  <--- return NULL
> 
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index fbdca89c677f..cd1188b7f3df 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8748,6 +8748,8 @@ static void set_next_buddy(struct sched_entity *se)
>>>                return;
>>>            if (se_is_idle(se))
>>>                return;
>>> +        if (se->sched_delayed)
>>> +            return;
>>
>> I tried to put a SCHED_WARN_ON() here to track where this comes from and
>> seems like it is usually from attach_task() in the load balancing path
>> pulling a delayed task which is set as the next buddy in
>> check_preempt_wakeup_fair()
>>
>> Can you please try the following diff instead of the first two patches
>> and see if you still hit these warnings, stalls, and pick_eevdf()
>> returning NULL?
>>
> Tested. Run specjbb with NEXT_BUDDY enabled, warnings, stalls and panic disappear.

Thank you for testing. I'll let Peter come back on which approach he
prefers :)

> 
> Regards,
> -adam

-- 
Thanks and Regards,
Prateek

Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled
Posted by Adam Li 1 year, 2 months ago
On 11/29/2024 12:28 PM, K Prateek Nayak wrote:

>>> I also see PSI splats like:
>>>
>>>      psi: inconsistent task state! task=2524:kworker/u1028:2 cpu=154 psi_flags=10 clear=14 set=0
>>>
>>> but the PSI flags it has set "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL)" and
>>> the flags it is trying to clear
>>> "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL | TSK_RUNNING)" seem to be only
>>> possible if you have picked a dequeued entity for running before its
>>> wakeup, which is also perhaps why the "nr_running" computation goes awry
>>> and pick_eevdf() returns NULL (which it should never since
>>> pick_next_entity() is only called when rq->cfs.nr_running is > 0)
>> IIUC, one path for pick_eevdf() to return NULL is:
>> pick_eevdf():
>> <snip>
>>     if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>>         curr = NULL; <--- curr is set to NULL
> 
> "on_rq" is only cleared when the entity is dequeued so "curr" is in fact
> going to sleep (proper sleep) and we've reached at pick_eevdf(),
> otherwise, if "curr" is not eligible, there is at least one more tasks
> on the cfs_rq which implies best has be found and will be non-null.
> 
if curr->sched_delayed == 1, the condition: '(curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))'
can be true. Please correct me if wrong.

>> <snip>
>> found:
>>     if (!best || (curr && entity_before(curr, best)))
>>         best = curr; <--- curr and best are both NULL
> 
> Say "curr" is going to sleep, and there is no "best", in which case
> "curr" is already blocked and "cfs_rq->nr_running" should be 0 and it
> should have not reached pick_eevdf() in the first place since
> pick_next_entity() is only called by pick_task_fair() if
> "cfs_rq->nr_running" is non-zero.
> 
> So as long as "cfs_rq->nr_running" is non-zero, pick_eevdf() should
> return a valid runnable entity. Failure to do so perhaps points to
> "entity_eligible()" check going sideways somewhere or a bug in
> "nr_running" accounting.
> 
> Chenyu had proposed a similar fix long back in
> https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/
> but the consensus was it was covering up a larger problem which
> then boiled down to avg_vruntime being computed incorrectly
> https://lore.kernel.org/lkml/ZiAWTU5xb%2FJMn%2FHs@chenyu5-mobl2/
> 
Thanks for the information.
From the timeline, it seems the issue is before 152e11f6df29 ("sched/fair: Implement delayed dequeue").
DELAY_DEQUEUE may introduce risk for pick_eevdf() return NULL.

After patch 1 ("Fix warning if NEXT_BUDDY enabled"), the NULL pointer panic disappears.
Patch 2 ("Fix panic if pick_eevdf() returns NULL") is a safe guard.

Thanks,
-adam

Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled
Posted by K Prateek Nayak 1 year, 2 months ago
Hello Adam,

On 11/29/2024 1:10 PM, Adam Li wrote:
> On 11/29/2024 12:28 PM, K Prateek Nayak wrote:
> 
>>>> I also see PSI splats like:
>>>>
>>>>       psi: inconsistent task state! task=2524:kworker/u1028:2 cpu=154 psi_flags=10 clear=14 set=0
>>>>
>>>> but the PSI flags it has set "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL)" and
>>>> the flags it is trying to clear
>>>> "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL | TSK_RUNNING)" seem to be only
>>>> possible if you have picked a dequeued entity for running before its
>>>> wakeup, which is also perhaps why the "nr_running" computation goes awry
>>>> and pick_eevdf() returns NULL (which it should never since
>>>> pick_next_entity() is only called when rq->cfs.nr_running is > 0)
>>> IIUC, one path for pick_eevdf() to return NULL is:
>>> pick_eevdf():
>>> <snip>
>>>      if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>>>          curr = NULL; <--- curr is set to NULL
>>
>> "on_rq" is only cleared when the entity is dequeued so "curr" is in fact
>> going to sleep (proper sleep) and we've reached at pick_eevdf(),
>> otherwise, if "curr" is not eligible, there is at least one more tasks
>> on the cfs_rq which implies best has be found and will be non-null.
>>
> if curr->sched_delayed == 1, the condition: '(curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))'
> can be true. Please correct me if wrong.

If curr is sched_delayed, that means it is going to sleep and it is
ineligible but it can only be ineligible if there is at least one more
task with a lower vruntime and hence best must be found. A delayed
entity will also not decrement the "nr_running" and it'll be queued back
from put_prev_entity() to be picked off later.

Essentially, I believe curr can never be ineligible in absence of other
eligible tasks.

> 
>>> <snip>
>>> found:
>>>      if (!best || (curr && entity_before(curr, best)))
>>>          best = curr; <--- curr and best are both NULL
>>
>> Say "curr" is going to sleep, and there is no "best", in which case
>> "curr" is already blocked and "cfs_rq->nr_running" should be 0 and it
>> should have not reached pick_eevdf() in the first place since
>> pick_next_entity() is only called by pick_task_fair() if
>> "cfs_rq->nr_running" is non-zero.
>>
>> So as long as "cfs_rq->nr_running" is non-zero, pick_eevdf() should
>> return a valid runnable entity. Failure to do so perhaps points to
>> "entity_eligible()" check going sideways somewhere or a bug in
>> "nr_running" accounting.
>>
>> Chenyu had proposed a similar fix long back in
>> https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/
>> but the consensus was it was covering up a larger problem which
>> then boiled down to avg_vruntime being computed incorrectly
>> https://lore.kernel.org/lkml/ZiAWTU5xb%2FJMn%2FHs@chenyu5-mobl2/
>>
> Thanks for the information.
>  From the timeline, it seems the issue is before 152e11f6df29 ("sched/fair: Implement delayed dequeue").
> DELAY_DEQUEUE may introduce risk for pick_eevdf() return NULL.

Ideally it shouldn't since delayed entities are still captured in
"cfs_rq->nr_running" and they'll eventually become eligible and be
picked off but I may be wrong and I hope someone corrects me in which
case :)

> 
> After patch 1 ("Fix warning if NEXT_BUDDY enabled"), the NULL pointer panic disappears.
> Patch 2 ("Fix panic if pick_eevdf() returns NULL") is a safe guard.
> 
> Thanks,
> -adam
> 

-- 
Thanks and Regards,
Prateek

[tip: sched/core] sched/fair: Fix NEXT_BUDDY
Posted by tip-bot2 for K Prateek Nayak 1 year, 1 month ago
The following commit has been merged into the sched/core branch of tip:

Commit-ID:     493afbd187c4c9cc1642792c0d9ba400c3d6d90d
Gitweb:        https://git.kernel.org/tip/493afbd187c4c9cc1642792c0d9ba400c3d6d90d
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Thu, 28 Nov 2024 12:59:54 +05:30
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 09 Dec 2024 11:48:09 +01:00

sched/fair: Fix NEXT_BUDDY

Adam reports that enabling NEXT_BUDDY insta triggers a WARN in
pick_next_entity().

Moving clear_buddies() up before the delayed dequeue bits ensures
no ->next buddy becomes delayed. Further ensure no new ->next buddy
ever starts as delayed.

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Reported-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Adam Li <adamli@os.amperecomputing.com>
Link: https://lkml.kernel.org/r/670a0d54-e398-4b1f-8a6e-90784e2fdf89@amd.com
---
 kernel/sched/fair.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 05b8f1e..9d7a2dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5478,6 +5478,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	bool sleep = flags & DEQUEUE_SLEEP;
 
 	update_curr(cfs_rq);
+	clear_buddies(cfs_rq, se);
 
 	if (flags & DEQUEUE_DELAYED) {
 		SCHED_WARN_ON(!se->sched_delayed);
@@ -5494,8 +5495,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 		if (sched_feat(DELAY_DEQUEUE) && delay &&
 		    !entity_eligible(cfs_rq, se)) {
-			if (cfs_rq->next == se)
-				cfs_rq->next = NULL;
 			update_load_avg(cfs_rq, se, 0);
 			se->sched_delayed = 1;
 			return false;
@@ -5520,8 +5519,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	update_stats_dequeue_fair(cfs_rq, se, flags);
 
-	clear_buddies(cfs_rq, se);
-
 	update_entity_lag(cfs_rq, se);
 	if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
 		se->deadline -= se->vruntime;
@@ -8774,7 +8771,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) {
+	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
 		set_next_buddy(pse);
 	}