[PATCH] sched/fair: Untangle NEXT_BUDDY and pick_next_task()

Peter Zijlstra posted 1 patch 1 year, 2 months ago
kernel/sched/fair.c     |    4 ++--
kernel/sched/features.h |    9 +++++++++
2 files changed, 11 insertions(+), 2 deletions(-)
[PATCH] sched/fair: Untangle NEXT_BUDDY and pick_next_task()
Posted by Peter Zijlstra 1 year, 2 months ago
On Fri, Nov 29, 2024 at 10:55:00AM +0100, Peter Zijlstra wrote:

> Anyway..  I'm sure I started a patch series cleaning up the whole next
> buddy thing months ago (there's more problems here), but I can't seem to
> find it in a hurry :/

There was this..

---
Subject: sched/fair: Untangle NEXT_BUDDY and pick_next_task()
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Nov 29 10:36:59 CET 2024

There are 3 sites using set_next_buddy() and only one is conditional
on NEXT_BUDDY, the other two sites are unconditional; to note:

  - yield_to_task()
  - cgroup dequeue / pick optimization

However, having NEXT_BUDDY control both the wakeup-preemption and the
picking side of things means its near useless.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |    4 ++--
 kernel/sched/features.h |    9 +++++++++
 2 files changed, 11 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5613,9 +5613,9 @@ static struct sched_entity *
 pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	/*
-	 * Enabling NEXT_BUDDY will affect latency but not fairness.
+	 * Picking the ->next buddy will affect latency but not fairness.
 	 */
-	if (sched_feat(NEXT_BUDDY) &&
+	if (sched_feat(PICK_BUDDY) &&
 	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
 		/* ->next will never be delayed */
 		SCHED_WARN_ON(cfs_rq->next->sched_delayed);
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -32,6 +32,15 @@ SCHED_FEAT(PREEMPT_SHORT, true)
 SCHED_FEAT(NEXT_BUDDY, false)
 
 /*
+ * Allow completely ignoring cfs_rq->next; which can be set from various
+ * places:
+ *   - NEXT_BUDDY (wakeup preemption)
+ *   - yield_to_task()
+ *   - cgroup dequeue / pick
+ */
+SCHED_FEAT(PICK_BUDDY, true)
+
+/*
  * Consider buddies to be cache hot, decreases the likeliness of a
  * cache buddy being migrated away, increases cache locality.
  */
> 
>
Re: [PATCH] sched/fair: Untangle NEXT_BUDDY and pick_next_task()
Posted by Adam Li 1 year, 2 months ago
On 11/29/2024 6:15 PM, Peter Zijlstra wrote:
> On Fri, Nov 29, 2024 at 10:55:00AM +0100, Peter Zijlstra wrote:
> 
>> Anyway..  I'm sure I started a patch series cleaning up the whole next
>> buddy thing months ago (there's more problems here), but I can't seem to
>> find it in a hurry :/
> 
> There was this..
> 
Hi Peter and Prateek,

I tested the two patches on 6.13-rc1 + patch "sched/fair: Fix NEXT_BUDDY"
(https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/urgent&id=d1139307fe97ffefcf90212772f7516732a11034)
 
1) sched/fair: Untangle NEXT_BUDDY and pick_next_task()
2) sched/fair: Add CGROUP_BUDDY feature

With all 2^3=8 combinations: (NO_)NEXT_BUDDY, (NO_)CGROUP_BUDDY, (NO_)PICK_BUDDY,
there is no warning or panic. There is no significant performance difference for
Specjbb workload.

And there is no much performance difference before and after the two patches.

Before the patches, I think the default setting 'NO_NEXT_BUDDY' logically
equals to 'NO_PICK_BUDDY && CGROUP_BUDDY && NO_NEXT_BUDDY'.
After the patches, the default becomes 'PICK_BUDDY && CGROUP_BUDDY && NO_NEXT_BUDDY'.

Thanks,
-adam
> ---
> Subject: sched/fair: Untangle NEXT_BUDDY and pick_next_task()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Nov 29 10:36:59 CET 2024
> 
> There are 3 sites using set_next_buddy() and only one is conditional
> on NEXT_BUDDY, the other two sites are unconditional; to note:
> 
>   - yield_to_task()
>   - cgroup dequeue / pick optimization
> 
> However, having NEXT_BUDDY control both the wakeup-preemption and the
> picking side of things means its near useless.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c     |    4 ++--
>  kernel/sched/features.h |    9 +++++++++
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5613,9 +5613,9 @@ static struct sched_entity *
>  pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>  {
>  	/*
> -	 * Enabling NEXT_BUDDY will affect latency but not fairness.
> +	 * Picking the ->next buddy will affect latency but not fairness.
>  	 */
> -	if (sched_feat(NEXT_BUDDY) &&
> +	if (sched_feat(PICK_BUDDY) &&
>  	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
>  		/* ->next will never be delayed */
>  		SCHED_WARN_ON(cfs_rq->next->sched_delayed);
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -32,6 +32,15 @@ SCHED_FEAT(PREEMPT_SHORT, true)
>  SCHED_FEAT(NEXT_BUDDY, false)
>  
>  /*
> + * Allow completely ignoring cfs_rq->next; which can be set from various
> + * places:
> + *   - NEXT_BUDDY (wakeup preemption)
> + *   - yield_to_task()
> + *   - cgroup dequeue / pick
> + */
> +SCHED_FEAT(PICK_BUDDY, true)
> +
> +/*
>   * Consider buddies to be cache hot, decreases the likeliness of a
>   * cache buddy being migrated away, increases cache locality.
>   */
>>
>>
Re: [PATCH] sched/fair: Untangle NEXT_BUDDY and pick_next_task()
Posted by Peter Zijlstra 1 year, 2 months ago
On Fri, Nov 29, 2024 at 11:15:41AM +0100, Peter Zijlstra wrote:
> On Fri, Nov 29, 2024 at 10:55:00AM +0100, Peter Zijlstra wrote:
> 
> > Anyway..  I'm sure I started a patch series cleaning up the whole next
> > buddy thing months ago (there's more problems here), but I can't seem to
> > find it in a hurry :/
> 
> There was this..

And this I think.

Adam, what was the reason you were enabling NEXT_BUDDY in the first
place?

I think someone (Ingo?) was proposing we kill the wakeup preempt thing;
and I suspect you don't actually care about that but instead want either
the cgroup or the yield_to_task()/KVM thing working.

---
Subject: sched/fair: Add CGROUP_BUDDY feature
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Nov 29 10:49:45 CET 2024

Add a feature to toggle the cgroup optimization.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |    3 ++-
 kernel/sched/features.h |    8 +++++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7126,7 +7126,8 @@ static int dequeue_entities(struct rq *r
 			 * Bias pick_next to pick a task from this cfs_rq, as
 			 * p is sleeping when it is within its sched_slice.
 			 */
-			if (task_sleep && se && !throttled_hierarchy(cfs_rq))
+			if (sched_feat(CGROUP_BUDDY) &&
+			    task_sleep && se && !throttled_hierarchy(cfs_rq))
 				set_next_buddy(se);
 			break;
 		}
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -32,11 +32,17 @@ SCHED_FEAT(PREEMPT_SHORT, true)
 SCHED_FEAT(NEXT_BUDDY, false)
 
 /*
+ * Optimization for cgroup scheduling where a dequeue + pick tries
+ * to share as much of the hierarchy as possible.
+ */
+SCHED_FEAT(CGROUP_BUDDY, true)
+
+/*
  * Allow completely ignoring cfs_rq->next; which can be set from various
  * places:
  *   - NEXT_BUDDY (wakeup preemption)
  *   - yield_to_task()
- *   - cgroup dequeue / pick
+ *   - CGROUP_BUDDY (cgroup dequeue / pick)
  */
 SCHED_FEAT(PICK_BUDDY, true)
Re: [PATCH] sched/fair: Untangle NEXT_BUDDY and pick_next_task()
Posted by Adam Li 1 year, 2 months ago
On 11/29/2024 6:18 PM, Peter Zijlstra wrote:
> On Fri, Nov 29, 2024 at 11:15:41AM +0100, Peter Zijlstra wrote:
>> On Fri, Nov 29, 2024 at 10:55:00AM +0100, Peter Zijlstra wrote:
>>
>>> Anyway..  I'm sure I started a patch series cleaning up the whole next
>>> buddy thing months ago (there's more problems here), but I can't seem to
>>> find it in a hurry :/
>>
>> There was this..
> 
> And this I think.
> 
> Adam, what was the reason you were enabling NEXT_BUDDY in the first
> place?
> 
Hi Peter,

I am tuning Specjbb critical-jOPS, which is latency sensitive.
NEXT_BUDDY affects schedule latency so I tried to enable NEXT_BUDDY.
However Specjbb critical-jOPS drops with NEXT_BUDDY enabled (after my patch fixing panic).

I will test your new NEXT_BUDDY patches.

> I think someone (Ingo?) was proposing we kill the wakeup preempt thing;
> and I suspect you don't actually care about that but instead want either
> the cgroup or the yield_to_task()/KVM thing working.
> 
> ---
> Subject: sched/fair: Add CGROUP_BUDDY feature
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Nov 29 10:49:45 CET 2024
> 
> Add a feature to toggle the cgroup optimization.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c     |    3 ++-
>  kernel/sched/features.h |    8 +++++++-
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7126,7 +7126,8 @@ static int dequeue_entities(struct rq *r
>  			 * Bias pick_next to pick a task from this cfs_rq, as
>  			 * p is sleeping when it is within its sched_slice.
>  			 */
> -			if (task_sleep && se && !throttled_hierarchy(cfs_rq))
> +			if (sched_feat(CGROUP_BUDDY) &&
> +			    task_sleep && se && !throttled_hierarchy(cfs_rq))
>  				set_next_buddy(se);
>  			break;
>  		}
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -32,11 +32,17 @@ SCHED_FEAT(PREEMPT_SHORT, true)
>  SCHED_FEAT(NEXT_BUDDY, false)
>  
>  /*
> + * Optimization for cgroup scheduling where a dequeue + pick tries
> + * to share as much of the hierarchy as possible.
> + */
> +SCHED_FEAT(CGROUP_BUDDY, true)
> +
> +/*
>   * Allow completely ignoring cfs_rq->next; which can be set from various
>   * places:
>   *   - NEXT_BUDDY (wakeup preemption)
>   *   - yield_to_task()
> - *   - cgroup dequeue / pick
> + *   - CGROUP_BUDDY (cgroup dequeue / pick)
>   */
>  SCHED_FEAT(PICK_BUDDY, true)
>  

Thanks,
-adam
Re: [PATCH] sched/fair: Untangle NEXT_BUDDY and pick_next_task()
Posted by Peter Zijlstra 1 year, 2 months ago
On Fri, Nov 29, 2024 at 06:37:06PM +0800, Adam Li wrote:
> On 11/29/2024 6:18 PM, Peter Zijlstra wrote:
> > On Fri, Nov 29, 2024 at 11:15:41AM +0100, Peter Zijlstra wrote:
> >> On Fri, Nov 29, 2024 at 10:55:00AM +0100, Peter Zijlstra wrote:
> >>
> >>> Anyway..  I'm sure I started a patch series cleaning up the whole next
> >>> buddy thing months ago (there's more problems here), but I can't seem to
> >>> find it in a hurry :/
> >>
> >> There was this..
> > 
> > And this I think.
> > 
> > Adam, what was the reason you were enabling NEXT_BUDDY in the first
> > place?
> > 
> Hi Peter,
> 
> I am tuning Specjbb critical-jOPS, which is latency sensitive.

There is a lot to latency, sometimes it's best to not preempt. I think
Prateek has found a fair number of workloads where SCHED_BATCH has been
helpful.

> NEXT_BUDDY affects schedule latency so I tried to enable NEXT_BUDDY.
> However Specjbb critical-jOPS drops with NEXT_BUDDY enabled (after my patch fixing panic).

Yes, picking outside of the EEVDF policy can make worse decisions for
latency.

The yield_to_task() can help performance for KVM (the only user AFAIK
-- oh DMA fences seem to also use it these days).

And the CGROUP_BUDDY thing can sometimes help when using cgroups.

But the wakeup thing is very situational -- it's disabled for a reason.
Unfortunately it seems to also have disabled the other users, which
wasn't intended.
	
> I will test your new NEXT_BUDDY patches.

We still need Prateek's fix. That ensures a delayed task will ever end
up being ->next.
[tip: sched/core] sched/fair: Untangle NEXT_BUDDY and pick_next_task()
Posted by tip-bot2 for Peter Zijlstra 1 year, 1 month ago
The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2a77e4be12cb58bbf774e7c717c8bb80e128b7a4
Gitweb:        https://git.kernel.org/tip/2a77e4be12cb58bbf774e7c717c8bb80e128b7a4
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 29 Nov 2024 11:15:41 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 09 Dec 2024 11:48:13 +01:00

sched/fair: Untangle NEXT_BUDDY and pick_next_task()

There are 3 sites using set_next_buddy() and only one is conditional
on NEXT_BUDDY, the other two sites are unconditional; to note:

  - yield_to_task()
  - cgroup dequeue / pick optimization

However, having NEXT_BUDDY control both the wakeup-preemption and the
picking side of things means its near useless.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241129101541.GA33464@noisy.programming.kicks-ass.net
---
 kernel/sched/fair.c     |  4 ++--
 kernel/sched/features.h |  9 +++++++++
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b505d3d..2c4ebfc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5630,9 +5630,9 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 	struct sched_entity *se;
 
 	/*
-	 * Enabling NEXT_BUDDY will affect latency but not fairness.
+	 * Picking the ->next buddy will affect latency but not fairness.
 	 */
-	if (sched_feat(NEXT_BUDDY) &&
+	if (sched_feat(PICK_BUDDY) &&
 	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
 		/* ->next will never be delayed */
 		SCHED_WARN_ON(cfs_rq->next->sched_delayed);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a3d331d..3c12d9f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -32,6 +32,15 @@ SCHED_FEAT(PREEMPT_SHORT, true)
 SCHED_FEAT(NEXT_BUDDY, false)
 
 /*
+ * Allow completely ignoring cfs_rq->next; which can be set from various
+ * places:
+ *   - NEXT_BUDDY (wakeup preemption)
+ *   - yield_to_task()
+ *   - cgroup dequeue / pick
+ */
+SCHED_FEAT(PICK_BUDDY, true)
+
+/*
  * Consider buddies to be cache hot, decreases the likeliness of a
  * cache buddy being migrated away, increases cache locality.
  */