[PATCH RFC] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails

Chris Mason posted 1 patch 3 months, 2 weeks ago
There is a newer version of this series
kernel/sched/fair.c | 52 ++++++++++++++++++++++++---------------------
1 file changed, 28 insertions(+), 24 deletions(-)
[PATCH RFC] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
Posted by Chris Mason 3 months, 2 weeks ago
schbench (https://github.com/masoncl/schbench.git) is showing a
regression from previous production kernels that bisected down to:

sched/fair: Remove sysctl_sched_migration_cost condition (c5b0a7eefc)

The schbench command line was:

schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0

This creates 4 message threads pinned to CPUs 0-3, and 256x4 worker
threads spread across the rest of the CPUs.  Neither the worker threads
or the message threads do any work, they just wake each other up and go
back to sleep as soon as possible.

The end result is the first 4 CPUs are pegged waking up those 1024
workers, and the rest of the CPUs are constantly banging in and out of
idle.  If I take a v6.9 Linus kernel and revert that one commit,
performance goes from 3.4M RPS to 5.4M RPS.

schedstat shows there are ~100x  more new idle balance operations, and
profiling shows the worker threads are spending ~20% of their CPU time
on new idle balance.  schedstats also shows that almost all of these new
idle balance attemps are failing to find busy groups.

The fix used here is to crank up the cost of the balance whenever it
fails to find a busy group or busy queue.

Signed-off-by: Chris Mason <clm@fb.com>
---
 kernel/sched/fair.c | 52 ++++++++++++++++++++++++---------------------
 1 file changed, 28 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb2..0c4701564ce01 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11744,6 +11744,32 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
 	}
 }
 
+static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
+{
+	if (cost > sd->max_newidle_lb_cost) {
+		/*
+		 * Track max cost of a domain to make sure to not delay the
+		 * next wakeup on the CPU.   Add a bit to the min so we don't
+		 * have to worry about <= vs <, and to handle the sysctl set at
+		 * zero.
+		 */
+		sd->max_newidle_lb_cost = min(cost, sysctl_sched_migration_cost + 200);
+		sd->last_decay_max_lb_cost = jiffies;
+	} else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
+		/*
+		 * Decay the newidle max times by ~1% per second to ensure that
+		 * it is not outdated and the current max cost is actually
+		 * shorter.
+		 */
+		sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256;
+		sd->last_decay_max_lb_cost = jiffies;
+
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -11782,12 +11808,14 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 
 	group = sched_balance_find_src_group(&env);
 	if (!group) {
+		update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
 		schedstat_inc(sd->lb_nobusyg[idle]);
 		goto out_balanced;
 	}
 
 	busiest = sched_balance_find_src_rq(&env, group);
 	if (!busiest) {
+		update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
 		schedstat_inc(sd->lb_nobusyq[idle]);
 		goto out_balanced;
 	}
@@ -12168,30 +12196,6 @@ void update_max_interval(void)
 	max_load_balance_interval = HZ*num_online_cpus()/10;
 }
 
-static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
-{
-	if (cost > sd->max_newidle_lb_cost) {
-		/*
-		 * Track max cost of a domain to make sure to not delay the
-		 * next wakeup on the CPU.
-		 */
-		sd->max_newidle_lb_cost = cost;
-		sd->last_decay_max_lb_cost = jiffies;
-	} else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
-		/*
-		 * Decay the newidle max times by ~1% per second to ensure that
-		 * it is not outdated and the current max cost is actually
-		 * shorter.
-		 */
-		sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256;
-		sd->last_decay_max_lb_cost = jiffies;
-
-		return true;
-	}
-
-	return false;
-}
-
 /*
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
-- 
2.47.1
Re: [PATCH RFC] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
Posted by Peter Zijlstra 3 months, 2 weeks ago
On Tue, Jun 24, 2025 at 01:48:08PM -0700, Chris Mason wrote:
> schbench (https://github.com/masoncl/schbench.git) is showing a
> regression from previous production kernels that bisected down to:
> 
> sched/fair: Remove sysctl_sched_migration_cost condition (c5b0a7eefc)
> 
> The schbench command line was:
> 
> schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0
> 
> This creates 4 message threads pinned to CPUs 0-3, and 256x4 worker
> threads spread across the rest of the CPUs.  Neither the worker threads
> or the message threads do any work, they just wake each other up and go
> back to sleep as soon as possible.
> 
> The end result is the first 4 CPUs are pegged waking up those 1024
> workers, and the rest of the CPUs are constantly banging in and out of
> idle.  If I take a v6.9 Linus kernel and revert that one commit,
> performance goes from 3.4M RPS to 5.4M RPS.
> 
> schedstat shows there are ~100x  more new idle balance operations, and
> profiling shows the worker threads are spending ~20% of their CPU time
> on new idle balance.  schedstats also shows that almost all of these new
> idle balance attemps are failing to find busy groups.
> 
> The fix used here is to crank up the cost of the balance whenever it
> fails to find a busy group or busy queue.
> 
> Signed-off-by: Chris Mason <clm@fb.com>
> ---
>  kernel/sched/fair.c | 52 ++++++++++++++++++++++++---------------------
>  1 file changed, 28 insertions(+), 24 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a14da5396fb2..0c4701564ce01 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11744,6 +11744,32 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
>  	}
>  }
>  
> +static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
> +{
> +	if (cost > sd->max_newidle_lb_cost) {
> +		/*
> +		 * Track max cost of a domain to make sure to not delay the
> +		 * next wakeup on the CPU.   Add a bit to the min so we don't
> +		 * have to worry about <= vs <, and to handle the sysctl set at
> +		 * zero.
> +		 */
> +		sd->max_newidle_lb_cost = min(cost, sysctl_sched_migration_cost + 200);
> +		sd->last_decay_max_lb_cost = jiffies;
> +	} else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
> +		/*
> +		 * Decay the newidle max times by ~1% per second to ensure that
> +		 * it is not outdated and the current max cost is actually
> +		 * shorter.
> +		 */
> +		sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256;
> +		sd->last_decay_max_lb_cost = jiffies;
> +
> +		return true;
> +	}
> +
> +	return false;
> +}
> +

For the non-RFC version, please split this into a code move and a code
change -- I had to stare waaay to long to spot the difference (if we
keep this code movement at all).

>  /*
>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
>   * tasks if there is an imbalance.
> @@ -11782,12 +11808,14 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>  
>  	group = sched_balance_find_src_group(&env);
>  	if (!group) {
> +		update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
>  		schedstat_inc(sd->lb_nobusyg[idle]);
>  		goto out_balanced;
>  	}
>  
>  	busiest = sched_balance_find_src_rq(&env, group);
>  	if (!busiest) {
> +		update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
>  		schedstat_inc(sd->lb_nobusyq[idle]);
>  		goto out_balanced;
>  	}

So sched_balance_rq() is used for pretty much all load-balancing, not
just newidle.

Either make this conditional like:

	if (idle == CPU_NEWLY_IDLE)
		update_newidle_cost(...);

or do it all the callsite, where we find !pulled_task (ie failure).

Specifically, we already do update_newidle_cost() there, perhaps inflate
the cost there instead?

	if (!pulled_tasks)
		domain_cost += sysctl_sched_migration_cost;

or whatever.
Re: [PATCH RFC] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
Posted by Chris Mason 3 months, 2 weeks ago
On 6/26/25 3:00 AM, Peter Zijlstra wrote:
> On Tue, Jun 24, 2025 at 01:48:08PM -0700, Chris Mason wrote:

[ ... ]

> For the non-RFC version, please split this into a code move and a code
> change -- I had to stare waaay to long to spot the difference (if we
> keep this code movement at all).

Sure

> 
>>  /*
>>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
>>   * tasks if there is an imbalance.
>> @@ -11782,12 +11808,14 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>  
>>  	group = sched_balance_find_src_group(&env);
>>  	if (!group) {
>> +		update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
>>  		schedstat_inc(sd->lb_nobusyg[idle]);
>>  		goto out_balanced;
>>  	}
>>  
>>  	busiest = sched_balance_find_src_rq(&env, group);
>>  	if (!busiest) {
>> +		update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
>>  		schedstat_inc(sd->lb_nobusyq[idle]);
>>  		goto out_balanced;
>>  	}
> 
> So sched_balance_rq() is used for pretty much all load-balancing, not
> just newidle.
> 
> Either make this conditional like:
> 
> 	if (idle == CPU_NEWLY_IDLE)
> 		update_newidle_cost(...);
> 
> or do it all the callsite, where we find !pulled_task (ie failure).
> 
> Specifically, we already do update_newidle_cost() there, perhaps inflate
> the cost there instead?
> 
> 	if (!pulled_tasks)
> 		domain_cost += sysctl_sched_migration_cost;

Got it, I'll play with that.  Vincent, was there a benchmark I can use
to see if I've regressed the case you were focused on?

-chris
Re: [PATCH RFC] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
Posted by Vincent Guittot 3 months, 2 weeks ago
On Thu, 26 Jun 2025 at 12:58, Chris Mason <clm@meta.com> wrote:
>
> On 6/26/25 3:00 AM, Peter Zijlstra wrote:
> > On Tue, Jun 24, 2025 at 01:48:08PM -0700, Chris Mason wrote:
>
> [ ... ]
>
> > For the non-RFC version, please split this into a code move and a code
> > change -- I had to stare waaay to long to spot the difference (if we
> > keep this code movement at all).
>
> Sure
>
> >
> >>  /*
> >>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
> >>   * tasks if there is an imbalance.
> >> @@ -11782,12 +11808,14 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >>
> >>      group = sched_balance_find_src_group(&env);
> >>      if (!group) {
> >> +            update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
> >>              schedstat_inc(sd->lb_nobusyg[idle]);
> >>              goto out_balanced;
> >>      }
> >>
> >>      busiest = sched_balance_find_src_rq(&env, group);
> >>      if (!busiest) {
> >> +            update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
> >>              schedstat_inc(sd->lb_nobusyq[idle]);
> >>              goto out_balanced;
> >>      }
> >
> > So sched_balance_rq() is used for pretty much all load-balancing, not
> > just newidle.
> >
> > Either make this conditional like:
> >
> >       if (idle == CPU_NEWLY_IDLE)
> >               update_newidle_cost(...);
> >
> > or do it all the callsite, where we find !pulled_task (ie failure).
> >
> > Specifically, we already do update_newidle_cost() there, perhaps inflate
> > the cost there instead?
> >
> >       if (!pulled_tasks)
> >               domain_cost += sysctl_sched_migration_cost;
>
> Got it, I'll play with that.  Vincent, was there a benchmark I can use
> to see if I've regressed the case you were focused on?

It's not a public benchmark but I had some unitary tests with tasks
waiting on a busy CPU while other CPUs become idle for a "long" time
(but still less than 500us in average). This is even more true with
frequency scaling which will minimize the idle duration by decreasing
the frequency

>
> -chris
>
Re: [PATCH RFC] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
Posted by Chris Mason 3 months, 2 weeks ago
On 6/26/25 10:26 AM, Vincent Guittot wrote:
> On Thu, 26 Jun 2025 at 12:58, Chris Mason <clm@meta.com> wrote:

>> Got it, I'll play with that.  Vincent, was there a benchmark I can use
>> to see if I've regressed the case you were focused on?
> 
> It's not a public benchmark but I had some unitary tests with tasks
> waiting on a busy CPU while other CPUs become idle for a "long" time
> (but still less than 500us in average). This is even more true with
> frequency scaling which will minimize the idle duration by decreasing
> the frequency

Ok, I don't think I'll be able to reliably recreate that on my own, can
I ask you to rerun against the v2 I sent out?

-chris
Re: [PATCH RFC] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
Posted by Vincent Guittot 3 months, 1 week ago
On Thu, 26 Jun 2025 at 16:55, Chris Mason <clm@meta.com> wrote:
>
> On 6/26/25 10:26 AM, Vincent Guittot wrote:
> > On Thu, 26 Jun 2025 at 12:58, Chris Mason <clm@meta.com> wrote:
>
> >> Got it, I'll play with that.  Vincent, was there a benchmark I can use
> >> to see if I've regressed the case you were focused on?
> >
> > It's not a public benchmark but I had some unitary tests with tasks
> > waiting on a busy CPU while other CPUs become idle for a "long" time
> > (but still less than 500us in average). This is even more true with
> > frequency scaling which will minimize the idle duration by decreasing
> > the frequency
>
> Ok, I don't think I'll be able to reliably recreate that on my own, can
> I ask you to rerun against the v2 I sent out?

Yes, I will run some tests

Vincent

>
> -chris
Re: [PATCH RFC] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
Posted by Vincent Guittot 3 months ago
 value to

On Fri, 27 Jun 2025 at 15:32, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Thu, 26 Jun 2025 at 16:55, Chris Mason <clm@meta.com> wrote:
> >
> > On 6/26/25 10:26 AM, Vincent Guittot wrote:
> > > On Thu, 26 Jun 2025 at 12:58, Chris Mason <clm@meta.com> wrote:
> >
> > >> Got it, I'll play with that.  Vincent, was there a benchmark I can use
> > >> to see if I've regressed the case you were focused on?
> > >
> > > It's not a public benchmark but I had some unitary tests with tasks
> > > waiting on a busy CPU while other CPUs become idle for a "long" time
> > > (but still less than 500us in average). This is even more true with
> > > frequency scaling which will minimize the idle duration by decreasing
> > > the frequency
> >
> > Ok, I don't think I'll be able to reliably recreate that on my own, can
> > I ask you to rerun against the v2 I sent out?
>
> Yes, I will run some tests

I ran some tests and I haven't seen any differences with or without
this patch so it looks good for me.
I thought that you were bumping the max_newidle_lb_cost to
sysctl_sched_migration_cost when newly idle failed to pull task but in
fact you increase the actual cost which is ok

Acked-by: Vincent Guittot <vincent.guittot@linaro.org>

>
> Vincent
>
> >
> > -chris
Re: [PATCH RFC] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
Posted by Vincent Guittot 3 months, 2 weeks ago
On Thu, 26 Jun 2025 at 09:00, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jun 24, 2025 at 01:48:08PM -0700, Chris Mason wrote:
> > schbench (https://github.com/masoncl/schbench.git) is showing a
> > regression from previous production kernels that bisected down to:
> >
> > sched/fair: Remove sysctl_sched_migration_cost condition (c5b0a7eefc)
> >
> > The schbench command line was:
> >
> > schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0
> >
> > This creates 4 message threads pinned to CPUs 0-3, and 256x4 worker
> > threads spread across the rest of the CPUs.  Neither the worker threads
> > or the message threads do any work, they just wake each other up and go
> > back to sleep as soon as possible.
> >
> > The end result is the first 4 CPUs are pegged waking up those 1024
> > workers, and the rest of the CPUs are constantly banging in and out of
> > idle.  If I take a v6.9 Linus kernel and revert that one commit,
> > performance goes from 3.4M RPS to 5.4M RPS.
> >
> > schedstat shows there are ~100x  more new idle balance operations, and
> > profiling shows the worker threads are spending ~20% of their CPU time
> > on new idle balance.  schedstats also shows that almost all of these new
> > idle balance attemps are failing to find busy groups.
> >
> > The fix used here is to crank up the cost of the balance whenever it
> > fails to find a busy group or busy queue.
> >
> > Signed-off-by: Chris Mason <clm@fb.com>
> > ---
> >  kernel/sched/fair.c | 52 ++++++++++++++++++++++++---------------------
> >  1 file changed, 28 insertions(+), 24 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7a14da5396fb2..0c4701564ce01 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11744,6 +11744,32 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
> >       }
> >  }
> >
> > +static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
> > +{
> > +     if (cost > sd->max_newidle_lb_cost) {
> > +             /*
> > +              * Track max cost of a domain to make sure to not delay the
> > +              * next wakeup on the CPU.   Add a bit to the min so we don't
> > +              * have to worry about <= vs <, and to handle the sysctl set at
> > +              * zero.
> > +              */
> > +             sd->max_newidle_lb_cost = min(cost, sysctl_sched_migration_cost + 200);
> > +             sd->last_decay_max_lb_cost = jiffies;
> > +     } else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
> > +             /*
> > +              * Decay the newidle max times by ~1% per second to ensure that
> > +              * it is not outdated and the current max cost is actually
> > +              * shorter.
> > +              */
> > +             sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256;
> > +             sd->last_decay_max_lb_cost = jiffies;
> > +
> > +             return true;
> > +     }
> > +
> > +     return false;
> > +}
> > +
>
> For the non-RFC version, please split this into a code move and a code
> change -- I had to stare waaay to long to spot the difference (if we
> keep this code movement at all).
>
> >  /*
> >   * Check this_cpu to ensure it is balanced within domain. Attempt to move
> >   * tasks if there is an imbalance.
> > @@ -11782,12 +11808,14 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >
> >       group = sched_balance_find_src_group(&env);
> >       if (!group) {
> > +             update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
> >               schedstat_inc(sd->lb_nobusyg[idle]);
> >               goto out_balanced;
> >       }
> >
> >       busiest = sched_balance_find_src_rq(&env, group);
> >       if (!busiest) {
> > +             update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
> >               schedstat_inc(sd->lb_nobusyq[idle]);
> >               goto out_balanced;
> >       }
>
> So sched_balance_rq() is used for pretty much all load-balancing, not
> just newidle.
>
> Either make this conditional like:
>
>         if (idle == CPU_NEWLY_IDLE)
>                 update_newidle_cost(...);
>
> or do it all the callsite, where we find !pulled_task (ie failure).
>
> Specifically, we already do update_newidle_cost() there, perhaps inflate
> the cost there instead?
>
>         if (!pulled_tasks)
>                 domain_cost += sysctl_sched_migration_cost;

yeah, sched_balance_newidle() looks like a better place to
artificially increase the cost

>
> or whatever.
>