introduce sched-idle balancing

[RFC PATCH 0/5] introduce sched-idle balancing

Posted by Abel Wu 4 years, 4 months ago

Current load balancing is mainly based on cpu capacity
and task util, which makes sense in the POV of overall
throughput. While there still might be some improvement
can be done by reducing number of overloaded cfs rqs if
sched-idle or idle rq exists.

An CFS runqueue is considered overloaded when there are
more than one pullable non-idle tasks on it (since sched-
idle cpus are treated as idle cpus). And idle tasks are
counted towards rq->cfs.idle_h_nr_running, that is either
assigned SCHED_IDLE policy or placed under idle cgroups.

The overloaded cfs rqs can cause performance issues to
both task types:

  - for latency critical tasks like SCHED_NORMAL,
    time of waiting in the rq will increase and
    result in higher pct99 latency, and

  - batch tasks may not be able to make full use
    of cpu capacity if sched-idle rq exists, thus
    presents poorer throughput.

So in short, the goal of the sched-idle balancing is to
let the *non-idle tasks* make full use of cpu resources.
To achieve that, we mainly do two things:

  - pull non-idle tasks for sched-idle or idle rqs
    from the overloaded ones, and

  - prevent pulling the last non-idle task in an rq

The mask of overloaded cpus is updated in periodic tick
and the idle path at the LLC domain basis. This cpumask
will also be used in SIS as a filter, improving idle cpu
searching.

Tests are done in an Intel Xeon E5-2650 v4 server with
2 NUMA nodes each of which has 12 cores, and with SMT2
enabled, so 48 CPUs in total. Test results are listed
as follows.

  - we used perf messaging test to test throughput
    at different load (groups).

      perf bench sched messaging -g [N] -l 40000

	N	w/o	w/	diff
	1	2.897	2.834	-2.17%
	3	5.156	4.904	-4.89%
	5	7.850	7.617	-2.97%
	10	15.140	14.574	-3.74%
	20	29.387	27.602	-6.07%

    the result shows approximate 2~6% improvement.

  - and schbench to test latency performance in two
    scenarios: quiet and noisy. In quiet test, we
    run schbench in a normal cpu cgroup in a quiet
    system, while the noisy test additionally runs
    perf messaging workload inside an idle cgroup
    as nosie.

      schbench -m 2 -t 24 -i 60 -r 60
      perf bench sched messaging -g 1 -l 4000000

	[quiet]
			w/o	w/
	50.0th		31	31
	75.0th		45	45
	90.0th		55	55
	95.0th		62	61
	*99.0th		85	86
	99.5th		565	318
	99.9th		11536	10992
	max		13029	13067

	[nosiy]
			w/o	w/
	50.0th		34	32
	75.0th		48	45
	90.0th		58	55
	95.0th		65	61
	*99.0th		2364	208
	99.5th		6696	2068
	99.9th		12688	8816
	max		15209	14191

    it can be seen that the quiet test results are
    quite similar, but the p99 latency is greatly
    improved in the nosiy test.

Comments and tests are appreciated!

Abel Wu (5):
  sched/fair: record overloaded cpus
  sched/fair: introduce sched-idle balance
  sched/fair: add stats for sched-idle balancing
  sched/fair: filter out overloaded cpus in sis
  sched/fair: favor cpu capacity for idle tasks

 include/linux/sched/idle.h     |   1 +
 include/linux/sched/topology.h |  15 ++++
 kernel/sched/core.c            |   1 +
 kernel/sched/fair.c            | 187 ++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h           |   6 ++
 kernel/sched/stats.c           |   5 +-
 kernel/sched/topology.c        |   4 +-
 7 files changed, 215 insertions(+), 4 deletions(-)

-- 
2.11.0

Re: [RFC PATCH 0/5] introduce sched-idle balancing

Posted by Mel Gorman 4 years, 4 months ago

On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> Current load balancing is mainly based on cpu capacity
> and task util, which makes sense in the POV of overall
> throughput. While there still might be some improvement
> can be done by reducing number of overloaded cfs rqs if
> sched-idle or idle rq exists.
> 
> An CFS runqueue is considered overloaded when there are
> more than one pullable non-idle tasks on it (since sched-
> idle cpus are treated as idle cpus). And idle tasks are
> counted towards rq->cfs.idle_h_nr_running, that is either
> assigned SCHED_IDLE policy or placed under idle cgroups.
> 

It's not clear how your tests evaluated the balancing of SCHED_IDLE tasks
versus the existing idle balancing and isolated that impact. I suspect
the tests may primarily measured the effect of the SIS filter.

> So in short, the goal of the sched-idle balancing is to
> let the *non-idle tasks* make full use of cpu resources.
> To achieve that, we mainly do two things:
> 
>   - pull non-idle tasks for sched-idle or idle rqs
>     from the overloaded ones, and
> 
>   - prevent pulling the last non-idle task in an rq
> 
> The mask of overloaded cpus is updated in periodic tick
> and the idle path at the LLC domain basis. This cpumask
> will also be used in SIS as a filter, improving idle cpu
> searching.
> 

As the overloaded mask may be updated on each idle, it could be a
significant source of cache misses between CPUs sharing the domain for
workloads that rapidly idle so there should be data on whether cache misses
are increased heavily. It also potentially delays the CPU reaching idle
but it may not be by much.

The filter may be out of date. It takes up to one tick to detect
overloaded and the filter to have a positive impact. As a CPU is not
guaranteed to enter idle if there is at least one CPU-bound task, it may
also be up to 1 tick before the mask is cleared. I'm not sure this is a
serious problem though as SIS would not pick the CPU with the CPU-bound
task anyway.

At minimum, the filter should be split out and considered first as it
is the most likely reason why a performance difference was measured. It
has some oddities like why nr_overloaded is really a boolean and as
it's under rq lock, it's not clear why it's atomic. The changelog
would ideally contain some comment on the impact to cache misses
if any and some sort of proof that SIS search depth is reduced which
https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/
may be some help.

At that point, compare the idle task balancing on top to isolate how
much it improves things if any and identify why existing balancing is
insufficient. Split out the can_migrate_task change beforehand in case it
is the main source of difference as opposed to the new balancing mechanism.

-- 
Mel Gorman
SUSE Labs

Re: [RFC PATCH 0/5] introduce sched-idle balancing

Posted by Abel Wu 4 years, 4 months ago

Hi Mel,

On 2/25/22 12:47 AM, Mel Gorman Wrote:
> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
>> Current load balancing is mainly based on cpu capacity
>> and task util, which makes sense in the POV of overall
>> throughput. While there still might be some improvement
>> can be done by reducing number of overloaded cfs rqs if
>> sched-idle or idle rq exists.
>>
>> An CFS runqueue is considered overloaded when there are
>> more than one pullable non-idle tasks on it (since sched-
>> idle cpus are treated as idle cpus). And idle tasks are
>> counted towards rq->cfs.idle_h_nr_running, that is either
>> assigned SCHED_IDLE policy or placed under idle cgroups.
>>
> 
> It's not clear how your tests evaluated the balancing of SCHED_IDLE tasks
> versus the existing idle balancing and isolated that impact. I suspect
> the tests may primarily measured the effect of the SIS filter.

The sched-idle balancing doesn't really care about the idle tasks.
It tries to improve the non-idle tasks' performance by spreading
them out to make full use of cpu capacity.

I will do some individual tests to SIS and sched-idle balancer
each, and keep you informed.

> 
>> So in short, the goal of the sched-idle balancing is to
>> let the *non-idle tasks* make full use of cpu resources.
>> To achieve that, we mainly do two things:
>>
>>    - pull non-idle tasks for sched-idle or idle rqs
>>      from the overloaded ones, and
>>
>>    - prevent pulling the last non-idle task in an rq
>>
>> The mask of overloaded cpus is updated in periodic tick
>> and the idle path at the LLC domain basis. This cpumask
>> will also be used in SIS as a filter, improving idle cpu
>> searching.
>>
> 
> As the overloaded mask may be updated on each idle, it could be a
> significant source of cache misses between CPUs sharing the domain for
> workloads that rapidly idle so there should be data on whether cache misses
> are increased heavily. It also potentially delays the CPU reaching idle
> but it may not be by much.

Yes, that's why I cached overloaded status in rq->overloaded. So in
this case of short running tasks, when cpus rapidly/frequently go
idle, the cpumask/counter are not actually updated if the cached
status is already 0 (not overloaded).

> 
> The filter may be out of date. It takes up to one tick to detect
> overloaded and the filter to have a positive impact. As a CPU is not
> guaranteed to enter idle if there is at least one CPU-bound task, it may
> also be up to 1 tick before the mask is cleared. I'm not sure this is a
> serious problem though as SIS would not pick the CPU with the CPU-bound
> task anyway.

Yes, it can be out of date, but increasing the accuracy means more
frequent update which would introduce cache issues you mentioned
above. Rate limit the updating to tick at the LLC basis might be an
acceptable tradeoff I presume.

> 
> At minimum, the filter should be split out and considered first as it
> is the most likely reason why a performance difference was measured. It
> has some oddities like why nr_overloaded is really a boolean and as
> it's under rq lock, it's not clear why it's atomic. The changelog
> would ideally contain some comment on the impact to cache misses
> if any and some sort of proof that SIS search depth is reduced which
> https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/
> may be some help.
> 
> At that point, compare the idle task balancing on top to isolate how
> much it improves things if any and identify why existing balancing is
> insufficient. Split out the can_migrate_task change beforehand in case it
> is the main source of difference as opposed to the new balancing mechanism.
> 

The nr_overloaded sits in shared domain structure thus shared in
LLC domain and needs to be atomic_t, while rq->overloaded is a
boolean updated under rq lock. Maybe the naming can cause some
confusion, please lighten me up if you have better idea :)

And yes, I agree it would be nice if test result on SIS search
depth can be shown, and I actually did the test, but the result
didn't show a reduction in depth due to sched-idle balancing
will also consume sched-idle/idle cpus. I will apply your patch
and make some further tests on that, thanks.

Best Regards,
	Abel

Re: [RFC PATCH 0/5] introduce sched-idle balancing

Posted by Mel Gorman 4 years, 4 months ago

On Fri, Feb 25, 2022 at 04:15:06PM +0800, Abel Wu wrote:
> > As the overloaded mask may be updated on each idle, it could be a
> > significant source of cache misses between CPUs sharing the domain for
> > workloads that rapidly idle so there should be data on whether cache misses
> > are increased heavily. It also potentially delays the CPU reaching idle
> > but it may not be by much.
> 
> Yes, that's why I cached overloaded status in rq->overloaded. So in
> this case of short running tasks, when cpus rapidly/frequently go
> idle, the cpumask/counter are not actually updated if the cached
> status is already 0 (not overloaded).
> 

Which is a good idea in some respects. It tries to limit the number of
updates and treats it as a boolean but it's probably prone to races.

> > The filter may be out of date. It takes up to one tick to detect
> > overloaded and the filter to have a positive impact. As a CPU is not
> > guaranteed to enter idle if there is at least one CPU-bound task, it may
> > also be up to 1 tick before the mask is cleared. I'm not sure this is a
> > serious problem though as SIS would not pick the CPU with the CPU-bound
> > task anyway.
> 
> Yes, it can be out of date, but increasing the accuracy means more
> frequent update which would introduce cache issues you mentioned
> above. Rate limit the updating to tick at the LLC basis might be an
> acceptable tradeoff I presume.
> 
> > 
> > At minimum, the filter should be split out and considered first as it
> > is the most likely reason why a performance difference was measured. It
> > has some oddities like why nr_overloaded is really a boolean and as
> > it's under rq lock, it's not clear why it's atomic. The changelog
> > would ideally contain some comment on the impact to cache misses
> > if any and some sort of proof that SIS search depth is reduced which
> > https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/
> > may be some help.
> > 
> > At that point, compare the idle task balancing on top to isolate how
> > much it improves things if any and identify why existing balancing is
> > insufficient. Split out the can_migrate_task change beforehand in case it
> > is the main source of difference as opposed to the new balancing mechanism.
> > 
> 
> The nr_overloaded sits in shared domain structure thus shared in
> LLC domain and needs to be atomic_t, while rq->overloaded is a
> boolean updated under rq lock. Maybe the naming can cause some
> confusion, please lighten me up if you have better idea :)
> 

The naming doesn't help because it's not really "the number of overloaded
rq's". atomic_t would be slightly safer against parallel updates but
it's race prone. I didn't think about it deeply but I suspect that two
separate rq's could disagree on what the boolean value should be if one rq
is overloaded, the other is not and they are updating via the idle path at
the same time. This probably can happen because the locking is rq based
and the cpumask is shared. On the flip side, making it an accurate count
would result in more updates and incur cache misses as well as probably
needing a cpumask check instead of a nr_overloaded comparison to determine
if the rq is already accounted for so it costs more. You are very likely
trading accuracy versus cost of update.

Whichever choice you make, add comments on the pros/cons and describe
the limitation of either approach. e.g. if overloaded is effectively a
boolean, describe in a comment the limitations.

> And yes, I agree it would be nice if test result on SIS search
> depth can be shown, and I actually did the test, but the result
> didn't show a reduction in depth due to sched-idle balancing
> will also consume sched-idle/idle cpus. I will apply your patch
> and make some further tests on that, thanks.
> 

Just remember to use the patch to measure changes in SIS depth but
performance figures should not include the patch as the schedstat
overhead distorts results.

Also place the filter first and do any measurements of any change to
balancing versus the filter. I'm suggesting placing the filter first
because it's less controversial than a new balancer. Just be aware that
the filter alone is not a guarantee of merging as there have been a few
approaches to filtering and so far all of them had downsides on either cost
or accuracy. IIRC the only active approach to reducing search cost in SIS
is https://lore.kernel.org/all/20220207034013.599214-1-yu.c.chen@intel.com/
and it's likely to get a new version due to
https://lore.kernel.org/all/20220207135253.GF23216@worktop.programming.kicks-ass.net/.
It also updates sched_domain_shared but with a single boolean instead of
an atomic+cpumask.

-- 
Mel Gorman
SUSE Labs

Re: [RFC PATCH 0/5] introduce sched-idle balancing

Posted by Abel Wu 4 years, 4 months ago

Hi Mel, thanks a lot for your review!

On 2/25/22 6:16 PM, Mel Gorman Wrote:
> On Fri, Feb 25, 2022 at 04:15:06PM +0800, Abel Wu wrote:
>>> As the overloaded mask may be updated on each idle, it could be a
>>> significant source of cache misses between CPUs sharing the domain for
>>> workloads that rapidly idle so there should be data on whether cache misses
>>> are increased heavily. It also potentially delays the CPU reaching idle
>>> but it may not be by much.
>>
>> Yes, that's why I cached overloaded status in rq->overloaded. So in
>> this case of short running tasks, when cpus rapidly/frequently go
>> idle, the cpumask/counter are not actually updated if the cached
>> status is already 0 (not overloaded).
>>
> 
> Which is a good idea in some respects. It tries to limit the number of
> updates and treats it as a boolean but it's probably prone to races.
> 
>>> The filter may be out of date. It takes up to one tick to detect
>>> overloaded and the filter to have a positive impact. As a CPU is not
>>> guaranteed to enter idle if there is at least one CPU-bound task, it may
>>> also be up to 1 tick before the mask is cleared. I'm not sure this is a
>>> serious problem though as SIS would not pick the CPU with the CPU-bound
>>> task anyway.
>>
>> Yes, it can be out of date, but increasing the accuracy means more
>> frequent update which would introduce cache issues you mentioned
>> above. Rate limit the updating to tick at the LLC basis might be an
>> acceptable tradeoff I presume.
>>
>>>
>>> At minimum, the filter should be split out and considered first as it
>>> is the most likely reason why a performance difference was measured. It
>>> has some oddities like why nr_overloaded is really a boolean and as
>>> it's under rq lock, it's not clear why it's atomic. The changelog
>>> would ideally contain some comment on the impact to cache misses
>>> if any and some sort of proof that SIS search depth is reduced which
>>> https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/
>>> may be some help.
>>>
>>> At that point, compare the idle task balancing on top to isolate how
>>> much it improves things if any and identify why existing balancing is
>>> insufficient. Split out the can_migrate_task change beforehand in case it
>>> is the main source of difference as opposed to the new balancing mechanism.
>>>
>>
>> The nr_overloaded sits in shared domain structure thus shared in
>> LLC domain and needs to be atomic_t, while rq->overloaded is a
>> boolean updated under rq lock. Maybe the naming can cause some
>> confusion, please lighten me up if you have better idea :)
>>
> 
> The naming doesn't help because it's not really "the number of overloaded
> rq's". atomic_t would be slightly safer against parallel updates but
> it's race prone. I didn't think about it deeply but I suspect that two
> separate rq's could disagree on what the boolean value should be if one rq
> is overloaded, the other is not and they are updating via the idle path at
> the same time. This probably can happen because the locking is rq based
> and the cpumask is shared. On the flip side, making it an accurate count
> would result in more updates and incur cache misses as well as probably
> needing a cpumask check instead of a nr_overloaded comparison to determine
> if the rq is already accounted for so it costs more. You are very likely
> trading accuracy versus cost of update.

The boolean value (rq->overloaded) is accessed under rq lock, and almost
accessed by its own rq except the very rare case in sched_idle_balance()
where a double check failed on cfs_rq_overloaded(). So this value should
be accurate and has good data locality.

But as you said, the nr_overloaded and cpu mask are race prone in the
following pattern in my patches:

	if (nr_overloaded > 0)
		/* nr_overloaded can be zero now */
		read(overloaded_mask);

since the mask is accessed without rq locked, the cost might not be too
much. This is quite similar with the idle_cpu() usage in SIS I guess.

> 
> Whichever choice you make, add comments on the pros/cons and describe
> the limitation of either approach. e.g. if overloaded is effectively a
> boolean, describe in a comment the limitations.

OK, will do.

> 
>> And yes, I agree it would be nice if test result on SIS search
>> depth can be shown, and I actually did the test, but the result
>> didn't show a reduction in depth due to sched-idle balancing
>> will also consume sched-idle/idle cpus. I will apply your patch
>> and make some further tests on that, thanks.
>>
> 
> Just remember to use the patch to measure changes in SIS depth but
> performance figures should not include the patch as the schedstat
> overhead distorts results.

Yes, agreed.

> 
> Also place the filter first and do any measurements of any change to
> balancing versus the filter. I'm suggesting placing the filter first
> because it's less controversial than a new balancer. Just be aware that
> the filter alone is not a guarantee of merging as there have been a few
> approaches to filtering and so far all of them had downsides on either cost

Yes, understood. I will adjust the patches as you suggested and send v2
together with more tests next week.

> or accuracy. IIRC the only active approach to reducing search cost in SIS
> is https://lore.kernel.org/all/20220207034013.599214-1-yu.c.chen@intel.com/
> and it's likely to get a new version due to
> https://lore.kernel.org/all/20220207135253.GF23216@worktop.programming.kicks-ass.net/.
> It also updates sched_domain_shared but with a single boolean instead of
> an atomic+cpumask.
> 

Chen Yu's patch disables idle cpu searching in SIS when the LLC domain
is overloaded (that is 85% capacity usage) and Peter suggested him use
this metric to replace/improve SIS_PROP feature to make search depth
varying gently.

I don't think either of the two approaches conflict with mine, as they
are to reduce the effort of searching when system is busy and cpus are
not likely to be idle, and mine is to consume sched-idle/idle cpus by
themselves by pulling non-idle tasks from overloaded rqs so there will
be fewer sched-idle/idle cpus.

Thanks and best regards,
	Abel

Re: [RFC PATCH 0/5] introduce sched-idle balancing

Posted by Josh Don 4 years, 3 months ago

On Fri, Feb 25, 2022 at 5:36 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
[snip]
> > Also place the filter first and do any measurements of any change to
> > balancing versus the filter. I'm suggesting placing the filter first
> > because it's less controversial than a new balancer. Just be aware that
> > the filter alone is not a guarantee of merging as there have been a few
> > approaches to filtering and so far all of them had downsides on either cost
>
> Yes, understood. I will adjust the patches as you suggested and send v2
> together with more tests next week.

+1 to trying the filter rather than introducing a new balance path.

We've found the sched_idle_cpu() checks in the wakeup path to be
adequate in allowing non-idle tasks to fully consume cpu resources
(but that of course relies on wakeup balancing, and not periodic
balancing).

Please cc me on the next series.

Thanks,
Josh

Re: [RFC PATCH 0/5] introduce sched-idle balancing

Posted by Abel Wu 4 years, 4 months ago

Ping :)

On 2/17/22 11:43 PM, Abel Wu Wrote:
> Current load balancing is mainly based on cpu capacity
> and task util, which makes sense in the POV of overall
> throughput. While there still might be some improvement
> can be done by reducing number of overloaded cfs rqs if
> sched-idle or idle rq exists.
> 
> An CFS runqueue is considered overloaded when there are
> more than one pullable non-idle tasks on it (since sched-
> idle cpus are treated as idle cpus). And idle tasks are
> counted towards rq->cfs.idle_h_nr_running, that is either
> assigned SCHED_IDLE policy or placed under idle cgroups.
> 
> The overloaded cfs rqs can cause performance issues to
> both task types:
> 
>    - for latency critical tasks like SCHED_NORMAL,
>      time of waiting in the rq will increase and
>      result in higher pct99 latency, and
> 
>    - batch tasks may not be able to make full use
>      of cpu capacity if sched-idle rq exists, thus
>      presents poorer throughput.
> 
> So in short, the goal of the sched-idle balancing is to
> let the *non-idle tasks* make full use of cpu resources.
> To achieve that, we mainly do two things:
> 
>    - pull non-idle tasks for sched-idle or idle rqs
>      from the overloaded ones, and
> 
>    - prevent pulling the last non-idle task in an rq
> 
> The mask of overloaded cpus is updated in periodic tick
> and the idle path at the LLC domain basis. This cpumask
> will also be used in SIS as a filter, improving idle cpu
> searching.
> 
> Tests are done in an Intel Xeon E5-2650 v4 server with
> 2 NUMA nodes each of which has 12 cores, and with SMT2
> enabled, so 48 CPUs in total. Test results are listed
> as follows.
> 
>    - we used perf messaging test to test throughput
>      at different load (groups).
> 
>        perf bench sched messaging -g [N] -l 40000
> 
> 	N	w/o	w/	diff
> 	1	2.897	2.834	-2.17%
> 	3	5.156	4.904	-4.89%
> 	5	7.850	7.617	-2.97%
> 	10	15.140	14.574	-3.74%
> 	20	29.387	27.602	-6.07%
> 
>      the result shows approximate 2~6% improvement.
> 
>    - and schbench to test latency performance in two
>      scenarios: quiet and noisy. In quiet test, we
>      run schbench in a normal cpu cgroup in a quiet
>      system, while the noisy test additionally runs
>      perf messaging workload inside an idle cgroup
>      as nosie.
> 
>        schbench -m 2 -t 24 -i 60 -r 60
>        perf bench sched messaging -g 1 -l 4000000
> 
> 	[quiet]
> 			w/o	w/
> 	50.0th		31	31
> 	75.0th		45	45
> 	90.0th		55	55
> 	95.0th		62	61
> 	*99.0th		85	86
> 	99.5th		565	318
> 	99.9th		11536	10992
> 	max		13029	13067
> 
> 	[nosiy]
> 			w/o	w/
> 	50.0th		34	32
> 	75.0th		48	45
> 	90.0th		58	55
> 	95.0th		65	61
> 	*99.0th		2364	208
> 	99.5th		6696	2068
> 	99.9th		12688	8816
> 	max		15209	14191
> 
>      it can be seen that the quiet test results are
>      quite similar, but the p99 latency is greatly
>      improved in the nosiy test.
> 
> Comments and tests are appreciated!
> 
> Abel Wu (5):
>    sched/fair: record overloaded cpus
>    sched/fair: introduce sched-idle balance
>    sched/fair: add stats for sched-idle balancing
>    sched/fair: filter out overloaded cpus in sis
>    sched/fair: favor cpu capacity for idle tasks
> 
>   include/linux/sched/idle.h     |   1 +
>   include/linux/sched/topology.h |  15 ++++
>   kernel/sched/core.c            |   1 +
>   kernel/sched/fair.c            | 187 ++++++++++++++++++++++++++++++++++++++++-
>   kernel/sched/sched.h           |   6 ++
>   kernel/sched/stats.c           |   5 +-
>   kernel/sched/topology.c        |   4 +-
>   7 files changed, 215 insertions(+), 4 deletions(-)
>

Re: [RFC PATCH 0/5] introduce sched-idle balancing

Posted by Peter Zijlstra 4 years, 4 months ago

On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> Current load balancing is mainly based on cpu capacity
> and task util, which makes sense in the POV of overall
> throughput. While there still might be some improvement
> can be done by reducing number of overloaded cfs rqs if
> sched-idle or idle rq exists.

I'm much confused, there is an explicit new-idle balancer and a periodic
idle balancer already there.

Re: [RFC PATCH 0/5] introduce sched-idle balancing

Posted by Vincent Guittot 4 years, 4 months ago

On Thu, 24 Feb 2022 at 16:20, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> > Current load balancing is mainly based on cpu capacity
> > and task util, which makes sense in the POV of overall
> > throughput. While there still might be some improvement
> > can be done by reducing number of overloaded cfs rqs if
> > sched-idle or idle rq exists.
>
> I'm much confused, there is an explicit new-idle balancer and a periodic
> idle balancer already there.

I agree, You failed to explain why newly_idle and periodic idle load
balance are not enough and we need this new one

Re: [RFC PATCH 0/5] introduce sched-idle balancing

Posted by Abel Wu 4 years, 4 months ago

On 2/24/22 11:29 PM, Vincent Guittot Wrote:
> On Thu, 24 Feb 2022 at 16:20, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
>>> Current load balancing is mainly based on cpu capacity
>>> and task util, which makes sense in the POV of overall
>>> throughput. While there still might be some improvement
>>> can be done by reducing number of overloaded cfs rqs if
>>> sched-idle or idle rq exists.
>>
>> I'm much confused, there is an explicit new-idle balancer and a periodic
>> idle balancer already there.
> 
> I agree, You failed to explain why newly_idle and periodic idle load
> balance are not enough and we need this new one

Hi Vincent, sorry for not giving a clearer explanation. Please check
my previous email replying to Peter, thanks.

Best Regards,
	Abel

Re: [RFC PATCH 0/5] introduce sched-idle balancing

Posted by Abel Wu 4 years, 4 months ago

Hi Peter,

On 2/24/22 11:20 PM, Peter Zijlstra Wrote:
> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
>> Current load balancing is mainly based on cpu capacity
>> and task util, which makes sense in the POV of overall
>> throughput. While there still might be some improvement
>> can be done by reducing number of overloaded cfs rqs if
>> sched-idle or idle rq exists.
> 
> I'm much confused, there is an explicit new-idle balancer and a periodic
> idle balancer already there.

The two balancers are triggered on the rqs that have no tasks on them,
and load_balance() seems don't show a preference for non-idle tasks so
there might be possibility that only idle tasks are pulled during load
balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a
result the normal tasks, mostly latency-critical ones in our case, on
that overloaded rq still suffer waiting for each other. I observed this
through perf sched.

IOW the main difference from the POV of load_balance() between the
latency-critical tasks and the idle ones is load.

The sched-idle balancer is triggered on the sched-idle rqs periodically
and the newly-idle ones. It does a 'fast' pull of non-idle tasks from
the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks
make full use of cpu resources.

The sched-idle balancer only focuses on non-idle tasks' performance, so
it can introduce overall load imbalance, and that's why I put it before
load_balance().

Best Regards,
	Abel

Re: [RFC PATCH 0/5] introduce sched-idle balancing

Posted by Vincent Guittot 4 years, 4 months ago

On Fri, 25 Feb 2022 at 07:46, Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> Hi Peter,
>
> On 2/24/22 11:20 PM, Peter Zijlstra Wrote:
> > On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> >> Current load balancing is mainly based on cpu capacity
> >> and task util, which makes sense in the POV of overall
> >> throughput. While there still might be some improvement
> >> can be done by reducing number of overloaded cfs rqs if
> >> sched-idle or idle rq exists.
> >
> > I'm much confused, there is an explicit new-idle balancer and a periodic
> > idle balancer already there.
>
> The two balancers are triggered on the rqs that have no tasks on them,
> and load_balance() seems don't show a preference for non-idle tasks so

The load balance will happen at the idle pace if a sched_idle task is
running on the cpu so you will have an ILB on each cpu that run a
sched-idle task

> there might be possibility that only idle tasks are pulled during load
> balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a

There is a LB_MIN feature (disable by default) that filters task with
very low load ( < 16) which includes sched-idle task which has a max
load of 3

> result the normal tasks, mostly latency-critical ones in our case, on
> that overloaded rq still suffer waiting for each other. I observed this
> through perf sched.
>
> IOW the main difference from the POV of load_balance() between the
> latency-critical tasks and the idle ones is load.
>
> The sched-idle balancer is triggered on the sched-idle rqs periodically
> and the newly-idle ones. It does a 'fast' pull of non-idle tasks from
> the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks
> make full use of cpu resources.
>
> The sched-idle balancer only focuses on non-idle tasks' performance, so
> it can introduce overall load imbalance, and that's why I put it before
> load_balance().

According to the very low weight of a sched-idle task, I don't expect
much imbalance because of sched-idle tasks. But this also depends of
the number of sched-idle task.


>
> Best Regards,
>         Abel

Re: [RFC PATCH 0/5] introduce sched-idle balancing

Posted by Abel Wu 4 years, 4 months ago

On 2/25/22 4:29 PM, Vincent Guittot Wrote:
> On Fri, 25 Feb 2022 at 07:46, Abel Wu <wuyun.abel@bytedance.com> wrote:
>>
>> Hi Peter,
>>
>> On 2/24/22 11:20 PM, Peter Zijlstra Wrote:
>>> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
>>>> Current load balancing is mainly based on cpu capacity
>>>> and task util, which makes sense in the POV of overall
>>>> throughput. While there still might be some improvement
>>>> can be done by reducing number of overloaded cfs rqs if
>>>> sched-idle or idle rq exists.
>>>
>>> I'm much confused, there is an explicit new-idle balancer and a periodic
>>> idle balancer already there.
>>
>> The two balancers are triggered on the rqs that have no tasks on them,
>> and load_balance() seems don't show a preference for non-idle tasks so
> 
> The load balance will happen at the idle pace if a sched_idle task is
> running on the cpu so you will have an ILB on each cpu that run a
> sched-idle task

I'm afraid I don't quite follow you, since sched-idle balancer doesn't
touch the ILB part, can you elaborate on this? Thanks.

> 
>> there might be possibility that only idle tasks are pulled during load
>> balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a
> 
> There is a LB_MIN feature (disable by default) that filters task with
> very low load ( < 16) which includes sched-idle task which has a max
> load of 3

This feature might not that friendly to the situation that only
sched-idle tasks are running in the system. And this situation
can last more than half a day in our co-location systems in which
the training/batch tasks are placed under idle groups or directly
assigned to SCHED_IDLE.

> 
>> result the normal tasks, mostly latency-critical ones in our case, on
>> that overloaded rq still suffer waiting for each other. I observed this
>> through perf sched.
>>
>> IOW the main difference from the POV of load_balance() between the
>> latency-critical tasks and the idle ones is load.
>>
>> The sched-idle balancer is triggered on the sched-idle rqs periodically
>> and the newly-idle ones. It does a 'fast' pull of non-idle tasks from
>> the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks
>> make full use of cpu resources.
>>
>> The sched-idle balancer only focuses on non-idle tasks' performance, so
>> it can introduce overall load imbalance, and that's why I put it before
>> load_balance().
> 
> According to the very low weight of a sched-idle task, I don't expect
> much imbalance because of sched-idle tasks. But this also depends of
> the number of sched-idle task.
> 
> 
>>
>> Best Regards,
>>          Abel

Re: [RFC PATCH 0/5] introduce sched-idle balancing

Posted by Vincent Guittot 4 years, 4 months ago

On Fri, 25 Feb 2022 at 11:46, Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> On 2/25/22 4:29 PM, Vincent Guittot Wrote:
> > On Fri, 25 Feb 2022 at 07:46, Abel Wu <wuyun.abel@bytedance.com> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 2/24/22 11:20 PM, Peter Zijlstra Wrote:
> >>> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> >>>> Current load balancing is mainly based on cpu capacity
> >>>> and task util, which makes sense in the POV of overall
> >>>> throughput. While there still might be some improvement
> >>>> can be done by reducing number of overloaded cfs rqs if
> >>>> sched-idle or idle rq exists.
> >>>
> >>> I'm much confused, there is an explicit new-idle balancer and a periodic
> >>> idle balancer already there.
> >>
> >> The two balancers are triggered on the rqs that have no tasks on them,
> >> and load_balance() seems don't show a preference for non-idle tasks so
> >
> > The load balance will happen at the idle pace if a sched_idle task is
> > running on the cpu so you will have an ILB on each cpu that run a
> > sched-idle task
>
> I'm afraid I don't quite follow you, since sched-idle balancer doesn't
> touch the ILB part, can you elaborate on this? Thanks.

I was referring to your sentence " The two balancers are triggered on
the rqs that have no tasks on them". When there is only sched-idle
tasks on a rq, the load_balance behave like the Idle Load Balance when
there is no task i.e. as often

>
> >
> >> there might be possibility that only idle tasks are pulled during load
> >> balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a
> >
> > There is a LB_MIN feature (disable by default) that filters task with
> > very low load ( < 16) which includes sched-idle task which has a max
> > load of 3

but we could easily change this like if !sched_idle_cpus then LB can
migrate only cfs tasks otherwise can migrate sched_idle task as well.
Instead of creating another side channel

>
> This feature might not that friendly to the situation that only
> sched-idle tasks are running in the system. And this situation
> can last more than half a day in our co-location systems in which
> the training/batch tasks are placed under idle groups or directly
> assigned to SCHED_IDLE.
>
> >
> >> result the normal tasks, mostly latency-critical ones in our case, on
> >> that overloaded rq still suffer waiting for each other. I observed this
> >> through perf sched.
> >>
> >> IOW the main difference from the POV of load_balance() between the
> >> latency-critical tasks and the idle ones is load.
> >>
> >> The sched-idle balancer is triggered on the sched-idle rqs periodically
> >> and the newly-idle ones. It does a 'fast' pull of non-idle tasks from
> >> the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks
> >> make full use of cpu resources.
> >>
> >> The sched-idle balancer only focuses on non-idle tasks' performance, so
> >> it can introduce overall load imbalance, and that's why I put it before
> >> load_balance().
> >
> > According to the very low weight of a sched-idle task, I don't expect
> > much imbalance because of sched-idle tasks. But this also depends of
> > the number of sched-idle task.
> >
> >
> >>
> >> Best Regards,
> >>          Abel