include/linux/sched/idle.h | 1 + include/linux/sched/topology.h | 15 ++++ kernel/sched/core.c | 1 + kernel/sched/fair.c | 187 ++++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 6 ++ kernel/sched/stats.c | 5 +- kernel/sched/topology.c | 4 +- 7 files changed, 215 insertions(+), 4 deletions(-)
Current load balancing is mainly based on cpu capacity
and task util, which makes sense in the POV of overall
throughput. While there still might be some improvement
can be done by reducing number of overloaded cfs rqs if
sched-idle or idle rq exists.
An CFS runqueue is considered overloaded when there are
more than one pullable non-idle tasks on it (since sched-
idle cpus are treated as idle cpus). And idle tasks are
counted towards rq->cfs.idle_h_nr_running, that is either
assigned SCHED_IDLE policy or placed under idle cgroups.
The overloaded cfs rqs can cause performance issues to
both task types:
- for latency critical tasks like SCHED_NORMAL,
time of waiting in the rq will increase and
result in higher pct99 latency, and
- batch tasks may not be able to make full use
of cpu capacity if sched-idle rq exists, thus
presents poorer throughput.
So in short, the goal of the sched-idle balancing is to
let the *non-idle tasks* make full use of cpu resources.
To achieve that, we mainly do two things:
- pull non-idle tasks for sched-idle or idle rqs
from the overloaded ones, and
- prevent pulling the last non-idle task in an rq
The mask of overloaded cpus is updated in periodic tick
and the idle path at the LLC domain basis. This cpumask
will also be used in SIS as a filter, improving idle cpu
searching.
Tests are done in an Intel Xeon E5-2650 v4 server with
2 NUMA nodes each of which has 12 cores, and with SMT2
enabled, so 48 CPUs in total. Test results are listed
as follows.
- we used perf messaging test to test throughput
at different load (groups).
perf bench sched messaging -g [N] -l 40000
N w/o w/ diff
1 2.897 2.834 -2.17%
3 5.156 4.904 -4.89%
5 7.850 7.617 -2.97%
10 15.140 14.574 -3.74%
20 29.387 27.602 -6.07%
the result shows approximate 2~6% improvement.
- and schbench to test latency performance in two
scenarios: quiet and noisy. In quiet test, we
run schbench in a normal cpu cgroup in a quiet
system, while the noisy test additionally runs
perf messaging workload inside an idle cgroup
as nosie.
schbench -m 2 -t 24 -i 60 -r 60
perf bench sched messaging -g 1 -l 4000000
[quiet]
w/o w/
50.0th 31 31
75.0th 45 45
90.0th 55 55
95.0th 62 61
*99.0th 85 86
99.5th 565 318
99.9th 11536 10992
max 13029 13067
[nosiy]
w/o w/
50.0th 34 32
75.0th 48 45
90.0th 58 55
95.0th 65 61
*99.0th 2364 208
99.5th 6696 2068
99.9th 12688 8816
max 15209 14191
it can be seen that the quiet test results are
quite similar, but the p99 latency is greatly
improved in the nosiy test.
Comments and tests are appreciated!
Abel Wu (5):
sched/fair: record overloaded cpus
sched/fair: introduce sched-idle balance
sched/fair: add stats for sched-idle balancing
sched/fair: filter out overloaded cpus in sis
sched/fair: favor cpu capacity for idle tasks
include/linux/sched/idle.h | 1 +
include/linux/sched/topology.h | 15 ++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 187 ++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 6 ++
kernel/sched/stats.c | 5 +-
kernel/sched/topology.c | 4 +-
7 files changed, 215 insertions(+), 4 deletions(-)
--
2.11.0
On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote: > Current load balancing is mainly based on cpu capacity > and task util, which makes sense in the POV of overall > throughput. While there still might be some improvement > can be done by reducing number of overloaded cfs rqs if > sched-idle or idle rq exists. > > An CFS runqueue is considered overloaded when there are > more than one pullable non-idle tasks on it (since sched- > idle cpus are treated as idle cpus). And idle tasks are > counted towards rq->cfs.idle_h_nr_running, that is either > assigned SCHED_IDLE policy or placed under idle cgroups. > It's not clear how your tests evaluated the balancing of SCHED_IDLE tasks versus the existing idle balancing and isolated that impact. I suspect the tests may primarily measured the effect of the SIS filter. > So in short, the goal of the sched-idle balancing is to > let the *non-idle tasks* make full use of cpu resources. > To achieve that, we mainly do two things: > > - pull non-idle tasks for sched-idle or idle rqs > from the overloaded ones, and > > - prevent pulling the last non-idle task in an rq > > The mask of overloaded cpus is updated in periodic tick > and the idle path at the LLC domain basis. This cpumask > will also be used in SIS as a filter, improving idle cpu > searching. > As the overloaded mask may be updated on each idle, it could be a significant source of cache misses between CPUs sharing the domain for workloads that rapidly idle so there should be data on whether cache misses are increased heavily. It also potentially delays the CPU reaching idle but it may not be by much. The filter may be out of date. It takes up to one tick to detect overloaded and the filter to have a positive impact. As a CPU is not guaranteed to enter idle if there is at least one CPU-bound task, it may also be up to 1 tick before the mask is cleared. I'm not sure this is a serious problem though as SIS would not pick the CPU with the CPU-bound task anyway. At minimum, the filter should be split out and considered first as it is the most likely reason why a performance difference was measured. It has some oddities like why nr_overloaded is really a boolean and as it's under rq lock, it's not clear why it's atomic. The changelog would ideally contain some comment on the impact to cache misses if any and some sort of proof that SIS search depth is reduced which https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/ may be some help. At that point, compare the idle task balancing on top to isolate how much it improves things if any and identify why existing balancing is insufficient. Split out the can_migrate_task change beforehand in case it is the main source of difference as opposed to the new balancing mechanism. -- Mel Gorman SUSE Labs
Hi Mel, On 2/25/22 12:47 AM, Mel Gorman Wrote: > On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote: >> Current load balancing is mainly based on cpu capacity >> and task util, which makes sense in the POV of overall >> throughput. While there still might be some improvement >> can be done by reducing number of overloaded cfs rqs if >> sched-idle or idle rq exists. >> >> An CFS runqueue is considered overloaded when there are >> more than one pullable non-idle tasks on it (since sched- >> idle cpus are treated as idle cpus). And idle tasks are >> counted towards rq->cfs.idle_h_nr_running, that is either >> assigned SCHED_IDLE policy or placed under idle cgroups. >> > > It's not clear how your tests evaluated the balancing of SCHED_IDLE tasks > versus the existing idle balancing and isolated that impact. I suspect > the tests may primarily measured the effect of the SIS filter. The sched-idle balancing doesn't really care about the idle tasks. It tries to improve the non-idle tasks' performance by spreading them out to make full use of cpu capacity. I will do some individual tests to SIS and sched-idle balancer each, and keep you informed. > >> So in short, the goal of the sched-idle balancing is to >> let the *non-idle tasks* make full use of cpu resources. >> To achieve that, we mainly do two things: >> >> - pull non-idle tasks for sched-idle or idle rqs >> from the overloaded ones, and >> >> - prevent pulling the last non-idle task in an rq >> >> The mask of overloaded cpus is updated in periodic tick >> and the idle path at the LLC domain basis. This cpumask >> will also be used in SIS as a filter, improving idle cpu >> searching. >> > > As the overloaded mask may be updated on each idle, it could be a > significant source of cache misses between CPUs sharing the domain for > workloads that rapidly idle so there should be data on whether cache misses > are increased heavily. It also potentially delays the CPU reaching idle > but it may not be by much. Yes, that's why I cached overloaded status in rq->overloaded. So in this case of short running tasks, when cpus rapidly/frequently go idle, the cpumask/counter are not actually updated if the cached status is already 0 (not overloaded). > > The filter may be out of date. It takes up to one tick to detect > overloaded and the filter to have a positive impact. As a CPU is not > guaranteed to enter idle if there is at least one CPU-bound task, it may > also be up to 1 tick before the mask is cleared. I'm not sure this is a > serious problem though as SIS would not pick the CPU with the CPU-bound > task anyway. Yes, it can be out of date, but increasing the accuracy means more frequent update which would introduce cache issues you mentioned above. Rate limit the updating to tick at the LLC basis might be an acceptable tradeoff I presume. > > At minimum, the filter should be split out and considered first as it > is the most likely reason why a performance difference was measured. It > has some oddities like why nr_overloaded is really a boolean and as > it's under rq lock, it's not clear why it's atomic. The changelog > would ideally contain some comment on the impact to cache misses > if any and some sort of proof that SIS search depth is reduced which > https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/ > may be some help. > > At that point, compare the idle task balancing on top to isolate how > much it improves things if any and identify why existing balancing is > insufficient. Split out the can_migrate_task change beforehand in case it > is the main source of difference as opposed to the new balancing mechanism. > The nr_overloaded sits in shared domain structure thus shared in LLC domain and needs to be atomic_t, while rq->overloaded is a boolean updated under rq lock. Maybe the naming can cause some confusion, please lighten me up if you have better idea :) And yes, I agree it would be nice if test result on SIS search depth can be shown, and I actually did the test, but the result didn't show a reduction in depth due to sched-idle balancing will also consume sched-idle/idle cpus. I will apply your patch and make some further tests on that, thanks. Best Regards, Abel
On Fri, Feb 25, 2022 at 04:15:06PM +0800, Abel Wu wrote: > > As the overloaded mask may be updated on each idle, it could be a > > significant source of cache misses between CPUs sharing the domain for > > workloads that rapidly idle so there should be data on whether cache misses > > are increased heavily. It also potentially delays the CPU reaching idle > > but it may not be by much. > > Yes, that's why I cached overloaded status in rq->overloaded. So in > this case of short running tasks, when cpus rapidly/frequently go > idle, the cpumask/counter are not actually updated if the cached > status is already 0 (not overloaded). > Which is a good idea in some respects. It tries to limit the number of updates and treats it as a boolean but it's probably prone to races. > > The filter may be out of date. It takes up to one tick to detect > > overloaded and the filter to have a positive impact. As a CPU is not > > guaranteed to enter idle if there is at least one CPU-bound task, it may > > also be up to 1 tick before the mask is cleared. I'm not sure this is a > > serious problem though as SIS would not pick the CPU with the CPU-bound > > task anyway. > > Yes, it can be out of date, but increasing the accuracy means more > frequent update which would introduce cache issues you mentioned > above. Rate limit the updating to tick at the LLC basis might be an > acceptable tradeoff I presume. > > > > > At minimum, the filter should be split out and considered first as it > > is the most likely reason why a performance difference was measured. It > > has some oddities like why nr_overloaded is really a boolean and as > > it's under rq lock, it's not clear why it's atomic. The changelog > > would ideally contain some comment on the impact to cache misses > > if any and some sort of proof that SIS search depth is reduced which > > https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/ > > may be some help. > > > > At that point, compare the idle task balancing on top to isolate how > > much it improves things if any and identify why existing balancing is > > insufficient. Split out the can_migrate_task change beforehand in case it > > is the main source of difference as opposed to the new balancing mechanism. > > > > The nr_overloaded sits in shared domain structure thus shared in > LLC domain and needs to be atomic_t, while rq->overloaded is a > boolean updated under rq lock. Maybe the naming can cause some > confusion, please lighten me up if you have better idea :) > The naming doesn't help because it's not really "the number of overloaded rq's". atomic_t would be slightly safer against parallel updates but it's race prone. I didn't think about it deeply but I suspect that two separate rq's could disagree on what the boolean value should be if one rq is overloaded, the other is not and they are updating via the idle path at the same time. This probably can happen because the locking is rq based and the cpumask is shared. On the flip side, making it an accurate count would result in more updates and incur cache misses as well as probably needing a cpumask check instead of a nr_overloaded comparison to determine if the rq is already accounted for so it costs more. You are very likely trading accuracy versus cost of update. Whichever choice you make, add comments on the pros/cons and describe the limitation of either approach. e.g. if overloaded is effectively a boolean, describe in a comment the limitations. > And yes, I agree it would be nice if test result on SIS search > depth can be shown, and I actually did the test, but the result > didn't show a reduction in depth due to sched-idle balancing > will also consume sched-idle/idle cpus. I will apply your patch > and make some further tests on that, thanks. > Just remember to use the patch to measure changes in SIS depth but performance figures should not include the patch as the schedstat overhead distorts results. Also place the filter first and do any measurements of any change to balancing versus the filter. I'm suggesting placing the filter first because it's less controversial than a new balancer. Just be aware that the filter alone is not a guarantee of merging as there have been a few approaches to filtering and so far all of them had downsides on either cost or accuracy. IIRC the only active approach to reducing search cost in SIS is https://lore.kernel.org/all/20220207034013.599214-1-yu.c.chen@intel.com/ and it's likely to get a new version due to https://lore.kernel.org/all/20220207135253.GF23216@worktop.programming.kicks-ass.net/. It also updates sched_domain_shared but with a single boolean instead of an atomic+cpumask. -- Mel Gorman SUSE Labs
Hi Mel, thanks a lot for your review! On 2/25/22 6:16 PM, Mel Gorman Wrote: > On Fri, Feb 25, 2022 at 04:15:06PM +0800, Abel Wu wrote: >>> As the overloaded mask may be updated on each idle, it could be a >>> significant source of cache misses between CPUs sharing the domain for >>> workloads that rapidly idle so there should be data on whether cache misses >>> are increased heavily. It also potentially delays the CPU reaching idle >>> but it may not be by much. >> >> Yes, that's why I cached overloaded status in rq->overloaded. So in >> this case of short running tasks, when cpus rapidly/frequently go >> idle, the cpumask/counter are not actually updated if the cached >> status is already 0 (not overloaded). >> > > Which is a good idea in some respects. It tries to limit the number of > updates and treats it as a boolean but it's probably prone to races. > >>> The filter may be out of date. It takes up to one tick to detect >>> overloaded and the filter to have a positive impact. As a CPU is not >>> guaranteed to enter idle if there is at least one CPU-bound task, it may >>> also be up to 1 tick before the mask is cleared. I'm not sure this is a >>> serious problem though as SIS would not pick the CPU with the CPU-bound >>> task anyway. >> >> Yes, it can be out of date, but increasing the accuracy means more >> frequent update which would introduce cache issues you mentioned >> above. Rate limit the updating to tick at the LLC basis might be an >> acceptable tradeoff I presume. >> >>> >>> At minimum, the filter should be split out and considered first as it >>> is the most likely reason why a performance difference was measured. It >>> has some oddities like why nr_overloaded is really a boolean and as >>> it's under rq lock, it's not clear why it's atomic. The changelog >>> would ideally contain some comment on the impact to cache misses >>> if any and some sort of proof that SIS search depth is reduced which >>> https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/ >>> may be some help. >>> >>> At that point, compare the idle task balancing on top to isolate how >>> much it improves things if any and identify why existing balancing is >>> insufficient. Split out the can_migrate_task change beforehand in case it >>> is the main source of difference as opposed to the new balancing mechanism. >>> >> >> The nr_overloaded sits in shared domain structure thus shared in >> LLC domain and needs to be atomic_t, while rq->overloaded is a >> boolean updated under rq lock. Maybe the naming can cause some >> confusion, please lighten me up if you have better idea :) >> > > The naming doesn't help because it's not really "the number of overloaded > rq's". atomic_t would be slightly safer against parallel updates but > it's race prone. I didn't think about it deeply but I suspect that two > separate rq's could disagree on what the boolean value should be if one rq > is overloaded, the other is not and they are updating via the idle path at > the same time. This probably can happen because the locking is rq based > and the cpumask is shared. On the flip side, making it an accurate count > would result in more updates and incur cache misses as well as probably > needing a cpumask check instead of a nr_overloaded comparison to determine > if the rq is already accounted for so it costs more. You are very likely > trading accuracy versus cost of update. The boolean value (rq->overloaded) is accessed under rq lock, and almost accessed by its own rq except the very rare case in sched_idle_balance() where a double check failed on cfs_rq_overloaded(). So this value should be accurate and has good data locality. But as you said, the nr_overloaded and cpu mask are race prone in the following pattern in my patches: if (nr_overloaded > 0) /* nr_overloaded can be zero now */ read(overloaded_mask); since the mask is accessed without rq locked, the cost might not be too much. This is quite similar with the idle_cpu() usage in SIS I guess. > > Whichever choice you make, add comments on the pros/cons and describe > the limitation of either approach. e.g. if overloaded is effectively a > boolean, describe in a comment the limitations. OK, will do. > >> And yes, I agree it would be nice if test result on SIS search >> depth can be shown, and I actually did the test, but the result >> didn't show a reduction in depth due to sched-idle balancing >> will also consume sched-idle/idle cpus. I will apply your patch >> and make some further tests on that, thanks. >> > > Just remember to use the patch to measure changes in SIS depth but > performance figures should not include the patch as the schedstat > overhead distorts results. Yes, agreed. > > Also place the filter first and do any measurements of any change to > balancing versus the filter. I'm suggesting placing the filter first > because it's less controversial than a new balancer. Just be aware that > the filter alone is not a guarantee of merging as there have been a few > approaches to filtering and so far all of them had downsides on either cost Yes, understood. I will adjust the patches as you suggested and send v2 together with more tests next week. > or accuracy. IIRC the only active approach to reducing search cost in SIS > is https://lore.kernel.org/all/20220207034013.599214-1-yu.c.chen@intel.com/ > and it's likely to get a new version due to > https://lore.kernel.org/all/20220207135253.GF23216@worktop.programming.kicks-ass.net/. > It also updates sched_domain_shared but with a single boolean instead of > an atomic+cpumask. > Chen Yu's patch disables idle cpu searching in SIS when the LLC domain is overloaded (that is 85% capacity usage) and Peter suggested him use this metric to replace/improve SIS_PROP feature to make search depth varying gently. I don't think either of the two approaches conflict with mine, as they are to reduce the effort of searching when system is busy and cpus are not likely to be idle, and mine is to consume sched-idle/idle cpus by themselves by pulling non-idle tasks from overloaded rqs so there will be fewer sched-idle/idle cpus. Thanks and best regards, Abel
On Fri, Feb 25, 2022 at 5:36 AM Abel Wu <wuyun.abel@bytedance.com> wrote: [snip] > > Also place the filter first and do any measurements of any change to > > balancing versus the filter. I'm suggesting placing the filter first > > because it's less controversial than a new balancer. Just be aware that > > the filter alone is not a guarantee of merging as there have been a few > > approaches to filtering and so far all of them had downsides on either cost > > Yes, understood. I will adjust the patches as you suggested and send v2 > together with more tests next week. +1 to trying the filter rather than introducing a new balance path. We've found the sched_idle_cpu() checks in the wakeup path to be adequate in allowing non-idle tasks to fully consume cpu resources (but that of course relies on wakeup balancing, and not periodic balancing). Please cc me on the next series. Thanks, Josh
Ping :) On 2/17/22 11:43 PM, Abel Wu Wrote: > Current load balancing is mainly based on cpu capacity > and task util, which makes sense in the POV of overall > throughput. While there still might be some improvement > can be done by reducing number of overloaded cfs rqs if > sched-idle or idle rq exists. > > An CFS runqueue is considered overloaded when there are > more than one pullable non-idle tasks on it (since sched- > idle cpus are treated as idle cpus). And idle tasks are > counted towards rq->cfs.idle_h_nr_running, that is either > assigned SCHED_IDLE policy or placed under idle cgroups. > > The overloaded cfs rqs can cause performance issues to > both task types: > > - for latency critical tasks like SCHED_NORMAL, > time of waiting in the rq will increase and > result in higher pct99 latency, and > > - batch tasks may not be able to make full use > of cpu capacity if sched-idle rq exists, thus > presents poorer throughput. > > So in short, the goal of the sched-idle balancing is to > let the *non-idle tasks* make full use of cpu resources. > To achieve that, we mainly do two things: > > - pull non-idle tasks for sched-idle or idle rqs > from the overloaded ones, and > > - prevent pulling the last non-idle task in an rq > > The mask of overloaded cpus is updated in periodic tick > and the idle path at the LLC domain basis. This cpumask > will also be used in SIS as a filter, improving idle cpu > searching. > > Tests are done in an Intel Xeon E5-2650 v4 server with > 2 NUMA nodes each of which has 12 cores, and with SMT2 > enabled, so 48 CPUs in total. Test results are listed > as follows. > > - we used perf messaging test to test throughput > at different load (groups). > > perf bench sched messaging -g [N] -l 40000 > > N w/o w/ diff > 1 2.897 2.834 -2.17% > 3 5.156 4.904 -4.89% > 5 7.850 7.617 -2.97% > 10 15.140 14.574 -3.74% > 20 29.387 27.602 -6.07% > > the result shows approximate 2~6% improvement. > > - and schbench to test latency performance in two > scenarios: quiet and noisy. In quiet test, we > run schbench in a normal cpu cgroup in a quiet > system, while the noisy test additionally runs > perf messaging workload inside an idle cgroup > as nosie. > > schbench -m 2 -t 24 -i 60 -r 60 > perf bench sched messaging -g 1 -l 4000000 > > [quiet] > w/o w/ > 50.0th 31 31 > 75.0th 45 45 > 90.0th 55 55 > 95.0th 62 61 > *99.0th 85 86 > 99.5th 565 318 > 99.9th 11536 10992 > max 13029 13067 > > [nosiy] > w/o w/ > 50.0th 34 32 > 75.0th 48 45 > 90.0th 58 55 > 95.0th 65 61 > *99.0th 2364 208 > 99.5th 6696 2068 > 99.9th 12688 8816 > max 15209 14191 > > it can be seen that the quiet test results are > quite similar, but the p99 latency is greatly > improved in the nosiy test. > > Comments and tests are appreciated! > > Abel Wu (5): > sched/fair: record overloaded cpus > sched/fair: introduce sched-idle balance > sched/fair: add stats for sched-idle balancing > sched/fair: filter out overloaded cpus in sis > sched/fair: favor cpu capacity for idle tasks > > include/linux/sched/idle.h | 1 + > include/linux/sched/topology.h | 15 ++++ > kernel/sched/core.c | 1 + > kernel/sched/fair.c | 187 ++++++++++++++++++++++++++++++++++++++++- > kernel/sched/sched.h | 6 ++ > kernel/sched/stats.c | 5 +- > kernel/sched/topology.c | 4 +- > 7 files changed, 215 insertions(+), 4 deletions(-) >
On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote: > Current load balancing is mainly based on cpu capacity > and task util, which makes sense in the POV of overall > throughput. While there still might be some improvement > can be done by reducing number of overloaded cfs rqs if > sched-idle or idle rq exists. I'm much confused, there is an explicit new-idle balancer and a periodic idle balancer already there.
On Thu, 24 Feb 2022 at 16:20, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote: > > Current load balancing is mainly based on cpu capacity > > and task util, which makes sense in the POV of overall > > throughput. While there still might be some improvement > > can be done by reducing number of overloaded cfs rqs if > > sched-idle or idle rq exists. > > I'm much confused, there is an explicit new-idle balancer and a periodic > idle balancer already there. I agree, You failed to explain why newly_idle and periodic idle load balance are not enough and we need this new one
On 2/24/22 11:29 PM, Vincent Guittot Wrote: > On Thu, 24 Feb 2022 at 16:20, Peter Zijlstra <peterz@infradead.org> wrote: >> >> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote: >>> Current load balancing is mainly based on cpu capacity >>> and task util, which makes sense in the POV of overall >>> throughput. While there still might be some improvement >>> can be done by reducing number of overloaded cfs rqs if >>> sched-idle or idle rq exists. >> >> I'm much confused, there is an explicit new-idle balancer and a periodic >> idle balancer already there. > > I agree, You failed to explain why newly_idle and periodic idle load > balance are not enough and we need this new one Hi Vincent, sorry for not giving a clearer explanation. Please check my previous email replying to Peter, thanks. Best Regards, Abel
Hi Peter, On 2/24/22 11:20 PM, Peter Zijlstra Wrote: > On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote: >> Current load balancing is mainly based on cpu capacity >> and task util, which makes sense in the POV of overall >> throughput. While there still might be some improvement >> can be done by reducing number of overloaded cfs rqs if >> sched-idle or idle rq exists. > > I'm much confused, there is an explicit new-idle balancer and a periodic > idle balancer already there. The two balancers are triggered on the rqs that have no tasks on them, and load_balance() seems don't show a preference for non-idle tasks so there might be possibility that only idle tasks are pulled during load balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a result the normal tasks, mostly latency-critical ones in our case, on that overloaded rq still suffer waiting for each other. I observed this through perf sched. IOW the main difference from the POV of load_balance() between the latency-critical tasks and the idle ones is load. The sched-idle balancer is triggered on the sched-idle rqs periodically and the newly-idle ones. It does a 'fast' pull of non-idle tasks from the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks make full use of cpu resources. The sched-idle balancer only focuses on non-idle tasks' performance, so it can introduce overall load imbalance, and that's why I put it before load_balance(). Best Regards, Abel
On Fri, 25 Feb 2022 at 07:46, Abel Wu <wuyun.abel@bytedance.com> wrote: > > Hi Peter, > > On 2/24/22 11:20 PM, Peter Zijlstra Wrote: > > On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote: > >> Current load balancing is mainly based on cpu capacity > >> and task util, which makes sense in the POV of overall > >> throughput. While there still might be some improvement > >> can be done by reducing number of overloaded cfs rqs if > >> sched-idle or idle rq exists. > > > > I'm much confused, there is an explicit new-idle balancer and a periodic > > idle balancer already there. > > The two balancers are triggered on the rqs that have no tasks on them, > and load_balance() seems don't show a preference for non-idle tasks so The load balance will happen at the idle pace if a sched_idle task is running on the cpu so you will have an ILB on each cpu that run a sched-idle task > there might be possibility that only idle tasks are pulled during load > balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a There is a LB_MIN feature (disable by default) that filters task with very low load ( < 16) which includes sched-idle task which has a max load of 3 > result the normal tasks, mostly latency-critical ones in our case, on > that overloaded rq still suffer waiting for each other. I observed this > through perf sched. > > IOW the main difference from the POV of load_balance() between the > latency-critical tasks and the idle ones is load. > > The sched-idle balancer is triggered on the sched-idle rqs periodically > and the newly-idle ones. It does a 'fast' pull of non-idle tasks from > the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks > make full use of cpu resources. > > The sched-idle balancer only focuses on non-idle tasks' performance, so > it can introduce overall load imbalance, and that's why I put it before > load_balance(). According to the very low weight of a sched-idle task, I don't expect much imbalance because of sched-idle tasks. But this also depends of the number of sched-idle task. > > Best Regards, > Abel
On 2/25/22 4:29 PM, Vincent Guittot Wrote: > On Fri, 25 Feb 2022 at 07:46, Abel Wu <wuyun.abel@bytedance.com> wrote: >> >> Hi Peter, >> >> On 2/24/22 11:20 PM, Peter Zijlstra Wrote: >>> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote: >>>> Current load balancing is mainly based on cpu capacity >>>> and task util, which makes sense in the POV of overall >>>> throughput. While there still might be some improvement >>>> can be done by reducing number of overloaded cfs rqs if >>>> sched-idle or idle rq exists. >>> >>> I'm much confused, there is an explicit new-idle balancer and a periodic >>> idle balancer already there. >> >> The two balancers are triggered on the rqs that have no tasks on them, >> and load_balance() seems don't show a preference for non-idle tasks so > > The load balance will happen at the idle pace if a sched_idle task is > running on the cpu so you will have an ILB on each cpu that run a > sched-idle task I'm afraid I don't quite follow you, since sched-idle balancer doesn't touch the ILB part, can you elaborate on this? Thanks. > >> there might be possibility that only idle tasks are pulled during load >> balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a > > There is a LB_MIN feature (disable by default) that filters task with > very low load ( < 16) which includes sched-idle task which has a max > load of 3 This feature might not that friendly to the situation that only sched-idle tasks are running in the system. And this situation can last more than half a day in our co-location systems in which the training/batch tasks are placed under idle groups or directly assigned to SCHED_IDLE. > >> result the normal tasks, mostly latency-critical ones in our case, on >> that overloaded rq still suffer waiting for each other. I observed this >> through perf sched. >> >> IOW the main difference from the POV of load_balance() between the >> latency-critical tasks and the idle ones is load. >> >> The sched-idle balancer is triggered on the sched-idle rqs periodically >> and the newly-idle ones. It does a 'fast' pull of non-idle tasks from >> the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks >> make full use of cpu resources. >> >> The sched-idle balancer only focuses on non-idle tasks' performance, so >> it can introduce overall load imbalance, and that's why I put it before >> load_balance(). > > According to the very low weight of a sched-idle task, I don't expect > much imbalance because of sched-idle tasks. But this also depends of > the number of sched-idle task. > > >> >> Best Regards, >> Abel
On Fri, 25 Feb 2022 at 11:46, Abel Wu <wuyun.abel@bytedance.com> wrote: > > On 2/25/22 4:29 PM, Vincent Guittot Wrote: > > On Fri, 25 Feb 2022 at 07:46, Abel Wu <wuyun.abel@bytedance.com> wrote: > >> > >> Hi Peter, > >> > >> On 2/24/22 11:20 PM, Peter Zijlstra Wrote: > >>> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote: > >>>> Current load balancing is mainly based on cpu capacity > >>>> and task util, which makes sense in the POV of overall > >>>> throughput. While there still might be some improvement > >>>> can be done by reducing number of overloaded cfs rqs if > >>>> sched-idle or idle rq exists. > >>> > >>> I'm much confused, there is an explicit new-idle balancer and a periodic > >>> idle balancer already there. > >> > >> The two balancers are triggered on the rqs that have no tasks on them, > >> and load_balance() seems don't show a preference for non-idle tasks so > > > > The load balance will happen at the idle pace if a sched_idle task is > > running on the cpu so you will have an ILB on each cpu that run a > > sched-idle task > > I'm afraid I don't quite follow you, since sched-idle balancer doesn't > touch the ILB part, can you elaborate on this? Thanks. I was referring to your sentence " The two balancers are triggered on the rqs that have no tasks on them". When there is only sched-idle tasks on a rq, the load_balance behave like the Idle Load Balance when there is no task i.e. as often > > > > >> there might be possibility that only idle tasks are pulled during load > >> balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a > > > > There is a LB_MIN feature (disable by default) that filters task with > > very low load ( < 16) which includes sched-idle task which has a max > > load of 3 but we could easily change this like if !sched_idle_cpus then LB can migrate only cfs tasks otherwise can migrate sched_idle task as well. Instead of creating another side channel > > This feature might not that friendly to the situation that only > sched-idle tasks are running in the system. And this situation > can last more than half a day in our co-location systems in which > the training/batch tasks are placed under idle groups or directly > assigned to SCHED_IDLE. > > > > >> result the normal tasks, mostly latency-critical ones in our case, on > >> that overloaded rq still suffer waiting for each other. I observed this > >> through perf sched. > >> > >> IOW the main difference from the POV of load_balance() between the > >> latency-critical tasks and the idle ones is load. > >> > >> The sched-idle balancer is triggered on the sched-idle rqs periodically > >> and the newly-idle ones. It does a 'fast' pull of non-idle tasks from > >> the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks > >> make full use of cpu resources. > >> > >> The sched-idle balancer only focuses on non-idle tasks' performance, so > >> it can introduce overall load imbalance, and that's why I put it before > >> load_balance(). > > > > According to the very low weight of a sched-idle task, I don't expect > > much imbalance because of sched-idle tasks. But this also depends of > > the number of sched-idle task. > > > > > >> > >> Best Regards, > >> Abel
© 2016 - 2026 Red Hat, Inc.