SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first
non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY).
Above the SMT sched domain, all domains have a child. The SD_PREFER_
SIBLING is honored always regardless of the scheduling domain at which the
load balance takes place.
There are cases, however, in which the busiest CPU's sched domain has
child but the destination CPU's does not. Consider, for instance a non-SMT
core (or an SMT core with only one online sibling) doing load balance with
an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are
left with a fully busy SMT core and an idle non-SMT core.
Avoid inconsistent behavior. Use the prefer_sibling behavior at the current
scheduling domain, not its child.
The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we
will not spread load among NUMA sched groups, as desired.
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df7bcbf634a8..a37ad59f20ea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10004,7 +10004,6 @@ static void update_idle_cpu_scan(struct lb_env *env,
static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
- struct sched_domain *child = env->sd->child;
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
@@ -10045,9 +10044,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
sg = sg->next;
} while (sg != env->sd->groups);
- /* Tag domain that child domain prefers tasks go to siblings first */
- sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
-
+ /*
+ * Tag domain that @env::sd prefers to spread excess tasks among
+ * sibling sched groups.
+ */
+ sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
if (env->sd->flags & SD_NUMA)
env->fbq_type = fbq_classify_group(&sds->busiest_stat);
@@ -10346,7 +10347,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto out_balanced;
}
- /* Try to move all excess tasks to child's sibling domain */
if (sds.prefer_sibling && local->group_type == group_has_spare &&
busiest->sum_nr_running > local->sum_nr_running + 1)
goto force_balance;
--
2.25.1
On Mon, Feb 06, 2023 at 08:58:34PM -0800, Ricardo Neri wrote: > SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first > non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY). > > Above the SMT sched domain, all domains have a child. The SD_PREFER_ > SIBLING is honored always regardless of the scheduling domain at which the > load balance takes place. > > There are cases, however, in which the busiest CPU's sched domain has > child but the destination CPU's does not. Consider, for instance a non-SMT > core (or an SMT core with only one online sibling) doing load balance with > an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are > left with a fully busy SMT core and an idle non-SMT core. > > Avoid inconsistent behavior. Use the prefer_sibling behavior at the current > scheduling domain, not its child. > > The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we > will not spread load among NUMA sched groups, as desired. > Like many of the others; I don't much like this. Why not simply detect this asymmetric having of SMT and kill the PREFER_SIBLING flag on the SMT leafs in that case? Specifically, I'm thinking something in the degenerate area where it looks if a given domain has equal depth children or so. Note that this should not be tied to having special hardware, you can create the very same weirdness by just offlining a few SMT siblings and leaving a few on.
On 10/02/23 11:08, Peter Zijlstra wrote: > On Mon, Feb 06, 2023 at 08:58:34PM -0800, Ricardo Neri wrote: >> SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first >> non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY). >> >> Above the SMT sched domain, all domains have a child. The SD_PREFER_ >> SIBLING is honored always regardless of the scheduling domain at which the >> load balance takes place. >> >> There are cases, however, in which the busiest CPU's sched domain has >> child but the destination CPU's does not. Consider, for instance a non-SMT >> core (or an SMT core with only one online sibling) doing load balance with >> an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are >> left with a fully busy SMT core and an idle non-SMT core. >> >> Avoid inconsistent behavior. Use the prefer_sibling behavior at the current >> scheduling domain, not its child. >> >> The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we >> will not spread load among NUMA sched groups, as desired. >> > > Like many of the others; I don't much like this. > > Why not simply detect this asymmetric having of SMT and kill the > PREFER_SIBLING flag on the SMT leafs in that case? > > Specifically, I'm thinking something in the degenerate area where it > looks if a given domain has equal depth children or so. > > Note that this should not be tied to having special hardware, you can > create the very same weirdness by just offlining a few SMT siblings and > leaving a few on. So something like have SD_PREFER_SIBLING affect the SD it's on (and not its parent), but remove it from the lowest non-degenerated topology level? (+ add it to the first NUMA level to keep things as they are, even if TBF I find relying on it for NUMA balancing a bit odd).
On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote: > So something like have SD_PREFER_SIBLING affect the SD it's on (and not > its parent), but remove it from the lowest non-degenerated topology level? So I was rather confused about the whole moving it between levels things this morning -- conceptually, prefer siblings says you want to try sibling domains before filling up your current domain. Now, balancing between siblings happens one level up, hence looking at child->flags makes perfect sense. But looking at the current domain and still calling it prefer sibling makes absolutely no sense what so ever. In that confusion I think I also got the polarity wrong, I thought you wanted to kill prefer_sibling for the assymetric SMT cases, instead you want to force enable it as long as there is one SMT child around. Whichever way around it we do it, I'm thinking perhaps some renaming might be in order to clarify things. How about adding a flag SD_SPREAD_TASKS, which is the effective toggle of the behaviour, but have it be set by children with SD_PREFER_SIBLING or something. OTOH, there's also if (busiest->group_weight == 1 || sds->prefer_sibling) { which explicitly also takes the group-of-one (the !child case) into account, but that's not consistently done. sds->prefer_sibling = !child || child->flags & SD_PREFER_SIBLING; seems an interesting option, except perhaps ASYM_CPUCAPACITY -- I forget, can CPUs of different capacity be in the same leaf domain? With big.LITTLE I think not, they had their own cache domains and so you get at least MC domains per capacity, but DynamiQ might have totally wrecked that party. > (+ add it to the first NUMA level to keep things as they are, even if TBF I > find relying on it for NUMA balancing a bit odd). Arguably it ought to perhaps be one of those node_reclaim_distance things. The thing is that NUMA-1 is often fairly quick, esp. these days where it's basically on die numa.
On 10/02/23 17:53, Peter Zijlstra wrote: > On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote: > >> So something like have SD_PREFER_SIBLING affect the SD it's on (and not >> its parent), but remove it from the lowest non-degenerated topology level? > > So I was rather confused about the whole moving it between levels things > this morning -- conceptually, prefer siblings says you want to try > sibling domains before filling up your current domain. Now, balancing > between siblings happens one level up, hence looking at child->flags > makes perfect sense. > > But looking at the current domain and still calling it prefer sibling > makes absolutely no sense what so ever. > True :-) > In that confusion I think I also got the polarity wrong, I thought you > wanted to kill prefer_sibling for the assymetric SMT cases, instead you > want to force enable it as long as there is one SMT child around. > > Whichever way around it we do it, I'm thinking perhaps some renaming > might be in order to clarify things. > > How about adding a flag SD_SPREAD_TASKS, which is the effective toggle > of the behaviour, but have it be set by children with SD_PREFER_SIBLING > or something. > Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah something along those lines. > OTOH, there's also > > if (busiest->group_weight == 1 || sds->prefer_sibling) { > > which explicitly also takes the group-of-one (the !child case) into > account, but that's not consistently done. > > sds->prefer_sibling = !child || child->flags & SD_PREFER_SIBLING; > > seems an interesting option, > except perhaps ASYM_CPUCAPACITY -- I > forget, can CPUs of different capacity be in the same leaf domain? With > big.LITTLE I think not, they had their own cache domains and so you get > at least MC domains per capacity, but DynamiQ might have totally wrecked > that party. Yeah, newer systems can have different capacities in one MC domain, cf: b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan") > >> (+ add it to the first NUMA level to keep things as they are, even if TBF I >> find relying on it for NUMA balancing a bit odd). > > Arguably it ought to perhaps be one of those node_reclaim_distance > things. The thing is that NUMA-1 is often fairly quick, esp. these days > where it's basically on die numa. Right, makes sense, thanks.
On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote: > On 10/02/23 17:53, Peter Zijlstra wrote: > > On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote: > > > >> So something like have SD_PREFER_SIBLING affect the SD it's on (and not > >> its parent), but remove it from the lowest non-degenerated topology level? > > > > So I was rather confused about the whole moving it between levels things > > this morning -- conceptually, prefer siblings says you want to try > > sibling domains before filling up your current domain. Now, balancing > > between siblings happens one level up, hence looking at child->flags > > makes perfect sense. > > > > But looking at the current domain and still calling it prefer sibling > > makes absolutely no sense what so ever. > > > > True :-) > > > In that confusion I think I also got the polarity wrong, I thought you > > wanted to kill prefer_sibling for the assymetric SMT cases, instead you > > want to force enable it as long as there is one SMT child around. Exactly. > > > > Whichever way around it we do it, I'm thinking perhaps some renaming > > might be in order to clarify things. > > > > How about adding a flag SD_SPREAD_TASKS, which is the effective toggle > > of the behaviour, but have it be set by children with SD_PREFER_SIBLING > > or something. > > > > Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah > something along those lines. I sense a consesus towards SD_SPREAD_TASKS. > > > OTOH, there's also > > > > if (busiest->group_weight == 1 || sds->prefer_sibling) { > > > > which explicitly also takes the group-of-one (the !child case) into > > account, but that's not consistently done. > > > > sds->prefer_sibling = !child || child->flags & SD_PREFER_SIBLING; This would need a special provision for SD_ASYM_CPUCAPACITY. > > > > seems an interesting option, > > > except perhaps ASYM_CPUCAPACITY -- I > > forget, can CPUs of different capacity be in the same leaf domain? With > > big.LITTLE I think not, they had their own cache domains and so you get > > at least MC domains per capacity, but DynamiQ might have totally wrecked > > that party. > > Yeah, newer systems can have different capacities in one MC domain, cf: > > b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan") > > > > >> (+ add it to the first NUMA level to keep things as they are, even if TBF I > >> find relying on it for NUMA balancing a bit odd). > > > > Arguably it ought to perhaps be one of those node_reclaim_distance > > things. The thing is that NUMA-1 is often fairly quick, esp. these days > > where it's basically on die numa. To conserve the current behavior the NUMA level would need to have SD_SPREAD_TASKS. It will be cleared along with SD_BALANCE_{EXEC, FORK} and SD_WAKE_AFFINE if the numa distance is larger than node_reclaim_distance, yes? Thanks and BR, Ricardo
On 10/02/2023 19:31, Ricardo Neri wrote: > On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote: >> On 10/02/23 17:53, Peter Zijlstra wrote: >>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote: >>> >>>> So something like have SD_PREFER_SIBLING affect the SD it's on (and not >>>> its parent), but remove it from the lowest non-degenerated topology level? >>> >>> So I was rather confused about the whole moving it between levels things >>> this morning -- conceptually, prefer siblings says you want to try >>> sibling domains before filling up your current domain. Now, balancing >>> between siblings happens one level up, hence looking at child->flags >>> makes perfect sense. >>> >>> But looking at the current domain and still calling it prefer sibling >>> makes absolutely no sense what so ever. >>> >> >> True :-) >> >>> In that confusion I think I also got the polarity wrong, I thought you >>> wanted to kill prefer_sibling for the assymetric SMT cases, instead you >>> want to force enable it as long as there is one SMT child around. > > Exactly. > >>> >>> Whichever way around it we do it, I'm thinking perhaps some renaming >>> might be in order to clarify things. >>> >>> How about adding a flag SD_SPREAD_TASKS, which is the effective toggle >>> of the behaviour, but have it be set by children with SD_PREFER_SIBLING >>> or something. >>> >> >> Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah >> something along those lines. > > I sense a consesus towards SD_SPREAD_TASKS. Can you not detect the E-core dst_cpu case on MC with: + if (child) + sds->prefer_sibling = child->flags & SD_PREFER_SIBLING; + else if (sds->busiest) + sds->prefer_sibling = sds->busiest->group_weight > 1; + [...]
On Mon, Feb 13, 2023 at 01:17:09PM +0100, Dietmar Eggemann wrote: > On 10/02/2023 19:31, Ricardo Neri wrote: > > On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote: > >> On 10/02/23 17:53, Peter Zijlstra wrote: > >>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote: > >>> > >>>> So something like have SD_PREFER_SIBLING affect the SD it's on (and not > >>>> its parent), but remove it from the lowest non-degenerated topology level? > >>> > >>> So I was rather confused about the whole moving it between levels things > >>> this morning -- conceptually, prefer siblings says you want to try > >>> sibling domains before filling up your current domain. Now, balancing > >>> between siblings happens one level up, hence looking at child->flags > >>> makes perfect sense. > >>> > >>> But looking at the current domain and still calling it prefer sibling > >>> makes absolutely no sense what so ever. > >>> > >> > >> True :-) > >> > >>> In that confusion I think I also got the polarity wrong, I thought you > >>> wanted to kill prefer_sibling for the assymetric SMT cases, instead you > >>> want to force enable it as long as there is one SMT child around. > > > > Exactly. > > > >>> > >>> Whichever way around it we do it, I'm thinking perhaps some renaming > >>> might be in order to clarify things. > >>> > >>> How about adding a flag SD_SPREAD_TASKS, which is the effective toggle > >>> of the behaviour, but have it be set by children with SD_PREFER_SIBLING > >>> or something. > >>> > >> > >> Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah > >> something along those lines. > > > > I sense a consesus towards SD_SPREAD_TASKS. > > Can you not detect the E-core dst_cpu case on MC with: > > + if (child) > + sds->prefer_sibling = child->flags & SD_PREFER_SIBLING; > + else if (sds->busiest) > + sds->prefer_sibling = sds->busiest->group_weight > 1; Whose child wants the prefer_sibling setting? In update_sd_lb_stats(), it is set based on the flags of the destination CPU's sched domain. But when used in find_busiest_group() tasks are spread from the busiest group's child domain. Your proposed code, also needs a check for SD_PREFER_SIBLING, no? Thanks and BR, Ricardo
On Mon, Feb 13, 2023 at 10:43:28PM -0800, Ricardo Neri wrote: > On Mon, Feb 13, 2023 at 01:17:09PM +0100, Dietmar Eggemann wrote: > > On 10/02/2023 19:31, Ricardo Neri wrote: > > > On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote: > > >> On 10/02/23 17:53, Peter Zijlstra wrote: > > >>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote: > > >>> > > >>>> So something like have SD_PREFER_SIBLING affect the SD it's on (and not > > >>>> its parent), but remove it from the lowest non-degenerated topology level? > > >>> > > >>> So I was rather confused about the whole moving it between levels things > > >>> this morning -- conceptually, prefer siblings says you want to try > > >>> sibling domains before filling up your current domain. Now, balancing > > >>> between siblings happens one level up, hence looking at child->flags > > >>> makes perfect sense. > > >>> > > >>> But looking at the current domain and still calling it prefer sibling > > >>> makes absolutely no sense what so ever. > > >>> > > >> > > >> True :-) > > >> > > >>> In that confusion I think I also got the polarity wrong, I thought you > > >>> wanted to kill prefer_sibling for the assymetric SMT cases, instead you > > >>> want to force enable it as long as there is one SMT child around. > > > > > > Exactly. > > > > > >>> > > >>> Whichever way around it we do it, I'm thinking perhaps some renaming > > >>> might be in order to clarify things. > > >>> > > >>> How about adding a flag SD_SPREAD_TASKS, which is the effective toggle > > >>> of the behaviour, but have it be set by children with SD_PREFER_SIBLING > > >>> or something. > > >>> > > >> > > >> Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah > > >> something along those lines. > > > > > > I sense a consesus towards SD_SPREAD_TASKS. > > > > Can you not detect the E-core dst_cpu case on MC with: > > > > + if (child) > > + sds->prefer_sibling = child->flags & SD_PREFER_SIBLING; > > + else if (sds->busiest) > > + sds->prefer_sibling = sds->busiest->group_weight > 1; > > Whose child wants the prefer_sibling setting? In update_sd_lb_stats(), it > is set based on the flags of the destination CPU's sched domain. But when > used in find_busiest_group() tasks are spread from the busiest group's > child domain. > > Your proposed code, also needs a check for SD_PREFER_SIBLING, no? I tweaked the solution that Dietmar proposed: - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING; + if (sds->busiest) + sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING; This comes from the observation that the prefer_sibling setting acts on busiest group. It then depends on whether the busiest group, not the local group, has child sched sched domains. Today it works because in most cases both the local and the busiest groups have child domains with the SD_ PREFER_SIBLING flag. This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as prefer_sibling would not be set in that case. It would also conserve the current behavior at the NUMA level. We would not need to implement SD_SPREAD_TASKS. This would both fix the SMT vs non-SMT bug and be less invasive. Thoughts? Thanks and BR, Ricardo
On 16/02/2023 06:21, Ricardo Neri wrote: > On Mon, Feb 13, 2023 at 10:43:28PM -0800, Ricardo Neri wrote: >> On Mon, Feb 13, 2023 at 01:17:09PM +0100, Dietmar Eggemann wrote: >>> On 10/02/2023 19:31, Ricardo Neri wrote: >>>> On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote: >>>>> On 10/02/23 17:53, Peter Zijlstra wrote: >>>>>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote: [...] >>> Can you not detect the E-core dst_cpu case on MC with: >>> >>> + if (child) >>> + sds->prefer_sibling = child->flags & SD_PREFER_SIBLING; >>> + else if (sds->busiest) >>> + sds->prefer_sibling = sds->busiest->group_weight > 1; >> >> Whose child wants the prefer_sibling setting? In update_sd_lb_stats(), it >> is set based on the flags of the destination CPU's sched domain. But when >> used in find_busiest_group() tasks are spread from the busiest group's >> child domain. >> >> Your proposed code, also needs a check for SD_PREFER_SIBLING, no? > > I tweaked the solution that Dietmar proposed: > > - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING; > + if (sds->busiest) > + sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING; Maybe: sds->prefer_sibling = !!(sds->busiest->flags & SD_PREFER_SIBLING); 1 vs 2048 ? > This comes from the observation that the prefer_sibling setting acts on > busiest group. It then depends on whether the busiest group, not the local > group, has child sched sched domains. Today it works because in most cases > both the local and the busiest groups have child domains with the SD_ > PREFER_SIBLING flag. > > This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as > prefer_sibling would not be set in that case. > > It would also conserve the current behavior at the NUMA level. We would > not need to implement SD_SPREAD_TASKS. > > This would both fix the SMT vs non-SMT bug and be less invasive. Yeah, much better! I always forget that we have those flags on SGs now as well. Luckily, we just need to check busiest sg to cover all cases.
On Thu, Feb 23, 2023 at 11:09:55AM +0100, Dietmar Eggemann wrote: > On 16/02/2023 06:21, Ricardo Neri wrote: > > On Mon, Feb 13, 2023 at 10:43:28PM -0800, Ricardo Neri wrote: > >> On Mon, Feb 13, 2023 at 01:17:09PM +0100, Dietmar Eggemann wrote: > >>> On 10/02/2023 19:31, Ricardo Neri wrote: > >>>> On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote: > >>>>> On 10/02/23 17:53, Peter Zijlstra wrote: > >>>>>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote: > > [...] > > >>> Can you not detect the E-core dst_cpu case on MC with: > >>> > >>> + if (child) > >>> + sds->prefer_sibling = child->flags & SD_PREFER_SIBLING; > >>> + else if (sds->busiest) > >>> + sds->prefer_sibling = sds->busiest->group_weight > 1; > >> > >> Whose child wants the prefer_sibling setting? In update_sd_lb_stats(), it > >> is set based on the flags of the destination CPU's sched domain. But when > >> used in find_busiest_group() tasks are spread from the busiest group's > >> child domain. > >> > >> Your proposed code, also needs a check for SD_PREFER_SIBLING, no? > > > > I tweaked the solution that Dietmar proposed: > > > > - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING; > > + if (sds->busiest) > > + sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING; > > Maybe: > > sds->prefer_sibling = !!(sds->busiest->flags & SD_PREFER_SIBLING); > > 1 vs 2048 ? Sure, I can do this. > > > This comes from the observation that the prefer_sibling setting acts on > > busiest group. It then depends on whether the busiest group, not the local > > group, has child sched sched domains. Today it works because in most cases > > both the local and the busiest groups have child domains with the SD_ > > PREFER_SIBLING flag. > > > > This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as > > prefer_sibling would not be set in that case. > > > > It would also conserve the current behavior at the NUMA level. We would > > not need to implement SD_SPREAD_TASKS. > > > > This would both fix the SMT vs non-SMT bug and be less invasive. > > Yeah, much better! I always forget that we have those flags on SGs now > as well. Luckily, we just need to check busiest sg to cover all cases. Right. I can add a comment to clarify from where the flags come.
On Wed, Feb 15, 2023 at 09:21:05PM -0800, Ricardo Neri wrote: > I tweaked the solution that Dietmar proposed: > > - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING; > + if (sds->busiest) > + sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING; > > This comes from the observation that the prefer_sibling setting acts on > busiest group. It then depends on whether the busiest group, not the local > group, has child sched sched domains. Today it works because in most cases > both the local and the busiest groups have child domains with the SD_ > PREFER_SIBLING flag. > > This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as > prefer_sibling would not be set in that case. > > It would also conserve the current behavior at the NUMA level. We would > not need to implement SD_SPREAD_TASKS. > > This would both fix the SMT vs non-SMT bug and be less invasive. > > Thoughts? That does look nice. Be sure to put in a nice comment too.
On Thu, Feb 16, 2023 at 01:16:42PM +0100, Peter Zijlstra wrote: > On Wed, Feb 15, 2023 at 09:21:05PM -0800, Ricardo Neri wrote: > > > I tweaked the solution that Dietmar proposed: > > > > - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING; > > + if (sds->busiest) > > + sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING; > > > > This comes from the observation that the prefer_sibling setting acts on > > busiest group. It then depends on whether the busiest group, not the local > > group, has child sched sched domains. Today it works because in most cases > > both the local and the busiest groups have child domains with the SD_ > > PREFER_SIBLING flag. > > > > This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as > > prefer_sibling would not be set in that case. > > > > It would also conserve the current behavior at the NUMA level. We would > > not need to implement SD_SPREAD_TASKS. > > > > This would both fix the SMT vs non-SMT bug and be less invasive. > > > > Thoughts? > > That does look nice. Be sure to put in a nice comment too. Will do!
On 2023-02-06 at 20:58:34 -0800, Ricardo Neri wrote: > SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first > non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY). > > Above the SMT sched domain, all domains have a child. The SD_PREFER_ > SIBLING is honored always regardless of the scheduling domain at which the > load balance takes place. > > There are cases, however, in which the busiest CPU's sched domain has > child but the destination CPU's does not. Consider, for instance a non-SMT > core (or an SMT core with only one online sibling) doing load balance with > an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are > left with a fully busy SMT core and an idle non-SMT core. > > Avoid inconsistent behavior. Use the prefer_sibling behavior at the current > scheduling domain, not its child. > > The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we > will not spread load among NUMA sched groups, as desired. > > Cc: Ben Segall <bsegall@google.com> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> > Cc: Len Brown <len.brown@intel.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Tim C. Chen <tim.c.chen@intel.com> > Cc: Valentin Schneider <vschneid@redhat.com> > Cc: x86@kernel.org > Cc: linux-kernel@vger.kernel.org > Suggested-by: Valentin Schneider <vschneid@redhat.com> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> > --- > Changes since v2: > * Introduced this patch. > > Changes since v1: > * N/A > --- > kernel/sched/fair.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index df7bcbf634a8..a37ad59f20ea 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10004,7 +10004,6 @@ static void update_idle_cpu_scan(struct lb_env *env, > > static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) > { > - struct sched_domain *child = env->sd->child; > struct sched_group *sg = env->sd->groups; > struct sg_lb_stats *local = &sds->local_stat; > struct sg_lb_stats tmp_sgs; > @@ -10045,9 +10044,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > sg = sg->next; > } while (sg != env->sd->groups); > > - /* Tag domain that child domain prefers tasks go to siblings first */ > - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING; > - > + /* > + * Tag domain that @env::sd prefers to spread excess tasks among > + * sibling sched groups. > + */ > + sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING; > This does help fix the issue that non-SMT core fails to pull task from busy SMT-cores. And it also semantically changes the definination of prefer sibling. Do we also need to change this: if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child) sd->child->flags &= ~SD_PREFER_SIBLING; might be: if ((sd->flags & SD_ASYM_CPUCAPACITY)) sd->flags &= ~SD_PREFER_SIBLING; thanks, Chenyu >
>> static inline void update_sd_lb_stats(struct lb_env *env, struct >> sd_lb_stats *sds) { >> - struct sched_domain *child = env->sd->child; >> struct sched_group *sg = env->sd->groups; >> struct sg_lb_stats *local = &sds->local_stat; >> struct sg_lb_stats tmp_sgs; >> @@ -10045,9 +10044,11 @@ static inline void update_sd_lb_stats(struct >lb_env *env, struct sd_lb_stats *sd >> sg = sg->next; >> } while (sg != env->sd->groups); >> >> - /* Tag domain that child domain prefers tasks go to siblings first */ >> - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING; >> - >> + /* >> + * Tag domain that @env::sd prefers to spread excess tasks among >> + * sibling sched groups. >> + */ >> + sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING; >> >This does help fix the issue that non-SMT core fails to pull task from busy SMT- >cores. >And it also semantically changes the definination of prefer sibling. Do we also >need to change this: > if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child) > sd->child->flags &= ~SD_PREFER_SIBLING; might be: > if ((sd->flags & SD_ASYM_CPUCAPACITY)) > sd->flags &= ~SD_PREFER_SIBLING; > Yu, I think you are talking about the code in sd_init() where SD_PREFER_SIBLING is first set to "ON" and updated depending on SD_ASYM_CPUCAPACITY. The intention of the code is if there are cpus in the scheduler domain that have differing cpu capacities, we do not want to do spreading among the child groups in the sched domain. So the flag is turned off in the child group level and not the parent level. But with your above change, the parent's flag is turned off, leaving the child level flag on. This moves the level where spreading happens (SD_PREFER_SIBLING on) up one level which is undesired (see table below). SD_PREFER_SIBLING after init original code proposed SD Level SD_ASYM_CPUCAPACITY root ON ON OFF (note: SD_PREFER_SIBLING unused at this level) first level ON OFF OFF second level OFF OFF ON third level OFF ON ON Tim
On Thu, 2023-02-09 at 20:00 +0000, Chen, Tim C wrote: > > > static inline void update_sd_lb_stats(struct lb_env *env, struct > > > sd_lb_stats *sds) { > > > - struct sched_domain *child = env->sd->child; > > > struct sched_group *sg = env->sd->groups; > > > struct sg_lb_stats *local = &sds->local_stat; > > > struct sg_lb_stats tmp_sgs; > > > @@ -10045,9 +10044,11 @@ static inline void > > > update_sd_lb_stats(struct > > lb_env *env, struct sd_lb_stats *sd > > > sg = sg->next; > > > } while (sg != env->sd->groups); > > > > > > - /* Tag domain that child domain prefers tasks go to > > > siblings first */ > > > - sds->prefer_sibling = child && child->flags & > > > SD_PREFER_SIBLING; > > > - > > > + /* > > > + * Tag domain that @env::sd prefers to spread excess > > > tasks among > > > + * sibling sched groups. > > > + */ > > > + sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING; > > > > > This does help fix the issue that non-SMT core fails to pull task > > from busy SMT- > > cores. > > And it also semantically changes the definination of prefer > > sibling. Do we also > > need to change this: > > if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child) > > sd->child->flags &= ~SD_PREFER_SIBLING; might be: > > if ((sd->flags & SD_ASYM_CPUCAPACITY)) > > sd->flags &= ~SD_PREFER_SIBLING; > > > > Yu, > > I think you are talking about the code in sd_init() > where SD_PREFER_SIBLING is first set > to "ON" and updated depending on SD_ASYM_CPUCAPACITY. The intention > of the code > is if there are cpus in the scheduler domain that have differing cpu > capacities, > we do not want to do spreading among the child groups in the sched > domain. > So the flag is turned off in the child group level and not the parent > level. But with your above > change, the parent's flag is turned off, leaving the child level flag > on. > This moves the level where spreading happens (SD_PREFER_SIBLING on) > up one level which is undesired (see table below). > > Sorry got a bad mail client messing up the table format. Updated below SD_ASYM_CPUCAPACITY SD_PREFER_SIBLING after init original code proposed SD Level root ON ON OFF (note: SD_PREFER_SIBLING unused at this level) first level ON OFF OFF second level OFF OFF ON third level OFF ON ON Tim
On 2023-02-09 at 15:05:03 -0800, Tim Chen wrote: > On Thu, 2023-02-09 at 20:00 +0000, Chen, Tim C wrote: > > > > static inline void update_sd_lb_stats(struct lb_env *env, struct > > > > sd_lb_stats *sds) { > > > > - struct sched_domain *child = env->sd->child; > > > > struct sched_group *sg = env->sd->groups; > > > > struct sg_lb_stats *local = &sds->local_stat; > > > > struct sg_lb_stats tmp_sgs; > > > > @@ -10045,9 +10044,11 @@ static inline void > > > > update_sd_lb_stats(struct > > > lb_env *env, struct sd_lb_stats *sd > > > > sg = sg->next; > > > > } while (sg != env->sd->groups); > > > > > > > > - /* Tag domain that child domain prefers tasks go to > > > > siblings first */ > > > > - sds->prefer_sibling = child && child->flags & > > > > SD_PREFER_SIBLING; > > > > - > > > > + /* > > > > + * Tag domain that @env::sd prefers to spread excess > > > > tasks among > > > > + * sibling sched groups. > > > > + */ > > > > + sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING; > > > > > > > This does help fix the issue that non-SMT core fails to pull task > > > from busy SMT- > > > cores. > > > And it also semantically changes the definination of prefer > > > sibling. Do we also > > > need to change this: > > > if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child) > > > sd->child->flags &= ~SD_PREFER_SIBLING; might be: > > > if ((sd->flags & SD_ASYM_CPUCAPACITY)) > > > sd->flags &= ~SD_PREFER_SIBLING; > > > > > > > Yu, > > > > I think you are talking about the code in sd_init() > > where SD_PREFER_SIBLING is first set > > to "ON" and updated depending on SD_ASYM_CPUCAPACITY. The intention > > of the code > > is if there are cpus in the scheduler domain that have differing cpu > > capacities, > > we do not want to do spreading among the child groups in the sched > > domain. > > So the flag is turned off in the child group level and not the parent > > level. But with your above > > change, the parent's flag is turned off, leaving the child level flag > > on. > > This moves the level where spreading happens (SD_PREFER_SIBLING on) > > up one level which is undesired (see table below). > > Yes, it moves the flag 1 level up. And if I understand correctly, with Ricardo's patch applied, we have changed the original meaning of SD_PREFER_SIBLING: Original: Tasks in this sched domain want to be migrated to another sched domain. After init change: Tasks in the sched group under this sched domain want to be migrated to a sibling group. > > > Sorry got a bad mail client messing up the table format. Updated below > > SD_ASYM_CPUCAPACITY SD_PREFER_SIBLING after init > original code proposed > SD Level > root ON ON OFF (note: SD_PREFER_SIBLING unused at this level) SD_PREFER_SIBLING is hornored in root level after the init proposal. > first level ON OFF OFF Before the init proposed, tasks in first level sd do not want to be spreaded to a sibling sd. After the init proposeal, tasks in all sched groups under root sd, do not want to be spreaded to a sibling sched group(AKA first level sd) thanks, Chenyu > second level OFF OFF ON > third level OFF ON ON > > Tim
On Thu, Feb 09, 2023 at 03:05:03PM -0800, Tim Chen wrote: > On Thu, 2023-02-09 at 20:00 +0000, Chen, Tim C wrote: > > > > static inline void update_sd_lb_stats(struct lb_env *env, struct > > > > sd_lb_stats *sds) { > > > > - struct sched_domain *child = env->sd->child; > > > > struct sched_group *sg = env->sd->groups; > > > > struct sg_lb_stats *local = &sds->local_stat; > > > > struct sg_lb_stats tmp_sgs; > > > > @@ -10045,9 +10044,11 @@ static inline void > > > > update_sd_lb_stats(struct > > > lb_env *env, struct sd_lb_stats *sd > > > > sg = sg->next; > > > > } while (sg != env->sd->groups); > > > > > > > > - /* Tag domain that child domain prefers tasks go to > > > > siblings first */ > > > > - sds->prefer_sibling = child && child->flags & > > > > SD_PREFER_SIBLING; > > > > - > > > > + /* > > > > + * Tag domain that @env::sd prefers to spread excess > > > > tasks among > > > > + * sibling sched groups. > > > > + */ > > > > + sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING; > > > > > > > This does help fix the issue that non-SMT core fails to pull task > > > from busy SMT- > > > cores. > > > And it also semantically changes the definination of prefer > > > sibling. Do we also > > > need to change this: > > > if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child) > > > sd->child->flags &= ~SD_PREFER_SIBLING; might be: > > > if ((sd->flags & SD_ASYM_CPUCAPACITY)) > > > sd->flags &= ~SD_PREFER_SIBLING; > > > > > > > Yu, > > > > I think you are talking about the code in sd_init() > > where SD_PREFER_SIBLING is first set > > to "ON" and updated depending on SD_ASYM_CPUCAPACITY. The intention > > of the code > > is if there are cpus in the scheduler domain that have differing cpu > > capacities, > > we do not want to do spreading among the child groups in the sched > > domain. > > So the flag is turned off in the child group level and not the parent > > level. But with your above > > change, the parent's flag is turned off, leaving the child level flag > > on. > > This moves the level where spreading happens (SD_PREFER_SIBLING on) > > up one level which is undesired (see table below). But my patch moves the level at which we act on prefer_sibling: it now checks the SD_PREFER_SIBLING flag at the current level, not its child. Thus, removing SD_PREFER_SIBLING from a level with SD_ASYM_CPUCAPACITY prevents spreading among CPUs of different CPU capacity, no? Thanks and BR, Ricardo
On Tue, 7 Feb 2023 at 05:50, Ricardo Neri <ricardo.neri-calderon@linux.intel.com> wrote: > > SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first > non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY). > > Above the SMT sched domain, all domains have a child. The SD_PREFER_ > SIBLING is honored always regardless of the scheduling domain at which the > load balance takes place. > > There are cases, however, in which the busiest CPU's sched domain has > child but the destination CPU's does not. Consider, for instance a non-SMT > core (or an SMT core with only one online sibling) doing load balance with > an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are > left with a fully busy SMT core and an idle non-SMT core. > > Avoid inconsistent behavior. Use the prefer_sibling behavior at the current > scheduling domain, not its child. > > The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we > will not spread load among NUMA sched groups, as desired. This is a significant change in the behavior of the numa system. It would be good to get figures or confirmation that demonstrate that it's ok to remove prefer_sibling behavior at the 1st numa level. > > Cc: Ben Segall <bsegall@google.com> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> > Cc: Len Brown <len.brown@intel.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Tim C. Chen <tim.c.chen@intel.com> > Cc: Valentin Schneider <vschneid@redhat.com> > Cc: x86@kernel.org > Cc: linux-kernel@vger.kernel.org > Suggested-by: Valentin Schneider <vschneid@redhat.com> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> > --- > Changes since v2: > * Introduced this patch. > > Changes since v1: > * N/A > --- > kernel/sched/fair.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index df7bcbf634a8..a37ad59f20ea 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10004,7 +10004,6 @@ static void update_idle_cpu_scan(struct lb_env *env, > > static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) > { > - struct sched_domain *child = env->sd->child; > struct sched_group *sg = env->sd->groups; > struct sg_lb_stats *local = &sds->local_stat; > struct sg_lb_stats tmp_sgs; > @@ -10045,9 +10044,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > sg = sg->next; > } while (sg != env->sd->groups); > > - /* Tag domain that child domain prefers tasks go to siblings first */ > - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING; > - > + /* > + * Tag domain that @env::sd prefers to spread excess tasks among > + * sibling sched groups. > + */ > + sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING; > > if (env->sd->flags & SD_NUMA) > env->fbq_type = fbq_classify_group(&sds->busiest_stat); > @@ -10346,7 +10347,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env) > goto out_balanced; > } > > - /* Try to move all excess tasks to child's sibling domain */ > if (sds.prefer_sibling && local->group_type == group_has_spare && > busiest->sum_nr_running > local->sum_nr_running + 1) > goto force_balance; > -- > 2.25.1 >
On Wed, Feb 08, 2023 at 08:48:05AM +0100, Vincent Guittot wrote: > On Tue, 7 Feb 2023 at 05:50, Ricardo Neri > <ricardo.neri-calderon@linux.intel.com> wrote: > > > > SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first > > non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY). > > > > Above the SMT sched domain, all domains have a child. The SD_PREFER_ > > SIBLING is honored always regardless of the scheduling domain at which the > > load balance takes place. > > > > There are cases, however, in which the busiest CPU's sched domain has > > child but the destination CPU's does not. Consider, for instance a non-SMT > > core (or an SMT core with only one online sibling) doing load balance with > > an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are > > left with a fully busy SMT core and an idle non-SMT core. > > > > Avoid inconsistent behavior. Use the prefer_sibling behavior at the current > > scheduling domain, not its child. > > > > The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we > > will not spread load among NUMA sched groups, as desired. > > This is a significant change in the behavior of the numa system. It > would be good to get figures or confirmation that demonstrate that > it's ok to remove prefer_sibling behavior at the 1st numa level. > Thank you very much for your feedback Vincent! You are right. It does change the behavior at the first numa level. Peter did not like this change either. It also made things confusing for SD_ASYM_CPUCAPACITY vs SD_PREFER_SIBLING. I'll work in a different approach.
© 2016 - 2025 Red Hat, Inc.