sched/fair: Propagate load balancing stats up the sched domain hierarchy

[RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop

Posted by K Prateek Nayak 11 months ago

Allow update_sg_lb_stats() to retrieve the group stats cached in
sg_lb_stats_prop saved by another CPU performing load balancing around
the same time (same jiffy)

Current implementation without invalidation of cached stats have few
limitations namely that the stats reuse is limited to busy load
balancing since stats can only be updated once a jiffy. Newidle Balance
can happen frequently and concurrently on many CPUs which can result in
readers seeing inconsitent values for the propagated stats.

For this iteration, the focus is to reduce the time taken for busy load
balancing allowing the busy CPU to resume renning the task as quickly as
possible.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 81 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 60517a732c10..3b402f294f0b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10275,6 +10275,75 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
 
+static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type idle)
+{
+	/*
+	 * Only under perioric load balancing can we ensure that no concurrent
+	 * CPUs modifies the stats being propagated upwards since
+	 * should_we_balance() can allow multiple concurrent newidle balance
+	 * to progress and an idle -> busy transition for idle balance will
+	 * require the stats to be recomputed since idleness metrics will
+	 * change with migration.
+	 */
+	if (idle)
+		return 0;
+
+	/*
+	 * If individual groups are separate NUMA domains, migrations can cause
+	 * preferred task statistics to change and will require recomputing of
+	 * stats.
+	 */
+	if (sd->child && (sd->child->flags & SD_NUMA))
+		return 0;
+
+	/*
+	 * misfit_task_load requires recalculation on SD_ASYM_CPUCAPACITY
+	 * domains. Skip caching stats for them.
+	 */
+	if (sd->flags & SD_ASYM_CPUCAPACITY)
+		return 0;
+
+	/*
+	 * TODO: For CPU_IDLE case, invalidate stats for an idle -> busy
+	 * transition but for the time being, save some cycles during busy
+	 * load balancing.
+	 */
+	return 1;
+}
+
+static inline int retrieve_cached_stats(struct sched_group *group, struct sg_lb_stats *sg_stats)
+{
+	struct sched_domain_shared *sg_share = group->shared;
+	unsigned long current_jiffy = jiffies;
+	struct sg_lb_stats_prop *lb_prop;
+
+	if (!sg_share)
+		return 0;
+
+	lb_prop = (struct sg_lb_stats_prop *)sg_share->private;
+	if (!lb_prop)
+		return 0;
+
+	/* Stale stats */
+	if (READ_ONCE(lb_prop->last_update) != current_jiffy)
+		return 0;
+
+	/*
+	 * Pairs against the update to sgs_prop->last_update to
+	 * prevent readers from seeing an inconsistent value of
+	 * the propagated stats from a concurrent update.
+	 */
+	smp_rmb();
+	*sg_stats = lb_prop->sg_stats;
+
+	/*
+	 * If stats were read in the same interval, it cannot
+	 * read an inconsistent state since stats are only
+	 * updated once per jiffy.
+	 */
+	return time_before_eq(jiffies, current_jiffy);
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10292,10 +10361,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	int i, nr_running, local_group, sd_flags = env->sd->flags;
 	bool balancing_at_rd = !env->sd->parent;
 
-	memset(sgs, 0, sizeof(*sgs));
-
 	local_group = group == sds->local;
 
+	/*
+	 * If stats can be retrieved, we are doing a busy load balancing.
+	 * Skip right ahead to group_classify() since group_asym_packing and
+	 * group_smt_balance is not possible under busy load balancing.
+	 */
+	if (can_retrieve_stats(env->sd, env->idle) &&
+	    retrieve_cached_stats(group, sgs))
+		goto group_classify;
+
+	memset(sgs, 0, sizeof(*sgs));
+
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 		unsigned long load = cpu_load(rq);
@@ -10360,6 +10438,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	if (!local_group && smt_balance(env, sgs, group))
 		sgs->group_smt_balance = 1;
 
+group_classify:
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
 	/* Computing avg_load makes sense only when group is overloaded */
-- 
2.43.0

Re: [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop

Posted by Chen, Yu C 10 months, 4 weeks ago

On 3/13/2025 5:37 PM, K Prateek Nayak wrote:
> Allow update_sg_lb_stats() to retrieve the group stats cached in
> sg_lb_stats_prop saved by another CPU performing load balancing around
> the same time (same jiffy)
> 

If I understand correctly, we allow update_sg_lb_stats() to retrieve
cached sg stats if another CPU in the same sched group has just done
load balance within a jiffy ago, say 10ms for CONFIG_100_HZ.

There are two roles,  writer who updates the cached stats,
the reader who reads the cache stats. For both cache writer and
the cache reader, do we trigger them only when it is in busy periodic
load balance? If yes, consider the periodic load balance is usually
triggered on 1 CPU in each SD(should_we_balance()),  and the
interval increases with the number of CPUs in that sd, just wonder
if 10 ms is a little short to find a cached stats on large LLC?
thanks,
Chenyu


> Current implementation without invalidation of cached stats have few
> limitations namely that the stats reuse is limited to busy load
> balancing since stats can only be updated once a jiffy. Newidle Balance
> can happen frequently and concurrently on many CPUs which can result in
> readers seeing inconsitent values for the propagated stats.
> 
> For this iteration, the focus is to reduce the time taken for busy load
> balancing allowing the busy CPU to resume renning the task as quickly as
> possible.
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>   kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 81 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 60517a732c10..3b402f294f0b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10275,6 +10275,75 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>   	return check_cpu_capacity(rq, sd);
>   }
>   
> +static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type idle)
> +{
> +	/*
> +	 * Only under perioric load balancing can we ensure that no concurrent
> +	 * CPUs modifies the stats being propagated upwards since
> +	 * should_we_balance() can allow multiple concurrent newidle balance
> +	 * to progress and an idle -> busy transition for idle balance will
> +	 * require the stats to be recomputed since idleness metrics will
> +	 * change with migration.
> +	 */
> +	if (idle)
> +		return 0;
> +
> +	/*
> +	 * If individual groups are separate NUMA domains, migrations can cause
> +	 * preferred task statistics to change and will require recomputing of
> +	 * stats.
> +	 */
> +	if (sd->child && (sd->child->flags & SD_NUMA))
> +		return 0;
> +
> +	/*
> +	 * misfit_task_load requires recalculation on SD_ASYM_CPUCAPACITY
> +	 * domains. Skip caching stats for them.
> +	 */
> +	if (sd->flags & SD_ASYM_CPUCAPACITY)
> +		return 0;
> +
> +	/*
> +	 * TODO: For CPU_IDLE case, invalidate stats for an idle -> busy
> +	 * transition but for the time being, save some cycles during busy
> +	 * load balancing.
> +	 */
> +	return 1;
> +}
> +
> +static inline int retrieve_cached_stats(struct sched_group *group, struct sg_lb_stats *sg_stats)
> +{
> +	struct sched_domain_shared *sg_share = group->shared;
> +	unsigned long current_jiffy = jiffies;
> +	struct sg_lb_stats_prop *lb_prop;
> +
> +	if (!sg_share)
> +		return 0;
> +
> +	lb_prop = (struct sg_lb_stats_prop *)sg_share->private;
> +	if (!lb_prop)
> +		return 0;
> +
> +	/* Stale stats */
> +	if (READ_ONCE(lb_prop->last_update) != current_jiffy)
> +		return 0;
> +
> +	/*
> +	 * Pairs against the update to sgs_prop->last_update to
> +	 * prevent readers from seeing an inconsistent value of
> +	 * the propagated stats from a concurrent update.
> +	 */
> +	smp_rmb();
> +	*sg_stats = lb_prop->sg_stats;
> +
> +	/*
> +	 * If stats were read in the same interval, it cannot
> +	 * read an inconsistent state since stats are only
> +	 * updated once per jiffy.
> +	 */
> +	return time_before_eq(jiffies, current_jiffy);
> +}
> +
>   /**
>    * update_sg_lb_stats - Update sched_group's statistics for load balancing.
>    * @env: The load balancing environment.
> @@ -10292,10 +10361,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   	int i, nr_running, local_group, sd_flags = env->sd->flags;
>   	bool balancing_at_rd = !env->sd->parent;
>   
> -	memset(sgs, 0, sizeof(*sgs));
> -
>   	local_group = group == sds->local;
>   
> +	/*
> +	 * If stats can be retrieved, we are doing a busy load balancing.
> +	 * Skip right ahead to group_classify() since group_asym_packing and
> +	 * group_smt_balance is not possible under busy load balancing.
> +	 */
> +	if (can_retrieve_stats(env->sd, env->idle) &&
> +	    retrieve_cached_stats(group, sgs))
> +		goto group_classify;
> +
> +	memset(sgs, 0, sizeof(*sgs));
> +
>   	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>   		struct rq *rq = cpu_rq(i);
>   		unsigned long load = cpu_load(rq);
> @@ -10360,6 +10438,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   	if (!local_group && smt_balance(env, sgs, group))
>   		sgs->group_smt_balance = 1;
>   
> +group_classify:
>   	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>   
>   	/* Computing avg_load makes sense only when group is overloaded */

Re: [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop

Posted by K Prateek Nayak 10 months, 3 weeks ago

Hello Chenyu,

Thank you for taking a look at the series.

On 3/17/2025 11:34 PM, Chen, Yu C wrote:
> On 3/13/2025 5:37 PM, K Prateek Nayak wrote:
>> Allow update_sg_lb_stats() to retrieve the group stats cached in
>> sg_lb_stats_prop saved by another CPU performing load balancing around
>> the same time (same jiffy)
>>
> 
> If I understand correctly, we allow update_sg_lb_stats() to retrieve
> cached sg stats if another CPU in the same sched group has just done
> load balance within a jiffy ago, say 10ms for CONFIG_100_HZ.

Quick disclaimer: All of this is best effort currently.

Periodic load balancing is easy to start since it only happens once a
jiffy at the maximum so "last_update" as a jiffy counter should be
good enough (in most cases).

Secondly, and this is slightly harder to solve for, is to get all the
CPUs to actually sync. Currently it is a best effort case since the
tick can fire late due to disabled interrupts on CPU, SCHED_SOFTIRQ
may run at different times depending on how much work is done at tick
prior to reaching the softirq handler etc.

But assuming some amount of sync, I would like:

- During busy balance only one CPU gets to proceed as per
   should_we_balance() heuristics. In addition to that, since all CPUs
   are busy (should_we_balance() would have allowed the first idle CPU
   to go ahead otherwise) the "idle_cpus" and "overloaded" situations
   may change and those are hard to propagate.

- By the time this CPU does busy balancing, other groups below it
   hopefully had enough time to reach update_sd_lb_stats() and cache
   their copy for this jiffy in there. If not - the load balancing CPU
   will recompute.

- Since stats at a higher domain is used only once, there was no need
   to invalidate them which I couldn't get right back then (or maybe
   even now :)

> 
> There are two roles,  writer who updates the cached stats,
> the reader who reads the cache stats. For both cache writer and
> the cache reader, do we trigger them only when it is in busy periodic
> load balance? If yes, consider the periodic load balance is usually
> triggered on 1 CPU in each SD(should_we_balance()),  and the
> interval increases with the number of CPUs in that sd, just wonder
> if 10 ms is a little short to find a cached stats on large LLC?

So the reader is always the CPU going to the higher domain and
recomputing stats. The writer should have updated the stats by then
or the reader won't care and recompute it.

At the very least, since the CPU has to look at local stats too, the
logic ensures at least that is reused and not recomputed.

Beyond the annotated PATCH 9, I've moved to a versioning scheme that
could also be reused for newidle balancing with stats invalidation
and that should help reuse stats more. There are some stats on the
empty PATCH 9.

-- 
Thanks and Regards,
Prateek

> thanks,
> Chenyu
> 
> 
>> Current implementation without invalidation of cached stats have few
>> limitations namely that the stats reuse is limited to busy load
>> balancing since stats can only be updated once a jiffy. Newidle Balance
>> can happen frequently and concurrently on many CPUs which can result in
>> readers seeing inconsitent values for the propagated stats.
>>
>> For this iteration, the focus is to reduce the time taken for busy load
>> balancing allowing the busy CPU to resume renning the task as quickly as
>> possible.
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---