sched/fair: Load balance improvements

[PATCH 2/2] sched/fair: Balance #Tasks/#CPUs if busiest group has no idle CPU

Posted by Pierre Gondois 4 days, 6 hours ago

Balancing the number of idle CPUs between groups is done if:
- the busiest group is overloaded: sum_nr_running > #CPUs
- the local group has spare capacity: sum_nr_running <= #CPUs

To avoid pulling too many tasks and moving the imbalance to the
local group, the number of task pulled is half of:
  (local->idle_cpus - busiest->idle_cpus)

Halving the imbalance currently lead to the following scenario.
On a Juno with 2 clusters: CLU0: 4 CPUs and CLU1: 2 CPUs, with
6 long running tasks:
- 1 task on the 2-CPUs cluster
- 5 Tasks run in the 4-CPUs cluster
Running the load balancer from the idle CPU (in CLU1):
- Local group: CLU1: idle_cpus=1; nr_running=1; type=group_has_spare
- Busiest group: CLU0 idle_cpus=0; nr_running=5 type=group_overloaded
Half of (local->idle_cpus - busiest->idle_cpus) is 0.
No task is migrated and the task placement persists.

Balancing number of idle CPUs is only relevant if the busiest group
has idle CPUs. Otherwise it is better to have an equal ratio
of #tasks / #CPUs.

sibling_imbalance() was also introduced to cope with groups with
asymmetric sizes. This is also the case here.
commit 7ff1693236f5 ("sched/fair: Implement prefer sibling
imbalance calculation between asymmetric groups")

Try to stay conservative and only balance the ratio
of #Tasks / #CPUs if the busiest group has no idle CPUs.
Note that a similar check present in update_pick_idlest()
is not updated.

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
---
 kernel/sched/fair.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa14a9982b9f1..9dac3536d9c19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11235,20 +11235,18 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			return;
 		}
 
-		if (busiest->group_weight == 1 || sds->prefer_sibling) {
+		env->migration_type = migrate_task;
+		if (busiest->group_weight == 1 || sds->prefer_sibling || !busiest->idle_cpus) {
 			/*
-			 * When prefer sibling, evenly spread running tasks on
-			 * groups.
+			 * When prefer sibling, or when busiest has no idle CPU,
+			 * evenly spread running tasks on groups.
 			 */
-			env->migration_type = migrate_task;
 			env->imbalance = sibling_imbalance(env, sds, busiest, local);
 		} else {
-
 			/*
 			 * If there is no overload, we just want to even the number of
 			 * idle CPUs.
 			 */
-			env->migration_type = migrate_task;
 			env->imbalance = local->idle_cpus;
 			lsub_positive(&env->imbalance, busiest->idle_cpus);
 		}
-- 
2.43.0

Re: [PATCH 2/2] sched/fair: Balance #Tasks/#CPUs if busiest group has no idle CPU

Posted by K Prateek Nayak 3 days, 12 hours ago

Hello Pierre,

On 2/5/2026 8:38 PM, Pierre Gondois wrote:
> Halving the imbalance currently lead to the following scenario.
> On a Juno with 2 clusters: CLU0: 4 CPUs and CLU1: 2 CPUs, with
> 6 long running tasks:
> - 1 task on the 2-CPUs cluster
> - 5 Tasks run in the 4-CPUs cluster
> Running the load balancer from the idle CPU (in CLU1):
> - Local group: CLU1: idle_cpus=1; nr_running=1; type=group_has_spare
> - Busiest group: CLU0 idle_cpus=0; nr_running=5 type=group_overloaded
> Half of (local->idle_cpus - busiest->idle_cpus) is 0.
> No task is migrated and the task placement persists.

...

> ---
>  kernel/sched/fair.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aa14a9982b9f1..9dac3536d9c19 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11235,20 +11235,18 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  			return;
>  		}
>  
> -		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> +		env->migration_type = migrate_task;
> +		if (busiest->group_weight == 1 || sds->prefer_sibling || !busiest->idle_cpus) {

I suppose you also have SD_ASYM_CPUCAPACITY set on your sd which is why
"sds->prefer_sibling" is false here.

Instead of checking for "busiest->idle_cpus", would it make sense to
enter this case for sibling_imbalance() when we have:

    capacity_greater(capacity_of(env->dst_cpu), sds->busiest->sgc->min_capacity)

since it could very well be the case that the smaller cluster is
actually idle since task_fits_cpu() returned false for CPUs there?

I couldn't actually spot any case where we compare the capacities
of local and busiest group for <= fully_loaded but let me know if
I've missed something.

>  			/*
> -			 * When prefer sibling, evenly spread running tasks on
> -			 * groups.
> +			 * When prefer sibling, or when busiest has no idle CPU,
> +			 * evenly spread running tasks on groups.
>  			 */
> -			env->migration_type = migrate_task;
>  			env->imbalance = sibling_imbalance(env, sds, busiest, local);

I'm slightly skeptical of spreading the tasks evenly without considering
the capacity difference when we are on SD_ASYM_CPUCAPACITY. I suppose
we'll filter out the target in sched_balance_find_src_rq() and bail out
if we have only see lower capacity CPUs on the busiest group.

>  		} else {
> -
>  			/*
>  			 * If there is no overload, we just want to even the number of
>  			 * idle CPUs.
>  			 */
> -			env->migration_type = migrate_task;
>  			env->imbalance = local->idle_cpus;
>  			lsub_positive(&env->imbalance, busiest->idle_cpus);
>  		}

-- 
Thanks and Regards,
Prateek

[PATCH 1/2] sched/fair: Fix integer underflow
[PATCH 2/2] sched/fair: Balance #Tasks/#CPUs if busiest group has no idle CPU