[PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS

Vincent Guittot posted 5 patches 1 year, 3 months ago
There is a newer version of this series
[PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS
Posted by Vincent Guittot 1 year, 3 months ago
With EAS, a group should be set overloaded if at least 1 CPU in the group
is overutilized bit it can happen that a CPU is fully utilized by tasks
because of clamping the compute capacity of the CPU. In such case, the CPU
is not overutilized and as a result should not be set overloaded as well.

group_overloaded being a higher priority than group_misfit, such group can
be selected as the busiest group instead of a group with a mistfit task
and prevents load_balance to select the CPU with the misfit task to pull
the latter on a fitting CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea057b311f6..e67d6029b269 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9806,6 +9806,7 @@ struct sg_lb_stats {
 	enum group_type group_type;
 	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
 	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
+	unsigned long group_overutilized;	/* No CPU is overutilized in the group */
 	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -10039,6 +10040,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 static inline bool
 group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 {
+	/*
+	 * With EAS and uclamp, 1 CPU in the group must be overutilized to
+	 * consider the group overloaded.
+	 */
+	if (sched_energy_enabled() && !sgs->group_overutilized)
+		return false;
+
 	if (sgs->sum_nr_running <= sgs->group_weight)
 		return false;
 
@@ -10252,8 +10260,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if (nr_running > 1)
 			*sg_overloaded = 1;
 
-		if (cpu_overutilized(i))
+		if (cpu_overutilized(i)) {
 			*sg_overutilized = 1;
+			sgs->group_overutilized = 1;
+		}
 
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
-- 
2.34.1
Re: [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS
Posted by Pierre Gondois 1 year, 3 months ago
Hello Vincent,

I have been trying this patch with the following workload, on a Pixel6
(4 littles, 2 mid, 2 big):
a. 5 tasks with: [UCLAMP_MIN:0, UCLAMP_MAX:1, duty_cycle=100%, cpuset:0-2]
b. 1 task with: [duty_cycle=100%, cpuset:0-7] but starting on CPU4

a.
There are many UCLAMP_MAX task also to pass the following condition
to tag a group as overloaded.
group_is_overloaded()
\-(sgs->sum_nr_running <= sgs->group_weight)
These tasks should put

b. to see if a CPU-bound task is migrated to the big cluster.

---
- Without patch 5 [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
- Without this patch
The migration is effectively due to the load_balancer selecting the
little cluster over the mid cluster.
The little cluster put the system in an overutilized state.

---
- Without patch 5 [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
- With this patch
The load_balancer effectively selects the medium cluster over the little
cluster (since none of the little CPU is overutilized). The load_balancer
migrates the task b. to a big CPU.

Note:
This is true most of the time, but whenever a non-UCLAMP_MAX tasks wakes-up
on one of CPU0-3 (where the UCLAMP_MAX are pinned), the cluster becomes
overutilized and the new mechanism is bypassed.
Same thing if a task with [UCLAMP_MIN:0, UCLAMP_MAX:1024, duty_cycle=100%, cpuset:0]
is added to the workload.

---
- With patch 5 [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
- Without this patch

The task b. gets an opportunity to migrate to a big CPU through the sched_tick.
However with both patches are applied, the migration is triggered by the
load_balancer.

---
So FWIW, from a mechanism PoV and independently from patch 5:
Tested-by: Pierre Gondois <pierre.gondois@arm.com>


On 8/30/24 15:03, Vincent Guittot wrote:
> With EAS, a group should be set overloaded if at least 1 CPU in the group
> is overutilized bit it can happen that a CPU is fully utilized by tasks
> because of clamping the compute capacity of the CPU. In such case, the CPU
> is not overutilized and as a result should not be set overloaded as well.
> 
> group_overloaded being a higher priority than group_misfit, such group can
> be selected as the busiest group instead of a group with a mistfit task
> and prevents load_balance to select the CPU with the misfit task to pull
> the latter on a fitting CPU.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c | 12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fea057b311f6..e67d6029b269 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9806,6 +9806,7 @@ struct sg_lb_stats {
>   	enum group_type group_type;
>   	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
>   	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
> +	unsigned long group_overutilized;	/* No CPU is overutilized in the group */
>   	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
>   #ifdef CONFIG_NUMA_BALANCING
>   	unsigned int nr_numa_running;
> @@ -10039,6 +10040,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>   static inline bool
>   group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>   {
> +	/*
> +	 * With EAS and uclamp, 1 CPU in the group must be overutilized to
> +	 * consider the group overloaded.
> +	 */
> +	if (sched_energy_enabled() && !sgs->group_overutilized)
> +		return false;
> +
>   	if (sgs->sum_nr_running <= sgs->group_weight)
>   		return false;
>   
> @@ -10252,8 +10260,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   		if (nr_running > 1)
>   			*sg_overloaded = 1;
>   
> -		if (cpu_overutilized(i))
> +		if (cpu_overutilized(i)) {
>   			*sg_overutilized = 1;
> +			sgs->group_overutilized = 1;
> +		}
>   
>   #ifdef CONFIG_NUMA_BALANCING
>   		sgs->nr_numa_running += rq->nr_numa_running;
Re: [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS
Posted by Hongyan Xia 1 year, 3 months ago
On 30/08/2024 14:03, Vincent Guittot wrote:
> With EAS, a group should be set overloaded if at least 1 CPU in the group
> is overutilized bit it can happen that a CPU is fully utilized by tasks
> because of clamping the compute capacity of the CPU. In such case, the CPU
> is not overutilized and as a result should not be set overloaded as well.
> 
> group_overloaded being a higher priority than group_misfit, such group can
> be selected as the busiest group instead of a group with a mistfit task
> and prevents load_balance to select the CPU with the misfit task to pull
> the latter on a fitting CPU.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c | 12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fea057b311f6..e67d6029b269 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9806,6 +9806,7 @@ struct sg_lb_stats {
>   	enum group_type group_type;
>   	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
>   	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
> +	unsigned long group_overutilized;	/* No CPU is overutilized in the group */

Does this have to be unsigned long? I think a shorter width like bool 
(or int to be consistent with other fields) expresses the intention.

Also the comment to me is a bit confusing. All other fields are positive 
but this one's comment is in a negative tone.

>   	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
>   #ifdef CONFIG_NUMA_BALANCING
>   	unsigned int nr_numa_running;
> @@ -10039,6 +10040,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>   static inline bool
>   group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>   {
> +	/*
> +	 * With EAS and uclamp, 1 CPU in the group must be overutilized to
> +	 * consider the group overloaded.
> +	 */
> +	if (sched_energy_enabled() && !sgs->group_overutilized)
> +		return false;
> +
>   	if (sgs->sum_nr_running <= sgs->group_weight)
>   		return false;
>   
> @@ -10252,8 +10260,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   		if (nr_running > 1)
>   			*sg_overloaded = 1;
>   
> -		if (cpu_overutilized(i))
> +		if (cpu_overutilized(i)) {
>   			*sg_overutilized = 1;
> +			sgs->group_overutilized = 1;
> +		}
>   
>   #ifdef CONFIG_NUMA_BALANCING
>   		sgs->nr_numa_running += rq->nr_numa_running;
Re: [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS
Posted by Vincent Guittot 1 year, 3 months ago
On Mon, 2 Sept 2024 at 11:01, Hongyan Xia <hongyan.xia2@arm.com> wrote:
>
> On 30/08/2024 14:03, Vincent Guittot wrote:
> > With EAS, a group should be set overloaded if at least 1 CPU in the group
> > is overutilized bit it can happen that a CPU is fully utilized by tasks
> > because of clamping the compute capacity of the CPU. In such case, the CPU
> > is not overutilized and as a result should not be set overloaded as well.
> >
> > group_overloaded being a higher priority than group_misfit, such group can
> > be selected as the busiest group instead of a group with a mistfit task
> > and prevents load_balance to select the CPU with the misfit task to pull
> > the latter on a fitting CPU.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   kernel/sched/fair.c | 12 +++++++++++-
> >   1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index fea057b311f6..e67d6029b269 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9806,6 +9806,7 @@ struct sg_lb_stats {
> >       enum group_type group_type;
> >       unsigned int group_asym_packing;        /* Tasks should be moved to preferred CPU */
> >       unsigned int group_smt_balance;         /* Task on busy SMT be moved */
> > +     unsigned long group_overutilized;       /* No CPU is overutilized in the group */
>
> Does this have to be unsigned long? I think a shorter width like bool
> (or int to be consistent with other fields) expresses the intention.

yes an unsigned int is enough

>
> Also the comment to me is a bit confusing. All other fields are positive
> but this one's comment is in a negative tone.

Coming from the 1st way I implemented it but then I forgot to update
the comment. Should be:
/* At least one CPU is overutilized in the group */

>
> >       unsigned long group_misfit_task_load;   /* A CPU has a task too big for its capacity */
> >   #ifdef CONFIG_NUMA_BALANCING
> >       unsigned int nr_numa_running;
> > @@ -10039,6 +10040,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
> >   static inline bool
> >   group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
> >   {
> > +     /*
> > +      * With EAS and uclamp, 1 CPU in the group must be overutilized to
> > +      * consider the group overloaded.
> > +      */
> > +     if (sched_energy_enabled() && !sgs->group_overutilized)
> > +             return false;
> > +
> >       if (sgs->sum_nr_running <= sgs->group_weight)
> >               return false;
> >
> > @@ -10252,8 +10260,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >               if (nr_running > 1)
> >                       *sg_overloaded = 1;
> >
> > -             if (cpu_overutilized(i))
> > +             if (cpu_overutilized(i)) {
> >                       *sg_overutilized = 1;
> > +                     sgs->group_overutilized = 1;
> > +             }
> >
> >   #ifdef CONFIG_NUMA_BALANCING
> >               sgs->nr_numa_running += rq->nr_numa_running;