[PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0

Qais Yousef posted 3 patches 2 years, 7 months ago
There is a newer version of this series
[PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Qais Yousef 2 years, 7 months ago
When uclamp_max is being used, the util of the task could be higher than
the spare capacity of the CPU, but due to uclamp_max value we force fit
it there.

The way the condition for checking for max_spare_cap in
find_energy_efficient_cpu() was constructed; it ignored any CPU that has
its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
hence ending up never performing compute_energy() for this cluster and
missing an opportunity for a better energy efficient placement to honour
uclamp_max setting.

	max_spare_cap = 0;
	cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high

	...

	util_fits_cpu(...);		// will return true if uclamp_max forces it to fit

	...

	// this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
	if (cpu_cap > max_spare_cap) {
		max_spare_cap = cpu_cap;
		max_spare_cap_cpu = cpu;
	}

prev_spare_cap suffers from a similar problem.

Fix the logic by converting the variables into long and treating -1
value as 'not populated' instead of 0 which is a viable and correct
spare capacity value.

Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/fair.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6c8e7f52935..7a21ee74139f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	for (; pd; pd = pd->next) {
 		unsigned long util_min = p_util_min, util_max = p_util_max;
 		unsigned long cpu_cap, cpu_thermal_cap, util;
-		unsigned long cur_delta, max_spare_cap = 0;
+		long prev_spare_cap = -1, max_spare_cap = -1;
 		unsigned long rq_util_min, rq_util_max;
-		unsigned long prev_spare_cap = 0;
+		unsigned long cur_delta, base_energy;
 		int max_spare_cap_cpu = -1;
-		unsigned long base_energy;
 		int fits, max_fits = -1;
 
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
@@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			}
 		}
 
-		if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
+		if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
 			continue;
 
 		eenv_pd_busy_time(&eenv, cpus, p);
@@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		base_energy = compute_energy(&eenv, pd, cpus, p, -1);
 
 		/* Evaluate the energy impact of using prev_cpu. */
-		if (prev_spare_cap > 0) {
+		if (prev_spare_cap > -1) {
 			prev_delta = compute_energy(&eenv, pd, cpus, p,
 						    prev_cpu);
 			/* CPU utilization has changed */
-- 
2.25.1
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Vincent Guittot 2 years, 7 months ago
On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
>
> When uclamp_max is being used, the util of the task could be higher than
> the spare capacity of the CPU, but due to uclamp_max value we force fit
> it there.
>
> The way the condition for checking for max_spare_cap in
> find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> hence ending up never performing compute_energy() for this cluster and
> missing an opportunity for a better energy efficient placement to honour
> uclamp_max setting.
>
>         max_spare_cap = 0;
>         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
>
>         ...
>
>         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
>
>         ...
>
>         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
>         if (cpu_cap > max_spare_cap) {
>                 max_spare_cap = cpu_cap;
>                 max_spare_cap_cpu = cpu;
>         }
>
> prev_spare_cap suffers from a similar problem.
>
> Fix the logic by converting the variables into long and treating -1
> value as 'not populated' instead of 0 which is a viable and correct
> spare capacity value.
>
> Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  kernel/sched/fair.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c6c8e7f52935..7a21ee74139f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>         for (; pd; pd = pd->next) {
>                 unsigned long util_min = p_util_min, util_max = p_util_max;
>                 unsigned long cpu_cap, cpu_thermal_cap, util;
> -               unsigned long cur_delta, max_spare_cap = 0;
> +               long prev_spare_cap = -1, max_spare_cap = -1;
>                 unsigned long rq_util_min, rq_util_max;
> -               unsigned long prev_spare_cap = 0;
> +               unsigned long cur_delta, base_energy;
>                 int max_spare_cap_cpu = -1;
> -               unsigned long base_energy;
>                 int fits, max_fits = -1;
>
>                 cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>                         }
>                 }
>
> -               if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
> +               if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
>                         continue;
>
>                 eenv_pd_busy_time(&eenv, cpus, p);
> @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>                 base_energy = compute_energy(&eenv, pd, cpus, p, -1);
>
>                 /* Evaluate the energy impact of using prev_cpu. */
> -               if (prev_spare_cap > 0) {
> +               if (prev_spare_cap > -1) {
>                         prev_delta = compute_energy(&eenv, pd, cpus, p,
>                                                     prev_cpu);
>                         /* CPU utilization has changed */

I think that you also need the change below to make sure that the
signed comparison will be used. I have quickly checked the assembly
code for aarch64 and your patch keeps using unsigned comparison (b.ls)
   ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
ffff8000080e4c98: eb00003f cmp x1, x0
ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0
<select_task_rq_fair+0x570>  // b.plast

Whereas the change below make it to use the signed version (b.le)
   ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
ffff8000080e4c98: eb00003f cmp x1, x0
ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570>

-- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct
task_struct *p, int prev_cpu)
                                prev_spare_cap = cpu_cap;
                                prev_fits = fits;
                        } else if ((fits > max_fits) ||
-                                  ((fits == max_fits) && (cpu_cap >
max_spare_cap))) {
+                                  ((fits == max_fits) &&
((long)cpu_cap > max_spare_cap))) {
                                /*
                                 * Find the CPU with the maximum spare capacity
                                 * among the remaining CPUs in the performance

> --
> 2.25.1
>
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Qais Yousef 2 years, 7 months ago
On 02/07/23 10:45, Vincent Guittot wrote:
> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > When uclamp_max is being used, the util of the task could be higher than
> > the spare capacity of the CPU, but due to uclamp_max value we force fit
> > it there.
> >
> > The way the condition for checking for max_spare_cap in
> > find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> > its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> > max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> > hence ending up never performing compute_energy() for this cluster and
> > missing an opportunity for a better energy efficient placement to honour
> > uclamp_max setting.
> >
> >         max_spare_cap = 0;
> >         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
> >
> >         ...
> >
> >         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
> >
> >         ...
> >
> >         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
> >         if (cpu_cap > max_spare_cap) {
> >                 max_spare_cap = cpu_cap;
> >                 max_spare_cap_cpu = cpu;
> >         }
> >
> > prev_spare_cap suffers from a similar problem.
> >
> > Fix the logic by converting the variables into long and treating -1
> > value as 'not populated' instead of 0 which is a viable and correct
> > spare capacity value.
> >
> > Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
> > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > ---
> >  kernel/sched/fair.c | 9 ++++-----
> >  1 file changed, 4 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c6c8e7f52935..7a21ee74139f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >         for (; pd; pd = pd->next) {
> >                 unsigned long util_min = p_util_min, util_max = p_util_max;
> >                 unsigned long cpu_cap, cpu_thermal_cap, util;
> > -               unsigned long cur_delta, max_spare_cap = 0;
> > +               long prev_spare_cap = -1, max_spare_cap = -1;
> >                 unsigned long rq_util_min, rq_util_max;
> > -               unsigned long prev_spare_cap = 0;
> > +               unsigned long cur_delta, base_energy;
> >                 int max_spare_cap_cpu = -1;
> > -               unsigned long base_energy;
> >                 int fits, max_fits = -1;
> >
> >                 cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> > @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                         }
> >                 }
> >
> > -               if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
> > +               if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> >                         continue;
> >
> >                 eenv_pd_busy_time(&eenv, cpus, p);
> > @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                 base_energy = compute_energy(&eenv, pd, cpus, p, -1);
> >
> >                 /* Evaluate the energy impact of using prev_cpu. */
> > -               if (prev_spare_cap > 0) {
> > +               if (prev_spare_cap > -1) {
> >                         prev_delta = compute_energy(&eenv, pd, cpus, p,
> >                                                     prev_cpu);
> >                         /* CPU utilization has changed */
> 
> I think that you also need the change below to make sure that the
> signed comparison will be used. I have quickly checked the assembly
> code for aarch64 and your patch keeps using unsigned comparison (b.ls)
>    ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
> ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> ffff8000080e4c98: eb00003f cmp x1, x0
> ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0
> <select_task_rq_fair+0x570>  // b.plast
> 
> Whereas the change below make it to use the signed version (b.le)
>    ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
> ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> ffff8000080e4c98: eb00003f cmp x1, x0
> ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570>
> 
> -- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct
> task_struct *p, int prev_cpu)
>                                 prev_spare_cap = cpu_cap;
>                                 prev_fits = fits;
>                         } else if ((fits > max_fits) ||
> -                                  ((fits == max_fits) && (cpu_cap >
> max_spare_cap))) {
> +                                  ((fits == max_fits) &&
> ((long)cpu_cap > max_spare_cap))) {
>                                 /*
>                                  * Find the CPU with the maximum spare capacity
>                                  * among the remaining CPUs in the performance

Isn't it better to go back to v1 form then? The inconsistent type paired with
the cast isn't getting too ugly for me :(

I don't think we can convert cpu_cap to long without having to do more work as
it is used with 'util'.


Cheers

--
Qais Yousef
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Hongyan Xia 2 years, 3 months ago
Hi Qais,

On 2023-02-11 17:28, Qais Yousef wrote:
> On 02/07/23 10:45, Vincent Guittot wrote:
>> [...]
> 
> Isn't it better to go back to v1 form then? The inconsistent type paired with
> the cast isn't getting too ugly for me :(
> 
> I don't think we can convert cpu_cap to long without having to do more work as
> it is used with 'util'.

Sorry if I'm missing something obvious, but why can't we convert util to 
long? The only place I see which mixes with util is

	lsub_positive(&cpu_cap, util);

but at this point both cpu_cap and util should have sane values (top bit 
not set) so doing clamped subtraction should just work fine?

Hongyan
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Qais Yousef 2 years, 2 months ago
On 06/07/23 12:50, Hongyan Xia wrote:
> Hi Qais,
> 
> On 2023-02-11 17:28, Qais Yousef wrote:
> > On 02/07/23 10:45, Vincent Guittot wrote:
> > > [...]
> > 
> > Isn't it better to go back to v1 form then? The inconsistent type paired with
> > the cast isn't getting too ugly for me :(
> > 
> > I don't think we can convert cpu_cap to long without having to do more work as
> > it is used with 'util'.
> 
> Sorry if I'm missing something obvious, but why can't we convert util to
> long? The only place I see which mixes with util is
> 
> 	lsub_positive(&cpu_cap, util);
> 
> but at this point both cpu_cap and util should have sane values (top bit not
> set) so doing clamped subtraction should just work fine?

I completely have this series paged out now. I'll consider this comment when
I prepare v3 - which I hope I'll have some down time in few weeks time to send
some fixes I have been sitting on for a bit.


Cheers

--
Qais Yousef
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Qais Yousef 2 years, 1 month ago
On 06/07/23 12:50, Hongyan Xia wrote:
> Hi Qais,
> 
> On 2023-02-11 17:28, Qais Yousef wrote:
> > On 02/07/23 10:45, Vincent Guittot wrote:
> > > [...]
> > 
> > Isn't it better to go back to v1 form then? The inconsistent type paired with
> > the cast isn't getting too ugly for me :(
> > 
> > I don't think we can convert cpu_cap to long without having to do more work as
> > it is used with 'util'.
> 
> Sorry if I'm missing something obvious, but why can't we convert util to
> long? The only place I see which mixes with util is
> 
> 	lsub_positive(&cpu_cap, util);
> 
> but at this point both cpu_cap and util should have sane values (top bit not
> set) so doing clamped subtraction should just work fine?

I think all options are not very pretty. But the least evil is what Vincent
suggested to simply do the cast in this one place. Util is used in more places
and assumed to be unsigned long, so chances of unhappy accidents are higher and
I don't feel like auditing more code for correctness given we have a simple
straightforward alternative now :)


Thanks

--
Qais Yousef
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Vincent Guittot 2 years, 6 months ago
On Sat, 11 Feb 2023 at 18:28, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 02/07/23 10:45, Vincent Guittot wrote:
> > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > When uclamp_max is being used, the util of the task could be higher than
> > > the spare capacity of the CPU, but due to uclamp_max value we force fit
> > > it there.
> > >
> > > The way the condition for checking for max_spare_cap in
> > > find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> > > its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> > > max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> > > hence ending up never performing compute_energy() for this cluster and
> > > missing an opportunity for a better energy efficient placement to honour
> > > uclamp_max setting.
> > >
> > >         max_spare_cap = 0;
> > >         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
> > >
> > >         ...
> > >
> > >         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
> > >
> > >         ...
> > >
> > >         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
> > >         if (cpu_cap > max_spare_cap) {
> > >                 max_spare_cap = cpu_cap;
> > >                 max_spare_cap_cpu = cpu;
> > >         }
> > >
> > > prev_spare_cap suffers from a similar problem.
> > >
> > > Fix the logic by converting the variables into long and treating -1
> > > value as 'not populated' instead of 0 which is a viable and correct
> > > spare capacity value.
> > >
> > > Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
> > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > > ---
> > >  kernel/sched/fair.c | 9 ++++-----
> > >  1 file changed, 4 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index c6c8e7f52935..7a21ee74139f 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > >         for (; pd; pd = pd->next) {
> > >                 unsigned long util_min = p_util_min, util_max = p_util_max;
> > >                 unsigned long cpu_cap, cpu_thermal_cap, util;
> > > -               unsigned long cur_delta, max_spare_cap = 0;
> > > +               long prev_spare_cap = -1, max_spare_cap = -1;
> > >                 unsigned long rq_util_min, rq_util_max;
> > > -               unsigned long prev_spare_cap = 0;
> > > +               unsigned long cur_delta, base_energy;
> > >                 int max_spare_cap_cpu = -1;
> > > -               unsigned long base_energy;
> > >                 int fits, max_fits = -1;
> > >
> > >                 cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> > > @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > >                         }
> > >                 }
> > >
> > > -               if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
> > > +               if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> > >                         continue;
> > >
> > >                 eenv_pd_busy_time(&eenv, cpus, p);
> > > @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > >                 base_energy = compute_energy(&eenv, pd, cpus, p, -1);
> > >
> > >                 /* Evaluate the energy impact of using prev_cpu. */
> > > -               if (prev_spare_cap > 0) {
> > > +               if (prev_spare_cap > -1) {
> > >                         prev_delta = compute_energy(&eenv, pd, cpus, p,
> > >                                                     prev_cpu);
> > >                         /* CPU utilization has changed */
> >
> > I think that you also need the change below to make sure that the
> > signed comparison will be used. I have quickly checked the assembly
> > code for aarch64 and your patch keeps using unsigned comparison (b.ls)
> >    ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
> > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> > ffff8000080e4c98: eb00003f cmp x1, x0
> > ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0
> > <select_task_rq_fair+0x570>  // b.plast
> >
> > Whereas the change below make it to use the signed version (b.le)
> >    ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
> > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> > ffff8000080e4c98: eb00003f cmp x1, x0
> > ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570>
> >
> > -- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct
> > task_struct *p, int prev_cpu)
> >                                 prev_spare_cap = cpu_cap;
> >                                 prev_fits = fits;
> >                         } else if ((fits > max_fits) ||
> > -                                  ((fits == max_fits) && (cpu_cap >
> > max_spare_cap))) {
> > +                                  ((fits == max_fits) &&
> > ((long)cpu_cap > max_spare_cap))) {
> >                                 /*
> >                                  * Find the CPU with the maximum spare capacity
> >                                  * among the remaining CPUs in the performance
>
> Isn't it better to go back to v1 form then? The inconsistent type paired with
> the cast isn't getting too ugly for me :(

the cast into a long of the cpu capacity in the condition was a good
way to fix this unsigned/signed comparison and make is consistent with
the use of -1 as default value IMO
    ((long)cpu_cap > max_spare_cap)
>
> I don't think we can convert cpu_cap to long without having to do more work as
> it is used with 'util'.
>
>
> Cheers
>
> --
> Qais Yousef
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Qais Yousef 2 years, 6 months ago
On 02/20/23 18:02, Vincent Guittot wrote:
> On Sat, 11 Feb 2023 at 18:28, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 02/07/23 10:45, Vincent Guittot wrote:
> > > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > When uclamp_max is being used, the util of the task could be higher than
> > > > the spare capacity of the CPU, but due to uclamp_max value we force fit
> > > > it there.
> > > >
> > > > The way the condition for checking for max_spare_cap in
> > > > find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> > > > its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> > > > max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> > > > hence ending up never performing compute_energy() for this cluster and
> > > > missing an opportunity for a better energy efficient placement to honour
> > > > uclamp_max setting.
> > > >
> > > >         max_spare_cap = 0;
> > > >         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
> > > >
> > > >         ...
> > > >
> > > >         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
> > > >
> > > >         ...
> > > >
> > > >         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
> > > >         if (cpu_cap > max_spare_cap) {
> > > >                 max_spare_cap = cpu_cap;
> > > >                 max_spare_cap_cpu = cpu;
> > > >         }
> > > >
> > > > prev_spare_cap suffers from a similar problem.
> > > >
> > > > Fix the logic by converting the variables into long and treating -1
> > > > value as 'not populated' instead of 0 which is a viable and correct
> > > > spare capacity value.
> > > >
> > > > Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
> > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > > > ---
> > > >  kernel/sched/fair.c | 9 ++++-----
> > > >  1 file changed, 4 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index c6c8e7f52935..7a21ee74139f 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > >         for (; pd; pd = pd->next) {
> > > >                 unsigned long util_min = p_util_min, util_max = p_util_max;
> > > >                 unsigned long cpu_cap, cpu_thermal_cap, util;
> > > > -               unsigned long cur_delta, max_spare_cap = 0;
> > > > +               long prev_spare_cap = -1, max_spare_cap = -1;
> > > >                 unsigned long rq_util_min, rq_util_max;
> > > > -               unsigned long prev_spare_cap = 0;
> > > > +               unsigned long cur_delta, base_energy;
> > > >                 int max_spare_cap_cpu = -1;
> > > > -               unsigned long base_energy;
> > > >                 int fits, max_fits = -1;
> > > >
> > > >                 cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> > > > @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > >                         }
> > > >                 }
> > > >
> > > > -               if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
> > > > +               if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> > > >                         continue;
> > > >
> > > >                 eenv_pd_busy_time(&eenv, cpus, p);
> > > > @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > >                 base_energy = compute_energy(&eenv, pd, cpus, p, -1);
> > > >
> > > >                 /* Evaluate the energy impact of using prev_cpu. */
> > > > -               if (prev_spare_cap > 0) {
> > > > +               if (prev_spare_cap > -1) {
> > > >                         prev_delta = compute_energy(&eenv, pd, cpus, p,
> > > >                                                     prev_cpu);
> > > >                         /* CPU utilization has changed */
> > >
> > > I think that you also need the change below to make sure that the
> > > signed comparison will be used. I have quickly checked the assembly
> > > code for aarch64 and your patch keeps using unsigned comparison (b.ls)
> > >    ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
> > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> > > ffff8000080e4c98: eb00003f cmp x1, x0
> > > ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0
> > > <select_task_rq_fair+0x570>  // b.plast
> > >
> > > Whereas the change below make it to use the signed version (b.le)
> > >    ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
> > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> > > ffff8000080e4c98: eb00003f cmp x1, x0
> > > ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570>
> > >
> > > -- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct
> > > task_struct *p, int prev_cpu)
> > >                                 prev_spare_cap = cpu_cap;
> > >                                 prev_fits = fits;
> > >                         } else if ((fits > max_fits) ||
> > > -                                  ((fits == max_fits) && (cpu_cap >
> > > max_spare_cap))) {
> > > +                                  ((fits == max_fits) &&
> > > ((long)cpu_cap > max_spare_cap))) {
> > >                                 /*
> > >                                  * Find the CPU with the maximum spare capacity
> > >                                  * among the remaining CPUs in the performance
> >
> > Isn't it better to go back to v1 form then? The inconsistent type paired with
> > the cast isn't getting too ugly for me :(
> 
> the cast into a long of the cpu capacity in the condition was a good
> way to fix this unsigned/signed comparison and make is consistent with
> the use of -1 as default value IMO
>     ((long)cpu_cap > max_spare_cap)

Fair enough. Our boolean brains differ :-) I'll add the cast.

Do you see the energy calculation issue Dietmar was trying to highlight? As it
stands I only have boolean tweaks to do for v3.


Cheers

--
Qais Yousef

> >
> > I don't think we can convert cpu_cap to long without having to do more work as
> > it is used with 'util'.
> >
> >
> > Cheers
> >
> > --
> > Qais Yousef
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Vincent Guittot 2 years, 6 months ago
On Tue, 21 Feb 2023 at 13:05, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 02/20/23 18:02, Vincent Guittot wrote:
> > On Sat, 11 Feb 2023 at 18:28, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 02/07/23 10:45, Vincent Guittot wrote:
> > > > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
> > > > >
> > > > > When uclamp_max is being used, the util of the task could be higher than
> > > > > the spare capacity of the CPU, but due to uclamp_max value we force fit
> > > > > it there.
> > > > >
> > > > > The way the condition for checking for max_spare_cap in
> > > > > find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> > > > > its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> > > > > max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> > > > > hence ending up never performing compute_energy() for this cluster and
> > > > > missing an opportunity for a better energy efficient placement to honour
> > > > > uclamp_max setting.
> > > > >
> > > > >         max_spare_cap = 0;
> > > > >         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
> > > > >
> > > > >         ...
> > > > >
> > > > >         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
> > > > >
> > > > >         ...
> > > > >
> > > > >         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
> > > > >         if (cpu_cap > max_spare_cap) {
> > > > >                 max_spare_cap = cpu_cap;
> > > > >                 max_spare_cap_cpu = cpu;
> > > > >         }
> > > > >
> > > > > prev_spare_cap suffers from a similar problem.
> > > > >
> > > > > Fix the logic by converting the variables into long and treating -1
> > > > > value as 'not populated' instead of 0 which is a viable and correct
> > > > > spare capacity value.
> > > > >
> > > > > Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
> > > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > > > > ---
> > > > >  kernel/sched/fair.c | 9 ++++-----
> > > > >  1 file changed, 4 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index c6c8e7f52935..7a21ee74139f 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > > >         for (; pd; pd = pd->next) {
> > > > >                 unsigned long util_min = p_util_min, util_max = p_util_max;
> > > > >                 unsigned long cpu_cap, cpu_thermal_cap, util;
> > > > > -               unsigned long cur_delta, max_spare_cap = 0;
> > > > > +               long prev_spare_cap = -1, max_spare_cap = -1;
> > > > >                 unsigned long rq_util_min, rq_util_max;
> > > > > -               unsigned long prev_spare_cap = 0;
> > > > > +               unsigned long cur_delta, base_energy;
> > > > >                 int max_spare_cap_cpu = -1;
> > > > > -               unsigned long base_energy;
> > > > >                 int fits, max_fits = -1;
> > > > >
> > > > >                 cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> > > > > @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > > >                         }
> > > > >                 }
> > > > >
> > > > > -               if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
> > > > > +               if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> > > > >                         continue;
> > > > >
> > > > >                 eenv_pd_busy_time(&eenv, cpus, p);
> > > > > @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > > >                 base_energy = compute_energy(&eenv, pd, cpus, p, -1);
> > > > >
> > > > >                 /* Evaluate the energy impact of using prev_cpu. */
> > > > > -               if (prev_spare_cap > 0) {
> > > > > +               if (prev_spare_cap > -1) {
> > > > >                         prev_delta = compute_energy(&eenv, pd, cpus, p,
> > > > >                                                     prev_cpu);
> > > > >                         /* CPU utilization has changed */
> > > >
> > > > I think that you also need the change below to make sure that the
> > > > signed comparison will be used. I have quickly checked the assembly
> > > > code for aarch64 and your patch keeps using unsigned comparison (b.ls)
> > > >    ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
> > > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> > > > ffff8000080e4c98: eb00003f cmp x1, x0
> > > > ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0
> > > > <select_task_rq_fair+0x570>  // b.plast
> > > >
> > > > Whereas the change below make it to use the signed version (b.le)
> > > >    ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
> > > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> > > > ffff8000080e4c98: eb00003f cmp x1, x0
> > > > ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570>
> > > >
> > > > -- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct
> > > > task_struct *p, int prev_cpu)
> > > >                                 prev_spare_cap = cpu_cap;
> > > >                                 prev_fits = fits;
> > > >                         } else if ((fits > max_fits) ||
> > > > -                                  ((fits == max_fits) && (cpu_cap >
> > > > max_spare_cap))) {
> > > > +                                  ((fits == max_fits) &&
> > > > ((long)cpu_cap > max_spare_cap))) {
> > > >                                 /*
> > > >                                  * Find the CPU with the maximum spare capacity
> > > >                                  * among the remaining CPUs in the performance
> > >
> > > Isn't it better to go back to v1 form then? The inconsistent type paired with
> > > the cast isn't getting too ugly for me :(
> >
> > the cast into a long of the cpu capacity in the condition was a good
> > way to fix this unsigned/signed comparison and make is consistent with
> > the use of -1 as default value IMO
> >     ((long)cpu_cap > max_spare_cap)
>
> Fair enough. Our boolean brains differ :-) I'll add the cast.
>
> Do you see the energy calculation issue Dietmar was trying to highlight? As it
> stands I only have boolean tweaks to do for v3.

I haven't looked too much at uclamp_max impact in energy calculation.
Nevertheless, I wonder if one solution could be to not clamp the
utilization to cpu max capacity in this case. The fact that
utilization can go above cpu capacity when we clamp its frequency
reflect that it would need more compute capacity or it will run
longer. I will try to look more deeply in this use case


>
>
> Cheers
>
> --
> Qais Yousef
>
> > >
> > > I don't think we can convert cpu_cap to long without having to do more work as
> > > it is used with 'util'.
> > >
> > >
> > > Cheers
> > >
> > > --
> > > Qais Yousef
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Qais Yousef 2 years, 6 months ago
On 02/22/23 11:59, Vincent Guittot wrote:

> I haven't looked too much at uclamp_max impact in energy calculation.
> Nevertheless, I wonder if one solution could be to not clamp the
> utilization to cpu max capacity in this case. The fact that
> utilization can go above cpu capacity when we clamp its frequency
> reflect that it would need more compute capacity or it will run
> longer. I will try to look more deeply in this use case

Okay thanks!


--
Qais Yousef
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Dietmar Eggemann 2 years, 7 months ago
On 07/02/2023 10:45, Vincent Guittot wrote:
> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
>>
>> When uclamp_max is being used, the util of the task could be higher than
>> the spare capacity of the CPU, but due to uclamp_max value we force fit
>> it there.
>>
>> The way the condition for checking for max_spare_cap in
>> find_energy_efficient_cpu() was constructed; it ignored any CPU that has
>> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
>> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
>> hence ending up never performing compute_energy() for this cluster and
>> missing an opportunity for a better energy efficient placement to honour
>> uclamp_max setting.
>>
>>         max_spare_cap = 0;
>>         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
>>
>>         ...
>>
>>         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit

s/true/1/ ?

>>
>>         ...
>>
>>         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
>>         if (cpu_cap > max_spare_cap) {
>>                 max_spare_cap = cpu_cap;
>>                 max_spare_cap_cpu = cpu;
>>         }
>>
>> prev_spare_cap suffers from a similar problem.
>>
>> Fix the logic by converting the variables into long and treating -1
>> value as 'not populated' instead of 0 which is a viable and correct
>> spare capacity value.

The issue I see here is in case we don't have any spare capacity left,
the energy calculation (in terms CPU utilization) isn't correct anymore.

Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
you never know how big the `busy_time` for the PD really is in this moment.

eenv_pd_busy_time()

  for_each_cpu(cpu, pd_cpus)
    busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
    ^^^^^^^^^

with:

  sum_util = min(busy_time + task_busy_time, pd_cap)
                 ^^^^^^^^^

  freq = (1.25 * max_util / max) * max_freq

  energy = (perf_state(freq)->cost / max) * sum_util


energy is not related to CPU utilization anymore (since there is no idle
time/spare capacity) left.

So EAS keeps packing on the cheaper PD/clamped OPP.

E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
tasks all running on little PD, cpumask=0x39. The big PD is idle but
never beats prev_cpu in terms of energy consumption for the wakee.

[...]
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Qais Yousef 2 years, 7 months ago
On 02/09/23 19:02, Dietmar Eggemann wrote:
> On 07/02/2023 10:45, Vincent Guittot wrote:
> > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
> >>
> >> When uclamp_max is being used, the util of the task could be higher than
> >> the spare capacity of the CPU, but due to uclamp_max value we force fit
> >> it there.
> >>
> >> The way the condition for checking for max_spare_cap in
> >> find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> >> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> >> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> >> hence ending up never performing compute_energy() for this cluster and
> >> missing an opportunity for a better energy efficient placement to honour
> >> uclamp_max setting.
> >>
> >>         max_spare_cap = 0;
> >>         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
> >>
> >>         ...
> >>
> >>         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
> 
> s/true/1/ ?
> 
> >>
> >>         ...
> >>
> >>         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
> >>         if (cpu_cap > max_spare_cap) {
> >>                 max_spare_cap = cpu_cap;
> >>                 max_spare_cap_cpu = cpu;
> >>         }
> >>
> >> prev_spare_cap suffers from a similar problem.
> >>
> >> Fix the logic by converting the variables into long and treating -1
> >> value as 'not populated' instead of 0 which is a viable and correct
> >> spare capacity value.
> 
> The issue I see here is in case we don't have any spare capacity left,
> the energy calculation (in terms CPU utilization) isn't correct anymore.
> 
> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
> you never know how big the `busy_time` for the PD really is in this moment.
> 
> eenv_pd_busy_time()
> 
>   for_each_cpu(cpu, pd_cpus)
>     busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
>     ^^^^^^^^^
> 
> with:
> 
>   sum_util = min(busy_time + task_busy_time, pd_cap)
>                  ^^^^^^^^^
> 
>   freq = (1.25 * max_util / max) * max_freq
> 
>   energy = (perf_state(freq)->cost / max) * sum_util
> 
> 
> energy is not related to CPU utilization anymore (since there is no idle
> time/spare capacity) left.

Am I right that what you're saying is that the energy calculation for the PD
will be capped to a certain value and this is why you think the energy is
incorrect?

What should be the correct energy (in theory at least)?

> 
> So EAS keeps packing on the cheaper PD/clamped OPP.

Which is the desired behavior for uclamp_max?

The only issue I see is that we want to distribute within a pd. Which is
something I was going to work on and send after later - but can lump it in this
series if it helps.

> 
> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
> tasks all running on little PD, cpumask=0x39. The big PD is idle but
> never beats prev_cpu in terms of energy consumption for the wakee.

IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
is two folds:

	1. Prevent tasks from consuming energy.
	2. Keep them away from expensive CPUs.

2 is actually very important for 2 reasons:

	a. Because of max aggregation - any uncapped tasks that wakes up will
	   cause a frequency spike on this 'expensive' cpu. We don't have
	   a mechanism to downmigrate it - which is another thing I'm working
	   on.
	b. It is desired to keep these bigger cpu idle ready for more important
	   work.

For 2, generally we don't want these tasks to steal bandwidth from these CPUs
that we'd like to preserve for other type of work.

Of course userspace has control by selecting the right uclamp_max value. They
can increase it to allow a spill to next pd - or keep it low to steer them more
strongly on a specific pd.


Cheers

--
Qais Yousef
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Hongyan Xia 2 years, 3 months ago
Hi Qais,

On 2023-02-11 17:50, Qais Yousef wrote:
> [...]
>>
>> So EAS keeps packing on the cheaper PD/clamped OPP.
> 
> Which is the desired behavior for uclamp_max?
> 
> The only issue I see is that we want to distribute within a pd. Which is
> something I was going to work on and send after later - but can lump it in this
> series if it helps.

I more or less share the same concern with Dietmar, which is packing 
things on the same small CPU when everyone has spare cpu_cap of 0.

I wonder if this could be useful: On the side of cfs_rq->avg.util_avg, 
we have a cfs_rq->avg.util_avg_uclamp_max. It is keeping track of 
util_avg, but each task on the rq is capped at its uclamp_max value, so 
even if there's two always-running tasks with uclamp_max values of 100 
with no idle time, the cfs_rq only sees cpu_util() of 200 and still has 
remaining capacity of 1024 - 200, not 0. This also helps balancing the 
load when rqs have no idle time. Even if two CPUs both have no idle 
time, but one is running a single task clamped at 100, the other running 
2 such tasks, the first sees a remaining capacity of 1024 - 100, while 
the 2nd is 1024 - 200, so we still prefer the first one.

And I wonder if this could also help calculating energy when there's no 
idle time under uclamp_max. Instead of seeing a util_avg at 1024, we 
actually see a lower value. This is also what cpu_util_next() does in 
Android's sum aggregation, but I'm thinking of maintaining it right 
beside util_avg so that we don't have to sum up everything every time.

Hongyan
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Qais Yousef 2 years, 2 months ago
On 06/07/23 15:52, Hongyan Xia wrote:
> Hi Qais,
> 
> On 2023-02-11 17:50, Qais Yousef wrote:
> > [...]
> > > 
> > > So EAS keeps packing on the cheaper PD/clamped OPP.
> > 
> > Which is the desired behavior for uclamp_max?
> > 
> > The only issue I see is that we want to distribute within a pd. Which is
> > something I was going to work on and send after later - but can lump it in this
> > series if it helps.
> 
> I more or less share the same concern with Dietmar, which is packing things
> on the same small CPU when everyone has spare cpu_cap of 0.
> 
> I wonder if this could be useful: On the side of cfs_rq->avg.util_avg, we
> have a cfs_rq->avg.util_avg_uclamp_max. It is keeping track of util_avg, but
> each task on the rq is capped at its uclamp_max value, so even if there's
> two always-running tasks with uclamp_max values of 100 with no idle time,
> the cfs_rq only sees cpu_util() of 200 and still has remaining capacity of
> 1024 - 200, not 0. This also helps balancing the load when rqs have no idle
> time. Even if two CPUs both have no idle time, but one is running a single
> task clamped at 100, the other running 2 such tasks, the first sees a
> remaining capacity of 1024 - 100, while the 2nd is 1024 - 200, so we still
> prefer the first one.

If I understood correctly you're suggesting do accounting of the sum of
uclamp_max for all the enqueued tasks?

I think we discussed this in the past. Can't remember the details now, but
adding additional accounting seemed undeseriable.

And I had issue with treating uclamp_max as a bandwidth hint rather than
a performance requirements hint. Limiting a task to 200 means it can't run
faster than this, but it doesn't mean it is not allowed to consume more
bandwidth than 200. Nice value and cfs bandwidth controllers should be used for
that.

> And I wonder if this could also help calculating energy when there's no idle
> time under uclamp_max. Instead of seeing a util_avg at 1024, we actually see
> a lower value. This is also what cpu_util_next() does in Android's sum
> aggregation, but I'm thinking of maintaining it right beside util_avg so
> that we don't have to sum up everything every time.

I haven't thought about how to improve the EM calculations to be honest, I see
this as a secondary problem compared to the other issue we need to fix first.

It seems load_avg can grow unboundedly, can you look at using this signal to
distribute on a cluster and as a hint we might be better off spilling to other
if they're already running at a perf level <= uclamp_max?


Thanks

--
Qais Yousef
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Lukasz Luba 2 years, 3 months ago
Hi Qais,

I have a question regarding the 'soft cpu affinity'.

On 2/11/23 17:50, Qais Yousef wrote:
> On 02/09/23 19:02, Dietmar Eggemann wrote:
>> On 07/02/2023 10:45, Vincent Guittot wrote:
>>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
>>>>
>>>> When uclamp_max is being used, the util of the task could be higher than
>>>> the spare capacity of the CPU, but due to uclamp_max value we force fit
>>>> it there.
>>>>
>>>> The way the condition for checking for max_spare_cap in
>>>> find_energy_efficient_cpu() was constructed; it ignored any CPU that has
>>>> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
>>>> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
>>>> hence ending up never performing compute_energy() for this cluster and
>>>> missing an opportunity for a better energy efficient placement to honour
>>>> uclamp_max setting.
>>>>
>>>>          max_spare_cap = 0;
>>>>          cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
>>>>
>>>>          ...
>>>>
>>>>          util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
>>
>> s/true/1/ ?
>>
>>>>
>>>>          ...
>>>>
>>>>          // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
>>>>          if (cpu_cap > max_spare_cap) {
>>>>                  max_spare_cap = cpu_cap;
>>>>                  max_spare_cap_cpu = cpu;
>>>>          }
>>>>
>>>> prev_spare_cap suffers from a similar problem.
>>>>
>>>> Fix the logic by converting the variables into long and treating -1
>>>> value as 'not populated' instead of 0 which is a viable and correct
>>>> spare capacity value.
>>
>> The issue I see here is in case we don't have any spare capacity left,
>> the energy calculation (in terms CPU utilization) isn't correct anymore.
>>
>> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
>> you never know how big the `busy_time` for the PD really is in this moment.
>>
>> eenv_pd_busy_time()
>>
>>    for_each_cpu(cpu, pd_cpus)
>>      busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
>>      ^^^^^^^^^
>>
>> with:
>>
>>    sum_util = min(busy_time + task_busy_time, pd_cap)
>>                   ^^^^^^^^^
>>
>>    freq = (1.25 * max_util / max) * max_freq
>>
>>    energy = (perf_state(freq)->cost / max) * sum_util
>>
>>
>> energy is not related to CPU utilization anymore (since there is no idle
>> time/spare capacity) left.
> 
> Am I right that what you're saying is that the energy calculation for the PD
> will be capped to a certain value and this is why you think the energy is
> incorrect?
> 
> What should be the correct energy (in theory at least)?
> 
>>
>> So EAS keeps packing on the cheaper PD/clamped OPP.
> 
> Which is the desired behavior for uclamp_max?
> 
> The only issue I see is that we want to distribute within a pd. Which is
> something I was going to work on and send after later - but can lump it in this
> series if it helps.
> 
>>
>> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
>> tasks all running on little PD, cpumask=0x39. The big PD is idle but
>> never beats prev_cpu in terms of energy consumption for the wakee.
> 
> IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
> is two folds:
> 
> 	1. Prevent tasks from consuming energy.
> 	2. Keep them away from expensive CPUs.
> 
> 2 is actually very important for 2 reasons:
> 
> 	a. Because of max aggregation - any uncapped tasks that wakes up will
> 	   cause a frequency spike on this 'expensive' cpu. We don't have
> 	   a mechanism to downmigrate it - which is another thing I'm working
> 	   on.
> 	b. It is desired to keep these bigger cpu idle ready for more important
> 	   work.
> 
> For 2, generally we don't want these tasks to steal bandwidth from these CPUs
> that we'd like to preserve for other type of work.

I'm a bit afraid about such 'strong force'. That means the task would
not go via EAS if we set uclamp_max e.g. 90, while the little capacity
is 125. Or am I missing something?

This might effectively use more energy for those tasks which can run on
any CPU and EAS would figure a good energy placement. I'm worried
about this, since we have L3+littles in one DVFS domain and the L3
would be only bigger in future.

IMO to keep the big cpus more in idle, we should give them big energy
wake up cost. That's my 3rd feature to the EM presented in OSPM2023.

> 
> Of course userspace has control by selecting the right uclamp_max value. They
> can increase it to allow a spill to next pd - or keep it low to steer them more
> strongly on a specific pd.

This would we be interesting to see in practice. I think we need such
experiment, for such changes.

Regards,
Lukasz
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Dietmar Eggemann 2 years, 7 months ago
On 11/02/2023 18:50, Qais Yousef wrote:
> On 02/09/23 19:02, Dietmar Eggemann wrote:
>> On 07/02/2023 10:45, Vincent Guittot wrote:
>>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:

[...]

>>>> Fix the logic by converting the variables into long and treating -1
>>>> value as 'not populated' instead of 0 which is a viable and correct
>>>> spare capacity value.
>>
>> The issue I see here is in case we don't have any spare capacity left,
>> the energy calculation (in terms CPU utilization) isn't correct anymore.
>>
>> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
>> you never know how big the `busy_time` for the PD really is in this moment.
>>
>> eenv_pd_busy_time()
>>
>>   for_each_cpu(cpu, pd_cpus)
>>     busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
>>     ^^^^^^^^^
>>
>> with:
>>
>>   sum_util = min(busy_time + task_busy_time, pd_cap)
>>                  ^^^^^^^^^
>>
>>   freq = (1.25 * max_util / max) * max_freq
>>
>>   energy = (perf_state(freq)->cost / max) * sum_util
>>
>>
>> energy is not related to CPU utilization anymore (since there is no idle
>> time/spare capacity) left.
> 
> Am I right that what you're saying is that the energy calculation for the PD
> will be capped to a certain value and this is why you think the energy is
> incorrect?
> 
> What should be the correct energy (in theory at least)?

The energy value for the perf_state is correct but the decision based 
on `energy diff` in which PD to put the task might not be.

In case CPUx already has some tasks, its `pd_busy_time` contribution 
is still capped by its capacity_orig.

eenv_pd_busy_time() -> cpu_util_next()
                         return min(util, capacity_orig_of(cpu))

In this case we can't use `energy diff` anymore to make the decision 
where to put the task.

The base_energy of CPUx's PD is already so high that the 
`energy diff = cur_energy - base_energy` is small enough so that CPUx 
is selected as target again.

>> So EAS keeps packing on the cheaper PD/clamped OPP.

Sometimes yes, but there are occurrences in which a big CPU ends up
with the majority of the tasks. It depends on the start condition.

> Which is the desired behavior for uclamp_max?

Not sure. Essentially, EAS can't do its job properly if there is no idle 
time left. As soon as uclamp_max restricts the system (like in my
example) task placement decisions via EAS (min energy diff based) can be
wrong. 

> The only issue I see is that we want to distribute within a pd. Which is
> something I was going to work on and send after later - but can lump it in this
> series if it helps.

I assume you refer to

    } else if ((fits > max_fits) ||
        ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {

here?

>> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
>> tasks all running on little PD, cpumask=0x39. The big PD is idle but
>> never beats prev_cpu in terms of energy consumption for the wakee.
> 
> IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
> is two folds:
> 
> 	1. Prevent tasks from consuming energy.
> 	2. Keep them away from expensive CPUs.

Like I tried to reason about above, the energy diff based task placement
can't always assure this.

> 
> 2 is actually very important for 2 reasons:
> 
> 	a. Because of max aggregation - any uncapped tasks that wakes up will
> 	   cause a frequency spike on this 'expensive' cpu. We don't have
> 	   a mechanism to downmigrate it - which is another thing I'm working
> 	   on.

True. And it can also lead to CPU overutilization which then leads to
load-balancing kicking in.

> 	b. It is desired to keep these bigger cpu idle ready for more important
> 	   work.

But this is not happening all the time. Using the same workload
(6 8ms/16ms uclamp_max=440) on Juno-r0 [446 1024 1024 446 446 446]
sometimes ends up with all 6 tasks on big CPU1.

A corresponding EAS task placement looks like this one:

<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[1]=0 cpu_util[f|e]=[446 994]  
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[2]=0 cpu_util[f|e]=[445 1004]

<idle>-0 [005] select_task_rq_fair: CPU5 cpu=1 busy_time=994 util=994 pd_cap=2048
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=2 busy_time=1998 util=1004 pd_cap=2048

<idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=821866 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=1998 cpu_cap=1024
<idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=1  energy=842434 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=2048 cpu_cap=1024
                                                            ^^^^^^^^^^^^^
                                                            diff = 20,568

<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[0]=0 cpu_util[f|e]=[440 446] 
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[3]=0 cpu_util[f|e]=[6 6]
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[4]=0 cpu_util[f|e]=[440 446] 
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[5]=0 cpu_util[f|e]=[440 446] 

<idle>-0 [005] select_task_rq_fair: CPU5 cpu=0 busy_time=446 util=446 pd_cap=1784
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=3 busy_time=452 util=2 pd_cap=1784
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=4 busy_time=898 util=446 pd_cap=1784
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=5 busy_time=907 util=1 pd_cap=1784

<idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=242002 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=907 cpu_cap=446
<idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=5  energy=476000 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=1784 cpu_cap=446
                                                            ^^^^^^^^^^^^^
                                                            diff = 233,998

<idle>-0 [005] select_task_rq_fair: p=[task0-5 2551] target=1

> For 2, generally we don't want these tasks to steal bandwidth from these CPUs
> that we'd like to preserve for other type of work.
> 
> Of course userspace has control by selecting the right uclamp_max value. They
> can increase it to allow a spill to next pd - or keep it low to steer them more
> strongly on a specific pd.

[...]
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Qais Yousef 2 years, 7 months ago
On 02/14/23 13:47, Dietmar Eggemann wrote:
> On 11/02/2023 18:50, Qais Yousef wrote:
> > On 02/09/23 19:02, Dietmar Eggemann wrote:
> >> On 07/02/2023 10:45, Vincent Guittot wrote:
> >>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
> 
> [...]
> 
> >>>> Fix the logic by converting the variables into long and treating -1
> >>>> value as 'not populated' instead of 0 which is a viable and correct
> >>>> spare capacity value.
> >>
> >> The issue I see here is in case we don't have any spare capacity left,
> >> the energy calculation (in terms CPU utilization) isn't correct anymore.
> >>
> >> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
> >> you never know how big the `busy_time` for the PD really is in this moment.
> >>
> >> eenv_pd_busy_time()
> >>
> >>   for_each_cpu(cpu, pd_cpus)
> >>     busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
> >>     ^^^^^^^^^
> >>
> >> with:
> >>
> >>   sum_util = min(busy_time + task_busy_time, pd_cap)
> >>                  ^^^^^^^^^
> >>
> >>   freq = (1.25 * max_util / max) * max_freq
> >>
> >>   energy = (perf_state(freq)->cost / max) * sum_util
> >>
> >>
> >> energy is not related to CPU utilization anymore (since there is no idle
> >> time/spare capacity) left.
> > 
> > Am I right that what you're saying is that the energy calculation for the PD
> > will be capped to a certain value and this is why you think the energy is
> > incorrect?
> > 
> > What should be the correct energy (in theory at least)?
> 
> The energy value for the perf_state is correct but the decision based 
> on `energy diff` in which PD to put the task might not be.
> 
> In case CPUx already has some tasks, its `pd_busy_time` contribution 
> is still capped by its capacity_orig.
> 
> eenv_pd_busy_time() -> cpu_util_next()
>                          return min(util, capacity_orig_of(cpu))
> 
> In this case we can't use `energy diff` anymore to make the decision 
> where to put the task.
> 
> The base_energy of CPUx's PD is already so high that the 
> `energy diff = cur_energy - base_energy` is small enough so that CPUx 
> is selected as target again.

I'm sorry bear with me as I'm still struggling to fully understand the case.

You're thinking that if all the CPUs in the pd are _already_ fully busy before
adding this newly woken up task there, then we'll end up with the wrong
energy_diff? IOW, when the pd is fully loaded, then the cost of adding extra
load will always look cheap is what you're saying?

> 
> >> So EAS keeps packing on the cheaper PD/clamped OPP.
> 
> Sometimes yes, but there are occurrences in which a big CPU ends up
> with the majority of the tasks. It depends on the start condition.

It should depend on the energy cost, yes. Which does depend on the current
state of the system.

> 
> > Which is the desired behavior for uclamp_max?
> 
> Not sure. Essentially, EAS can't do its job properly if there is no idle 

This "not sure" statement is making me worried. Are you not sure about how
uclamp_max should work in force fitting big tasks into small cores? Or just
about handling some of the corner cases? I understood the former, not the
latter.

> time left. As soon as uclamp_max restricts the system (like in my
> example) task placement decisions via EAS (min energy diff based) can be
> wrong. 

I'm assuming 'no idle time' refers to the pd being fully loaded already
_before_ placing the newly woken up task. If you refer to it not having idle
time _after_ placing it - then I'm confused as one of the goals of uclamp_max
is to act as a task placement hint and if it can't do that in this simple
scenarios, it won't be much useful? I can appreciate it failing at some corner
cases. But for example if a little core is idle and a 1024 task wakes up with
uclamp_max that makes the little a candidate; then yeah it'll not leave any
idle time on that cpu - but placing it there (if it the energy efficient cpu)
is the desired effect, no?

> 
> > The only issue I see is that we want to distribute within a pd. Which is
> > something I was going to work on and send after later - but can lump it in this
> > series if it helps.
> 
> I assume you refer to
> 
>     } else if ((fits > max_fits) ||
>         ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
> 
> here?

Yes. If there are multiple cpus with the same max_spare_cap maybe we can
distribute among them rather than always pick the first one.

> 
> >> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
> >> tasks all running on little PD, cpumask=0x39. The big PD is idle but
> >> never beats prev_cpu in terms of energy consumption for the wakee.
> > 
> > IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
> > is two folds:
> > 
> > 	1. Prevent tasks from consuming energy.
> > 	2. Keep them away from expensive CPUs.
> 
> Like I tried to reason about above, the energy diff based task placement
> can't always assure this.

Re assure: It is not expected to be 100% guarantee. But it shouldn't fail
simply too.

> 
> > 
> > 2 is actually very important for 2 reasons:
> > 
> > 	a. Because of max aggregation - any uncapped tasks that wakes up will
> > 	   cause a frequency spike on this 'expensive' cpu. We don't have
> > 	   a mechanism to downmigrate it - which is another thing I'm working
> > 	   on.
> 
> True. And it can also lead to CPU overutilization which then leads to
> load-balancing kicking in.

Yep.

> 
> > 	b. It is desired to keep these bigger cpu idle ready for more important
> > 	   work.
> 
> But this is not happening all the time. Using the same workload
> (6 8ms/16ms uclamp_max=440) on Juno-r0 [446 1024 1024 446 446 446]
> sometimes ends up with all 6 tasks on big CPU1.

This seems similar to a case I see on pinebook pro but with uclamp_min.

$ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity
/sys/devices/system/cpu/cpu0/cpu_capacity:381
/sys/devices/system/cpu/cpu1/cpu_capacity:381
/sys/devices/system/cpu/cpu2/cpu_capacity:381
/sys/devices/system/cpu/cpu3/cpu_capacity:381
/sys/devices/system/cpu/cpu4/cpu_capacity:1024
/sys/devices/system/cpu/cpu5/cpu_capacity:1024

ts, dst_cpu, p_util, uclamp_min, uclamp_max, energy
253049258065, 4, 0, 381, 1024, 10562
253049269732, 3, 0, 381, 1024, 18814
253065609673, 4, 0, 381, 1024, 10562
253065621340, 3, 0, 381, 1024, 17874
253082033696, 4, 0, 381, 1024, 14669
253082045071, 2, 0, 381, 1024, 17403
253098438637, 4, 0, 381, 1024, 14082
253098450011, 3, 0, 381, 1024, 17403
253114803452, 4, 0, 381, 1024, 17016
253114814827, 0, 0, 381, 1024, 16933
253131192435, 4, 0, 381, 1024, 15843
253131204101, 2, 0, 381, 1024, 16933
253147557125, 4, 0, 381, 1024, 18776
253147568500, 0, 0, 381, 1024, 15992
253163935608, 4, 0, 381, 1024, 17603
253163946108, 0, 0, 381, 1024, 15522
253180299382, 4, 0, 381, 1024, 17016
253180306965, 3, 0, 381, 1024, 26811
253196598074, 4, 0, 381, 1024, 16429
253196606532, 0, 0, 381, 1024, 25870
253212953723, 4, 0, 381, 1024, 15256
253212965681, 0, 0, 381, 1024, 25400
253229376288, 4, 0, 381, 1024, 19363

With uclamp_max energy looks different, but p_util is different too which has
impact on compute_energy() calculations

ts, dst_cpu, p_util, uclamp_min, uclamp_max, energy
760154735422179, 4, 407, 0, 381, 237058
760154735426845, 0, 407, 0, 381, 192382
760154751547464, 4, 407, 0, 381, 237058
760154751552131, 0, 407, 0, 381, 191912
760154767690833, 4, 407, 0, 381, 237058
760154767696667, 0, 407, 0, 381, 202730
760154783818744, 4, 407, 0, 381, 237058
760154783823994, 0, 407, 0, 381, 198967
760154799944613, 4, 407, 0, 381, 237058
760154799949280, 0, 407, 0, 381, 198967
760154816074565, 4, 407, 0, 381, 237058
760154816079232, 0, 407, 0, 381, 195675
760154832201309, 4, 407, 0, 381, 237058
760154832205976, 0, 407, 0, 381, 195204
760154848328053, 4, 407, 0, 381, 237058
760154848333012, 0, 407, 0, 381, 193793
760154864453631, 4, 407, 0, 381, 237058
760154864458297, 0, 407, 0, 381, 193793
760154880578333, 4, 407, 0, 381, 237058
760154880583000, 0, 407, 0, 381, 192852
760154896705369, 4, 407, 0, 381, 237058
760154896710619, 0, 407, 0, 381, 192852

I'm not sure if there's an error in compute_energy to be honest - but as you
said depends on the conditions of the system the most energy efficient cpu
could be different.

Without this patch we don't even call compute_energy() for the little core; but
now it is a viable candidate.

> 
> A corresponding EAS task placement looks like this one:
> 
> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[1]=0 cpu_util[f|e]=[446 994]  
> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[2]=0 cpu_util[f|e]=[445 1004]
> 
> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=1 busy_time=994 util=994 pd_cap=2048
> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=2 busy_time=1998 util=1004 pd_cap=2048
> 
> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=821866 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=1998 cpu_cap=1024
> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=1  energy=842434 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=2048 cpu_cap=1024
>                                                             ^^^^^^^^^^^^^
>                                                             diff = 20,568

This is not necessarily a problem, unless the energy calculations are wrong of
course.

Setting uclamp_max near the top capacity of the littles will hopefully make it
run there more often - but we know that the top frequencies of the little are
not necessarily efficient ones too.

Does lowering uclamp_max more towards the 'knee' of the little help keeping the
tasks there?

Note what this patch does is make sure that the little is a 'candidate'. Energy
efficiency is the ultimate choice of which candidate cpu/pd to pick.

Being more strict about uclamp_max choice might be necessary if there's
a stronger desire by userspace to keep the tasks on specific cluster.

If there're errors in calculating energy, I'd appreciate your help on how to
resolve them.


Thanks!

--
Qais Yousef

> 
> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[0]=0 cpu_util[f|e]=[440 446] 
> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[3]=0 cpu_util[f|e]=[6 6]
> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[4]=0 cpu_util[f|e]=[440 446] 
> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[5]=0 cpu_util[f|e]=[440 446] 
> 
> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=0 busy_time=446 util=446 pd_cap=1784
> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=3 busy_time=452 util=2 pd_cap=1784
> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=4 busy_time=898 util=446 pd_cap=1784
> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=5 busy_time=907 util=1 pd_cap=1784
> 
> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=242002 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=907 cpu_cap=446
> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=5  energy=476000 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=1784 cpu_cap=446
>                                                             ^^^^^^^^^^^^^
>                                                             diff = 233,998
> 
> <idle>-0 [005] select_task_rq_fair: p=[task0-5 2551] target=1
> 
> > For 2, generally we don't want these tasks to steal bandwidth from these CPUs
> > that we'd like to preserve for other type of work.
> > 
> > Of course userspace has control by selecting the right uclamp_max value. They
> > can increase it to allow a spill to next pd - or keep it low to steer them more
> > strongly on a specific pd.
> 
> [...]
>
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Dietmar Eggemann 2 years, 6 months ago
On 14/02/2023 19:09, Qais Yousef wrote:
> On 02/14/23 13:47, Dietmar Eggemann wrote:
>> On 11/02/2023 18:50, Qais Yousef wrote:
>>> On 02/09/23 19:02, Dietmar Eggemann wrote:
>>>> On 07/02/2023 10:45, Vincent Guittot wrote:
>>>>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:

[...]

>>
>> The energy value for the perf_state is correct but the decision based 
>> on `energy diff` in which PD to put the task might not be.
>>
>> In case CPUx already has some tasks, its `pd_busy_time` contribution 
>> is still capped by its capacity_orig.
>>
>> eenv_pd_busy_time() -> cpu_util_next()
>>                          return min(util, capacity_orig_of(cpu))
>>
>> In this case we can't use `energy diff` anymore to make the decision 
>> where to put the task.
>>
>> The base_energy of CPUx's PD is already so high that the 
>> `energy diff = cur_energy - base_energy` is small enough so that CPUx 
>> is selected as target again.
> 
> I'm sorry bear with me as I'm still struggling to fully understand the case.
> 
> You're thinking that if all the CPUs in the pd are _already_ fully busy before
> adding this newly woken up task there, then we'll end up with the wrong
> energy_diff? IOW, when the pd is fully loaded, then the cost of adding extra
> load will always look cheap is what you're saying?

Sort of. The key to this is:

  compute_energy()

    if (dst_cpu >= 0)
     busy_time = min(pd_capacity, pd_busy_time + task_busy_time);
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                  pd + task contribution

If `pd + task contribution` > `pd_capacity` then we're saturated and the
energy diff is bogus. This includes the case in which `pd contribution`
> `pd_capacity`.

>>>> So EAS keeps packing on the cheaper PD/clamped OPP.
>>
>> Sometimes yes, but there are occurrences in which a big CPU ends up
>> with the majority of the tasks. It depends on the start condition.
> 
> It should depend on the energy cost, yes. Which does depend on the current
> state of the system.

I analyzed one of the last traces I got with my example:

During the initial wakeup the system is in CPU OU. So feec() returns
`target = -1` and `sis()->sic()` (a) does the initial placement for all
the 6 tasks.

CPU  (a)     (b) (c) (d)    (e)
      ^       ^   ^   ^      ^
 0   t1-------->| |
 1   t0 t3    |t5 |t1 |t4|   |   |   |   | ...
 2   t2--------------------->|
 3
 4   t4------------>| |
 5   t5---->| |

(b) feec() for t5:

    eenv_pd_busy_time()

      pd_busy_time = 1998 = 994 (CPU1) + 1004 (CPU2)

    compute_energy()

      busy_time = min(pd_capacity, pd_busy_time + task_busy_time)

                = min(2048, 1998 + 921)

                = 2048

    We go into saturation here and EAS assumes it can place t5 for the
    cost of 2048 - 1998 = 50 util where in reality t5 has a util of
    ~921. And that's why t5 ends up on CPU1 too.

(c) & (d) similar to (b)

(e) From here on no wakeups anymore. Only tick preemption on CPU1 every
    4ms (250Hz) between the 5 tasks and t2 finishing on CPU2 eventually.

>>> Which is the desired behavior for uclamp_max?
>>
>> Not sure. Essentially, EAS can't do its job properly if there is no idle 
> 
> This "not sure" statement is making me worried. Are you not sure about how
> uclamp_max should work in force fitting big tasks into small cores? Or just
> about handling some of the corner cases? I understood the former, not the
> latter.

I'm not sure if we can call the issue that EAS doesn't work in
saturation anymore a corner case. In case uclamp_max has an effect, it
can quite easily create these saturation scenarios in which EAS is still
enabled but it can't do its job anymore.
Making sure that we probe each PD is not getting rid of this more
fundamental issue.

>> time left. As soon as uclamp_max restricts the system (like in my
>> example) task placement decisions via EAS (min energy diff based) can be
>> wrong. 
> 
> I'm assuming 'no idle time' refers to the pd being fully loaded already
> _before_ placing the newly woken up task. If you refer to it not having idle
> time _after_ placing it - then I'm confused as one of the goals of uclamp_max
> is to act as a task placement hint and if it can't do that in this simple

We can't consider an individual task here. After placing `t0` might be
before placing `t1` for which we might then run into this saturation.

> scenarios, it won't be much useful? I can appreciate it failing at some corner
> cases. But for example if a little core is idle and a 1024 task wakes up with
> uclamp_max that makes the little a candidate; then yeah it'll not leave any
> idle time on that cpu - but placing it there (if it the energy efficient cpu)
> is the desired effect, no?

Not having idle time means EAS can't do its job properly and uclamp_max
can create these scenarios.

>>> The only issue I see is that we want to distribute within a pd. Which is
>>> something I was going to work on and send after later - but can lump it in this
>>> series if it helps.
>>
>> I assume you refer to
>>
>>     } else if ((fits > max_fits) ||
>>         ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
>>
>> here?
> 
> Yes. If there are multiple cpus with the same max_spare_cap maybe we can
> distribute among them rather than always pick the first one.

This can mitigate the issue but not solve it.

[...]

>>> 	b. It is desired to keep these bigger cpu idle ready for more important
>>> 	   work.
>>
>> But this is not happening all the time. Using the same workload
>> (6 8ms/16ms uclamp_max=440) on Juno-r0 [446 1024 1024 446 446 446]
>> sometimes ends up with all 6 tasks on big CPU1.
> 
> This seems similar to a case I see on pinebook pro but with uclamp_min.
> 
> $ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity
> /sys/devices/system/cpu/cpu0/cpu_capacity:381
> /sys/devices/system/cpu/cpu1/cpu_capacity:381
> /sys/devices/system/cpu/cpu2/cpu_capacity:381
> /sys/devices/system/cpu/cpu3/cpu_capacity:381
> /sys/devices/system/cpu/cpu4/cpu_capacity:1024
> /sys/devices/system/cpu/cpu5/cpu_capacity:1024
> 
> ts, dst_cpu, p_util, uclamp_min, uclamp_max, energy
> 253049258065, 4, 0, 381, 1024, 10562
> 253049269732, 3, 0, 381, 1024, 18814

So the energy after placing p on CPU4 (big) is lower than placing it on
CPU3 (LITTLE). But to see the energy-diff we would need to see the base
energy (dst_cpu = -1) as well.

Why is p_util = 0 here?

> 253065609673, 4, 0, 381, 1024, 10562
> 253065621340, 3, 0, 381, 1024, 17874

[..]

> 
> With uclamp_max energy looks different, but p_util is different too which has
> impact on compute_energy() calculations

p_util is p->se.avg.util_avg here?

> ts, dst_cpu, p_util, uclamp_min, uclamp_max, energy
> 760154735422179, 4, 407, 0, 381, 237058
> 760154735426845, 0, 407, 0, 381, 192382
> 
> I'm not sure if there's an error in compute_energy to be honest - but as you
> said depends on the conditions of the system the most energy efficient cpu
> could be different.
> 
> Without this patch we don't even call compute_energy() for the little core; but
> now it is a viable candidate.

Understood. But this doesn't fix the `EAS not working under saturation
problem`.

>> A corresponding EAS task placement looks like this one:
>>
>> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[1]=0 cpu_util[f|e]=[446 994]  
>> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[2]=0 cpu_util[f|e]=[445 1004]
>>
>> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=1 busy_time=994 util=994 pd_cap=2048
>> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=2 busy_time=1998 util=1004 pd_cap=2048
>>
>> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=821866 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=1998 cpu_cap=1024
>> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=1  energy=842434 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=2048 cpu_cap=1024
>>                                                             ^^^^^^^^^^^^^
>>                                                             diff = 20,568
> 
> This is not necessarily a problem, unless the energy calculations are wrong of
> course.

The problem is that placing p=[task0-5 2551] 'costs' only the
energy-equivalence of 2048-1998 = 50 util instead of 921.
> 
> Setting uclamp_max near the top capacity of the littles will hopefully make it
> run there more often - but we know that the top frequencies of the little are
> not necessarily efficient ones too.
> 
> Does lowering uclamp_max more towards the 'knee' of the little help keeping the
> tasks there?
> 
> Note what this patch does is make sure that the little is a 'candidate'. Energy
> efficiency is the ultimate choice of which candidate cpu/pd to pick.
> 
> Being more strict about uclamp_max choice might be necessary if there's
> a stronger desire by userspace to keep the tasks on specific cluster.
> 
> If there're errors in calculating energy, I'd appreciate your help on how to
> resolve them.

I don't think there is an error in the energy calculation.
But EAS doesn't work properly in these saturation cases which uclamp_max
can create. And considering each PD won't solve this. I'm afraid we do
need a solution for this saturation issue.
Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
Posted by Qais Yousef 2 years, 6 months ago
On 02/21/23 13:20, Dietmar Eggemann wrote:

> I analyzed one of the last traces I got with my example:
> 
> During the initial wakeup the system is in CPU OU. So feec() returns
> `target = -1` and `sis()->sic()` (a) does the initial placement for all
> the 6 tasks.
> 
> CPU  (a)     (b) (c) (d)    (e)
>       ^       ^   ^   ^      ^
>  0   t1-------->| |
>  1   t0 t3    |t5 |t1 |t4|   |   |   |   | ...
>  2   t2--------------------->|
>  3
>  4   t4------------>| |
>  5   t5---->| |
> 
> (b) feec() for t5:
> 
>     eenv_pd_busy_time()
> 
>       pd_busy_time = 1998 = 994 (CPU1) + 1004 (CPU2)
> 
>     compute_energy()
> 
>       busy_time = min(pd_capacity, pd_busy_time + task_busy_time)
> 
>                 = min(2048, 1998 + 921)
> 
>                 = 2048
> 
>     We go into saturation here and EAS assumes it can place t5 for the
>     cost of 2048 - 1998 = 50 util where in reality t5 has a util of
>     ~921. And that's why t5 ends up on CPU1 too.

So to rephrase as it was hard to follow your line thoughts.

The problem you're highlighting is that with uclamp_max we can end up with
a fully utilized pd. But the energy cost of this always-runing pd is capped to
a constant value. But you think this should be modeled better to reflect it'll
run for longer period of time, hence cost more energy.

There are multiple compound issue that makes this difficult to be reflected
accurately and makes think this best-effort is not really as bad as you think:

1. The capacity of little cores has become small, 158 on pixel 6. This makes
   the definition of 'busy' task from the little's point of view rather
   relaxed/small. But there's a strong desire to enable uclamp_max to act as
   a soft affinity to prevent these tasks from interfering with more important
   work on bigger cores and potentially waste power.

2. If the task is truly long running 1024 tasks (even on big core), then
   knowing its actual utilization and runtime would be impossible and we'd have
   to cap it to something. It's basically inifinity.

3. Because of uclamp_max capping, the util_avg could be wrong - depends on
   actually how big the task is compared the uclamp_max. In worst case scenario
   if no idle time is seen util_avg can easily grow into 1024 - although if the
   task is allowed to run uncapped it actually might be 300 or something in
   reality.

4. Because of max aggregation, the uclamp_max tasks are a frequency spike
   hazard. Again if the task is a true 1024 (and capping those tasks is one of
   the major use cases for uclamp_max, if not The Major use case) then as soon
   as uncapped task wakes up on the CPU, there'd be a frequency spike
   regardless of how small this task is.

   If at wake-up it thought a big core is okay, and ends up running for the new
   few 100s ms with randodm rt/kworkers waking up on that core - the big core
   will run enough times at most energy enefficient point of the system. Which
   is a bigger problem as the power cost will be very high and noticeable. And
   uclamp_max isn't being used because of this already today.

   And because this is an always running task for the foreseeable future from
   the scheduler PoV, and, the lack of a downmigration mechanism to move these
   tasks away (I have patches for that - but trying to send fixes one by one),
   these frequency spikes poses a bigger hazard of wasting power than making
   a one off miscalculation at wakeup time. The decision at wake up is sampling
   at that exact instant of time. But generally uclamp_max is desired for those
   long running tasks whose potential energy cost over a long period of time
   (many ticks worth of realtime) is the worrisome. Immediately after that
   sampling time the condition could change even today and render our decision
   completely wrong. And there's no fixing for that. It's the nature of the
   beast.

5 One of the desired properties of uclamp_max in practice is to act as a soft
  affinity. In Android background tasks are restricted by cpuset. But this
  could potentially lead to lost opportunities of better energy placement under
  normal lightly loaded circumstances where there are a lot of small tasks
  running but nothing particularly busy. EAS shines under those scenarios. But
  fails for long running tasks - and here where uclamp_max is expected to help.

> 
> (c) & (d) similar to (b)
> 
> (e) From here on no wakeups anymore. Only tick preemption on CPU1 every
>     4ms (250Hz) between the 5 tasks and t2 finishing on CPU2 eventually.
> 
> >>> Which is the desired behavior for uclamp_max?
> >>
> >> Not sure. Essentially, EAS can't do its job properly if there is no idle 
> > 
> > This "not sure" statement is making me worried. Are you not sure about how
> > uclamp_max should work in force fitting big tasks into small cores? Or just
> > about handling some of the corner cases? I understood the former, not the
> > latter.
> 
> I'm not sure if we can call the issue that EAS doesn't work in
> saturation anymore a corner case. In case uclamp_max has an effect, it
> can quite easily create these saturation scenarios in which EAS is still
> enabled but it can't do its job anymore.
> Making sure that we probe each PD is not getting rid of this more
> fundamental issue.

I disagree with there being a fundamental issue in regards to energy
calculation. I see a potential nice-to-have improvement for estimating energy
calculation of long running tasks.

What I see as a fundamental error that this series is fixing is that the hint
to keep tasks on little cores doesn't work. Being able to consider a better
energy efficient CPU is less of a problem and not sure if it's really affecting
anyone in practice. Today's mainline will not consider any busy task as
a candidate for little core even of uclamp_max says it's okay. The task has to
be small enough (< 158 on pixel 6) for EAS to consider it, but those tasks are
not the ones that need uclamp_max hint.

We could try to remove the min(). But I worry this is premature now before
first introducing my fixes to teach the load balancer to do down migration. And
I think we need to do more to address the frequency spike problem - which I do
have proposals ready to address this problem too.

As it stands we can't use uclamp_max in products, and there are a series of
fixes required to achieve that. And what you see as a fundamental issue is not
one of the blocking issues I am aware of. It seems a nice enhancement to make
the wakeup  energy estimation even better. But I do think it requires some
crystal ball and it'll remain a best effort even after that.

I think the current outcome is the desired behavior to be honest. Maybe we can
do better, but that's a separate problem/question and not a fundamental
blocking to enabling uclamp_max to do what it says on the tin.

I can add this as another limitation in the uclamp doc.

The only thing that probably should do now is add a limit to how much cramming
we can do before we say it's too much for this pd even with uclamp_max. Maybe
we can put a limit based on load_avg as I think it'll continue to grow unlike
util_avg which will settle on 1024.


Thanks!

--
Qais Yousef