sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

[PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Xuewen Yan 1 year, 8 months ago

Because the effective_cpu_util() would return a util which
maybe bigger than the actual_cpu_capacity, this could cause
the pd_busy_time calculation errors.
So clamp the cpu_busy_time with the eenv->cpu_cap, which is
the actual_cpu_capacity.

Fixes: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
---
 kernel/sched/fair.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..8939d725023a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7870,7 +7870,9 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
 	for_each_cpu(cpu, pd_cpus) {
 		unsigned long util = cpu_util(cpu, p, -1, 0);
 
-		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+		util = effective_cpu_util(cpu, util, NULL, NULL);
+		util = min(eenv->cpu_cap, util);
+		busy_time += util;
 	}
 
 	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
-- 
2.25.1

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Christian Loehle 1 year, 8 months ago

On 6/6/24 08:06, Xuewen Yan wrote:
> Because the effective_cpu_util() would return a util which
> maybe bigger than the actual_cpu_capacity, this could cause
> the pd_busy_time calculation errors.
> So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> the actual_cpu_capacity.
> 
> Fixes: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
> ---
>  kernel/sched/fair.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..8939d725023a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7870,7 +7870,9 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
>  	for_each_cpu(cpu, pd_cpus) {
>  		unsigned long util = cpu_util(cpu, p, -1, 0);
>  
> -		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> +		util = effective_cpu_util(cpu, util, NULL, NULL);
> +		util = min(eenv->cpu_cap, util);
> +		busy_time += util;
>  	}
>  
>  	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);

I can reproduce the issue and the fix, so
Tested-by: Christian Loehle <christian.loehle@arm.com>
(@Qais, this is on a non-overutilized system).
I'm unsure about the other callers of effective_cpu_util(), or rather sched_cpu_util()
in particular which includes thermal and powercap, they should be off too.
Anyway I'll try to reproduce for them too.

Kind Regards,
Christian

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Qais Yousef 1 year, 8 months ago

On 06/06/24 15:06, Xuewen Yan wrote:
> Because the effective_cpu_util() would return a util which
> maybe bigger than the actual_cpu_capacity, this could cause
> the pd_busy_time calculation errors.
> So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> the actual_cpu_capacity.

I actually think capping by pd_cap is something we should remove. Saturated
systems aren't calculated properly especially when uclamp_max is used.

But this might a bigger change and out of scope of what you're proposing..

Did this 'wrong' calculation cause an actual problem for task placement?
I assume the pd looked 'busier' because some CPUs were too busy.

Was the system in overutilzied state? Usually if one CPU is that busy
overutilized should be set and we'd skip EAS - unless uclamp_max was used.

> 
> Fixes: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
> ---
>  kernel/sched/fair.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..8939d725023a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7870,7 +7870,9 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
>  	for_each_cpu(cpu, pd_cpus) {
>  		unsigned long util = cpu_util(cpu, p, -1, 0);
>  
> -		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> +		util = effective_cpu_util(cpu, util, NULL, NULL);
> +		util = min(eenv->cpu_cap, util);
> +		busy_time += util;
>  	}
>  
>  	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
> -- 
> 2.25.1
>

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Xuewen Yan 1 year, 8 months ago

Hi Qais

On Mon, Jun 10, 2024 at 6:55 AM Qais Yousef <qyousef@layalina.io> wrote:
>
> On 06/06/24 15:06, Xuewen Yan wrote:
> > Because the effective_cpu_util() would return a util which
> > maybe bigger than the actual_cpu_capacity, this could cause
> > the pd_busy_time calculation errors.
> > So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> > the actual_cpu_capacity.
>
> I actually think capping by pd_cap is something we should remove. Saturated
> systems aren't calculated properly especially when uclamp_max is used.
>
> But this might a bigger change and out of scope of what you're proposing..

I agree, there are other things to consider before doing this.

>
> Did this 'wrong' calculation cause an actual problem for task placement?
> I assume the pd looked 'busier' because some CPUs were too busy.

This will not only affect calculations in scenarios with high temperatures.
Sometimes, users will set scalimg_max_freq to actively limit the CPU frequency,
so that even if the CPU load is large, the CPU frequency will not be very high.
At this time, even if tasks are placed on other CPUs in the same PD,
the energy increment may not be too large, thus affecting core selection.

>
> Was the system in overutilzied state? Usually if one CPU is that busy
> overutilized should be set and we'd skip EAS - unless uclamp_max was used.

As Christian said, This case occurs not only in the overutil scenario,
this scenario holds true as long as the actual-cpu-capacity caused by
the reduction in max cpu frequency is smaller than the cpu util.

Thanks!
---
xuewen
>
> >
> > Fixes: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
> > Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
> > ---
> >  kernel/sched/fair.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a5b1ae0aa55..8939d725023a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7870,7 +7870,9 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
> >       for_each_cpu(cpu, pd_cpus) {
> >               unsigned long util = cpu_util(cpu, p, -1, 0);
> >
> > -             busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> > +             util = effective_cpu_util(cpu, util, NULL, NULL);
> > +             util = min(eenv->cpu_cap, util);
> > +             busy_time += util;
> >       }
> >
> >       eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
> > --
> > 2.25.1
> >

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Qais Yousef 1 year, 8 months ago

On 06/12/24 16:11, Xuewen Yan wrote:
> Hi Qais
> 
> On Mon, Jun 10, 2024 at 6:55 AM Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 06/06/24 15:06, Xuewen Yan wrote:
> > > Because the effective_cpu_util() would return a util which
> > > maybe bigger than the actual_cpu_capacity, this could cause
> > > the pd_busy_time calculation errors.
> > > So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> > > the actual_cpu_capacity.
> >
> > I actually think capping by pd_cap is something we should remove. Saturated
> > systems aren't calculated properly especially when uclamp_max is used.
> >
> > But this might a bigger change and out of scope of what you're proposing..
> 
> I agree, there are other things to consider before doing this.
> 
> >
> > Did this 'wrong' calculation cause an actual problem for task placement?
> > I assume the pd looked 'busier' because some CPUs were too busy.
> 
> This will not only affect calculations in scenarios with high temperatures.
> Sometimes, users will set scalimg_max_freq to actively limit the CPU frequency,
> so that even if the CPU load is large, the CPU frequency will not be very high.
> At this time, even if tasks are placed on other CPUs in the same PD,
> the energy increment may not be too large, thus affecting core selection.
> 
> >
> > Was the system in overutilzied state? Usually if one CPU is that busy
> > overutilized should be set and we'd skip EAS - unless uclamp_max was used.
> 
> As Christian said, This case occurs not only in the overutil scenario,
> this scenario holds true as long as the actual-cpu-capacity caused by
> the reduction in max cpu frequency is smaller than the cpu util.

Hmm. Sorry I might be thick here, but shouldn't fits_capacity() use
capacity_of() which should return capacity based on get_actual_cpu_capacity()
to compare if we are overutilized? If we are higher than this value that you
need to cap, then the CPU must be overutilized and we shouldn't be in feec() in
the first place, no? Unless the rq is capped with uclamp_max that is.

I generally think our current definition of overutilized is wrong and the use
case you're talking about should hold true if it's just a single CPU that is
overutilized. But I can't see how you end up in feec() if the util is higher
than the CPU in our current code base.

What did I miss?

And should effective_cpu_util() return a value higher than
get_actual_cpu_capacity()?

diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 8b1194c39161..91acc0f92ae4 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -286,7 +286,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
        unsigned long util, irq, scale;
        struct rq *rq = cpu_rq(cpu);

-       scale = arch_scale_cpu_capacity(cpu);
+       scale = get_actual_cpu_capacity(cpu);

        /*
         * Early check to see if IRQ/steal time saturates the CPU, can be

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Vincent Guittot 1 year, 8 months ago

On Mon, 17 Jun 2024 at 00:20, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 06/12/24 16:11, Xuewen Yan wrote:
> > Hi Qais
> >
> > On Mon, Jun 10, 2024 at 6:55 AM Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 06/06/24 15:06, Xuewen Yan wrote:
> > > > Because the effective_cpu_util() would return a util which
> > > > maybe bigger than the actual_cpu_capacity, this could cause
> > > > the pd_busy_time calculation errors.
> > > > So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> > > > the actual_cpu_capacity.
> > >
> > > I actually think capping by pd_cap is something we should remove. Saturated
> > > systems aren't calculated properly especially when uclamp_max is used.
> > >
> > > But this might a bigger change and out of scope of what you're proposing..
> >
> > I agree, there are other things to consider before doing this.
> >
> > >
> > > Did this 'wrong' calculation cause an actual problem for task placement?
> > > I assume the pd looked 'busier' because some CPUs were too busy.
> >
> > This will not only affect calculations in scenarios with high temperatures.
> > Sometimes, users will set scalimg_max_freq to actively limit the CPU frequency,
> > so that even if the CPU load is large, the CPU frequency will not be very high.
> > At this time, even if tasks are placed on other CPUs in the same PD,
> > the energy increment may not be too large, thus affecting core selection.
> >
> > >
> > > Was the system in overutilzied state? Usually if one CPU is that busy
> > > overutilized should be set and we'd skip EAS - unless uclamp_max was used.
> >
> > As Christian said, This case occurs not only in the overutil scenario,
> > this scenario holds true as long as the actual-cpu-capacity caused by
> > the reduction in max cpu frequency is smaller than the cpu util.
>
> Hmm. Sorry I might be thick here, but shouldn't fits_capacity() use
> capacity_of() which should return capacity based on get_actual_cpu_capacity()
> to compare if we are overutilized? If we are higher than this value that you
> need to cap, then the CPU must be overutilized and we shouldn't be in feec() in
> the first place, no? Unless the rq is capped with uclamp_max that is.
>
> I generally think our current definition of overutilized is wrong and the use
> case you're talking about should hold true if it's just a single CPU that is
> overutilized. But I can't see how you end up in feec() if the util is higher
> than the CPU in our current code base.
>
> What did I miss?
>
> And should effective_cpu_util() return a value higher than
> get_actual_cpu_capacity()?

I don't think we should because we want to return the effective
utilization not the actual compute capacity.
Having an utilization of the cpu or group of cpus above the actual
capacity or the original capacity mainly means that we will have to
run longer

By capping the utilization we filter this information.

capacity orig = 800
util_avg = 700

if we cap the capacity to 400 the cpu is expected to run twice longer
for the same amount of work to be done

>
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index 8b1194c39161..91acc0f92ae4 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -286,7 +286,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>         unsigned long util, irq, scale;
>         struct rq *rq = cpu_rq(cpu);
>
> -       scale = arch_scale_cpu_capacity(cpu);
> +       scale = get_actual_cpu_capacity(cpu);
>
>         /*
>          * Early check to see if IRQ/steal time saturates the CPU, can be

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Qais Yousef 1 year, 8 months ago

On 06/17/24 11:07, Vincent Guittot wrote:

> > And should effective_cpu_util() return a value higher than
> > get_actual_cpu_capacity()?
> 
> I don't think we should because we want to return the effective
> utilization not the actual compute capacity.
> Having an utilization of the cpu or group of cpus above the actual
> capacity or the original capacity mainly means that we will have to
> run longer
> 
> By capping the utilization we filter this information.
> 
> capacity orig = 800
> util_avg = 700
> 
> if we cap the capacity to 400 the cpu is expected to run twice longer
> for the same amount of work to be done

Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on
min/max freq ratio) though?

We cap with arch_scale_capacity() still, I guess we know at this stage it is
100% wrong if we allow returning higher values?

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Vincent Guittot 1 year, 7 months ago

On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 06/17/24 11:07, Vincent Guittot wrote:
>
> > > And should effective_cpu_util() return a value higher than
> > > get_actual_cpu_capacity()?
> >
> > I don't think we should because we want to return the effective
> > utilization not the actual compute capacity.
> > Having an utilization of the cpu or group of cpus above the actual
> > capacity or the original capacity mainly means that we will have to
> > run longer
> >
> > By capping the utilization we filter this information.
> >
> > capacity orig = 800
> > util_avg = 700
> >
> > if we cap the capacity to 400 the cpu is expected to run twice longer
> > for the same amount of work to be done
>
> Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on
> min/max freq ratio) though?
>
> We cap with arch_scale_capacity() still, I guess we know at this stage it is
> 100% wrong if we allow returning higher values?

I think that capping utilization to max capacity generates some energy
estimation error because it filters the fact that we run longer in
some cases.

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Qais Yousef 1 year, 7 months ago

On 06/18/24 17:23, Vincent Guittot wrote:
> On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 06/17/24 11:07, Vincent Guittot wrote:
> >
> > > > And should effective_cpu_util() return a value higher than
> > > > get_actual_cpu_capacity()?
> > >
> > > I don't think we should because we want to return the effective
> > > utilization not the actual compute capacity.
> > > Having an utilization of the cpu or group of cpus above the actual
> > > capacity or the original capacity mainly means that we will have to
> > > run longer
> > >
> > > By capping the utilization we filter this information.
> > >
> > > capacity orig = 800
> > > util_avg = 700
> > >
> > > if we cap the capacity to 400 the cpu is expected to run twice longer
> > > for the same amount of work to be done
> >
> > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on
> > min/max freq ratio) though?
> >
> > We cap with arch_scale_capacity() still, I guess we know at this stage it is
> > 100% wrong if we allow returning higher values?
> 
> I think that capping utilization to max capacity generates some energy
> estimation error because it filters the fact that we run longer in
> some cases.

Yes, I think so too and that was my first statement. But I think this is
a bigger change to do separately.

I *think* we have another source of error, we take util/cpu_cap as a percentage
of time the CPU is busy. We assume an implicit multiplication with a time
period, T. I am not sure if this implicit assumption is accurate and things are
aligned properly. Especially with how utilization loses the temporal info due
to invariance. util can be low but actual runtime will be much longer. I'm not
sure if this implicit multiplication is handling this properly. Beside due
performance domains having shared CPUs, I am not sure this period is aligned
across all CPUs for this implicit multiplication to work as intended.

I yet to study this properly. But I thought I'll mention it as I think this
(energy estimation) is increasingly becoming an important area to improve on.

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Xuewen Yan 1 year, 7 months ago

On Tue, Jun 18, 2024 at 11:39 PM Qais Yousef <qyousef@layalina.io> wrote:
>
> On 06/18/24 17:23, Vincent Guittot wrote:
> > On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 06/17/24 11:07, Vincent Guittot wrote:
> > >
> > > > > And should effective_cpu_util() return a value higher than
> > > > > get_actual_cpu_capacity()?
> > > >
> > > > I don't think we should because we want to return the effective
> > > > utilization not the actual compute capacity.
> > > > Having an utilization of the cpu or group of cpus above the actual
> > > > capacity or the original capacity mainly means that we will have to
> > > > run longer
> > > >
> > > > By capping the utilization we filter this information.
> > > >
> > > > capacity orig = 800
> > > > util_avg = 700
> > > >
> > > > if we cap the capacity to 400 the cpu is expected to run twice longer
> > > > for the same amount of work to be done
> > >
> > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on
> > > min/max freq ratio) though?
> > >
> > > We cap with arch_scale_capacity() still, I guess we know at this stage it is
> > > 100% wrong if we allow returning higher values?
> >
> > I think that capping utilization to max capacity generates some energy
> > estimation error because it filters the fact that we run longer in
> > some cases.
>
> Yes, I think so too and that was my first statement. But I think this is
> a bigger change to do separately.

I saw the the sched_cpu_util() was used by teo.c and cpufreq_cooling.c
If we change the arch_scale_capacity() to actual_cpu_capacity(), it may cause
some errors?

For-example:
In teo:
212 static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
213 {
214         return sched_cpu_util(cpu) > cpu_data->util_threshold;
215 }
It may cause the teo_cpu_is_utilized() to return false forever if the
actual_cpu_capacity is smaller than util_threshold.
However, the util_threshold is frome actual_cpu_capacity.

In cpufreq_cooling.c:
May we should change:

diff --git a/drivers/thermal/cpufreq_cooling.c
b/drivers/thermal/cpufreq_cooling.c
index 280071be30b1..a8546d69cc10 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -164,7 +164,7 @@ static u32 get_load(struct cpufreq_cooling_device
*cpufreq_cdev, int cpu,
 {
        unsigned long util = sched_cpu_util(cpu);

-       return (util * 100) / arch_scale_cpu_capacity(cpu);
+       return (util * 100) / get_actual_cpu_capacity(cpu);
 }
 #else /* !CONFIG_SMP */
 static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,


Because if still use arch_scale_cpu_capacity(), the load pct may be decreased,
It may affect the thermal-IPA-governor's power consideration.

>
> I *think* we have another source of error, we take util/cpu_cap as a percentage
> of time the CPU is busy. We assume an implicit multiplication with a time
> period, T. I am not sure if this implicit assumption is accurate and things are
> aligned properly. Especially with how utilization loses the temporal info due
> to invariance. util can be low but actual runtime will be much longer. I'm not
> sure if this implicit multiplication is handling this properly. Beside due
> performance domains having shared CPUs, I am not sure this period is aligned
> across all CPUs for this implicit multiplication to work as intended.
>
> I yet to study this properly. But I thought I'll mention it as I think this
> (energy estimation) is increasingly becoming an important area to improve on.

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Qais Yousef 1 year, 7 months ago

On 06/19/24 11:05, Xuewen Yan wrote:
> On Tue, Jun 18, 2024 at 11:39 PM Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 06/18/24 17:23, Vincent Guittot wrote:
> > > On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > On 06/17/24 11:07, Vincent Guittot wrote:
> > > >
> > > > > > And should effective_cpu_util() return a value higher than
> > > > > > get_actual_cpu_capacity()?
> > > > >
> > > > > I don't think we should because we want to return the effective
> > > > > utilization not the actual compute capacity.
> > > > > Having an utilization of the cpu or group of cpus above the actual
> > > > > capacity or the original capacity mainly means that we will have to
> > > > > run longer
> > > > >
> > > > > By capping the utilization we filter this information.
> > > > >
> > > > > capacity orig = 800
> > > > > util_avg = 700
> > > > >
> > > > > if we cap the capacity to 400 the cpu is expected to run twice longer
> > > > > for the same amount of work to be done
> > > >
> > > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on
> > > > min/max freq ratio) though?
> > > >
> > > > We cap with arch_scale_capacity() still, I guess we know at this stage it is
> > > > 100% wrong if we allow returning higher values?
> > >
> > > I think that capping utilization to max capacity generates some energy
> > > estimation error because it filters the fact that we run longer in
> > > some cases.
> >
> > Yes, I think so too and that was my first statement. But I think this is
> > a bigger change to do separately.
> 
> I saw the the sched_cpu_util() was used by teo.c and cpufreq_cooling.c
> If we change the arch_scale_capacity() to actual_cpu_capacity(), it may cause
> some errors?

The plan to revert this now.

> 
> For-example:
> In teo:
> 212 static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> 213 {
> 214         return sched_cpu_util(cpu) > cpu_data->util_threshold;
> 215 }
> It may cause the teo_cpu_is_utilized() to return false forever if the
> actual_cpu_capacity is smaller than util_threshold.
> However, the util_threshold is frome actual_cpu_capacity.
> 
> In cpufreq_cooling.c:
> May we should change:
> 
> diff --git a/drivers/thermal/cpufreq_cooling.c
> b/drivers/thermal/cpufreq_cooling.c
> index 280071be30b1..a8546d69cc10 100644
> --- a/drivers/thermal/cpufreq_cooling.c
> +++ b/drivers/thermal/cpufreq_cooling.c
> @@ -164,7 +164,7 @@ static u32 get_load(struct cpufreq_cooling_device
> *cpufreq_cdev, int cpu,
>  {
>         unsigned long util = sched_cpu_util(cpu);
> 
> -       return (util * 100) / arch_scale_cpu_capacity(cpu);
> +       return (util * 100) / get_actual_cpu_capacity(cpu);
>  }
>  #else /* !CONFIG_SMP */
>  static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> 
> 
> Because if still use arch_scale_cpu_capacity(), the load pct may be decreased,
> It may affect the thermal-IPA-governor's power consideration.

I am not sure about this one. But looks plausible. Vincent?

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Vincent Guittot 1 year, 7 months ago

On Wed, 19 Jun 2024 at 20:10, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 06/19/24 11:05, Xuewen Yan wrote:
> > On Tue, Jun 18, 2024 at 11:39 PM Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 06/18/24 17:23, Vincent Guittot wrote:
> > > > On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
> > > > >
> > > > > On 06/17/24 11:07, Vincent Guittot wrote:
> > > > >
> > > > > > > And should effective_cpu_util() return a value higher than
> > > > > > > get_actual_cpu_capacity()?
> > > > > >
> > > > > > I don't think we should because we want to return the effective
> > > > > > utilization not the actual compute capacity.
> > > > > > Having an utilization of the cpu or group of cpus above the actual
> > > > > > capacity or the original capacity mainly means that we will have to
> > > > > > run longer
> > > > > >
> > > > > > By capping the utilization we filter this information.
> > > > > >
> > > > > > capacity orig = 800
> > > > > > util_avg = 700
> > > > > >
> > > > > > if we cap the capacity to 400 the cpu is expected to run twice longer
> > > > > > for the same amount of work to be done
> > > > >
> > > > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on
> > > > > min/max freq ratio) though?
> > > > >
> > > > > We cap with arch_scale_capacity() still, I guess we know at this stage it is
> > > > > 100% wrong if we allow returning higher values?
> > > >
> > > > I think that capping utilization to max capacity generates some energy
> > > > estimation error because it filters the fact that we run longer in
> > > > some cases.
> > >
> > > Yes, I think so too and that was my first statement. But I think this is
> > > a bigger change to do separately.
> >
> > I saw the the sched_cpu_util() was used by teo.c and cpufreq_cooling.c
> > If we change the arch_scale_capacity() to actual_cpu_capacity(), it may cause
> > some errors?
>
> The plan to revert this now.
>
> >
> > For-example:
> > In teo:
> > 212 static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> > 213 {
> > 214         return sched_cpu_util(cpu) > cpu_data->util_threshold;
> > 215 }
> > It may cause the teo_cpu_is_utilized() to return false forever if the
> > actual_cpu_capacity is smaller than util_threshold.
> > However, the util_threshold is frome actual_cpu_capacity.
> >
> > In cpufreq_cooling.c:
> > May we should change:
> >
> > diff --git a/drivers/thermal/cpufreq_cooling.c
> > b/drivers/thermal/cpufreq_cooling.c
> > index 280071be30b1..a8546d69cc10 100644
> > --- a/drivers/thermal/cpufreq_cooling.c
> > +++ b/drivers/thermal/cpufreq_cooling.c
> > @@ -164,7 +164,7 @@ static u32 get_load(struct cpufreq_cooling_device
> > *cpufreq_cdev, int cpu,
> >  {
> >         unsigned long util = sched_cpu_util(cpu);
> >
> > -       return (util * 100) / arch_scale_cpu_capacity(cpu);
> > +       return (util * 100) / get_actual_cpu_capacity(cpu);
> >  }
> >  #else /* !CONFIG_SMP */
> >  static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> >
> >
> > Because if still use arch_scale_cpu_capacity(), the load pct may be decreased,
> > It may affect the thermal-IPA-governor's power consideration.
>
> I am not sure about this one. But looks plausible. Vincent?

I don't see why we should change them ? We don't want to change
sched_cpu_util() as well
AFAICT, the only outcome of this thread is that we should use
get_actual_cpu_capacity() instead of arch_scale_cpu_capacity() in
util_fits_cpu(). capping the utilization only make the estimation
worse

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Qais Yousef 1 year, 7 months ago

On 06/20/24 09:45, Vincent Guittot wrote:
> On Wed, 19 Jun 2024 at 20:10, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 06/19/24 11:05, Xuewen Yan wrote:
> > > On Tue, Jun 18, 2024 at 11:39 PM Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > On 06/18/24 17:23, Vincent Guittot wrote:
> > > > > On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
> > > > > >
> > > > > > On 06/17/24 11:07, Vincent Guittot wrote:
> > > > > >
> > > > > > > > And should effective_cpu_util() return a value higher than
> > > > > > > > get_actual_cpu_capacity()?
> > > > > > >
> > > > > > > I don't think we should because we want to return the effective
> > > > > > > utilization not the actual compute capacity.
> > > > > > > Having an utilization of the cpu or group of cpus above the actual
> > > > > > > capacity or the original capacity mainly means that we will have to
> > > > > > > run longer
> > > > > > >
> > > > > > > By capping the utilization we filter this information.
> > > > > > >
> > > > > > > capacity orig = 800
> > > > > > > util_avg = 700
> > > > > > >
> > > > > > > if we cap the capacity to 400 the cpu is expected to run twice longer
> > > > > > > for the same amount of work to be done
> > > > > >
> > > > > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on
> > > > > > min/max freq ratio) though?
> > > > > >
> > > > > > We cap with arch_scale_capacity() still, I guess we know at this stage it is
> > > > > > 100% wrong if we allow returning higher values?
> > > > >
> > > > > I think that capping utilization to max capacity generates some energy
> > > > > estimation error because it filters the fact that we run longer in
> > > > > some cases.
> > > >
> > > > Yes, I think so too and that was my first statement. But I think this is
> > > > a bigger change to do separately.
> > >
> > > I saw the the sched_cpu_util() was used by teo.c and cpufreq_cooling.c
> > > If we change the arch_scale_capacity() to actual_cpu_capacity(), it may cause
> > > some errors?
> >
> > The plan to revert this now.
> >
> > >
> > > For-example:
> > > In teo:
> > > 212 static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> > > 213 {
> > > 214         return sched_cpu_util(cpu) > cpu_data->util_threshold;
> > > 215 }
> > > It may cause the teo_cpu_is_utilized() to return false forever if the
> > > actual_cpu_capacity is smaller than util_threshold.
> > > However, the util_threshold is frome actual_cpu_capacity.
> > >
> > > In cpufreq_cooling.c:
> > > May we should change:
> > >
> > > diff --git a/drivers/thermal/cpufreq_cooling.c
> > > b/drivers/thermal/cpufreq_cooling.c
> > > index 280071be30b1..a8546d69cc10 100644
> > > --- a/drivers/thermal/cpufreq_cooling.c
> > > +++ b/drivers/thermal/cpufreq_cooling.c
> > > @@ -164,7 +164,7 @@ static u32 get_load(struct cpufreq_cooling_device
> > > *cpufreq_cdev, int cpu,
> > >  {
> > >         unsigned long util = sched_cpu_util(cpu);
> > >
> > > -       return (util * 100) / arch_scale_cpu_capacity(cpu);
> > > +       return (util * 100) / get_actual_cpu_capacity(cpu);
> > >  }
> > >  #else /* !CONFIG_SMP */
> > >  static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> > >
> > >
> > > Because if still use arch_scale_cpu_capacity(), the load pct may be decreased,
> > > It may affect the thermal-IPA-governor's power consideration.
> >
> > I am not sure about this one. But looks plausible. Vincent?
> 
> I don't see why we should change them ? We don't want to change
> sched_cpu_util() as well
> AFAICT, the only outcome of this thread is that we should use
> get_actual_cpu_capacity() instead of arch_scale_cpu_capacity() in
> util_fits_cpu(). capping the utilization only make the estimation
> worse

Yes my bad. Only util_fits_cpu() is needed now

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Dietmar Eggemann 1 year, 7 months ago

On 20/06/2024 13:37, Qais Yousef wrote:
> On 06/20/24 09:45, Vincent Guittot wrote:
>> On Wed, 19 Jun 2024 at 20:10, Qais Yousef <qyousef@layalina.io> wrote:
>>>
>>> On 06/19/24 11:05, Xuewen Yan wrote:
>>>> On Tue, Jun 18, 2024 at 11:39 PM Qais Yousef <qyousef@layalina.io> wrote:
>>>>>
>>>>> On 06/18/24 17:23, Vincent Guittot wrote:
>>>>>> On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
>>>>>>>
>>>>>>> On 06/17/24 11:07, Vincent Guittot wrote:

[...]

>>>> diff --git a/drivers/thermal/cpufreq_cooling.c
>>>> b/drivers/thermal/cpufreq_cooling.c
>>>> index 280071be30b1..a8546d69cc10 100644
>>>> --- a/drivers/thermal/cpufreq_cooling.c
>>>> +++ b/drivers/thermal/cpufreq_cooling.c
>>>> @@ -164,7 +164,7 @@ static u32 get_load(struct cpufreq_cooling_device
>>>> *cpufreq_cdev, int cpu,
>>>>  {
>>>>         unsigned long util = sched_cpu_util(cpu);
>>>>
>>>> -       return (util * 100) / arch_scale_cpu_capacity(cpu);
>>>> +       return (util * 100) / get_actual_cpu_capacity(cpu);
>>>>  }
>>>>  #else /* !CONFIG_SMP */
>>>>  static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
>>>>
>>>>
>>>> Because if still use arch_scale_cpu_capacity(), the load pct may be decreased,
>>>> It may affect the thermal-IPA-governor's power consideration.
>>>
>>> I am not sure about this one. But looks plausible. Vincent?
>>
>> I don't see why we should change them ? We don't want to change
>> sched_cpu_util() as well
>> AFAICT, the only outcome of this thread is that we should use
>> get_actual_cpu_capacity() instead of arch_scale_cpu_capacity() in
>> util_fits_cpu(). capping the utilization only make the estimation
>> worse
> 
> Yes my bad. Only util_fits_cpu() is needed now

Looks like that's for the uclamp part (2. part) of util_fits_cpu().

For the first part we use capacity_of() which bases on
get_actual_cpu_capacity() [scale_rt_capacity()] and changes each 4ms
[250 Hz].

Our assumption is that hw_load_avg() and cpufreq_get_pressure() change
way less frequent than that, right?

So we can use capacity_of() and get_actual_cpu_capacity() in the same
code path.

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Vincent Guittot 1 year, 7 months ago

On Tue, 18 Jun 2024 at 17:39, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 06/18/24 17:23, Vincent Guittot wrote:
> > On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 06/17/24 11:07, Vincent Guittot wrote:
> > >
> > > > > And should effective_cpu_util() return a value higher than
> > > > > get_actual_cpu_capacity()?
> > > >
> > > > I don't think we should because we want to return the effective
> > > > utilization not the actual compute capacity.
> > > > Having an utilization of the cpu or group of cpus above the actual
> > > > capacity or the original capacity mainly means that we will have to
> > > > run longer
> > > >
> > > > By capping the utilization we filter this information.
> > > >
> > > > capacity orig = 800
> > > > util_avg = 700
> > > >
> > > > if we cap the capacity to 400 the cpu is expected to run twice longer
> > > > for the same amount of work to be done
> > >
> > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on
> > > min/max freq ratio) though?
> > >
> > > We cap with arch_scale_capacity() still, I guess we know at this stage it is
> > > 100% wrong if we allow returning higher values?
> >
> > I think that capping utilization to max capacity generates some energy
> > estimation error because it filters the fact that we run longer in
> > some cases.
>
> Yes, I think so too and that was my first statement. But I think this is
> a bigger change to do separately.
>
> I *think* we have another source of error, we take util/cpu_cap as a percentage
> of time the CPU is busy. We assume an implicit multiplication with a time
> period, T. I am not sure if this implicit assumption is accurate and things are
> aligned properly. Especially with how utilization loses the temporal info due
> to invariance. util can be low but actual runtime will be much longer. I'm not

I'm not sure to get what you mean by " how utilization loses the
temporal info due to invariance"

Utilization aims to estimate the number of instructions to execute
whatever the CPU of the system, which once divided by the compute
capacity of the OPP of a CPU will estimate how long it will take to do
the job. So if the capa of an OPP of a CPU is low, it will reflect
that the actual runtime will be much longer.  A low utilization means
that you don't have much instruction to execute but not the speed at
which you will execute them.

Then, problems start when we cap utilization to the CPU capacity as an
example because we cap this temporal info.

> sure if this implicit multiplication is handling this properly. Beside due
> performance domains having shared CPUs, I am not sure this period is aligned
> across all CPUs for this implicit multiplication to work as intended.

It's all about average because it's too expensive if not even possible
to know when the instruction will be executed on the other CPUs. We
can only take the edge case (currently the worst case)

Beside the impact of uclamp making the selected OPP not always
sustainable but sometimes temporary

>
> I yet to study this properly. But I thought I'll mention it as I think this
> (energy estimation) is increasingly becoming an important area to improve on.

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Qais Yousef 1 year, 7 months ago

On 06/18/24 23:05, Vincent Guittot wrote:
> On Tue, 18 Jun 2024 at 17:39, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 06/18/24 17:23, Vincent Guittot wrote:
> > > On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > On 06/17/24 11:07, Vincent Guittot wrote:
> > > >
> > > > > > And should effective_cpu_util() return a value higher than
> > > > > > get_actual_cpu_capacity()?
> > > > >
> > > > > I don't think we should because we want to return the effective
> > > > > utilization not the actual compute capacity.
> > > > > Having an utilization of the cpu or group of cpus above the actual
> > > > > capacity or the original capacity mainly means that we will have to
> > > > > run longer
> > > > >
> > > > > By capping the utilization we filter this information.
> > > > >
> > > > > capacity orig = 800
> > > > > util_avg = 700
> > > > >
> > > > > if we cap the capacity to 400 the cpu is expected to run twice longer
> > > > > for the same amount of work to be done
> > > >
> > > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on
> > > > min/max freq ratio) though?
> > > >
> > > > We cap with arch_scale_capacity() still, I guess we know at this stage it is
> > > > 100% wrong if we allow returning higher values?
> > >
> > > I think that capping utilization to max capacity generates some energy
> > > estimation error because it filters the fact that we run longer in
> > > some cases.
> >
> > Yes, I think so too and that was my first statement. But I think this is
> > a bigger change to do separately.
> >
> > I *think* we have another source of error, we take util/cpu_cap as a percentage
> > of time the CPU is busy. We assume an implicit multiplication with a time
> > period, T. I am not sure if this implicit assumption is accurate and things are
> > aligned properly. Especially with how utilization loses the temporal info due
> > to invariance. util can be low but actual runtime will be much longer. I'm not
> 
> I'm not sure to get what you mean by " how utilization loses the
> temporal info due to invariance"

The utilization value itself doesn't tell us about the length of runtime of the
task. But its compute capacity.

> 
> Utilization aims to estimate the number of instructions to execute
> whatever the CPU of the system, which once divided by the compute

Yes for the number of instructions.

And yes, the *ratio* can potentially be a proxy for *percentage* of time we are
running. But we have no idea about absolute runtime.

AFAIU, there's an assumption that this percentage of running time is multiplied
by 'unidentified' period value to get a proxy of time the perf domain will run
for. This time then multiplied by power we get the energy.

I am just not sure if we're losing informations with all of these
transformations. I need to investigate more.

And we assume a periodic time interval for which this percentage of busy time
we say the CPU will be busy for.

I am not sure if at every wake up this period needs to be aligned.

I think this will matter the most for calculating the base_energy.

I am not sure if this makes sense :).

I need to study the details more anyway and collect some data. But my worry is
generally whether our approximate of runtime is good enough and how to improve
it.

> capacity of the OPP of a CPU will estimate how long it will take to do
> the job. So if the capa of an OPP of a CPU is low, it will reflect
> that the actual runtime will be much longer.  A low utilization means
> that you don't have much instruction to execute but not the speed at
> which you will execute them.

Yes. But I am worried about actual absolute time is being approximated
good enough or not.

> 
> Then, problems start when we cap utilization to the CPU capacity as an
> example because we cap this temporal info.

Yes. We agree on the existence of this problem.

> 
> > sure if this implicit multiplication is handling this properly. Beside due
> > performance domains having shared CPUs, I am not sure this period is aligned
> > across all CPUs for this implicit multiplication to work as intended.
> 
> It's all about average because it's too expensive if not even possible
> to know when the instruction will be executed on the other CPUs. We
> can only take the edge case (currently the worst case)

Yes..

> 
> Beside the impact of uclamp making the selected OPP not always
> sustainable but sometimes temporary
> 
> >
> > I yet to study this properly. But I thought I'll mention it as I think this
> > (energy estimation) is increasingly becoming an important area to improve on.

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Xuewen Yan 1 year, 8 months ago

Hi Qais,

On Mon, Jun 17, 2024 at 6:20 AM Qais Yousef <qyousef@layalina.io> wrote:
>
> On 06/12/24 16:11, Xuewen Yan wrote:
> > Hi Qais
> >
> > On Mon, Jun 10, 2024 at 6:55 AM Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 06/06/24 15:06, Xuewen Yan wrote:
> > > > Because the effective_cpu_util() would return a util which
> > > > maybe bigger than the actual_cpu_capacity, this could cause
> > > > the pd_busy_time calculation errors.
> > > > So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> > > > the actual_cpu_capacity.
> > >
> > > I actually think capping by pd_cap is something we should remove. Saturated
> > > systems aren't calculated properly especially when uclamp_max is used.
> > >
> > > But this might a bigger change and out of scope of what you're proposing..
> >
> > I agree, there are other things to consider before doing this.
> >
> > >
> > > Did this 'wrong' calculation cause an actual problem for task placement?
> > > I assume the pd looked 'busier' because some CPUs were too busy.
> >
> > This will not only affect calculations in scenarios with high temperatures.
> > Sometimes, users will set scalimg_max_freq to actively limit the CPU frequency,
> > so that even if the CPU load is large, the CPU frequency will not be very high.
> > At this time, even if tasks are placed on other CPUs in the same PD,
> > the energy increment may not be too large, thus affecting core selection.
> >
> > >
> > > Was the system in overutilzied state? Usually if one CPU is that busy
> > > overutilized should be set and we'd skip EAS - unless uclamp_max was used.
> >
> > As Christian said, This case occurs not only in the overutil scenario,
> > this scenario holds true as long as the actual-cpu-capacity caused by
> > the reduction in max cpu frequency is smaller than the cpu util.
>
> Hmm. Sorry I might be thick here, but shouldn't fits_capacity() use
> capacity_of() which should return capacity based on get_actual_cpu_capacity()
> to compare if we are overutilized? If we are higher than this value that you
> need to cap, then the CPU must be overutilized and we shouldn't be in feec() in
> the first place, no? Unless the rq is capped with uclamp_max that is.

Sorry, I miss the "fits_capacity() use capacity_of()", and without
uclamp_max, the rd is over-utilized,
and would not use feec().
But I notice the uclamp_max, if the rq's uclamp_max is smaller than
SCHED_CAPACITY_SCALE,
and is bigger than actual_cpu_capacity, the util_fits_cpu() would
return true, and the rd is not over-utilized.
Is this setting intentional?

>
> I generally think our current definition of overutilized is wrong and the use
> case you're talking about should hold true if it's just a single CPU that is
> overutilized. But I can't see how you end up in feec() if the util is higher
> than the CPU in our current code base.
>
> What did I miss?
>
> And should effective_cpu_util() return a value higher than
> get_actual_cpu_capacity()?

I also thought about changing this at first, but because this function
is called in many places,
I am not 100% sure that changing it will not have any unexpected consequences,
so I only changed feec():)


cheers
---
xuewen

>
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index 8b1194c39161..91acc0f92ae4 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -286,7 +286,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>         unsigned long util, irq, scale;
>         struct rq *rq = cpu_rq(cpu);
>
> -       scale = arch_scale_cpu_capacity(cpu);
> +       scale = get_actual_cpu_capacity(cpu);
>
>         /*
>          * Early check to see if IRQ/steal time saturates the CPU, can be

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Qais Yousef 1 year, 8 months ago

On 06/17/24 15:27, Xuewen Yan wrote:
> Hi Qais,
> 
> On Mon, Jun 17, 2024 at 6:20 AM Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 06/12/24 16:11, Xuewen Yan wrote:
> > > Hi Qais
> > >
> > > On Mon, Jun 10, 2024 at 6:55 AM Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > On 06/06/24 15:06, Xuewen Yan wrote:
> > > > > Because the effective_cpu_util() would return a util which
> > > > > maybe bigger than the actual_cpu_capacity, this could cause
> > > > > the pd_busy_time calculation errors.
> > > > > So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> > > > > the actual_cpu_capacity.
> > > >
> > > > I actually think capping by pd_cap is something we should remove. Saturated
> > > > systems aren't calculated properly especially when uclamp_max is used.
> > > >
> > > > But this might a bigger change and out of scope of what you're proposing..
> > >
> > > I agree, there are other things to consider before doing this.
> > >
> > > >
> > > > Did this 'wrong' calculation cause an actual problem for task placement?
> > > > I assume the pd looked 'busier' because some CPUs were too busy.
> > >
> > > This will not only affect calculations in scenarios with high temperatures.
> > > Sometimes, users will set scalimg_max_freq to actively limit the CPU frequency,
> > > so that even if the CPU load is large, the CPU frequency will not be very high.
> > > At this time, even if tasks are placed on other CPUs in the same PD,
> > > the energy increment may not be too large, thus affecting core selection.
> > >
> > > >
> > > > Was the system in overutilzied state? Usually if one CPU is that busy
> > > > overutilized should be set and we'd skip EAS - unless uclamp_max was used.
> > >
> > > As Christian said, This case occurs not only in the overutil scenario,
> > > this scenario holds true as long as the actual-cpu-capacity caused by
> > > the reduction in max cpu frequency is smaller than the cpu util.
> >
> > Hmm. Sorry I might be thick here, but shouldn't fits_capacity() use
> > capacity_of() which should return capacity based on get_actual_cpu_capacity()
> > to compare if we are overutilized? If we are higher than this value that you
> > need to cap, then the CPU must be overutilized and we shouldn't be in feec() in
> > the first place, no? Unless the rq is capped with uclamp_max that is.
> 
> Sorry, I miss the "fits_capacity() use capacity_of()", and without
> uclamp_max, the rd is over-utilized,
> and would not use feec().
> But I notice the uclamp_max, if the rq's uclamp_max is smaller than
> SCHED_CAPACITY_SCALE,
> and is bigger than actual_cpu_capacity, the util_fits_cpu() would
> return true, and the rd is not over-utilized.
> Is this setting intentional?

Hmm. To a great extent yes. We didn't want to take all types of rq pressure
into account for uclamp_max. But this corner case could be debatable.

Is this the source of your problem? If you change util_fits_cpu() to return
false here, would this fix the problem you're seeing?

> 
> >
> > I generally think our current definition of overutilized is wrong and the use
> > case you're talking about should hold true if it's just a single CPU that is
> > overutilized. But I can't see how you end up in feec() if the util is higher
> > than the CPU in our current code base.
> >
> > What did I miss?
> >
> > And should effective_cpu_util() return a value higher than
> > get_actual_cpu_capacity()?
> 
> I also thought about changing this at first, but because this function
> is called in many places,
> I am not 100% sure that changing it will not have any unexpected consequences,
> so I only changed feec():)

Yes I had similar doubts. But the question had to be asked :-)

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Vincent Guittot 1 year, 7 months ago

On Mon, 17 Jun 2024 at 13:03, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 06/17/24 15:27, Xuewen Yan wrote:
> > Hi Qais,
> >
> > On Mon, Jun 17, 2024 at 6:20 AM Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 06/12/24 16:11, Xuewen Yan wrote:
> > > > Hi Qais
> > > >
> > > > On Mon, Jun 10, 2024 at 6:55 AM Qais Yousef <qyousef@layalina.io> wrote:
> > > > >
> > > > > On 06/06/24 15:06, Xuewen Yan wrote:
> > > > > > Because the effective_cpu_util() would return a util which
> > > > > > maybe bigger than the actual_cpu_capacity, this could cause
> > > > > > the pd_busy_time calculation errors.
> > > > > > So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> > > > > > the actual_cpu_capacity.
> > > > >
> > > > > I actually think capping by pd_cap is something we should remove. Saturated
> > > > > systems aren't calculated properly especially when uclamp_max is used.
> > > > >
> > > > > But this might a bigger change and out of scope of what you're proposing..
> > > >
> > > > I agree, there are other things to consider before doing this.
> > > >
> > > > >
> > > > > Did this 'wrong' calculation cause an actual problem for task placement?
> > > > > I assume the pd looked 'busier' because some CPUs were too busy.
> > > >
> > > > This will not only affect calculations in scenarios with high temperatures.
> > > > Sometimes, users will set scalimg_max_freq to actively limit the CPU frequency,
> > > > so that even if the CPU load is large, the CPU frequency will not be very high.
> > > > At this time, even if tasks are placed on other CPUs in the same PD,
> > > > the energy increment may not be too large, thus affecting core selection.
> > > >
> > > > >
> > > > > Was the system in overutilzied state? Usually if one CPU is that busy
> > > > > overutilized should be set and we'd skip EAS - unless uclamp_max was used.
> > > >
> > > > As Christian said, This case occurs not only in the overutil scenario,
> > > > this scenario holds true as long as the actual-cpu-capacity caused by
> > > > the reduction in max cpu frequency is smaller than the cpu util.
> > >
> > > Hmm. Sorry I might be thick here, but shouldn't fits_capacity() use
> > > capacity_of() which should return capacity based on get_actual_cpu_capacity()
> > > to compare if we are overutilized? If we are higher than this value that you
> > > need to cap, then the CPU must be overutilized and we shouldn't be in feec() in
> > > the first place, no? Unless the rq is capped with uclamp_max that is.
> >
> > Sorry, I miss the "fits_capacity() use capacity_of()", and without
> > uclamp_max, the rd is over-utilized,
> > and would not use feec().
> > But I notice the uclamp_max, if the rq's uclamp_max is smaller than
> > SCHED_CAPACITY_SCALE,
> > and is bigger than actual_cpu_capacity, the util_fits_cpu() would
> > return true, and the rd is not over-utilized.
> > Is this setting intentional?
>
> Hmm. To a great extent yes. We didn't want to take all types of rq pressure
> into account for uclamp_max. But this corner case could be debatable.

Shouldn't we use get_actual_cpu_capacity() instead of
arch_scale_cpu_capacity() everywhere in util_fits_cpu().
get_actual_cpu_capacity() appeared recently and there were discussion
about using or not the thermal load_avg but everything is fixed now
and  think that using get_actual_cpu_capacity() everywhere in
util_fits_cpu( would make sense and cover the case reported by Xuewen
just above

>
> Is this the source of your problem? If you change util_fits_cpu() to return
> false here, would this fix the problem you're seeing?
>
> >
> > >
> > > I generally think our current definition of overutilized is wrong and the use
> > > case you're talking about should hold true if it's just a single CPU that is
> > > overutilized. But I can't see how you end up in feec() if the util is higher
> > > than the CPU in our current code base.
> > >
> > > What did I miss?
> > >
> > > And should effective_cpu_util() return a value higher than
> > > get_actual_cpu_capacity()?
> >
> > I also thought about changing this at first, but because this function
> > is called in many places,
> > I am not 100% sure that changing it will not have any unexpected consequences,
> > so I only changed feec():)
>
> Yes I had similar doubts. But the question had to be asked :-)

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Qais Yousef 1 year, 7 months ago

On 06/18/24 17:20, Vincent Guittot wrote:

> > > Sorry, I miss the "fits_capacity() use capacity_of()", and without
> > > uclamp_max, the rd is over-utilized,
> > > and would not use feec().
> > > But I notice the uclamp_max, if the rq's uclamp_max is smaller than
> > > SCHED_CAPACITY_SCALE,
> > > and is bigger than actual_cpu_capacity, the util_fits_cpu() would
> > > return true, and the rd is not over-utilized.
> > > Is this setting intentional?
> >
> > Hmm. To a great extent yes. We didn't want to take all types of rq pressure
> > into account for uclamp_max. But this corner case could be debatable.
> 
> Shouldn't we use get_actual_cpu_capacity() instead of
> arch_scale_cpu_capacity() everywhere in util_fits_cpu().
> get_actual_cpu_capacity() appeared recently and there were discussion
> about using or not the thermal load_avg but everything is fixed now
> and  think that using get_actual_cpu_capacity() everywhere in
> util_fits_cpu( would make sense and cover the case reported by Xuewen
> just above

Yes agreed. I think we need both patches. Although we need to confirm that
uclamp_max is what is causing the situation Xuewen is seeing. Otherwise we have
a race somewhere that needs to be understood.

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Qais Yousef 1 year, 7 months ago

On 06/17/24 12:03, Qais Yousef wrote:

> > Sorry, I miss the "fits_capacity() use capacity_of()", and without
> > uclamp_max, the rd is over-utilized,
> > and would not use feec().
> > But I notice the uclamp_max, if the rq's uclamp_max is smaller than
> > SCHED_CAPACITY_SCALE,
> > and is bigger than actual_cpu_capacity, the util_fits_cpu() would
> > return true, and the rd is not over-utilized.
> > Is this setting intentional?
> 
> Hmm. To a great extent yes. We didn't want to take all types of rq pressure
> into account for uclamp_max. But this corner case could be debatable.
> 
> Is this the source of your problem? If you change util_fits_cpu() to return
> false here, would this fix the problem you're seeing?

FWIW, if this happens due to uclamp_max, then this patch to do the capping is
still needed.

I think it's good to understand first how we end up in feec() when a CPU is
supposed to be overutlized. uclamp_max is the only way to override this
decision AFAICT..

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Xuewen Yan 1 year, 7 months ago

On Tue, Jun 18, 2024 at 10:58 PM Qais Yousef <qyousef@layalina.io> wrote:
>
> On 06/17/24 12:03, Qais Yousef wrote:
>
> > > Sorry, I miss the "fits_capacity() use capacity_of()", and without
> > > uclamp_max, the rd is over-utilized,
> > > and would not use feec().
> > > But I notice the uclamp_max, if the rq's uclamp_max is smaller than
> > > SCHED_CAPACITY_SCALE,
> > > and is bigger than actual_cpu_capacity, the util_fits_cpu() would
> > > return true, and the rd is not over-utilized.
> > > Is this setting intentional?
> >
> > Hmm. To a great extent yes. We didn't want to take all types of rq pressure
> > into account for uclamp_max. But this corner case could be debatable.
> >
> > Is this the source of your problem? If you change util_fits_cpu() to return
> > false here, would this fix the problem you're seeing?
>
> FWIW, if this happens due to uclamp_max, then this patch to do the capping is
> still needed.
>
> I think it's good to understand first how we end up in feec() when a CPU is
> supposed to be overutlized. uclamp_max is the only way to override this
> decision AFAICT..

Sorry for the late reply...
In our own tree, we removed the check for rd overutil in feec(), so
the above case often occurs.
And now it seems that on the mainline, uclamp_max is the only way to
override this.

Thanks!
BR
---
xuewen

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Dietmar Eggemann 1 year, 7 months ago

On 19/06/2024 04:46, Xuewen Yan wrote:
> On Tue, Jun 18, 2024 at 10:58 PM Qais Yousef <qyousef@layalina.io> wrote:
>>
>> On 06/17/24 12:03, Qais Yousef wrote:

[...]

> Sorry for the late reply...
> In our own tree, we removed the check for rd overutil in feec(), so
> the above case often occurs.

How to you schedule hackbench on this thing then? Via EAS or do you just
exclude this kind of workload?

> And now it seems that on the mainline, uclamp_max is the only way to
> override this.

[...]

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Vincent Guittot 1 year, 7 months ago

On Fri, 21 Jun 2024 at 12:41, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 19/06/2024 04:46, Xuewen Yan wrote:
> > On Tue, Jun 18, 2024 at 10:58 PM Qais Yousef <qyousef@layalina.io> wrote:
> >>
> >> On 06/17/24 12:03, Qais Yousef wrote:
>
> [...]
>
> > Sorry for the late reply...
> > In our own tree, we removed the check for rd overutil in feec(), so
> > the above case often occurs.
>
> How to you schedule hackbench on this thing then? Via EAS or do you just
> exclude this kind of workload?

Don't know the details for Xuewen but that's also what I'm doing as
part of the rework I'm doing on EAS. As I said in at OSPM, I'm also
calling feec() every time even when overutilized. feec() return -1
when it can't find a suitable cpu and we then fallback to default
performance path

>
> > And now it seems that on the mainline, uclamp_max is the only way to
> > override this.
>
> [...]
>

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Xuewen Yan 1 year, 7 months ago

On Fri, Jun 21, 2024 at 9:01 PM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Fri, 21 Jun 2024 at 12:41, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >
> > On 19/06/2024 04:46, Xuewen Yan wrote:
> > > On Tue, Jun 18, 2024 at 10:58 PM Qais Yousef <qyousef@layalina.io> wrote:
> > >>
> > >> On 06/17/24 12:03, Qais Yousef wrote:
> >
> > [...]
> >
> > > Sorry for the late reply...
> > > In our own tree, we removed the check for rd overutil in feec(), so
> > > the above case often occurs.
> >
> > How to you schedule hackbench on this thing then? Via EAS or do you just
> > exclude this kind of workload?
>
> Don't know the details for Xuewen but that's also what I'm doing as
> part of the rework I'm doing on EAS. As I said in at OSPM, I'm also
> calling feec() every time even when overutilized. feec() return -1
> when it can't find a suitable cpu and we then fallback to default
> performance path
>
Aha, we do the same, We use eas on Android, and there is often only
one or two tasks in the overutil state, but we don't want this
scenario to affect the placement of other small tasks.
Only when all CPUs are in the overutil state, feec will return -1, and
then fallback to default performance path.

BR
---
xuewen

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Qais Yousef 1 year, 7 months ago

On 06/19/24 10:46, Xuewen Yan wrote:
> On Tue, Jun 18, 2024 at 10:58 PM Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 06/17/24 12:03, Qais Yousef wrote:
> >
> > > > Sorry, I miss the "fits_capacity() use capacity_of()", and without
> > > > uclamp_max, the rd is over-utilized,
> > > > and would not use feec().
> > > > But I notice the uclamp_max, if the rq's uclamp_max is smaller than
> > > > SCHED_CAPACITY_SCALE,
> > > > and is bigger than actual_cpu_capacity, the util_fits_cpu() would
> > > > return true, and the rd is not over-utilized.
> > > > Is this setting intentional?
> > >
> > > Hmm. To a great extent yes. We didn't want to take all types of rq pressure
> > > into account for uclamp_max. But this corner case could be debatable.
> > >
> > > Is this the source of your problem? If you change util_fits_cpu() to return
> > > false here, would this fix the problem you're seeing?
> >
> > FWIW, if this happens due to uclamp_max, then this patch to do the capping is
> > still needed.
> >
> > I think it's good to understand first how we end up in feec() when a CPU is
> > supposed to be overutlized. uclamp_max is the only way to override this
> > decision AFAICT..
> 
> Sorry for the late reply...
> In our own tree, we removed the check for rd overutil in feec(), so
> the above case often occurs.

Yeah, the current definition of overutilized is not good enough. So I can see
the need to do that..

> And now it seems that on the mainline, uclamp_max is the only way to
> override this.

Thanks for checking. Your patch is correct and if we combine this with another
patch to use get_actual_cpu_capacity() in util_fits_cpu(), they are good
improvements.

Are you still happy to send an updated patches?


Thanks!

--
Qais Yousef

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Xuewen Yan 1 year, 7 months ago

On Thu, Jun 20, 2024 at 2:03 AM Qais Yousef <qyousef@layalina.io> wrote:
>
> On 06/19/24 10:46, Xuewen Yan wrote:
> > On Tue, Jun 18, 2024 at 10:58 PM Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 06/17/24 12:03, Qais Yousef wrote:
> > >
> > > > > Sorry, I miss the "fits_capacity() use capacity_of()", and without
> > > > > uclamp_max, the rd is over-utilized,
> > > > > and would not use feec().
> > > > > But I notice the uclamp_max, if the rq's uclamp_max is smaller than
> > > > > SCHED_CAPACITY_SCALE,
> > > > > and is bigger than actual_cpu_capacity, the util_fits_cpu() would
> > > > > return true, and the rd is not over-utilized.
> > > > > Is this setting intentional?
> > > >
> > > > Hmm. To a great extent yes. We didn't want to take all types of rq pressure
> > > > into account for uclamp_max. But this corner case could be debatable.
> > > >
> > > > Is this the source of your problem? If you change util_fits_cpu() to return
> > > > false here, would this fix the problem you're seeing?
> > >
> > > FWIW, if this happens due to uclamp_max, then this patch to do the capping is
> > > still needed.
> > >
> > > I think it's good to understand first how we end up in feec() when a CPU is
> > > supposed to be overutlized. uclamp_max is the only way to override this
> > > decision AFAICT..
> >
> > Sorry for the late reply...
> > In our own tree, we removed the check for rd overutil in feec(), so
> > the above case often occurs.
>
> Yeah, the current definition of overutilized is not good enough. So I can see
> the need to do that..
>
> > And now it seems that on the mainline, uclamp_max is the only way to
> > override this.
>
> Thanks for checking. Your patch is correct and if we combine this with another
> patch to use get_actual_cpu_capacity() in util_fits_cpu(), they are good
> improvements.
>
> Are you still happy to send an updated patches?

Okay, I'll send the new patches soon:)

Thanks!

--
xuewen

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Dietmar Eggemann 1 year, 8 months ago

On 06/06/2024 09:06, Xuewen Yan wrote:
> Because the effective_cpu_util() would return a util which
> maybe bigger than the actual_cpu_capacity, this could cause
> the pd_busy_time calculation errors.

Doesn't return effective_cpu_util() either scale or min(scale, util)
with scale = arch_scale_cpu_capacity(cpu)? So the util sum over the PD
cannot exceed eenv->cpu_cap?
Looks like this was the case with 3e8c6c9aac42 already.

> So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> the actual_cpu_capacity.
> 
> Fixes: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
> ---
>  kernel/sched/fair.c | 4 +++- 
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..8939d725023a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7870,7 +7870,9 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
>  	for_each_cpu(cpu, pd_cpus) {
>  		unsigned long util = cpu_util(cpu, p, -1, 0);
>  
> -		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> +		util = effective_cpu_util(cpu, util, NULL, NULL);
> +		util = min(eenv->cpu_cap, util);
> +		busy_time += util;
>  	}
>  
>  	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Xuewen Yan 1 year, 8 months ago

Hi Dietmar

On Fri, Jun 7, 2024 at 3:19 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 06/06/2024 09:06, Xuewen Yan wrote:
> > Because the effective_cpu_util() would return a util which
> > maybe bigger than the actual_cpu_capacity, this could cause
> > the pd_busy_time calculation errors.
>
> Doesn't return effective_cpu_util() either scale or min(scale, util)
> with scale = arch_scale_cpu_capacity(cpu)? So the util sum over the PD
> cannot exceed eenv->cpu_cap?

In effective_cpu_util, the scale = arch_scale_cpu_capacity(cpu);
 Although there is the clamp of eenv->pd_cap, but let us consider the
following simple scenario:
The pd cpus are 4-7, and the arch_scale_capacity is 1024, and because
of cpufreq-limit,
the cpu_actual_cap = 512. Then the eenv->cpu_cap = 512, the eenv->pd_cap = 2048;
effective_cpu_util(4) = 1024;
effective_cpu_util(5) = 1024;
effective_cpu_util(6) = 256;
effective_cpu_util(7) = 0;

Without this patch, the eenv->pd_busy_time = 2048, because the clamp
of eenv->pd_cap.
However, indeed, the cpu4 and cpu5's util would not exceed the actual_cap(512),
so the cpu_util4/5 = 512, instead of 1024.
And with this patch, the eenv->pd_busy_time = 512+512+256 = 1280.
And the 1280 should be more reasonable.

BR
--
xuewen

> Looks like this was the case with 3e8c6c9aac42 already.
>
> > So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> > the actual_cpu_capacity.
> >
> > Fixes: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
> > Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
> > ---
> >  kernel/sched/fair.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a5b1ae0aa55..8939d725023a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7870,7 +7870,9 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
> >       for_each_cpu(cpu, pd_cpus) {
> >               unsigned long util = cpu_util(cpu, p, -1, 0);
> >
> > -             busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> > +             util = effective_cpu_util(cpu, util, NULL, NULL);
> > +             util = min(eenv->cpu_cap, util);
> > +             busy_time += util;
> >       }
> >
> >       eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
>

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Dietmar Eggemann 1 year, 8 months ago

On 07/06/2024 10:20, Xuewen Yan wrote:
> Hi Dietmar
> 
> On Fri, Jun 7, 2024 at 3:19 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 06/06/2024 09:06, Xuewen Yan wrote:
>>> Because the effective_cpu_util() would return a util which
>>> maybe bigger than the actual_cpu_capacity, this could cause
>>> the pd_busy_time calculation errors.
>>
>> Doesn't return effective_cpu_util() either scale or min(scale, util)
>> with scale = arch_scale_cpu_capacity(cpu)? So the util sum over the PD
>> cannot exceed eenv->cpu_cap?
> 
> In effective_cpu_util, the scale = arch_scale_cpu_capacity(cpu);
>  Although there is the clamp of eenv->pd_cap, but let us consider the
> following simple scenario:
> The pd cpus are 4-7, and the arch_scale_capacity is 1024, and because
> of cpufreq-limit,

Ah, this is due to:

find_energy_efficient_cpu()

   ...
   for (; pd; pd = pd->next)
       ...
       cpu_actual_cap = get_actual_cpu_capacity(cpu)

       for_each_cpu(cpu, cpus)
           ...
           eenv.pd_cap += cpu_actual_cap

and:

get_actual_cpu_capacity()

   ...
   capacity = arch_scale_cpu_capacity(cpu)

   capacity -= max(hw_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu))

which got introduced by f1f8d0a22422 ("sched/cpufreq: Take cpufreq
feedback into account").

[...]

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Xuewen Yan 1 year, 8 months ago

On Fri, Jun 7, 2024 at 6:30 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 07/06/2024 10:20, Xuewen Yan wrote:
> > Hi Dietmar
> >
> > On Fri, Jun 7, 2024 at 3:19 PM Dietmar Eggemann
> > <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 06/06/2024 09:06, Xuewen Yan wrote:
> >>> Because the effective_cpu_util() would return a util which
> >>> maybe bigger than the actual_cpu_capacity, this could cause
> >>> the pd_busy_time calculation errors.
> >>
> >> Doesn't return effective_cpu_util() either scale or min(scale, util)
> >> with scale = arch_scale_cpu_capacity(cpu)? So the util sum over the PD
> >> cannot exceed eenv->cpu_cap?
> >
> > In effective_cpu_util, the scale = arch_scale_cpu_capacity(cpu);
> >  Although there is the clamp of eenv->pd_cap, but let us consider the
> > following simple scenario:
> > The pd cpus are 4-7, and the arch_scale_capacity is 1024, and because
> > of cpufreq-limit,
>
> Ah, this is due to:
>
> find_energy_efficient_cpu()
>
>    ...
>    for (; pd; pd = pd->next)
>        ...
>        cpu_actual_cap = get_actual_cpu_capacity(cpu)
>
>        for_each_cpu(cpu, cpus)
>            ...
>            eenv.pd_cap += cpu_actual_cap
>
> and:
>
> get_actual_cpu_capacity()
>
>    ...
>    capacity = arch_scale_cpu_capacity(cpu)
>
>    capacity -= max(hw_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu))
>
> which got introduced by f1f8d0a22422 ("sched/cpufreq: Take cpufreq
> feedback into account").

I don't think it was introduced by f1f8d0a22422, because f1f8d0a22422
just replaced the cpu_thermal_cap with get_actual_cpu_capacity(cpu).
The eenv.cpu_cap was  introduced by 3e8c6c9aac42 ("sched/fair: Remove
task_util from effective utilization in feec()").

BR
---
xuewen

>
> [...]

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Dietmar Eggemann 1 year, 7 months ago

On 07/06/2024 12:37, Xuewen Yan wrote:
> On Fri, Jun 7, 2024 at 6:30 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 07/06/2024 10:20, Xuewen Yan wrote:
>>> Hi Dietmar
>>>
>>> On Fri, Jun 7, 2024 at 3:19 PM Dietmar Eggemann
>>> <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 06/06/2024 09:06, Xuewen Yan wrote:
>>>>> Because the effective_cpu_util() would return a util which
>>>>> maybe bigger than the actual_cpu_capacity, this could cause
>>>>> the pd_busy_time calculation errors.
>>>>
>>>> Doesn't return effective_cpu_util() either scale or min(scale, util)
>>>> with scale = arch_scale_cpu_capacity(cpu)? So the util sum over the PD
>>>> cannot exceed eenv->cpu_cap?
>>>
>>> In effective_cpu_util, the scale = arch_scale_cpu_capacity(cpu);
>>>  Although there is the clamp of eenv->pd_cap, but let us consider the
>>> following simple scenario:
>>> The pd cpus are 4-7, and the arch_scale_capacity is 1024, and because
>>> of cpufreq-limit,
>>
>> Ah, this is due to:
>>
>> find_energy_efficient_cpu()
>>
>>    ...
>>    for (; pd; pd = pd->next)
>>        ...
>>        cpu_actual_cap = get_actual_cpu_capacity(cpu)
>>
>>        for_each_cpu(cpu, cpus)
>>            ...
>>            eenv.pd_cap += cpu_actual_cap
>>
>> and:
>>
>> get_actual_cpu_capacity()
>>
>>    ...
>>    capacity = arch_scale_cpu_capacity(cpu)
>>
>>    capacity -= max(hw_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu))
>>
>> which got introduced by f1f8d0a22422 ("sched/cpufreq: Take cpufreq
>> feedback into account").
> 
> I don't think it was introduced by f1f8d0a22422, because f1f8d0a22422
> just replaced the cpu_thermal_cap with get_actual_cpu_capacity(cpu).
> The eenv.cpu_cap was  introduced by 3e8c6c9aac42 ("sched/fair: Remove
> task_util from effective utilization in feec()").

Yes, you're right. 3e8c6c9aac42 changed it from per-CPU to per-PD
capping.

In case we want to go back to per-CPU then we should remove the
eenv->pd_cap capping in eenv_pd_busy_time().

-->8--

@@ -7864,16 +7864,15 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
                                     struct cpumask *pd_cpus,
                                     struct task_struct *p)
 {
-       unsigned long busy_time = 0;
        int cpu;
 
        for_each_cpu(cpu, pd_cpus) {
                unsigned long util = cpu_util(cpu, p, -1, 0);
 
-               busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+               util = effective_cpu_util(cpu, util, NULL, NULL);
+               util = min(util, eenv->cpu_cap);
+               eenv->pd_busy_time += util;
        }
-
-       eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
 }



I'm wondering whether we would need the:

if (dst_cpu >= 0)
    busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);

in compute_energy() anymore since we only get a candidate CPU in feec()
after checking with util_fits_cpu() if cpu can accommodate p :

feec()

    ...

    for_each_cpu()

        util = cpu_util(cpu, p, cpu, ...)
        cpu_cap = capacity_of()

        ...

        fits = util_fits_cpu(util, ..., cpu);
        if (!fits)
            continue

        /* check if candidate CPU */

Re: [PATCH] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Posted by Xuewen Yan 1 year, 7 months ago

On Fri, Jun 21, 2024 at 6:22 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 07/06/2024 12:37, Xuewen Yan wrote:
> > On Fri, Jun 7, 2024 at 6:30 PM Dietmar Eggemann
> > <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 07/06/2024 10:20, Xuewen Yan wrote:
> >>> Hi Dietmar
> >>>
> >>> On Fri, Jun 7, 2024 at 3:19 PM Dietmar Eggemann
> >>> <dietmar.eggemann@arm.com> wrote:
> >>>>
> >>>> On 06/06/2024 09:06, Xuewen Yan wrote:
> >>>>> Because the effective_cpu_util() would return a util which
> >>>>> maybe bigger than the actual_cpu_capacity, this could cause
> >>>>> the pd_busy_time calculation errors.
> >>>>
> >>>> Doesn't return effective_cpu_util() either scale or min(scale, util)
> >>>> with scale = arch_scale_cpu_capacity(cpu)? So the util sum over the PD
> >>>> cannot exceed eenv->cpu_cap?
> >>>
> >>> In effective_cpu_util, the scale = arch_scale_cpu_capacity(cpu);
> >>>  Although there is the clamp of eenv->pd_cap, but let us consider the
> >>> following simple scenario:
> >>> The pd cpus are 4-7, and the arch_scale_capacity is 1024, and because
> >>> of cpufreq-limit,
> >>
> >> Ah, this is due to:
> >>
> >> find_energy_efficient_cpu()
> >>
> >>    ...
> >>    for (; pd; pd = pd->next)
> >>        ...
> >>        cpu_actual_cap = get_actual_cpu_capacity(cpu)
> >>
> >>        for_each_cpu(cpu, cpus)
> >>            ...
> >>            eenv.pd_cap += cpu_actual_cap
> >>
> >> and:
> >>
> >> get_actual_cpu_capacity()
> >>
> >>    ...
> >>    capacity = arch_scale_cpu_capacity(cpu)
> >>
> >>    capacity -= max(hw_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu))
> >>
> >> which got introduced by f1f8d0a22422 ("sched/cpufreq: Take cpufreq
> >> feedback into account").
> >
> > I don't think it was introduced by f1f8d0a22422, because f1f8d0a22422
> > just replaced the cpu_thermal_cap with get_actual_cpu_capacity(cpu).
> > The eenv.cpu_cap was  introduced by 3e8c6c9aac42 ("sched/fair: Remove
> > task_util from effective utilization in feec()").
>
> Yes, you're right. 3e8c6c9aac42 changed it from per-CPU to per-PD
> capping.
>
> In case we want to go back to per-CPU then we should remove the
> eenv->pd_cap capping in eenv_pd_busy_time().
>
> -->8--
>
> @@ -7864,16 +7864,15 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
>                                      struct cpumask *pd_cpus,
>                                      struct task_struct *p)
>  {
> -       unsigned long busy_time = 0;
>         int cpu;
>
>         for_each_cpu(cpu, pd_cpus) {
>                 unsigned long util = cpu_util(cpu, p, -1, 0);
>
> -               busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> +               util = effective_cpu_util(cpu, util, NULL, NULL);
> +               util = min(util, eenv->cpu_cap);
> +               eenv->pd_busy_time += util;
>         }
> -
> -       eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
>  }

Okay, the pd-busy clamp is indeed unnecessary.

>
>
>
> I'm wondering whether we would need the:
>
> if (dst_cpu >= 0)
>     busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
>
> in compute_energy() anymore since we only get a candidate CPU in feec()
> after checking with util_fits_cpu() if cpu can accommodate p :

I think this condition is still necessary, because pd_busy_time is
clamped. If this condition is not added, util may exceed
actual_cpu_cap.

>
> feec()
>
>     ...
>
>     for_each_cpu()
>
>         util = cpu_util(cpu, p, cpu, ...)
>         cpu_cap = capacity_of()
>
>         ...
>
>         fits = util_fits_cpu(util, ..., cpu);
>         if (!fits)
>             continue
>
>         /* check if candidate CPU */