kernel/sched/fair.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
Because the effective_cpu_util() would return a util which
maybe bigger than the actual_cpu_capacity, this could cause
the pd_busy_time calculation errors.
So clamp the cpu_busy_time with the eenv->cpu_cap, which is
the actual_cpu_capacity.
Fixes: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
---
kernel/sched/fair.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..8939d725023a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7870,7 +7870,9 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
for_each_cpu(cpu, pd_cpus) {
unsigned long util = cpu_util(cpu, p, -1, 0);
- busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+ util = effective_cpu_util(cpu, util, NULL, NULL);
+ util = min(eenv->cpu_cap, util);
+ busy_time += util;
}
eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
--
2.25.1
On 6/6/24 08:06, Xuewen Yan wrote:
> Because the effective_cpu_util() would return a util which
> maybe bigger than the actual_cpu_capacity, this could cause
> the pd_busy_time calculation errors.
> So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> the actual_cpu_capacity.
>
> Fixes: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
> ---
> kernel/sched/fair.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..8939d725023a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7870,7 +7870,9 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
> for_each_cpu(cpu, pd_cpus) {
> unsigned long util = cpu_util(cpu, p, -1, 0);
>
> - busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> + util = effective_cpu_util(cpu, util, NULL, NULL);
> + util = min(eenv->cpu_cap, util);
> + busy_time += util;
> }
>
> eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
I can reproduce the issue and the fix, so
Tested-by: Christian Loehle <christian.loehle@arm.com>
(@Qais, this is on a non-overutilized system).
I'm unsure about the other callers of effective_cpu_util(), or rather sched_cpu_util()
in particular which includes thermal and powercap, they should be off too.
Anyway I'll try to reproduce for them too.
Kind Regards,
Christian
On 06/06/24 15:06, Xuewen Yan wrote:
> Because the effective_cpu_util() would return a util which
> maybe bigger than the actual_cpu_capacity, this could cause
> the pd_busy_time calculation errors.
> So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> the actual_cpu_capacity.
I actually think capping by pd_cap is something we should remove. Saturated
systems aren't calculated properly especially when uclamp_max is used.
But this might a bigger change and out of scope of what you're proposing..
Did this 'wrong' calculation cause an actual problem for task placement?
I assume the pd looked 'busier' because some CPUs were too busy.
Was the system in overutilzied state? Usually if one CPU is that busy
overutilized should be set and we'd skip EAS - unless uclamp_max was used.
>
> Fixes: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
> ---
> kernel/sched/fair.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..8939d725023a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7870,7 +7870,9 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
> for_each_cpu(cpu, pd_cpus) {
> unsigned long util = cpu_util(cpu, p, -1, 0);
>
> - busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> + util = effective_cpu_util(cpu, util, NULL, NULL);
> + util = min(eenv->cpu_cap, util);
> + busy_time += util;
> }
>
> eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
> --
> 2.25.1
>
Hi Qais
On Mon, Jun 10, 2024 at 6:55 AM Qais Yousef <qyousef@layalina.io> wrote:
>
> On 06/06/24 15:06, Xuewen Yan wrote:
> > Because the effective_cpu_util() would return a util which
> > maybe bigger than the actual_cpu_capacity, this could cause
> > the pd_busy_time calculation errors.
> > So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> > the actual_cpu_capacity.
>
> I actually think capping by pd_cap is something we should remove. Saturated
> systems aren't calculated properly especially when uclamp_max is used.
>
> But this might a bigger change and out of scope of what you're proposing..
I agree, there are other things to consider before doing this.
>
> Did this 'wrong' calculation cause an actual problem for task placement?
> I assume the pd looked 'busier' because some CPUs were too busy.
This will not only affect calculations in scenarios with high temperatures.
Sometimes, users will set scalimg_max_freq to actively limit the CPU frequency,
so that even if the CPU load is large, the CPU frequency will not be very high.
At this time, even if tasks are placed on other CPUs in the same PD,
the energy increment may not be too large, thus affecting core selection.
>
> Was the system in overutilzied state? Usually if one CPU is that busy
> overutilized should be set and we'd skip EAS - unless uclamp_max was used.
As Christian said, This case occurs not only in the overutil scenario,
this scenario holds true as long as the actual-cpu-capacity caused by
the reduction in max cpu frequency is smaller than the cpu util.
Thanks!
---
xuewen
>
> >
> > Fixes: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
> > Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
> > ---
> > kernel/sched/fair.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a5b1ae0aa55..8939d725023a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7870,7 +7870,9 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
> > for_each_cpu(cpu, pd_cpus) {
> > unsigned long util = cpu_util(cpu, p, -1, 0);
> >
> > - busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> > + util = effective_cpu_util(cpu, util, NULL, NULL);
> > + util = min(eenv->cpu_cap, util);
> > + busy_time += util;
> > }
> >
> > eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
> > --
> > 2.25.1
> >
On 06/12/24 16:11, Xuewen Yan wrote:
> Hi Qais
>
> On Mon, Jun 10, 2024 at 6:55 AM Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 06/06/24 15:06, Xuewen Yan wrote:
> > > Because the effective_cpu_util() would return a util which
> > > maybe bigger than the actual_cpu_capacity, this could cause
> > > the pd_busy_time calculation errors.
> > > So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> > > the actual_cpu_capacity.
> >
> > I actually think capping by pd_cap is something we should remove. Saturated
> > systems aren't calculated properly especially when uclamp_max is used.
> >
> > But this might a bigger change and out of scope of what you're proposing..
>
> I agree, there are other things to consider before doing this.
>
> >
> > Did this 'wrong' calculation cause an actual problem for task placement?
> > I assume the pd looked 'busier' because some CPUs were too busy.
>
> This will not only affect calculations in scenarios with high temperatures.
> Sometimes, users will set scalimg_max_freq to actively limit the CPU frequency,
> so that even if the CPU load is large, the CPU frequency will not be very high.
> At this time, even if tasks are placed on other CPUs in the same PD,
> the energy increment may not be too large, thus affecting core selection.
>
> >
> > Was the system in overutilzied state? Usually if one CPU is that busy
> > overutilized should be set and we'd skip EAS - unless uclamp_max was used.
>
> As Christian said, This case occurs not only in the overutil scenario,
> this scenario holds true as long as the actual-cpu-capacity caused by
> the reduction in max cpu frequency is smaller than the cpu util.
Hmm. Sorry I might be thick here, but shouldn't fits_capacity() use
capacity_of() which should return capacity based on get_actual_cpu_capacity()
to compare if we are overutilized? If we are higher than this value that you
need to cap, then the CPU must be overutilized and we shouldn't be in feec() in
the first place, no? Unless the rq is capped with uclamp_max that is.
I generally think our current definition of overutilized is wrong and the use
case you're talking about should hold true if it's just a single CPU that is
overutilized. But I can't see how you end up in feec() if the util is higher
than the CPU in our current code base.
What did I miss?
And should effective_cpu_util() return a value higher than
get_actual_cpu_capacity()?
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 8b1194c39161..91acc0f92ae4 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -286,7 +286,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
unsigned long util, irq, scale;
struct rq *rq = cpu_rq(cpu);
- scale = arch_scale_cpu_capacity(cpu);
+ scale = get_actual_cpu_capacity(cpu);
/*
* Early check to see if IRQ/steal time saturates the CPU, can be
On Mon, 17 Jun 2024 at 00:20, Qais Yousef <qyousef@layalina.io> wrote: > > On 06/12/24 16:11, Xuewen Yan wrote: > > Hi Qais > > > > On Mon, Jun 10, 2024 at 6:55 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 06/06/24 15:06, Xuewen Yan wrote: > > > > Because the effective_cpu_util() would return a util which > > > > maybe bigger than the actual_cpu_capacity, this could cause > > > > the pd_busy_time calculation errors. > > > > So clamp the cpu_busy_time with the eenv->cpu_cap, which is > > > > the actual_cpu_capacity. > > > > > > I actually think capping by pd_cap is something we should remove. Saturated > > > systems aren't calculated properly especially when uclamp_max is used. > > > > > > But this might a bigger change and out of scope of what you're proposing.. > > > > I agree, there are other things to consider before doing this. > > > > > > > > Did this 'wrong' calculation cause an actual problem for task placement? > > > I assume the pd looked 'busier' because some CPUs were too busy. > > > > This will not only affect calculations in scenarios with high temperatures. > > Sometimes, users will set scalimg_max_freq to actively limit the CPU frequency, > > so that even if the CPU load is large, the CPU frequency will not be very high. > > At this time, even if tasks are placed on other CPUs in the same PD, > > the energy increment may not be too large, thus affecting core selection. > > > > > > > > Was the system in overutilzied state? Usually if one CPU is that busy > > > overutilized should be set and we'd skip EAS - unless uclamp_max was used. > > > > As Christian said, This case occurs not only in the overutil scenario, > > this scenario holds true as long as the actual-cpu-capacity caused by > > the reduction in max cpu frequency is smaller than the cpu util. > > Hmm. Sorry I might be thick here, but shouldn't fits_capacity() use > capacity_of() which should return capacity based on get_actual_cpu_capacity() > to compare if we are overutilized? If we are higher than this value that you > need to cap, then the CPU must be overutilized and we shouldn't be in feec() in > the first place, no? Unless the rq is capped with uclamp_max that is. > > I generally think our current definition of overutilized is wrong and the use > case you're talking about should hold true if it's just a single CPU that is > overutilized. But I can't see how you end up in feec() if the util is higher > than the CPU in our current code base. > > What did I miss? > > And should effective_cpu_util() return a value higher than > get_actual_cpu_capacity()? I don't think we should because we want to return the effective utilization not the actual compute capacity. Having an utilization of the cpu or group of cpus above the actual capacity or the original capacity mainly means that we will have to run longer By capping the utilization we filter this information. capacity orig = 800 util_avg = 700 if we cap the capacity to 400 the cpu is expected to run twice longer for the same amount of work to be done > > diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c > index 8b1194c39161..91acc0f92ae4 100644 > --- a/kernel/sched/syscalls.c > +++ b/kernel/sched/syscalls.c > @@ -286,7 +286,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > unsigned long util, irq, scale; > struct rq *rq = cpu_rq(cpu); > > - scale = arch_scale_cpu_capacity(cpu); > + scale = get_actual_cpu_capacity(cpu); > > /* > * Early check to see if IRQ/steal time saturates the CPU, can be
On 06/17/24 11:07, Vincent Guittot wrote: > > And should effective_cpu_util() return a value higher than > > get_actual_cpu_capacity()? > > I don't think we should because we want to return the effective > utilization not the actual compute capacity. > Having an utilization of the cpu or group of cpus above the actual > capacity or the original capacity mainly means that we will have to > run longer > > By capping the utilization we filter this information. > > capacity orig = 800 > util_avg = 700 > > if we cap the capacity to 400 the cpu is expected to run twice longer > for the same amount of work to be done Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on min/max freq ratio) though? We cap with arch_scale_capacity() still, I guess we know at this stage it is 100% wrong if we allow returning higher values?
On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote: > > On 06/17/24 11:07, Vincent Guittot wrote: > > > > And should effective_cpu_util() return a value higher than > > > get_actual_cpu_capacity()? > > > > I don't think we should because we want to return the effective > > utilization not the actual compute capacity. > > Having an utilization of the cpu or group of cpus above the actual > > capacity or the original capacity mainly means that we will have to > > run longer > > > > By capping the utilization we filter this information. > > > > capacity orig = 800 > > util_avg = 700 > > > > if we cap the capacity to 400 the cpu is expected to run twice longer > > for the same amount of work to be done > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on > min/max freq ratio) though? > > We cap with arch_scale_capacity() still, I guess we know at this stage it is > 100% wrong if we allow returning higher values? I think that capping utilization to max capacity generates some energy estimation error because it filters the fact that we run longer in some cases.
On 06/18/24 17:23, Vincent Guittot wrote: > On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote: > > > > On 06/17/24 11:07, Vincent Guittot wrote: > > > > > > And should effective_cpu_util() return a value higher than > > > > get_actual_cpu_capacity()? > > > > > > I don't think we should because we want to return the effective > > > utilization not the actual compute capacity. > > > Having an utilization of the cpu or group of cpus above the actual > > > capacity or the original capacity mainly means that we will have to > > > run longer > > > > > > By capping the utilization we filter this information. > > > > > > capacity orig = 800 > > > util_avg = 700 > > > > > > if we cap the capacity to 400 the cpu is expected to run twice longer > > > for the same amount of work to be done > > > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on > > min/max freq ratio) though? > > > > We cap with arch_scale_capacity() still, I guess we know at this stage it is > > 100% wrong if we allow returning higher values? > > I think that capping utilization to max capacity generates some energy > estimation error because it filters the fact that we run longer in > some cases. Yes, I think so too and that was my first statement. But I think this is a bigger change to do separately. I *think* we have another source of error, we take util/cpu_cap as a percentage of time the CPU is busy. We assume an implicit multiplication with a time period, T. I am not sure if this implicit assumption is accurate and things are aligned properly. Especially with how utilization loses the temporal info due to invariance. util can be low but actual runtime will be much longer. I'm not sure if this implicit multiplication is handling this properly. Beside due performance domains having shared CPUs, I am not sure this period is aligned across all CPUs for this implicit multiplication to work as intended. I yet to study this properly. But I thought I'll mention it as I think this (energy estimation) is increasingly becoming an important area to improve on.
On Tue, Jun 18, 2024 at 11:39 PM Qais Yousef <qyousef@layalina.io> wrote:
>
> On 06/18/24 17:23, Vincent Guittot wrote:
> > On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 06/17/24 11:07, Vincent Guittot wrote:
> > >
> > > > > And should effective_cpu_util() return a value higher than
> > > > > get_actual_cpu_capacity()?
> > > >
> > > > I don't think we should because we want to return the effective
> > > > utilization not the actual compute capacity.
> > > > Having an utilization of the cpu or group of cpus above the actual
> > > > capacity or the original capacity mainly means that we will have to
> > > > run longer
> > > >
> > > > By capping the utilization we filter this information.
> > > >
> > > > capacity orig = 800
> > > > util_avg = 700
> > > >
> > > > if we cap the capacity to 400 the cpu is expected to run twice longer
> > > > for the same amount of work to be done
> > >
> > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on
> > > min/max freq ratio) though?
> > >
> > > We cap with arch_scale_capacity() still, I guess we know at this stage it is
> > > 100% wrong if we allow returning higher values?
> >
> > I think that capping utilization to max capacity generates some energy
> > estimation error because it filters the fact that we run longer in
> > some cases.
>
> Yes, I think so too and that was my first statement. But I think this is
> a bigger change to do separately.
I saw the the sched_cpu_util() was used by teo.c and cpufreq_cooling.c
If we change the arch_scale_capacity() to actual_cpu_capacity(), it may cause
some errors?
For-example:
In teo:
212 static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
213 {
214 return sched_cpu_util(cpu) > cpu_data->util_threshold;
215 }
It may cause the teo_cpu_is_utilized() to return false forever if the
actual_cpu_capacity is smaller than util_threshold.
However, the util_threshold is frome actual_cpu_capacity.
In cpufreq_cooling.c:
May we should change:
diff --git a/drivers/thermal/cpufreq_cooling.c
b/drivers/thermal/cpufreq_cooling.c
index 280071be30b1..a8546d69cc10 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -164,7 +164,7 @@ static u32 get_load(struct cpufreq_cooling_device
*cpufreq_cdev, int cpu,
{
unsigned long util = sched_cpu_util(cpu);
- return (util * 100) / arch_scale_cpu_capacity(cpu);
+ return (util * 100) / get_actual_cpu_capacity(cpu);
}
#else /* !CONFIG_SMP */
static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
Because if still use arch_scale_cpu_capacity(), the load pct may be decreased,
It may affect the thermal-IPA-governor's power consideration.
>
> I *think* we have another source of error, we take util/cpu_cap as a percentage
> of time the CPU is busy. We assume an implicit multiplication with a time
> period, T. I am not sure if this implicit assumption is accurate and things are
> aligned properly. Especially with how utilization loses the temporal info due
> to invariance. util can be low but actual runtime will be much longer. I'm not
> sure if this implicit multiplication is handling this properly. Beside due
> performance domains having shared CPUs, I am not sure this period is aligned
> across all CPUs for this implicit multiplication to work as intended.
>
> I yet to study this properly. But I thought I'll mention it as I think this
> (energy estimation) is increasingly becoming an important area to improve on.
On 06/19/24 11:05, Xuewen Yan wrote:
> On Tue, Jun 18, 2024 at 11:39 PM Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 06/18/24 17:23, Vincent Guittot wrote:
> > > On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > On 06/17/24 11:07, Vincent Guittot wrote:
> > > >
> > > > > > And should effective_cpu_util() return a value higher than
> > > > > > get_actual_cpu_capacity()?
> > > > >
> > > > > I don't think we should because we want to return the effective
> > > > > utilization not the actual compute capacity.
> > > > > Having an utilization of the cpu or group of cpus above the actual
> > > > > capacity or the original capacity mainly means that we will have to
> > > > > run longer
> > > > >
> > > > > By capping the utilization we filter this information.
> > > > >
> > > > > capacity orig = 800
> > > > > util_avg = 700
> > > > >
> > > > > if we cap the capacity to 400 the cpu is expected to run twice longer
> > > > > for the same amount of work to be done
> > > >
> > > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on
> > > > min/max freq ratio) though?
> > > >
> > > > We cap with arch_scale_capacity() still, I guess we know at this stage it is
> > > > 100% wrong if we allow returning higher values?
> > >
> > > I think that capping utilization to max capacity generates some energy
> > > estimation error because it filters the fact that we run longer in
> > > some cases.
> >
> > Yes, I think so too and that was my first statement. But I think this is
> > a bigger change to do separately.
>
> I saw the the sched_cpu_util() was used by teo.c and cpufreq_cooling.c
> If we change the arch_scale_capacity() to actual_cpu_capacity(), it may cause
> some errors?
The plan to revert this now.
>
> For-example:
> In teo:
> 212 static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> 213 {
> 214 return sched_cpu_util(cpu) > cpu_data->util_threshold;
> 215 }
> It may cause the teo_cpu_is_utilized() to return false forever if the
> actual_cpu_capacity is smaller than util_threshold.
> However, the util_threshold is frome actual_cpu_capacity.
>
> In cpufreq_cooling.c:
> May we should change:
>
> diff --git a/drivers/thermal/cpufreq_cooling.c
> b/drivers/thermal/cpufreq_cooling.c
> index 280071be30b1..a8546d69cc10 100644
> --- a/drivers/thermal/cpufreq_cooling.c
> +++ b/drivers/thermal/cpufreq_cooling.c
> @@ -164,7 +164,7 @@ static u32 get_load(struct cpufreq_cooling_device
> *cpufreq_cdev, int cpu,
> {
> unsigned long util = sched_cpu_util(cpu);
>
> - return (util * 100) / arch_scale_cpu_capacity(cpu);
> + return (util * 100) / get_actual_cpu_capacity(cpu);
> }
> #else /* !CONFIG_SMP */
> static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
>
>
> Because if still use arch_scale_cpu_capacity(), the load pct may be decreased,
> It may affect the thermal-IPA-governor's power consideration.
I am not sure about this one. But looks plausible. Vincent?
On Wed, 19 Jun 2024 at 20:10, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 06/19/24 11:05, Xuewen Yan wrote:
> > On Tue, Jun 18, 2024 at 11:39 PM Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 06/18/24 17:23, Vincent Guittot wrote:
> > > > On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
> > > > >
> > > > > On 06/17/24 11:07, Vincent Guittot wrote:
> > > > >
> > > > > > > And should effective_cpu_util() return a value higher than
> > > > > > > get_actual_cpu_capacity()?
> > > > > >
> > > > > > I don't think we should because we want to return the effective
> > > > > > utilization not the actual compute capacity.
> > > > > > Having an utilization of the cpu or group of cpus above the actual
> > > > > > capacity or the original capacity mainly means that we will have to
> > > > > > run longer
> > > > > >
> > > > > > By capping the utilization we filter this information.
> > > > > >
> > > > > > capacity orig = 800
> > > > > > util_avg = 700
> > > > > >
> > > > > > if we cap the capacity to 400 the cpu is expected to run twice longer
> > > > > > for the same amount of work to be done
> > > > >
> > > > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on
> > > > > min/max freq ratio) though?
> > > > >
> > > > > We cap with arch_scale_capacity() still, I guess we know at this stage it is
> > > > > 100% wrong if we allow returning higher values?
> > > >
> > > > I think that capping utilization to max capacity generates some energy
> > > > estimation error because it filters the fact that we run longer in
> > > > some cases.
> > >
> > > Yes, I think so too and that was my first statement. But I think this is
> > > a bigger change to do separately.
> >
> > I saw the the sched_cpu_util() was used by teo.c and cpufreq_cooling.c
> > If we change the arch_scale_capacity() to actual_cpu_capacity(), it may cause
> > some errors?
>
> The plan to revert this now.
>
> >
> > For-example:
> > In teo:
> > 212 static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> > 213 {
> > 214 return sched_cpu_util(cpu) > cpu_data->util_threshold;
> > 215 }
> > It may cause the teo_cpu_is_utilized() to return false forever if the
> > actual_cpu_capacity is smaller than util_threshold.
> > However, the util_threshold is frome actual_cpu_capacity.
> >
> > In cpufreq_cooling.c:
> > May we should change:
> >
> > diff --git a/drivers/thermal/cpufreq_cooling.c
> > b/drivers/thermal/cpufreq_cooling.c
> > index 280071be30b1..a8546d69cc10 100644
> > --- a/drivers/thermal/cpufreq_cooling.c
> > +++ b/drivers/thermal/cpufreq_cooling.c
> > @@ -164,7 +164,7 @@ static u32 get_load(struct cpufreq_cooling_device
> > *cpufreq_cdev, int cpu,
> > {
> > unsigned long util = sched_cpu_util(cpu);
> >
> > - return (util * 100) / arch_scale_cpu_capacity(cpu);
> > + return (util * 100) / get_actual_cpu_capacity(cpu);
> > }
> > #else /* !CONFIG_SMP */
> > static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> >
> >
> > Because if still use arch_scale_cpu_capacity(), the load pct may be decreased,
> > It may affect the thermal-IPA-governor's power consideration.
>
> I am not sure about this one. But looks plausible. Vincent?
I don't see why we should change them ? We don't want to change
sched_cpu_util() as well
AFAICT, the only outcome of this thread is that we should use
get_actual_cpu_capacity() instead of arch_scale_cpu_capacity() in
util_fits_cpu(). capping the utilization only make the estimation
worse
On 06/20/24 09:45, Vincent Guittot wrote:
> On Wed, 19 Jun 2024 at 20:10, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 06/19/24 11:05, Xuewen Yan wrote:
> > > On Tue, Jun 18, 2024 at 11:39 PM Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > On 06/18/24 17:23, Vincent Guittot wrote:
> > > > > On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
> > > > > >
> > > > > > On 06/17/24 11:07, Vincent Guittot wrote:
> > > > > >
> > > > > > > > And should effective_cpu_util() return a value higher than
> > > > > > > > get_actual_cpu_capacity()?
> > > > > > >
> > > > > > > I don't think we should because we want to return the effective
> > > > > > > utilization not the actual compute capacity.
> > > > > > > Having an utilization of the cpu or group of cpus above the actual
> > > > > > > capacity or the original capacity mainly means that we will have to
> > > > > > > run longer
> > > > > > >
> > > > > > > By capping the utilization we filter this information.
> > > > > > >
> > > > > > > capacity orig = 800
> > > > > > > util_avg = 700
> > > > > > >
> > > > > > > if we cap the capacity to 400 the cpu is expected to run twice longer
> > > > > > > for the same amount of work to be done
> > > > > >
> > > > > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on
> > > > > > min/max freq ratio) though?
> > > > > >
> > > > > > We cap with arch_scale_capacity() still, I guess we know at this stage it is
> > > > > > 100% wrong if we allow returning higher values?
> > > > >
> > > > > I think that capping utilization to max capacity generates some energy
> > > > > estimation error because it filters the fact that we run longer in
> > > > > some cases.
> > > >
> > > > Yes, I think so too and that was my first statement. But I think this is
> > > > a bigger change to do separately.
> > >
> > > I saw the the sched_cpu_util() was used by teo.c and cpufreq_cooling.c
> > > If we change the arch_scale_capacity() to actual_cpu_capacity(), it may cause
> > > some errors?
> >
> > The plan to revert this now.
> >
> > >
> > > For-example:
> > > In teo:
> > > 212 static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> > > 213 {
> > > 214 return sched_cpu_util(cpu) > cpu_data->util_threshold;
> > > 215 }
> > > It may cause the teo_cpu_is_utilized() to return false forever if the
> > > actual_cpu_capacity is smaller than util_threshold.
> > > However, the util_threshold is frome actual_cpu_capacity.
> > >
> > > In cpufreq_cooling.c:
> > > May we should change:
> > >
> > > diff --git a/drivers/thermal/cpufreq_cooling.c
> > > b/drivers/thermal/cpufreq_cooling.c
> > > index 280071be30b1..a8546d69cc10 100644
> > > --- a/drivers/thermal/cpufreq_cooling.c
> > > +++ b/drivers/thermal/cpufreq_cooling.c
> > > @@ -164,7 +164,7 @@ static u32 get_load(struct cpufreq_cooling_device
> > > *cpufreq_cdev, int cpu,
> > > {
> > > unsigned long util = sched_cpu_util(cpu);
> > >
> > > - return (util * 100) / arch_scale_cpu_capacity(cpu);
> > > + return (util * 100) / get_actual_cpu_capacity(cpu);
> > > }
> > > #else /* !CONFIG_SMP */
> > > static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> > >
> > >
> > > Because if still use arch_scale_cpu_capacity(), the load pct may be decreased,
> > > It may affect the thermal-IPA-governor's power consideration.
> >
> > I am not sure about this one. But looks plausible. Vincent?
>
> I don't see why we should change them ? We don't want to change
> sched_cpu_util() as well
> AFAICT, the only outcome of this thread is that we should use
> get_actual_cpu_capacity() instead of arch_scale_cpu_capacity() in
> util_fits_cpu(). capping the utilization only make the estimation
> worse
Yes my bad. Only util_fits_cpu() is needed now
On 20/06/2024 13:37, Qais Yousef wrote:
> On 06/20/24 09:45, Vincent Guittot wrote:
>> On Wed, 19 Jun 2024 at 20:10, Qais Yousef <qyousef@layalina.io> wrote:
>>>
>>> On 06/19/24 11:05, Xuewen Yan wrote:
>>>> On Tue, Jun 18, 2024 at 11:39 PM Qais Yousef <qyousef@layalina.io> wrote:
>>>>>
>>>>> On 06/18/24 17:23, Vincent Guittot wrote:
>>>>>> On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote:
>>>>>>>
>>>>>>> On 06/17/24 11:07, Vincent Guittot wrote:
[...]
>>>> diff --git a/drivers/thermal/cpufreq_cooling.c
>>>> b/drivers/thermal/cpufreq_cooling.c
>>>> index 280071be30b1..a8546d69cc10 100644
>>>> --- a/drivers/thermal/cpufreq_cooling.c
>>>> +++ b/drivers/thermal/cpufreq_cooling.c
>>>> @@ -164,7 +164,7 @@ static u32 get_load(struct cpufreq_cooling_device
>>>> *cpufreq_cdev, int cpu,
>>>> {
>>>> unsigned long util = sched_cpu_util(cpu);
>>>>
>>>> - return (util * 100) / arch_scale_cpu_capacity(cpu);
>>>> + return (util * 100) / get_actual_cpu_capacity(cpu);
>>>> }
>>>> #else /* !CONFIG_SMP */
>>>> static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
>>>>
>>>>
>>>> Because if still use arch_scale_cpu_capacity(), the load pct may be decreased,
>>>> It may affect the thermal-IPA-governor's power consideration.
>>>
>>> I am not sure about this one. But looks plausible. Vincent?
>>
>> I don't see why we should change them ? We don't want to change
>> sched_cpu_util() as well
>> AFAICT, the only outcome of this thread is that we should use
>> get_actual_cpu_capacity() instead of arch_scale_cpu_capacity() in
>> util_fits_cpu(). capping the utilization only make the estimation
>> worse
>
> Yes my bad. Only util_fits_cpu() is needed now
Looks like that's for the uclamp part (2. part) of util_fits_cpu().
For the first part we use capacity_of() which bases on
get_actual_cpu_capacity() [scale_rt_capacity()] and changes each 4ms
[250 Hz].
Our assumption is that hw_load_avg() and cpufreq_get_pressure() change
way less frequent than that, right?
So we can use capacity_of() and get_actual_cpu_capacity() in the same
code path.
On Tue, 18 Jun 2024 at 17:39, Qais Yousef <qyousef@layalina.io> wrote: > > On 06/18/24 17:23, Vincent Guittot wrote: > > On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 06/17/24 11:07, Vincent Guittot wrote: > > > > > > > > And should effective_cpu_util() return a value higher than > > > > > get_actual_cpu_capacity()? > > > > > > > > I don't think we should because we want to return the effective > > > > utilization not the actual compute capacity. > > > > Having an utilization of the cpu or group of cpus above the actual > > > > capacity or the original capacity mainly means that we will have to > > > > run longer > > > > > > > > By capping the utilization we filter this information. > > > > > > > > capacity orig = 800 > > > > util_avg = 700 > > > > > > > > if we cap the capacity to 400 the cpu is expected to run twice longer > > > > for the same amount of work to be done > > > > > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on > > > min/max freq ratio) though? > > > > > > We cap with arch_scale_capacity() still, I guess we know at this stage it is > > > 100% wrong if we allow returning higher values? > > > > I think that capping utilization to max capacity generates some energy > > estimation error because it filters the fact that we run longer in > > some cases. > > Yes, I think so too and that was my first statement. But I think this is > a bigger change to do separately. > > I *think* we have another source of error, we take util/cpu_cap as a percentage > of time the CPU is busy. We assume an implicit multiplication with a time > period, T. I am not sure if this implicit assumption is accurate and things are > aligned properly. Especially with how utilization loses the temporal info due > to invariance. util can be low but actual runtime will be much longer. I'm not I'm not sure to get what you mean by " how utilization loses the temporal info due to invariance" Utilization aims to estimate the number of instructions to execute whatever the CPU of the system, which once divided by the compute capacity of the OPP of a CPU will estimate how long it will take to do the job. So if the capa of an OPP of a CPU is low, it will reflect that the actual runtime will be much longer. A low utilization means that you don't have much instruction to execute but not the speed at which you will execute them. Then, problems start when we cap utilization to the CPU capacity as an example because we cap this temporal info. > sure if this implicit multiplication is handling this properly. Beside due > performance domains having shared CPUs, I am not sure this period is aligned > across all CPUs for this implicit multiplication to work as intended. It's all about average because it's too expensive if not even possible to know when the instruction will be executed on the other CPUs. We can only take the edge case (currently the worst case) Beside the impact of uclamp making the selected OPP not always sustainable but sometimes temporary > > I yet to study this properly. But I thought I'll mention it as I think this > (energy estimation) is increasingly becoming an important area to improve on.
On 06/18/24 23:05, Vincent Guittot wrote: > On Tue, 18 Jun 2024 at 17:39, Qais Yousef <qyousef@layalina.io> wrote: > > > > On 06/18/24 17:23, Vincent Guittot wrote: > > > On Mon, 17 Jun 2024 at 12:53, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > On 06/17/24 11:07, Vincent Guittot wrote: > > > > > > > > > > And should effective_cpu_util() return a value higher than > > > > > > get_actual_cpu_capacity()? > > > > > > > > > > I don't think we should because we want to return the effective > > > > > utilization not the actual compute capacity. > > > > > Having an utilization of the cpu or group of cpus above the actual > > > > > capacity or the original capacity mainly means that we will have to > > > > > run longer > > > > > > > > > > By capping the utilization we filter this information. > > > > > > > > > > capacity orig = 800 > > > > > util_avg = 700 > > > > > > > > > > if we cap the capacity to 400 the cpu is expected to run twice longer > > > > > for the same amount of work to be done > > > > > > > > Okay makes sense. Wouldn't the util be 'wrong' (to what degree will depend on > > > > min/max freq ratio) though? > > > > > > > > We cap with arch_scale_capacity() still, I guess we know at this stage it is > > > > 100% wrong if we allow returning higher values? > > > > > > I think that capping utilization to max capacity generates some energy > > > estimation error because it filters the fact that we run longer in > > > some cases. > > > > Yes, I think so too and that was my first statement. But I think this is > > a bigger change to do separately. > > > > I *think* we have another source of error, we take util/cpu_cap as a percentage > > of time the CPU is busy. We assume an implicit multiplication with a time > > period, T. I am not sure if this implicit assumption is accurate and things are > > aligned properly. Especially with how utilization loses the temporal info due > > to invariance. util can be low but actual runtime will be much longer. I'm not > > I'm not sure to get what you mean by " how utilization loses the > temporal info due to invariance" The utilization value itself doesn't tell us about the length of runtime of the task. But its compute capacity. > > Utilization aims to estimate the number of instructions to execute > whatever the CPU of the system, which once divided by the compute Yes for the number of instructions. And yes, the *ratio* can potentially be a proxy for *percentage* of time we are running. But we have no idea about absolute runtime. AFAIU, there's an assumption that this percentage of running time is multiplied by 'unidentified' period value to get a proxy of time the perf domain will run for. This time then multiplied by power we get the energy. I am just not sure if we're losing informations with all of these transformations. I need to investigate more. And we assume a periodic time interval for which this percentage of busy time we say the CPU will be busy for. I am not sure if at every wake up this period needs to be aligned. I think this will matter the most for calculating the base_energy. I am not sure if this makes sense :). I need to study the details more anyway and collect some data. But my worry is generally whether our approximate of runtime is good enough and how to improve it. > capacity of the OPP of a CPU will estimate how long it will take to do > the job. So if the capa of an OPP of a CPU is low, it will reflect > that the actual runtime will be much longer. A low utilization means > that you don't have much instruction to execute but not the speed at > which you will execute them. Yes. But I am worried about actual absolute time is being approximated good enough or not. > > Then, problems start when we cap utilization to the CPU capacity as an > example because we cap this temporal info. Yes. We agree on the existence of this problem. > > > sure if this implicit multiplication is handling this properly. Beside due > > performance domains having shared CPUs, I am not sure this period is aligned > > across all CPUs for this implicit multiplication to work as intended. > > It's all about average because it's too expensive if not even possible > to know when the instruction will be executed on the other CPUs. We > can only take the edge case (currently the worst case) Yes.. > > Beside the impact of uclamp making the selected OPP not always > sustainable but sometimes temporary > > > > > I yet to study this properly. But I thought I'll mention it as I think this > > (energy estimation) is increasingly becoming an important area to improve on.
Hi Qais, On Mon, Jun 17, 2024 at 6:20 AM Qais Yousef <qyousef@layalina.io> wrote: > > On 06/12/24 16:11, Xuewen Yan wrote: > > Hi Qais > > > > On Mon, Jun 10, 2024 at 6:55 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 06/06/24 15:06, Xuewen Yan wrote: > > > > Because the effective_cpu_util() would return a util which > > > > maybe bigger than the actual_cpu_capacity, this could cause > > > > the pd_busy_time calculation errors. > > > > So clamp the cpu_busy_time with the eenv->cpu_cap, which is > > > > the actual_cpu_capacity. > > > > > > I actually think capping by pd_cap is something we should remove. Saturated > > > systems aren't calculated properly especially when uclamp_max is used. > > > > > > But this might a bigger change and out of scope of what you're proposing.. > > > > I agree, there are other things to consider before doing this. > > > > > > > > Did this 'wrong' calculation cause an actual problem for task placement? > > > I assume the pd looked 'busier' because some CPUs were too busy. > > > > This will not only affect calculations in scenarios with high temperatures. > > Sometimes, users will set scalimg_max_freq to actively limit the CPU frequency, > > so that even if the CPU load is large, the CPU frequency will not be very high. > > At this time, even if tasks are placed on other CPUs in the same PD, > > the energy increment may not be too large, thus affecting core selection. > > > > > > > > Was the system in overutilzied state? Usually if one CPU is that busy > > > overutilized should be set and we'd skip EAS - unless uclamp_max was used. > > > > As Christian said, This case occurs not only in the overutil scenario, > > this scenario holds true as long as the actual-cpu-capacity caused by > > the reduction in max cpu frequency is smaller than the cpu util. > > Hmm. Sorry I might be thick here, but shouldn't fits_capacity() use > capacity_of() which should return capacity based on get_actual_cpu_capacity() > to compare if we are overutilized? If we are higher than this value that you > need to cap, then the CPU must be overutilized and we shouldn't be in feec() in > the first place, no? Unless the rq is capped with uclamp_max that is. Sorry, I miss the "fits_capacity() use capacity_of()", and without uclamp_max, the rd is over-utilized, and would not use feec(). But I notice the uclamp_max, if the rq's uclamp_max is smaller than SCHED_CAPACITY_SCALE, and is bigger than actual_cpu_capacity, the util_fits_cpu() would return true, and the rd is not over-utilized. Is this setting intentional? > > I generally think our current definition of overutilized is wrong and the use > case you're talking about should hold true if it's just a single CPU that is > overutilized. But I can't see how you end up in feec() if the util is higher > than the CPU in our current code base. > > What did I miss? > > And should effective_cpu_util() return a value higher than > get_actual_cpu_capacity()? I also thought about changing this at first, but because this function is called in many places, I am not 100% sure that changing it will not have any unexpected consequences, so I only changed feec():) cheers --- xuewen > > diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c > index 8b1194c39161..91acc0f92ae4 100644 > --- a/kernel/sched/syscalls.c > +++ b/kernel/sched/syscalls.c > @@ -286,7 +286,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > unsigned long util, irq, scale; > struct rq *rq = cpu_rq(cpu); > > - scale = arch_scale_cpu_capacity(cpu); > + scale = get_actual_cpu_capacity(cpu); > > /* > * Early check to see if IRQ/steal time saturates the CPU, can be
On 06/17/24 15:27, Xuewen Yan wrote: > Hi Qais, > > On Mon, Jun 17, 2024 at 6:20 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > On 06/12/24 16:11, Xuewen Yan wrote: > > > Hi Qais > > > > > > On Mon, Jun 10, 2024 at 6:55 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > On 06/06/24 15:06, Xuewen Yan wrote: > > > > > Because the effective_cpu_util() would return a util which > > > > > maybe bigger than the actual_cpu_capacity, this could cause > > > > > the pd_busy_time calculation errors. > > > > > So clamp the cpu_busy_time with the eenv->cpu_cap, which is > > > > > the actual_cpu_capacity. > > > > > > > > I actually think capping by pd_cap is something we should remove. Saturated > > > > systems aren't calculated properly especially when uclamp_max is used. > > > > > > > > But this might a bigger change and out of scope of what you're proposing.. > > > > > > I agree, there are other things to consider before doing this. > > > > > > > > > > > Did this 'wrong' calculation cause an actual problem for task placement? > > > > I assume the pd looked 'busier' because some CPUs were too busy. > > > > > > This will not only affect calculations in scenarios with high temperatures. > > > Sometimes, users will set scalimg_max_freq to actively limit the CPU frequency, > > > so that even if the CPU load is large, the CPU frequency will not be very high. > > > At this time, even if tasks are placed on other CPUs in the same PD, > > > the energy increment may not be too large, thus affecting core selection. > > > > > > > > > > > Was the system in overutilzied state? Usually if one CPU is that busy > > > > overutilized should be set and we'd skip EAS - unless uclamp_max was used. > > > > > > As Christian said, This case occurs not only in the overutil scenario, > > > this scenario holds true as long as the actual-cpu-capacity caused by > > > the reduction in max cpu frequency is smaller than the cpu util. > > > > Hmm. Sorry I might be thick here, but shouldn't fits_capacity() use > > capacity_of() which should return capacity based on get_actual_cpu_capacity() > > to compare if we are overutilized? If we are higher than this value that you > > need to cap, then the CPU must be overutilized and we shouldn't be in feec() in > > the first place, no? Unless the rq is capped with uclamp_max that is. > > Sorry, I miss the "fits_capacity() use capacity_of()", and without > uclamp_max, the rd is over-utilized, > and would not use feec(). > But I notice the uclamp_max, if the rq's uclamp_max is smaller than > SCHED_CAPACITY_SCALE, > and is bigger than actual_cpu_capacity, the util_fits_cpu() would > return true, and the rd is not over-utilized. > Is this setting intentional? Hmm. To a great extent yes. We didn't want to take all types of rq pressure into account for uclamp_max. But this corner case could be debatable. Is this the source of your problem? If you change util_fits_cpu() to return false here, would this fix the problem you're seeing? > > > > > I generally think our current definition of overutilized is wrong and the use > > case you're talking about should hold true if it's just a single CPU that is > > overutilized. But I can't see how you end up in feec() if the util is higher > > than the CPU in our current code base. > > > > What did I miss? > > > > And should effective_cpu_util() return a value higher than > > get_actual_cpu_capacity()? > > I also thought about changing this at first, but because this function > is called in many places, > I am not 100% sure that changing it will not have any unexpected consequences, > so I only changed feec():) Yes I had similar doubts. But the question had to be asked :-)
On Mon, 17 Jun 2024 at 13:03, Qais Yousef <qyousef@layalina.io> wrote: > > On 06/17/24 15:27, Xuewen Yan wrote: > > Hi Qais, > > > > On Mon, Jun 17, 2024 at 6:20 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 06/12/24 16:11, Xuewen Yan wrote: > > > > Hi Qais > > > > > > > > On Mon, Jun 10, 2024 at 6:55 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > > > On 06/06/24 15:06, Xuewen Yan wrote: > > > > > > Because the effective_cpu_util() would return a util which > > > > > > maybe bigger than the actual_cpu_capacity, this could cause > > > > > > the pd_busy_time calculation errors. > > > > > > So clamp the cpu_busy_time with the eenv->cpu_cap, which is > > > > > > the actual_cpu_capacity. > > > > > > > > > > I actually think capping by pd_cap is something we should remove. Saturated > > > > > systems aren't calculated properly especially when uclamp_max is used. > > > > > > > > > > But this might a bigger change and out of scope of what you're proposing.. > > > > > > > > I agree, there are other things to consider before doing this. > > > > > > > > > > > > > > Did this 'wrong' calculation cause an actual problem for task placement? > > > > > I assume the pd looked 'busier' because some CPUs were too busy. > > > > > > > > This will not only affect calculations in scenarios with high temperatures. > > > > Sometimes, users will set scalimg_max_freq to actively limit the CPU frequency, > > > > so that even if the CPU load is large, the CPU frequency will not be very high. > > > > At this time, even if tasks are placed on other CPUs in the same PD, > > > > the energy increment may not be too large, thus affecting core selection. > > > > > > > > > > > > > > Was the system in overutilzied state? Usually if one CPU is that busy > > > > > overutilized should be set and we'd skip EAS - unless uclamp_max was used. > > > > > > > > As Christian said, This case occurs not only in the overutil scenario, > > > > this scenario holds true as long as the actual-cpu-capacity caused by > > > > the reduction in max cpu frequency is smaller than the cpu util. > > > > > > Hmm. Sorry I might be thick here, but shouldn't fits_capacity() use > > > capacity_of() which should return capacity based on get_actual_cpu_capacity() > > > to compare if we are overutilized? If we are higher than this value that you > > > need to cap, then the CPU must be overutilized and we shouldn't be in feec() in > > > the first place, no? Unless the rq is capped with uclamp_max that is. > > > > Sorry, I miss the "fits_capacity() use capacity_of()", and without > > uclamp_max, the rd is over-utilized, > > and would not use feec(). > > But I notice the uclamp_max, if the rq's uclamp_max is smaller than > > SCHED_CAPACITY_SCALE, > > and is bigger than actual_cpu_capacity, the util_fits_cpu() would > > return true, and the rd is not over-utilized. > > Is this setting intentional? > > Hmm. To a great extent yes. We didn't want to take all types of rq pressure > into account for uclamp_max. But this corner case could be debatable. Shouldn't we use get_actual_cpu_capacity() instead of arch_scale_cpu_capacity() everywhere in util_fits_cpu(). get_actual_cpu_capacity() appeared recently and there were discussion about using or not the thermal load_avg but everything is fixed now and think that using get_actual_cpu_capacity() everywhere in util_fits_cpu( would make sense and cover the case reported by Xuewen just above > > Is this the source of your problem? If you change util_fits_cpu() to return > false here, would this fix the problem you're seeing? > > > > > > > > > I generally think our current definition of overutilized is wrong and the use > > > case you're talking about should hold true if it's just a single CPU that is > > > overutilized. But I can't see how you end up in feec() if the util is higher > > > than the CPU in our current code base. > > > > > > What did I miss? > > > > > > And should effective_cpu_util() return a value higher than > > > get_actual_cpu_capacity()? > > > > I also thought about changing this at first, but because this function > > is called in many places, > > I am not 100% sure that changing it will not have any unexpected consequences, > > so I only changed feec():) > > Yes I had similar doubts. But the question had to be asked :-)
On 06/18/24 17:20, Vincent Guittot wrote: > > > Sorry, I miss the "fits_capacity() use capacity_of()", and without > > > uclamp_max, the rd is over-utilized, > > > and would not use feec(). > > > But I notice the uclamp_max, if the rq's uclamp_max is smaller than > > > SCHED_CAPACITY_SCALE, > > > and is bigger than actual_cpu_capacity, the util_fits_cpu() would > > > return true, and the rd is not over-utilized. > > > Is this setting intentional? > > > > Hmm. To a great extent yes. We didn't want to take all types of rq pressure > > into account for uclamp_max. But this corner case could be debatable. > > Shouldn't we use get_actual_cpu_capacity() instead of > arch_scale_cpu_capacity() everywhere in util_fits_cpu(). > get_actual_cpu_capacity() appeared recently and there were discussion > about using or not the thermal load_avg but everything is fixed now > and think that using get_actual_cpu_capacity() everywhere in > util_fits_cpu( would make sense and cover the case reported by Xuewen > just above Yes agreed. I think we need both patches. Although we need to confirm that uclamp_max is what is causing the situation Xuewen is seeing. Otherwise we have a race somewhere that needs to be understood.
On 06/17/24 12:03, Qais Yousef wrote: > > Sorry, I miss the "fits_capacity() use capacity_of()", and without > > uclamp_max, the rd is over-utilized, > > and would not use feec(). > > But I notice the uclamp_max, if the rq's uclamp_max is smaller than > > SCHED_CAPACITY_SCALE, > > and is bigger than actual_cpu_capacity, the util_fits_cpu() would > > return true, and the rd is not over-utilized. > > Is this setting intentional? > > Hmm. To a great extent yes. We didn't want to take all types of rq pressure > into account for uclamp_max. But this corner case could be debatable. > > Is this the source of your problem? If you change util_fits_cpu() to return > false here, would this fix the problem you're seeing? FWIW, if this happens due to uclamp_max, then this patch to do the capping is still needed. I think it's good to understand first how we end up in feec() when a CPU is supposed to be overutlized. uclamp_max is the only way to override this decision AFAICT..
On Tue, Jun 18, 2024 at 10:58 PM Qais Yousef <qyousef@layalina.io> wrote: > > On 06/17/24 12:03, Qais Yousef wrote: > > > > Sorry, I miss the "fits_capacity() use capacity_of()", and without > > > uclamp_max, the rd is over-utilized, > > > and would not use feec(). > > > But I notice the uclamp_max, if the rq's uclamp_max is smaller than > > > SCHED_CAPACITY_SCALE, > > > and is bigger than actual_cpu_capacity, the util_fits_cpu() would > > > return true, and the rd is not over-utilized. > > > Is this setting intentional? > > > > Hmm. To a great extent yes. We didn't want to take all types of rq pressure > > into account for uclamp_max. But this corner case could be debatable. > > > > Is this the source of your problem? If you change util_fits_cpu() to return > > false here, would this fix the problem you're seeing? > > FWIW, if this happens due to uclamp_max, then this patch to do the capping is > still needed. > > I think it's good to understand first how we end up in feec() when a CPU is > supposed to be overutlized. uclamp_max is the only way to override this > decision AFAICT.. Sorry for the late reply... In our own tree, we removed the check for rd overutil in feec(), so the above case often occurs. And now it seems that on the mainline, uclamp_max is the only way to override this. Thanks! BR --- xuewen
On 19/06/2024 04:46, Xuewen Yan wrote: > On Tue, Jun 18, 2024 at 10:58 PM Qais Yousef <qyousef@layalina.io> wrote: >> >> On 06/17/24 12:03, Qais Yousef wrote: [...] > Sorry for the late reply... > In our own tree, we removed the check for rd overutil in feec(), so > the above case often occurs. How to you schedule hackbench on this thing then? Via EAS or do you just exclude this kind of workload? > And now it seems that on the mainline, uclamp_max is the only way to > override this. [...]
On Fri, 21 Jun 2024 at 12:41, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 19/06/2024 04:46, Xuewen Yan wrote: > > On Tue, Jun 18, 2024 at 10:58 PM Qais Yousef <qyousef@layalina.io> wrote: > >> > >> On 06/17/24 12:03, Qais Yousef wrote: > > [...] > > > Sorry for the late reply... > > In our own tree, we removed the check for rd overutil in feec(), so > > the above case often occurs. > > How to you schedule hackbench on this thing then? Via EAS or do you just > exclude this kind of workload? Don't know the details for Xuewen but that's also what I'm doing as part of the rework I'm doing on EAS. As I said in at OSPM, I'm also calling feec() every time even when overutilized. feec() return -1 when it can't find a suitable cpu and we then fallback to default performance path > > > And now it seems that on the mainline, uclamp_max is the only way to > > override this. > > [...] >
On Fri, Jun 21, 2024 at 9:01 PM Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Fri, 21 Jun 2024 at 12:41, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > > > On 19/06/2024 04:46, Xuewen Yan wrote: > > > On Tue, Jun 18, 2024 at 10:58 PM Qais Yousef <qyousef@layalina.io> wrote: > > >> > > >> On 06/17/24 12:03, Qais Yousef wrote: > > > > [...] > > > > > Sorry for the late reply... > > > In our own tree, we removed the check for rd overutil in feec(), so > > > the above case often occurs. > > > > How to you schedule hackbench on this thing then? Via EAS or do you just > > exclude this kind of workload? > > Don't know the details for Xuewen but that's also what I'm doing as > part of the rework I'm doing on EAS. As I said in at OSPM, I'm also > calling feec() every time even when overutilized. feec() return -1 > when it can't find a suitable cpu and we then fallback to default > performance path > Aha, we do the same, We use eas on Android, and there is often only one or two tasks in the overutil state, but we don't want this scenario to affect the placement of other small tasks. Only when all CPUs are in the overutil state, feec will return -1, and then fallback to default performance path. BR --- xuewen
On 06/19/24 10:46, Xuewen Yan wrote: > On Tue, Jun 18, 2024 at 10:58 PM Qais Yousef <qyousef@layalina.io> wrote: > > > > On 06/17/24 12:03, Qais Yousef wrote: > > > > > > Sorry, I miss the "fits_capacity() use capacity_of()", and without > > > > uclamp_max, the rd is over-utilized, > > > > and would not use feec(). > > > > But I notice the uclamp_max, if the rq's uclamp_max is smaller than > > > > SCHED_CAPACITY_SCALE, > > > > and is bigger than actual_cpu_capacity, the util_fits_cpu() would > > > > return true, and the rd is not over-utilized. > > > > Is this setting intentional? > > > > > > Hmm. To a great extent yes. We didn't want to take all types of rq pressure > > > into account for uclamp_max. But this corner case could be debatable. > > > > > > Is this the source of your problem? If you change util_fits_cpu() to return > > > false here, would this fix the problem you're seeing? > > > > FWIW, if this happens due to uclamp_max, then this patch to do the capping is > > still needed. > > > > I think it's good to understand first how we end up in feec() when a CPU is > > supposed to be overutlized. uclamp_max is the only way to override this > > decision AFAICT.. > > Sorry for the late reply... > In our own tree, we removed the check for rd overutil in feec(), so > the above case often occurs. Yeah, the current definition of overutilized is not good enough. So I can see the need to do that.. > And now it seems that on the mainline, uclamp_max is the only way to > override this. Thanks for checking. Your patch is correct and if we combine this with another patch to use get_actual_cpu_capacity() in util_fits_cpu(), they are good improvements. Are you still happy to send an updated patches? Thanks! -- Qais Yousef
On Thu, Jun 20, 2024 at 2:03 AM Qais Yousef <qyousef@layalina.io> wrote: > > On 06/19/24 10:46, Xuewen Yan wrote: > > On Tue, Jun 18, 2024 at 10:58 PM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 06/17/24 12:03, Qais Yousef wrote: > > > > > > > > Sorry, I miss the "fits_capacity() use capacity_of()", and without > > > > > uclamp_max, the rd is over-utilized, > > > > > and would not use feec(). > > > > > But I notice the uclamp_max, if the rq's uclamp_max is smaller than > > > > > SCHED_CAPACITY_SCALE, > > > > > and is bigger than actual_cpu_capacity, the util_fits_cpu() would > > > > > return true, and the rd is not over-utilized. > > > > > Is this setting intentional? > > > > > > > > Hmm. To a great extent yes. We didn't want to take all types of rq pressure > > > > into account for uclamp_max. But this corner case could be debatable. > > > > > > > > Is this the source of your problem? If you change util_fits_cpu() to return > > > > false here, would this fix the problem you're seeing? > > > > > > FWIW, if this happens due to uclamp_max, then this patch to do the capping is > > > still needed. > > > > > > I think it's good to understand first how we end up in feec() when a CPU is > > > supposed to be overutlized. uclamp_max is the only way to override this > > > decision AFAICT.. > > > > Sorry for the late reply... > > In our own tree, we removed the check for rd overutil in feec(), so > > the above case often occurs. > > Yeah, the current definition of overutilized is not good enough. So I can see > the need to do that.. > > > And now it seems that on the mainline, uclamp_max is the only way to > > override this. > > Thanks for checking. Your patch is correct and if we combine this with another > patch to use get_actual_cpu_capacity() in util_fits_cpu(), they are good > improvements. > > Are you still happy to send an updated patches? Okay, I'll send the new patches soon:) Thanks! -- xuewen
On 06/06/2024 09:06, Xuewen Yan wrote:
> Because the effective_cpu_util() would return a util which
> maybe bigger than the actual_cpu_capacity, this could cause
> the pd_busy_time calculation errors.
Doesn't return effective_cpu_util() either scale or min(scale, util)
with scale = arch_scale_cpu_capacity(cpu)? So the util sum over the PD
cannot exceed eenv->cpu_cap?
Looks like this was the case with 3e8c6c9aac42 already.
> So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> the actual_cpu_capacity.
>
> Fixes: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
> ---
> kernel/sched/fair.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..8939d725023a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7870,7 +7870,9 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
> for_each_cpu(cpu, pd_cpus) {
> unsigned long util = cpu_util(cpu, p, -1, 0);
>
> - busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> + util = effective_cpu_util(cpu, util, NULL, NULL);
> + util = min(eenv->cpu_cap, util);
> + busy_time += util;
> }
>
> eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
Hi Dietmar
On Fri, Jun 7, 2024 at 3:19 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 06/06/2024 09:06, Xuewen Yan wrote:
> > Because the effective_cpu_util() would return a util which
> > maybe bigger than the actual_cpu_capacity, this could cause
> > the pd_busy_time calculation errors.
>
> Doesn't return effective_cpu_util() either scale or min(scale, util)
> with scale = arch_scale_cpu_capacity(cpu)? So the util sum over the PD
> cannot exceed eenv->cpu_cap?
In effective_cpu_util, the scale = arch_scale_cpu_capacity(cpu);
Although there is the clamp of eenv->pd_cap, but let us consider the
following simple scenario:
The pd cpus are 4-7, and the arch_scale_capacity is 1024, and because
of cpufreq-limit,
the cpu_actual_cap = 512. Then the eenv->cpu_cap = 512, the eenv->pd_cap = 2048;
effective_cpu_util(4) = 1024;
effective_cpu_util(5) = 1024;
effective_cpu_util(6) = 256;
effective_cpu_util(7) = 0;
Without this patch, the eenv->pd_busy_time = 2048, because the clamp
of eenv->pd_cap.
However, indeed, the cpu4 and cpu5's util would not exceed the actual_cap(512),
so the cpu_util4/5 = 512, instead of 1024.
And with this patch, the eenv->pd_busy_time = 512+512+256 = 1280.
And the 1280 should be more reasonable.
BR
--
xuewen
> Looks like this was the case with 3e8c6c9aac42 already.
>
> > So clamp the cpu_busy_time with the eenv->cpu_cap, which is
> > the actual_cpu_capacity.
> >
> > Fixes: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
> > Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
> > ---
> > kernel/sched/fair.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a5b1ae0aa55..8939d725023a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7870,7 +7870,9 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
> > for_each_cpu(cpu, pd_cpus) {
> > unsigned long util = cpu_util(cpu, p, -1, 0);
> >
> > - busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> > + util = effective_cpu_util(cpu, util, NULL, NULL);
> > + util = min(eenv->cpu_cap, util);
> > + busy_time += util;
> > }
> >
> > eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
>
On 07/06/2024 10:20, Xuewen Yan wrote:
> Hi Dietmar
>
> On Fri, Jun 7, 2024 at 3:19 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 06/06/2024 09:06, Xuewen Yan wrote:
>>> Because the effective_cpu_util() would return a util which
>>> maybe bigger than the actual_cpu_capacity, this could cause
>>> the pd_busy_time calculation errors.
>>
>> Doesn't return effective_cpu_util() either scale or min(scale, util)
>> with scale = arch_scale_cpu_capacity(cpu)? So the util sum over the PD
>> cannot exceed eenv->cpu_cap?
>
> In effective_cpu_util, the scale = arch_scale_cpu_capacity(cpu);
> Although there is the clamp of eenv->pd_cap, but let us consider the
> following simple scenario:
> The pd cpus are 4-7, and the arch_scale_capacity is 1024, and because
> of cpufreq-limit,
Ah, this is due to:
find_energy_efficient_cpu()
...
for (; pd; pd = pd->next)
...
cpu_actual_cap = get_actual_cpu_capacity(cpu)
for_each_cpu(cpu, cpus)
...
eenv.pd_cap += cpu_actual_cap
and:
get_actual_cpu_capacity()
...
capacity = arch_scale_cpu_capacity(cpu)
capacity -= max(hw_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu))
which got introduced by f1f8d0a22422 ("sched/cpufreq: Take cpufreq
feedback into account").
[...]
On Fri, Jun 7, 2024 at 6:30 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 07/06/2024 10:20, Xuewen Yan wrote:
> > Hi Dietmar
> >
> > On Fri, Jun 7, 2024 at 3:19 PM Dietmar Eggemann
> > <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 06/06/2024 09:06, Xuewen Yan wrote:
> >>> Because the effective_cpu_util() would return a util which
> >>> maybe bigger than the actual_cpu_capacity, this could cause
> >>> the pd_busy_time calculation errors.
> >>
> >> Doesn't return effective_cpu_util() either scale or min(scale, util)
> >> with scale = arch_scale_cpu_capacity(cpu)? So the util sum over the PD
> >> cannot exceed eenv->cpu_cap?
> >
> > In effective_cpu_util, the scale = arch_scale_cpu_capacity(cpu);
> > Although there is the clamp of eenv->pd_cap, but let us consider the
> > following simple scenario:
> > The pd cpus are 4-7, and the arch_scale_capacity is 1024, and because
> > of cpufreq-limit,
>
> Ah, this is due to:
>
> find_energy_efficient_cpu()
>
> ...
> for (; pd; pd = pd->next)
> ...
> cpu_actual_cap = get_actual_cpu_capacity(cpu)
>
> for_each_cpu(cpu, cpus)
> ...
> eenv.pd_cap += cpu_actual_cap
>
> and:
>
> get_actual_cpu_capacity()
>
> ...
> capacity = arch_scale_cpu_capacity(cpu)
>
> capacity -= max(hw_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu))
>
> which got introduced by f1f8d0a22422 ("sched/cpufreq: Take cpufreq
> feedback into account").
I don't think it was introduced by f1f8d0a22422, because f1f8d0a22422
just replaced the cpu_thermal_cap with get_actual_cpu_capacity(cpu).
The eenv.cpu_cap was introduced by 3e8c6c9aac42 ("sched/fair: Remove
task_util from effective utilization in feec()").
BR
---
xuewen
>
> [...]
On 07/06/2024 12:37, Xuewen Yan wrote:
> On Fri, Jun 7, 2024 at 6:30 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 07/06/2024 10:20, Xuewen Yan wrote:
>>> Hi Dietmar
>>>
>>> On Fri, Jun 7, 2024 at 3:19 PM Dietmar Eggemann
>>> <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 06/06/2024 09:06, Xuewen Yan wrote:
>>>>> Because the effective_cpu_util() would return a util which
>>>>> maybe bigger than the actual_cpu_capacity, this could cause
>>>>> the pd_busy_time calculation errors.
>>>>
>>>> Doesn't return effective_cpu_util() either scale or min(scale, util)
>>>> with scale = arch_scale_cpu_capacity(cpu)? So the util sum over the PD
>>>> cannot exceed eenv->cpu_cap?
>>>
>>> In effective_cpu_util, the scale = arch_scale_cpu_capacity(cpu);
>>> Although there is the clamp of eenv->pd_cap, but let us consider the
>>> following simple scenario:
>>> The pd cpus are 4-7, and the arch_scale_capacity is 1024, and because
>>> of cpufreq-limit,
>>
>> Ah, this is due to:
>>
>> find_energy_efficient_cpu()
>>
>> ...
>> for (; pd; pd = pd->next)
>> ...
>> cpu_actual_cap = get_actual_cpu_capacity(cpu)
>>
>> for_each_cpu(cpu, cpus)
>> ...
>> eenv.pd_cap += cpu_actual_cap
>>
>> and:
>>
>> get_actual_cpu_capacity()
>>
>> ...
>> capacity = arch_scale_cpu_capacity(cpu)
>>
>> capacity -= max(hw_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu))
>>
>> which got introduced by f1f8d0a22422 ("sched/cpufreq: Take cpufreq
>> feedback into account").
>
> I don't think it was introduced by f1f8d0a22422, because f1f8d0a22422
> just replaced the cpu_thermal_cap with get_actual_cpu_capacity(cpu).
> The eenv.cpu_cap was introduced by 3e8c6c9aac42 ("sched/fair: Remove
> task_util from effective utilization in feec()").
Yes, you're right. 3e8c6c9aac42 changed it from per-CPU to per-PD
capping.
In case we want to go back to per-CPU then we should remove the
eenv->pd_cap capping in eenv_pd_busy_time().
-->8--
@@ -7864,16 +7864,15 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
struct cpumask *pd_cpus,
struct task_struct *p)
{
- unsigned long busy_time = 0;
int cpu;
for_each_cpu(cpu, pd_cpus) {
unsigned long util = cpu_util(cpu, p, -1, 0);
- busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+ util = effective_cpu_util(cpu, util, NULL, NULL);
+ util = min(util, eenv->cpu_cap);
+ eenv->pd_busy_time += util;
}
-
- eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
}
I'm wondering whether we would need the:
if (dst_cpu >= 0)
busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
in compute_energy() anymore since we only get a candidate CPU in feec()
after checking with util_fits_cpu() if cpu can accommodate p :
feec()
...
for_each_cpu()
util = cpu_util(cpu, p, cpu, ...)
cpu_cap = capacity_of()
...
fits = util_fits_cpu(util, ..., cpu);
if (!fits)
continue
/* check if candidate CPU */
On Fri, Jun 21, 2024 at 6:22 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 07/06/2024 12:37, Xuewen Yan wrote:
> > On Fri, Jun 7, 2024 at 6:30 PM Dietmar Eggemann
> > <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 07/06/2024 10:20, Xuewen Yan wrote:
> >>> Hi Dietmar
> >>>
> >>> On Fri, Jun 7, 2024 at 3:19 PM Dietmar Eggemann
> >>> <dietmar.eggemann@arm.com> wrote:
> >>>>
> >>>> On 06/06/2024 09:06, Xuewen Yan wrote:
> >>>>> Because the effective_cpu_util() would return a util which
> >>>>> maybe bigger than the actual_cpu_capacity, this could cause
> >>>>> the pd_busy_time calculation errors.
> >>>>
> >>>> Doesn't return effective_cpu_util() either scale or min(scale, util)
> >>>> with scale = arch_scale_cpu_capacity(cpu)? So the util sum over the PD
> >>>> cannot exceed eenv->cpu_cap?
> >>>
> >>> In effective_cpu_util, the scale = arch_scale_cpu_capacity(cpu);
> >>> Although there is the clamp of eenv->pd_cap, but let us consider the
> >>> following simple scenario:
> >>> The pd cpus are 4-7, and the arch_scale_capacity is 1024, and because
> >>> of cpufreq-limit,
> >>
> >> Ah, this is due to:
> >>
> >> find_energy_efficient_cpu()
> >>
> >> ...
> >> for (; pd; pd = pd->next)
> >> ...
> >> cpu_actual_cap = get_actual_cpu_capacity(cpu)
> >>
> >> for_each_cpu(cpu, cpus)
> >> ...
> >> eenv.pd_cap += cpu_actual_cap
> >>
> >> and:
> >>
> >> get_actual_cpu_capacity()
> >>
> >> ...
> >> capacity = arch_scale_cpu_capacity(cpu)
> >>
> >> capacity -= max(hw_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu))
> >>
> >> which got introduced by f1f8d0a22422 ("sched/cpufreq: Take cpufreq
> >> feedback into account").
> >
> > I don't think it was introduced by f1f8d0a22422, because f1f8d0a22422
> > just replaced the cpu_thermal_cap with get_actual_cpu_capacity(cpu).
> > The eenv.cpu_cap was introduced by 3e8c6c9aac42 ("sched/fair: Remove
> > task_util from effective utilization in feec()").
>
> Yes, you're right. 3e8c6c9aac42 changed it from per-CPU to per-PD
> capping.
>
> In case we want to go back to per-CPU then we should remove the
> eenv->pd_cap capping in eenv_pd_busy_time().
>
> -->8--
>
> @@ -7864,16 +7864,15 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
> struct cpumask *pd_cpus,
> struct task_struct *p)
> {
> - unsigned long busy_time = 0;
> int cpu;
>
> for_each_cpu(cpu, pd_cpus) {
> unsigned long util = cpu_util(cpu, p, -1, 0);
>
> - busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> + util = effective_cpu_util(cpu, util, NULL, NULL);
> + util = min(util, eenv->cpu_cap);
> + eenv->pd_busy_time += util;
> }
> -
> - eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
> }
Okay, the pd-busy clamp is indeed unnecessary.
>
>
>
> I'm wondering whether we would need the:
>
> if (dst_cpu >= 0)
> busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
>
> in compute_energy() anymore since we only get a candidate CPU in feec()
> after checking with util_fits_cpu() if cpu can accommodate p :
I think this condition is still necessary, because pd_busy_time is
clamped. If this condition is not added, util may exceed
actual_cpu_cap.
>
> feec()
>
> ...
>
> for_each_cpu()
>
> util = cpu_util(cpu, p, cpu, ...)
> cpu_cap = capacity_of()
>
> ...
>
> fits = util_fits_cpu(util, ..., cpu);
> if (!fits)
> continue
>
> /* check if candidate CPU */
© 2016 - 2026 Red Hat, Inc.