cpufreq,base/arch_topology: Calculate cpu_capacity according to boost

[RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity according to boost

Posted by Dietmar Eggemann 7 months, 2 weeks ago

I noticed on my Arm64 big.Little platform (Juno-r0, scmi-cpufreq) that
the cpu_scale values (/sys/devices/system/cpu/cpu*/cpu_capacity) of the
little CPU changed in v6.14 from 446 to 505. I bisected and found that
commit dd016f379ebc ("cpufreq: Introduce a more generic way to set
default per-policy boost flag") (1) introduced this change.
Juno's scmi FW marks the 2 topmost OPPs of each CPUfreq policy (policy0:
775000 850000, policy1: 950000 1100000) as boost OPPs.

The reason is that the 'policy->boost_enabled = true' is now done after
'cpufreq_table_validate_and_sort() -> cpufreq_frequency_table_cpuinfo()'
in cpufreq_online() so that 'policy->cpuinfo.max_freq' is set to the
'highest non-boost' instead of the 'highest boost' frequency.

This is before the CPUFREQ_CREATE_POLICY notifier is fired in
cpufreq_online() to which the cpu_capacity setup code in
[drivers/base/arch_topology.c] has registered.

Its notifier_call init_cpu_capacity_callback() uses
'policy->cpuinfo.max_freq' to set the per-cpu
capacity_freq_ref so that the cpu_capacity can be calculated as:

cpu_capacity = raw_cpu_capacity (2) * capacity_freq_ref /
				      'max system-wide cpu frequency'

(2) Juno's little CPU has 'capacity-dmips-mhz = <578>'.

So before (1) for a little CPU:

cpu_capacity = 578 * 850000 / 1100000 = 446

and after:

cpu_capacity = 578 * 700000 / 800000 = 505

This issue can also be seen on Arm64 boards with cpufreq-dt drivers
using the 'turbo-mode' dt property for boosted OPPs.

What's actually needed IMHO is to calculate cpu_capacity according to
the boost value. I.e.:

(a) The infrastructure to adjust cpu_capacity in arch_topology.c has to
    be kept alive after boot.

(b) There has to be some kind of notification from cpufreq.c to
    arch_topology.c about the toggling of boost. I'm abusing
    CPUFREQ_CREATE_POLICY for this right now. Could we perhaps add a
    CPUFREQ_MOD_POLICY for this?

(c) Allow unconditional set of policy->cpuinfo.max_freq in case boost
    is set to 0 in cpufreq_frequency_table_cpuinfo().
    This currently clashes with the commented feature that in case the
    driver has set a higher value it should stay untouched.

Tested on Arm64 Juno (scmi-cpufreq) and Hikey 960 (cpufreq-dt +
added 'turbo-mode' to the topmost OPPs in dts file).

This is probably related what Christian Loehle tried to address in
https://lkml.kernel.org/r/3cc5b83b-f81c-4bd7-b7ff-4d02db4e25d8@arm.com .

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 drivers/base/arch_topology.c | 11 -----------
 drivers/cpufreq/cpufreq.c    |  3 +++
 drivers/cpufreq/freq_table.c |  8 +-------
 3 files changed, 4 insertions(+), 18 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1c3221ff1d1f..0a3916dc9644 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -378,8 +378,6 @@ void acpi_processor_init_invariance_cppc(void)
 
 #ifdef CONFIG_CPU_FREQ
 static cpumask_var_t cpus_to_visit;
-static void parsing_done_workfn(struct work_struct *work);
-static DECLARE_WORK(parsing_done_work, parsing_done_workfn);
 
 static int
 init_cpu_capacity_callback(struct notifier_block *nb,
@@ -408,10 +406,8 @@ init_cpu_capacity_callback(struct notifier_block *nb,
 		if (raw_capacity) {
 			topology_normalize_cpu_scale();
 			schedule_work(&update_topology_flags_work);
-			free_raw_capacity();
 		}
 		pr_debug("cpu_capacity: parsing done\n");
-		schedule_work(&parsing_done_work);
 	}
 
 	return 0;
@@ -447,13 +443,6 @@ static int __init register_cpufreq_notifier(void)
 }
 core_initcall(register_cpufreq_notifier);
 
-static void parsing_done_workfn(struct work_struct *work)
-{
-	cpufreq_unregister_notifier(&init_cpu_capacity_notifier,
-					 CPUFREQ_POLICY_NOTIFIER);
-	free_cpumask_var(cpus_to_visit);
-}
-
 #else
 core_initcall(free_raw_capacity);
 #endif
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index d7426e1d8bdd..6fdfcb6815d7 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2798,6 +2798,9 @@ int cpufreq_boost_set_sw(struct cpufreq_policy *policy, int state)
 		return ret;
 	}
 
+	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
+				     CPUFREQ_CREATE_POLICY, policy);
+
 	ret = freq_qos_update_request(policy->max_freq_req, policy->max);
 	if (ret < 0)
 		return ret;
diff --git a/drivers/cpufreq/freq_table.c b/drivers/cpufreq/freq_table.c
index 35de513af6c9..06068a12ec53 100644
--- a/drivers/cpufreq/freq_table.c
+++ b/drivers/cpufreq/freq_table.c
@@ -51,13 +51,7 @@ int cpufreq_frequency_table_cpuinfo(struct cpufreq_policy *policy,
 	}
 
 	policy->min = policy->cpuinfo.min_freq = min_freq;
-	policy->max = max_freq;
-	/*
-	 * If the driver has set its own cpuinfo.max_freq above max_freq, leave
-	 * it as is.
-	 */
-	if (policy->cpuinfo.max_freq < max_freq)
-		policy->max = policy->cpuinfo.max_freq = max_freq;
+	policy->max = policy->cpuinfo.max_freq = max_freq;
 
 	if (policy->min == ~0)
 		return -EINVAL;
-- 
2.34.1

Re: [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity according to boost

Posted by Dietmar Eggemann 6 months, 3 weeks ago

+cc Vincent Guittot <vincent.guittot@linaro.org>
+cc Ionela Voinescu <ionela.voinescu@arm.com>

On 26/06/2025 11:30, Dietmar Eggemann wrote:
> I noticed on my Arm64 big.Little platform (Juno-r0, scmi-cpufreq) that
> the cpu_scale values (/sys/devices/system/cpu/cpu*/cpu_capacity) of the
> little CPU changed in v6.14 from 446 to 505. I bisected and found that
> commit dd016f379ebc ("cpufreq: Introduce a more generic way to set
> default per-policy boost flag") (1) introduced this change.
> Juno's scmi FW marks the 2 topmost OPPs of each CPUfreq policy (policy0:
> 775000 850000, policy1: 950000 1100000) as boost OPPs.
> 
> The reason is that the 'policy->boost_enabled = true' is now done after
> 'cpufreq_table_validate_and_sort() -> cpufreq_frequency_table_cpuinfo()'
> in cpufreq_online() so that 'policy->cpuinfo.max_freq' is set to the
> 'highest non-boost' instead of the 'highest boost' frequency.
> 
> This is before the CPUFREQ_CREATE_POLICY notifier is fired in
> cpufreq_online() to which the cpu_capacity setup code in
> [drivers/base/arch_topology.c] has registered.
> 
> Its notifier_call init_cpu_capacity_callback() uses
> 'policy->cpuinfo.max_freq' to set the per-cpu
> capacity_freq_ref so that the cpu_capacity can be calculated as:
> 
> cpu_capacity = raw_cpu_capacity (2) * capacity_freq_ref /
> 				      'max system-wide cpu frequency'
> 
> (2) Juno's little CPU has 'capacity-dmips-mhz = <578>'.
> 
> So before (1) for a little CPU:
> 
> cpu_capacity = 578 * 850000 / 1100000 = 446
> 
> and after:
> 
> cpu_capacity = 578 * 700000 / 800000 = 505
> 
> This issue can also be seen on Arm64 boards with cpufreq-dt drivers
> using the 'turbo-mode' dt property for boosted OPPs.
> 
> What's actually needed IMHO is to calculate cpu_capacity according to
> the boost value. I.e.:
> 
> (a) The infrastructure to adjust cpu_capacity in arch_topology.c has to
>     be kept alive after boot.
> 
> (b) There has to be some kind of notification from cpufreq.c to
>     arch_topology.c about the toggling of boost. I'm abusing
>     CPUFREQ_CREATE_POLICY for this right now. Could we perhaps add a
>     CPUFREQ_MOD_POLICY for this?
> 
> (c) Allow unconditional set of policy->cpuinfo.max_freq in case boost
>     is set to 0 in cpufreq_frequency_table_cpuinfo().
>     This currently clashes with the commented feature that in case the
>     driver has set a higher value it should stay untouched.
> 
> Tested on Arm64 Juno (scmi-cpufreq) and Hikey 960 (cpufreq-dt +
> added 'turbo-mode' to the topmost OPPs in dts file).
> 
> This is probably related what Christian Loehle tried to address in
> https://lkml.kernel.org/r/3cc5b83b-f81c-4bd7-b7ff-4d02db4e25d8@arm.com .

Christian L. reminded me that since commit dd016f379ebc we also have a
performance regression on a system with boosted OPPs using schedutil
CPUfreq governor.

The reason is that per cpu 'capacity_freq_ref' is set in
drivers/base/arch_topology.c only during system boot so far based on the
highest non-boosted OPP since boost is disabled per default.

Schedutil uses capacity_freq_ref (*) in get_next_freq() to calculate the
next frequency request:

   next_freq = max_freq * util / max
               ^^^^^^^^
                 (*)

In case the boost OPPs will be enabled:

   echo 1 > /sys/devices/system/cpu/cpufreq/boost

'capacity_freq_ref' stays at the highest non-boosted OPP's so schedutil
won't request any boosted OPPs for util values > ''highest non boosted
OPP'/'highest boosted OPP' * max'. The 'highest non boosted OPP' will be
used by schedutil instead.

This performance regression will go away with the proposed patch as well.

Calling drivers/base/arch_topology.c's init_cpu_capacity_callback() in
the event that boost is toggled makes sure that 'capacity_freq_ref' will
be set to the highest boosted (0->1) or highest non-boosted (1->0) OPP.

[...]

Re: [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity according to boost

Posted by Vincent Guittot 6 months ago

On Mon, 14 Jul 2025 at 14:17, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> +cc Vincent Guittot <vincent.guittot@linaro.org>
> +cc Ionela Voinescu <ionela.voinescu@arm.com>
>
> On 26/06/2025 11:30, Dietmar Eggemann wrote:
> > I noticed on my Arm64 big.Little platform (Juno-r0, scmi-cpufreq) that
> > the cpu_scale values (/sys/devices/system/cpu/cpu*/cpu_capacity) of the
> > little CPU changed in v6.14 from 446 to 505. I bisected and found that
> > commit dd016f379ebc ("cpufreq: Introduce a more generic way to set
> > default per-policy boost flag") (1) introduced this change.
> > Juno's scmi FW marks the 2 topmost OPPs of each CPUfreq policy (policy0:
> > 775000 850000, policy1: 950000 1100000) as boost OPPs.
> >
> > The reason is that the 'policy->boost_enabled = true' is now done after
> > 'cpufreq_table_validate_and_sort() -> cpufreq_frequency_table_cpuinfo()'
> > in cpufreq_online() so that 'policy->cpuinfo.max_freq' is set to the
> > 'highest non-boost' instead of the 'highest boost' frequency.
> >
> > This is before the CPUFREQ_CREATE_POLICY notifier is fired in
> > cpufreq_online() to which the cpu_capacity setup code in
> > [drivers/base/arch_topology.c] has registered.
> >
> > Its notifier_call init_cpu_capacity_callback() uses
> > 'policy->cpuinfo.max_freq' to set the per-cpu
> > capacity_freq_ref so that the cpu_capacity can be calculated as:
> >
> > cpu_capacity = raw_cpu_capacity (2) * capacity_freq_ref /
> >                                     'max system-wide cpu frequency'
> >
> > (2) Juno's little CPU has 'capacity-dmips-mhz = <578>'.
> >
> > So before (1) for a little CPU:
> >
> > cpu_capacity = 578 * 850000 / 1100000 = 446
> >
> > and after:
> >
> > cpu_capacity = 578 * 700000 / 800000 = 505
> >
> > This issue can also be seen on Arm64 boards with cpufreq-dt drivers
> > using the 'turbo-mode' dt property for boosted OPPs.
> >
> > What's actually needed IMHO is to calculate cpu_capacity according to
> > the boost value. I.e.:
> >
> > (a) The infrastructure to adjust cpu_capacity in arch_topology.c has to
> >     be kept alive after boot.

If we adjust the cpu_capacity at runtime this will create oscillation
in PELT values. We should stay with one single capacity all time :
- either include boost value but when boost is disable we will never
reach the max capacity of the cpu which could imply that the cpu will
never be overloaded (from scheduler pov)
- either not include boost_value but allow to go above cpu max compute
capacity which is something we already discussed for x86 and the turbo
freq in the past.

> >
> > (b) There has to be some kind of notification from cpufreq.c to
> >     arch_topology.c about the toggling of boost. I'm abusing
> >     CPUFREQ_CREATE_POLICY for this right now. Could we perhaps add a
> >     CPUFREQ_MOD_POLICY for this?
> >
> > (c) Allow unconditional set of policy->cpuinfo.max_freq in case boost
> >     is set to 0 in cpufreq_frequency_table_cpuinfo().
> >     This currently clashes with the commented feature that in case the
> >     driver has set a higher value it should stay untouched.
> >
> > Tested on Arm64 Juno (scmi-cpufreq) and Hikey 960 (cpufreq-dt +
> > added 'turbo-mode' to the topmost OPPs in dts file).
> >
> > This is probably related what Christian Loehle tried to address in
> > https://lkml.kernel.org/r/3cc5b83b-f81c-4bd7-b7ff-4d02db4e25d8@arm.com .
>
> Christian L. reminded me that since commit dd016f379ebc we also have a
> performance regression on a system with boosted OPPs using schedutil
> CPUfreq governor.
>
> The reason is that per cpu 'capacity_freq_ref' is set in
> drivers/base/arch_topology.c only during system boot so far based on the
> highest non-boosted OPP since boost is disabled per default.
>
> Schedutil uses capacity_freq_ref (*) in get_next_freq() to calculate the
> next frequency request:
>
>    next_freq = max_freq * util / max
>                ^^^^^^^^
>                  (*)
>
> In case the boost OPPs will be enabled:
>
>    echo 1 > /sys/devices/system/cpu/cpufreq/boost
>
> 'capacity_freq_ref' stays at the highest non-boosted OPP's so schedutil
> won't request any boosted OPPs for util values > ''highest non boosted
> OPP'/'highest boosted OPP' * max'. The 'highest non boosted OPP' will be
> used by schedutil instead.
>
> This performance regression will go away with the proposed patch as well.
>
> Calling drivers/base/arch_topology.c's init_cpu_capacity_callback() in
> the event that boost is toggled makes sure that 'capacity_freq_ref' will
> be set to the highest boosted (0->1) or highest non-boosted (1->0) OPP.
>
> [...]
>
>
>
>
>

Re: [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity according to boost

Posted by Christian Loehle 6 months ago

On 8/4/25 14:01, Vincent Guittot wrote:
> On Mon, 14 Jul 2025 at 14:17, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> +cc Vincent Guittot <vincent.guittot@linaro.org>
>> +cc Ionela Voinescu <ionela.voinescu@arm.com>
>>
>> On 26/06/2025 11:30, Dietmar Eggemann wrote:
>>> I noticed on my Arm64 big.Little platform (Juno-r0, scmi-cpufreq) that
>>> the cpu_scale values (/sys/devices/system/cpu/cpu*/cpu_capacity) of the
>>> little CPU changed in v6.14 from 446 to 505. I bisected and found that
>>> commit dd016f379ebc ("cpufreq: Introduce a more generic way to set
>>> default per-policy boost flag") (1) introduced this change.
>>> Juno's scmi FW marks the 2 topmost OPPs of each CPUfreq policy (policy0:
>>> 775000 850000, policy1: 950000 1100000) as boost OPPs.
>>>
>>> The reason is that the 'policy->boost_enabled = true' is now done after
>>> 'cpufreq_table_validate_and_sort() -> cpufreq_frequency_table_cpuinfo()'
>>> in cpufreq_online() so that 'policy->cpuinfo.max_freq' is set to the
>>> 'highest non-boost' instead of the 'highest boost' frequency.
>>>
>>> This is before the CPUFREQ_CREATE_POLICY notifier is fired in
>>> cpufreq_online() to which the cpu_capacity setup code in
>>> [drivers/base/arch_topology.c] has registered.
>>>
>>> Its notifier_call init_cpu_capacity_callback() uses
>>> 'policy->cpuinfo.max_freq' to set the per-cpu
>>> capacity_freq_ref so that the cpu_capacity can be calculated as:
>>>
>>> cpu_capacity = raw_cpu_capacity (2) * capacity_freq_ref /
>>>                                     'max system-wide cpu frequency'
>>>
>>> (2) Juno's little CPU has 'capacity-dmips-mhz = <578>'.
>>>
>>> So before (1) for a little CPU:
>>>
>>> cpu_capacity = 578 * 850000 / 1100000 = 446
>>>
>>> and after:
>>>
>>> cpu_capacity = 578 * 700000 / 800000 = 505
>>>
>>> This issue can also be seen on Arm64 boards with cpufreq-dt drivers
>>> using the 'turbo-mode' dt property for boosted OPPs.
>>>
>>> What's actually needed IMHO is to calculate cpu_capacity according to
>>> the boost value. I.e.:
>>>
>>> (a) The infrastructure to adjust cpu_capacity in arch_topology.c has to
>>>     be kept alive after boot.
> 
> If we adjust the cpu_capacity at runtime this will create oscillation
> in PELT values. We should stay with one single capacity all time :
> - either include boost value but when boost is disable we will never
> reach the max capacity of the cpu which could imply that the cpu will
> never be overloaded (from scheduler pov)

overutilized I'm assuming, that's the issue I was worried about here.
Strictly speaking the platform doesn't guarantee that the capacity can
be reached and sustained indefinitely. Whether the frequency is marked
as boost or not.

> - either not include boost_value but allow to go above cpu max compute
> capacity which is something we already discussed for x86 and the turbo
> freq in the past.
> 

But that currently breaks schedutil, i.e. boost frequencies will never
be used with schedutil. There's also some other locations where capacities
>1024 just break some assumptions (e.g. the kernel/sched/ext.c cpuperf
interface defines SCX_CPUPERF_ONE).


So we have either:
a) Potential wrong capacity estimation of CPUs when boost is disabled
(but capacity calculation assumed enabled).
b) Boost frequencies completely unused by schedutil.
c) Oscillating PELT values due to boost enable/disable.

Isn't c) (what Dietmar proposed here) by far the smallest evil of these
three?
I've also found a) very hard to actually trigger, although it's obviously
a problem that depends on the platform.

Re: [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity according to boost

Posted by Vincent Guittot 6 months ago

On Mon, 4 Aug 2025 at 15:18, Christian Loehle <christian.loehle@arm.com> wrote:
>
> On 8/4/25 14:01, Vincent Guittot wrote:
> > On Mon, 14 Jul 2025 at 14:17, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> +cc Vincent Guittot <vincent.guittot@linaro.org>
> >> +cc Ionela Voinescu <ionela.voinescu@arm.com>
> >>
> >> On 26/06/2025 11:30, Dietmar Eggemann wrote:
> >>> I noticed on my Arm64 big.Little platform (Juno-r0, scmi-cpufreq) that
> >>> the cpu_scale values (/sys/devices/system/cpu/cpu*/cpu_capacity) of the
> >>> little CPU changed in v6.14 from 446 to 505. I bisected and found that
> >>> commit dd016f379ebc ("cpufreq: Introduce a more generic way to set
> >>> default per-policy boost flag") (1) introduced this change.
> >>> Juno's scmi FW marks the 2 topmost OPPs of each CPUfreq policy (policy0:
> >>> 775000 850000, policy1: 950000 1100000) as boost OPPs.
> >>>
> >>> The reason is that the 'policy->boost_enabled = true' is now done after
> >>> 'cpufreq_table_validate_and_sort() -> cpufreq_frequency_table_cpuinfo()'
> >>> in cpufreq_online() so that 'policy->cpuinfo.max_freq' is set to the
> >>> 'highest non-boost' instead of the 'highest boost' frequency.
> >>>
> >>> This is before the CPUFREQ_CREATE_POLICY notifier is fired in
> >>> cpufreq_online() to which the cpu_capacity setup code in
> >>> [drivers/base/arch_topology.c] has registered.
> >>>
> >>> Its notifier_call init_cpu_capacity_callback() uses
> >>> 'policy->cpuinfo.max_freq' to set the per-cpu
> >>> capacity_freq_ref so that the cpu_capacity can be calculated as:
> >>>
> >>> cpu_capacity = raw_cpu_capacity (2) * capacity_freq_ref /
> >>>                                     'max system-wide cpu frequency'
> >>>
> >>> (2) Juno's little CPU has 'capacity-dmips-mhz = <578>'.
> >>>
> >>> So before (1) for a little CPU:
> >>>
> >>> cpu_capacity = 578 * 850000 / 1100000 = 446
> >>>
> >>> and after:
> >>>
> >>> cpu_capacity = 578 * 700000 / 800000 = 505
> >>>
> >>> This issue can also be seen on Arm64 boards with cpufreq-dt drivers
> >>> using the 'turbo-mode' dt property for boosted OPPs.
> >>>
> >>> What's actually needed IMHO is to calculate cpu_capacity according to
> >>> the boost value. I.e.:
> >>>
> >>> (a) The infrastructure to adjust cpu_capacity in arch_topology.c has to
> >>>     be kept alive after boot.
> >
> > If we adjust the cpu_capacity at runtime this will create oscillation
> > in PELT values. We should stay with one single capacity all time :
> > - either include boost value but when boost is disable we will never
> > reach the max capacity of the cpu which could imply that the cpu will
> > never be overloaded (from scheduler pov)
>
> overutilized I'm assuming, that's the issue I was worried about here.

no I was referring to group_is_overloaded which use /Sum of CPU's capacity
That' also true for EAS and overutilized

> Strictly speaking the platform doesn't guarantee that the capacity can
> be reached and sustained indefinitely. Whether the frequency is marked
> as boost or not.

Regarding thermal mitigation and user max freq, we take that into account

>
> > - either not include boost_value but allow to go above cpu max compute
> > capacity which is something we already discussed for x86 and the turbo
> > freq in the past.
> >
>
> But that currently breaks schedutil, i.e. boost frequencies will never
> be used with schedutil. There's also some other locations where capacities

We should allow capacity to go above 1024 to reflect HW reality with
turbo and here enabling/diabling boost

> >1024 just break some assumptions (e.g. the kernel/sched/ext.c cpuperf
> interface defines SCX_CPUPERF_ONE).
>
>
> So we have either:
> a) Potential wrong capacity estimation of CPUs when boost is disabled
> (but capacity calculation assumed enabled).
> b) Boost frequencies completely unused by schedutil.
> c) Oscillating PELT values due to boost enable/disable.
>
> Isn't c) (what Dietmar proposed here) by far the smallest evil of these
> three?

No, this breaks PELT invariance

> I've also found a) very hard to actually trigger, although it's obviously
> a problem that depends on the platform.